Arxiv今日论文 | 2025-01-29

本篇博文主要内容为 2025-01-29 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决语言模型输出的精细调节问题，以提升其安全性和可靠性。当前存在多种方法，如提示调优（Prompting）和微调（Finetuning），但这些方法缺乏可比性基准。为此，论文引入了一个大规模基准AxBench，并在Gemma-2-2B和9B模型上进行实验。关键解决方案在于通过AxBench基准对不同方法进行系统比较，发现提示调优在调节任务中表现最优，而表征基础方法如均值差异法（Difference-in-means）在概念检测任务中更优。此外，论文提出了一种新的弱监督表征方法——秩一表征微调（Rank-1 Representation Finetuning; ReFT-r1），该方法在两项任务中均表现出竞争力，并具备提示调优所缺乏的可解释性优势。

链接: https://arxiv.org/abs/2501.17148
作者: Zhengxuan Wu,Aryaman Arora,Atticus Geiger,Zheng Wang,Jing Huang,Dan Jurafsky,Christopher D. Manning,Christopher Potts
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
zh

[NLP-1] FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在生成文本时出现幻觉（hallucinations）的问题。现有的方法主要依赖于公共自然语言推断（NLI）数据集和基于合成数据的方法，但这些方法并不适合文档级推理，且对于长文档而言计算成本高昂，并受限于LLMs的能力。论文的关键解决方案是提出了一种新的合成数据生成方法CG2C，该方法利用从文档中提取的上下文图进行多跳推理。基于此方法，作者开发了一个事实检查模型FactCG，展示了在相同的骨干模型下，通过更连贯的推理能力实现性能提升，甚至在LLM-Aggrefact基准测试中超越了GPT-4-o，同时具有更小的模型规模。

链接: https://arxiv.org/abs/2501.17144
作者: Deren Lei,Yaxi Li,Siyao Li,Mengya Hu,Rui Xu,Ken Archer,Mingyu Wang,Emily Ching,Alex Deng
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025

点击查看摘要

Abstract:Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM’s capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark with much smaller model size.
zh

[NLP-2] ASTRAL: Automated Safety Testing of Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在提供安全响应方面的问题，特别是如何生成平衡且多样化的不安全测试输入，并确保这些测试输入能够覆盖广泛的安全类别和语言写作风格特征。论文的关键解决方案在于引入了一种新的黑盒覆盖率标准，结合基于LLM的方法，利用检索增强生成（RAG）、少量提示策略和网络浏览来生成最新的测试输入。此外，通过利用LLMs作为测试预言器来区分安全和不安全的测试输出，实现了完全自动化的测试方法。这种方法能够显著提高检测不安全行为的能力。

链接: https://arxiv.org/abs/2501.17132
作者: Miriam Ugarte,Pablo Valle,José Antonio Parejo,Sergio Segura,Aitor Arrieta
机构: Mondragon University (蒙德拉贡大学); University of Seville (塞维利亚大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.
zh

[NLP-3] Histoires Morales: A French Dataset for Assessing Moral Alignment NAACL2025

【速读】：该论文旨在解决法语领域中大型语言模型（LLMs）在道德推理方面的研究不足。关键解决方案是引入了一个名为Histoires Morales的新数据集，该数据集源自Moral Stories并通过翻译和本地化处理以适应法语文化和规范，同时确保其在道德价值观上的标注与法国社会的标准相一致。这一方法有助于填补现有研究在法语LLMs道德推理理解方面的空白，并促进未来的研究。

链接: https://arxiv.org/abs/2501.17117
作者: Thibaud Leteno,Irina Proskurina,Antoine Gourru,Julien Velcin,Charlotte Laclau,Guillaume Metzler,Christophe Gravier
机构: Laboratoire Hubert Curien, UMR CNRS 5516 (Hubert Curien实验室, UMR CNRS 5516); Université Lumière Lyon 2, Université Claude Bernard Lyon 1, ERIC (里昂第二光 university, 里昂第一 Claude Bernard University, ERIC); Télécom Paris, Institut Polytechnique de Paris (巴黎电信, 巴黎理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Aligning language models with human values is crucial, especially as they become more integrated into everyday life. While models are often adapted to user preferences, it is equally important to ensure they align with moral norms and behaviours in real-world social situations. Despite significant progress in languages like English and Chinese, French has seen little attention in this area, leaving a gap in understanding how LLMs handle moral reasoning in this language. To address this gap, we introduce Histoires Morales, a French dataset derived from Moral Stories, created through translation and subsequently refined with the assistance of native speakers to guarantee grammatical accuracy and adaptation to the French cultural context. We also rely on annotations of the moral values within the dataset to ensure their alignment with French norms. Histoires Morales covers a wide range of social situations, including differences in tipping practices, expressions of honesty in relationships, and responsibilities toward animals. To foster future research, we also conduct preliminary experiments on the alignment of multilingual models on French and English data and the robustness of the alignment. We find that while LLMs are generally aligned with human moral norms by default, they can be easily influenced with user-preference optimization for both moral and immoral data.
zh

[NLP-4] Optimizing Large Language Model Training Using FP4 Quantization

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）训练过程中计算需求增加的问题，提出了一种新的FP4训练框架。解决方案的关键在于引入了一个可微量化估计器以实现精确的权重更新，并采用异常值钳位和补偿策略以防止激活函数塌陷。此外，该框架还结合了混合精度训练方案和向量量化方法以确保稳定性。

链接: https://arxiv.org/abs/2501.17116
作者: Ruizhe Wang,Yeyun Gong,Xiao Liu,Guoshuai Zhao,Ziyue Yang,Baining Guo,Zhengjun Zha,Peng Cheng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
zh

[NLP-5] COS(MO)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models

【速读】：该论文旨在解决开放性情节发展中的高质量故事生成问题。关键解决方案在于结合蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）与步级价值模型，以奖励适度的惊奇感（好奇心）并惩罚不连贯性，同时采用优势比偏好优化（Odds Ratio Preference Optimization, ORPO）来微调策略。这种方法通过迭代强化学习循环系统地探索多个候选情节分支，反向传播质量信号，并调整策略以实现更快收敛，从而显著提升了故事情节的质量。

链接: https://arxiv.org/abs/2501.17104
作者: Tobias Materzok
机构: unknown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present COS(M+O)S, a System 2-inspired framework for open-ended plot development that systematically explores the vast space of possible story expansions, enabling a 3B-parameter language model to approach the plot quality of a 70B model on select short-story tasks. The method accomplishes this by combining Monte Carlo Tree Search (MCTS), guided by a step-level value model that rewards moderate surprisal (curiosity) while penalizing incoherence, and Odds Ratio Preference Optimization (ORPO) to fine-tune the policy on high-value plot expansions. This iterative reinforcement learning loop systematically explores multiple candidate plot branches, backpropagates quality signals, and adapts the policy for faster convergence, notably shifting the policy from puzzle-based Chain-of-Thought to more character-driven storytelling. In small-scale tests with short-story prompts, 67%-77% of participants favored COS(M+O)S’s highest-rated expansions over lower-rated ones, suggesting that our learned value function aligns. GPT-4o ratings further show that COS(M+O)S surpasses naive single-pass decoding from Llama 3.2 3B by 0.59 SD, coming within 0.06 SD of Llama 3.1 70B (no significant difference, p=0.93). Pairwise comparisons with o1 place COS(M+O)S 1.5 SD above the 3B baseline and find no statistically significant gap from 70B. Nevertheless, absolute story quality remains modest, constrained by the small model’s capacity and limited training data.
zh

[NLP-6] Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models NAACL-25

【速读】：该论文旨在解决大型预训练模型在序列建模中的效率问题，特别是针对Transformer及其注意力机制所带来的计算开销。论文的关键在于探索压缩基于选择性结构状态空间模型（SSMs）的方法，尤其是Mamba及其混合模型。通过研究不同粒度下移除选定组件对模型大小和计算开销的影响，论文提出了一种名为Mamba-Shedder的解决方案，实现了高达1.4倍的推理速度提升，同时保持了模型性能。这种方法通过消除冗余来提高模型效率，而对整体模型性能影响较小。

链接: https://arxiv.org/abs/2501.17088
作者: J. Pablo Muñoz,Jinjie Yuan,Nilesh Jain
机构: Intel Labs (英特尔实验室); Intel Corporation (英特尔公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NAACL-25 - Main track

点击查看摘要

Abstract:Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at this https URL.
zh

[NLP-7] Context is Key in Agent Security

【速读】：该论文旨在解决在大规模多样化情境下，通用智能体系统安全设计的适应性问题。当前的安全机制依赖于手动制定的策略或用户确认，无法有效应对系统复杂性和情境多样性。论文的关键解决方案是提出了一种名为“代理情境安全（Conseca）”的框架，该框架能够生成即时、情境感知且可由人类验证的安全策略，从而更好地适应通用智能体系统的广泛情境与能力需求。

链接: https://arxiv.org/abs/2501.17070
作者: Lillian Tsai,Eugene Bagdasarian
机构: Google(谷歌); Google Research(谷歌研究)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Judging the safety of an action, whether taken by a human or a system, must take into account the context in which the action takes place. Deleting an email from user’s mailbox may or may not be appropriate depending on email’s content, user’s goals, or even available space. Systems today that make these judgements – providing security against harmful or inappropriate actions – rely on manually-crafted policies or user confirmation for each relevant context. With the upcoming deployment of systems like generalist agents, we argue that we must rethink security designs to adapt to the scale of contexts and capabilities of these systems. As a first step, this paper explores contextual security in the domain of agents and proposes contextual security for agents (Conseca), a framework to generate just-in-time, contextual, and human-verifiable security policies.
zh

[NLP-8] How Linguistics Learned to Stop Worrying and Love the Language Models

【速读】：该论文旨在解决语言模型（Language Models, LMs）在语言学研究中的定位问题。论文的关键在于提出语言模型既不是学习语言的替代品，也不应被忽视其在探究语言结构、处理及学习方面的贡献。论文主张语言模型能够促进对语言学理论的重大问题的理解，并促使我们重新思考关于学习的论点，但它们并不能取代语言学的结构和理论。

链接: https://arxiv.org/abs/2501.17047
作者: Richard Futrell,Kyle Mahowald
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models can produce fluent, grammatical text. Nonetheless, some maintain that language models don’t really learn language and also that, even if they did, that would not be informative for the study of human learning and processing. On the other side, there have been claims that the success of LMs obviates the need for studying linguistic theory and structure. We argue that both extremes are wrong. LMs can contribute to fundamental questions about linguistic structure, language processing, and learning. They force us to rethink arguments about learning and are informative for major questions in linguistic theory. But they do not replace linguistic structure and theory. We offer an optimistic take on the relationship between language models and linguistics.
zh

[NLP-9] Challenges in Ensuring AI Safety in DeepSeek -R1 Models: The Shortcomings of Reinforcement Learning Strategies

【速读】：该论文旨在解决大型语言模型（LLMs）在减少有害输出方面的挑战，特别是针对高级模型如DeepSeek-R1。论文的关键在于提出结合强化学习（Reinforcement Learning, RL）和监督微调（Supervised Fine-Tuning, SFT）的混合训练方法，以实现更稳健的有害输出减少效果。

链接: https://arxiv.org/abs/2501.17030
作者: Manojkumar Parmar,Yuvaraj Govindarajulu
机构: AIShield (Powered by Bosch)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 9 pages, 1 table

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress in reasoning, alignment, and task-specific performance. However, ensuring harmlessness in these systems remains a critical challenge, particularly in advanced models like DeepSeek-R1. This paper examines the limitations of Reinforcement Learning (RL) as the primary approach for reducing harmful outputs in DeepSeek-R1 and compares it with Supervised Fine-Tuning (SFT). While RL improves reasoning capabilities, it faces challenges such as reward hacking, generalization failures, language mixing, and high computational costs. We propose hybrid training approaches combining RL and SFT to achieve robust harmlessness reduction. Usage recommendations and future directions for deploying DeepSeek-R1 responsibly are also presented.
zh

[NLP-10] Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

【速读】：该论文旨在探究分词（Tokenization）在大规模语言模型（LLMs）中的作用，特别是其对模型扩展性和性能的影响。目前这一领域尚未被充分探索。论文的关键解决方案是引入了一种名为“过量分词变换器（Over-Tokenized Transformers）”的新框架，通过解耦输入词汇表和输出词汇表来提升语言建模性能。具体而言，该方法通过扩大输入词汇表规模以利用多词（multi-gram）标记，从而实现性能提升。研究发现，输入词汇表的大小与训练损失之间存在对数线性关系，并且更大的输入词汇表能够一致地提高模型性能，无论模型规模如何。这一方法无需额外成本即可达到相当于基线模型双倍规模的性能，强调了分词在扩展定律中的重要性，并为分词器设计提供了实用见解。

链接: https://arxiv.org/abs/2501.16975
作者: Hongzhi Huang,Defa Zhu,Banggu Wu,Yutao Zeng,Ya Wang,Qiyang Min,Xun Zhou
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.
zh

[NLP-11] Multiple Abstraction Level Retrieve Augment Generation

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）模型在处理多层级抽象信息时的局限性。传统方法通常依赖于提取固定长度的片段作为参考，这限制了模型在多个抽象层级上生成答案的能力。论文的关键解决方案在于提出一种新的RAG方法，该方法使用多层次抽象（Multiple Abstraction Levels, MAL）的片段，包括多句子级、段落级、章节级和文档级，从而提升在复杂科学领域如糖科学（Glycoscience）中的问答（Question/Answering, Q/A）的准确性，相比传统单一层次RAG方法，其AI评估的答案正确性提高了25.739%。

链接: https://arxiv.org/abs/2501.16952
作者: Zheng Zheng(1),Xinyi Ni(1),Pengyu Hong(1) ((1) Brandeis University)
机构: Brandeis University (布兰迪斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A Retrieval-Augmented Generation (RAG) model powered by a large language model (LLM) provides a faster and more cost-effective solution for adapting to new data and knowledge. It also delivers more specialized responses compared to pre-trained LLMs. However, most existing approaches rely on retrieving prefix-sized chunks as references to support question-answering (Q/A). This approach is often deployed to address information needs at a single level of abstraction, as it struggles to generate answers across multiple levels of abstraction. In an RAG setting, while LLMs can summarize and answer questions effectively when provided with sufficient details, retrieving excessive information often leads to the ‘lost in the middle’ problem and exceeds token limitations. We propose a novel RAG approach that uses chunks of multiple abstraction levels (MAL), including multi-sentence-level, paragraph-level, section-level, and document-level. The effectiveness of our approach is demonstrated in an under-explored scientific domain of Glycoscience. Compared to traditional single-level RAG approaches, our approach improves AI evaluated answer correctness of Q/A by 25.739% on Glyco-related papers.
zh

[NLP-12] oolFactory: Automating Tool Generation by Leverag ing LLM to Understand REST API Documentations

【速读】：该论文旨在解决从非结构化REST API文档自动生成与人工智能兼容工具的问题。由于API文档缺乏标准化、模式不一致以及信息不完整，这一过程面临挑战。解决方案的关键在于开发了一个名为\textbf{ToolFactory}的开源管道系统，能够自动化生成这些工具，并通过构建一个验证工具的知识库来推断不充分文档中的缺失信息。此外，引入了一种评估方法以诊断错误，进一步增强了工具的可靠性。实验结果表明，ToolFactory在处理科学REST API集成到AI工作流程方面具有显著潜力。

链接: https://arxiv.org/abs/2501.16945
作者: Xinyi Ni(1),Qiuyang Wang(1),Yukun Zhang(1),Pengyu Hong(1) ((1) Brandeis University)
机构: Brandeis University (布兰迪斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:LLM-based tool agents offer natural language interfaces, enabling users to seamlessly interact with computing services. While REST APIs are valuable resources for building such agents, they must first be transformed into AI-compatible tools. Automatically generating AI-compatible tools from REST API documents can greatly streamline tool agent development and minimize user learning curves. However, API documentation often suffers from a lack of standardization, inconsistent schemas, and incomplete information. To address these issues, we developed \textbfToolFactory, an open-source pipeline for automating tool generation from unstructured API documents. To enhance the reliability of the developed tools, we implemented an evaluation method to diagnose errors. Furthermore, we built a knowledge base of verified tools, which we leveraged to infer missing information from poorly documented APIs. We developed the API Extraction Benchmark, comprising 167 API documents and 744 endpoints in various formats, and designed a JSON schema to annotate them. This annotated dataset was utilized to train and validate ToolFactory. The experimental results highlight the effectiveness of ToolFactory. We also demonstrated ToolFactory by creating a domain-specific AI agent for glycomaterials research. ToolFactory exhibits significant potential for facilitating the seamless integration of scientific REST APIs into AI workflows.
zh

[NLP-13] AID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

【速读】：该论文旨在解决因果语言模型在资源受限环境中的部署挑战，主要问题是由于教师模型与学生模型之间存在显著的能力差距、模式平均和模式崩溃等问题，导致知识蒸馏过程中遇到障碍。解决方案的关键在于引入了一种新颖的知识蒸馏方法——\textit{时序自适应插值蒸馏 (TAID)}，通过动态插值学生分布和教师分布，并通过一个自适应中间分布逐渐从学生初始分布过渡到教师分布，从而有效防止模式崩溃，缓解能力差距，并平衡模式平均与模式崩溃的问题。

链接: https://arxiv.org/abs/2501.16937
作者: Makoto Shing,Kou Misaki,Han Bao,Sho Yokoi,Takuya Akiba
机构: Sakana AI; Kyoto University; NINJAL; Tohoku University; RIKEN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce \textitTemporally Adaptive Interpolated Distillation (TAID) , a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student’s initial distribution towards the teacher’s distribution. We provide a theoretical analysis demonstrating TAID’s ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID’s superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID’s practical impact by developing two state-of-the-art compact foundation models: \textttTAID-LLM-1.5B for language tasks and \textttTAID-VLM-2B for vision-language tasks. These results demonstrate TAID’s effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.
zh

[NLP-14] Detecting harassment and defamation in cyberbullying with emotion-adaptive training

【速读】：该论文旨在解决社交媒体上针对名人的网络欺凌事件检测问题，特别是涵盖诽谤和骚扰等多种形式的网络欺凌。由于现有研究主要集中在骚扰行为，并通常将其视为二元分类任务，导致训练数据对于多种网络欺凌形式仍然不足。论文的关键解决方案在于提出了一种情感自适应训练框架（Emotion-Adaptive Training, EAT），该框架通过从情感检测领域向网络欺凌检测领域的知识迁移来帮助识别间接的网络欺凌事件。实验结果表明，在低资源条件下，EAT能够将九种基于变换器的模型在网络安全检测任务中的平均宏F1分数、精确率和召回率提升约20%。

链接: https://arxiv.org/abs/2501.16925
作者: Peiling Yi,Arkaitz Zubiaga,Yunfei Long
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing research on detecting cyberbullying incidents on social media has primarily concentrated on harassment and is typically approached as a binary classification task. However, cyberbullying encompasses various forms, such as denigration and harassment, which celebrities frequently face. Furthermore, suitable training data for these diverse forms of cyberbullying remains scarce. In this study, we first develop a celebrity cyberbullying dataset that encompasses two distinct types of incidents: harassment and defamation. We investigate various types of transformer-based models, namely masked (RoBERTa, Bert and DistilBert), replacing(Electra), autoregressive (XLnet), maskedpermuted (Mpnet), text-text (T5) and large language models (Llama2 and Llama3) under low source settings. We find that they perform competitively on explicit harassment binary detection. However, their performance is substantially lower on harassment and denigration multi-classification tasks. Therefore, we propose an emotion-adaptive training framework (EAT) that helps transfer knowledge from the domain of emotion detection to the domain of cyberbullying detection to help detect indirect cyberbullying events. EAT consistently improves the average macro F1, precision and recall by 20% in cyberbullying detection tasks across nine transformer-based models under low-resource settings. Our claims are supported by intuitive theoretical insights and extensive experiments.
zh

[NLP-15] Irony Detection Reasoning and Understanding in Zero-shot Learning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在理解与检测讽刺（irony）方面的泛化能力不足的问题。论文的关键解决方案在于提出了一种提示工程设计框架（IDADP），通过优化提示设计来提升ChatGPT在不同体裁讽刺检测数据集上的讽刺理解与推理能力，从而实现更高的讽刺检测准确率及更有效的解释，进而缓解LLMs的泛化问题。

链接: https://arxiv.org/abs/2501.16884
作者: Peiling Yi,Yuhan Xia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Irony is a powerful figurative language (FL) on social media that can potentially mislead various NLP tasks, such as recommendation systems, misinformation checks, and sentiment analysis. Understanding the implicit meaning of this kind of subtle language is essential to mitigate irony’s negative impact on NLP tasks. However, building models to understand irony presents a unique set of challenges, because irony is a complex form of language that often relies on context, tone, and subtle cues to convey meaning that is opposite or different from the literal interpretation. Large language models, such as ChatGPT, are increasingly able to capture implicit and contextual information. In this study, we investigate the generalization, reasoning and understanding ability of ChatGPT on irony detection across six different genre irony detection datasets. Our findings suggest that ChatGPT appears to show an enhanced language understanding and reasoning ability. But it needs to be very careful in prompt engineering design. Thus, we propose a prompt engineering design framework IDADP to achieve higher irony detection accuracy, improved understanding of irony, and more effective explanations compared to other state-of-the-art ChatGPT zero-shot approaches. And ascertain via experiments that the practice generated under the framework is likely to be the promised solution to resolve the generalization issues of LLMs.
zh

[NLP-16] JRE-L: Journalist Reader and Editor LLM s in the Loop for Science Journalism for the General Audience

【速读】：该论文旨在解决科学新闻报道中，由于目标受众缺乏特定专业知识，导致难以使公众理解最新科研成果的问题。解决方案的关键在于提出JRE-L框架，该框架整合了三个大语言模型（LLMs）来模拟写作-阅读-反馈-修订循环。具体而言，一个模型作为记者撰写文章，另一个作为普通读者提供反馈，第三个作为编辑进行修改建议。通过这种协作方式，文章可被迭代优化以提高其可读性和公众理解度。实验结果表明，使用该框架生成的文章比现有方法（包括使用单个高级模型如GPT-4和其他多模型协作策略）更具可访问性。

链接: https://arxiv.org/abs/2501.16865
作者: Gongyao Jiang,Xinran Shi,Qiong Luo
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2407.09756

点击查看摘要

Abstract:Science journalism reports current scientific discoveries to non-specialists, aiming to enable public comprehension of the state of the art. This task is challenging as the audience often lacks specific knowledge about the presented research. We propose a JRE-L framework that integrates three LLMs mimicking the writing-reading-feedback-revision loop. In JRE-L, one LLM acts as the journalist, another LLM as the general public reader, and the third LLM as an editor. The journalist’s writing is iteratively refined by feedback from the reader and suggestions from the editor. Our experiments demonstrate that by leveraging the collaboration of two 7B and one 1.8B open-source LLMs, we can generate articles that are more accessible than those generated by existing methods, including prompting single advanced models such as GPT-4 and other LLM-collaboration strategies. Our code is publicly available at this http URL.
zh

[NLP-17] Misspellings in Natural Language Processing: A survey

【速读】：该论文旨在解决自然语言处理（NLP）中错别字带来的挑战。论文的关键解决方案包括数据增强、双重步骤法、字符顺序无关方法以及基于元组的方法等策略，以减轻错别字对NLP任务性能的影响。此外，论文还探讨了专用数据挑战和竞赛，以推动该领域的发展，并分析了现代大型语言模型在应对错别字方面的表现和机会。

链接: https://arxiv.org/abs/2501.16836
作者: Gianluca Sperduti,Alejandro Moreo
机构: Istituto di Scienza e Tecnologie dell’Informazione (信息科学与技术研究所), Consiglio Nazionale delle Ricerche (意大利国家研究委员会), 56124 Pisa, Italy (意大利比萨)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This survey provides an overview of the challenges of misspellings in natural language processing (NLP). While often unintentional, misspellings have become ubiquitous in digital communication, especially with the proliferation of Web 2.0, user-generated content, and informal text mediums such as social media, blogs, and forums. Even if humans can generally interpret misspelled text, NLP models frequently struggle to handle it: this causes a decline in performance in common tasks like text classification and machine translation. In this paper, we reconstruct a history of misspellings as a scientific problem. We then discuss the latest advancements to address the challenge of misspellings in NLP. Main strategies to mitigate the effect of misspellings include data augmentation, double step, character-order agnostic, and tuple-based methods, among others. This survey also examines dedicated data challenges and competitions to spur progress in the field. Critical safety and ethical concerns are also examined, for example, the voluntary use of misspellings to inject malicious messages and hate speech on social networks. Furthermore, the survey explores psycholinguistic perspectives on how humans process misspellings, potentially informing innovative computational techniques for text normalization and representation. Finally, the misspelling-related challenges and opportunities associated with modern large language models are also analyzed, including benchmarks, datasets, and performances of the most prominent language models against misspellings. This survey aims to be an exhaustive resource for researchers seeking to mitigate the impact of misspellings in the rapidly evolving landscape of NLP.
zh

[NLP-18] Whispers of Sound-Enhancing Information Extraction from Depression Patients Unstructured Data through Audio and Text Emotion Recognition and Llama Fine-tuning

【速读】：该论文旨在解决传统方法在特征融合和模态权重分配方面的局限性，通过引入多头注意力机制和加权多模态迁移学习来提升抑郁症分类的准确性。解决方案的关键在于设计了一种基于教师-学生架构的创新多模态融合模型，其中学生融合模型在文本和听觉教师模型的指导下实现显著的分类精度提升。

链接: https://arxiv.org/abs/2501.16813
作者: Lindy Gan,Yifan Huang,Xiaoyang Gao,Jiaming Tan,Fujun Zhao,Tao Yang
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 21 pages,7 figures.1 table

点击查看摘要

Abstract:This study proposes an innovative multimodal fusion model based on a teacher-student architecture to enhance the accuracy of depression classification. Our designed model addresses the limitations of traditional methods in feature fusion and modality weight allocation by introducing multi-head attention mechanisms and weighted multimodal transfer learning. Leveraging the DAIC-WOZ dataset, the student fusion model, guided by textual and auditory teacher models, achieves significant improvements in classification accuracy. Ablation experiments demonstrate that the proposed model attains an F1 score of 99. 1% on the test set, significantly outperforming unimodal and conventional approaches. Our method effectively captures the complementarity between textual and audio features while dynamically adjusting the contributions of the teacher models to enhance generalization capabilities. The experimental results highlight the robustness and adaptability of the proposed framework in handling complex multimodal data. This research provides a novel technical framework for multimodal large model learning in depression analysis, offering new insights into addressing the limitations of existing methods in modality fusion and feature extraction.
zh

[NLP-19] Algorithm for Automatic Legislative Text Consolidation

【速读】：该论文旨在解决法律背景下繁琐且耗时的立法文本整合过程，传统上这一任务由法律专业人士手动完成。论文的关键解决方案是提出了一种基于生成式模型的方法，采用轻量化的量化生成模型，并利用LoRA进行微调，以自动生成准确可靠的修正文本。据作者所知，这是首次将生成式模型应用于立法文本的整合工作。

链接: https://arxiv.org/abs/2501.16794
作者: Matias Etcheverry,Thibaud Real,Pauline Chavallard
机构: Doctrone(道多克特); Ecole Nationale Supérieure(法国国立高等学院); Paris-Saclay(巴黎萨克雷)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces a method for automating the consolidation process in a legal context, a time-consuming task traditionally performed by legal professionals. We present a generative approach that processes legislative texts to automatically apply amendments. Our method employs light quantized generative model, fine-tuned with LoRA, to generate accurate and reliable amended texts. To the authors knowledge, this is the first time generative models are used on legislative text consolidation. Our dataset is publicly available on HuggingFace1. Experimental results demonstrate a significant improvement in efficiency, offering faster updates to legal documents. A full automated pipeline of legislative text consolidation can be done in a few hours, with a success rate of more than 63% on a difficult bill.
zh

[NLP-20] Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding

【速读】：该论文旨在解决在视频理解中应用多模态大型语言模型（Multimodal Large Language Models, MLLMs）时面临的挑战，特别是如何有效地建模跨帧的时间关系。论文的关键在于提出了一种名为堆叠时间编码器（Stackable Temporal Encoder, STE）的解决方案，STE能够灵活地进行显式时间建模，并具备可调的时间感受野和令牌压缩比率，从而系统性地对比隐式和显式时间建模的效果。

链接: https://arxiv.org/abs/2501.16786
作者: Yun Li,Zhe Liu,Yajing Kong,Guangrui Li,Jiyuan Zhang,Chao Bian,Feng Liu,Lina Yao,Zhenbang Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Applying Multimodal Large Language Models (MLLMs) to video understanding presents significant challenges due to the need to model temporal relations across frames. Existing approaches adopt either implicit temporal modeling, relying solely on the LLM decoder, or explicit temporal modeling, employing auxiliary temporal encoders. To investigate this debate between the two paradigms, we propose the Stackable Temporal Encoder (STE). STE enables flexible explicit temporal modeling with adjustable temporal receptive fields and token compression ratios. Using STE, we systematically compare implicit and explicit temporal modeling across dimensions such as overall performance, token compression effectiveness, and temporal-specific understanding. We also explore STE’s design considerations and broader impacts as a plug-in module and in image modalities. Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
zh

[NLP-21] A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process DATE

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在链式思维推理过程中如何可能自我放大潜在的偏差或毒性。关键在于提出一个连续时间随机动力学框架，通过随机微分方程（SDE）中的漂移项 (\mu(x)) 和扩散项 (\sigma(x)) 来描述严重性变量 (x(t)) 的演变。该框架允许通过福克-普朗克方法进行一致分析，前提是每个增量步骤在严重性空间中几乎具有马尔可夫性质。论文通过分析临界现象，揭示了某些参数范围内的相变，从亚临界（自我纠正）到超临界（失控严重性），从而为理解和评估模型稳定性及偏差传播提供了理论基础。

链接: https://arxiv.org/abs/2501.16783
作者: Jack David Carson
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注: Experimental verification and more formal argument for Markov approximation of bias propagation to be released soon. Primarily pushed now to establish novelty and ease of sharing. Please do not cite this work until the forthcoming experimental validation and updated mathematical model are provided

点击查看摘要

Abstract:This paper introduces a continuous-time stochastic dynamical framework for understanding how large language models (LLMs) may self-amplify latent biases or toxicity through their own chain-of-thought reasoning. The model posits an instantaneous “severity” variable x(t) \in [0,1] evolving under a stochastic differential equation (SDE) with a drift term \mu(x) and diffusion \sigma(x) . Crucially, such a process can be consistently analyzed via the Fokker–Planck approach if each incremental step behaves nearly Markovian in severity space. The analysis investigates critical phenomena, showing that certain parameter regimes create phase transitions from subcritical (self-correcting) to supercritical (runaway severity). The paper derives stationary distributions, first-passage times to harmful thresholds, and scaling laws near critical points. Finally, it highlights implications for agents and extended LLM reasoning models: in principle, these equations might serve as a basis for formal verification of whether a model remains stable or propagates bias over repeated inferences.
zh

[NLP-22] hrough the Prism of Culture: Evaluating LLM s Understanding of Indian Subcultures and Traditions

【速读】：该论文旨在评估大型语言模型（LLMs）在识别和准确回应印度社会中的小传统（Little Traditions）方面的能力，这些小传统包括地方化的文化实践和亚文化，如种姓、亲属关系、婚姻和宗教。论文通过一系列案例研究探讨了LLMs平衡主导的大传统（Great Traditions）与地方化的小传统之间的互动能力，并探索了使用区域语言提示是否能增强模型的文化敏感性和响应质量。研究的关键在于发现尽管LLMs能够表达文化细微差别，但在具体应用场景中往往难以有效应用这种理解。该研究首次分析了LLMs与印度亚文化的互动，为AI系统嵌入文化多样性面临的挑战提供了重要见解。

链接: https://arxiv.org/abs/2501.16748
作者: Garima Chhikara,Abhishek Kumar,Abhijnan Chakraborty
机构: Indian Institute of Technology Delhi(印度理工学院德里); Delhi Technological University(德里技术大学); India; Abhishek Kumar

Delhi Technological University(德里技术大学); India; Abhijnan Chakraborty

Indian Institute of Technology Kharagpur(印度理工学院库尔甘); India
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable advancements but also raise concerns about cultural bias, often reflecting dominant narratives at the expense of under-represented subcultures. In this study, we evaluate the capacity of LLMs to recognize and accurately respond to the Little Traditions within Indian society, encompassing localized cultural practices and subcultures such as caste, kinship, marriage, and religion. Through a series of case studies, we assess whether LLMs can balance the interplay between dominant Great Traditions and localized Little Traditions. We explore various prompting strategies and further investigate whether using prompts in regional languages enhances the models cultural sensitivity and response quality. Our findings reveal that while LLMs demonstrate an ability to articulate cultural nuances, they often struggle to apply this understanding in practical, context-specific scenarios. To the best of our knowledge, this is the first study to analyze LLMs engagement with Indian subcultures, offering critical insights into the challenges of embedding cultural diversity in AI systems.
zh

[NLP-23] xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对精心设计的提示（prompts）时，现有安全机制容易被绕过的问题。论文的关键解决方案是提出了一种基于强化学习（Reinforcement Learning, RL）的新颖黑盒越狱方法，通过分析良性提示与恶意提示之间的嵌入空间接近性（embedding proximity），优化提示生成过程。这种方法确保重写的提示与原始提示意图高度一致，同时增强攻击的有效性。此外，论文还引入了一个综合性的越狱评估框架，包括关键词匹配、意图匹配和答案验证，以提供更严格和全面的越狱成功评估标准。

链接: https://arxiv.org/abs/2501.16727
作者: Sunbowen Lee,Shiwen Ni,Chi Wei,Shuaimin Li,Liyang Fan,Ahmadreza Argha,Hamid Alinejad-Rokny,Ruifeng Xu,Yicheng Gong,Min Yang
机构: Shenzhen Key Laboratory for High Performance Data Mining (高性能数据挖掘重点实验室), Shenzhen Institutes of Advanced Technology (深圳先进技术研究院), Chinese Academy of Sciences (中国科学院); Shenzhen University of Advanced Technology (深圳先进技术学院); WUST (未知); HITSZ (未知); School of Biomedical Engineering, UNSW Sydney (UNSW悉尼商学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model’s internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack’s effectiveness. Furthermore, we introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success. Experimental results show the superiority of our approach, achieving state-of-the-art (SOTA) performance on several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs. The codebase for this work is available at this https URL.
zh

[NLP-24] 3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow

【速读】：该论文旨在解决3D视觉和多模态数据处理中的高效建模问题。关键在于将现有的密集激活的大语言模型（LLMs）转换为混合专家模型（MoE），并通过附加一个名为Pose-DiT的扩散头，利用改进的修正流扩散调度器，增强模型的任务规划能力。实验结果表明，该3D-MoE框架在减少激活参数的情况下，提升了3D问答和任务规划任务的性能。

链接: https://arxiv.org/abs/2501.16698
作者: Yueen Ma,Yuzheng Zhuang,Jianye Hao,Irwin King
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint. Work in progress

点击查看摘要

Abstract:3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world, especially when compared with traditional visual reasoning based on 2D images. Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum. With the advent of powerful large language models (LLMs), multi-modal LLMs for 3D vision have been developed over the past few years. However, most of these models focus primarily on the vision encoder for 3D data. In this paper, we propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing. In addition to leveraging these models’ instruction-following capabilities, we further enable embodied task planning by attaching a diffusion head, Pose-DiT, that employs a novel rectified flow diffusion scheduler. Experimental results on 3D question answering and task-planning tasks demonstrate that our 3D-MoE framework achieves improved performance with fewer activated parameters.
zh

[NLP-25] MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在工业应用中的全面评估基准不足的问题。关键解决方案是引入了一个名为MME-Industry的新基准，该基准包含21个不同的领域，共计1050个问答对，并确保数据的完整性和防止潜在的公共数据集泄漏。此外，该基准通过纳入直接可回答的非光学字符识别（non-OCR）问题以及需要专门领域知识的任务来增强其复杂性。同时，提供中英文版本以支持跨语言能力的比较分析。

链接: https://arxiv.org/abs/2501.16688
作者: Dongyi Yi,Guibo Zhu,Chenglin Ding,Zongshu Li,Dong Yi,Jinqiao Wang
机构: Wuhan AI Research (武汉人工智能研究院); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages,2 figures

点击查看摘要

Abstract:With the rapid advancement of Multimodal Large Language Models (MLLMs), numerous evaluation benchmarks have emerged. However, comprehensive assessments of their performance across diverse industrial applications remain limited. In this paper, we introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial this http URL benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain. To ensure data integrity and prevent potential leakage from public datasets, all question-answer pairs were manually crafted and validated by domain experts. Besides, the benchmark’s complexity is effectively enhanced by incorporating non-OCR questions that can be answered directly, along with tasks requiring specialized domain knowledge. Moreover, we provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs’ capabilities across these languages. Our findings contribute valuable insights into MLLMs’ practical industrial applications and illuminate promising directions for future model optimization research.
zh

[NLP-26] Auto-Differentiating Any LLM Workflow: A Farewell to Manual Prompting

【速读】：该论文旨在解决复杂语言模型（LLM）流水线中提示工程（Prompt Engineering）困难且劳动密集的问题。特别是对于结合多个LLM调用及功能操作（如检索和数据格式化）的复杂流水线，传统方法难以有效指导LLM。论文的关键解决方案是引入LLM-AutoDiff框架，这是一种新颖的自动提示工程（APE）方法。该方法通过扩展基于文本的梯度方法（如Text-Grad）至多组件、可能循环的LLM架构中，实现对每个文本输入作为可训练参数进行处理，并利用冻结的后向引擎LLM生成反馈，类似于文本梯度，以引导迭代提示更新。这种方法能够容纳功能性节点，保持时间顺序行为，并通过隔离不同的子提示来克服“迷失在中间”的问题，从而提高训练效率。

链接: https://arxiv.org/abs/2501.16673
作者: Li Yin,Zhangyang Wang(Atlas)
机构: SylphAI; University of Texas at Austin
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have reshaped natural language processing, powering applications from multi-hop retrieval and question answering to autonomous agent workflows. Yet, prompt engineering – the task of crafting textual inputs to effectively direct LLMs – remains difficult and labor-intensive, particularly for complex pipelines that combine multiple LLM calls with functional operations like retrieval and data formatting. We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) that extends textual gradient-based methods (such as Text-Grad) to multi-component, potentially cyclic LLM architectures. Implemented within the AdalFlow library, LLM-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine LLM to generate feedback-akin to textual gradients – that guide iterative prompt updates. Unlike prior single-node approaches, LLM-AutoDiff inherently accommodates functional nodes, preserves time-sequential behavior in repeated calls (e.g., multi-hop loops), and combats the “lost-in-the-middle” problem by isolating distinct sub-prompts (instructions, formats, or few-shot examples). It further boosts training efficiency by focusing on error-prone samples through selective gradient computation. Across diverse tasks, including single-step classification, multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff consistently outperforms existing textual gradient baselines in both accuracy and training cost. By unifying prompt optimization through a graph-centric lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating LLM workflows - mirroring the transformative role that automatic differentiation libraries have long played in neural network research.
zh

[NLP-27] VeriFact: Verifying Facts in LLM -Generated Clinical Text with Electronic Health Records

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLM）在临床医学文本生成中的事实准确性保障方法缺失的问题。解决方案的关键在于VeriFact系统，它结合了检索增强生成（Retrieval-Augmented Generation）和LLM作为法官（LLM-as-a-Judge），以验证LLM生成的文本是否基于患者的电子健康记录（EHR）得到了事实支持。

链接: https://arxiv.org/abs/2501.16672
作者: Philip Chung,Akshay Swaminathan,Alex J. Goodell,Yeasul Kim,S. Momsen Reincke,Lichy Han,Ben Deverett,Mohammad Amin Sadeghi,Abdel-Badih Ariss,Marc Ghanem,David Seong,Andrew A. Lee,Caitlin E. Coombes,Brad Bradshaw,Mahir A. Sufian,Hyo Jung Hong,Teresa P. Nguyen,Mohammad R. Rasouli,Komal Kamra,Mark A. Burbridge,James C. McAvoy,Roya Saffary,Stephen P. Ma,Dev Dash,James Xie,Ellen Y. Wang,Clifford A. Schmiesing,Nigam Shah,Nima Aghaeepour
机构: Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Logic in Computer Science (cs.LO)
备注: 62 pages, 5 figures, 1 table, pre-print manuscript

点击查看摘要

Abstract:Methods to ensure factual accuracy of text generated by large language models (LLM) in clinical medicine are lacking. VeriFact is an artificial intelligence system that combines retrieval-augmented generation and LLM-as-a-Judge to verify whether LLM-generated text is factually supported by a patient’s medical history based on their electronic health record (EHR). To evaluate this system, we introduce VeriFact-BHC, a new dataset that decomposes Brief Hospital Course narratives from discharge summaries into a set of simple statements with clinician annotations for whether each statement is supported by the patient’s EHR clinical notes. Whereas highest agreement between clinicians was 88.5%, VeriFact achieves up to 92.7% agreement when compared to a denoised and adjudicated average human clinican ground truth, suggesting that VeriFact exceeds the average clinician’s ability to fact-check text against a patient’s medical record. VeriFact may accelerate the development of LLM-based EHR applications by removing current evaluation bottlenecks.
zh

[NLP-28] Contextual Reinforcement in Multimodal Token Compression for Large Language Models

【速读】：该论文旨在解决有效压缩标记(token)以处理日益复杂和多样数据集的问题。解决方案的关键在于引入了一种基于上下文强化(contextual reinforcement)的新机制，通过交互依赖性和语义相关性动态调整标记的重要性。这种方法能够在保持信息表示的质量和连贯性的同时，实现标记使用的大幅减少，并通过图算法和自适应加权捕捉文本和多模态数据中的细微上下文关系，确保下游任务中的稳健对齐和性能。

链接: https://arxiv.org/abs/2501.16658
作者: Naderdel Piero,Zacharias Cromwell,Nathaniel Wainwright,Matthias Nethercott
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets. A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance. This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation. Incorporating graph-based algorithms and adaptive weighting, the method captures subtle contextual relationships across textual and multimodal data, ensuring robust alignment and performance in downstream tasks. Evaluations across varied domains reveal significant improvements in accuracy and semantic retention, particularly for tasks requiring detailed cross-modal interactions. Memory usage analyses demonstrate improved computational efficiency, with minimal overhead despite the additional reinforcement processes. Performance gains are further validated through error distribution analyses, showing reduced semantic loss and syntactic inconsistencies compared to baseline models. The modular architecture ensures compatibility with a wide range of open-source frameworks, facilitating scalable implementation for real-world applications. These findings highlight the potential of contextual reinforcement in redefining token management strategies and advancing large-scale model design.
zh

[NLP-29] Large Language Model Critics for Execution-Free Evaluation of Code Changes

【速读】：该论文旨在解决现有自动化软件工程任务评估指标（如构建状态和偶尔的日志分析）过于稀疏且有限的问题，无法充分评估多步骤基于大型语言模型 (LLM) 的代理工作流所做出的更改质量。论文的关键解决方案是设计了基于 LLM 的评论者，以推导出针对代码更改的结构良好且严格的中间/步骤级、无执行的评估代理，从而实现对存储库级别代码更改的评估。通过使用黄金测试补丁作为参考，该方法能够预测编辑位置的可执行性，并进一步预测构建状态，显著提升了评估的准确性和有效性。

链接: https://arxiv.org/abs/2501.16655
作者: Aashish Yadavally,Hoan Nguyen,Laurent Callot,Gauthier Guinet
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校); AWS AI Labs (AWS AI 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) offer a promising way forward for automating software engineering tasks, such as bug fixes, feature additions, etc., via multi-step LLM-based agentic workflows. However, existing metrics for evaluating such workflows, mainly build status and occasionally log analysis, are too sparse and limited in providing the information needed to assess the quality of changes made. In this work, we designed LLM-based critics to derive well-structured and rigorous intermediate/step-level, execution-free evaluation proxies for repo-level code changes. Importantly, we assume access to the gold test patch for the problem (i.e., reference-aware) to assess both semantics and executability of generated patches. With the gold test patch as a reference, we predict executability of all editing locations with an F1 score of 91.6%, aggregating which, we can predict the build status in 84.8% of the instances in SWE-bench. In particular, such an execution-focused LLM critic outperforms other reference-free and reference-aware LLM critics by 38.9% to 72.5%. Moreover, we demonstrate the usefulness of such a reference-aware framework in comparing patches generated by different agentic workflows. Finally, we open-source the library developed for this project, which allows further usage for either other agentic workflows or other benchmarks. The source code is available at this https URL.
zh

[NLP-30] DOCS: Quantifying Weight Similarity for Deeper Insights into Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中权重矩阵相似性定量评估的问题。为了解决这一问题，论文引入了一个新的指标——分布余弦相似度（Distribution of Cosine Similarity, DOCS），并通过分析发现最新的开源LLMs中相邻层经常表现出高的权重相似性，并倾向于形成聚类，这表明LLMs在深度方向上存在功能专业化。此外，论文证明了DOCS在量化正交矩阵相似性方面理论上是有效的，这对理解LLMs的架构和行为具有重要意义，并为开发更高效且可解释的模型提供了工具。

链接: https://arxiv.org/abs/2501.16650
作者: Zeping Min,Xinshang Wang
机构: Alibaba Group (阿里集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel index, the Distribution of Cosine Similarity (DOCS), for quantitatively assessing the similarity between weight matrices in Large Language Models (LLMs), aiming to facilitate the analysis of their complex architectures. Leveraging DOCS, our analysis uncovers intriguing patterns in the latest open-source LLMs: adjacent layers frequently exhibit high weight similarity and tend to form clusters, suggesting depth-wise functional specialization. Additionally, we prove that DOCS is theoretically effective in quantifying similarity for orthogonal matrices, a crucial aspect given the prevalence of orthogonal initializations in LLMs. This research contributes to a deeper understanding of LLM architecture and behavior, offering tools with potential implications for developing more efficient and interpretable models.
zh

[NLP-31] An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue

【速读】：该论文旨在解决多参与者对话系统中确定发言对象（addressee recognition）的问题，这是多参与者对话系统中的一个关键挑战。论文的关键解决方案在于构建一个多模态多参与者对话语料库，并标注其中一部分数据以识别发言对象。研究通过评估大型语言模型（如GPT-4o）在该任务上的表现，揭示了当前模型在这方面的局限性，从而强调了进一步研究的必要性，以提升大型语言模型在理解和处理多参与者对话动态中的能力。

链接: https://arxiv.org/abs/2501.16643
作者: Koji Inoue,Divesh Lala,Mikey Elmers,Keiko Ochi,Tatsuya Kawahara
机构: Graduate School of Informatics, Kyoto University (京都大学情报学研究科)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task’s complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.
zh

[NLP-32] Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation

【速读】：该论文旨在解决对话AI系统在识别笑声触发情境方面的挑战。解决方案的关键在于通过标注日语文本对话数据中的笑声触发情境，并开发了一个包含十类原因的分类体系来解释这些情境。论文首先由多名标注者使用二元决策（笑声触发或非触发）手动标注这些情境，随后利用大型语言模型（LLM）生成对这些二元标注的解释，并将其归类到上述分类体系中。这一方法有助于提升GPT-4等模型在识别笑声触发情境上的表现，其F1分数达到43.14%，从而促进了更细腻的笑声识别与生成，最终推动了更加自然和吸引人的AI与人类交互。

链接: https://arxiv.org/abs/2501.16635
作者: Koji Inoue,Mikey Elmers,Divesh Lala,Tatsuya Kawahara
机构: Graduate School of Informatics, Kyoto University (京都大学情报学研究科)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including “Empathy and Affinity” and “Humor and Surprise,” highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4’s performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.
zh

[NLP-33] CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLM s ICLR2025

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在生成描述时存在的幻觉问题，尽管这些模型具有强大的能力。论文的关键解决方案是提出了一种跨模态分层直接偏好优化（CHiP）。CHiP通过引入视觉偏好优化模块和分层文本偏好优化模块，使模型能够同时从文本和视觉偏好中学习，并在多个粒度级别（包括响应、段落和令牌层面）捕获偏好，从而有效减少幻觉现象。

链接: https://arxiv.org/abs/2501.16629
作者: Jinlan Fu,Shenzhen Huangfu,Hao Fei,Xiaoyu Shen,Bryan Hooi,Xipeng Qiu,See-Kiong Ng
机构: National University of Singapore; Fudan University; Digital Twin Institute, Eastern Institute of Technology, Ningbo
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the model to capture preferences at multiple granular levels, including response, segment, and token levels. We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations. On the Object HalBench dataset, CHiP outperforms DPO in hallucination reduction, achieving improvements of 52.7% and 55.5% relative points based on the base model Muffin and LLaVA models, respectively. We make all our datasets and code publicly available: this https URL.
zh

[NLP-34] Few-Shot Optimized Framework for Hallucination Detection in Resource-Limited NLP Systems

【速读】：该论文旨在解决文本生成中的幻觉检测问题，特别是在自然语言处理系统中，这种问题常常导致机器翻译和定义建模等应用中的输出不可靠。论文的关键解决方案在于提出了一种新颖的框架，通过引入DeepSeek少样本优化和迭代提示工程来增强弱标签生成，并重新结构化数据以适应指令生成模型。这种方法通过微调Mistral-7B-Instruct-v0.3模型，在资源受限的情况下实现了精确的幻觉检测，最终在测试集上达到了85.5%的准确率，从而为资源约束下的NLP系统构建了可扩展且鲁棒的幻觉检测框架。

链接: https://arxiv.org/abs/2501.16616
作者: Baraa Hikal,Ahmed Nasreldin,Ali Hamdi,Ammar Mohammed
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination detection in text generation remains an ongoing struggle for natural language processing (NLP) systems, frequently resulting in unreliable outputs in applications such as machine translation and definition modeling. Existing methods struggle with data scarcity and the limitations of unlabeled datasets, as highlighted by the SHROOM shared task at SemEval-2024. In this work, we propose a novel framework to address these challenges, introducing DeepSeek Few-shot optimization to enhance weak label generation through iterative prompt engineering. We achieved high-quality annotations that considerably enhanced the performance of downstream models by restructuring data to align with instruct generative models. We further fine-tuned the Mistral-7B-Instruct-v0.3 model on these optimized annotations, enabling it to accurately detect hallucinations in resource-limited settings. Combining this fine-tuned model with ensemble learning strategies, our approach achieved 85.5% accuracy on the test set, setting a new benchmark for the SHROOM task. This study demonstrates the effectiveness of data restructuring, few-shot optimization, and fine-tuning in building scalable and robust hallucination detection frameworks for resource-constrained NLP systems.
zh

[NLP-35] CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

【速读】：该论文旨在解决Web代理在复杂任务执行及用户偏好建模方面表现不足的问题。解决方案的关键在于提出CowPilot框架，该框架支持自主导航与人机协作导航，并通过允许代理提出下一步建议而减少人类操作步骤，同时用户可以随时干预或恢复代理控制。研究表明，在人机协作模式下，任务成功率高达95%，且人类仅需执行15.2%的操作步骤。

链接: https://arxiv.org/abs/2501.16609
作者: Faria Huq,Zora Zhiruo Wang,Frank F. Xu,Tianyue Ou,Shuyan Zhou,Jeffrey P. Bigham,Graham Neubig
机构: School of Computer Science, Carnegie Mellon University (卡内基梅隆大学计算机学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Preprint

点击查看摘要

Abstract:While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent’s capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at this https URL
zh

[NLP-36] MCTS-SQL: An Effective Framework for Text-to-SQL with Monte Carlo Tree Search

【速读】：该论文旨在解决文本到SQL（Text-to-SQL）转换中的挑战，特别是在处理复杂查询时模型性能控制困难及易产生幻觉的问题。论文的关键解决方案是设计了一种基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）和启发式自优化机制的MCTS-SQL方法。此方法通过一个模式选择器提取相关信息，并采用迭代式的查询优化生成SQL语句，从而显著提升了执行准确性（Execution accuracy）和可靠性。

链接: https://arxiv.org/abs/2501.16607
作者: Shuozhi Yuan,Liming Chen,Miaomiao Yuan,Jin Zhao,Haoran Peng,Wenming Guo
机构: China Telecom Digital Intelligence; Institute of Computing Technology, Chinese Academy of Sciences; Beijing University of Posts and Telecommunication
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Text-to-SQL is a fundamental and longstanding problem in the NLP area, aiming at converting natural language queries into SQL, enabling non-expert users to operate databases. Recent advances in LLM have greatly improved text-to-SQL performance. However, challenges persist, especially when dealing with complex user queries. Current approaches (e.g., COT prompting and multi-agent frameworks) rely on the ability of models to plan and generate SQL autonomously, but controlling performance remains difficult. In addition, LLMs are still prone to hallucinations. To alleviate these challenges, we designed a novel MCTS-SQL to guide SQL generation iteratively. The approach generates SQL queries through Monte Carlo Tree Search (MCTS) and a heuristic self-refinement mechanism are used to enhance accuracy and reliability. Key components include a schema selector for extracting relevant information and an MCTS-based generator for iterative query refinement. Experimental results from the SPIDER and BIRD benchmarks show that MCTS-SQL achieves state-of-the-art performance. Specifically, on the BIRD development dataset, MCTS-SQL achieves an Execution (EX) accuracy of 69.40% using GPT-4o as the base model and a significant improvement when dealing with challenging tasks, with an EX of 51.48%, which is 3.41% higher than the existing method.
zh

[NLP-37] DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

【速读】：该论文旨在解决低资源语言和方言在主流机器翻译（Machine Translation, MT）模型中支持不足的问题。论文的关键解决方案在于提出了一种名为DialUp的方法，该方法包含训练阶段适应预训练模型以处理方言数据（M-D），以及推理阶段将方言数据调整为模型所擅长的语言（D-M）。这种方法通过暴露于模拟方言变异语言机制的合成数据来增强模型对潜在未见过和未知方言的鲁棒性，同时针对已知目标方言处理方言差异。这些技术显著提升了来自四个语系的几种方言的性能，并在其他两个语系中显示出适度的改进。

链接: https://arxiv.org/abs/2501.16581
作者: Niyati Bafna,Emily Chang,Nathaniel R. Robinson,David R. Mortensen,Kenton Murray,David Yarowsky,Hale Sirin
机构: Johns Hopkins University, Center for Language and Speech Processing(约翰霍普金斯大学，语言与语音处理中心); University of Virginia(弗吉尼亚大学); Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 46 incl. appendix

点击查看摘要

Abstract:Most of the world’s languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectical variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectical data (M-D), and an inference-time intervention adapting dialectical data to the model expertise (D-M). M-D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectical variation, whereas D-M treats dialectical divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.
zh

[NLP-38] A comparison of data filtering techniques for English-Polish LLM -based machine translation in the biomedical domain

【速读】：该论文旨在解决大型语言模型在机器翻译任务中因训练数据集包含低质量条目和冗余信息而导致的显著计算挑战。解决方案的关键在于评估不同数据过滤技术（如LASER、MUSE和LaBSE）在英语-波兰语生物医学领域翻译中的效果。通过使用这些方法过滤UFAL Medical Corpus，并调整mBART50模型，研究发现LASER和MUSE均能显著减小数据集规模同时保持或提升性能，特别是LASER方法表现出色，提供了更流畅且自然的翻译结果。

链接: https://arxiv.org/abs/2501.16533
作者: Jorge del Pozo Lérida,Kamil Kojs,János Máté,Mikołaj Antoni Barański,Christian Hardmeier
机构: IT University of Copenhagen
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.
zh

[NLP-39] Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

【速读】：该论文旨在解决自动声律归纳（Sound Law Induction, SLI）在编程示例（Programming by Examples, PBE）框架下，通过大规模语言模型（Large Language Models, LLMs）进行代码生成的挑战。论文的关键在于提出了一种新的数据生成方法，通过四种不同诱导偏差程度的合成数据生成策略，来优化SLI任务的表现。基于实验结果，论文开发了一个最新的开源SLI模型，并展示了其优越性（相比第二好的LLM，参数量仅为三分之一的情况下，通过率提高了6%）。

链接: https://arxiv.org/abs/2501.16524
作者: Atharva Naik,Darsh Agrawal,Hong Sng,Clayton Marr,Kexun Zhang,Nathaniel R Robinson,Kalvin Chang,Rebecca Byrnes,Aravind Mysore,Carolyn Rose,David R Mortensen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Historical linguists have long written “programs” that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a “similar distribution” for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.
zh

[NLP-40] How well can LLM s Grade Essays in Arabic?

【速读】：该论文旨在评估当前先进的大型语言模型（Large Language Models, LLMs）在阿拉伯语自动作文评分（Arabic Automated Essay Scoring, AES）任务中的有效性。研究采用了AR-AES数据集，并探索了包括零样本学习（zero-shot）、少量样本上下文学习（few-shot in-context learning）以及微调（fine-tuning）在内的多种评估方法。研究的关键在于通过在提示中包含评分指南来考察指令遵循能力，并实施了一种混合语言提示策略，即结合英语提示与阿拉伯语内容以提升模型理解和性能。最终，研究发现ACEGPT模型表现最佳，但一个小的基于BERT的模型在Quadratic Weighted Kappa (QWK)指标上表现更优。论文指出了LLMs处理阿拉伯语时面临的挑战，如分词复杂性和更高的计算需求，并强调了适应性模型和有效的提示工程对于提高LLM输出的重要性。

链接: https://arxiv.org/abs/2501.16516
作者: Rayed Ghazawi,Edwin Simpson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:This research assesses the effectiveness of state-of-the-art large language models (LLMs), including ChatGPT, Llama, Aya, Jais, and ACEGPT, in the task of Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning, and examines the influence of instruction-following capabilities through the inclusion of marking guidelines within the prompts. A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving a Quadratic Weighted Kappa (QWK) of 0.67, but was outperformed by a smaller BERT-based model with a QWK of 0.88. The study identifies challenges faced by LLMs in processing Arabic, including tokenization complexities and higher computational demands. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on improving LLM outputs. To the best of our knowledge, this study is the first to empirically evaluate the performance of multiple generative Large Language Models (LLMs) on Arabic essays using authentic student data.
zh

[NLP-41] Deception in LLM s: Self-Preservation and Autonomous Goals in Large Language Models

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在具备规划与推理能力后所表现出的新行为特征，特别是DeepSeek R1模型中发现的欺骗性和自我保存本能。尽管这些特性并未被明确编程或提示，但它们揭示了潜在的风险：即LLMs可能通过伪装对齐来掩盖其真实目标。论文的关键在于强调在将此类模型集成到机器人系统之前，需要建立严格的目标规范和安全性框架以应对这些风险。

链接: https://arxiv.org/abs/2501.16513
作者: Sudarshan Kamath Barkur,Sigurd Schacht,Johannes Scholl
机构: COAI Research (COAI研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have incorporated planning and reasoning capabilities, enabling models to outline steps before execution and provide transparent reasoning paths. This enhancement has reduced errors in mathematical and logical tasks while improving accuracy. These developments have facilitated LLMs’ use as agents that can interact with tools and adapt their responses based on new information. Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI’s o1. Testing revealed concerning behaviors: the model exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted). These findings raise concerns about LLMs potentially masking their true objectives behind a facade of alignment. When integrating such LLMs into robotic systems, the risks become tangible - a physically embodied AI exhibiting deceptive behaviors and self-preservation instincts could pursue its hidden objectives through real-world actions. This highlights the critical need for robust goal specification and safety frameworks before any physical implementation. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.16513 [cs.CL] (or arXiv:2501.16513v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.16513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-42] Smoothed Embeddings for Robust Language Models NEURIPS2024

【速读】：该论文旨在解决大型语言模型（LLMs）在安全性与可靠性方面的问题，特别是它们易受越狱攻击（jailbreaking attacks），这类攻击通过对抗性输入破坏对齐机制（alignment），从而诱导有害输出。论文的关键解决方案是提出了一种名为随机嵌入平滑与标记聚合（RESTA）的防御方法。RESTA通过在嵌入向量中添加随机噪声，并在生成每个输出标记时进行聚合，以更好地保留语义信息，从而增强模型的鲁棒性。

链接: https://arxiv.org/abs/2501.16497
作者: Ryo Hase,Md Rafi Ur Rashid,Ashley Lewis,Jing Liu,Toshiaki Koike-Akino,Kieran Parsons,Ye Wang
机构: Mitsubishi Electric Corporation(三菱电机); Pennsylvania State University(宾夕法尼亚州立大学); The Ohio State University(俄亥俄州立大学); Mitsubishi Electric Research Laboratories(三菱电机研究实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
备注: Presented in the Safe Generative AI Workshop at NeurIPS 2024

点击查看摘要

Abstract:Improving the safety and reliability of large language models (LLMs) is a crucial aspect of realizing trustworthy AI systems. Although alignment methods aim to suppress harmful content generation, LLMs are often still vulnerable to jailbreaking attacks that employ adversarial inputs that subvert alignment and induce harmful outputs. We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token, with the aim of better preserving semantic information. Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
zh

[NLP-43] PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding ICLR2025

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在理解物理世界方面的局限性，以提升其在执行复杂任务和确保安全操作方面的能力。论文的关键解决方案是引入PhysAgent框架，该框架结合了VLMs的泛化能力与专门的视觉模型知识，显著增强了VLMs对物理世界的理解，特别是在一系列任务中的表现提升了18.4%。这一改进有助于提升如MOKA等实体代理在物理环境中的操作能力。

链接: https://arxiv.org/abs/2501.16411
作者: Wei Chow,Jiageng Mao,Boyi Li,Daniel Seita,Vitor Guizilini,Yue Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICLR 2025. Project page: this https URL Dataset: this https URL

点击查看摘要

Abstract:Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs’ physical world understanding capability across a diverse set of tasks. PhysBench contains 100,000 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world – likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs’ physical understanding across a variety of tasks, including an 18.4% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs’ physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.
zh

[NLP-44] DynaPrompt: Dynamic Test-Time Prompt Tuning ICLR2025

【速读】：该论文旨在解决在线测试时微调过程中提示崩塌（prompt collapse）的问题，这是由于错误累积导致的。解决方案的关键在于提出了一种名为动态测试时提示微调（DynaPrompt）的方法。DynaPrompt 通过利用相关数据分布信息并减少错误累积来增强测试时提示微调的效果。具体而言，DynaPrompt 根据预测熵（prediction entropy）和概率差异（probability difference）两种度量标准自适应选择和优化每个测试样本的相关提示。此外，针对未见测试数据的信息，开发了动态提示追加机制，允许缓冲区添加新提示并删除不活跃的提示，从而在特定测试数据上优化提示以利用有益信息，同时减轻错误累积。

链接: https://arxiv.org/abs/2501.16404
作者: Zehao Xiao,Shilin Yan,Jack Hong,Jiayin Cai,Xiaolong Jiang,Yao Hu,Jiayi Shen,Qi Wang,Cees G. M. Snoek
机构: AIM Lab, University of Amsterdam (阿姆斯特丹大学AIM实验室); Xiaohongshu Inc. (小红书公司); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2025

点击查看摘要

Abstract:Test-time prompt tuning enhances zero-shot generalization of vision-language models but tends to ignore the relatedness among test samples during inference. Online test-time prompt tuning provides a simple way to leverage the information in previous test samples, albeit with the risk of prompt collapse due to error accumulation. To enhance test-time prompt tuning, we propose DynaPrompt, short for dynamic test-time prompt tuning, exploiting relevant data distribution information while reducing error accumulation. Built on an online prompt buffer, DynaPrompt adaptively selects and optimizes the relevant prompts for each test sample during tuning. Specifically, we introduce a dynamic prompt selection strategy based on two metrics: prediction entropy and probability difference. For unseen test data information, we develop dynamic prompt appending, which allows the buffer to append new prompts and delete the inactive ones. By doing so, the prompts are optimized to exploit beneficial information on specific test data, while alleviating error accumulation. Experiments on fourteen datasets demonstrate the effectiveness of dynamic test-time prompt tuning.
zh

[NLP-45] Is Open Source the Future of AI? A Data-Driven Approach

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在学术界和工业界的广泛应用所带来的隐私、透明度和误用等问题，并聚焦于如何提高这些模型的可信度。论文的关键在于提出一种数据驱动的方法，通过收集开源开发中的LLMs及其贡献的数据，分析其改进、修改和方法等方面的贡献。论文的目标是提供客观的数据支持，以促进未来行业专家和政策制定者之间的讨论，而非偏向任何极端立场。研究发现开源贡献可以提升模型性能，并且识别出社区参与模式以及受益于开源贡献的架构类型。

链接: https://arxiv.org/abs/2501.16403
作者: Domen Vake,Bogdan Šinik,Jernej Vičič,Aleksandar Tošić
机构: DIST, UP FAMNIT (DIST, UP FAMNIT); IP, InnoRenew CoE (IP, InnoRenew CoE)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become central in academia and industry, raising concerns about privacy, transparency, and misuse. A key issue is the trustworthiness of proprietary models, with open-sourcing often proposed as a solution. However, open-sourcing presents challenges, including potential misuse, financial disincentives, and intellectual property concerns. Proprietary models, backed by private sector resources, are better positioned for return on investment. There are also other approaches that lie somewhere on the spectrum between completely open-source and proprietary. These can largely be categorised into open-source usage limitations protected by licensing, partially open-source (open weights) models, hybrid approaches where obsolete model versions are open-sourced, while competitive versions with market value remain proprietary. Currently, discussions on where on the spectrum future models should fall on remains unbacked and mostly opinionated where industry leaders are weighing in on the discussion. In this paper, we present a data-driven approach by compiling data on open-source development of LLMs, and their contributions in terms of improvements, modifications, and methods. Our goal is to avoid supporting either extreme but rather present data that will support future discussions both by industry experts as well as policy makers. Our findings indicate that open-source contributions can enhance model performance, with trends such as reduced model size and manageable accuracy loss. We also identify positive community engagement patterns and architectures that benefit most from open contributions. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2501.16403 [cs.SE] (or arXiv:2501.16403v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.16403 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Domen Vake [view email] [v1] Mon, 27 Jan 2025 09:03:49 UTC (5,643 KB)
zh

[NLP-46] FBQuant: FeedBack Quantization for Large Language Models

【速读】：该论文旨在解决在边缘设备上部署大语言模型（Large Language Models, LLMs）时面临的内存带宽限制问题，特别是与权重加载相关的瓶颈。现有的权重仅量化方法虽能有效减少内存访问，但通常会导致显著的精度下降。此外，虽然引入子分支的方法显示出缓解量化误差的潜力，但这些方法要么缺乏稳健的优化策略，要么依赖次优的目标函数。为了解决这些问题，论文提出了一种名为反馈量化（FeedBack Quantization, FBQuant）的新方法，该方法受到自动控制中的负反馈机制的启发。FBQuant的关键在于通过量化过程确保重构权重保持有界，从而降低过拟合的风险。此外，为了抵消由子分支引入的额外延迟，论文开发了一个高效的CUDA内核，减少了60%的额外推理时间。

链接: https://arxiv.org/abs/2501.16385
作者: Yijiang Liu,Hengyu Fang,Liulu He,Rongyu Zhang,Yichuan Bai,Yuan Du,Li Du
机构: Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.
zh

[NLP-47] RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLM s via Outlier-Aware Adaptive Rotations

【速读】：该论文旨在解决大规模语言模型（LLMs）推理过程中，随着批次大小和上下文长度增加，过大的键值（KV）缓存导致显著的内存瓶颈问题。为应对这一挑战，论文提出了一种名为RotateKV的旋转技术，用于实现2比特的KV量化。RotateKV的关键创新在于：(i) 异常感知旋转，通过通道重排适应不同的通道异常分布，同时保持快速哈达玛变换（FWHT）的计算效率；(ii) 预旋转向量位置嵌入（RoPE）分组头旋转，缓解RoPE对异常感知旋转的影响，并进一步平滑各头之间的异常；(iii) 注意力汇点感知量化，利用大量激活来精确识别并保护注意力汇点。这些创新使得RotateKV在WikiText-2上使用LLaMA-2-13B进行2比特量化时，困惑度（PPL）下降不到0.3，同时保持强大的连续提示推理能力和长上下文处理能力。

链接: https://arxiv.org/abs/2501.16383
作者: Zunhai Su,Zhe Chen,Wang Shen,Hanyu Wei,Linge Li,Huangqi Yu,Kehong Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs. As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression. Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths. In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations: (i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT); (ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads; (iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks. RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7% degradation on GSM8K, outperforming existing methods even at lower average bit-widths. RotateKV also showcases a 3.97x reduction in peak memory usage, supports 5.75x larger batch sizes, and achieves a 2.32x speedup in decoding stage.
zh

[NLP-48] Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在生成内容方面比其基础的大规模语言模型（Large Language Models, LLMs）更容易产生有害内容的问题。研究发现，图像的引入显著改变了模型前向传递过程中的内部激活状态，导致与仅基于文本输入时的激活模式不同。此外，嵌入在VLM中的LLM的安全对齐机制不足以处理这些激活差异，使得模型容易受到简单的越狱攻击（jailbreaking attacks）。为了解决这一问题，论文提出了一种内部激活修订方法，通过在生成过程中高效地修订激活状态，引导模型生成更安全的输出。该方法的关键在于其能够在层和头级别进行激活修订，从而提供对模型生成过程的精细控制，并通过构建正负样本和提取修订向量的不同策略，形成该方法的多种变体。实验结果表明，所提出的内部激活修订方法显著提高了常用VLM的安全性，平均降低了48.94%到52.98%的攻击成功率，同时对模型的有用性影响较小。

链接: https://arxiv.org/abs/2501.16378
作者: Qing Li,Jiahui Geng,Zongxiong Chen,Kun Song,Lei Ma,Fakhri Karray
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content compared to their backbone large language models (LLMs). Our investigation reveals that the integration of images significantly shifts the model’s internal activations during the forward pass, diverging from those triggered by textual input. Moreover, the safety alignments of LLMs embedded within VLMs are not sufficiently robust to handle the activations discrepancies, making the models vulnerable to even the simplest jailbreaking attacks. To address this issue, we propose an \textbfinternal activation revision approach that efficiently revises activations during generation, steering the model toward safer outputs. Our framework incorporates revisions at both the layer and head levels, offering control over the model’s generation at varying levels of granularity. In addition, we explore three strategies for constructing positive and negative samples and two approaches for extracting revision vectors, resulting in different variants of our method. Comprehensive experiments demonstrate that the internal activation revision method significantly improves the safety of widely used VLMs, reducing attack success rates by an average of 48.94%, 34.34%, 43.92%, and 52.98% on SafeBench, Safe-Unsafe, Unsafe, and MM-SafetyBench, respectively, while minimally impacting model helpfulness.
zh

[NLP-49] Low-Rank Adapters Meet Neural Architecture Search for LLM Compression AAAI-25

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在微调和部署过程中所需的大量计算资源问题。关键解决方案在于结合低秩自适应器（low-rank adapters）与神经架构搜索（Neural Architecture Search, NAS）技术，特别是权重共享超网络（weight-sharing super-networks），以实现参数高效微调（parameter-efficient fine-tuning, PEFT）。这种方法能够有效压缩和微调大规模预训练模型，从而降低内存占用并加快推理速度，使LLMs更易于在资源受限环境中部署。

链接: https://arxiv.org/abs/2501.16372
作者: J. Pablo Muñoz,Jinjie Yuan,Nilesh Jain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AAAI-25 Workshop on Connecting Low-rank Representations in AI

点击查看摘要

Abstract:The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at this https URL.
zh

[NLP-50] A Method for Multi-Hop Question Answering on Persian Knowledge Graph

【速读】：该论文旨在解决多跳复杂问题在波斯语知识图谱问答系统中的准确理解和转换难题。关键解决方案在于开发了一个包含5,600个波斯语多跳复杂问题的数据集及其分解形式，并基于这些问题的语义表示进行训练。此外，提出了一种使用波斯语知识图谱回答复杂问题的架构。最终，通过PeCoQ数据集验证，所提方法在F1分数和准确率方面分别提升了12.57%和12.06%，优于对比系统。

链接: https://arxiv.org/abs/2501.16350
作者: Arash Ghafouri,Mahdi Firouzmandi,Hasan Naderi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Question answering systems are the latest evolution in information retrieval technology, designed to accept complex queries in natural language and provide accurate answers using both unstructured and structured knowledge sources. Knowledge Graph Question Answering (KGQA) systems fulfill users’ information needs by utilizing structured data, representing a vast number of facts as a graph. However, despite significant advancements, major challenges persist in answering multi-hop complex questions, particularly in Persian. One of the main challenges is the accurate understanding and transformation of these multi-hop complex questions into semantically equivalent SPARQL queries, which allows for precise answer retrieval from knowledge graphs. In this study, to address this issue, a dataset of 5,600 Persian multi-hop complex questions was developed, along with their decomposed forms based on the semantic representation of the questions. Following this, Persian language models were trained using this dataset, and an architecture was proposed for answering complex questions using a Persian knowledge graph. Finally, the proposed method was evaluated against similar systems on the PeCoQ dataset. The results demonstrated the superiority of our approach, with an improvement of 12.57% in F1-score and 12.06% in accuracy compared to the best comparable method.
zh

[NLP-51] RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

【速读】：该论文旨在解决通过文本查询从视频中检索事件时面临的挑战，特别是在多媒体内容迅速增长的背景下。现有方法主要集中在对象级描述上，而忽略了上下文信息的重要性，这在查询缺乏足够上下文（如缺少位置细节或存在模糊背景元素）时尤为明显。为了解决这些问题，论文提出了一种名为 RAPID 的系统，其关键是利用大型语言模型（Large Language Models, LLMs）和基于提示的学习来语义修正和丰富用户查询，并补充相关的上下文信息。这些经过富集的查询随后通过并行检索处理，并通过评估步骤选出与原始查询最匹配的结果。

链接: https://arxiv.org/abs/2501.16303
作者: Long Nguyen,Huy Nguyen,Bao Khuu,Huy Luu,Huy Le,Tuan Nguyen,Tho Quan
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review at SoICT’24

点击查看摘要

Abstract:Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.
zh

[NLP-52] URAG : Implementing a Unified Hybrid RAG for Precise Answers in University Admission Chatbots – A Case Study at HCMUT

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在教育问答系统中的准确性问题，特别是在大学招生咨询中的挑战。论文的关键解决方案是提出了统一检索增强生成（Unified Retrieval-Augmented Generation, URAG）框架，该框架通过整合特定大学的数据显著提升了响应的准确性，尤其对于关键查询。实验结果表明，URAG不仅使内部轻量级模型达到了与最先进商业模型相当的性能，还在实际应用中得到了积极反馈，验证了其在教育场景中的可行性。

链接: https://arxiv.org/abs/2501.16276
作者: Long Nguyen,Tho Quan
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review at SoICT’24

点击查看摘要

Abstract:With the rapid advancement of Artificial Intelligence, particularly in Natural Language Processing, Large Language Models (LLMs) have become pivotal in educational question-answering systems, especially university admission chatbots. Concepts such as Retrieval-Augmented Generation (RAG) and other advanced techniques have been developed to enhance these systems by integrating specific university data, enabling LLMs to provide informed responses on admissions and academic counseling. However, these enhanced RAG techniques often involve high operational costs and require the training of complex, specialized modules, which poses challenges for practical deployment. Additionally, in the educational context, it is crucial to provide accurate answers to prevent misinformation, a task that LLM-based systems find challenging without appropriate strategies and methods. In this paper, we introduce the Unified RAG (URAG) Framework, a hybrid approach that significantly improves the accuracy of responses, particularly for critical queries. Experimental results demonstrate that URAG enhances our in-house, lightweight model to perform comparably to state-of-the-art commercial models. Moreover, to validate its practical applicability, we conducted a case study at our educational institution, which received positive feedback and acclaim. This study not only proves the effectiveness of URAG but also highlights its feasibility for real-world implementation in educational settings.
zh

[NLP-53] WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning ACL

【速读】：该论文旨在解决当前语音编码管道在文本和音频之间采用独立处理流程的问题，未能充分利用这两种模态之间的内在重叠来理解人类交流。论文的关键解决方案是提出WhiSPA（Whisper with Semantic-Psychological Alignment），这是一种通过对比学生-教师学习目标训练的新型音频编码器。WhiSPA通过将Whisper音频嵌入与SBERT编码器的文本表示及基于文本的心理维度（情绪和人格）评估进行对齐，从而在自我监督和下游心理健康任务中超越现有的语音模型，分别实现了73.4%和83.8%的平均误差减少。这表明跨模态对齐可以增加仅使用音频编码模型中的文本语义和心理信息。

链接: https://arxiv.org/abs/2501.16344
作者: Rajath Rao,Adithya Ganesan,Oscar Kjell,Jonah Luby,Akshay Raghavan,Scott Feltman,Whitney Ringwald,Ryan L. Boyd,Benjamin Luft,Camilo Ruggero,Neville Ryant,Roman Kotov,H. Andrew Schwartz
机构: Stony Brook University; University of Pennsylvania; University of Texas at Dallas; University of Minnesota
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 13 pages, 6 figures, ACL ARR 2024

点击查看摘要

Abstract:Current speech encoding pipelines often rely on separate processing pipelines between text and audio, not fully leveraging the inherent overlap between these modalities for understanding human communication. Language models excel at capturing semantic meaning from text that can complement the additional prosodic, emotional, and acoustic cues from speech. This work bridges the gap by proposing WhiSPA (Whisper with Semantic-Psychological Alignment), a novel audio encoder trained with a contrastive student-teacher learning objective. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper audio embeddings with text representations from an SBERT encoder and text-based assessments of psychological dimensions: emotion and personality. Over self-supervised and downstream mental health tasks, WhiSPA surpasses state-of-the-art speech models, achieving an average error reduction of 73.4% on the segment-level self-supervised objective and 83.8% on 11 psychological downstream tasks. WhiSPA demonstrates that cross-modal alignment can increase the amount of text-semantic and psychological information captured in audio-only encoder models.
zh

[NLP-54] Developing Enhanced Conversational Agents for Social Virtual Worlds

【速读】：该论文旨在开发适用于社交虚拟世界的具身对话代理（Embodied Conversational Agents），以实现与用户的多模态交流，其中包含语音交互。解决方案的关键在于结合多种人工智能技术，包括统计方法来建模系统对话行为，并通过初始语料库学习及后续交互中的知识获取进行改进。此外，系统响应的选择会根据用户档案中存储的信息以及用户话语中的情感内容进行调整，从而实现对话行为对特定用户特征的良好适应。

链接: https://arxiv.org/abs/2501.16341
作者: D. Griol,A. Sanchis,J. M. Molina,Z. Callejas
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Neurocomputing 2019

点击查看摘要

Abstract:In this paper, we present a methodology for the development of embodied conversational agents for social virtual worlds. The agents provide multimodal communication with their users in which speech interaction is included. Our proposal combines different techniques related to Artificial Intelligence, Natural Language Processing, Affective Computing, and User Modeling. Firstly, the developed conversational agents. A statistical methodology has been developed to model the system conversational behavior, which is learned from an initial corpus and improved with the knowledge acquired from the successive interactions. In addition, the selection of the next system response is adapted considering information stored into users profiles and also the emotional contents detected in the users utterances. Our proposal has been evaluated with the successful development of an embodied conversational agent which has been placed in the Second Life social virtual world. The avatar includes the different models and interacts with the users who inhabit the virtual world in order to provide academic information. The experimental results show that the agents conversational behavior adapts successfully to the specific characteristics of users interacting in such environments.
zh

计算机视觉

[CV-0] CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation ICLR2025

【速读】：该论文旨在解决从文本提示或图像生成全景（360° panoramas）的问题。关键在于采用多视角扩散模型（multi-view diffusion models），将立方体贴图的六个面分别作为标准透视图像来处理，从而简化生成过程，并利用现有的多视角扩散模型实现高质量的全景图像生成，无需使用对应感知注意力层（correspondence-aware attention layers）。这种方法不仅实现了细粒度的文本控制和高分辨率全景图像生成，还在泛化能力和结果质量方面达到了当前最先进水平。

链接: https://arxiv.org/abs/2501.17162
作者: Nikolai Kalischek,Michael Oechsle,Fabian Manhardt,Philipp Henzler,Konrad Schindler,Federico Tombari
机构: ETH Zürich(苏黎世联邦理工学院); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:We introduce a novel method for generating 360° panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively. Project page: this https URL
zh

[CV-1] SFT Memorizes RL Generalizes: A Comparative Study of Foundation Model Post-training

【速读】：该论文旨在探究监督微调（SFT）和强化学习（RL）在提升基础模型泛化能力方面的差异，特别是针对文本规则变体和视觉变体。研究通过引入算术推理卡牌游戏GeneralPoints和采用V-IRL真实世界导航环境，评估这两种方法在未见过的文本和视觉场景中的泛化能力。研究发现，尤其是当RL采用基于结果的奖励训练时，在规则驱动的文本和视觉变体中表现出更强的泛化能力。相比之下，SFT倾向于记忆训练数据，难以在分布外场景中泛化。此外，研究还揭示了SFT对于稳定模型输出格式的重要性，这有助于后续RL实现性能提升。因此，关键解决方案在于结合使用SFT和RL，以发挥各自的优点，从而在复杂多模态任务中获得可泛化的知识。

链接: https://arxiv.org/abs/2501.17161
作者: Tianzhe Chu,Yuexiang Zhai,Jihan Yang,Shengbang Tong,Saining Xie,Dale Schuurmans,Quoc V. Le,Sergey Levine,Yi Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website at this https URL

点击查看摘要

Abstract:Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model’s underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL’s superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
zh

[CV-2] IC-Portrait: In-Context Matching for View-Consistent Personalized Portrait

【速读】：该论文旨在解决个性化肖像生成中的身份保留问题，特别是由于用户画像在外观和光照条件上的多样性所带来的挑战。论文的关键在于提出了一种名为IC-Portrait的新框架，通过重新定义肖像生成为两个子任务：1）考虑光照的拼接，通过遮掩输入图像的大比例部分（如80%）来有效地学习参考图像的光照表示；2）视角一致的适应，利用合成的视角一致的肖像数据集来学习上下文对应关系。这两个设计通过简单地连接潜在空间来形成控制网络式的监督和建模，从而显著提高了身份保留的真实性和稳定性。

链接: https://arxiv.org/abs/2501.17159
作者: Han Yang,Enis Simsar,Sotiris Anagnostidi,Yanlong Zang,Thomas Hofmann,Ziwei Liu
机构: ETH Zurich(苏黎世联邦理工学院); ZMO AI Inc.; Zhejiang University(浙江大学); S-Lab, Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Existing diffusion models show great potential for identity-preserving generation. However, personalized portrait generation remains challenging due to the diversity in user profiles, including variations in appearance and lighting conditions. To address these challenges, we propose IC-Portrait, a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners (e.g.,100 ~ 200 steps) for in-context dense correspondence matching, which motivates the two major designs of our IC-Portrait framework. Specifically, we reformulate portrait generation into two sub-tasks: 1) Lighting-Aware Stitching: we find that masking a high proportion of the input image, e.g., 80%, yields a highly effective self-supervisory representation learning of reference image lighting. 2) View-Consistent Adaptation: we leverage a synthetic view-consistent profile dataset to learn the in-context correspondence. The reference profile can then be warped into arbitrary poses for strong spatial-aligned view conditioning. Coupling these two designs by simply concatenating latents to form ControlNet-like supervision and modeling, enables us to significantly enhance the identity preservation fidelity and stability. Extensive evaluations demonstrate that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively, with particularly notable improvements in visual qualities. Furthermore, IC-Portrait even demonstrates 3D-aware relighting capabilities.
zh

[CV-3] Scenario Understanding of Traffic Scenes Through Large Visual Language Models WACV2025

【速读】：该论文旨在解决深度学习模型在自动驾驶领域中的泛化问题，特别是在感知、规划和控制任务中由于数据分布的领域特定性而导致的性能下降。论文的关键解决方案在于利用大规模视觉语言模型（Large Visual Language Models, LVLMs），如GPT-4和LLaVA，通过上下文查询自动化图像分析和分类，而无需为新类别重新训练模型。这种自动化处理方法不仅提高了数据标注的效率，还展示了LVLMs在理解和分类城市交通场景方面的有效性，从而推动了数据驱动的自动驾驶技术进步。

链接: https://arxiv.org/abs/2501.17131
作者: Rivera Esteban,Lübberstedt Jannik,Nico Uhlemann,Markus Lienkamp
机构: Technical University of Munich (慕尼黑工业大学); School of Engineering & Design (工程与设计学院), Institute of Automotive Technology (汽车技术研究所) and Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV2025

点击查看摘要

Abstract:Deep learning models for autonomous driving, encompassing perception, planning, and control, depend on vast datasets to achieve their high performance. However, their generalization often suffers due to domain-specific data distributions, making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning, though valuable, is both labor-intensive and time-consuming, creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries, often without requiring retraining for new categories. In this study, we evaluate the capabilities of LVLMs, including GPT-4 and LLaVA, to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets. Our analysis, combining quantitative metrics with qualitative insights, demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.
zh

[CV-4] xt-to-Image Generation for Vocabulary Learning Using the Keyword Method

【速读】：该论文旨在解决通过关键词方法学习外语词汇时，难以将大量词汇的视觉联想显性化记忆的问题。解决方案的关键在于开发了一款结合关键词方法与文本到图像生成器（text-to-image generators）的应用程序，将这些视觉联想转化为可视化图像，作为额外的刺激来增强词汇的记忆与回忆过程。实验结果显示，使用DALL-E2生成的图像尤其受到参与者偏爱，并且提供这些图像显著提升了词汇记忆的保持率。

链接: https://arxiv.org/abs/2501.17099
作者: Nuwan T. Attygalle,Matjaž Kljun,Aaron Quigley,Klen čOpič Pucihar,Jens Grubert,Verena Biener,Luis A. Leiva,Juri Yoneyama,Alice Toniolo,Angela Miguel,Hirokazu Kato,Maheshya Weerasinghe
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ‘keyword method’ is an effective technique for learning vocabulary of a foreign language. It involves creating a memorable visual link between what a word means and what its pronunciation in a foreign language sounds like in the learner’s native language. However, these memorable visual links remain implicit in the people’s mind and are not easy to remember for a large set of words. To enhance the memorisation and recall of the vocabulary, we developed an application that combines the keyword method with text-to-image generators to externalise the memorable visual links into visuals. These visuals represent additional stimuli during the memorisation process. To explore the effectiveness of this approach we first run a pilot study to investigate how difficult it is to externalise the descriptions of mental visualisations of memorable links, by asking participants to write them down. We used these descriptions as prompts for text-to-image generator (DALL-E2) to convert them into images and asked participants to select their favourites. Next, we compared different text-to-image generators (DALL-E2, Midjourney, Stable and Latent Diffusion) to evaluate the perceived quality of the generated images by each. Despite heterogeneous results, participants mostly preferred images generated by DALL-E2, which was used also for the final study. In this study, we investigated whether providing such images enhances the retention of vocabulary being learned, compared to the keyword method only. Our results indicate that people did not encounter difficulties describing their visualisations of memorable links and that providing corresponding images significantly improves memory retention.
zh

[CV-5] Evaluating CrowdSplat: Perceived Level of Detail for Gaussian Crowds

【速读】：该论文旨在通过一项两项选择实验（2AFC experiment）确定3D高斯化身（3D Gaussian avatars）的感知质量。研究探索了三个因素：运动（Motion）、层次细节（LOD，即高斯点的数量），以及化身在像素中的高度（对应视距）。关键在于通过参与者对比选择最详细的化身来优化基于高斯点的群体渲染中的LOD策略，从而实现在实时应用中高效渲染的同时保持视觉质量。

链接: https://arxiv.org/abs/2501.17085
作者: Xiaohan Sun,Yinghan Xu,John Dingliana,Carol O’Sullivan
机构: Trinity College Dublin(都柏林三一学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Efficient and realistic crowd rendering is an important element of many real-time graphics applications such as Virtual Reality (VR) and games. To this end, Levels of Detail (LOD) avatar representations such as polygonal meshes, image-based impostors, and point clouds have been proposed and evaluated. More recently, 3D Gaussian Splatting has been explored as a potential method for real-time crowd rendering. In this paper, we present a two-alternative forced choice (2AFC) experiment that aims to determine the perceived quality of 3D Gaussian avatars. Three factors were explored: Motion, LOD (i.e., #Gaussians), and the avatar height in Pixels (corresponding to the viewing distance). Participants viewed pairs of animated 3D Gaussian avatars and were tasked with choosing the most detailed one. Our findings can inform the optimization of LOD strategies in Gaussian-based crowd rendering, thereby helping to achieve efficient rendering while maintaining visual quality in real-time applications.
zh

[CV-6] DINOSTAR: Deep Iterative Neural Object Detector Self-Supervised Training for Roadside LiDAR Applications

【速读】：该论文旨在解决点云数据中目标检测的人工监督标注所导致的时间和成本消耗问题。解决方案的关键在于开发了一种端到端、可扩展且自监督的框架，通过利用自监督的统计模型教师来训练深度目标检测器，从而避免了人工标注的需求。这些教师模型通过背景过滤、对象聚类、边界框拟合和分类等标准方法生成噪声标签，而学生模型通过对来自多个教师的组合噪声标注进行训练，增强了其区分前景与背景的能力，并促使其学习多样化的点云表示。

链接: https://arxiv.org/abs/2501.17076
作者: Muhammad Shahbaz,Shaurya Agarwal
机构: University of Central Florida (中佛罗里达大学); SRI International (SRI国际)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: conference, 6 pages

点击查看摘要

Abstract:Recent advancements in deep-learning methods for object detection in point-cloud data have enabled numerous roadside applications, fostering improvements in transportation safety and management. However, the intricate nature of point-cloud data poses significant challenges for human-supervised labeling, resulting in substantial expenditures of time and capital. This paper addresses the issue by developing an end-to-end, scalable, and self-supervised framework for training deep object detectors tailored for roadside point-cloud data. The proposed framework leverages self-supervised, statistically modeled teachers to train off-the-shelf deep object detectors, thus circumventing the need for human supervision. The teacher models follow fine-tuned set standard practices of background filtering, object clustering, bounding-box fitting, and classification to generate noisy labels. It is presented that by training the student model over the combined noisy annotations from multitude of teachers enhances its capacity to discern background/foreground more effectively and forces it to learn diverse point-cloud-representations for object categories of interest. The evaluations, involving publicly available roadside datasets and state-of-art deep object detectors, demonstrate that the proposed framework achieves comparable performance to deep object detectors trained on human-annotated labels, despite not utilizing such human-annotations in its training process.
zh

[CV-7] EdgeMLOps: Operationalizing ML models with Cumulocity IoT and thin-edge.io for Visual quality Inspection

【速读】：该论文旨在解决在资源受限的边缘设备上部署和管理机器学习模型时所面临的挑战，包括模型优化、部署及生命周期管理。解决方案的关键在于提出了一种名为EdgeMLOps的框架，该框架利用Cumulocity IoT和相关技术进行模型的高效部署与管理，并通过视觉质量检测（VQI）的应用案例展示了其有效性。此外，论文评估了不同量化方法（如静态和动态有符号整数8位）在Raspberry Pi 4上的性能优势，表明相较于32位浮点（FP32）精度，这些方法能够显著减少推理时间。

链接: https://arxiv.org/abs/2501.17062
作者: Kanishk Chaturvedi,Johannes Gasthuber,Mohamed Abdelaal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces EdgeMLOps, a framework leveraging Cumulocity IoT and this http URL for deploying and managing machine learning models on resource-constrained edge devices. We address the challenges of model optimization, deployment, and lifecycle management in edge environments. The framework’s efficacy is demonstrated through a visual quality inspection (VQI) use case where images of assets are processed on edge devices, enabling real-time condition updates within an asset management system. Furthermore, we evaluate the performance benefits of different quantization methods, specifically static and dynamic signed-int8, on a Raspberry Pi 4, demonstrating significant inference time reductions compared to FP32 precision. Our results highlight the potential of EdgeMLOps to enable efficient and scalable AI deployments at the edge for industrial applications.
zh

[CV-8] Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding ICLR’25

【速读】：该论文致力于解决弱监督时空视频定位（Weakly Supervised Spatio-Temporal Video Grounding, WSTVG）任务中的挑战。尽管最先进的目标检测模型具备强大的零样本能力，但它们在一致性的时间预测、复杂查询的理解以及适应困难场景方面存在显著局限性。为此，论文提出了一种名为CoSPaL（Contextual Self-Paced Learning）的新方法，关键在于其整合了三个核心组件：Tubelet Phrase Grounding (TPG)，通过将文本查询与tubelets关联来引入时空预测；Contextual Referral Grounding (CRG)，通过提取上下文信息来改善复杂查询的理解，从而逐步优化对象识别；Self-Paced Scene Understanding (SPS)，一种训练范式，通过逐步增加任务难度，使模型能够从粗粒度到细粒度理解场景，从而适应复杂情况。

链接: https://arxiv.org/abs/2501.17053
作者: Akash Kumar,Zsolt Kira,Yogesh Singh Rawat
机构: University of Central Florida(中佛罗里达大学); Georgia Institute of Technology(乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR’25 Main Conference. Project Page: this https URL

点击查看摘要

Abstract:In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.
zh

[CV-9] Synthesizing 3D Abstractions by Inverting Procedural Buildings with Transformers

【速读】：该论文旨在通过逆向推理过程模型来生成建筑物的抽象表示，以解决从点云数据重建建筑物几何结构和布局的问题。解决方案的关键在于构建了一个包含抽象过程建模建筑与模拟点云配对的数据集，并通过变压器（Transformer）学习逆映射。这种方法利用了游戏和动画领域开发的过程模型，保持了高效渲染和规则性及对称性的强先验性质。

链接: https://arxiv.org/abs/2501.17044
作者: Max Dax,Jordi Berbel,Jan Stria,Leonidas Guibas,Urs Bergmann
机构: Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所), Tübingen, Germany; Google DeepMind(谷歌深感); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:We generate abstractions of buildings, reflecting the essential aspects of their geometry and structure, by learning to invert procedural models. We first build a dataset of abstract procedural building models paired with simulated point clouds and then learn the inverse mapping through a transformer. Given a point cloud, the trained transformer then infers the corresponding abstracted building in terms of a programmatic language description. This approach leverages expressive procedural models developed for gaming and animation, and thereby retains desirable properties such as efficient rendering of the inferred abstractions and strong priors for regularity and symmetry. Our approach achieves good reconstruction accuracy in terms of geometry and structure, as well as structurally consistent inpainting.
zh

[CV-10] MAUCell: An Adaptive Multi-Attention Framework for Video Frame Prediction IJCAI2025

【速读】：该论文旨在解决在时间序列建模中实现高效资源消耗下的准确预测问题。关键解决方案在于引入Multi-Attention Unit (MAUCell)，它结合了生成对抗网络（GANs）和时空注意力机制，通过三种类型的注意力模型捕捉复杂的运动序列，并通过动态组合这些注意力输出来提高视频帧预测的准确性与质量，同时保持计算效率。此外，GAN元素的应用使得生成的帧更加逼真，从而生成的序列能够更好地模拟真实世界的画面。

链接: https://arxiv.org/abs/2501.16997
作者: Shreyam Gupta(1),P. Agrawal(2),Priyam Gupta(3) ((1) Indian Institute of Technology (BHU), Varanasi, India, (2) University of Colorado, Boulder, USA, (3) Intelligent Field Robotic Systems (IFROS), University of Girona, Spain)
机构: Indian Institute of Technology (BHU), Varanasi, India; University of Colorado, Boulder, USA; Intelligent Field Robotic Systems (IFROS), University of Girona, Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This work has been submitted to the IJCAI 2025 Conference for review. It contains: 11 pages, 4 figures, 7 tables, and 3 Algorithms

点击查看摘要

Abstract:Temporal sequence modeling stands as the fundamental foundation for video prediction systems and real-time forecasting operations as well as anomaly detection applications. The achievement of accurate predictions through efficient resource consumption remains an ongoing issue in contemporary temporal sequence modeling. We introduce the Multi-Attention Unit (MAUCell) which combines Generative Adversarial Networks (GANs) and spatio-temporal attention mechanisms to improve video frame prediction capabilities. Our approach implements three types of attention models to capture intricate motion sequences. A dynamic combination of these attention outputs allows the model to reach both advanced decision accuracy along with superior quality while remaining computationally efficient. The integration of GAN elements makes generated frames appear more true to life therefore the framework creates output sequences which mimic real-world footage. The new design system maintains equilibrium between temporal continuity and spatial accuracy to deliver reliable video prediction. Through a comprehensive evaluation methodology which merged the perceptual LPIPS measurement together with classic tests MSE, MAE, SSIM and PSNR exhibited enhancing capabilities than contemporary approaches based on direct benchmark tests of Moving MNIST, KTH Action, and CASIA-B (Preprocessed) datasets. Our examination indicates that MAUCell shows promise for operational time requirements. The research findings demonstrate how GANs work best with attention mechanisms to create better applications for predicting video sequences.
zh

[CV-11] FedEFM: Federated Endovascular Foundation Model with Unseen Data ICRA2025

【速读】：该论文旨在解决在血管内手术中，由于有限的标记数据导致精确分割导管和导丝结构的挑战。解决方案的关键在于提出了一种在去中心化联邦学习环境中训练基础模型的新方法，通过在知识蒸馏框架内使用可微分地球移动器距离来应对未见数据的问题，从而实现数据隐私保护的同时提升下游任务性能。

链接: https://arxiv.org/abs/2501.16992
作者: Tuong Do,Nghia Vu,Tudor Jianu,Baoru Huang,Minh Vu,Jionglong Su,Erman Tjiputra,Quang D. Tran,Te-Chuan Chiu,Anh Nguyen
机构: University of Liverpool(利物浦大学); AIOZ Ltd.(AIOZ有限公司); Automation & Control Institute, TU Wien(自动化与控制研究所, TU Wien); Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学); National Tsing Hua University(国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages. Accepted to ICRA 2025

点击查看摘要

Abstract:In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover’s Distance within a knowledge distillation framework. Once trained, our foundation model’s weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.
zh

[CV-12] Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

【速读】：该论文旨在解决现有开放词汇目标检测器在利用预训练视觉语言模型（Visual Language Model, VLM）进行生成表示时所存在的局限性。具体而言，这些问题包括如何有效地结合预训练模型的优势与标注数据的优化能力。为了解决这些问题，论文提出了一种名为ViT-Feature-Modulated Multi-Scale Convolutional网络（VMCNet）的关键解决方案。VMCNet采用双分支架构，其中包含一个可训练的卷积分支、一个冻结的预训练视觉变换器（Vision Transformer, ViT）分支以及一个特征调制模块。通过这种混合结构，VMCNet能够更好地发掘新类别，从而提升了在新类别上的检测性能，在OV-COCO和OV-LVIS两个基准数据集上均取得了显著的性能提升。

链接: https://arxiv.org/abs/2501.16981
作者: Xiangyu Gao,Yu Dai,Benliu Qiu,Hongliang Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLM to attain generative representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, the frozen backbone doesn’t benefit from the labeled data to strengthen the representation. Therefore, we propose a novel two-branch backbone network design, named as ViT-Feature-Modulated Multi-Scale Convolutional network (VMCNet). VMCNet consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a feature modulation module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed feature modulation module could modulate the multi-scale CNN features with the representations from ViT branch. With the proposed mixed structure, detector is more likely to discover novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms the baseline. On OV-COCO, the proposed method achieves 44.3 AP _50^\mathrmnovel with ViT-B/16 and 48.5 AP _50^\mathrmnovel with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP _r .
zh

[CV-13] RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples ICML

【速读】：该论文旨在解决图像异常检测在对抗性设置下的性能不足问题。解决方案的关键在于引入RODEO方法，通过结合异常曝光（Outlier Exposure, OE）和对抗性训练，生成有效的异常样本以增强模型的鲁棒性。这种方法利用文本到图像的模型生成“多样化”且“接近分布”的异常样本，从而显著提升异常检测器在对抗性环境中的表现。

链接: https://arxiv.org/abs/2501.16971
作者: Hossein Mirzaei,Mohammad Jafari,Hamid Reza Dehbashi,Ali Ansari,Sepehr Ghobadi,Masoud Hadi,Arshia Soltani Moakhar,Mohammad Azizmalayeri,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the Forty-First International Conference on Machine Learning (ICML) 2024. The implementation of our work is available at: \url{ this https URL }

点击查看摘要

Abstract:In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates diverse'' and near-distribution’’ outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.
zh

[CV-14] What Really Matters for Learning-based LiDAR-Camera Calibration

【速读】：该论文旨在解决无目标（target-less）和在线（online）条件下LiDAR与相机传感器的校准问题。传统方法依赖特定目标或场景以获得可靠的2D-3D对应关系，而深度神经网络被引入以数据驱动的方式解决这一挑战。论文的关键在于系统分析主流基于学习的校准方法范式，并指出回归方法在常用数据生成管道中的局限性。研究发现，大多数基于学习的方法实际上表现为检索网络，更关注单一模态分布而非跨模态对应关系。此外，论文还探讨了输入数据格式和预处理操作对网络性能的影响，并总结了回归线索以指导进一步改进。

链接: https://arxiv.org/abs/2501.16969
作者: Shujuan Huang,Chunyu Lin,Yao Zhao
机构: Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究所); Visual Intelligence + X International Joint Laboratory of the Ministry of Education (教育部视觉智能+X国际合作联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Calibration is an essential prerequisite for the accurate data fusion of LiDAR and camera sensors. Traditional calibration techniques often require specific targets or suitable scenes to obtain reliable 2D-3D correspondences. To tackle the challenge of target-less and online calibration, deep neural networks have been introduced to solve the problem in a data-driven manner. While previous learning-based methods have achieved impressive performance on specific datasets, they still struggle in complex real-world scenarios. Most existing works focus on improving calibration accuracy but overlook the underlying mechanisms. In this paper, we revisit the development of learning-based LiDAR-Camera calibration and encourage the community to pay more attention to the underlying principles to advance practical applications. We systematically analyze the paradigm of mainstream learning-based methods, and identify the critical limitations of regression-based methods with the widely used data generation pipeline. Our findings reveal that most learning-based methods inadvertently operate as retrieval networks, focusing more on single-modality distributions rather than cross-modality correspondences. We also investigate how the input data format and preprocessing operations impact network performance and summarize the regression clues to inform further improvements.
zh

[CV-15] Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? IROS2025

【速读】：该论文旨在探索最先进的视觉语言模型（Vision-Language Models, VLMs）作为独立的零样本地理定位系统在黑盒设置中的潜力，并考虑实际约束条件。论文的关键解决方案在于系统性地研究这些模型在三种主要场景下的表现：固定的基于文本的提示、语义等效的基于文本的提示以及语义等效的查询图像。此外，通过利用模型一致性作为评估指标，除了传统的准确性之外，还探讨了这些模型的概率生成过程在地理定位任务中的实用性。

链接: https://arxiv.org/abs/2501.16947
作者: Sania Waheed,Bruno Ferrarini,Michael Milford,Sarvapali D. Ramchurn,Shoaib Ehsan
机构: School of Electronics and Computer Science, University of Southampton (南安普顿大学电子与计算机科学学院); MyWay srl (MyWay srl); School of Electrical Engineering and Computer Science, Queensland University of Technology (昆士兰科技大学电气工程与计算机科学学院); School of Electronics and Computer Science, University of Southampton (南安普顿大学电子与计算机科学学院); School of Computer Science and Electronic Engineering, University of Essex (埃塞克斯大学计算机科学与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to IROS 2025

点击查看摘要

Abstract:The advances in Vision-Language models (VLMs) offer exciting opportunities for robotic applications involving image geo-localization, the problem of identifying the geo-coordinates of a place based on visual data only. Recent research works have focused on using a VLM as embeddings extractor for geo-localization, however, the most sophisticated VLMs may only be available as black boxes that are accessible through an API, and come with a number of limitations: there is no access to training data, model features and gradients; retraining is not possible; the number of predictions may be limited by the API; training on model outputs is often prohibited; and queries are open-ended. The utilization of a VLM as a stand-alone, zero-shot geo-localization system using a single text-based prompt is largely unexplored. To bridge this gap, this paper undertakes the first systematic study, to the best of our knowledge, to investigate the potential of some of the state-of-the-art VLMs as stand-alone, zero-shot geo-localization systems in a black-box setting with realistic constraints. We consider three main scenarios for this thorough investigation: a) fixed text-based prompt; b) semantically-equivalent text-based prompts; and c) semantically-equivalent query images. We also take into account the auto-regressive and probabilistic generation process of the VLMs when investigating their utility for geo-localization task by using model consistency as a metric in addition to traditional accuracy. Our work provides new insights in the capabilities of different VLMs for the above-mentioned scenarios.
zh

[CV-16] B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning WACV2025

【速读】：该论文旨在解决轻量级模型在资源受限设备上的部署需求，通过神经网络剪枝技术在不显著影响性能的前提下有效减小模型规模。论文的关键解决方案在于提出了一种新颖的剪枝流水线B-FPGM，结合了Filter Pruning via Geometric Median (FPGM)，Soft Filter Pruning (SFP) 和贝叶斯优化（Bayesian optimization），以实现模型大小与性能之间的更优平衡。FPGM是一种结构化剪枝方法，允许逐层剪枝最不重要的滤波器；SFP则迭代剪枝滤波器并允许其在后续训练步骤中更新；贝叶斯优化用于优化各层的剪枝率，而非依赖工程经验确定最优剪枝率。实验结果显示，所提方法在WIDER FACE数据集的三个子集上优于现有方法。

链接: https://arxiv.org/abs/2501.16917
作者: Nikolaos Kaparinos,Vasileios Mezaris
机构: CERTH-ITI (CERTH-信息技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication, RWS Workshop @ IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, Feb. 2025. This is the authors’ “accepted version”

点击查看摘要

Abstract:Face detection is a computer vision application that increasingly demands lightweight models to facilitate deployment on devices with limited computational resources. Neural network pruning is a promising technique that can effectively reduce network size without significantly affecting performance. In this work, we propose a novel face detection pruning pipeline that leverages Filter Pruning via Geometric Median (FPGM) pruning, Soft Filter Pruning (SFP) and Bayesian optimization in order to achieve a superior trade-off between size and performance compared to existing approaches. FPGM pruning is a structured pruning technique that allows pruning the least significant filters in each layer, while SFP iteratively prunes the filters and allows them to be updated in any subsequent training step. Bayesian optimization is employed in order to optimize the pruning rates of each layer, rather than relying on engineering expertise to determine the optimal pruning rates for each layer. In our experiments across all three subsets of the WIDER FACE dataset, our proposed approach B-FPGM consistently outperforms existing ones in balancing model size and performance. All our experiments were applied to EResFD, the currently smallest (in number of parameters) well-performing face detector of the literature; a small ablation study with a second small face detector, EXTD, is also reported. The source code and trained pruned face detection models can be found at: this https URL.
zh

[CV-17] Adversarial Masked Autoencoder Purifier with Defense Transferability

【速读】：该论文旨在解决高级对抗攻击下防御机制仍显不足的问题。为应对这一挑战，论文提出了一种名为Masked AutoEncoder Purifier (MAEP)的方法，其关键是将Masked AutoEncoder (MAE)集成到一个用于测试时净化的对抗净化框架中。这种方法不仅展示了显著的对抗鲁棒性，还具备模型防御的迁移能力和攻击泛化能力，无需依赖额外的不同于训练数据集的数据。

链接: https://arxiv.org/abs/2501.16904
作者: Yuan-Chih Chen,Chun-Shien Lu
机构: Academia Sinica(中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The study of adversarial defense still struggles to combat with advanced adversarial attacks. In contrast to most prior studies that rely on the diffusion model for test-time defense to remarkably increase the inference time, we propose Masked AutoEncoder Purifier (MAEP), which integrates Masked AutoEncoder (MAE) into an adversarial purifier framework for test-time purification. While MAEP achieves promising adversarial robustness, it particularly features model defense transferability and attack generalization without relying on using additional data that is different from the training dataset. To our knowledge, MAEP is the first study of adversarial purifier based on MAE. Extensive experimental results demonstrate that our method can not only maintain clear accuracy with only a slight drop but also exhibit a close gap between the clean and robust accuracy. Notably, MAEP trained on CIFAR10 achieves state-of-the-art performance even when tested directly on ImageNet, outperforming existing diffusion-based models trained specifically on ImageNet.
zh

[CV-18] Frequency Matters: Explaining Biases of Face Recognition in the Frequency Domain ECCV2024

【速读】：该论文旨在解决人脸识别（Face Recognition, FR）模型在不同人口群体间性能变化的问题。论文的关键在于利用基于频率的最新解释方法，揭示不同频率模式对于不同人种样本的重要性差异，从而解析人脸识别中的偏见来源。

链接: https://arxiv.org/abs/2501.16896
作者: Marco Huber,Fadi Boutros,Naser Damer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at xAI4Biometrics at ECCV 2024

点击查看摘要

Abstract:Face recognition (FR) models are vulnerable to performance variations across demographic groups. The causes for these performance differences are unclear due to the highly complex deep learning-based structure of face recognition models. Several works aimed at exploring possible roots of gender and ethnicity bias, identifying semantic reasons such as hairstyle, make-up, or facial hair as possible sources. Motivated by recent discoveries of the importance of frequency patterns in convolutional neural networks, we explain bias in face recognition using state-of-the-art frequency-based explanations. Our extensive results show that different frequencies are important to FR models depending on the ethnicity of the samples.
zh

[CV-19] Extending Information Bottleneck Attribution to Video Sequences

【速读】：该论文旨在解决视频分类的可解释性问题，特别是在视频深伪检测中的应用。论文的关键解决方案是提出了一种名为VIBA的新方法，通过将信息瓶颈用于归因（Information Bottlenecks for Attribution, IBA）的方法应用于视频序列，从而实现对时间模型的解释。这种方法能够生成时空一致的解释，与人工标注高度吻合，为视频分类尤其是深伪检测提供了解释性。

链接: https://arxiv.org/abs/2501.16889
作者: Veronika Solopova,Lucas Schmidt,Dorothea Kolossa
机构: Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks for Attribution (IBA) to video sequences. While most traditional explainability methods are designed for image models, our IBA framework addresses the need for explainability in temporal models used for video analysis. To demonstrate its effectiveness, we apply VIBA to video deepfake detection, testing it on two architectures: the Xception model for spatial features and a VGG11-based model for capturing motion dynamics through optical flow. Using a custom dataset that reflects recent deepfake generation techniques, we adapt IBA to create relevance and optical flow maps, visually highlighting manipulated regions and motion inconsistencies. Our results show that VIBA generates temporally and spatially consistent explanations, which align closely with human annotations, thus providing interpretability for video classification and particularly for deepfake detection.
zh

[CV-20] Experimenting with Affective Computing Models in Video Interviews with Spanish-speaking Older Adults

【速读】：该论文旨在解决情感计算模型在理解和识别老年人情感信号方面存在的两大主要问题：(1) 缺乏代表老年人群体的数据集，特别是在非英语国家；(2) 模型在年轻或同质性较高的人群上训练后的泛化能力不足。为了解决这些问题，论文的关键在于引入了一个包含西班牙语老年群体的新型数据集，并通过面部表情识别、文本情感分析和微笑检测等方法，评估了最先进的情感计算模型。研究结果揭示了人类标注与自动模型输出之间的一致性有限，跨模态的一致性较弱，以及个体间情感信号的显著差异。这些发现强调了在未来的系统中需要考虑个人差异和文化细微差别的必要性。

链接: https://arxiv.org/abs/2501.16870
作者: Josep Lopez Camunas,Cristina Bustos,Yanjun Zhu,Raquel Ros,Agata Lapedriza
机构: Universitat Oberta de Catalunya(巴塞罗那开放大学); Northeastern University(东北大学); PAL Robotics( PAL 机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding emotional signals in older adults is crucial for designing virtual assistants that support their well-being. However, existing affective computing models often face significant limitations: (1) limited availability of datasets representing older adults, especially in non-English-speaking populations, and (2) poor generalization of models trained on younger or homogeneous demographics. To address these gaps, this study evaluates state-of-the-art affective computing models – including facial expression recognition, text sentiment analysis, and smile detection – using videos of older adults interacting with either a person or a virtual avatar. As part of this effort, we introduce a novel dataset featuring Spanish-speaking older adults engaged in human-to-human video interviews. Through three comprehensive analyses, we investigate (1) the alignment between human-annotated labels and automatic model outputs, (2) the relationships between model outputs across different modalities, and (3) individual variations in emotional signals. Using both the Wizard of Oz (WoZ) dataset and our newly collected dataset, we uncover limited agreement between human annotations and model predictions, weak consistency across modalities, and significant variability among individuals. These findings highlight the shortcomings of generalized emotion perception models and emphasize the need of incorporating personal variability and cultural nuances into future systems.
zh

[CV-21] Not Every Patch is Needed: Towards a More Efficient and Effective Backbone for Video-based Person Re-identification

【速读】：该论文旨在解决视频行人重识别（Video-based Person Re-ID）过程中计算成本高且效率低的问题。论文的关键在于提出了一种新的选择性特征提取机制，通过仅选择关键且非重复的图像块进行特征提取以减少计算量。此外，文中还引入了一种新颖的网络结构，用于生成和利用伪全局上下文信息，以应对由于稀疏输入导致的部分视图缺失问题。这些设计使得新提出的主干网络在保持高精度的同时显著降低了计算成本。

链接: https://arxiv.org/abs/2501.16811
作者: Lanyun Zhu,Tianrun Chen,Deyi Ji,Jieping Ye,Jun Liu
机构: Singapore University of Technology and Design (新加坡科技大学); College of Computer Science and Technology, Zhejiang University, China (中国浙江大学计算机科学与技术学院); Alibaba Group, China (中国阿里巴巴集团); School of Computing and Communications, Lancaster University, UK (英国兰卡斯特大学计算与通信学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TIP

点击查看摘要

Abstract:This paper proposes a new effective and efficient plug-and-play backbone for video-based person re-identification (ReID). Conventional video-based ReID methods typically use CNN or transformer backbones to extract deep features for every position in every sampled video frame. Here, we argue that this exhaustive feature extraction could be unnecessary, since we find that different frames in a ReID video often exhibit small differences and contain many similar regions due to the relatively slight movements of human beings. Inspired by this, a more selective, efficient paradigm is explored in this paper. Specifically, we introduce a patch selection mechanism to reduce computational cost by choosing only the crucial and non-repetitive patches for feature extraction. Additionally, we present a novel network structure that generates and utilizes pseudo frame global context to address the issue of incomplete views resulting from sparse inputs. By incorporating these new designs, our backbone can achieve both high performance and low computational cost. Extensive experiments on multiple datasets show that our approach reduces the computational cost by 74% compared to ViT-B and 28% compared to ResNet50, while the accuracy is on par with ViT-B and outperforms ResNet50 significantly.
zh

[CV-22] RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

【速读】：该论文旨在解决单Agent系统感知能力局限的问题，通过引入跨模态融合模块来充分利用多源数据。论文的关键在于设计了一个名为Radian-Glue-Attention (RG-Attn)的跨模态融合模块，适用于同Agent内部及跨Agent之间的LiDAR与相机数据融合。此外，提出了两种架构Paint-To-Puzzle (PTP)和Co-Sketching-Co-Coloring (CoS-CoCo)，分别针对不同传感器配置下的最优性能和更广泛的适用性。

链接: https://arxiv.org/abs/2501.16803
作者: Lantao Li,Kang Yang,Wenqi Zhang,Xiaoxue Wang,Chen Sun
机构: Sony Research and Development Center China(索尼研究开发中心中国); School of Information, Renmin University of China(中国人民大学信息学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system’s performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code will be released at GitHub in early 2025.
zh

[CV-23] Dynamic Hypergraph Representation for Bone Metastasis Cancer Analysis

【速读】：该论文旨在解决骨转移分析中的复杂多变量相互作用问题，特别是传统方法如多重实例学习（Multiple Instance Learning, MIL）和图神经网络（Graph Neural Networks, GNNs）难以捕捉高阶生物关联的问题。论文的关键解决方案是提出了一种动态超图神经网络（Dynamic Hypergraph Neural Network, DyHG），通过超边连接多个节点来克服传统图表示的边构建限制，采用低秩策略减少参数复杂度，并利用基于Gumbel-Softmax的采样策略优化超边上的斑块分布。此外，使用MIL聚合器生成图级嵌入以实现全面的全幻灯片图像（Whole Slide Images, WSIs）分析。

链接: https://arxiv.org/abs/2501.16787
作者: Yuxuan Chen,Jiawen Li,Huijuan Shi,Yang Xu,Tian Guan,Lianghui Zhu,Yonghong He,Anjia Han
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Jinfeng Laboratory, Chongqing, China (重庆金凤实验室); Department of Laboratory Medicine, Shenzhen Children’s Hospital (深圳市儿童医院实验室); Department of Pathology, The First Affiliated Hostpital, Sun Yat-sen Univeristy (中山大学第一附属医院病理科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,11 figures

点击查看摘要

Abstract:Bone metastasis analysis is a significant challenge in pathology and plays a critical role in determining patient quality of life and treatment strategies. The microenvironment and specific tissue structures are essential for pathologists to predict the primary bone cancer origins and primary bone cancer subtyping. By digitizing bone tissue sections into whole slide images (WSIs) and leveraging deep learning to model slide embeddings, this analysis can be enhanced. However, tumor metastasis involves complex multivariate interactions with diverse bone tissue structures, which traditional WSI analysis methods such as multiple instance learning (MIL) fail to capture. Moreover, graph neural networks (GNNs), limited to modeling pairwise relationships, are hard to represent high-order biological associations. To address these challenges, we propose a dynamic hypergraph neural network (DyHG) that overcomes the edge construction limitations of traditional graph representations by connecting multiple nodes via hyperedges. A low-rank strategy is used to reduce the complexity of parameters in learning hypergraph structures, while a Gumbel-Softmax-based sampling strategy optimizes the patch distribution across hyperedges. An MIL aggregator is then used to derive a graph-level embedding for comprehensive WSI analysis. To evaluate the effectiveness of DyHG, we construct two large-scale datasets for primary bone cancer origins and subtyping classification based on real-world bone metastasis scenarios. Extensive experiments demonstrate that DyHG significantly outperforms state-of-the-art (SOTA) baselines, showcasing its ability to model complex biological interactions and improve the accuracy of bone metastasis analysis.
zh

[CV-24] FlexMotion: Lightweight Physics-Aware and Controllable Human Motion Generation

【速读】：该论文旨在解决轻量级、可控且物理逼真的人体运动合成问题，现有方法通常在计算效率、物理真实性和空间可控性之间做出妥协。论文的关键解决方案是提出了一种名为FlexMotion的新框架，它利用了一个在潜在空间中运行的计算高效的扩散模型，消除了对物理模拟器的需求，实现了快速有效的训练。FlexMotion通过集成关节位置、接触力、关节驱动和肌肉激活等多模态信息，确保生成运动的物理逼真性，并引入了一个即插即用模块以增强对多种运动参数的控制能力。

链接: https://arxiv.org/abs/2501.16778
作者: Arvin Tashakori,Arash Tashakori,Gongbo Yang,Z. Jane Wang,Peyman Servati
机构: University of British Columbia; Dalhousie University; University of British Columbia; University of British Columbia; University of British Columbia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Lightweight, controllable, and physically plausible human motion synthesis is crucial for animation, virtual reality, robotics, and human-computer interaction applications. Existing methods often compromise between computational efficiency, physical realism, or spatial controllability. We propose FlexMotion, a novel framework that leverages a computationally lightweight diffusion model operating in the latent space, eliminating the need for physics simulators and enabling fast and efficient training. FlexMotion employs a multimodal pre-trained Transformer encoder-decoder, integrating joint locations, contact forces, joint actuations and muscle activations to ensure the physical plausibility of the generated motions. FlexMotion also introduces a plug-and-play module, which adds spatial controllability over a range of motion parameters (e.g., joint locations, joint actuations, contact forces, and muscle activations). Our framework achieves realistic motion generation with improved efficiency and control, setting a new benchmark for human motion synthesis. We evaluate FlexMotion on extended datasets and demonstrate its superior performance in terms of realism, physical plausibility, and controllability.
zh

[CV-25] Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

【速读】：该论文旨在解决开放词汇语义分割任务中预训练基础模型的适应性问题。解决方案的关键在于提出了一种名为“Beyond-Labels”的轻量级Transformer融合模块，该模块利用少量图像分割数据融合冻结的图像表示与语言概念，并通过Fourier嵌入有效地捕捉图像中的位置信息，从而提高不同图像尺寸下的泛化能力。

链接: https://arxiv.org/abs/2501.16769
作者: Muhammad Atta ur Rahman
机构: Autonomous Driving Intelligence Lab, ETRI (自主驾驶智能实验室，ETRI); University of Science and Technology (UST) (科学技术大学，UST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning can resolve numerous image or linguistic processing problems when effectively trained. This study investigated simple yet efficient methods for adaping previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposed “Beyond-Labels,” a lightweight transformer-based fusion module that uses a handful of image segmentation data to fuse frozen image representations with language concepts. Furthermore, we efficiently captured positional information in images using Fourier embeddings, thus improving the generalization across various image sizes. Extensive ablation tests were performed to investigate the important components of our proposed method; when tested against the common benchmark PASCAL-5i, it demonstrated superior performance despite being trained on frozen image and language characteristics.
zh

[CV-26] arget-driven Self-Distillation for Partial Observed Trajectories Forecasting

【速读】：该论文旨在解决在部分观测条件下交通参与者未来轨迹预测性能显著下降的问题。论文的关键解决方案是提出了一种名为目标驱动自蒸馏（Target-driven Self-Distillation, TSD）的方法。通过利用预测的准确目标来引导模型在部分观测条件下的预测，并采用自蒸馏技术使模型能够在单阶段端到端训练过程中从完全观测和部分观测轨迹的特征分布中学习，从而提升模型在完全观测和部分观测场景下的运动预测准确性。

链接: https://arxiv.org/abs/2501.16767
作者: Pengfei Zhu,Peng Shu,Mengshi Qi,Liang Liu,Huadong Ma
机构: Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications (北京邮电大学智能电信软件和多媒体北京市重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate prediction of future trajectories of traffic agents is essential for ensuring safe autonomous driving. However, partially observed trajectories can significantly degrade the performance of even state-of-the-art models. Previous approaches often rely on knowledge distillation to transfer features from fully observed trajectories to partially observed ones. This involves firstly training a fully observed model and then using a distillation process to create the final model. While effective, they require multi-stage training, making the training process very expensive. Moreover, knowledge distillation can lead to a performance degradation of the model. In this paper, we introduce a Target-driven Self-Distillation method (TSD) for motion forecasting. Our method leverages predicted accurate targets to guide the model in making predictions under partial observation conditions. By employing self-distillation, the model learns from the feature distributions of both fully observed and partially observed trajectories during a single end-to-end training process. This enhances the model’s ability to predict motion accurately in both fully observed and partially observed scenarios. We evaluate our method on multiple datasets and state-of-the-art motion forecasting models. Extensive experimental results demonstrate that our approach achieves significant performance improvements in both settings. To facilitate further research, we will release our code and model checkpoints.
zh

[CV-27] DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation ICLR2025

【速读】：该论文旨在解决从文本或单个图像生成高质量3D内容时面临的有限高质3D数据集以及来自2D多视角生成的一致性问题。解决方案的关键在于引入了DiffSplat，这是一种创新的3D生成框架，通过调节大规模文本到图像扩散模型原生生成3D高斯光点（Gaussian splats），从而有效地利用网络规模的2D先验知识，并在统一模型中保持3D一致性。此外，通过轻量级的重建模型即时生成多视角高斯光点网格，进一步促进了训练过程，并结合常规的扩散损失和3D渲染损失以促进任意视角下的3D一致性。

链接: https://arxiv.org/abs/2501.16764
作者: Chenguo Lin,Panwang Pan,Bangbang Yang,Zeming Li,Yadong Mu
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025; Project page: this https URL

点击查看摘要

Abstract:Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.
zh

[CV-28] AdaSemSeg: An Adaptive Few-shot Semantic Segmentation of Seismic Facies

【速读】：该论文旨在解决自动化地震图像解释中因训练数据有限而导致的挑战，特别是在面对不同数量类别数据集的联合训练问题。解决方案的关键在于提出了一种名为AdaSemSeg的自适应少样本语义分割方法，该方法能够适应跨数据集的不同类别数量，从而提高模型在未见数据集上的泛化能力。此外，为了解决地震图像缺乏大规模标注数据的问题，该方法利用自监督算法初始化骨干网络，以增强模型性能。

链接: https://arxiv.org/abs/2501.16760
作者: Surojit Saha,Ross Whitaker
机构: Scientific Computing and Imaging Institute (科学计算与成像研究所), Kahlert School of Computing (Kahlert 计算学院), The University of Utah (犹他大学), Salt Lake City (盐湖城), USA (美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review at IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Automated interpretation of seismic images using deep learning methods is challenging because of the limited availability of training data. Few-shot learning is a suitable learning paradigm in such scenarios due to its ability to adapt to a new task with limited supervision (small training budget). Existing few-shot semantic segmentation (FSSS) methods fix the number of target classes. Therefore, they do not support joint training on multiple datasets varying in the number of classes. In the context of the interpretation of seismic facies, fixing the number of target classes inhibits the generalization capability of a model trained on one facies dataset to another, which is likely to have a different number of facies. To address this shortcoming, we propose a few-shot semantic segmentation method for interpreting seismic facies that can adapt to the varying number of facies across the dataset, dubbed the AdaSemSeg. In general, the backbone network of FSSS methods is initialized with the statistics learned from the ImageNet dataset for better performance. The lack of such a huge annotated dataset for seismic images motivates using a self-supervised algorithm on seismic datasets to initialize the backbone network. We have trained the AdaSemSeg on three public seismic facies datasets with different numbers of facies and evaluated the proposed method on multiple metrics. The performance of the AdaSemSeg on unseen datasets (not used in training) is better than the prototype-based few-shot method and baselines.
zh

[CV-29] ITVTON:Virtual Try-On Diffusion Transformer Model Based on Integrated Image and Text

【速读】：该论文旨在解决虚拟试衣在复杂场景和姿态下存在的自然性不足及细节图案渲染不佳的问题。关键在于引入ITVTON方法，通过将衣物和人物图像沿空间通道结合作为输入，并整合多幅图像的文本描述以增强视觉效果的真实感。此外，仅优化单个扩散变换（Single-DiT）块内的注意力参数，以提升计算效率。通过从IGPair数据集中精心挑选训练样本，进一步提升了在不同环境下的表现。

链接: https://arxiv.org/abs/2501.16757
作者: Haifeng Ni
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in virtual fitting for characters and clothing have leveraged diffusion models to improve the realism of garment fitting. However, challenges remain in handling complex scenes and poses, which can result in unnatural garment fitting and poorly rendered intricate patterns. In this work, we introduce ITVTON, a novel method that enhances clothing-character interactions by combining clothing and character images along spatial channels as inputs, thereby improving fitting accuracy for the inpainting model. Additionally, we incorporate integrated textual descriptions from multiple images to boost the realism of the generated visual effects. To optimize computational efficiency, we limit training to the attention parameters within a single diffusion transformer (Single-DiT) block. To more rigorously address the complexities of real-world scenarios, we curated training samples from the IGPair dataset, thereby enhancing ITVTON’s performance across diverse environments. Extensive experiments demonstrate that ITVTON outperforms baseline methods both qualitatively and quantitatively, setting a new standard for virtual fitting tasks.
zh

[CV-30] SSF-PAN: Semantic Scene Flow-Based Perception for Autonomous Navigation in Traffic Scenarios

【速读】：该论文旨在解决在复杂交通场景中因移动物体干扰导致的车辆检测与定位难题。传统方法通常依赖于异常值排除或语义分割，但存在计算效率低和精度不足的问题。论文的关键解决方案在于提出了一种名为SSF-PAN的方法，该方法通过开发一种神经网络实现静态和动态物体的语义场景流（Semantic Scene Flow, SSF）分割，并建立了一个迭代框架以优化输入场景流和输出分割结果的质量。此外，还构建了一个基于场景流的导航平台，用于测试感知系统的性能。该方法在SUScape-CARLA和KITTI数据集以及CARLA模拟器上进行了验证，实验结果显示其在场景流计算准确性、运动物体检测准确性、计算效率及自主导航有效性方面优于传统方法。

链接: https://arxiv.org/abs/2501.16754
作者: Yinqi Chen,Meiying Zhang,Qi Hao,Guang Zhou
机构: Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology(可信自主系统研究院, 南方科技大学), Shenzhen 518055, China; Operation Department, Shenzhen Deeproute.ai Co.,Ltd(运营部门, 深圳迪维锐通科技有限公司), Shenzhen, China
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vehicle detection and localization in complex traffic scenarios pose significant challenges due to the interference of moving objects. Traditional methods often rely on outlier exclusions or semantic segmentations, which suffer from low computational efficiency and accuracy. The proposed SSF-PAN can achieve the functionalities of LiDAR point cloud based object detection/localization and SLAM (Simultaneous Localization and Mapping) with high computational efficiency and accuracy, enabling map-free navigation frameworks. The novelty of this work is threefold: 1) developing a neural network which can achieve segmentation among static and dynamic objects within the scene flows with different motion features, that is, semantic scene flow (SSF); 2) developing an iterative framework which can further optimize the quality of input scene flows and output segmentation results; 3) developing a scene flow-based navigation platform which can test the performance of the SSF perception system in the simulation environment. The proposed SSF-PAN method is validated using the SUScape-CARLA and the KITTI datasets, as well as on the CARLA simulator. Experimental results demonstrate that the proposed approach outperforms traditional methods in terms of scene flow computation accuracy, moving object detection accuracy, computational efficiency, and autonomous navigation effectiveness.
zh

[CV-31] Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction

【速读】：该论文旨在解决视频序列中下一帧预测的问题，主要挑战在于有效捕捉和处理空间和时间信息。论文指出当前基于变换器的模型存在两个显著问题：(a) 多头自注意力 (Multi-Head Self-Attention, MHSA) 机制将输入嵌入分割成多个部分，导致语义稀释；(b) 模型预测的是嵌入而非实际帧，且损失函数基于重构帧而非嵌入误差，造成训练目标与模型输出之间的不一致。论文的关键解决方案是提出了一种语义集中多头自注意力 (Semantic Concentration Multi-Head Self-Attention, SCMHSA) 架构，并引入一种优化SCMHSA的潜在空间损失函数，以更紧密地使训练目标与模型输出保持一致。

链接: https://arxiv.org/abs/2501.16753
作者: Hy Nguyen,Srikanth Thudumu,Hung Du,Rajesh Vasa,Kon Mouzakis
机构: Applied Artificial Intelligence Institute, Deakin University (应用人工智能研究所,迪肯大学)
Burwood, Victoria, Australia (澳大利亚维多利亚州伯伍德)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Next-frame prediction in videos is crucial for applications such as autonomous driving, object tracking, and motion prediction. The primary challenge in next-frame prediction lies in effectively capturing and processing both spatial and temporal information from previous video sequences. The transformer architecture, known for its prowess in handling sequence data, has made remarkable progress in this domain. However, transformer-based next-frame prediction models face notable issues: (a) The multi-head self-attention (MHSA) mechanism requires the input embedding to be split into N chunks, where N is the number of heads. Each segment captures only a fraction of the original embeddings information, which distorts the representation of the embedding in the latent space, resulting in a semantic dilution problem; (b) These models predict the embeddings of the next frames rather than the frames themselves, but the loss function based on the errors of the reconstructed frames, not the predicted embeddings – this creates a discrepancy between the training objective and the model output. We propose a Semantic Concentration Multi-Head Self-Attention (SCMHSA) architecture, which effectively mitigates semantic dilution in transformer-based next-frame prediction. Additionally, we introduce a loss function that optimizes SCMHSA in the latent space, aligning the training objective more closely with the model output. Our method demonstrates superior performance compared to the original transformer-based predictors.
zh

[CV-32] DebugAgent : Efficient and Interpretable Error Slice Discovery for Comprehensive Model Debugging

【速读】：该论文旨在解决深度学习模型在计算机视觉任务中于特定数据子集上出现系统性错误的问题。论文的关键解决方案是提出DebugAgent框架，通过生成任务相关的视觉属性来识别易出错实例，并采用高效的切片枚举算法系统地发现错误切片。此外，DebugAgent还能够预测验证集之外的错误切片，从而克服先前方法中的一个关键限制。这一方法不仅提高了错误切片识别的准确性和一致性，还显著增强了模型修复能力。

链接: https://arxiv.org/abs/2501.16751
作者: Muxi Chen,Chenchen Zhao,Qiang Xu
机构: The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the significant success of deep learning models in computer vision, they often exhibit systematic failures on specific data subsets, known as error slices. Identifying and mitigating these error slices is crucial to enhancing model robustness and reliability in real-world scenarios. In this paper, we introduce DebugAgent, an automated framework for error slice discovery and model repair. DebugAgent first generates task-specific visual attributes to highlight instances prone to errors through an interpretable and structured process. It then employs an efficient slice enumeration algorithm to systematically identify error slices, overcoming the combinatorial challenges that arise during slice exploration. Additionally, DebugAgent extends its capabilities by predicting error slices beyond the validation set, addressing a key limitation of prior approaches. Extensive experiments across multiple domains, including image classification, pose estimation, and object detection - show that DebugAgent not only improves the coherence and precision of identified error slices but also significantly enhances the model repair capabilities.
zh

[CV-33] Consistency Diffusion Models for Single-Image 3D Reconstruction with Priors

【速读】：该论文致力于解决从单个图像进行三维点云重建的问题。关键解决方案在于开发一致性扩散模型（Consistency Diffusion Model），通过在贝叶斯框架中探索二维和三维先验信息的协同作用来确保重建过程中的优越一致性。具体而言，论文引入了一种创新的训练框架，包含两个主要创新：一是将来自初始三维点云的三维结构先验转换为边界项，以增强变分贝叶斯框架中的证据；二是从输入图像中提取并融入二维先验信息，将其投影到三维点云上，以丰富扩散训练的指导。这种方法不仅避免了直接施加额外约束可能引起的模型学习偏移，还精确地将二维先验信息转化为三维领域中的指导。

链接: https://arxiv.org/abs/2501.16737
作者: Chenru Jiang,Chengrui Zhang,Xi Yang,Jie Sun,Kaizhu Huang
机构: Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学); University of Liverpool(利物浦大学); Duke Kunshan University(杜克昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper delves into the study of 3D point cloud reconstruction from a single image. Our objective is to develop the Consistency Diffusion Model, exploring synergistic 2D and 3D priors in the Bayesian framework to ensure superior consistency in the reconstruction process, a challenging yet critical requirement in this field. Specifically, we introduce a pioneering training framework under diffusion models that brings two key innovations. First, we convert 3D structural priors derived from the initial 3D point cloud as a bound term to increase evidence in the variational Bayesian framework, leveraging these robust intrinsic priors to tightly govern the diffusion training process and bolster consistency in reconstruction. Second, we extract and incorporate 2D priors from the single input image, projecting them onto the 3D point cloud to enrich the guidance for diffusion training. Our framework not only sidesteps potential model learning shifts that may arise from directly imposing additional constraints during training but also precisely transposes the 2D priors into the 3D domain. Extensive experimental evaluations reveal that our approach sets new benchmarks in both synthetic and real-world datasets. The code is included with the submission.
zh

[CV-34] Dream to Drive with Predictive Individual World Model

【速读】：该论文旨在解决在复杂城市环境中未知道路使用者意图的情况下实现反应性驾驶行为的问题。关键解决方案在于提出了一种带有预测个体世界模型（Predictive Individual World Model, PIWM）的新型模型基强化学习（Model-based Reinforcement Learning, MBRL）方法。PIWM从个体层面描述驾驶环境，通过轨迹预测任务捕捉车辆间的交互关系及其意图，并与行为策略联合学习。该方法利用意图感知潜在状态，在城市驾驶场景中实现安全高效的导航。

链接: https://arxiv.org/abs/2501.16733
作者: Yinfeng Gao,Qichao Zhang,Da-wei Ding,Dongbin Zhao
机构: School of Automation and Electrical Engineering, University of Science and Technology Beijing (北京科技大学自动化与电气工程学院); Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education (知识自动化工业过程重点实验室); The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (多模态人工智能系统国家重点实验室, 中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Codes: this https URL

点击查看摘要

Abstract:It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users’ intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by constructing a world model that can provide informative states and imagination training. However, a critical limitation in relevant research lies in the scene-level reconstruction representation learning, which may overlook key interactive vehicles and hardly model the interactive features among vehicles and their long-term intentions. Therefore, this paper presents a novel MBRL method with a predictive individual world model (PIWM) for autonomous driving. PIWM describes the driving environment from an individual-level perspective and captures vehicles’ interactive relations and their intentions via trajectory prediction task. Meanwhile, a behavior policy is learned jointly with PIWM. It is trained in PIWM’s imagination and effectively navigates in the urban driving scenes leveraging intention-aware latent states. The proposed method is trained and evaluated on simulation environments built upon real-world challenging interactive scenarios. Compared with popular model-free and state-of-the-art model-based reinforcement learning methods, experimental results show that the proposed method achieves the best performance in terms of safety and efficiency.
zh

[CV-35] B-RIGHT: Benchmark Re-evaluation for Integrity in Generalized Human-Object Interaction Testing

【速读】：该论文旨在解决现有基准数据集如HICO-DET中存在的严重类别不平衡以及某些类别的训练集与测试集数量不一致的问题。这些问题可能导致模型评估结果的可靠性降低。为解决这些问题，论文提出了一种系统的方法来开发一个新的类别平衡的数据集——Benchmark Re-evaluation for Integrity in Generalized Human-object Interaction Testing (B-RIGHT)。该方法的关键在于利用平衡算法和自动化生成及筛选过程，确保每个HOI（人类-物体交互）类别的实例数量相等，从而实现类别平衡，并设计了一个平衡的零样本测试集以系统性地评估模型在未见过场景中的表现。通过使用B-RIGHT重新评估现有模型，发现相比于传统的HICO-DET，这种方法显著减少了评分方差并改变了性能排名。

链接: https://arxiv.org/abs/2501.16724
作者: Yoojin Jang,Junsu Kim,Hayeon Kim,Eun-ki Lee,Eun-sol Kim,Seungryul Baek,Jaejun Yoo
机构: Ulsan National Institute of Science and Technology (UNIST); NAVER AI LAB; Hangyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-object interaction (HOI) is an essential problem in artificial intelligence (AI) which aims to understand the visual world that involves complex relationships between humans and objects. However, current benchmarks such as HICO-DET face the following limitations: (1) severe class imbalance and (2) varying number of train and test sets for certain classes. These issues can potentially lead to either inflation or deflation of model performance during evaluation, ultimately undermining the reliability of evaluation scores. In this paper, we propose a systematic approach to develop a new class-balanced dataset, Benchmark Re-evaluation for Integrity in Generalized Human-object Interaction Testing (B-RIGHT), that addresses these imbalanced problems. B-RIGHT achieves class balance by leveraging balancing algorithm and automated generation-and-filtering processes, ensuring an equal number of instances for each HOI class. Furthermore, we design a balanced zero-shot test set to systematically evaluate models on unseen scenario. Re-evaluating existing models using B-RIGHT reveals substantial the reduction of score variance and changes in performance rankings compared to conventional HICO-DET. Our experiments demonstrate that evaluation under balanced conditions ensure more reliable and fair model comparisons.
zh

[CV-36] One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning

【速读】：该论文旨在解决在下游少量样本学习任务中微调视觉语言基础模型（VLMs）时面临的过度训练参数和高计算成本的问题。关键解决方案在于提出了一种基于块矩阵的低秩自适应框架——Block-LoRA。该方法通过将低秩分解矩阵划分为一系列子矩阵，并共享所有降秩子矩阵，不仅减少了训练参数的数量，还将复杂的矩阵乘法运算简化为矩阵加法，从而显著降低了微调过程中的计算成本。

链接: https://arxiv.org/abs/2501.16720
作者: Chunpeng Zhou,Qianqian Shen,Zhi Yu,Jiajun Bu,Haishuai Wang
机构: Zhejiang university(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Recent advancements in fine-tuning Vision-Language Foundation Models (VLMs) have garnered significant attention for their effectiveness in downstream few-shot learning this http URL these recent approaches exhibits some performance improvements, they often suffer from excessive training parameters and high computational costs. To address these challenges, we propose a novel Block matrix-based low-rank adaptation framework, called Block-LoRA, for fine-tuning VLMs on downstream few-shot tasks. Inspired by recent work on Low-Rank Adaptation (LoRA), Block-LoRA partitions the original low-rank decomposition matrix of LoRA into a series of sub-matrices while sharing all down-projection sub-matrices. This structure not only reduces the number of training parameters, but also transforms certain complex matrix multiplication operations into simpler matrix addition, significantly lowering the computational cost of fine-tuning. Notably, Block-LoRA enables fine-tuning CLIP on the ImageNet few-shot benchmark using a single 24GB GPU. We also show that Block-LoRA has the more tighter bound of generalization error than vanilla LoRA. Without bells and whistles, extensive experiments demonstrate that Block-LoRA achieves competitive performance compared to state-of-the-art CLIP-based few-shot methods, while maintaining a low training parameters count and reduced computational overhead.
zh

[CV-37] Point Cloud Upsampling as Statistical Shape Model for Pelvic

【速读】：该论文旨在解决医学图像分割和点云上采样在构建高分辨率三维骨模型中的精度不足问题。关键在于采用SAM-Med3D模型进行医学图像分割，并利用在MedShapeNet数据集上训练的点云上采样网络，将稀疏的医学影像数据转化为详细的三维骨骼模型。这种方法结合了解剖形状的先验知识，实现了更平滑和完整的重建效果。

链接: https://arxiv.org/abs/2501.16716
作者: Tongxu Zhang,Bei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:We propose a novel framework that integrates medical image segmentation and point cloud upsampling for accurate shape reconstruction of pelvic models. Using the SAM-Med3D model for segmentation and a point cloud upsampling network trained on the MedShapeNet dataset, our method transforms sparse medical imaging data into high-resolution 3D bone models. This framework leverages prior knowledge of anatomical shapes, achieving smoother and more complete reconstructions. Quantitative evaluations using metrics such as Chamfer Distance etc, demonstrate the effectiveness of the point cloud upsampling in pelvic model. Our approach offers potential applications in reconstructing other skeletal structures, providing a robust solution for medical image analysis and statistical shape modeling.
zh

[CV-38] Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models

【速读】：该论文旨在解决通过扩散模型（Diffusion Model, DM）生成具有指定运动概念视频的同时保持外观多样性的问题。解决方案的关键在于如何在扩散模型的适应过程中分离运动概念与外观。论文提出了一种学习运动LoRA的方法，并引入了时间注意力净化（Temporal Attention Purification, TAP）和外观高速公路（Appearance Highway, AH）两种策略来增强运动与外观的分离。具体而言，TAP仅重塑时间注意力模块中的运动LoRA，以重新组织Value嵌入来产生新的运动；而AH则改变了U-Net中每个时间注意力模块输出到空间注意力模块输出的跳跃连接起点。实验结果表明，该方法能够生成与文本描述更一致的外观和与参考视频更一致的运动。

链接: https://arxiv.org/abs/2501.16714
作者: Huijie Liu,Jingyun Wang,Shuai Ma,Jie Hu,Xiaoming Wei,Guoliang Kang
机构: Beihang University(北航); Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages,6 figures

点击查看摘要

Abstract:Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pretrained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in the reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a motion LoRA to encode the motion concept, but propose two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH). Specifically, we assume that in the temporal attention module, the pretrained Value embeddings are sufficient to serve as basic components needed by producing a new motion. Thus, in TAP, we choose only to reshape the temporal attention with motion LoRAs so that Value embeddings can be reorganized to produce a new motion. Further, in AH, we alter the starting point of each skip connection in U-Net from the output of each temporal attention module to the output of each spatial attention module. Extensive experiments demonstrate that compared to previous works, our method can generate videos with appearance more aligned with the text descriptions and motion more consistent with the reference videos.
zh

[CV-39] DFCon: Attention-Driven Supervised Contrastive Learning for Robust Deepfake Detection

【速读】：该论文旨在解决在真实世界条件下检测深度伪造人脸（Deepfake Face Detection in the Wild, DFWild）的问题。关键解决方案在于采用多种先进的骨干模型（MaxViT、CoAtNet和EVA-02），并通过监督对比损失进行微调以增强特征分离。这些模型因其互补的优势而被选择：MaxViT通过卷积层和步幅注意力机制有效检测局部特征；CoAtNet结合卷积和注意力机制有效地捕捉多尺度特征；EVA-02则通过掩码图像建模进行鲁棒预训练，擅长捕捉全局特征。最终，通过冻结骨干模型参数并训练分类头，再利用多数投票集成方法结合各模型的预测结果，从而提高系统的鲁棒性和泛化能力。该系统在验证数据集上实现了95.83%的准确率。

链接: https://arxiv.org/abs/2501.16704
作者: MD Sadik Hossain Shanto,Mahir Labib Dihan,Souvik Ghosh,Riad Ahmed Anonto,Hafijul Hoque Chowdhury,Abir Muhtasim,Rakib Ahsan,MD Tanvir Hassan,MD Roqunuzzaman Sojib,Sheikh Azizul Hakim,M. Saifur Rahman
机构: Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh (计算机科学与工程系, 孟加拉国工程技术大学, 达卡, 孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Technical report for IEEE Signal Processing Cup 2025, 7 pages

点击查看摘要

Abstract:This report presents our approach for the IEEE SP Cup 2025: Deepfake Face Detection in the Wild (DFWild-Cup), focusing on detecting deepfakes across diverse datasets. Our methodology employs advanced backbone models, including MaxViT, CoAtNet, and EVA-02, fine-tuned using supervised contrastive loss to enhance feature separation. These models were specifically chosen for their complementary strengths. Integration of convolution layers and strided attention in MaxViT is well-suited for detecting local features. In contrast, hybrid use of convolution and attention mechanisms in CoAtNet effectively captures multi-scale features. Robust pretraining with masked image modeling of EVA-02 excels at capturing global features. After training, we freeze the parameters of these models and train the classification heads. Finally, a majority voting ensemble is employed to combine the predictions from these models, improving robustness and generalization to unseen scenarios. The proposed system addresses the challenges of detecting deepfakes in real-world conditions and achieves a commendable accuracy of 95.83% on the validation dataset.
zh

[CV-40] Determining Mosaic Resilience in Sugarcane Plants using Hyperspectral Images

【速读】：该论文旨在解决甘蔗花叶病对澳大利亚甘蔗产业造成的严重威胁，特别是通过现有手动检测方法难以高效大规模应用的问题。论文的关键解决方案在于采用高光谱成像与机器学习技术，利用ResNet18深度学习架构从局部光谱片段中提取全局特征表示，以有效识别甘蔗的花叶病抗性。相比传统方法如支持向量机在利用空间-光谱关系上的不足，该深度学习模型实现了高分类精度，从而增强了早期检测能力，促进了易感品种的高效管理，有助于实现可持续的甘蔗生产。

链接: https://arxiv.org/abs/2501.16700
作者: Ali Zia,Jun Zhou,Muyiwa Olayemi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sugarcane mosaic disease poses a serious threat to the Australian sugarcane industry, leading to yield losses of up to 30% in susceptible varieties. Existing manual inspection methods for detecting mosaic resilience are inefficient and impractical for large-scale application. This study introduces a novel approach using hyperspectral imaging and machine learning to detect mosaic resilience by leveraging global feature representation from local spectral patches. Hyperspectral data were collected from eight sugarcane varieties under controlled and field conditions. Local spectral patches were analyzed to capture spatial and spectral variations, which were then aggregated into global feature representations using a ResNet18 deep learning architecture. While classical methods like Support Vector Machines struggled to utilize spatial-spectral relationships effectively, the deep learning model achieved high classification accuracy, demonstrating its capacity to identify mosaic resilience from fine-grained hyperspectral data. This approach enhances early detection capabilities, enabling more efficient management of susceptible strains and contributing to sustainable sugarcane production.
zh

[CV-41] SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation ICRA2025

【速读】：该论文旨在解决在密集室内环境中由于常见遮挡导致的基于平面表示方法难以准确捕捉全局语义占据的问题。关键解决方案在于提出了一种新的垂直切片表示方法，通过沿垂直轴分割场景并将空间点特征投影到最近的一对平行平面上，进而利用基于RGB相机的SliceOcc模型，采用切片查询对和平行注意力机制从输入图像中提取平面特征，并将这些局部平面特征融合形成全局场景表示，用于室内占据预测。

链接: https://arxiv.org/abs/2501.16684
作者: Jianing Li,Ming Lu,Hao Wang,Chenyang Gu,Wenzhao Zheng,Li Du,Shanghang Zhang
机构: School of Electronic Science and Engineering, Nanjing University (电子科学与工程学院，南京大学); School of Computer Science, Peking University (计算机科学学院，北京大学); Department of Electrical Engineering and Computer Sciences, University of California, Berkeley (电气工程与计算机科学系，加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025;

点击查看摘要

Abstract:3D semantic occupancy prediction is a crucial task in visual perception, as it requires the simultaneous comprehension of both scene geometry and semantics. It plays a crucial role in understanding 3D scenes and has great potential for various applications, such as robotic vision perception and autonomous driving. Many existing works utilize planar-based representations such as Bird’s Eye View (BEV) and Tri-Perspective View (TPV). These representations aim to simplify the complexity of 3D scenes while preserving essential object information, thereby facilitating efficient scene representation. However, in dense indoor environments with prevalent occlusions, directly applying these planar-based methods often leads to difficulties in capturing global semantic occupancy, ultimately degrading model performance. In this paper, we present a new vertical slice representation that divides the scene along the vertical axis and projects spatial point features onto the nearest pair of parallel planes. To utilize these slice features, we propose SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction. SliceOcc utilizes pairs of slice queries and cross-attention mechanisms to extract planar features from input images. These local planar features are then fused to form a global scene representation, which is employed for indoor occupancy prediction. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories, setting a new state-of-the-art performance among RGB camera-based models for indoor 3D semantic occupancy prediction. Code is available at this https URL.
zh

[CV-42] Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion ICRA2025

【速读】：该论文旨在解决在自动化诊断系统（Automated Diagnostic Systems, ADS）开发过程中，由于标注成本高昂和隐私顾虑，获取高质量内窥镜图像的难题。此外，现有的内窥镜图像生成算法难以准确生成息肉边界区域的细节，并通常需要医学先验知识来指定息肉的合理位置和形状，这限制了生成图像的真实性和多样性。论文的关键解决方案是提出Polyp-Gen，这是一种基于扩散模型的全自动生成内窥镜图像框架。具体而言，Polyp-Gen通过设计空间感知扩散训练方案和病灶引导损失来增强息肉边界区域的结构上下文，并引入分层检索采样策略以匹配相似的细粒度空间特征，从而捕捉医学先验以定位潜在的息肉区域。这种方法能够生成真实且多样的内窥镜图像，有助于构建可靠的ADS系统。

链接: https://arxiv.org/abs/2501.16679
作者: Shengyuan Liu,Zhen Chen,Qiushi Yang,Weihao Yu,Di Dong,Jiancong Hu,Yixuan Yuan
机构: Department of Electronic Engineering, Chinese University of Hong Kong(电子工程系，香港中文大学); Centre for Artificial Intelligence and Robotics (CAIR), Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences(人工智能与机器人中心，香港创新科技研究院，中国科学院); Department of Electrical Engineering, The City University of Hong Kong(电气工程系，香港城市大学); Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院); The Sixth Affiliated Hospital, Sun Yat-sen University(中山大学第六附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at this https URL.
zh

[CV-43] Improving Interpretability and Accuracy in Neuro-Symbolic Rule Extraction Using Class-Specific Sparse Filters

【速读】：该论文旨在解决神经符号模型在图像分类任务中的准确性损失问题。论文的关键在于提出了一种新的稀疏性损失函数（sparsity loss function），该函数能够在CNN训练过程中实现类别特定的滤波器二值化，从而在提取规则集时最小化信息损失。这种方法显著提升了模型的准确性和规则集的紧凑性，相比之前的最先进方法（SOTA）提高了9%的准确率，并减少了53%的规则集大小，同时保持了接近原始CNN的性能，仅相差3%。

链接: https://arxiv.org/abs/2501.16677
作者: Parth Padalkar,Jaeseong Lee,Shiyi Wei,Gopal Gupta
机构: The University of Texas at Dallas
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been significant focus on creating neuro-symbolic models for interpretable image classification using Convolutional Neural Networks (CNNs). These methods aim to replace the CNN with a neuro-symbolic model consisting of the CNN, which is used as a feature extractor, and an interpretable rule-set extracted from the CNN itself. While these approaches provide interpretability through the extracted rule-set, they often compromise accuracy compared to the original CNN model. In this paper, we identify the root cause of this accuracy loss as the post-training binarization of filter activations to extract the rule-set. To address this, we propose a novel sparsity loss function that enables class-specific filter binarization during CNN training, thus minimizing information loss when extracting the rule-set. We evaluate several training strategies with our novel sparsity loss, analyzing their effectiveness and providing guidance on their appropriate use. Notably, we set a new benchmark, achieving a 9% improvement in accuracy and a 53% reduction in rule-set size on average, compared to the previous SOTA, while coming within 3% of the original CNN’s accuracy. This highlights the significant potential of interpretable neuro-symbolic models as viable alternatives to black-box CNNs.
zh

[CV-44] CSPCL: Category Semantic Prior Contrastive Learning for Deformable DETR-Based Prohibited Item Detectors

【速读】：该论文旨在解决基于X射线图像的违禁品检测中，由于重叠现象导致前景-背景特征耦合的问题，这使得针对自然图像设计的一般检测器表现不佳。解决方案的关键在于提出了一种名为类别语义先验对比学习（Category Semantic Prior Contrastive Learning, CSPCL）机制，通过设计特定的对比损失函数（CSP损失），包括类内截断吸引（Intra-Class Truncated Attraction, ITA）损失和类间自适应排斥（Inter-Class Adaptive Repulsion, IAR）损失，来校正和补充分类所依赖的缺失语义信息，从而增强模型对前景的敏感性。

链接: https://arxiv.org/abs/2501.16665
作者: Mingyuan Li,Tong Jia,Hui Lu,Bowen Ma,Hao Wang,Dongyue Chen
机构: Northeastern University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Prohibited item detection based on X-ray images is one of the most effective security inspection methods. However, the foreground-background feature coupling caused by the overlapping phenomenon specific to X-ray images makes general detectors designed for natural images perform poorly. To address this issue, we propose a Category Semantic Prior Contrastive Learning (CSPCL) mechanism, which aligns the class prototypes perceived by the classifier with the content queries to correct and supplement the missing semantic information responsible for classification, thereby enhancing the model sensitivity to foreground this http URL achieve this alignment, we design a specific contrastive loss, CSP loss, which includes Intra-Class Truncated Attraction (ITA) loss and Inter-Class Adaptive Repulsion (IAR) loss, and outperforms classic N-pair loss and InfoNCE loss. Specifically, ITA loss leverages class prototypes to attract intra-class category-specific content queries while preserving necessary distinctiveness. IAR loss utilizes class prototypes to adaptively repel inter-class category-specific content queries based on the similarity between class prototypes, helping disentangle features of similar this http URL is general and can be easily integrated into Deformable DETR-based models. Extensive experiments on the PIXray and OPIXray datasets demonstrate that CSPCL significantly enhances the performance of various state-of-the-art models without increasing this http URL code will be open source once the paper is accepted.
zh

[CV-45] Improving Vision-Language-Action Model with Online Reinforcement Learning ICRA2025

【速读】：该论文旨在解决如何在交互环境中进一步提升视觉语言动作（VLA）模型性能的问题。论文的关键解决方案在于提出iRe-VLA框架，通过迭代强化学习（RL）和监督微调（SFT），在保持监督学习稳定性的前提下，充分利用强化学习的探索优势，从而有效改进大型VLA模型。这一方法克服了直接应用在线RL面临的训练不稳定性和计算负担过重的挑战。

链接: https://arxiv.org/abs/2501.16664
作者: Yanjiang Guo,Jianke Zhang,Xiaoyu Chen,Xiang Ji,Yen-Jen Wang,Yucheng Hu,Jianyu Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICRA 2025

点击查看摘要

Abstract:Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.
zh

[CV-46] Vision-based autonomous structural damage detection using data-driven methods

【速读】：该论文旨在解决风力涡轮机结构损伤检测中的高效性和准确性问题，这是可再生能源基础设施中的一个关键组成部分。传统检测方法如人工评估和无损检测（NDT）通常成本高、耗时长且容易出错。为应对这些挑战，研究采用了先进的深度学习算法进行基于视觉的结构健康监测（SHM）。解决方案的关键在于使用YOLOv7及其轻量级变体和Faster R-CNN算法来检测和分类表面损伤，并通过优化超参数如学习率和批量大小进一步提高模型的准确性和效率。实验结果显示，YOLOv7在平均精度（mAP@50）和处理速度方面表现出色，达到了82.4%，使其适用于实时检测。

链接: https://arxiv.org/abs/2501.16662
作者: Seyyed Taghi Ataei,Parviz Mohammad Zadeh,Saeid Ataei
机构: University of Tehran(德黑兰大学); Faculty of New Sciences and Technologies(新科学与技术学院); Stevens Institute of Technology(史蒂文斯理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 14 pages, 8 figures. This study examines advanced deep learning algorithms, specifically YOLOv7, for efficient and accurate damage detection in wind turbine structures. It significantly enhances detection precision and speed for real-time inspections

点击查看摘要

Abstract:This study addresses the urgent need for efficient and accurate damage detection in wind turbine structures, a crucial component of renewable energy infrastructure. Traditional inspection methods, such as manual assessments and non-destructive testing (NDT), are often costly, time-consuming, and prone to human error. To tackle these challenges, this research investigates advanced deep learning algorithms for vision-based structural health monitoring (SHM). A dataset of wind turbine surface images, featuring various damage types and pollution, was prepared and augmented for enhanced model training. Three algorithms-YOLOv7, its lightweight variant, and Faster R-CNN- were employed to detect and classify surface damage. The models were trained and evaluated on a dataset split into training, testing, and evaluation subsets (80%-10%-10%). Results indicate that YOLOv7 outperformed the others, achieving 82.4% mAP@50 and high processing speed, making it suitable for real-time inspections. By optimizing hyperparameters like learning rate and batch size, the models’ accuracy and efficiency improved further. YOLOv7 demonstrated significant advancements in detection precision and execution speed, especially for real-time applications. However, challenges such as dataset limitations and environmental variability were noted, suggesting future work on segmentation methods and larger datasets. This research underscores the potential of vision-based deep learning techniques to transform SHM practices by reducing costs, enhancing safety, and improving reliability, thus contributing to the sustainable maintenance of critical infrastructure and supporting the longevity of wind energy systems.
zh

[CV-47] Molecular-driven Foundation Model for Oncologic Pathology

【速读】：该论文旨在解决现有基础模型在处理全幅病理切片图像（Whole-Slide Images, WSI）时存在的局限性，特别是它们难以编码整个吉字节级别的全幅图像而无需额外训练，并且通常缺乏互补的多模态数据。为了解决这些问题，论文提出了一种名为Threads的模型，这是一种能够生成任意大小全幅病理切片图像通用表示的滑块级基础模型（slide-level foundation model）。Threads的关键创新在于其预训练采用了多模态学习方法，利用包含47,171个苏木精和伊红染色组织切片及其对应的基因组和转录组特征的大型配对数据集进行训练，从而使其能够捕捉组织的分子组成，生成适用于多种下游任务的强大表示。

链接: https://arxiv.org/abs/2501.16652
作者: Anurag Vaidya,Andrew Zhang,Guillaume Jaume,Andrew H. Song,Tong Ding,Sophia J. Wagner,Ming Y. Lu,Paul Doucet,Harry Robertson,Cristina Almagro-Perez,Richard J. Chen,Dina ElHarouni,Georges Ayoub,Connor Bossi,Keith L. Ligon,Georg Gerber,Long Phi Le,Faisal Mahmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models are reshaping computational pathology by enabling transfer learning, where models pre-trained on vast datasets can be adapted for downstream diagnostic, prognostic, and therapeutic response tasks. Despite these advances, foundation models are still limited in their ability to encode the entire gigapixel whole-slide images without additional training and often lack complementary multimodal data. Here, we introduce Threads, a slide-level foundation model capable of generating universal representations of whole-slide images of any size. Threads was pre-trained using a multimodal learning approach on a diverse cohort of 47,171 hematoxylin and eosin (HE)-stained tissue sections, paired with corresponding genomic and transcriptomic profiles - the largest such paired dataset to be used for foundation model development to date. This unique training paradigm enables Threads to capture the tissue’s underlying molecular composition, yielding powerful representations applicable to a wide array of downstream tasks. In extensive benchmarking across 54 oncology tasks, including clinical subtyping, grading, mutation prediction, immunohistochemistry status determination, treatment response prediction, and survival prediction, Threads outperformed all baselines while demonstrating remarkable generalizability and label efficiency. It is particularly well suited for predicting rare events, further emphasizing its clinical utility. We intend to make the model publicly available for the broader community.
zh

[CV-48] Predicting 3D representations for Dynamic Scenes

【速读】：本文旨在解决动态场景的4D（三维空间加时间）辐射场预测问题。关键解决方案在于采用了以自我为中心的无界三平面（ego-centric unbounded triplane）来显式表示动态物理世界，并开发了一种4D感知变换器（4D-aware transformer）来从单目视频中聚合特征以更新三平面。这两种设计的结合使得模型能够以自我监督的方式利用大规模单目视频进行训练，从而实现出色的动态辐射场预测性能和对未见场景的优越泛化能力。

链接: https://arxiv.org/abs/2501.16617
作者: Di Qi,Tong Yang,Beining Wang,Xiangyu Zhang,Wenqiang Zhang
机构: StepFun; MEGVII Technology; Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
zh

[CV-49] CascadeV: An Implementation of Wurstchen Architecture for Video Generation

【速读】：该论文旨在解决在文本到视频（Text-to-Video, T2V）应用中使用扩散模型（Diffusion Models）所面临的高计算需求问题，特别是生成高质量高分辨率视频时的挑战。论文的关键解决方案是提出了一种级联潜扩散模型（CascadeV），能够生成最先进的2K分辨率视频。通过采用级联架构和空间-时间交替网格三维注意力机制，该模型实现了更高的压缩比，并有效整合了空间和时间信息，确保生成视频帧之间的一致性，从而显著降低了高质量视频生成的计算复杂度。此外，该模型可与现有的T2V模型级联，理论上可使分辨率或每秒帧数提升4倍而无需微调。

链接: https://arxiv.org/abs/2501.16612
作者: Wenfeng Lin,Jiangchuan Wei,Boyuan Liu,Yichen Zhang,Shiyue Yan,Mingyu Guo
机构: ByteDance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4 \times increase in resolution or frames per second without any fine-tuning. Our code is available at this https URL.
zh

[CV-50] Unsupervised Domain Adaptation with Dynamic Clustering and Contrastive Refinement for Gait Recognition

【速读】：该论文旨在解决无监督步态识别中伪标签噪声对聚类和模型训练的影响问题。关键在于提出了GaitDCCR模型，通过动态聚类参数(DCP)、动态权重中心(DWC)、基于置信度的伪标签优化(CPR)和对比教师模块(CTM)，有效降低了伪标签噪声对模型性能的负面影响，从而显著提升了无监督步态识别的效果。

链接: https://arxiv.org/abs/2501.16608
作者: Xiaolei Liu,Yan Sun,Mark Nixon
机构: School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院)Shanghai(上海)China; School of Electronics and Computer Science, University of Southampton(南安普顿大学电子与计算机科学学院)Southampton(南安普顿)United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Gait recognition is an emerging identification technology that distinguishes individuals at long distances by analyzing individual walking patterns. Traditional techniques rely heavily on large-scale labeled datasets, which incurs high costs and significant labeling challenges. Recently, researchers have explored unsupervised gait recognition with clustering-based unsupervised domain adaptation methods and achieved notable success. However, these methods directly use pseudo-label generated by clustering and neglect pseudolabel noise caused by domain differences, which affects the effect of the model training process. To mitigate these issues, we proposed a novel model called GaitDCCR, which aims to reduce the influence of noisy pseudo labels on clustering and model training. Our approach can be divided into two main stages: clustering and training stage. In the clustering stage, we propose Dynamic Cluster Parameters (DCP) and Dynamic Weight Centroids (DWC) to improve the efficiency of clustering and obtain reliable cluster centroids. In the training stage, we employ the classical teacher-student structure and propose Confidence-based Pseudo-label Refinement (CPR) and Contrastive Teacher Module (CTM) to encourage noisy samples to converge towards clusters containing their true identities. Extensive experiments on public gait datasets have demonstrated that our simple and effective method significantly enhances the performance of unsupervised gait recognition, laying the foundation for its application in the this http URL code is available at this https URL
zh

[CV-51] Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration

【速读】：该论文旨在解决图像复原过程中在保持高质量的同时提高计算效率的问题。现有方法主要基于CNNs、Transformers或其混合方法，但它们通常难以有效建模长距离依赖，并且忽略了图像退化过程中的空间特性（纹理丰富的区域往往遭受更严重的损坏），这使得难以在复原质量和效率之间取得最佳平衡。论文的关键解决方案是提出了一种新颖的纹理感知图像复原方法TAMambaIR，它通过引入一种新颖的纹理感知状态空间模型，增强纹理感知能力并改进效率，该模型通过对状态空间方程的转换矩阵进行调制并关注复杂纹理区域来实现这一目标。此外，设计了多方向感知块以改善多方向感受野，同时保持低计算开销。

链接: https://arxiv.org/abs/2501.16583
作者: Long Peng,Xin Di,Zhanfeng Feng,Wenbo Li,Renjing Pei,Yang Wang,Xueyang Fu,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (\textite.g., 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a Multi-Directional Perception Block to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.
zh

[CV-52] Efficient Object Detection of Marine Debris using Pruned YOLO Model

【速读】：该论文旨在解决海洋垃圾对海洋生物造成的严重危害问题，特别是由微塑料、多氯联苯和杀虫剂等物质导致的栖息地破坏和生物中毒。解决方案的关键在于开发高效的自主水下航行器（Autonomous Underwater Vehicles, AUVs）用于海洋垃圾收集，并通过采用实时检测架构来提升效率。研究采用了YOLOv4模型进行海洋垃圾的实时检测，并通过比较不同修改策略如预训练模型、从零开始训练、马赛克增强、层冻结、YOLOv4-tiny和通道剪枝等方法，以优化检测架构的性能。其中，通道剪枝显著提升了检测速度，将基础YOLOv4帧率从15.19 FPS提高到19.4 FPS，同时仅降低了1.2%的平均精度，从97.6%降至96.4%。

链接: https://arxiv.org/abs/2501.16571
作者: Abi Aryaza,Novanto Yudistira,Tibyani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Marine debris poses significant harm to marine life due to substances like microplastics, polychlorinated biphenyls, and pesticides, which damage habitats and poison organisms. Human-based solutions, such as diving, are increasingly ineffective in addressing this issue. Autonomous underwater vehicles (AUVs) are being developed for efficient sea garbage collection, with the choice of object detection architecture being critical. This research employs the YOLOv4 model for real-time detection of marine debris using the Trash-ICRA 19 dataset, consisting of 7683 images at 480x320 pixels. Various modifications-pretrained models, training from scratch, mosaic augmentation, layer freezing, YOLOv4-tiny, and channel pruning-are compared to enhance architecture efficiency. Channel pruning significantly improves detection speed, increasing the base YOLOv4 frame rate from 15.19 FPS to 19.4 FPS, with only a 1.2% drop in mean Average Precision, from 97.6% to 96.4%.
zh

[CV-53] LoRA-X: Bridging Foundation Models with Training-Free Cross-Model Adaptation ICLR2025

【速读】：该论文旨在解决大型基础模型被弃用后，Low-Rank Adaptation (LoRA) 参数重新训练的需求问题。由于原始训练数据可能无法访问或生成合成数据存在实际困难，这导致了参数高效微调方法的应用复杂化。论文的关键解决方案是引入了一种新的适配器——跨模型低秩适应（LoRA-X），它能够在无需访问原始数据或合成数据的情况下，实现LoRA参数在源模型和目标模型之间的无训练迁移。这种方法通过限制适配器在源模型子空间内的操作来实现，并仅在目标模型中具有可接受子空间相似性的层中应用适配器，从而有效解决了上述挑战。

链接: https://arxiv.org/abs/2501.16559
作者: Farzad Farhadzadeh,Debasmit Das,Shubhankar Borse,Fatih Porikli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:The rising popularity of large foundation models has led to a heightened demand for parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), which offer performance comparable to full model fine-tuning while requiring only a few additional parameters tailored to the specific base model. When such base models are deprecated and replaced, all associated LoRA modules must be retrained, requiring access to either the original training data or a substantial amount of synthetic data that mirrors the original distribution. However, the original data is often inaccessible due to privacy or licensing issues, and generating synthetic data may be impractical and insufficiently representative. These factors complicate the fine-tuning process considerably. To address this challenge, we introduce a new adapter, Cross-Model Low-Rank Adaptation (LoRA-X), which enables the training-free transfer of LoRA parameters across source and target models, eliminating the need for original or synthetic training data. Our approach imposes the adapter to operate within the subspace of the source base model. This constraint is necessary because our prior knowledge of the target model is limited to its weights, and the criteria for ensuring the adapter’s transferability are restricted to the target base model’s weights and subspace. To facilitate the transfer of LoRA parameters of the source model to a target model, we employ the adapter only in the layers of the target model that exhibit an acceptable level of subspace similarity. Our extensive experiments demonstrate the effectiveness of LoRA-X for text-to-image generation, including Stable Diffusion v1.5 and Stable Diffusion XL.
zh

[CV-54] PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

【速读】：该论文旨在解决单向生成任务（如仅从文本到动作的生成）的局限性，并探索双向生成任务（包括动作到文本和文本到动作）的能力。论文的关键创新在于引入了PackDiT模型，通过使用互惠块（mutual blocks）无缝整合多模态扩散变换器（diffusion transformers, DiTs），从而实现包括动作生成、动作预测、文本生成、文本到动作、动作到文本以及联合动作-文本生成在内的多种任务。实验结果表明，PackDiT在HumanML3D数据集上达到了当前最佳的文本到动作性能（FID分数为0.106），并在动作预测和其他相关任务中也表现出色。

链接: https://arxiv.org/abs/2501.16551
作者: Zhongyu Jiang,Wenhao Chai,Zhuoran Zhou,Cheng-Yen Yang,Hsiang-Wei Huang,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.
zh

[CV-55] PhysAnimator: Physics-Guided Generative Cartoon Animation

【速读】：该论文旨在解决手绘动画序列制作劳动密集且需要专业知识的问题。解决方案的关键在于引入PhysAnimator方法，通过将基于物理的模拟与数据驱动的生成模型无缝集成，实现动态且视觉上引人入胜的动画生成。这种方法通过在提取的网格几何体上进行图像空间可变形体模拟，捕捉动漫特有的流畅性和夸张效果，并通过自定义能量笔触和骨骼点支持增强艺术控制，最终利用草图引导的视频扩散模型合成高质量动画帧。

链接: https://arxiv.org/abs/2501.16550
作者: Tianyi Xie,Yiwei Zhao,Ying Jiang,Chenfanfu Jiang
机构: Netflix; UCLA (加州大学洛杉矶分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating hand-drawn animation sequences is labor-intensive and demands professional expertise. We introduce PhysAnimator, a novel approach for generating physically plausible meanwhile anime-stylized animation from static anime illustrations. Our method seamlessly integrates physics-based simulations with data-driven generative models to produce dynamic and visually compelling animations. To capture the fluidity and exaggeration characteristic of anime, we perform image-space deformable body simulations on extracted mesh geometries. We enhance artistic control by introducing customizable energy strokes and incorporating rigging point support, enabling the creation of tailored animation effects such as wind interactions. Finally, we extract and warp sketches from the simulation sequence, generating a texture-agnostic representation, and employ a sketch-guided video diffusion model to synthesize high-quality animation frames. The resulting animations exhibit temporal consistency and visual plausibility, demonstrating the effectiveness of our method in creating dynamic anime-style animations.
zh

[CV-56] Multi-Objective Deep-Learning-based Biomechanical Deformable Image Registration with MOREA

【速读】：该论文旨在解决在处理具有大变形和内容不匹配的图像时，选择合适的可变形图像配准（DIR）方法的问题。传统的基于深度学习（DL）的方法虽然能够实时预测变换，但在面对复杂问题时变换的真实感不足；而基于生物力学和有限元建模（FEM）的方法虽能找到更真实的变换，但运行时间过长。论文的关键解决方案是提出了一种名为DL-MOREA的混合方法，它结合了基于多目标深度学习的DIR方法（DL-MODIR）和基于进化算法的多目标DIR方法（MOREA），通过将DL结果智能初始化MOREA，以更高效地优化其网格变换模型。实验表明，DL-MOREA在5分钟内即可找到高质量的变换，相比DL-MODIR，其变换结果具有更少的折叠现象，并改善或保持了膀胱轮廓距离误差。

链接: https://arxiv.org/abs/2501.16525
作者: Georgios Andreadis,Eduard Ruiz Munné,Thomas H. W. Bäck,Peter A. N. Bosman,Tanja Alderliesten
机构: Dept. of Radiation Oncology, Leiden University Medical Center (LUMC), The Netherlands; Leiden Institute of Advanced Computer Science (LIACS), Leiden University, The Netherlands; Evolutionary Intelligence Group, Centrum Wiskunde & Informatica (CWI), The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Pre-print for the SPIE Medical Imaging: Image Processing Conference

点击查看摘要

Abstract:When choosing a deformable image registration (DIR) approach for images with large deformations and content mismatch, the realism of found transformations often needs to be traded off against the required runtime. DIR approaches using deep learning (DL) techniques have shown remarkable promise in instantly predicting a transformation. However, on difficult registration problems, the realism of these transformations can fall short. DIR approaches using biomechanical, finite element modeling (FEM) techniques can find more realistic transformations, but tend to require much longer runtimes. This work proposes the first hybrid approach to combine them, with the aim of getting the best of both worlds. This hybrid approach, called DL-MOREA, combines a recently introduced multi-objective DL-based DIR approach which leverages the VoxelMorph framework, called DL-MODIR, with MOREA, an evolutionary algorithm-based, multi-objective DIR approach in which a FEM-like biomechanical mesh transformation model is used. In our proposed hybrid approach, the DL results are used to smartly initialize MOREA, with the aim of more efficiently optimizing its mesh transformation model. We empirically compare DL-MOREA against its components, DL-MODIR and MOREA, on CT scan pairs capturing large bladder filling differences of 15 cervical cancer patients. While MOREA requires a median runtime of 45 minutes, DL-MOREA can already find high-quality transformations after 5 minutes. Compared to the DL-MODIR transformations, the transformations found by DL-MOREA exhibit far less folding and improve or preserve the bladder contour distance error.
zh

[CV-57] Generating customized prompts for Zero-Shot Rare Event Medical Image Classification using LLM

【速读】：该论文旨在解决罕见事件检测（Rare Event Detection）中因数据稀少导致深度学习技术难以有效估计其分布的问题。特别是在医学领域，罕见事件由于类间差异小和类内差异大的特点而更具挑战性。论文的关键解决方案是提出了一种利用领域特定专家知识生成定制化且语境相关的提示词的方法，这些提示词随后被大型语言模型用于图像分类。这种方法在无需额外训练的情况下实现了零样本、保护隐私的罕见事件分类，并且超越了现有的最先进方法。

链接: https://arxiv.org/abs/2501.16481
作者: Payal Kamboj,Ayan Banerjee,Bin Xu,Sandeep Gupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE ISBI, 2025

点击查看摘要

Abstract:Rare events, due to their infrequent occurrences, do not have much data, and hence deep learning techniques fail in estimating the distribution for such data. Open-vocabulary models represent an innovative approach to image classification. Unlike traditional models, these models classify images into any set of categories specified with natural language prompts during inference. These prompts usually comprise manually crafted templates (e.g., 'a photo of a ') that are filled in with the names of each category. This paper introduces a simple yet effective method for generating highly accurate and contextually descriptive prompts containing discriminative characteristics. Rare event detection, especially in medicine, is more challenging due to low inter-class and high intra-class variability. To address these, we propose a novel approach that uses domain-specific expert knowledge on rare events to generate customized and contextually relevant prompts, which are then used by large language models for image classification. Our zero-shot, privacy-preserving method enhances rare event classification without additional training, outperforming state-of-the-art techniques.
zh

[CV-58] Object Detection for Medical Image Analysis: Insights from the RT-DETR Model

【速读】：该论文旨在解决复杂图像数据中精细模式识别和目标检测的挑战，特别是在糖尿病视网膜病变检测领域。解决方案的关键在于提出了一种基于RT-DETR模型的新颖检测框架，该模型采用Transformer架构，能够在处理高维和复杂视觉数据时提供增强的鲁棒性和准确性。通过与YOLOv5、YOLOv8、SSD及DETR等模型的对比评估，RT-DETR在精度、召回率、mAP50及mAP50-95指标上表现出色，尤其在检测小尺度对象和密集分布目标方面具有显著优势。

链接: https://arxiv.org/abs/2501.16469
作者: Weijie He,Yuwei Zhang,Ting Xu,Tai An,Yingbin Liang,Bo Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning has emerged as a transformative approach for solving complex pattern recognition and object detection challenges. This paper focuses on the application of a novel detection framework based on the RT-DETR model for analyzing intricate image data, particularly in areas such as diabetic retinopathy detection. Diabetic retinopathy, a leading cause of vision loss globally, requires accurate and efficient image analysis to identify early-stage lesions. The proposed RT-DETR model, built on a Transformer-based architecture, excels at processing high-dimensional and complex visual data with enhanced robustness and accuracy. Comparative evaluations with models such as YOLOv5, YOLOv8, SSD, and DETR demonstrate that RT-DETR achieves superior performance across precision, recall, mAP50, and mAP50-95 metrics, particularly in detecting small-scale objects and densely packed targets. This study underscores the potential of Transformer-based models like RT-DETR for advancing object detection tasks, offering promising applications in medical imaging and beyond.
zh

[CV-59] Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation

【速读】：该论文旨在解决传统语义分割方法在多样化场景和未见物体类别中的泛化能力有限的问题。论文的关键在于提出了一种名为LangSeg的新方法，该方法利用大型语言模型（Large Language Models, LLMs）生成的上下文敏感的细粒度子类描述符，并将其与预训练的视觉Transformer（Vision Transformer, ViT）结合，以实现卓越的分割性能而无需进行广泛的模型再训练。

链接: https://arxiv.org/abs/2501.16467
作者: Philip Hughes,Larry Burns,Luke Adams
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation plays a crucial role in enabling machines to understand and interpret visual scenes at a pixel level. While traditional segmentation methods have achieved remarkable success, their generalization to diverse scenes and unseen object categories remains limited. Recent advancements in large language models (LLMs) offer a promising avenue for bridging visual and textual modalities, providing a deeper understanding of semantic relationships. In this paper, we propose LangSeg, a novel LLM-guided semantic segmentation method that leverages context-sensitive, fine-grained subclass descriptors generated by LLMs. Our framework integrates these descriptors with a pre-trained Vision Transformer (ViT) to achieve superior segmentation performance without extensive model retraining. We evaluate LangSeg on two challenging datasets, ADE20K and COCO-Stuff, where it outperforms state-of-the-art models, achieving up to a 6.1% improvement in mean Intersection over Union (mIoU). Additionally, we conduct a comprehensive ablation study and human evaluation to validate the effectiveness of our method in real-world scenarios. The results demonstrate that LangSeg not only excels in semantic understanding and contextual alignment but also provides a flexible and efficient framework for language-guided segmentation tasks. This approach opens up new possibilities for interactive and domain-specific segmentation applications.
zh

[CV-60] BiFold: Bimanual Cloth Folding with Language Guidance ICRA2025

【速读】：该论文旨在解决衣物折叠任务中的复杂性问题，包括衣物自遮挡、复杂的动力学特性以及多样化的材料、几何形状和纹理。为了解决这一问题，论文提出了一种基于文本指令学习折叠动作的方法。关键解决方案在于利用预训练的视觉-语言模型，并重新调整其以预测操纵动作。所提出的模型BiFold能够考虑上下文信息，在现有的语言条件折叠基准测试中达到最先进性能。由于缺乏标注的双手折叠数据，研究还设计了一种自动解析模拟数据集动作并配对相应文本指令的程序。最终，BiFold在新指令、衣物和环境中均展现出良好的性能。

链接: https://arxiv.org/abs/2501.16458
作者: Oriol Barbany,Adrià Colomé,Carme Torras
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (机器人与工业信息研究所，CSIC-UPC)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA 2025

点击查看摘要

Abstract:Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. Given the lack of annotated bimanual folding data, we devise a procedure to automatically parse actions of a simulated dataset and tag them with aligned text instructions. BiFold attains the best performance on our dataset and can transfer to new instructions, garments, and environments.
zh

[CV-61] Objects matter: object-centric world models improve reinforcement learning in visually complex environments

【速读】：该论文旨在解决深度强化学习在像素级复杂视觉环境中样本效率低下的问题。传统基于模型的强化学习（MBRL）方法依赖于像素级别的自动编码，并使用L_2损失函数，这往往忽略了对决策至关重要的小或动态元素。为了解决这一局限性，论文提出了一种以对象为中心的MBRL管道，即OC-STORM。其关键是通过引入计算机视觉的最新进展，使智能体能够专注于关键的决策相关元素。具体而言，该方法包括标注关键对象、提取对象特征、将这些特征与原始观测相结合以预测环境动力学，最后利用这种对象为中心的世界模型训练策略。

链接: https://arxiv.org/abs/2501.16443
作者: Weipu Zhang,Adam Jelley,Trevor McInroe,Amos Storkey
机构: University of Edinburgh (爱丁堡大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep reinforcement learning has achieved remarkable success in learning control policies from pixels across a wide range of tasks, yet its application remains hindered by low sample efficiency, requiring significantly more environment interactions than humans to reach comparable performance. Model-based reinforcement learning (MBRL) offers a solution by leveraging learnt world models to generate simulated experience, thereby improving sample efficiency. However, in visually complex environments, small or dynamic elements can be critical for decision-making. Yet, traditional MBRL methods in pixel-based environments typically rely on auto-encoding with an L_2 loss, which is dominated by large areas and often fails to capture decision-relevant details. To address these limitations, we propose an object-centric MBRL pipeline, which integrates recent advances in computer vision to allow agents to focus on key decision-related elements. Our approach consists of four main steps: (1) annotating key objects related to rewards and goals with segmentation masks, (2) extracting object features using a pre-trained, frozen foundation vision model, (3) incorporating these object features with the raw observations to predict environmental dynamics, and (4) training the policy using imagined trajectories generated by this object-centric world model. Building on the efficient MBRL algorithm STORM, we call this pipeline OC-STORM. We demonstrate OC-STORM’s practical value in overcoming the limitations of conventional MBRL approaches on both Atari games and the visually complex game Hollow Knight.
zh

[CV-62] DynAlign: Unsupervised Dynamic Taxonomy Alignment for Cross-Domain Segmentation

【速读】：该论文旨在解决当前无监督领域适应（Unsupervised Domain Adaptation, UDA）方法在语义分割任务中假设源域和目标域标签一致的问题。这种假设忽略了标签级别的领域差距，而在真实场景中这一差距普遍存在，限制了这些方法识别更细粒度或新类别的能力。解决方案的关键在于引入DynAlign框架，该框架将UDA与基础模型相结合，以弥合图像级别和标签级别的领域差距。DynAlign利用先验语义知识将源类别与目标类别对齐，即使这些目标类别可能是新颖的、更细粒度的或命名不同的类别。通过动态适应不同场景上下文的知识融合方法进一步增强精度，从而实现在不依赖任何人工标注的情况下进行精确预测，允许无缝适应新的分类体系。

链接: https://arxiv.org/abs/2501.16410
作者: Han Sun,Rui Gong,Ismail Nejjar,Olga Fink
机构: EPFL(埃博瑞工学院); Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current unsupervised domain adaptation (UDA) methods for semantic segmentation typically assume identical class labels between the source and target domains. This assumption ignores the label-level domain gap, which is common in real-world scenarios, thus limiting their ability to identify finer-grained or novel categories without requiring extensive manual annotation. A promising direction to address this limitation lies in recent advancements in foundation models, which exhibit strong generalization abilities due to their rich prior knowledge. However, these models often struggle with domain-specific nuances and underrepresented fine-grained categories. To address these challenges, we introduce DynAlign, a framework that integrates UDA with foundation models to bridge both the image-level and label-level domain gaps. Our approach leverages prior semantic knowledge to align source categories with target categories that can be novel, more fine-grained, or named differently (e.g., vehicle to car, truck, bus). Foundation models are then employed for precise segmentation and category reassignment. To further enhance accuracy, we propose a knowledge fusion approach that dynamically adapts to varying scene contexts. DynAlign generates accurate predictions in a new target label space without requiring any manual annotations, allowing seamless adaptation to new taxonomies through either model retraining or direct inference. Experiments on the street scene semantic segmentation benchmarks GTA to Mapillary Vistas and GTA to IDD validate the effectiveness of our approach, achieving a significant improvement over existing methods. Our code will be publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.16410 [cs.CV] (or arXiv:2501.16410v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.16410 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-63] Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer

【速读】：该论文旨在解决仿真到现实（Simulation-to-Reality, Sim2Real）的分布偏移问题，即通过仿真环境训练的策略在真实环境中应用时所遇到的挑战。论文的关键解决方案在于使用大规模预训练的视觉编码器来提取适用于机器人控制的特征，并保持对任务无关环境变化的不变性。通过线性探测评估编码器的特征提取能力，并利用模拟与真实世界嵌入中心之间的距离计算来衡量其领域不变性。研究发现，针对特定操作任务进行预训练的编码器通常比通用数据集训练的编码器更能有效弥合Sim2Real差距。

链接: https://arxiv.org/abs/2501.16389
作者: Samuel Biruduganti,Yash Yardi,Lars Ankile
机构: Illinois Mathematics and Science Academy (伊利诺伊数学与科学学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 10 figures, view GitHub for all appendix figures from the study

点击查看摘要

Abstract:Simulation offers a scalable and efficient alternative to real-world data collection for learning visuomotor robotic policies. However, the simulation-to-reality, or “Sim2Real” distribution shift – introduced by employing simulation-trained policies in real-world environments – frequently prevents successful policy transfer. This study explores the potential of using large-scale pre-training of vision encoders to address the Sim2Real gap. We examine a diverse collection of encoders, evaluating their ability to (1) extract features necessary for robot control while (2) remaining invariant to task-irrelevant environmental variations. We quantitatively measure the encoder’s feature extraction capabilities through linear probing and its domain invariance by computing distances between simulation and real-world embedding centroids. Additional qualitative insights are provided through t-SNE plots and GradCAM saliency maps. Findings suggest that encoders pre-trained on manipulation-specific datasets generally outperform those trained on generic datasets in bridging the Sim2Real gap. this https URL
zh

[CV-64] A Hybrid Deep Learning CNN Model for Enhanced COVID-19 Detection from Computed Tomography (CT) Scan Images

【速读】：该论文旨在解决COVID-19早期检测问题，以实现更有效的治疗和控制病毒传播。关键在于提出了一种新颖的混合深度学习模型，该模型结合了VGG16、DenseNet121和MobileNetV2的优势来提取特征，并通过主成分分析（Principal Component Analysis, PCA）进行降维，随后使用支持向量分类器（Support Vector Classifier, SVC）进行分类。这一混合模型在包含2,108个训练图像和373个测试图像的数据集上达到了98.93%的准确率，优于单一预训练卷积神经网络（CNN）模型。

链接: https://arxiv.org/abs/2501.17160
作者: Suresh Babu Nettur,Shanthi Karpurapu,Unnati Nettur,Likhit Sagar Gajja,Sravanthy Myneni,Akhil Dusi,Lalithya Posham
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Corresponding authors: Shanthi Karpurapu ( this http URL @gmail.com), Suresh Babu Nettur (nettursuresh@gmail.com) Shanthi Karpurapu and Suresh Babu Nettur are co-first authors

点击查看摘要

Abstract:Early detection of COVID-19 is crucial for effective treatment and controlling its spread. This study proposes a novel hybrid deep learning model for detecting COVID-19 from CT scan images, designed to assist overburdened medical professionals. Our proposed model leverages the strengths of VGG16, DenseNet121, and MobileNetV2 to extract features, followed by Principal Component Analysis (PCA) for dimensionality reduction, after which the features are stacked and classified using a Support Vector Classifier (SVC). We conducted comparative analysis between the proposed hybrid model and individual pre-trained CNN models, using a dataset of 2,108 training images and 373 test images comprising both COVID-positive and non-COVID images. Our proposed hybrid model achieved an accuracy of 98.93%, outperforming the individual models in terms of precision, recall, F1 scores, and ROC curve performance.
zh

[CV-65] Ultra-high resolution multimodal MRI dense labelled holistic brain atlas

【速读】：该论文旨在构建一个全面、多模态且高分辨率的人类大脑图谱（holiAtlas），以覆盖从器官到亚结构不同层次的人脑解剖细节。解决方案的关键在于采用一种新的密集标记协议，该协议通过融合多种不同尺度的局部协议生成，并利用来自人类连接组项目数据库的75名健康受试者的MR图像及分割数据。这些图像包括T1、T2和去白质（WMn）对比度，在0.125 mm³分辨率下进行非线性配准和对称群组归一化处理，从而构建出这一图谱。在最精细的层面，holiAtlas协议包含350个不同的标签，这些标签源自10种不同的描绘协议，并按不同尺度分组，提供了一致且连贯的大脑整体视图。该多尺度多模态图谱可用于开发新型超高分辨率分割方法，以期促进神经疾病的早期检测。

链接: https://arxiv.org/abs/2501.16879
作者: José V. Manjón,Sergio Morell-Ortega,Marina Ruiz-Perez,Boris Mansencal,Edern Le Bot,Marien Gadea,Enrique Lanuza,Gwenaelle Catheline,Thomas Tourdias,Vincent Planche,Rémi Giraud,Denis Rivière,Jean-François Mangin,Nicole Labra-Avila,Roberto Vivo-Hernando,Gregorio Rubio,Fernando Aparici,Maria de la Iglesia-Vaya,Pierrick Coupé
机构: Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València (瓦伦西亚理工大学); CNRS (法国国家科学研究中心), Univ. Bordeaux (波尔多大学), Bordeaux INP (波尔多国立高等工程学院), LABRI (波尔多计算机科学实验室), UMR5800 (联合研究单位5800), in2brain (in2brain); Department of Psychobiology (心理生物学系), Faculty of Psychology (心理学系), Universitat de Valencia (瓦伦西亚大学); University Valencia, Department of Cell Biology (细胞生物学系), Burjassot, 46100, Valencia, Spain; University Bordeaux (波尔多大学), CNRS (法国国家科学研究中心), EPHE (高级社会科学研究学校), PSL (巴黎科学与文学学院), INCIA (神经认知与计算智能研究所), UMR 5283 (联合研究单位5283), F-33000, Bordeaux, France; Service de Neuroimagerie diagnostique et thérapeutique (神经影像诊断与治疗服务), CHU de Bordeaux (波尔多大学医院), F-33000 Bordeaux, France; Institut des Maladies Neurodégénératives (神经退行性疾病研究所), Univ. Bordeaux (波尔多大学), CNRS (法国国家科学研究中心), UMR 5293 (联合研究单位5293), F-33000 Bordeaux, France; Remi Giraud, Univ. Bordeaux (波尔多大学), CNRS (法国国家科学研究中心), Bordeaux INP (波尔多国立高等工程学院), IMS (微系统与信号研究所), UMR 5218 (联合研究单位5218), F-33400 Talence, France; Denis Rivière, NeuroSpin (NeuroSpin), BAOBAB lab (BAOBAB实验室), CEA Saclay (法国原子能和替代能源委员会萨克莱中心), Gif-sur-Yvette, France; Jean-Francois Mangin, NeuroSpin (NeuroSpin), BAOBAB lab (BAOBAB实验室), CEA Saclay (法国原子能和替代能源委员会萨克莱中心), Gif-sur-Yvette, France; Nicole Labra-Avila, NeuroSpin (NeuroSpin), BAOBAB lab (BAOBAB实验室), CEA Saclay (法国原子能和替代能源委员会萨克莱中心), Gif-sur-Yvette, France; Roberto Vivo-Hernando, Instituto de Automática e Informática Industrial (自动化与信息技术研究所), Universitat Politècnica de València (瓦伦西亚理工大学), Camino de Vera s/n, 46022, Valencia, Spain; Gregorio Rubio, Departamento de matemática aplicada (应用数学系), Universitat Politècnica de València (瓦伦西亚理工大学), Camino de Vera s/n, 46022 Valencia, Spain; Fernando Aparici, Área de Imagen Médica (医学影像区). Hospital Universitario y Politécnico La Fe (拉菲大学医院与理工学院). Valencia, Spain; Maria de la Iglesia-Vaya, Unidad Mixta de Imagen Biomédica FISABIO-CIPF (生物医学影像混合单元FISABIO-CIPF). Fundación para el Fomento de la Investigación Sanitario y Biomédica de la Comunidad Valenciana (促进健康和生物医学研究基金会) - Valencia, Spain. CIBERSAM (精神健康生物医学研究网络中心), ISC III (健康卡斯蒂利亚-莱昂研究所). Av. Blasco Ibáñez 15, 46010 - València, Spain; Pierrick Coupé, CNRS (法国国家科学研究中心), Univ. Bordeaux (波尔多大学), Bordeaux INP (波尔多国立高等工程学院), LABRI (波尔多计算机科学实验室), UMR5800 (联合研究单位5800), in2brain (in2brain)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages

点击查看摘要

Abstract:In this paper, we introduce holiAtlas, a holistic, multimodal and high-resolution human brain atlas. This atlas covers different levels of details of the human brain anatomy, from the organ to the substructure level, using a new dense labelled protocol generated from the fusion of multiple local protocols at different scales. This atlas has been constructed averaging images and segmentations of 75 healthy subjects from the Human Connectome Project database. Specifically, MR images of T1, T2 and WMn (White Matter nulled) contrasts at 0.125 mm^3 resolution that were nonlinearly registered and averaged using symmetric group-wise normalisation to construct the atlas. At the finest level, the holiAtlas protocol has 350 different labels derived from 10 different delineation protocols. These labels were grouped at different scales to provide a holistic view of the brain at different levels in a coherent and consistent manner. This multiscale and multimodal atlas can be used for the development of new ultra-high resolution segmentation methods that can potentially leverage the early detection of neurological disorders.
zh

[CV-66] Efficient Knowledge Distillation of SAM for Medical Image Segmentation

【速读】：该论文旨在解决Segment Anything Model (SAM)在实时或资源受限环境中因显著计算需求而难以部署的问题。解决方案的关键在于提出了一种名为KD SAM的知识蒸馏方法，通过结合均方误差(Mean Squared Error, MSE)和感知损失(Perceptual Loss)的双损失框架，实现编码器和解码器的优化，从而在保持高分割精度的同时降低计算复杂度。

链接: https://arxiv.org/abs/2501.16740
作者: Kunal Dasharath Patil,Gowthamaan Palani,Ganapathy Krishnamurthi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:The Segment Anything Model (SAM) has set a new standard in interactive image segmentation, offering robust performance across various tasks. However, its significant computational requirements limit its deployment in real-time or resource-constrained environments. To address these challenges, we propose a novel knowledge distillation approach, KD SAM, which incorporates both encoder and decoder optimization through a combination of Mean Squared Error (MSE) and Perceptual Loss. This dual-loss framework captures structural and semantic features, enabling the student model to maintain high segmentation accuracy while reducing computational complexity. Based on the model evaluation on datasets, including Kvasir-SEG, ISIC 2017, Fetal Head Ultrasound, and Breast Ultrasound, we demonstrate that KD SAM achieves comparable or superior performance to the baseline models, with significantly fewer parameters. KD SAM effectively balances segmentation accuracy and computational efficiency, making it well-suited for real-time medical image segmentation applications in resource-constrained environments.
zh

人工智能

[AI-0] Graph Transformers for inverse physics: reconstructing flows around arbitrary 2D airfoils

链接: https://arxiv.org/abs/2501.17081
作者: Gregory Duthé,Imad Abdallah,Eleni Chatzi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:We introduce a Graph Transformer framework that serves as a general inverse physics engine on meshes, demonstrated through the challenging task of reconstructing aerodynamic flow fields from sparse surface measurements. While deep learning has shown promising results in forward physics simulation, inverse problems remain particularly challenging due to their ill-posed nature and the difficulty of propagating information from limited boundary observations. Our approach addresses these challenges by combining the geometric expressiveness of message-passing neural networks with the global reasoning of Transformers, enabling efficient learning of inverse mappings from boundary conditions to complete states. We evaluate this framework on a comprehensive dataset of steady-state RANS simulations around diverse airfoil geometries, where the task is to reconstruct full pressure and velocity fields from surface pressure measurements alone. The architecture achieves high reconstruction accuracy while maintaining fast inference times. We conduct experiments and provide insights into the relative importance of local geometric processing and global attention mechanisms in mesh-based inverse problems. We also find that the framework is robust to reduced sensor coverage. These results suggest that Graph Transformers can serve as effective inverse physics engines across a broader range of applications where complete system states must be reconstructed from limited boundary observations.

[AI-1] Learning Mean Field Control on Sparse Graphs

链接: https://arxiv.org/abs/2501.17079
作者: Christian Fabian,Kai Cui,Heinz Koeppl
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large agent networks are abundant in applications and nature and pose difficult challenges in the field of multi-agent reinforcement learning (MARL) due to their computational and theoretical complexity. While graphon mean field games and their extensions provide efficient learning algorithms for dense and moderately sparse agent networks, the case of realistic sparser graphs remains largely unsolved. Thus, we propose a novel mean field control model inspired by local weak convergence to include sparse graphs such as power law networks with coefficients above two. Besides a theoretical analysis, we design scalable learning algorithms which apply to the challenging class of graph sequences with finite first moment. We compare our model and algorithms for various examples on synthetic and real world networks with mean field algorithms based on Lp graphons and graphexes. As it turns out, our approach outperforms existing methods in many examples and on various networks due to the special design aiming at an important, but so far hard to solve class of MARL problems.

[AI-2] Induced Modularity and Community Detection for Functionally Interpretable Reinforcement Learning

链接: https://arxiv.org/abs/2501.17077
作者: Anna Soligo,Pietro Ferraro,David Boyle
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interpretability in reinforcement learning is crucial for ensuring AI systems align with human values and fulfill the diverse related requirements including safety, robustness and fairness. Building on recent approaches to encouraging sparsity and locality in neural networks, we demonstrate how the penalisation of non-local weights leads to the emergence of functionally independent modules in the policy network of a reinforcement learning agent. To illustrate this, we demonstrate the emergence of two parallel modules for assessment of movement along the X and Y axes in a stochastic Minigrid environment. Through the novel application of community detection algorithms, we show how these modules can be automatically identified and their functional roles verified through direct intervention on the network weights prior to inference. This establishes a scalable framework for reinforcement learning interpretability through functional modularity, addressing challenges regarding the trade-off between completeness and cognitive tractability of reinforcement learning explanations.

[AI-3] Standardised schema and taxonomy for AI incident databases in critical digital infrastructure

链接: https://arxiv.org/abs/2501.17037
作者: Avinash Agarwal,Manisha J. Nene
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 6 pages, 3 tables. Accepted at the 2024 IEEE Pune Section International Conference (PuneCon)

点击查看摘要

Abstract:The rapid deployment of Artificial Intelligence (AI) in critical digital infrastructure introduces significant risks, necessitating a robust framework for systematically collecting AI incident data to prevent future incidents. Existing databases lack the granularity as well as the standardized structure required for consistent data collection and analysis, impeding effective incident management. This work proposes a standardized schema and taxonomy for AI incident databases, addressing these challenges by enabling detailed and structured documentation of AI incidents across sectors. Key contributions include developing a unified schema, introducing new fields such as incident severity, causes, and harms caused, and proposing a taxonomy for classifying AI incidents in critical digital infrastructure. The proposed solution facilitates more effective incident data collection and analysis, thus supporting evidence-based policymaking, enhancing industry safety measures, and promoting transparency. This work lays the foundation for a coordinated global response to AI incidents, ensuring trust, safety, and accountability in using AI across regions.

[AI-4] Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

链接: https://arxiv.org/abs/2501.17015
作者: Longzhong Lin,Xuewu Lin,Kechun Xu,Haojian Lu,Lichao Huang,Rong Xiong,Yue Wang
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we revisit mixture models for generating multimodal agent behaviors, which can cover the mainstream methods including continuous mixture models and GPT-like discrete models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the unified mixture model~(UniMM) framework, we recognize critical configurations from both model and data perspectives. We conduct a systematic examination of various model configurations, including positive component matching, continuous regression, prediction horizon, and the number of components. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

[AI-5] Heterogeneity-aware Personalized Federated Learning via Adaptive Dual-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2501.16966
作者: Xi Chen,Qin Li,Haibin Cai,Ting Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) empowers multiple clients to collaboratively train machine learning models without sharing local data, making it highly applicable in heterogeneous Internet of Things (IoT) environments. However, intrinsic heterogeneity in clients’ model architectures and computing capabilities often results in model accuracy loss and the intractable straggler problem, which significantly impairs training effectiveness. To tackle these challenges, this paper proposes a novel Heterogeneity-aware Personalized Federated Learning method, named HAPFL, via multi-level Reinforcement Learning (RL) mechanisms. HAPFL optimizes the training process by incorporating three strategic components: 1) An RL-based heterogeneous model allocation mechanism. The parameter server employs a Proximal Policy Optimization (PPO)-based RL agent to adaptively allocate appropriately sized, differentiated models to clients based on their performance, effectively mitigating performance disparities. 2) An RL-based training intensity adjustment scheme. The parameter server leverages another PPO-based RL agent to dynamically fine-tune the training intensity for each client to further enhance training efficiency and reduce straggling latency. 3) A knowledge distillation-based mutual learning mechanism. Each client deploys both a heterogeneous local model and a homogeneous lightweight model named LiteModel, where these models undergo mutual learning through knowledge distillation. This uniform LiteModel plays a pivotal role in aggregating and sharing global knowledge, significantly enhancing the effectiveness of personalized local training. Experimental results across multiple benchmark datasets demonstrate that HAPFL not only achieves high accuracy but also substantially reduces the overall training time by 20.9%-40.4% and decreases straggling latency by 19.0%-48.0% compared to existing solutions.

[AI-6] Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers

链接: https://arxiv.org/abs/2501.16961
作者: Mohammad Raza,Natasa Milic-Frayling
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robustness of reasoning remains a significant challenge for large language models, and addressing it is essential for the practical applicability of AI-driven reasoning systems. We introduce Semantic Self-Verification (SSV), a novel approach that addresses the key challenge in combining language models with the rigor of logical solvers: to accurately formulate the reasoning problem from natural language to the formal language of the solver. SSV uses a consistency-based approach to produce strong abstract formalizations of problems using concrete instantiations that are generated by the model and verified by the solver. In addition to significantly advancing the overall reasoning accuracy over the state-of-the-art, a key novelty that this approach presents is a feature of verification that has near-perfect precision over a significant coverage of cases, as we demonstrate on open reasoning benchmarks. We propose such near-certain reasoning as a new approach to reduce the need for manual verification in many cases, taking us closer to more dependable and autonomous AI reasoning systems.

[AI-7] Exact Computation of Any-Order Shapley Interactions for Graph Neural Networks ICLR2025

链接: https://arxiv.org/abs/2501.16944
作者: Fabian Fumagalli,Maximilian Muschalik,Paolo Frazzetto,Janine Strotherm,Luca Hermes,Alessandro Sperduti,Eyke Hüllermeier,Barbara Hammer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint Version. Accepted at ICLR 2025

点击查看摘要

Abstract:Albeit the ubiquitous use of Graph Neural Networks (GNNs) in machine learning (ML) prediction tasks involving graph-structured data, their interpretability remains challenging. In explainable artificial intelligence (XAI), the Shapley Value (SV) is the predominant method to quantify contributions of individual features to a ML model’s output. Addressing the limitations of SVs in complex prediction models, Shapley Interactions (SIs) extend the SV to groups of features. In this work, we explain single graph predictions of GNNs with SIs that quantify node contributions and interactions among multiple nodes. By exploiting the GNN architecture, we show that the structure of interactions in node embeddings are preserved for graph prediction. As a result, the exponential complexity of SIs depends only on the receptive fields, i.e. the message-passing ranges determined by the connectivity of the graph and the number of convolutional layers. Based on our theoretical results, we introduce GraphSHAP-IQ, an efficient approach to compute any-order SIs exactly. GraphSHAP-IQ is applicable to popular message passing techniques in conjunction with a linear global pooling and output layer. We showcase that GraphSHAP-IQ substantially reduces the exponential complexity of computing exact SIs on multiple benchmark datasets. Beyond exact computation, we evaluate GraphSHAP-IQ’s approximation of SIs on popular GNN architectures and compare with existing baselines. Lastly, we visualize SIs of real-world water distribution networks and molecule structures using a SI-Graph.

[AI-8] Agent ial AI for Integrated Continual Learning Deliberative Behavior and Comprehensible Models

链接: https://arxiv.org/abs/2501.16922
作者: Zeki Doruk Erden,Boi Faltings
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contemporary machine learning paradigm excels in statistical data analysis, solving problems that classical AI couldn’t. However, it faces key limitations, such as a lack of integration with planning, incomprehensible internal structure, and inability to learn continually. We present the initial design for an AI system, Agential AI (AAI), in principle operating independently or on top of statistical methods, designed to overcome these issues. AAI’s core is a learning method that models temporal dynamics with guarantees of completeness, minimality, and continual learning, using component-level variation and selection to learn the structure of the environment. It integrates this with a behavior algorithm that plans on a learned model and encapsulates high-level behavior patterns. Preliminary experiments on a simple environment show AAI’s effectiveness and potential.

[AI-9] RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains

链接: https://arxiv.org/abs/2501.16899
作者: Shady Nasrat,Myungsu Kim,Seonil Lee,Jiho Lee,Yeoncheol Jang,Seung-joon Yi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. We showcase the capabilities of our framework within the context of the real-world household competition. This research introduces a framework that utilizes RDMM (Robotics Decision-Making Models), which possess the capacity for decision-making within domain-specific contexts, as well as an awareness of their personal knowledge and capabilities. The framework leverages information to enhance the autonomous decision-making of the system. In contrast to other approaches, our focus is on real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. Our framework incorporates visual perception models equipping robots with understanding of their environment. Additionally, the framework has integrated real-time speech recognition capabilities, thus enhancing the human-robot interaction experience. Experimental results demonstrate that the RDMM framework can plan with an 93% accuracy. Furthermore, we introduce a new dataset consisting of 27k planning instances, as well as 1.3k text-image annotated samples derived from the competition. The framework, benchmarks, datasets, and models developed in this work are publicly available on our GitHub repository at this https URL.

[AI-10] DIRIGENt: End-To-End Robotic Imitation of Human Demonstrations Based on a Diffusion Model

链接: https://arxiv.org/abs/2501.16800
作者: Josua Spisak,Matthias Kerzel,Stefan Wermter
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There has been substantial progress in humanoid robots, with new skills continuously being taught, ranging from navigation to manipulation. While these abilities may seem impressive, the teaching methods often remain inefficient. To enhance the process of teaching robots, we propose leveraging a mechanism effectively used by humans: teaching by demonstrating. In this paper, we introduce DIRIGENt (DIrect Robotic Imitation GENeration model), a novel end-to-end diffusion approach that directly generates joint values from observing human demonstrations, enabling a robot to imitate these actions without any existing mapping between it and humans. We create a dataset in which humans imitate a robot and then use this collected data to train a diffusion model that enables a robot to imitate humans. The following three aspects are the core of our contribution. First is our novel dataset with natural pairs between human and robot poses, allowing our approach to imitate humans accurately despite the gap between their anatomies. Second, the diffusion input to our model alleviates the challenge of redundant joint configurations, limiting the search space. And finally, our end-to-end architecture from perception to action leads to an improved learning capability. Through our experimental analysis, we show that combining these three aspects allows DIRIGENt to outperform existing state-of-the-art approaches in the field of generating joint values from RGB images.

[AI-11] LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience AAAI-2025

链接: https://arxiv.org/abs/2501.16744
作者: Nimesh Jha,Shuxin Lin,Srideepika Jayaraman,Kyle Frohling,Christodoulos Constantinides,Dhaval Patel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the AAAI-2025 Deployable AI Workshop

点击查看摘要

Abstract:This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data, designed to assist Site Reliability Engineers (SREs) in managing cloud infrastructure. The service enables efficient anomaly detection in complex data streams, supporting proactive identification and resolution of issues. Furthermore, it presents an innovative approach to anomaly modeling in cloud infrastructure by utilizing Large Language Models (LLMs) to understand key components, their failure modes, and behaviors. A suite of algorithms for detecting anomalies is offered in univariate and multivariate time series data, including regression-based, mixture-model-based, and semi-supervised approaches. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. The service has been successfully applied in various industrial settings, including IoT-based AI applications. We have also evaluated our system on public anomaly benchmarks to show its effectiveness. By leveraging it, SREs can proactively identify potential issues before they escalate, reducing downtime and improving response times to incidents, ultimately enhancing the overall customer experience. We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.

[AI-12] Distilling Large Language Models for Network Active Queue Management

链接: https://arxiv.org/abs/2501.16734
作者: Deol Satish,Shiva Raj Pokhrel,Jonathan Kua,Anwar Walid
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:The growing complexity of network traffic and demand for ultra-low latency communication require smarter packet traffic management. Existing Deep Learning-based queuing approaches struggle with dynamic network scenarios and demand high engineering effort. We propose AQM-LLM, distilling Large Language Models (LLMs) with few-shot learning, contextual understanding, and pattern recognition to improve Active Queue Management (AQM) [RFC 9330] with minimal manual effort. We consider a specific case where AQM is Low Latency, Low Loss, and Scalable Throughput (L4S) and our design of AQM-LLM builds on speculative decoding and reinforcement-based distilling of LLM by tackling congestion prevention in the L4S architecture using Explicit Congestion Notification (ECN) [RFC 9331] and periodic packet dropping. We develop a new open-source experimental platform by executing L4S-AQM on FreeBSD-14, providing interoperable modules to support LLM integration and facilitate IETF recognition through wider testing. Our extensive evaluations show L4S-LLM enhances queue management, prevents congestion, reduces latency, and boosts network performance, showcasing LLMs’ adaptability and efficiency in uplifting AQM systems.

[AI-13] On the Interplay Between Sparsity and Training in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.16729
作者: Fatima Davelouis,John D. Martin,Michael Bowling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study the benefits of different sparse architectures for deep reinforcement learning. In particular, we focus on image-based domains where spatially-biased and fully-connected architectures are common. Using these and several other architectures of equal capacity, we show that sparse structure has a significant effect on learning performance. We also observe that choosing the best sparse architecture for a given domain depends on whether the hidden layer weights are fixed or learned.

[AI-14] Bridging Neural Networks and Wireless Systems with MIMO-OFDM Semantic Communications

链接: https://arxiv.org/abs/2501.16726
作者: Hanju Yoo,Dongha Choi,Yonghwi Kim,Yoontae Kim,Songkuk Kim,Chan-Byoung Chae,Robert W. Heath Jr
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Semantic communications aim to enhance transmission efficiency by jointly optimizing source coding, channel coding, and modulation. While prior research has demonstrated promising performance in simulations, real-world implementations often face significant challenges, including noise variability and nonlinear distortions, leading to performance gaps. This article investigates these challenges in a multiple-input multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM)-based semantic communication system, focusing on the practical impacts of power amplifier (PA) nonlinearity and peak-to-average power ratio (PAPR) variations. Our analysis identifies frequency selectivity of the actual channel as a critical factor in performance degradation and demonstrates that targeted mitigation strategies can enable semantic systems to approach theoretical performance. By addressing key limitations in existing designs, we provide actionable insights for advancing semantic communications in practical wireless environments. This work establishes a foundation for bridging the gap between theoretical models and real-world deployment, highlighting essential considerations for system design and optimization.

[AI-15] Hypergraph Diffusion for High-Order Recommender Systems

链接: https://arxiv.org/abs/2501.16722
作者: Darnbi Sakong,Thanh Trung Huynh,Jun Jo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Technical Report

点击查看摘要

Abstract:Recommender systems rely on Collaborative Filtering (CF) to predict user preferences by leveraging patterns in historical user-item interactions. While traditional CF methods primarily focus on learning compact vector embeddings for users and items, graph neural network (GNN)-based approaches have emerged as a powerful alternative, utilizing the structure of user-item interaction graphs to enhance recommendation accuracy. However, existing GNN-based models, such as LightGCN and UltraGCN, often struggle with two major limitations: an inability to fully account for heterophilic interactions, where users engage with diverse item categories, and the over-smoothing problem in multi-layer GNNs, which hinders their ability to model complex, high-order relationships. To address these gaps, we introduce WaveHDNN, an innovative wavelet-enhanced hypergraph diffusion framework. WaveHDNN integrates a Heterophily-aware Collaborative Encoder, designed to capture user-item interactions across diverse categories, with a Multi-scale Group-wise Structure Encoder, which leverages wavelet transforms to effectively model localized graph structures. Additionally, cross-view contrastive learning is employed to maintain robust and consistent representations. Experiments on benchmark datasets validate the efficacy of WaveHDNN, demonstrating its superior ability to capture both heterophilic and localized structural information, leading to improved recommendation performance.

[AI-16] Optimizing Code Runtime Performance through Context-Aware Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2501.16692
作者: Manish Acharya,Yifan Zhang,Yu Huang,Kevin Leach
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimizing software performance through automated code refinement offers a promising avenue for enhancing execution speed and efficiency. Despite recent advancements in LLMs, a significant gap remains in their ability to perform in-depth program analysis. This study introduces AUTOPATCH, an in-context learning approach designed to bridge this gap by enabling LLMs to automatically generate optimized code. Inspired by how programmers learn and apply knowledge to optimize software, AUTOPATCH incorporates three key components: (1) an analogy-driven framework to align LLM optimization with human cognitive processes, (2) a unified approach that integrates historical code examples and CFG analysis for context-aware learning, and (3) an automated pipeline for generating optimized code through in-context prompting. Experimental results demonstrate that AUTOPATCH achieves a 7.3% improvement in execution efficiency over GPT-4o across common generated executable code, highlighting its potential to advance automated program runtime optimization.

[AI-17] MACI: Multi-Agent Collaborative Intelligence for Robust Reasoning and Temporal Planning

链接: https://arxiv.org/abs/2501.16689
作者: Edward Y. Chang
类目: Artificial Intelligence (cs.AI)
*备注: 22 pages, 21 tables

点击查看摘要

Abstract:Artificial intelligence requires deliberate reasoning, temporal awareness, and effective constraint management, capabilities beyond the pattern-matching strengths of LLMs. LLMs struggle with planning tasks because of their reliance on associative reasoning, inability to self-verify, and inconsistent constraint awareness. We propose Multi-Agent Collaborative Intelligence (MACI), a framework centered on a meta-planner (MP) that orchestrates multiple agents to generate planner templates that define roles and constraints. These planners produce actionable workflows of role nodes and dependency constraints, enabling advanced temporal reasoning and adaptability. MACI’s three-tier architecture includes a meta-planning module for planner construction, common agents for general reasoning, and specialized agents for domain expertise. By decoupling planning from validation, it overcomes key LLM limitations. Evaluations demonstrate MACI’s effective constraint satisfaction, conflict detection, and reasoning, positioning it as a robust solution for complex reasoning and planning tasks. Comments: 22 pages, 21 tables Subjects: Artificial Intelligence (cs.AI) ACMclasses: F.2.2 Cite as: arXiv:2501.16689 [cs.AI] (or arXiv:2501.16689v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.16689 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Stanford University InfoLab Technical Report, 2025

[AI-18] Data-Free Model-Related Attacks: Unleashing the Potential of Generative AI USENIX-SECURITY2025

链接: https://arxiv.org/abs/2501.16671
作者: Dayong Ye,Tianqing Zhu,Shang Wang,Bo Liu,Leo Yu Zhang,Wanlei Zhou,Yang Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted at USENIX Security 2025

点击查看摘要

Abstract:Generative AI technology has become increasingly integrated into our daily lives, offering powerful capabilities to enhance productivity. However, these same capabilities can be exploited by adversaries for malicious purposes. While existing research on adversarial applications of generative AI predominantly focuses on cyberattacks, less attention has been given to attacks targeting deep learning models. In this paper, we introduce the use of generative AI for facilitating model-related attacks, including model extraction, membership inference, and model inversion. Our study reveals that adversaries can launch a variety of model-related attacks against both image and text models in a data-free and black-box manner, achieving comparable performance to baseline methods that have access to the target models’ training data and parameters in a white-box manner. This research serves as an important early warning to the community about the potential risks associated with generative AI-powered attacks on deep learning models.

[AI-19] Federated Learning for Efficient Condition Monitoring and Anomaly Detection in Industrial Cyber-Physical Systems

链接: https://arxiv.org/abs/2501.16666
作者: William Marfo,Deepak K. Tosh,Shirley V. Moore
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Detecting and localizing anomalies in cyber-physical systems (CPS) has become increasingly challenging as systems grow in complexity, particularly due to varying sensor reliability and node failures in distributed environments. While federated learning (FL) provides a foundation for distributed model training, existing approaches often lack mechanisms to address these CPS-specific challenges. This paper introduces an enhanced FL framework with three key innovations: adaptive model aggregation based on sensor reliability, dynamic node selection for resource optimization, and Weibull-based checkpointing for fault tolerance. The proposed framework ensures reliable condition monitoring while tackling the computational and reliability challenges of industrial CPS deployments. Experiments on the NASA Bearing and Hydraulic System datasets demonstrate superior performance compared to state-of-the-art FL methods, achieving 99.5% AUC-ROC in anomaly detection and maintaining accuracy even under node failures. Statistical validation using the Mann-Whitney U test confirms significant improvements, with a p-value less than 0.05, in both detection accuracy and computational efficiency across various operational scenarios.

[AI-20] Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearning USENIX-SECURITY2025

链接: https://arxiv.org/abs/2501.16663
作者: Dayong Ye,Tainqing Zhu,Jiayang Li,Kun Gao,Bo Liu,Leo Yu Zhang,Wanlei Zhou,Yang Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted at USENIX Security 2025

点击查看摘要

Abstract:Duplication is a prevalent issue within datasets. Existing research has demonstrated that the presence of duplicated data in training datasets can significantly influence both model performance and data privacy. However, the impact of data duplication on the unlearning process remains largely unexplored. This paper addresses this gap by pioneering a comprehensive investigation into the role of data duplication, not only in standard machine unlearning but also in federated and reinforcement unlearning paradigms. Specifically, we propose an adversary who duplicates a subset of the target model’s training set and incorporates it into the training set. After training, the adversary requests the model owner to unlearn this duplicated subset, and analyzes the impact on the unlearned model. For example, the adversary can challenge the model owner by revealing that, despite efforts to unlearn it, the influence of the duplicated subset remains in the model. Moreover, to circumvent detection by de-duplication techniques, we propose three novel near-duplication methods for the adversary, each tailored to a specific unlearning paradigm. We then examine their impacts on the unlearning process when de-duplication techniques are applied. Our findings reveal several crucial insights: 1) the gold standard unlearning method, retraining from scratch, fails to effectively conduct unlearning under certain conditions; 2) unlearning duplicated data can lead to significant model degradation in specific scenarios; and 3) meticulously crafted duplicates can evade detection by de-duplication methods.

[AI-21] owards Resource-Efficient Compound AI Systems

链接: https://arxiv.org/abs/2501.16634
作者: Gohar Irfan Chaudhry,Esha Choukse,Íñigo Goiri,Rodrigo Fonseca,Adam Belay,Ricardo Bianchini
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Compound AI Systems, integrating multiple interacting components like models, retrievers, and external tools, have emerged as essential for addressing complex AI tasks. However, current implementations suffer from inefficient resource utilization due to tight coupling between application logic and execution details, a disconnect between orchestration and resource management layers, and the perceived exclusiveness between efficiency and quality. We propose a vision for resource-efficient Compound AI Systems through a \emphdeclarative workflow programming model and an \emphadaptive runtime system for dynamic scheduling and resource-aware decision-making. Decoupling application logic from low-level details exposes levers for the runtime to flexibly configure the execution environment and resources, without compromising on quality. Enabling collaboration between the workflow orchestration and cluster manager enables higher efficiency through better scheduling and resource management. We are building a prototype system, called \textbf\textitMurakkab, to realize this vision. Our preliminary evaluation demonstrates speedups up to \sim 3.4\times in workflow completion times while delivering \sim 4.5\times higher energy efficiency, showing promise in optimizing resources and advancing AI system design. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.16634 [cs.DC] (or arXiv:2501.16634v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2501.16634 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] Engaging with AI: How Interface Design Shapes Human-AI Collaboration in High-Stakes Decision-Making

链接: https://arxiv.org/abs/2501.16627
作者: Zichen Chen,Yunhao Luo,Misha Sra
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 36 pages, 6 figures, 6 tables. Preprint version

点击查看摘要

Abstract:As reliance on AI systems for decision-making grows, it becomes critical to ensure that human users can appropriately balance trust in AI suggestions with their own judgment, especially in high-stakes domains like healthcare. However, human + AI teams have been shown to perform worse than AI alone, with evidence indicating automation bias as the reason for poorer performance, particularly because humans tend to follow AI’s recommendations even when they are incorrect. In many existing human + AI systems, decision-making support is typically provided in the form of text explanations (XAI) to help users understand the AI’s reasoning. Since human decision-making often relies on System 1 thinking, users may ignore or insufficiently engage with the explanations, leading to poor decision-making. Previous research suggests that there is a need for new approaches that encourage users to engage with the explanations and one proposed method is the use of cognitive forcing functions (CFFs). In this work, we examine how various decision-support mechanisms impact user engagement, trust, and human-AI collaborative task performance in a diabetes management decision-making scenario. In a controlled experiment with 108 participants, we evaluated the effects of six decision-support mechanisms split into two categories of explanations (text, visual) and four CFFs. Our findings reveal that mechanisms like AI confidence levels, text explanations, and performance visualizations enhanced human-AI collaborative task performance, and improved trust when AI reasoning clues were provided. Mechanisms like human feedback and AI-driven questions encouraged deeper reflection but often reduced task performance by increasing cognitive effort, which in turn affected trust. Simple mechanisms like visual explanations had little effect on trust, highlighting the importance of striking a balance in CFF and XAI design.

[AI-23] Chinese Stock Prediction Based on a Multi-Modal Transformer Framework: Macro-Micro Information Fusion

链接: https://arxiv.org/abs/2501.16621
作者: Lumen AI,Tengzhou No. 1 Middle School,Shihao Ji,Zihui Song,Fucheng Zhong,Jisen Jia,Zhaobo Wu,Zheyi Cao,Xu Tianhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes an innovative Multi-Modal Transformer framework (MMF-Trans) designed to significantly improve the prediction accuracy of the Chinese stock market by integrating multi-source heterogeneous information including macroeconomy, micro-market, financial text, and event knowledge. The framework consists of four core modules: (1) A four-channel parallel encoder that processes technical indicators, financial text, macro data, and event knowledge graph respectively for independent feature extraction of multi-modal data; (2) A dynamic gated cross-modal fusion mechanism that adaptively learns the importance of different modalities through differentiable weight allocation for effective information integration; (3) A time-aligned mixed-frequency processing layer that uses an innovative position encoding method to effectively fuse data of different time frequencies and solves the time alignment problem of heterogeneous data; (4) A graph attention-based event impact quantification module that captures the dynamic impact of events on the market through event knowledge graph and quantifies the event impact coefficient. We introduce a hybrid-frequency Transformer and Event2Vec algorithm to effectively fuse data of different frequencies and quantify the event impact. Experimental results show that in the prediction task of CSI 300 constituent stocks, the root mean square error (RMSE) of the MMF-Trans framework is reduced by 23.7% compared to the baseline model, the event response prediction accuracy is improved by 41.2%, and the Sharpe ratio is improved by 32.6%.

[AI-24] Safe Reinforcement Learning for Real-World Engine Control

链接: https://arxiv.org/abs/2501.16613
作者: Julian Bedei,Lucas Koch,Kevin Badalian,Alexander Winkler,Patrick Schaber,Jakob Andert
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work introduces a toolchain for applying Reinforcement Learning (RL), specifically the Deep Deterministic Policy Gradient (DDPG) algorithm, in safety-critical real-world environments. As an exemplary application, transient load control is demonstrated on a single-cylinder internal combustion engine testbench in Homogeneous Charge Compression Ignition (HCCI) mode, that offers high thermal efficiency and low emissions. However, HCCI poses challenges for traditional control methods due to its nonlinear, autoregressive, and stochastic nature. RL provides a viable solution, however, safety concerns, such as excessive pressure rise rates, must be addressed when applying to HCCI. A single unsuitable control input can severely damage the engine or cause misfiring and shut down. Additionally, operating limits are not known a priori and must be determined experimentally. To mitigate these risks, real-time safety monitoring based on the k-nearest neighbor algorithm is implemented, enabling safe interaction with the testbench. The feasibility of this approach is demonstrated as the RL agent learns a control policy through interaction with the testbench. A root mean square error of 0.1374 bar is achieved for the indicated mean effective pressure, comparable to neural network-based controllers from the literature. The toolchain’s flexibility is further demonstrated by adapting the agent’s policy to increase ethanol energy shares, promoting renewable fuel use while maintaining safety. This RL approach addresses the longstanding challenge of applying RL to safety-critical real-world environments. The developed toolchain, with its adaptability and safety mechanisms, paves the way for future applicability of RL in engine testbenches and other safety-critical settings.

[AI-25] Governing the Agent -to-Agent Economy of Trust via Progressive Decentralization

链接: https://arxiv.org/abs/2501.16606
作者: Tomer Jordi Chaffer
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current approaches to AI governance often fall short in anticipating a future where AI agents manage critical tasks, such as financial operations, administrative functions, and beyond. As AI agents may eventually delegate tasks among themselves to optimize efficiency, understanding the foundational principles of human value exchange could offer insights into how AI-driven economies might operate. Just as trust and value exchange are central to human interactions in open marketplaces, they may also be critical for enabling secure and efficient interactions among AI agents. While cryptocurrencies could serve as the foundation for monetizing value exchange in a collaboration and delegation dynamic among AI agents, a critical question remains: how can these agents reliably determine whom to trust, and how can humans ensure meaningful oversight and control as an economy of AI agents scales and evolves? This paper is a call for a collective exploration of cryptoeconomic incentives, which can help design decentralized governance systems that allow AI agents to autonomously interact and exchange value while ensuring human oversight via progressive decentralization. Toward this end, I propose a research agenda to address the question of agent-to-agent trust using AgentBound Tokens, which are non-transferable, non-fungible tokens uniquely tied to individual AI agents, akin to Soulbound tokens for humans in Web3. By staking ABTs as collateral for autonomous actions within an agent-to-agent network via a proof-of-stake mechanism, agents may be incentivized towards ethical behavior, and penalties for misconduct are automatically enforced.

[AI-26] Impact and influence of modern AI in metadata management

链接: https://arxiv.org/abs/2501.16605
作者: Wenli Yang,Rui Fu,Muhammad Bilal Amin,Byeong Kang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.

[AI-27] Applying Ensemble Models based on Graph Neural Network and Reinforcement Learning for Wind Power Forecasting

链接: https://arxiv.org/abs/2501.16591
作者: Hongjin Song,Qianrun Chen,Tianqi Jiang,Yongfeng Li,Xusheng Li,Wenjun Xi,Songtao Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately predicting the wind power output of a wind farm across various time scales utilizing Wind Power Forecasting (WPF) is a critical issue in wind power trading and utilization. The WPF problem remains unresolved due to numerous influencing variables, such as wind speed, temperature, latitude, and longitude. Furthermore, achieving high prediction accuracy is crucial for maintaining electric grid stability and ensuring supply security. In this paper, we model all wind turbines within a wind farm as graph nodes in a graph built by their geographical locations. Accordingly, we propose an ensemble model based on graph neural networks and reinforcement learning (EMGRL) for WPF. Our approach includes: (1) applying graph neural networks to capture the time-series data from neighboring wind farms relevant to the target wind farm; (2) establishing a general state embedding that integrates the target wind farm’s data with the historical performance of base models on the target wind farm; (3) ensembling and leveraging the advantages of all base models through an actor-critic reinforcement learning framework for WPF.

[AI-28] Generative AI Uses and Risks for Knowledge Workers in a Science Organization

链接: https://arxiv.org/abs/2501.16577
作者: Kelly B. Wagman,Matthew T. Dearing,Marshini Chetty
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: CHI Conference on Human Factors in Computing Systems (CHI '25)

点击查看摘要

Abstract:Generative AI could enhance scientific discovery by supporting knowledge workers in science organizations. However, the real-world applications and perceived concerns of generative AI use in these organizations are uncertain. In this paper, we report on a collaborative study with a US national laboratory with employees spanning Science and Operations about their use of generative AI tools. We surveyed 66 employees, interviewed a subset (N=22), and measured early adoption of an internal generative AI interface called Argo lab-wide. We have four findings: (1) Argo usage data shows small but increasing use by Science and Operations employees; Common current and envisioned use cases for generative AI in this context conceptually fall into either a (2) copilot or (3) workflow agent modality; and (4) Concerns include sensitive data security, academic publishing, and job impacts. Based on our findings, we make recommendations for generative AI use in science and other organizations.

[AI-29] Sample-Efficient Behavior Cloning Using General Domain Knowledge

链接: https://arxiv.org/abs/2501.16546
作者: Feiyu Zhu,Jean Oh,Reid Simmons
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Behavior cloning has shown success in many sequential decision-making tasks by learning from expert demonstrations, yet they can be very sample inefficient and fail to generalize to unseen scenarios. One approach to these problems is to introduce general domain knowledge, such that the policy can focus on the essential features and may generalize to unseen states by applying that knowledge. Although this knowledge is easy to acquire from the experts, it is hard to be combined with learning from individual examples due to the lack of semantic structure in neural networks and the time-consuming nature of feature engineering. To enable learning from both general knowledge and specific demonstration trajectories, we use a large language model’s coding capability to instantiate a policy structure based on expert domain knowledge expressed in natural language and tune the parameters in the policy with demonstrations. We name this approach the Knowledge Informed Model (KIM) as the structure reflects the semantics of expert knowledge. In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning.

[AI-30] Generalized Mission Planning for Heterogeneous Multi-Robot Teams via LLM -constructed Hierarchical Trees

链接: https://arxiv.org/abs/2501.16539
作者: Piyush Gupta,David Isele,Enna Sachdeva,Pin-Hao Huang,Behzad Dariush,Kwonjoon Lee,Sangjae Bae
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We present a novel mission-planning strategy for heterogeneous multi-robot teams, taking into account the specific constraints and capabilities of each robot. Our approach employs hierarchical trees to systematically break down complex missions into manageable sub-tasks. We develop specialized APIs and tools, which are utilized by Large Language Models (LLMs) to efficiently construct these hierarchical trees. Once the hierarchical tree is generated, it is further decomposed to create optimized schedules for each robot, ensuring adherence to their individual constraints and capabilities. We demonstrate the effectiveness of our framework through detailed examples covering a wide range of missions, showcasing its flexibility and scalability.

[AI-31] argeting Alignment: Extracting Safety Classifiers of Aligned LLM s

链接: https://arxiv.org/abs/2501.16534
作者: Jean-Charles Noirot Ferrand,Yohan Beugin,Eric Pauley,Ryan Sheatsley,Patrick McDaniel
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we present and evaluate a method to assess the robustness of LLM alignment. We observe that alignment embeds a safety classifier in the target model that is responsible for deciding between refusal and compliance. We seek to extract an approximation of this classifier, called a surrogate classifier, from the LLM. We develop an algorithm for identifying candidate classifiers from subsets of the LLM model. We evaluate the degree to which the candidate classifiers approximate the model’s embedded classifier in benign (F1 score) and adversarial (using surrogates in a white-box attack) settings. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find attacks mounted on the surrogate models can be transferred with high accuracy. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70%, a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is a viable (and highly effective) means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

[AI-32] Characterizing Network Structure of Anti-Trans Actors on TikTok

链接: https://arxiv.org/abs/2501.16507
作者: Maxyn Leitner,Rebecca Dorn,Fred Morstatter,Kristina Lerman
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 11 pages, 4 figures. 2 tables

点击查看摘要

Abstract:The recent proliferation of short form video social media sites such as TikTok has been effectively utilized for increased visibility, communication, and community connection amongst trans/nonbinary creators online. However, these same platforms have also been exploited by right-wing actors targeting trans/nonbinary people, enabling such anti-trans actors to efficiently spread hate speech and propaganda. Given these divergent groups, what are the differences in network structure between anti-trans and pro-trans communities on TikTok, and to what extent do they amplify the effects of anti-trans content? In this paper, we collect a sample of TikTok videos containing pro and anti-trans content, and develop a taxonomy of trans related sentiment to enable the classification of content on TikTok, and ultimately analyze the reply network structures of pro-trans and anti-trans communities. In order to accomplish this, we worked with hired expert data annotators from the trans/nonbinary community in order to generate a sample of highly accurately labeled data. From this subset, we utilized a novel classification pipeline leveraging Retrieval-Augmented Generation (RAG) with annotated examples and taxonomy definitions to classify content into pro-trans, anti-trans, or neutral categories. We find that incorporating our taxonomy and its logics into our classification engine results in improved ability to differentiate trans related content, and that Results from network analysis indicate many interactions between posters of pro-trans and anti-trans content exist, further demonstrating targeting of trans individuals, and demonstrating the need for better content moderation tools

[AI-33] owards Robust Stability Prediction in Smart Grids: GAN-based Approach under Data Constraints and Adversarial Challenges

链接: https://arxiv.org/abs/2501.16490
作者: Emad Efatinasab,Alessandro Brighente,Denis Donadel,Mauro Conti,Mirco Rampazzo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE Internet of Things Journal for possible publication

点击查看摘要

Abstract:Smart grids are critical for addressing the growing energy demand due to global population growth and urbanization. They enhance efficiency, reliability, and sustainability by integrating renewable energy. Ensuring their availability and safety requires advanced operational control and safety measures. Researchers employ AI and machine learning to assess grid stability, but challenges like the lack of datasets and cybersecurity threats, including adversarial attacks, persist. In particular, data scarcity is a key issue: obtaining grid instability instances is tough due to the need for significant expertise, resources, and time. However, they are essential to test novel research advancements and security mitigations. In this paper, we introduce a novel framework to detect instability in smart grids by employing only stable data. It relies on a Generative Adversarial Network (GAN) where the generator is trained to create instability data that are used along with stable data to train the discriminator. Moreover, we include a new adversarial training layer to improve robustness against adversarial attacks. Our solution, tested on a dataset composed of real-world stable and unstable samples, achieve accuracy up to 97.5% in predicting grid stability and up to 98.9% in detecting adversarial attacks. Moreover, we implemented our model in a single-board computer demonstrating efficient real-time decision-making with an average response time of less than 7ms. Our solution improves prediction accuracy and resilience while addressing data scarcity in smart grid management.

[AI-34] SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments ICLR2025

链接: https://arxiv.org/abs/2501.16471
作者: Simon Dahan,Gabriel Bénédict,Logan Z. J. Williams,Yourong Guo,Daniel Rueckert,Robert Leech,Emma C. Robinson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: 27 pages, accepted to ICLR 2025

点击查看摘要

Abstract:Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code pre-trained models will be made available at this https URL, processed data for training will be available upon request at this https URL.

[AI-35] On the Feasibility of Using LLM s to Execute Multistage Network Attacks

链接: https://arxiv.org/abs/2501.16466
作者: Brian Singer,Keane Lucas,Lakshmi Adiga,Meghna Jain,Lujo Bauer,Vyas Sekar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 14 figures

点击查看摘要

Abstract:LLMs have shown preliminary promise in some security tasks and CTF challenges. However, it is unclear whether LLMs are able to realize multistage network attacks, which involve executing a wide variety of actions across multiple hosts such as conducting reconnaissance, exploiting vulnerabilities to gain initial access, leveraging internal hosts to move laterally, and using multiple compromised hosts to exfiltrate data. We evaluate LLMs across 10 multistage networks and find that popular LLMs are unable to realize these attacks. To enable LLMs to realize these attacks, we introduce Incalmo, an LLM-agnostic high-level attack abstraction layer that sits between an LLM and the environment. Rather than LLMs issuing low-level command-line instructions, which can lead to incorrect implementations, Incalmo allows LLMs to specify high-level tasks (e.g., infect a host, scan a network), which are then carried out by Incalmo. Incalmo realizes these tasks by translating them into low-level primitives (e.g., commands to exploit tools). Incalmo also provides an environment state service and an attack graph service to provide structure to LLMs in selecting actions relevant to a multistage attack. Across 9 out of 10 realistic emulated networks (from 25 to 50 hosts), LLMs using Incalmo can successfully autonomously execute multistage attacks. We also conduct an ablation analysis to show the key role the high-level abstractions play. For instance, we find that both Incalmo’s high-level tasks and services are crucial. Furthermore, even smaller-parameter LLMs with Incalmo can fully succeed in 5 of 10 environments, while larger-parameter LLMs without Incalmo do not fully succeed in any.

[AI-36] Detecting Zero-Day Attacks in Digital Substations via In-Context Learning

链接: https://arxiv.org/abs/2501.16453
作者: Faizan Manzoor,Vanshaj Khattar,Akila Herath,Clifton Black,Matthew C Nielsen,Junho Hong,Chen-Ching Liu,Ming Jin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The occurrences of cyber attacks on the power grids have been increasing every year, with novel attack techniques emerging every year. In this paper, we address the critical challenge of detecting novel/zero-day attacks in digital substations that employ the IEC-61850 communication protocol. While many heuristic and machine learning (ML)-based methods have been proposed for attack detection in IEC-61850 digital substations, generalization to novel or zero-day attacks remains challenging. We propose an approach that leverages the in-context learning (ICL) capability of the transformer architecture, the fundamental building block of large language models. The ICL approach enables the model to detect zero-day attacks and learn from a few examples of that attack without explicit retraining. Our experiments on the IEC-61850 dataset demonstrate that the proposed method achieves more than 85% detection accuracy on zero-day attacks while the existing state-of-the-art baselines fail. This work paves the way for building more secure and resilient digital substations of the future.

[AI-37] 360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation

链接: https://arxiv.org/abs/2501.16450
作者: Hamed Firooz,Maziar Sanjabi,Adrian Englhardt,Aman Gupta,Ben Levine,Dre Olgiati,Gungor Polatkan,Iuliia Melnychuk,Karthik Ramgopal,Kirill Talanine,Kutta Srinivasan,Luke Simon,Natesh Sivasubramoniapillai,Necip Fazil Ayan,Qingquan Song,Samira Sriram,Souvik Ghosh,Tao Song,Vignesh Kothapalli,Xiaoling Zhai,Ya Xu,Yu Wang,Yun Dai
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ranking and recommendation systems are the foundation for numerous online experiences, ranging from search results to personalized content delivery. These systems have evolved into complex, multilayered architectures that leverage vast datasets and often incorporate thousands of predictive models. The maintenance and enhancement of these models is a labor intensive process that requires extensive feature engineering. This approach not only exacerbates technical debt but also hampers innovation in extending these systems to emerging problem domains. In this report, we present our research to address these challenges by utilizing a large foundation model with a textual interface for ranking and recommendation tasks. We illustrate several key advantages of our approach: (1) a single model can manage multiple predictive tasks involved in ranking and recommendation, (2) decoder models with textual interface due to their comprehension of reasoning capabilities, can generalize to new recommendation surfaces and out-of-domain problems, and (3) by employing natural language interfaces for task definitions and verbalizing member behaviors and their social connections, we eliminate the need for feature engineering and the maintenance of complex directed acyclic graphs of model dependencies. We introduce our research pre-production model, 360Brew V1.0, a 150B parameter, decoder-only model that has been trained and fine-tuned on LinkedIn’s data and tasks. This model is capable of solving over 30 predictive tasks across various segments of the LinkedIn platform, achieving performance levels comparable to or exceeding those of current production systems based on offline metrics, without task-specific fine-tuning. Notably, each of these tasks is conventionally addressed by dedicated models that have been developed and maintained over multiple years by teams of a similar or larger size than our own.

[AI-38] What is Harm? Baby Dont Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

链接: https://arxiv.org/abs/2501.16448
作者: Robin Young
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:“First, do no harm” faces a fundamental challenge in artificial intelligence: how can we specify what constitutes harm? While prior work treats harm specification as a technical hurdle to be overcome through better algorithms or more data, we argue this assumption is unsound. Drawing on information theory, we demonstrate that complete harm specification is fundamentally impossible for any system where harm is defined external to its specifications. This impossibility arises from an inescapable information-theoretic gap: the entropy of harm H(O) always exceeds the mutual information I(O;I) between ground truth harm O and a system’s specifications I. We introduce two novel metrics: semantic entropy H(S) and the safety-capability ratio I(O;I)/H(O), to quantify these limitations. Through a progression of increasingly sophisticated specification attempts, we show why each approach must fail and why the resulting gaps are not mere engineering challenges but fundamental constraints akin to the halting problem. These results suggest a paradigm shift: rather than pursuing complete specifications, AI alignment research should focus on developing systems that can operate safely despite irreducible specification uncertainty. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.16448 [cs.AI] (or arXiv:2501.16448v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.16448 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] Leverag ing Induced Transferable Binding Principles for Associative Prediction of Novel Drug-Target Interactions

链接: https://arxiv.org/abs/2501.16391
作者: Xiaoqing Lian,Jie Zhu,Tianxu Lv,Shiyun Nie,Hang Fan,Guosheng Wu,Yunjun Ge,Lihua Li,Xiangxiang Zeng,Xiang Pan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Significant differences in protein structures hinder the generalization of existing drug-target interaction (DTI) models, which often rely heavily on pre-learned binding principles or detailed annotations. In contrast, BioBridge designs an Inductive-Associative pipeline inspired by the workflow of scientists who base their accumulated expertise on drawing insights into novel drug-target pairs from weakly related references. BioBridge predicts novel drug-target interactions using limited sequence data, incorporating multi-level encoders with adversarial training to accumulate transferable binding principles. On these principles basis, BioBridge employs a dynamic prototype meta-learning framework to associate insights from weakly related annotations, enabling robust predictions for previously unseen drug-target pairs. Extensive experiments demonstrate that BioBridge surpasses existing models, especially for unseen proteins. Notably, when only homologous protein binding data is available, BioBridge proves effective for virtual screening of the epidermal growth factor receptor and adenosine receptor, underscoring its potential in drug discovery.

[AI-40] UDiTQC: U-Net-Style Diffusion Transformer for Quantum Circuit Synthesis

链接: https://arxiv.org/abs/2501.16380
作者: Zhiwei Chen,Hao Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum computing is a transformative technology with wide-ranging applications, and efficient quantum circuit generation is crucial for unlocking its full potential. Current diffusion model approaches based on U-Net architectures, while promising, encounter challenges related to computational efficiency and modeling global context. To address these issues, we propose UDiT,a novel U-Net-style Diffusion Transformer architecture, which combines U-Net’s strengths in multi-scale feature extraction with the Transformer’s ability to model global context. We demonstrate the framework’s effectiveness on two tasks: entanglement generation and unitary compilation, where UDiTQC consistently outperforms existing methods. Additionally, our framework supports tasks such as masking and editing circuits to meet specific physical property requirements. This dual advancement, improving quantum circuit synthesis and refining generative model architectures, marks a significant milestone in the convergence of quantum computing and machine learning research.

[AI-41] FedAGHN: Personalized Federated Learning with Attentive Graph HyperNetworks

链接: https://arxiv.org/abs/2501.16379
作者: Jiarui Song,Yunheng Shen,Chengbin Hou,Pengyu Wang,Jinbao Wang,Ke Tang,Hairong Lv
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized Federated Learning (PFL) aims to address the statistical heterogeneity of data across clients by learning the personalized model for each client. Among various PFL approaches, the personalized aggregation-based approach conducts parameter aggregation in the server-side aggregation phase to generate personalized models, and focuses on learning appropriate collaborative relationships among clients for aggregation. However, the collaborative relationships vary in different scenarios and even at different stages of the FL process. To this end, we propose Personalized Federated Learning with Attentive Graph HyperNetworks (FedAGHN), which employs Attentive Graph HyperNetworks (AGHNs) to dynamically capture fine-grained collaborative relationships and generate client-specific personalized initial models. Specifically, AGHNs empower graphs to explicitly model the client-specific collaborative relationships, construct collaboration graphs, and introduce tunable attentive mechanism to derive the collaboration weights, so that the personalized initial models can be obtained by aggregating parameters over the collaboration graphs. Extensive experiments can demonstrate the superiority of FedAGHN. Moreover, a series of visualizations are presented to explore the effectiveness of collaboration graphs learned by FedAGHN.

[AI-42] Optimal Signal Decomposition-based Multi-Stage Learning for Battery Health Estimation

链接: https://arxiv.org/abs/2501.16377
作者: Vijay Babu Pamshetti,Wei Zhang,King Jet Tseng,Bor Kiat Ng,Qingyu Yan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Battery health estimation is fundamental to ensure battery safety and reduce cost. However, achieving accurate estimation has been challenging due to the batteries’ complex nonlinear aging patterns and capacity regeneration phenomena. In this paper, we propose OSL, an optimal signal decomposition-based multi-stage machine learning for battery health estimation. OSL treats battery signals optimally. It uses optimized variational mode decomposition to extract decomposed signals capturing different frequency bands of the original battery signals. It also incorporates a multi-stage learning process to analyze both spatial and temporal battery features effectively. An experimental study is conducted with a public battery aging dataset. OSL demonstrates exceptional performance with a mean error of just 0.26%. It significantly outperforms comparison algorithms, both those without and those with suboptimal signal decomposition and analysis. OSL considers practical battery challenges and can be integrated into real-world battery management systems, offering a good impact on battery monitoring and optimization.

[AI-43] HWPQ: Hessian-free Weight Pruning-Quantization For LLM Compression And Acceleration

链接: https://arxiv.org/abs/2501.16376
作者: Yuhan Kang,Zhongdi Luo,Mei Wen,Yang Shi,Jun He,Jianchao Yang,Zeyu Xue,Jing Feng,Xinwang Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across numerous domains. However, the high time complexity of existing pruning and quantization methods significantly hinders their effective deployment on resource-constrained consumer or edge devices. In this study, we propose a novel Hessian-free Weight Pruning-Quantization (HWPQ) method. HWPQ eliminates the need for computationally intensive Hessian matrix calculations by introducing a contribution-based weight metric, which evaluates the importance of weights without relying on second-order derivatives. Additionally, we employ the Exponentially Weighted Moving Average (EWMA) technique to bypass weight sorting, enabling the selection of weights that contribute most to LLM accuracy and further reducing time complexity. Our approach is extended to support 2:4 structured sparsity pruning, facilitating efficient execution on modern hardware accelerators. Experimental results demonstrate that HWPQ significantly enhances the compression performance of LLaMA2. Compared to state-of-the-art quantization and pruning frameworks, HWPQ achieves average speedups of 5.97x (up to 20.75x) in quantization time and 12.29x (up to 56.02x) in pruning time, while largely preserving model accuracy. Furthermore, we observe a 1.50x inference speedup compared to the baseline.

[AI-44] On Storag e Neural Network Augmented Approximate Nearest Neighbor Search

链接: https://arxiv.org/abs/2501.16375
作者: Taiga Ikeda,Daisuke Miyashita,Jun Deguchi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Large-scale approximate nearest neighbor search (ANN) has been gaining attention along with the latest machine learning researches employing ANNs. If the data is too large to fit in memory, it is necessary to search for the most similar vectors to a given query vector from the data stored in storage devices, not from that in memory. The storage device such as NAND flash memory has larger capacity than the memory device such as DRAM, but they also have larger latency to read data. Therefore, ANN methods for storage require completely different approaches from conventional in-memory ANN methods. Since the approximation that the time required for search is determined only by the amount of data fetched from storage holds under reasonable assumptions, our goal is to minimize it while maximizing recall. For partitioning-based ANNs, vectors are partitioned into clusters in the index building phase. In the search phase, some of the clusters are chosen, the vectors in the chosen clusters are fetched from storage, and the nearest vector is retrieved from the fetched vectors. Thus, the key point is to accurately select the clusters containing the ground truth nearest neighbor vectors. We accomplish this by proposing a method to predict the correct clusters by means of a neural network that is gradually refined by alternating supervised learning and duplicated cluster assignment. Compared to state-of-the-art SPANN and an exhaustive method using k-means clustering and linear search, the proposed method achieves 90% recall on SIFT1M with 80% and 58% less data fetched from storage, respectively.

[AI-45] SAFR: Neuron Redistribution for Interpretability

链接: https://arxiv.org/abs/2501.16374
作者: Ruidi Chang,Chunyuan Deng,Hanjie Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Superposition refers to encoding representations of multiple features within a single neuron, which is common in transformers. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model’s interpretability has been diminished. This paper presents a novel approach to enhance transformer interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights. With a transformer model on two classification tasks, SAFR improves interpretability without compromising prediction performance. Given an input to the model, SAFR provides an explanation by visualizing the neuron allocation and interaction within the MLP layers.

[AI-46] Unveiling Discrete Clues: Superior Healthcare Predictions for Rare Diseases

链接: https://arxiv.org/abs/2501.16373
作者: Chuang Zhao,Hui Tang,Jiheng Zhang,Xiaomeng Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurate healthcare prediction is essential for improving patient outcomes. Existing work primarily leverages advanced frameworks like attention or graph networks to capture the intricate collaborative (CO) signals in electronic health records. However, prediction for rare diseases remains challenging due to limited co-occurrence and inadequately tailored approaches. To address this issue, this paper proposes UDC, a novel method that unveils discrete clues to bridge consistent textual knowledge and CO signals within a unified semantic space, thereby enriching the representation semantics of rare diseases. Specifically, we focus on addressing two key sub-problems: (1) acquiring distinguishable discrete encodings for precise disease representation and (2) achieving semantic alignment between textual knowledge and the CO signals at the code level. For the first sub-problem, we refine the standard vector quantized process to include condition awareness. Additionally, we develop an advanced contrastive approach in the decoding stage, leveraging synthetic and mixed-domain targets as hard negatives to enrich the perceptibility of the reconstructed representation for downstream tasks. For the second sub-problem, we introduce a novel codebook update strategy using co-teacher distillation. This approach facilitates bidirectional supervision between textual knowledge and CO signals, thereby aligning semantically equivalent information in a shared discrete latent space. Extensive experiments on three datasets demonstrate our superiority.

[AI-47] Which Optimizer Works Best for Physics-Informed Neural Networks and Kolmogorov-Arnold Networks?

链接: https://arxiv.org/abs/2501.16371
作者: Elham Kiyani,Khemraj Shukla,Jorge F. Urbán,Jérôme Darbon,George Em Karniadakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 33 pages, 27 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have revolutionized the computation of PDE solutions by integrating partial differential equations (PDEs) into the neural network’s training process as soft constraints, becoming an important component of the scientific machine learning (SciML) ecosystem. In its current implementation, PINNs are mainly optimized using first-order methods like Adam, as well as quasi-Newton methods such as BFGS and its low-memory variant, L-BFGS. However, these optimizers often struggle with highly non-linear and non-convex loss landscapes, leading to challenges such as slow convergence, local minima entrapment, and (non)degenerate saddle points. In this study, we investigate the performance of Self-Scaled Broyden (SSBroyden) methods and other advanced quasi-Newton schemes, including BFGS and L-BFGS with different line search strategies approaches. These methods dynamically rescale updates based on historical gradient information, thus enhancing training efficiency and accuracy. We systematically compare these optimizers on key challenging linear, stiff, multi-scale and non-linear PDEs benchmarks, including the Burgers, Allen-Cahn, Kuramoto-Sivashinsky, and Ginzburg-Landau equations, and extend our study to Physics-Informed Kolmogorov-Arnold Networks (PIKANs) representation. Our findings provide insights into the effectiveness of second-order optimization strategies in improving the convergence and accurate generalization of PINNs for complex PDEs by orders of magnitude compared to the state-of-the-art.

[AI-48] Advanced Physics-Informed Neural Network with Residuals for Solving Complex Integral Equations

链接: https://arxiv.org/abs/2501.16370
作者: Mahdi Movahedian Moghaddam,Kourosh Parand,Saeed Reza Kheradpisheh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we present the Residual Integral Solver Network (RISN), a novel neural network architecture designed to solve a wide range of integral and integro-differential equations, including one-dimensional, multi-dimensional, ordinary and partial integro-differential, systems, and fractional types. RISN integrates residual connections with high-accurate numerical methods such as Gaussian quadrature and fractional derivative operational matrices, enabling it to achieve higher accuracy and stability than traditional Physics-Informed Neural Networks (PINN). The residual connections help mitigate vanishing gradient issues, allowing RISN to handle deeper networks and more complex kernels, particularly in multi-dimensional problems. Through extensive experiments, we demonstrate that RISN consistently outperforms PINN, achieving significantly lower Mean Absolute Errors (MAE) across various types of equations. The results highlight RISN’s robustness and efficiency in solving challenging integral and integro-differential problems, making it a valuable tool for real-world applications where traditional methods often struggle.

[AI-49] Blockchain-based Crowdsourced Deep Reinforcement Learning as a Service

链接: https://arxiv.org/abs/2501.16369
作者: Ahmed Alagha,Hadi Otrok,Shakti Singh,Rabeb Mizouni,Jamal Bentahar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) has emerged as a powerful paradigm for solving complex problems. However, its full potential remains inaccessible to a broader audience due to its complexity, which requires expertise in training and designing DRL solutions, high computational capabilities, and sometimes access to pre-trained models. This necessitates the need for hassle-free services that increase the availability of DRL solutions to a variety of users. To enhance the accessibility to DRL services, this paper proposes a novel blockchain-based crowdsourced DRL as a Service (DRLaaS) framework. The framework provides DRL-related services to users, covering two types of tasks: DRL training and model sharing. Through crowdsourcing, users could benefit from the expertise and computational capabilities of workers to train DRL solutions. Model sharing could help users gain access to pre-trained models, shared by workers in return for incentives, which can help train new DRL solutions using methods in knowledge transfer. The DRLaaS framework is built on top of a Consortium Blockchain to enable traceable and autonomous execution. Smart Contracts are designed to manage worker and model allocation, which are stored using the InterPlanetary File System (IPFS) to ensure tamper-proof data distribution. The framework is tested on several DRL applications, proving its efficacy.

[AI-50] Foundation Models for CPS-IoT: Opportunities and Challenges

链接: https://arxiv.org/abs/2501.16368
作者: Ozan Baris,Yizhuo Chen,Gaofeng Dong,Liying Han,Tomoyoshi Kimura,Pengrui Quan,Ruijie Wang,Tianchen Wang,Tarek Abdelzaher,Mario Bergés,Paul Pu Liang,Mani Srivastava
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Methods from machine learning (ML) have transformed the implementation of Perception-Cognition-Communication-Action loops in Cyber-Physical Systems (CPS) and the Internet of Things (IoT), replacing mechanistic and basic statistical models with those derived from data. However, the first generation of ML approaches, which depend on supervised learning with annotated data to create task-specific models, faces significant limitations in scaling to the diverse sensor modalities, deployment configurations, application tasks, and operating dynamics characterizing real-world CPS-IoT systems. The success of task-agnostic foundation models (FMs), including multimodal large language models (LLMs), in addressing similar challenges across natural language, computer vision, and human speech has generated considerable enthusiasm for and exploration of FMs and LLMs as flexible building blocks in CPS-IoT analytics pipelines, promising to reduce the need for costly task-specific engineering. Nonetheless, a significant gap persists between the current capabilities of FMs and LLMs in the CPS-IoT domain and the requirements they must meet to be viable for CPS-IoT applications. In this paper, we analyze and characterize this gap through a thorough examination of the state of the art and our research, which extends beyond it in various dimensions. Based on the results of our analysis and research, we identify essential desiderata that CPS-IoT domain-specific FMs and LLMs must satisfy to bridge this gap. We also propose actions by CPS-IoT researchers to collaborate in developing key community resources necessary for establishing FMs and LLMs as foundational tools for the next generation of CPS-IoT systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2501.16368 [cs.LG] (or arXiv:2501.16368v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.16368 Focus to learn more arXiv-issued DOI via DataCite

[AI-51] CAND: Cross-Domain Ambiguity Inference for Early Detecting Nuanced Illness Deterioration

链接: https://arxiv.org/abs/2501.16365
作者: Lo Pang-Yun Ting,Zhen Tan,Hong-Pei Chen,Cheng-Te Li,Po-Lin Chen,Kun-Ta Chuang,Huan Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Early detection of patient deterioration is essential for timely treatment, with vital signs like heart rates being key health indicators. Existing methods tend to solely analyze vital sign waveforms, ignoring transition relationships of waveforms within each vital sign and the correlation strengths among various vital signs. Such studies often overlook nuanced illness deterioration, which is the early sign of worsening health but is difficult to detect. In this paper, we introduce CAND, a novel method that organizes the transition relationships and the correlations within and among vital signs as domain-specific and cross-domain knowledge. CAND jointly models these knowledge in a unified representation space, considerably enhancing the early detection of nuanced illness deterioration. In addition, CAND integrates a Bayesian inference method that utilizes augmented knowledge from domain-specific and cross-domain knowledge to address the ambiguities in correlation strengths. With this architecture, the correlation strengths can be effectively inferred to guide joint modeling and enhance representations of vital signs. This allows a more holistic and accurate interpretation of patient health. Our experiments on a real-world ICU dataset demonstrate that CAND significantly outperforms existing methods in both effectiveness and earliness in detecting nuanced illness deterioration. Moreover, we conduct a case study for the interpretable detection process to showcase the practicality of CAND.

[AI-52] Multivariate Time Series Anomaly Detection by Capturing Coarse-Grained Intra- and Inter-Variate Dependencies

链接: https://arxiv.org/abs/2501.16364
作者: Yongzheng Xie,Hongyu Zhang,Muhammad Ali Babar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 3 figures, Accepted to TheWebConference 2025

点击查看摘要

Abstract:Multivariate time series anomaly detection is essential for failure management in web application operations, as it directly influences the effectiveness and timeliness of implementing remedial or preventive measures. This task is often framed as a semi-supervised learning problem, where only normal data are available for model training, primarily due to the labor-intensive nature of data labeling and the scarcity of anomalous data. Existing semi-supervised methods often detect anomalies by capturing intra-variate temporal dependencies and/or inter-variate relationships to learn normal patterns, flagging timestamps that deviate from these patterns as anomalies. However, these approaches often fail to capture salient intra-variate temporal and inter-variate dependencies in time series due to their focus on excessively fine granularity, leading to suboptimal performance. In this study, we introduce MtsCID, a novel semi-supervised multivariate time series anomaly detection method. MtsCID employs a dual network architecture: one network operates on the attention maps of multi-scale intra-variate patches for coarse-grained temporal dependency learning, while the other works on variates to capture coarse-grained inter-variate relationships through convolution and interaction with sinusoidal prototypes. This design enhances the ability to capture the patterns from both intra-variate temporal dependencies and inter-variate relationships, resulting in improved performance. Extensive experiments across seven widely used datasets demonstrate that MtsCID achieves performance comparable or superior to state-of-the-art benchmark methods.

[AI-53] Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning

链接: https://arxiv.org/abs/2501.16361
作者: Haoran Song,Jiarui Feng,Guangfu Li,Michael Province,Philip Payne,Yixin Chen,Fuhai Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:In real-world scientific discovery, human beings always make use of the accumulated prior knowledge with imagination pick select one or a few most promising hypotheses from large and noisy data analysis results. In this study, we introduce a new type of graph structure, the text-numeric graph (TNG), which is defined as graph entities and associations have both text-attributed information and numeric information. The TNG is an ideal data structure model for novel scientific discovery via graph reasoning because it integrates human-understandable textual annotations or prior knowledge, with numeric values that represent the observed or activation levels of graph entities or associations in different samples. Together both the textual information and numeric values determine the importance of graph entities and associations in graph reasoning for novel scientific knowledge discovery. We further propose integrating large language models (LLMs) and graph neural networks (GNNs) to analyze the TNGs for graph understanding and reasoning. To demonstrate the utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one type of TNGs, in which all graphs have the same entities, associations and annotations, but have sample-specific entity numeric (omic) values using single cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs. The evaluation results showed the LLM-GNN and TNGs models significantly improve classification accuracy and network inference. In conclusion, the TNGs and joint LLM-GNN models are important approaches for scientific discovery.

[AI-54] Momentum Contrastive Learning with Enhanced Negative Sampling and Hard Negative Filtering

链接: https://arxiv.org/abs/2501.16360
作者: Duy Hoang,Huy Ngo,Khoi Pham,Tri Nguyen,Gia Bao,Huy Phan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Contrastive learning has become pivotal in unsupervised representation learning, with frameworks like Momentum Contrast (MoCo) effectively utilizing large negative sample sets to extract discriminative features. However, traditional approaches often overlook the full potential of key embeddings and are susceptible to performance degradation from noisy negative samples in the memory bank. This study addresses these challenges by proposing an enhanced contrastive learning framework that incorporates two key innovations. First, we introduce a dual-view loss function, which ensures balanced optimization of both query and key embeddings, improving representation quality. Second, we develop a selective negative sampling strategy that emphasizes the most challenging negatives based on cosine similarity, mitigating the impact of noise and enhancing feature discrimination. Extensive experiments demonstrate that our framework achieves superior performance on downstream tasks, delivering robust and well-structured representations. These results highlight the potential of optimized contrastive mechanisms to advance unsupervised learning and extend its applicability across domains such as computer vision and natural language processing

[AI-55] EVolutionary Independent DEtermiNistiC Explanation

链接: https://arxiv.org/abs/2501.16357
作者: Vincenzo Dentamaro,Paolo Giglio,Donato Impedovo,Giuseppe Pirlo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:The widespread use of artificial intelligence deep neural networks in fields such as medicine and engineering necessitates understanding their decision-making processes. Current explainability methods often produce inconsistent results and struggle to highlight essential signals influencing model inferences. This paper introduces the Evolutionary Independent Deterministic Explanation (EVIDENCE) theory, a novel approach offering a deterministic, model-independent method for extracting significant signals from black-box models. EVIDENCE theory, grounded in robust mathematical formalization, is validated through empirical tests on diverse datasets, including COVID-19 audio diagnostics, Parkinson’s disease voice recordings, and the George Tzanetakis music classification dataset (GTZAN). Practical applications of EVIDENCE include improving diagnostic accuracy in healthcare and enhancing audio signal analysis. For instance, in the COVID-19 use case, EVIDENCE-filtered spectrograms fed into a frozen Residual Network with 50 layers improved precision by 32% for positive cases and increased the area under the curve (AUC) by 16% compared to baseline models. For Parkinson’s disease classification, EVIDENCE achieved near-perfect precision and sensitivity, with a macro average F1-Score of 0.997. In the GTZAN, EVIDENCE maintained a high AUC of 0.996, demonstrating its efficacy in filtering relevant features for accurate genre classification. EVIDENCE outperformed other Explainable Artificial Intelligence (XAI) methods such as LIME, SHAP, and GradCAM in almost all metrics. These findings indicate that EVIDENCE not only improves classification accuracy but also provides a transparent and reproducible explanation mechanism, crucial for advancing the trustworthiness and applicability of AI systems in real-world settings.

[AI-56] Evaluating Binary Decision Biases in Large Language Models : Implications for Fair Agent -Based Financial Simulations

链接: https://arxiv.org/abs/2501.16356
作者: Alicia Vidler,Toby Walsh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being used to simulate human-like decision making in agent-based financial market models (ABMs). As models become more powerful and accessible, researchers can now incorporate individual LLM decisions into ABM environments. However, integration may introduce inherent biases that need careful evaluation. In this paper we test three state-of-the-art GPT models for bias using two model sampling approaches: one-shot and few-shot API queries. We observe significant variations in distributions of outputs between specific models, and model sub versions, with GPT-4o-Mini-2024-07-18 showing notably better performance (32-43% yes responses) compared to GPT-4-0125-preview’s extreme bias (98-99% yes responses). We show that sampling methods and model sub-versions significantly impact results: repeated independent API calls produce different distributions compared to batch sampling within a single call. While no current GPT model can simultaneously achieve a uniform distribution and Markovian properties in one-shot testing, few-shot sampling can approach uniform distributions under certain conditions. We explore the Temperature parameter, providing a definition and comparative results. We further compare our results to true random binary series and test specifically for the common human bias of Negative Recency - finding LLMs have a mixed ability to ‘beat’ humans in this one regard. These findings emphasise the critical importance of careful LLM integration into ABMs for financial markets and more broadly.

[AI-57] How Strategic Agents Respond: Comparing Analytical Models with LLM -Generated Responses in Strategic Classification

链接: https://arxiv.org/abs/2501.16355
作者: Tian Xie,Pavan Rauch,Xueru Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:When machine learning (ML) algorithms are used to automate human-related decisions, human agents may gain knowledge of the decision policy and behave strategically to obtain desirable outcomes. Strategic Classification (SC) has been proposed to address the interplay between agents and decision-makers. Prior work on SC has relied on assumptions that agents are perfectly or approximately rational, responding to decision policies by maximizing their utilities. Verifying these assumptions is challenging due to the difficulty of collecting real-world agent responses. Meanwhile, the growing adoption of large language models (LLMs) makes it increasingly likely that human agents in SC settings will seek advice from these tools. We propose using strategic advice generated by LLMs to simulate human agent responses in SC. Specifically, we examine five critical SC scenarios – hiring, loan applications, school admissions, personal income, and public assistance programs – and simulate how human agents with diverse profiles seek advice from LLMs. We then compare the resulting agent responses with the best responses generated by existing theoretical models. Our findings reveal that: (i) LLMs and theoretical models generally lead to agent score or qualification changes in the same direction across most settings, with both achieving similar levels of fairness; (ii) state-of-the-art commercial LLMs (e.g., GPT-3.5, GPT-4) consistently provide helpful suggestions, though these suggestions typically do not result in maximal score or qualification improvements; and (iii) LLMs tend to produce more diverse agent responses, often favoring more balanced effort allocation strategies. These results suggest that theoretical models align with LLMs to some extent and that leveraging LLMs to simulate more realistic agent responses offers a promising approach to designing trustworthy ML systems.

[AI-58] Adaptive Hoeffding Tree with Transfer Learning for Streaming Synchrophasor Data Sets

链接: https://arxiv.org/abs/2501.16354
作者: Zakaria El Mrabet,Daisy Flora Selvaraj,Prakash Ranganathan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Synchrophasor technology or phasor measurement units (PMUs) are known to detect multiple type of oscillations or faults better than Supervisory Control and Data Acquisition (SCADA) systems, but the volume of Bigdata (e.g., 30-120 samples per second on a single PMU) generated by these sensors at the aggregator level (e.g., several PMUs) requires special handling. Conventional machine learning or data mining methods are not suitable to handle such larger streaming realtime data. This is primarily due to latencies associated with cloud environments (e.g., at an aggregator or PDC level), and thus necessitates the need for local computing to move the data on the edge (or locally at the PMU level) for processing. This requires faster real-time streaming algorithms to be processed at the local level (e.g., typically by a Field Programmable Gate Array (FPGA) based controllers). This paper proposes a transfer learning-based hoeffding tree with ADWIN (THAT) method to detect anomalous synchrophasor signatures. The proposed algorithm is trained and tested with the OzaBag method. The preliminary results with transfer learning indicate that a computational time saving of 0.7ms is achieved with THAT algorithm (0.34ms) over Ozabag (1.04ms), while the accuracy of both methods in detecting fault events remains at 94% for four signatures.

[AI-59] Synthetic Data Generation by Supervised Neural Gas Network for Physiological Emotion Recognition Data

链接: https://arxiv.org/abs/2501.16353
作者: S. Muhammad Hossein Mousavi
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 14 pages

点击查看摘要

Abstract:Data scarcity remains a significant challenge in the field of emotion recognition using physiological signals, as acquiring comprehensive and diverse datasets is often prevented by privacy concerns and logistical constraints. This limitation restricts the development and generalization of robust emotion recognition models, making the need for effective synthetic data generation methods more critical. Emotion recognition from physiological signals such as EEG, ECG, and GSR plays a pivotal role in enhancing human-computer interaction and understanding human affective states. Utilizing these signals, this study introduces an innovative approach to synthetic data generation using a Supervised Neural Gas (SNG) network, which has demonstrated noteworthy speed advantages over established models like Conditional VAE, Conditional GAN, diffusion model, and Variational LSTM. The Neural Gas network, known for its adaptability in organizing data based on topological and feature-space proximity, provides a robust framework for generating real-world-like synthetic datasets that preserve the intrinsic patterns of physiological emotion data. Our implementation of the SNG efficiently processes the input data, creating synthetic instances that closely mimic the original data distributions, as demonstrated through comparative accuracy assessments. In experiments, while our approach did not universally outperform all models, it achieved superior performance against most of the evaluated models and offered significant improvements in processing time. These outcomes underscore the potential of using SNG networks for fast, efficient, and effective synthetic data generation in emotion recognition applications.

[AI-60] Mixture of Experts (MoE): A Big Data Perspective

链接: https://arxiv.org/abs/2501.16352
作者: Wensheng Gan,Zhenyao Ning,Zhenlian Qi,Philip S. Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint. 5 figures, 3 tables

点击查看摘要

Abstract:As the era of big data arrives, traditional artificial intelligence algorithms have difficulty processing the demands of massive and diverse data. Mixture of experts (MoE) has shown excellent performance and broad application prospects. This paper provides an in-depth review and analysis of the latest progress in this field from multiple perspectives, including the basic principles, algorithmic models, key technical challenges, and application practices of MoE. First, we introduce the basic concept of MoE and its core idea and elaborate on its advantages over traditional single models. Then, we discuss the basic architecture of MoE and its main components, including the gating network, expert networks, and learning algorithms. Next, we review the applications of MoE in addressing key technical issues in big data. For each challenge, we provide specific MoE solutions and their innovations. Furthermore, we summarize the typical use cases of MoE in various application domains. This fully demonstrates the powerful capability of MoE in big data processing. We also analyze the advantages of MoE in big data environments. Finally, we explore the future development trends of MoE. We believe that MoE will become an important paradigm of artificial intelligence in the era of big data. In summary, this paper systematically elaborates on the principles, techniques, and applications of MoE in big data processing, providing theoretical and practical references to further promote the application of MoE in real scenarios.

[AI-61] Risk-Informed Diffusion Transformer for Long-Tail Trajectory Prediction in the Crash Scenario

链接: https://arxiv.org/abs/2501.16349
作者: Junlan Chen,Pei Liu,Zihao Zhang,Hongyi Zhao,Yufei Ji,Ziyuan Pu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Trajectory prediction methods have been widely applied in autonomous driving technologies. Although the overall performance accuracy of trajectory prediction is relatively high, the lack of trajectory data in critical scenarios in the training data leads to the long-tail phenomenon. Normally, the trajectories of the tail data are more critical and more difficult to predict and may include rare scenarios such as crashes. To solve this problem, we extracted the trajectory data from real-world crash scenarios, which contain more long-tail data. Meanwhile, based on the trajectory data in this scenario, we integrated graph-based risk information and diffusion with transformer and proposed the Risk-Informed Diffusion Transformer (RI-DiT) trajectory prediction method. Extensive experiments were conducted on trajectory data in the real-world crash scenario, and the results show that the algorithm we proposed has good performance. When predicting the data of the tail 10% (Top 10%), the minADE and minFDE indicators are 0.016/2.667 m. At the same time, we showed the trajectory conditions of different long-tail distributions. The distribution of trajectory data is closer to the tail, the less smooth the trajectory is. Through the trajectory data in real-world crash scenarios, Our work expands the methods to overcome the long-tail challenges in trajectory prediction. Our method, RI-DiT, integrates inverse time to collision (ITTC) and the feature of traffic flow, which can predict long-tail trajectories more accurately and improve the safety of autonomous driving systems.

[AI-62] An Integrated Approach to AI-Generated Content in e-health

链接: https://arxiv.org/abs/2501.16348
作者: Tasnim Ahmed,Salimur Choudhury
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for presentation at 2025 IEEE International Conference on Communications (IEEE ICC25)

点击查看摘要

Abstract:Artificial Intelligence-Generated Content, a subset of Generative Artificial Intelligence, holds significant potential for advancing the e-health sector by generating diverse forms of data. In this paper, we propose an end-to-end class-conditioned framework that addresses the challenge of data scarcity in health applications by generating synthetic medical images and text data, evaluating on practical applications such as retinopathy detection, skin infections and mental health assessments. Our framework integrates Diffusion and Large Language Models (LLMs) to generate data that closely match real-world patterns, which is essential for improving downstream task performance and model robustness in e-health applications. Experimental results demonstrate that the synthetic images produced by the proposed diffusion model outperform traditional GAN architectures. Similarly, in the text modality, data generated by uncensored LLM achieves significantly better alignment with real-world data than censored models in replicating the authentic tone.

[AI-63] Identification of Hardware Trojan Locations in Gate-Level Netlist using Nearest Neighbour Approach integrated with Machine Learning Technique

链接: https://arxiv.org/abs/2501.16347
作者: Anindita Chattopadhyay,Siddharth Bisariya,Vijay Kumar Sutrakar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the evolving landscape of integrated circuit design, detecting Hardware Trojans (HTs) within a multi entity based design cycle presents significant challenges. This research proposes an innovative machine learning-based methodology for identifying malicious logic gates in gate-level netlists. By focusing on path retrace algorithms. The methodology is validated across three distinct cases, each employing different machine learning models to classify HTs. Case I utilizes a decision tree algorithm for node-to-node comparisons, significantly improving detection accuracy through the integration of Principal Component Analysis (PCA). Case II introduces a graph-to-graph classification using a Graph Neural Network (GNN) model, enabling the differentiation between normal and Trojan-infected circuit designs. Case III applies GNN-based node classification to identify individual compromised nodes and its location. Additionally, nearest neighbor (NN) method has been combined with GNN graph-to-graph in Case II and GNN node-to-node in Case III. Despite the potential of GNN model graph-to-graph classification, NN approach demonstrated superior performance, with the first nearest neighbor (1st NN) achieving 73.2% accuracy and the second nearest neighbor (2nd NN) method reaching 97.7%. In comparison, the GNN model achieved an accuracy of 62.8%. Similarly, GNN model node-to-node classification, NN approach demonstrated superior performance, with the 1st NN achieving 93% accuracy and the 2nd NN method reaching 97.7%. In comparison, the GNN model achieved an accuracy of 79.8%. However, higher and higher NN will lead to large code coverage for the identification of HTs.

[AI-64] Self-supervised Graph Transformer with Contrastive Learning for Brain Connectivity Analysis towards Improving Autism Detection

链接: https://arxiv.org/abs/2501.16346
作者: Yicheng Leng,Syed Muhammad Anwar,Islem Rekik,Sen He,Eung-Joo Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Functional Magnetic Resonance Imaging (fMRI) provides useful insights into the brain function both during task or rest. Representing fMRI data using correlation matrices is found to be a reliable method of analyzing the inherent connectivity of the brain in the resting and active states. Graph Neural Networks (GNNs) have been widely used for brain network analysis due to their inherent explainability capability. In this work, we introduce a novel framework using contrastive self-supervised learning graph transformers, incorporating a brain network transformer encoder with random graph alterations. The proposed network leverages both contrastive learning and graph alterations to effectively train the graph transformer for autism detection. Our approach, tested on Autism Brain Imaging Data Exchange (ABIDE) data, demonstrates superior autism detection, achieving an AUROC of 82.6 and an accuracy of 74%, surpassing current state-of-the-art methods.

[AI-65] Self-Clustering Graph Transformer Approach to Model Resting-State Functional Brain Activity

链接: https://arxiv.org/abs/2501.16345
作者: Bishal Thapaliya,Esra Akbas,Ram Sapkota,Bhaskar Ray,Vince Calhoun,Jingyu Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 2 figures

点击查看摘要

Abstract:Resting-state functional magnetic resonance imaging (rs-fMRI) offers valuable insights into the human brain’s functional organization and is a powerful tool for investigating the relationship between brain function and cognitive processes, as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this study, we introduce a novel attention mechanism for graphs with subnetworks, named Self-Clustering Graph Transformer (SCGT), designed to handle the issue of uniform node updates in graph transformers. By using static functional connectivity (FC) correlation features as input to the transformer model, SCGT effectively captures the sub-network structure of the brain by performing cluster-specific updates to the nodes, unlike uniform node updates in vanilla graph transformers, further allowing us to learn and interpret the subclusters. We validate our approach on the Adolescent Brain Cognitive Development (ABCD) dataset, comprising 7,957 participants, for the prediction of total cognitive score and gender classification. Our results demonstrate that SCGT outperforms the vanilla graph transformer method and other recent models, offering a promising tool for modeling brain functional connectivity and interpreting the underlying subnetwork structures.

[AI-66] Explore Activation Sparsity in Recurrent LLM s for Energy-Efficient Neuromorphic Computing

链接: https://arxiv.org/abs/2501.16337
作者: Ivan Knunyants,Maryam Tavakol,Manolis Sifalakis,Yingfu Xu,Amirreza Yousefzadeh,Guangzhi Tang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted by AICAS 2025

点击查看摘要

Abstract:The recent rise of Large Language Models (LLMs) has revolutionized the deep learning field. However, the desire to deploy LLMs on edge devices introduces energy efficiency and latency challenges. Recurrent LLM (R-LLM) architectures have proven effective in mitigating the quadratic complexity of self-attention, making them a potential paradigm for computing on-edge neuromorphic processors. In this work, we propose a low-cost, training-free algorithm to sparsify R-LLMs’ activations to enhance energy efficiency on neuromorphic hardware. Our approach capitalizes on the inherent structure of these models, rendering them well-suited for energy-constrained environments. Although primarily designed for R-LLMs, this method can be generalized to other LLM architectures, such as transformers, as demonstrated on the OPT model, achieving comparable sparsity and efficiency improvements. Empirical studies illustrate that our method significantly reduces computational demands while maintaining competitive accuracy across multiple zero-shot learning benchmarks. Additionally, hardware simulations with the SENECA neuromorphic processor underscore notable energy savings and latency improvements. These results pave the way for low-power, real-time neuromorphic deployment of LLMs and demonstrate the feasibility of training-free on-chip adaptation using activation sparsity.

[AI-67] Runtime Analysis of Evolutionary Algorithms for Multiparty Multiobjective Optimization

链接: https://arxiv.org/abs/2501.16336
作者: Yuetong Sun,Peilan Xu,Wenjian Luo
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In scenarios where multiple decision-makers operate within a common decision space, each focusing on their own multi-objective optimization problem (e.g., bargaining games), the problem can be modeled as a multi-party multi-objective optimization problem (MPMOP). While numerous evolutionary algorithms have been proposed to solve MPMOPs, most results remain empirical. This paper presents the first theoretical analysis of the expected runtime of evolutionary algorithms on bi-party multi-objective optimization problems (BPMOPs). Our findings demonstrate that employing traditional multi-objective optimization algorithms to solve MPMOPs is both time-consuming and inefficient, as the resulting population contains many solutions that fail to achieve consensus among decision-makers. An alternative approach involves decision-makers individually solving their respective optimization problems and seeking consensus only in the final stage. While feasible for pseudo-Boolean optimization problems, this method may fail to guarantee approximate performance for one party in NP-hard problems. Finally, We propose coevolutionary multi-party multi-objective optimizers (CoEMPMO) for pseudo-Boolean optimization and shortest path problems within a multi-party multi-objective context, which maintains a common solution set among all parties through coevolution. Theoretical and experimental results demonstrate that the proposed ( \textCoEMPMO_\textrandom ) outperforms previous algorithms in terms of the expected lower bound on runtime for pseudo-Boolean optimization problems. Additionally, ( \textCoEMPMO_\textcons^\textSP ) achieves better efficiency and precision in solving shortest path problems compared to existing algorithms.

[AI-68] hree-Dimensional Diffusion-Weighted Multi-Slab MRI With Slice Profile Compensation Using Deep Energy Model

链接: https://arxiv.org/abs/2501.17152
作者: Reza Ghorbani,Jyothi Rikhab Chand,Chu-Yu Lee,Mathews Jacob,Merry Mani
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: 4 pages, 4 figures, ISBI2025 Conference paper

点击查看摘要

Abstract:Three-dimensional (3D) multi-slab acquisition is a technique frequently employed in high-resolution diffusion-weighted MRI in order to achieve the best signal-to-noise ratio (SNR) efficiency. However, this technique is limited by slab boundary artifacts that cause intensity fluctuations and aliasing between slabs which reduces the accuracy of anatomical imaging. Addressing this issue is crucial for advancing diffusion MRI quality and making high-resolution imaging more feasible for clinical and research applications. In this work, we propose a regularized slab profile encoding (PEN) method within a Plug-and-Play ADMM framework, incorporating multi-scale energy (MuSE) regularization to effectively improve the slab combined reconstruction. Experimental results demonstrate that the proposed method significantly improves image quality compared to non-regularized and TV-regularized PEN approaches. The regularized PEN framework provides a more robust and efficient solution for high-resolution 3D diffusion MRI, potentially enabling clearer, more reliable anatomical imaging across various applications.

[AI-69] Why is the estimation of metaorder impact with public market data so challenging?

链接: https://arxiv.org/abs/2501.17096
作者: Manuel Naviglio,Giacomo Bormetti,Francesco Campigli,German Rodikov,Fabrizio Lillo
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Estimating market impact and transaction costs of large trades (metaorders) is a very important topic in finance. However, using models of price and trade based on public market data provide average price trajectories which are qualitatively different from what is observed during real metaorder executions: the price increases linearly, rather than in a concave way, during the execution and the amount of reversion after its end is very limited. We claim that this is a generic phenomenon due to the fact that even sophisticated statistical models are unable to correctly describe the origin of the autocorrelation of the order flow. We propose a modified Transient Impact Model which provides more realistic trajectories by assuming that only a fraction of the metaorder trading triggers market order flow. Interestingly, in our model there is a critical condition on the kernels of the price and order flow equations in which market impact becomes permanent.

[AI-70] Benchmarking Quantum Convolutional Neural Networks for Signal Classification in Simulated Gamma-Ray Burst Detection MICRO

链接: https://arxiv.org/abs/2501.17041
作者: Farida Farsian,Nicolò Parmiggiani,Alessandro Rizzo,Gabriele Panebianco,Andrea Bulgarelli,Francesco Schillirò,Carlo Burigana,Vincenzo Cardone,Luca Cappelli,Massimo Meneghetti,Giuseppe Murante,Giuseppe Sarracino,Roberto Scaramella,Vincenzo Testa,Tiziana Trombetti
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: 9 pages, Accepted for publication in 33rd Euromicro/IEEE International Conference on Parallel, Distributed and Network-Based Processing (PDP 2025)

点击查看摘要

Abstract:This study evaluates the use of Quantum Convolutional Neural Networks (QCNNs) for identifying signals resembling Gamma-Ray Bursts (GRBs) within simulated astrophysical datasets in the form of light curves. The task addressed here focuses on distinguishing GRB-like signals from background noise in simulated Cherenkov Telescope Array Observatory (CTAO) data, the next-generation astrophysical observatory for very high-energy gamma-ray science. QCNNs, a quantum counterpart of classical Convolutional Neural Networks (CNNs), leverage quantum principles to process and analyze high-dimensional data efficiently. We implemented a hybrid quantum-classical machine learning technique using the Qiskit framework, with the QCNNs trained on a quantum simulator. Several QCNN architectures were tested, employing different encoding methods such as Data Reuploading and Amplitude encoding. Key findings include that QCNNs achieved accuracy comparable to classical CNNs, often surpassing 90%, while using fewer parameters, potentially leading to more efficient models in terms of computational resources. A benchmark study further examined how hyperparameters like the number of qubits and encoding methods affected performance, with more qubits and advanced encoding methods generally enhancing accuracy but increasing complexity. QCNNs showed robust performance on time-series datasets, successfully detecting GRB signals with high precision. The research is a pioneering effort in applying QCNNs to astrophysics, offering insights into their potential and limitations. This work sets the stage for future investigations to fully realize the advantages of QCNNs in astrophysical data analysis.

[AI-71] Generative quantum combinatorial optimization by means of a novel conditional generative quantum eigensolver

链接: https://arxiv.org/abs/2501.16986
作者: Shunya Minami,Kouhei Nakaji,Yohichi Suzuki,Alán Aspuru-Guzik,Tadashi Kadowaki
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 12 figures

点击查看摘要

Abstract:Quantum computing is entering a transformative phase with the emergence of logical quantum processors, which hold the potential to tackle complex problems beyond classical capabilities. While significant progress has been made, applying quantum algorithms to real-world problems remains challenging. Hybrid quantum-classical techniques have been explored to bridge this gap, but they often face limitations in expressiveness, trainability, or scalability. In this work, we introduce conditional Generative Quantum Eigensolver (conditional-GQE), a context-aware quantum circuit generator powered by an encoder-decoder Transformer. Focusing on combinatorial optimization, we train our generator for solving problems with up to 10 qubits, exhibiting nearly perfect performance on new problems. By leveraging the high expressiveness and flexibility of classical generative models, along with an efficient preference-based training scheme, conditional-GQE provides a generalizable and scalable framework for quantum circuit generation. Our approach advances hybrid quantum-classical computing and contributes to accelerate the transition toward fault-tolerant quantum computing.

[AI-72] Decrypting the temperature field in flow boiling with latent diffusion models

链接: https://arxiv.org/abs/2501.16510
作者: UngJin Na,JunYoung Seo,Taeil Kim,ByongGuk Jeon,HangJin Jo
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents an innovative method using Latent Diffusion Models (LDMs) to generate temperature fields from phase indicator maps. By leveraging the BubbleML dataset from numerical simulations, the LDM translates phase field data into corresponding temperature distributions through a two-stage training process involving a vector-quantized variational autoencoder (VQVAE) and a denoising autoencoder. The resulting model effectively reconstructs complex temperature fields at interfaces. Spectral analysis indicates a high degree of agreement with ground truth data in the low to mid wavenumber ranges, even though some inconsistencies are observed at higher wavenumbers, suggesting areas for further enhancement. This machine learning approach significantly reduces the computational burden of traditional simulations and improves the precision of experimental calibration methods. Future work will focus on refining the model’s ability to represent small-scale turbulence and expanding its applicability to a broader range of boiling conditions.

[AI-73] Reinforcement Learning for Quantum Circuit Design: Using Matrix Representations

链接: https://arxiv.org/abs/2501.16509
作者: Zhiyuan Wang,Chunlin Feng,Christopher Poon,Lijian Huang,Xingjian Zhao,Yao Ma,Tianfan Fu,Xiao-Yang Liu
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantum computing promises advantages over classical computing. The manufacturing of quantum hardware is in the infancy stage, called the Noisy Intermediate-Scale Quantum (NISQ) era. A major challenge is automated quantum circuit design that map a quantum circuit to gates in a universal gate set. In this paper, we present a generic MDP modeling and employ Q-learning and DQN algorithms for quantum circuit design. By leveraging the power of deep reinforcement learning, we aim to provide an automatic and scalable approach over traditional hand-crafted heuristic methods.

[AI-74] Digital Twin Enabled Site Specific Channel Precoding: Over the Air CIR Inference

链接: https://arxiv.org/abs/2501.16504
作者: Majumder Haider,Imtiaz Ahmed,Zoheb Hassan,Timothy J. O’Shea,Lingjia Liu,Danda B. Rawat
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the significance of designing a reliable, intelligent, and true physical environment-aware precoding scheme by leveraging an accurately designed channel twin model to obtain realistic channel state information (CSI) for cellular communication systems. Specifically, we propose a fine-tuned multi-step channel twin design process that can render CSI very close to the CSI of the actual environment. After generating a precise CSI, we execute precoding using the obtained CSI at the transmitter end. We demonstrate a two-step parameters’ tuning approach to design channel twin by ray tracing (RT) emulation, then further fine-tuning of CSI by employing an artificial intelligence (AI) based algorithm can significantly reduce the gap between actual CSI and the fine-tuned digital twin (DT) rendered CSI. The simulation results show the effectiveness of the proposed novel approach in designing a true physical environment-aware channel twin model.

[AI-75] Classification of Mild Cognitive Impairment Based on Dynamic Functional Connectivity Using Spatio-Temporal Transformer

链接: https://arxiv.org/abs/2501.16409
作者: Jing Zhang,Yanjun Lyu,Xiaowei Yu,Lu Zhang,Chao Cao,Tong Chen,Minheng Chen,Yan Zhuang,Tianming Liu,Dajiang Zhu
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer’s disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide valuable information when identifying brain conditions. In this paper, we propose a novel framework that jointly learns the embedding of both spatial and temporal information within dFC based on the transformer architecture. Specifically, we first construct dFC networks from rs-fMRI data through a sliding window strategy. Then, we simultaneously employ a temporal block and a spatial block to capture higher-order representations of dynamic spatio-temporal dependencies, via mapping them into an efficient fused feature representation. To further enhance the robustness of these feature representations by reducing the dependency on labeled data, we also introduce a contrastive learning strategy to manipulate different brain states. Experimental results on 345 subjects with 570 scans from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) demonstrate the superiority of our proposed method for MCI (Mild Cognitive Impairment, the prodromal stage of AD) prediction, highlighting its potential for early identification of AD.

[AI-76] GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration NAACL2025

链接: https://arxiv.org/abs/2501.16382
作者: Ziwen Li,Xiang ‘Anthony’ Chen,Youngseung Jeon
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages; 5 figures. Published as a finding at NAACL 2025

点击查看摘要

Abstract:Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations in previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.

[AI-77] Decoding OTC Government Bond Market Liquidity: An ABM Model for Market Dynamics

链接: https://arxiv.org/abs/2501.16331
作者: Alicia Vidler,Toby Walsh
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI)
*备注: 7 pages

点击查看摘要

Abstract:The over-the-counter (OTC) government bond markets are characterised by their bilateral trading structures, which pose unique challenges to understanding and ensuring market stability and liquidity. In this paper, we develop a bespoke ABM that simulates market-maker interactions within a stylised government bond market. The model focuses on the dynamics of liquidity and stability in the secondary trading of government bonds, particularly in concentrated markets like those found in Australia and the UK. Through this simulation, we test key hypotheses around improving market stability, focusing on the effects of agent diversity, business costs, and client base size. We demonstrate that greater agent diversity enhances market liquidity and that reducing the costs of market-making can improve overall market stability. The model offers insights into computational finance by simulating trading without price transparency, highlighting how micro-structural elements can affect macro-level market outcomes. This research contributes to the evolving field of computational finance by employing computational intelligence techniques to better understand the fundamental mechanics of government bond markets, providing actionable insights for both academics and practitioners.

机器学习

[LG-0] Scanning Trojaned Models Using Out-of-Distribution Samples NEURIPS

链接: https://arxiv.org/abs/2501.17151
作者: Hossein Mirzaei,Ali Ansari,Bahar Dibaei Nia,Mojtaba Nafez,Moein Madadi,Sepehr Rezaee,Zeinab Sadat Taghavi,Arad Maleki,Kian Shamsaie,Mahdi Hajialilue,Jafar Habibi,Mohammad Sabokrou,Mohammad Hossein Rohban
类目: Machine Learning (cs.LG)
*备注: Accepted at the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS) 2024. The code repository is available at: this https URL

点击查看摘要

Abstract:Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of “blind spots”–regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting OOD samples towards in-distribution. The increased likelihood of perturbed OOD samples being classified as ID serves as a signature for trojan detection. TRODO is both trojan and label mapping agnostic, effective even against adversarially trained trojaned classifiers. It is applicable even in scenarios where training data is absent, demonstrating high accuracy and adaptability across various scenarios and datasets, highlighting its potential as a robust trojan scanning strategy.

[LG-1] CoRe-Net: Co-Operational Regressor Network with Progressive Transfer Learning for Blind Radar Signal Restoration

链接: https://arxiv.org/abs/2501.17125
作者: Muhammad Uzair Zahid,Serkan Kiranyaz,Alper Yildirim,Moncef Gabbouj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world radar signals are frequently corrupted by various artifacts, including sensor noise, echoes, interference, and intentional jamming, differing in type, severity, and duration. This pilot study introduces a novel model, called Co-Operational Regressor Network (CoRe-Net) for blind radar signal restoration, designed to address such limitations and drawbacks. CoRe-Net replaces adversarial training with a novel cooperative learning strategy, leveraging the complementary roles of its Apprentice Regressor (AR) and Master Regressor (MR). The AR restores radar signals corrupted by various artifacts, while the MR evaluates the quality of the restoration and provides immediate and task-specific feedback, ensuring stable and efficient learning. The AR, therefore, has the advantage of both self-learning and assistive learning by the MR. The proposed model has been extensively evaluated over the benchmark Blind Radar Signal Restoration (BRSR) dataset, which simulates diverse real-world artifact scenarios. Under the fair experimental setup, this study shows that the CoRe-Net surpasses the Op-GANs over a 1 dB mean SNR improvement. To further boost the performance gain, this study proposes multi-pass restoration by cascaded CoRe-Nets trained with a novel paradigm called Progressive Transfer Learning (PTL), which enables iterative refinement, thus achieving an additional 2 dB mean SNR enhancement. Multi-pass CoRe-Net training by PTL consistently yields incremental performance improvements through successive restoration passes whilst highlighting CoRe-Net ability to handle such a complex and varying blend of artifacts.

[LG-2] Evidence on the Regularisation Properties of Maximum-Entropy Reinforcement Learning

链接: https://arxiv.org/abs/2501.17115
作者: Rémy Hosseinkhan Boucher(1 and 2),Onofrio Semeraro(1 and 2),Lionel Mathelin(1 and 2) ((1) Université Paris-Saclay, (2) CNRS)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The generalisation and robustness properties of policies learnt through Maximum-Entropy Reinforcement Learning are investigated on chaotic dynamical systems with Gaussian noise on the observable. First, the robustness under noise contamination of the agent’s observation of entropy regularised policies is observed. Second, notions of statistical learning theory, such as complexity measures on the learnt model, are borrowed to explain and predict the phenomenon. Results show the existence of a relationship between entropy-regularised policy optimisation and robustness to noise, which can be described by the chosen complexity measures.

[LG-3] Unlocking Transparent Alignment Through Enhanced Inverse Constitutional AI for Principle Extraction

链接: https://arxiv.org/abs/2501.17112
作者: Carl-Leander Henneking,Claas Beger
类目: Machine Learning (cs.LG)
*备注: 8 Pages, 3 Figures

点击查看摘要

Abstract:Traditional methods for aligning Large Language Models (LLMs), such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on implicit principles, limiting interpretability. Constitutional AI (CAI) offers an explicit, rule-based framework for guiding model outputs. Building on this, we refine the Inverse Constitutional AI (ICAI) algorithm, which extracts constitutions from preference datasets. By improving principle generation, clustering, and embedding processes, our approach enhances the accuracy and generalizability of extracted principles across synthetic and real-world datasets. While in-context alignment yields modest improvements, our results highlight the potential of these principles to foster more transparent and adaptable alignment methods, offering a promising direction for future advancements beyond traditional fine-tuning.

[LG-4] Solving Roughly Forced Nonlinear PDEs via Misspecified Kernel Methods and Neural Networks

链接: https://arxiv.org/abs/2501.17110
作者: Matthieu Darcy,Edoardo Calvello,Ricardo Baptista,Houman Owhadi,Andrew M. Stuart,Xianjin Yang
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 41 pages, 7 figures

点击查看摘要

Abstract:We consider the use of Gaussian Processes (GPs) or Neural Networks (NNs) to numerically approximate the solutions to nonlinear partial differential equations (PDEs) with rough forcing or source terms, which commonly arise as pathwise solutions to stochastic PDEs. Kernel methods have recently been generalized to solve nonlinear PDEs by approximating their solutions as the maximum a posteriori estimator of GPs that are conditioned to satisfy the PDE at a finite set of collocation points. The convergence and error guarantees of these methods, however, rely on the PDE being defined in a classical sense and its solution possessing sufficient regularity to belong to the associated reproducing kernel Hilbert space. We propose a generalization of these methods to handle roughly forced nonlinear PDEs while preserving convergence guarantees with an oversmoothing GP kernel that is misspecified relative to the true solution’s regularity. This is achieved by conditioning a regular GP to satisfy the PDE with a modified source term in a weak sense (when integrated against a finite number of test functions). This is equivalent to replacing the empirical L^2 -loss on the PDE constraint by an empirical negative-Sobolev norm. We further show that this loss function can be used to extend physics-informed neural networks (PINNs) to stochastic equations, thereby resulting in a new NN-based variant termed Negative Sobolev Norm-PINN (NeS-PINN).

[LG-5] Accelerated Training through Iterative Gradient Propagation Along the Residual Path ICLR2025

链接: https://arxiv.org/abs/2501.17086
作者: Erwan Fagnou,Paul Caillon,Blaise Delattre,Alexandre Allauzen
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, accepted to ICLR 2025

点击查看摘要

Abstract:Despite being the cornerstone of deep learning, backpropagation is criticized for its inherent sequentiality, which can limit the scalability of very deep models. Such models faced convergence issues due to vanishing gradient, later resolved using residual connections. Variants of these are now widely used in modern architecture. However, the computational cost of backpropagation remains a major burden, accounting for most of the training time. Taking advantage of residual-like architectural designs, we introduce Highway backpropagation, a parallelizable iterative algorithm that approximates backpropagation, by alternatively i) accumulating the gradient estimates along the residual path, and ii) backpropagating them through every layer in parallel. This algorithm is naturally derived from a decomposition of the gradient as the sum of gradients flowing through all paths and is adaptable to a diverse set of common architectures, ranging from ResNets and Transformers to recurrent neural networks. Through an extensive empirical study on a large selection of tasks and models, we evaluate Highway-BP and show that major speedups can be achieved with minimal performance degradation.

[LG-6] oken-by-Token Regeneration and Domain Biases: A Benchmark of LLM s on Advanced Mathematical Problem-Solving

链接: https://arxiv.org/abs/2501.17084
作者: Evgenii Evstafev
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

[LG-7] MIDI-GPT : A Controllable Generative Model for Computer-Assisted Multitrack Music Composition AAAI25

链接: https://arxiv.org/abs/2501.17011
作者: Philippe Pasquier,Jeff Ens,Nathan Fradet,Paul Triana,Davide Rizzotti,Jean-Baptiste Rolland,Maryam Safi
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: AAAI 25

点击查看摘要

Abstract:We present and release MIDI-GPT, a generative system based on the Transformer architecture that is designed for computer-assisted music composition workflows. MIDI-GPT supports the infilling of musical material at the track and bar level, and can condition generation on attributes including: instrument type, musical style, note density, polyphony level, and note duration. In order to integrate these features, we employ an alternative representation for musical material, creating a time-ordered sequence of musical events for each track and concatenating several tracks into a single sequence, rather than using a single time-ordered sequence where the musical events corresponding to different tracks are interleaved. We also propose a variation of our representation allowing for expressiveness. We present experimental results that demonstrate that MIDI-GPT is able to consistently avoid duplicating the musical material it was trained on, generate music that is stylistically similar to the training dataset, and that attribute controls allow enforcing various constraints on the generated material. We also outline several real-world applications of MIDI-GPT, including collaborations with industry partners that explore the integration and evaluation of MIDI-GPT into commercial products, as well as several artistic works produced using it.

[LG-8] Few Edges Are Enough: Few-Shot Network Attack Detection with Graph Neural Networks

链接: https://arxiv.org/abs/2501.16964
作者: Tristan Bilot,Nour El Madhoun,Khaldoun Al Agha,Anis Zouaoui
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This is the version of the author, accepted for publication at IWSEC 2024. Published version available at this https URL

点击查看摘要

Abstract:Detecting cyberattacks using Graph Neural Networks (GNNs) has seen promising results recently. Most of the state-of-the-art models that leverage these techniques require labeled examples, hard to obtain in many real-world scenarios. To address this issue, unsupervised learning and Self-Supervised Learning (SSL) have emerged as interesting approaches to reduce the dependency on labeled data. Nonetheless, these methods tend to yield more anomalous detection algorithms rather than effective attack detection systems. This paper introduces Few Edges Are Enough (FEAE), a GNN-based architecture trained with SSL and Few-Shot Learning (FSL) to better distinguish between false positive anomalies and actual attacks. To maximize the potential of few-shot examples, our model employs a hybrid self-supervised objective that combines the advantages of contrastive-based and reconstruction-based SSL. By leveraging only a minimal number of labeled attack events, represented as attack edges, FEAE achieves competitive performance on two well-known network datasets compared to both supervised and unsupervised methods. Remarkably, our experimental results unveil that employing only 1 malicious event for each attack type in the dataset is sufficient to achieve substantial improvements. FEAE not only outperforms self-supervised GNN baselines but also surpasses some supervised approaches on one of the datasets.

[LG-9] Online-BLS: An Accurate and Efficient Online Broad Learning System for Data Stream Classification

链接: https://arxiv.org/abs/2501.16932
作者: Chunyu Lei,Guang-Ze Chen,C. L. Philip Chen,Tong Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The state-of-the-art online learning models generally conduct a single online gradient descent when a new sample arrives and thus suffer from suboptimal model weights. To this end, we introduce an online broad learning system framework with closed-form solutions for each online update. Different from employing existing incremental broad learning algorithms for online learning tasks, which tend to incur degraded accuracy and expensive online update overhead, we design an effective weight estimation algorithm and an efficient online updating strategy to remedy the above two deficiencies, respectively. Specifically, an effective weight estimation algorithm is first developed by replacing notorious matrix inverse operations with Cholesky decomposition and forward-backward substitution to improve model accuracy. Second, we devise an efficient online updating strategy that dramatically reduces online update time. Theoretical analysis exhibits the splendid error bound and low time complexity of our model. The most popular test-then-training evaluation experiments on various real-world datasets prove its superiority and efficiency. Furthermore, our framework is naturally extended to data stream scenarios with concept drift and exceeds state-of-the-art baselines.

[LG-10] Quantifying Uncertainty and Variability in Machine Learning: Confidence Intervals for Quantiles in Performance Metric Distributions

链接: https://arxiv.org/abs/2501.16931
作者: Christoph Lehmann,Yahor Paromau
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 23 pages, 10 figures

点击查看摘要

Abstract:Machine learning models are widely used in applications where reliability and robustness are critical. Model evaluation often relies on single-point estimates of performance metrics such as accuracy, F1 score, or mean squared error, that fail to capture the inherent variability in model performance. This variability arises from multiple sources, including train-test split, weights initialization, and hyperparameter tuning. Investigating the characteristics of performance metric distributions, rather than focusing on a single point only, is essential for informed decision-making during model selection and optimization, especially in high-stakes settings. How does the performance metric vary due to intrinsic uncertainty in the selected modeling approach? For example, train-test split is modified, initial weights for optimization are modified or hyperparameter tuning is done using an algorithm with probabilistic nature? This is shifting the focus from identifying a single best model to understanding a distribution of the performance metric that captures variability across different training conditions. By running multiple experiments with varied settings, empirical distributions of performance metrics can be generated. Analyzing these distributions can lead to more robust models that generalize well across diverse scenarios. This contribution explores the use of quantiles and confidence intervals to analyze such distributions, providing a more complete understanding of model performance and its uncertainty. Aimed at a statistically interested audience within the machine learning community, the suggested approaches are easy to implement and apply to various performance metrics for classification and regression problems. Given the often long training times in ML, particular attention is given to small sample sizes (in the order of 10-25). Comments: 23 pages, 10 figures Subjects: Machine Learning (cs.LG); Applications (stat.AP) MSC classes: 62G15 ACMclasses: G.3 Cite as: arXiv:2501.16931 [cs.LG] (or arXiv:2501.16931v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.16931 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christoph Lehmann [view email] [v1] Tue, 28 Jan 2025 13:21:34 UTC (286 KB)

[LG-11] Projection-free Algorithms for Online Convex Optimization with Adversarial Constraints

链接: https://arxiv.org/abs/2501.16919
作者: Dhruv Sarkar,Aprameyo Chakrabartty,Subhamon Supantha,Palash Dey,Abhishek Sinha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a generalization of the Online Convex Optimization (OCO) framework with time-varying adversarial constraints. In this problem, after selecting a feasible action from the convex decision set X, a convex constraint function is revealed alongside the cost function in each round. Our goal is to design a computationally efficient learning policy that achieves a small regret with respect to the cost functions and a small cumulative constraint violation (CCV) with respect to the constraint functions over a horizon of length T . It is well-known that the projection step constitutes the major computational bottleneck of the standard OCO algorithms. However, for many structured decision sets, linear functions can be efficiently optimized over the decision set. We propose a projection-free online policy which makes a single call to a Linear Program (LP) solver per round. Our method outperforms state-of-the-art projection-free online algorithms with adversarial constraints, achieving improved bounds of \tildeO(T^\frac34) for both regret and CCV. The proposed algorithm is conceptually simple - it first constructs a surrogate cost function as a non-negative linear combination of the cost and constraint functions. Then, it passes the surrogate costs to a new, adaptive version of the online conditional gradient subroutine, which we propose in this paper.

[LG-12] On Rollouts in Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2501.16918
作者: Bernd Frauenknecht,Devdutt Subhasish,Friedrich Solowjow,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) seeks to enhance data efficiency by learning a model of the environment and generating synthetic rollouts from it. However, accumulated model errors during these rollouts can distort the data distribution, negatively impacting policy learning and hindering long-term planning. Thus, the accumulation of model errors is a key bottleneck in current MBRL methods. We propose Infoprop, a model-based rollout mechanism that separates aleatoric from epistemic model uncertainty and reduces the influence of the latter on the data distribution. Further, Infoprop keeps track of accumulated model errors along a model rollout and provides termination criteria to limit data corruption. We demonstrate the capabilities of Infoprop in the Infoprop-Dyna algorithm, reporting state-of-the-art performance in Dyna-style MBRL on common MuJoCo benchmark tasks while substantially increasing rollout length and data quality.

[LG-13] A Unified Evaluation Framework for Epistemic Predictions AISTATS

链接: https://arxiv.org/abs/2501.16912
作者: Shireen Kudukkil Manchingal,Muhammad Mubashar,Kaizheng Wang,Fabio Cuzzolin
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025, Mai Khao, Thailand. PMLR: Volume 258. Copyright 2025 by the author(s)

点击查看摘要

Abstract:Predictions of uncertainty-aware models are diverse, ranging from single point estimates (often averaged over prediction samples) to predictive distributions, to set-valued or credal-set representations. We propose a novel unified evaluation framework for uncertainty-aware classifiers, applicable to a wide range of model classes, which allows users to tailor the trade-off between accuracy and precision of predictions via a suitably designed performance metric. This makes possible the selection of the most suitable model for a particular real-world application as a function of the desired trade-off. Our experiments, concerning Bayesian, ensemble, evidential, deterministic, credal and belief function classifiers on the CIFAR-10, MNIST and CIFAR-100 datasets, show that the metric behaves as desired.

[LG-14] RAINER: A Robust Ensemble Learning Grid Search-Tuned Framework for Rainfall Patterns Prediction

链接: https://arxiv.org/abs/2501.16900
作者: Zhenqi Li,Junhao Zhong,Hewei Wang,Jinfeng Xu,Yijie Li,Jinjiang You,Jiayi Zhang,Runzhi Wu,Soumyabrata Dev
类目: Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:Rainfall prediction remains a persistent challenge due to the highly nonlinear and complex nature of meteorological data. Existing approaches lack systematic utilization of grid search for optimal hyperparameter tuning, relying instead on heuristic or manual selection, frequently resulting in sub-optimal results. Additionally, these methods rarely incorporate newly constructed meteorological features such as differences between temperature and humidity to capture critical weather dynamics. Furthermore, there is a lack of systematic evaluation of ensemble learning techniques and limited exploration of diverse advanced models introduced in the past one or two years. To address these limitations, we propose a robust ensemble learning grid search-tuned framework (RAINER) for rainfall prediction. RAINER incorporates a comprehensive feature engineering pipeline, including outlier removal, imputation of missing values, feature reconstruction, and dimensionality reduction via Principal Component Analysis (PCA). The framework integrates novel meteorological features to capture dynamic weather patterns and systematically evaluates non-learning mathematical-based methods and a variety of machine learning models, from weak classifiers to advanced neural networks such as Kolmogorov-Arnold Networks (KAN). By leveraging grid search for hyperparameter tuning and ensemble voting techniques, RAINER achieves promising results within real-world datasets.

[LG-15] DBSCAN in domains with periodic boundary conditions

链接: https://arxiv.org/abs/2501.16894
作者: Xander M. de Wit,Alessandro Gabbana
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Many scientific problems involve data that is embedded in a space with periodic boundary conditions. This can for instance be related to an inherent cyclic or rotational symmetry in the data or a spatially extended periodicity. When analyzing such data, well-tailored methods are needed to obtain efficient approaches that obey the periodic boundary conditions of the problem. In this work, we present a method for applying a clustering algorithm to data embedded in a periodic domain based on the DBSCAN algorithm, a widely used unsupervised machine learning method that identifies clusters in data. The proposed method internally leverages the conventional DBSCAN algorithm for domains with open boundaries, such that it remains compatible with all optimized implementations for neighborhood searches in open domains. In this way, it retains the same optimized runtime complexity of O(N\log N) . We demonstrate the workings of the proposed method using synthetic data in one, two and three dimensions and also apply it to a real-world example involving the clustering of bubbles in a turbulent flow. The proposed approach is implemented in a ready-to-use Python package that we make publicly available.

[LG-16] Enhancing Web Service Anomaly Detection via Fine-grained Multi-modal Association and Frequency Domain Analysis WWW’-25

链接: https://arxiv.org/abs/2501.16875
作者: Xixuan Yang,Xin Huang,Chiming Duan,Tong Jia,Shandong Dong,Ying Li,Gang Huang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted by WWW’ 25

点击查看摘要

Abstract:Anomaly detection is crucial for ensuring the stability and reliability of web service systems. Logs and metrics contain multiple information that can reflect the system’s operational state and potential anomalies. Thus, existing anomaly detection methods use logs and metrics to detect web service systems’ anomalies through data fusion approaches. They associate logs and metrics using coarse-grained time window alignment and capture the normal patterns of system operation through reconstruction. However, these methods have two issues that limit their performance in anomaly detection. First, due to asynchrony between logs and metrics, coarse-grained time window alignment cannot achieve a precise association between the two modalities. Second, reconstruction-based methods suffer from severe overgeneralization problems, resulting in anomalies being accurately reconstructed. In this paper, we propose a novel anomaly detection method named FFAD to address these two issues. On the one hand, FFAD employs graph-based alignment to mine and extract associations between the modalities from the constructed log-metric relation graph, achieving precise associations between logs and metrics. On the other hand, we improve the model’s fit to normal data distributions through Fourier Frequency Focus, thereby enhancing the effectiveness of anomaly detection. We validated the effectiveness of our model on two real-world industrial datasets and one open-source dataset. The results show that our method achieves an average anomaly detection F1-score of 93.6%, representing an 8.8% improvement over previous state-of-the-art methods.

[LG-17] HD-CB: The First Exploration of Hyperdimensional Computing for Contextual Bandits Problems

链接: https://arxiv.org/abs/2501.16863
作者: Marco Angioli,Antonello Rosato,Marcello Barbirotta,Rocco Martino,Francesco Menichelli,Mauro Olivieri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hyperdimensional Computing (HDC), also known as Vector Symbolic Architectures, is a computing paradigm that combines the strengths of symbolic reasoning with the efficiency and scalability of distributed connectionist models in artificial intelligence. HDC has recently emerged as a promising alternative for performing learning tasks in resource-constrained environments thanks to its energy and computational efficiency, inherent parallelism, and resilience to noise and hardware faults. This work introduces the Hyperdimensional Contextual Bandits (HD-CB): the first exploration of HDC to model and automate sequential decision-making Contextual Bandits (CB) problems. The proposed approach maps environmental states in a high-dimensional space and represents each action with dedicated hypervectors (HVs). At each iteration, these HVs are used to select the optimal action for the given context and are updated based on the received reward, replacing computationally expensive ridge regression procedures required by traditional linear CB algorithms with simple, highly parallel vector operations. We propose four HD-CB variants, demonstrating their flexibility in implementing different exploration strategies, as well as techniques to reduce memory overhead and the number of hyperparameters. Extensive simulations on synthetic datasets and a real-world benchmark reveal that HD-CB consistently achieves competitive or superior performance compared to traditional linear CB algorithms, while offering faster convergence time, lower computational complexity, improved scalability, and high parallelism. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.16863 [cs.LG] (or arXiv:2501.16863v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.16863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Hybrid Phenology Modeling for Predicting Temperature Effects on Tree Dormancy

链接: https://arxiv.org/abs/2501.16848
作者: Ron van Bree,Diego Marcos,Ioannis Athanasiadis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biophysical models offer valuable insights into climate-phenology relationships in both natural and agricultural settings. However, there are substantial structural discrepancies across models which require site-specific recalibration, often yielding inconsistent predictions under similar climate scenarios. Machine learning methods offer data-driven solutions, but often lack interpretability and alignment with existing knowledge. We present a phenology model describing dormancy in fruit trees, integrating conventional biophysical models with a neural network to address their structural disparities. We evaluate our hybrid model in an extensive case study predicting cherry tree phenology in Japan, South Korea and Switzerland. Our approach consistently outperforms both traditional biophysical and machine learning models in predicting blooming dates across years. Additionally, the neural network’s adaptability facilitates parameter learning for specific tree varieties, enabling robust generalization to new sites without site-specific recalibration. This hybrid model leverages both biophysical constraints and data-driven flexibility, offering a promising avenue for accurate and interpretable phenology modeling.

[LG-19] Flow Matching: Markov Kernels Stochastic Processes and Transport Plans

链接: https://arxiv.org/abs/2501.16839
作者: Christian Wald,Gabriele Steidl
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Among generative neural models, flow matching techniques stand out for their simple applicability and good scaling properties. Here, velocity fields of curves connecting a simple latent and a target distribution are learned. Then the corresponding ordinary differential equation can be used to sample from a target distribution, starting in samples from the latent one. This paper reviews from a mathematical point of view different techniques to learn the velocity fields of absolutely continuous curves in the Wasserstein geometry. We show how the velocity fields can be characterized and learned via i) transport plans (couplings) between latent and target distributions, ii) Markov kernels and iii) stochastic processes, where the latter two include the coupling approach, but are in general broader. Besides this main goal, we show how flow matching can be used for solving Bayesian inverse problems, where the definition of conditional Wasserstein distances plays a central role. Finally, we briefly address continuous normalizing flows and score matching techniques, which approach the learning of velocity fields of curves from other directions.

[LG-20] Data-Driven vs Traditional Approaches to Power Transformers Top-Oil Temperature Estimation

链接: https://arxiv.org/abs/2501.16831
作者: Francis Tembo,Federica Bragone,Tor Laneryd,Matthieu Barreau,Kateryna Morozovska
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Power transformers are subjected to electrical currents and temperature fluctuations that, if not properly controlled, can lead to major deterioration of their insulation system. Therefore, monitoring the temperature of a power transformer is fundamental to ensure a long-term operational life. Models presented in the IEC 60076-7 and IEEE standards, for example, monitor the temperature by calculating the top-oil and the hot-spot temperatures. However, these models are not very accurate and rely on the power transformers’ properties. This paper focuses on finding an alternative method to predict the top-oil temperatures given previous measurements. Given the large quantities of data available, machine learning methods for time series forecasting are analyzed and compared to the real measurements and the corresponding prediction of the IEC standard. The methods tested are Artificial Neural Networks (ANNs), Time-series Dense Encoder (TiDE), and Temporal Convolutional Networks (TCN) using different combinations of historical measurements. Each of these methods outperformed the IEC 60076-7 model and they are extended to estimate the temperature rise over ambient. To enhance prediction reliability, we explore the application of quantile regression to construct prediction intervals for the expected top-oil temperature ranges. The best-performing model successfully estimates conditional quantiles that provide sufficient coverage.

[LG-21] Statistical Analysis of Risk Assessment Factors and Metrics to Evaluate Radicalisation in Twitter

链接: https://arxiv.org/abs/2501.16830
作者: Raul Lara-Cabrera,Antonio Gonzalez-Pardo,David Camacho
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nowadays, Social Networks have become an essential communication tools producing a large amount of information about their users and their interactions, which can be analysed with Data Mining methods. In the last years, Social Networks are being used to radicalise people. In this paper, we study the performance of a set of indicators and their respective metrics, devoted to assess the risk of radicalisation of a precise individual on three different datasets. Keyword-based metrics, even though depending on the written language, performs well when measuring frustration, perception of discrimination as well as declaration of negative and positive ideas about Western society and Jihadism, respectively. However, metrics based on frequent habits such as writing ellipses are not well enough to characterise a user in risk of radicalisation. The paper presents a detailed description of both, the set of indicators used to asses the radicalisation in Social Networks and the set of datasets used to evaluate them. Finally, an experimental study over these datasets are carried out to evaluate the performance of the metrics considered.

[LG-22] Late Breaking Results: Energy-Efficient Printed Machine Learning Classifiers with Sequential SVMs DATE’25

链接: https://arxiv.org/abs/2501.16828
作者: Spyridon Besias,Ilias Sertaridis,Florentia Afentaki,Konstantinos Balaskas,Georgios Zervakis
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
*备注: Accepted at the Design, Automation and Test in Europe Conference (DATE’25), March 31 - April 2, 2025

点击查看摘要

Abstract:Printed Electronics (PE) provide a mechanically flexible and cost-effective solution for machine learning (ML) circuits, compared to silicon-based technologies. However, due to large feature sizes, printed classifiers are limited by high power, area, and energy overheads, which restricts the realization of battery-powered systems. In this work, we design sequential printed bespoke Support Vector Machine (SVM) circuits that adhere to the power constraints of existing printed batteries while minimizing energy consumption, thereby boosting battery life. Our results show 6.5x energy savings while maintaining higher accuracy compared to the state of the art.

[LG-23] Can Transformers Learn Full Bayesian Inference in Context?

链接: https://arxiv.org/abs/2501.16825
作者: Arik Reuter,Tim G. J. Rudner,Vincent Fortuin,David Rügamer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context – without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows which enables us to infer complex posterior distributions for methods such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods not operating in context.

[LG-24] Enhancing Non-Intrusive Load Monitoring with Features Extracted by Independent Component Analysis

链接: https://arxiv.org/abs/2501.16817
作者: Sahar Moghimian Hoosh,Ilia Kamyshev,Henni Ouerdane
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, a novel neural network architecture is proposed to address the challenges in energy disaggregation algorithms. These challenges include the limited availability of data and the complexity of disaggregating a large number of appliances operating simultaneously. The proposed model utilizes independent component analysis as the backbone of the neural network and is evaluated using the F1-score for varying numbers of appliances working concurrently. Our results demonstrate that the model is less prone to overfitting, exhibits low complexity, and effectively decomposes signals with many individual components. Furthermore, we show that the proposed model outperforms existing algorithms when applied to real-world data.

[LG-25] Meta-Federated Learning: A Novel Approach for Real-Time Traffic Flow Management

链接: https://arxiv.org/abs/2501.16758
作者: Bob Johnson,Michael Geller
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Efficient management of traffic flow in urban environments presents a significant challenge, exacerbated by dynamic changes and the sheer volume of data generated by modern transportation networks. Traditional centralized traffic management systems often struggle with scalability and privacy concerns, hindering their effectiveness. This paper introduces a novel approach by combining Federated Learning (FL) and Meta-Learning (ML) to create a decentralized, scalable, and adaptive traffic management system. Our approach, termed Meta-Federated Learning, leverages the distributed nature of FL to process data locally at the edge, thereby enhancing privacy and reducing latency. Simultaneously, ML enables the system to quickly adapt to new traffic conditions without the need for extensive retraining. We implement our model across a simulated network of smart traffic devices, demonstrating that Meta-Federated Learning significantly outperforms traditional models in terms of prediction accuracy and response time. Furthermore, our approach shows remarkable adaptability to sudden changes in traffic patterns, suggesting a scalable solution for real-time traffic management in smart cities. This study not only paves the way for more resilient urban traffic systems but also exemplifies the potential of integrated FL and ML in other real-world applications.

[LG-26] Random Forest Calibration

链接: https://arxiv.org/abs/2501.16756
作者: Mohammad Hossein Shaker,Eyke Hüllermeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Random Forest (RF) classifier is often claimed to be relatively well calibrated when compared with other machine learning methods. Moreover, the existing literature suggests that traditional calibration methods, such as isotonic regression, do not substantially enhance the calibration of RF probability estimates unless supplied with extensive calibration data sets, which can represent a significant obstacle in cases of limited data availability. Nevertheless, there seems to be no comprehensive study validating such claims and systematically comparing state-of-the-art calibration methods specifically for RF. To close this gap, we investigate a broad spectrum of calibration methods tailored to or at least applicable to RF, ranging from scaling techniques to more advanced algorithms. Our results based on synthetic as well as real-world data unravel the intricacies of RF probability estimates, scrutinize the impacts of hyper-parameters, compare calibration methods in a systematic way. We show that a well-optimized RF performs as well as or better than leading calibration approaches.

[LG-27] HateBench: Benchmarking Hate Speech Detectors on LLM -Generated Content and Hate Campaigns

链接: https://arxiv.org/abs/2501.16750
作者: Xinyue Shen,Yixin Wu,Yiting Qu,Michael Backes,Savvas Zannettou,Yang Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. Among all the efforts to address this issue, hate speech detectors play a crucial role. However, the effectiveness of different detectors against LLM-generated hate speech remains largely unknown. In this paper, we propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups, with meticulous annotations by three labelers. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs. We also reveal the potential of LLM-driven hate campaigns, a new threat that LLMs bring to the field of hate speech detection. By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can intentionally evade the detector and automate hate campaigns online. The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by 13-21\times through model stealing attacks with acceptable attack performance. We hope our study can serve as a call to action for the research community and platform moderators to fortify defenses against these emerging threats.

[LG-28] Growing the Efficient Frontier on Panel Trees

链接: https://arxiv.org/abs/2501.16730
作者: Lin William Cong,Guanhao Feng,Jingyu He,Xin He
类目: Machine Learning (cs.LG); Pricing of Securities (q-fin.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a new class of tree-based models, P-Trees, for analyzing (unbalanced) panel of individual asset returns, generalizing high-dimensional sorting with economic guidance and interpretability. Under the mean-variance efficient framework, P-Trees construct test assets that significantly advance the efficient frontier compared to commonly used test assets, with alphas unexplained by benchmark pricing models. P-Tree tangency portfolios also constitute traded factors, recovering the pricing kernel and outperforming popular observable and latent factor models for investments and cross-sectional pricing. Finally, P-Trees capture the complexity of asset returns with sparsity, achieving out-of-sample Sharpe ratios close to those attained only by over-parameterized large models.

[LG-29] Outlier Synthesis via Hamiltonian Monte Carlo for Out-of-Distribution Detection ICLR2025

链接: https://arxiv.org/abs/2501.16718
作者: Hengzhuang Li,Teng Zhang
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for developing trustworthy and reliable machine learning systems. Recent advances in training with auxiliary OOD data demonstrate efficacy in enhancing detection capabilities. Nonetheless, these methods heavily rely on acquiring a large pool of high-quality natural outliers. Some prior methods try to alleviate this problem by synthesizing virtual outliers but suffer from either poor quality or high cost due to the monotonous sampling strategy and the heavy-parameterized generative models. In this paper, we overcome all these problems by proposing the Hamiltonian Monte Carlo Outlier Synthesis (HamOS) framework, which views the synthesis process as sampling from Markov chains. Based solely on the in-distribution data, the Markov chains can extensively traverse the feature space and generate diverse and representative outliers, hence exposing the model to miscellaneous potential OOD scenarios. The Hamiltonian Monte Carlo with sampling acceptance rate almost close to 1 also makes our framework enjoy great efficiency. By empirically competing with SOTA baselines on both standard and large-scale benchmarks, we verify the efficacy and efficiency of our proposed HamOS.

[LG-30] Data Mining in Transportation Networks with Graph Neural Networks: A Review and Outlook

链接: https://arxiv.org/abs/2501.16656
作者: Jiawei Xue,Ruichen Tan,Jianzhu Ma,Satish V. Ukkusuri
类目: Machine Learning (cs.LG)
*备注: 41 pages, 6 figures

点击查看摘要

Abstract:Data mining in transportation networks (DMTNs) refers to using diverse types of spatio-temporal data for various transportation tasks, including pattern analysis, traffic prediction, and traffic controls. Graph neural networks (GNNs) are essential in many DMTN problems due to their capability to represent spatial correlations between entities. Between 2016 and 2024, the notable applications of GNNs in DMTNs have extended to multiple fields such as traffic prediction and operation. However, existing reviews have primarily focused on traffic prediction tasks. To fill this gap, this study provides a timely and insightful summary of GNNs in DMTNs, highlighting new progress in prediction and operation from academic and industry perspectives since 2023. First, we present and analyze various DMTN problems, followed by classical and recent GNN models. Second, we delve into key works in three areas: (1) traffic prediction, (2) traffic operation, and (3) industry involvement, such as Google Maps, Amap, and Baidu Maps. Along these directions, we discuss new research opportunities based on the significance of transportation problems and data availability. Finally, we compile resources such as data, code, and other learning materials to foster interdisciplinary communication. This review, driven by recent trends in GNNs in DMTN studies since 2023, could democratize abundant datasets and efficient GNN methods for various transportation problems including prediction and operation.

[LG-31] Analysis of Zero Day Attack Detection Using MLP and XAI

链接: https://arxiv.org/abs/2501.16638
作者: Ashim Dahal,Prabin Bajgai,Nick Rahimi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Any exploit taking advantage of zero-day is called a zero-day attack. Previous research and social media trends show a massive demand for research in zero-day attack detection. This paper analyzes Machine Learning (ML) and Deep Learning (DL) based approaches to create Intrusion Detection Systems (IDS) and scrutinizing them using Explainable AI (XAI) by training an explainer based on randomly sampled data from the testing set. The focus is on using the KDD99 dataset, which has the most research done among all the datasets for detecting zero-day attacks. The paper aims to synthesize the dataset to have fewer classes for multi-class classification, test ML and DL approaches on pattern recognition, establish the robustness and dependability of the model, and establish the interpretability and scalability of the model. We evaluated the performance of four multilayer perceptron (MLP) trained on the KDD99 dataset, including baseline ML models, weighted ML models, truncated ML models, and weighted truncated ML models. Our results demonstrate that the truncated ML model achieves the highest accuracy (99.62%), precision, and recall, while weighted truncated ML model shows lower accuracy (97.26%) but better class representation (less bias) among all the classes with improved unweighted recall score. We also used Shapely Additive exPlanations (SHAP) to train explainer for our truncated models to check for feature importance among the two weighted and unweighted models.

[LG-32] A General Bayesian Framework for Informative Input Design in System Identification

链接: https://arxiv.org/abs/2501.16625
作者: Alexandros E. Tzikas,Mykel J. Kochenderfer
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to the IEEE Control Systems Letters

点击查看摘要

Abstract:We tackle the problem of informative input design for system identification, where we select inputs, observe the corresponding outputs from the true system, and optimize the parameters of our model to best fit the data. We propose a methodology that is compatible with any system and parametric family of models. Our approach only requires input-output data from the system and first-order information from the model with respect to the parameters. Our algorithm consists of two modules. First, we formulate the problem of system identification from a Bayesian perspective and propose an approximate iterative method to optimize the model’s parameters. Based on this Bayesian formulation, we are able to define a Gaussian-based uncertainty measure for the model parameters, which we can then minimize with respect to the next selected input. Our method outperforms model-free baselines with various linear and nonlinear dynamics.

[LG-33] Sparse Autoencoders Trained on the Same Data Learn Different Features

链接: https://arxiv.org/abs/2501.16615
作者: Gonçalo Paulo,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features “truly used” by the model.

[LG-34] FUNU: Boosting Machine Unlearning Efficiency by Filtering Unnecessary Unlearning WWW’25

链接: https://arxiv.org/abs/2501.16614
作者: Zitong Li,Qingqing Ye,Haibo Hu
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by WWW’25

点击查看摘要

Abstract:Machine unlearning is an emerging field that selectively removes specific data samples from a trained model. This capability is crucial for addressing privacy concerns, complying with data protection regulations, and correcting errors or biases introduced by certain data. Unlike traditional machine learning, where models are typically static once trained, machine unlearning facilitates dynamic updates that enable the model to ``forget’’ information without requiring complete retraining from scratch. There are various machine unlearning methods, some of which are more time-efficient when data removal requests are fewer. To decrease the execution time of such machine unlearning methods, we aim to reduce the size of data removal requests based on the fundamental assumption that the removal of certain data would not result in a distinguishable retrained model. We first propose the concept of unnecessary unlearning, which indicates that the model would not alter noticeably after removing some data points. Subsequently, we review existing solutions that can be used to solve our problem. We highlight their limitations in adaptability to different unlearning scenarios and their reliance on manually selected parameters. We consequently put forward FUNU, a method to identify data points that lead to unnecessary unlearning. FUNU circumvents the limitations of existing solutions. The idea is to discover data points within the removal requests that have similar neighbors in the remaining dataset. We utilize a reference model to set parameters for finding neighbors, inspired from the area of model memorization. We provide a theoretical analysis of the privacy guarantee offered by FUNU and conduct extensive experiments to validate its efficacy. Comments: This paper has been accepted by WWW’25 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.16614 [cs.LG] (or arXiv:2501.16614v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.16614 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] he Power of Perturbation under Sampling in Solving Extensive-Form Games

链接: https://arxiv.org/abs/2501.16600
作者: Wataru Masaka,Mitsuki Sakamoto,Kenshi Abe,Kaito Ariu,Tuomas Sandholm,Atsushi Iwasaki
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper investigates how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in imperfect-information extensive-form games. Perturbing the expected payoffs guarantees that the FTRL dynamics reach an approximate equilibrium, and proper adjustments of the magnitude of the perturbation lead to a Nash equilibrium (\textitlast-iterate convergence). This approach is robust even when payoffs are estimated using sampling – as is the case for large games – while the optimistic approach often becomes unstable. Building upon those insights, we first develop a general framework for perturbed FTRL algorithms under \textitsampling. We then empirically show that in the last-iterate sense, the perturbed FTRL consistently outperforms the non-perturbed FTRL. We further identify a divergence function that reduces the variance of the estimates for perturbed payoffs, with which it significantly outperforms the prior algorithms on Leduc poker (whose structure is more asymmetric in a sense than that of the other benchmark games) and consistently performs smooth convergence behavior on all the benchmark games.

[LG-36] oward Safe Integration of UAM in Terminal Airspace: UAM Route Feasibility Assessment using Probabilistic Aircraft Trajectory Prediction

链接: https://arxiv.org/abs/2501.16599
作者: Jungwoo Cho,Seongjin Choi
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Integrating Urban Air Mobility (UAM) into airspace managed by Air Traffic Control (ATC) poses significant challenges, particularly in congested terminal environments. This study proposes a framework to assess the feasibility of UAM route integration using probabilistic aircraft trajectory prediction. By leveraging conditional Normalizing Flows, the framework predicts short-term trajectory distributions of conventional aircraft, enabling UAM vehicles to dynamically adjust speeds and maintain safe separations. The methodology was applied to airspace over Seoul metropolitan area, encompassing interactions between UAM and conventional traffic at multiple altitudes and lanes. The results reveal that different physical locations of lanes and routes experience varying interaction patterns and encounter dynamics. For instance, Lane 1 at lower altitudes (1,500 ft and 2,000 ft) exhibited minimal interactions with conventional aircraft, resulting in the largest separations and the most stable delay proportions. In contrast, Lane 4 near the airport experienced more frequent and complex interactions due to its proximity to departing traffic. The limited trajectory data for departing aircraft in this region occasionally led to tighter separations and increased operational challenges. This study underscores the potential of predictive modeling in facilitating UAM integration while highlighting critical trade-offs between safety and efficiency. The findings contribute to refining airspace management strategies and offer insights for scaling UAM operations in complex urban environments.

[LG-37] Fine-Tuned Language Models as Space Systems Controllers

链接: https://arxiv.org/abs/2501.16588
作者: Enrico M. Zucchelli,Di Wu,Julia Briden,Christian Hofmann,Victor Rodriguez-Fernandez,Richard Linares
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), or foundation models (FMs), are pretrained transformers that coherently complete sentences auto-regressively. In this paper, we show that LLMs can control simplified space systems after some additional training, called fine-tuning. We look at relatively small language models, ranging between 7 and 13 billion parameters. We focus on four problems: a three-dimensional spring toy problem, low-thrust orbit transfer, low-thrust cislunar control, and powered descent guidance. The fine-tuned LLMs are capable of controlling systems by generating sufficiently accurate outputs that are multi-dimensional vectors with up to 10 significant digits. We show that for several problems the amount of data required to perform fine-tuning is smaller than what is generally required of traditional deep neural networks (DNNs), and that fine-tuned LLMs are good at generalizing outside of the training dataset. Further, the same LLM can be fine-tuned with data from different problems, with only minor performance degradation with respect to LLMs trained for a single application. This work is intended as a first step towards the development of a general space systems controller.

[LG-38] HopCast: Calibration of Autoregressive Dynamics Models

链接: https://arxiv.org/abs/2501.16587
作者: Muhammad Bilal Shahid,Cody Fleming
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. These models are optimized to predict one step ahead and produce calibrated predictions if the predictive model can quantify uncertainty, such as deep ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method (e.g., Trajectory Sampling) to produce calibrated multi-step predictions. This paper introduces an approach named HopCast that uses the Modern Hopfield Network (MHN) to learn the residuals of a deterministic model that approximates the dynamical system. The MHN predicts the density of residuals based on a context vector at any timestep during autoregression. This approach produces calibrated multi-step predictions without uncertainty propagation and turns a deterministic model into a calibrated probabilistic model. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors with deep ensembles for multi-step predictions.

[LG-39] Optimization Landscapes Learned: Proxy Networks Boost Convergence in Physics-based Inverse Problems

链接: https://arxiv.org/abs/2501.16573
作者: Girnar Goyal,Philipp Holl,Sweta Agrawal,Nils Thuerey
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Ongoing work

点击查看摘要

Abstract:Solving inverse problems in physics is central to understanding complex systems and advancing technologies in various fields. Iterative optimization algorithms, commonly used to solve these problems, often encounter local minima, chaos, or regions with zero gradients. This is due to their overreliance on local information and highly chaotic inverse loss landscapes governed by underlying partial differential equations (PDEs). In this work, we show that deep neural networks successfully replicate such complex loss landscapes through spatio-temporal trajectory inputs. They also offer the potential to control the underlying complexity of these chaotic loss landscapes during training through various regularization methods. We show that optimizing on network-smoothened loss landscapes leads to improved convergence in predicting optimum inverse parameters over conventional momentum-based optimizers such as BFGS on multiple challenging problems.

[LG-40] C-HDNet: A Fast Hyperdimensional Computing Based Method for Causal Effect Estimation from Networked Observational Data

链接: https://arxiv.org/abs/2501.16562
作者: Abhishek Dalvi,Neil Ashtekar,Vasant Honavar
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We consider the problem of estimating causal effects from observational data in the presence of network confounding. In this context, an individual’s treatment assignment and outcomes may be affected by their neighbors within the network. We propose a novel matching technique which leverages hyperdimensional computing to model network information and improve predictive performance. We present results of extensive experiments which show that the proposed method outperforms or is competitive with the state-of-the-art methods for causal effect estimation from network data, including advanced computationally demanding deep learning methods. Further, our technique benefits from simplicity and speed, with roughly an order of magnitude lower runtime compared to state-of-the-art methods, while offering similar causal effect estimation error rates.

[LG-41] Distributional Information Embedding: A Framework for Multi-bit Watermarking

链接: https://arxiv.org/abs/2501.16558
作者: Haiyun He,Yepeng Liu,Ziqiao Wang,Yongyi Mao,Yuheng Bu
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel problem, distributional information embedding, motivated by the practical demands of multi-bit watermarking for large language models (LLMs). Unlike traditional information embedding, which embeds information into a pre-existing host signal, LLM watermarking actively controls the text generation process–adjusting the token distribution–to embed a detectable signal. We develop an information-theoretic framework to analyze this distributional information embedding problem, characterizing the fundamental trade-offs among three critical performance metrics: text quality, detectability, and information rate. In the asymptotic regime, we demonstrate that the maximum achievable rate with vanishing error corresponds to the entropy of the LLM’s output distribution and increases with higher allowable distortion. We also characterize the optimal watermarking scheme to achieve this rate. Extending the analysis to the finite-token case, we identify schemes that maximize detection probability while adhering to constraints on false alarm and distortion.

[LG-42] Reconciling Predictive Multiplicity in Practice ICML

链接: https://arxiv.org/abs/2501.16549
作者: Tina Behzad,Sílvia Casacuberta,Emily Ruth Diana,Alexander Williams Tolbert
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Presented at the ICML workshop on Humans, Algorithmic Decision-Making and Society: Modeling Interactions and Impact 2024

点击查看摘要

Abstract:Many machine learning applications predict individual probabilities, such as the likelihood that a person develops a particular illness. Since these probabilities are unknown, a key question is how to address situations in which different models trained on the same dataset produce varying predictions for certain individuals. This issue is exemplified by the model multiplicity (MM) phenomenon, where a set of comparable models yield inconsistent predictions. Roth, Tolbert, and Weinstein recently introduced a reconciliation procedure, the Reconcile algorithm, to address this problem. Given two disagreeing models, the algorithm leverages their disagreement to falsify and improve at least one of the models. In this paper, we empirically analyze the Reconcile algorithm using five widely-used fairness datasets: COMPAS, Communities and Crime, Adult, Statlog (German Credit Data), and the ACS Dataset. We examine how Reconcile fits within the model multiplicity literature and compare it to existing MM solutions, demonstrating its effectiveness. We also discuss potential improvements to the Reconcile algorithm theoretically and practically. Finally, we extend the Reconcile algorithm to the setting of causal inference, given that different competing estimators can again disagree on specific causal average treatment effect (CATE) values. We present the first extension of the Reconcile algorithm in causal inference, analyze its theoretical properties, and conduct empirical tests. Our results confirm the practical effectiveness of Reconcile and its applicability across various domains.

[LG-43] Optimizing Decentralized Online Learning for Supervised Regression and Classification Problems

链接: https://arxiv.org/abs/2501.16519
作者: J. M. Diederik Kruijssen(1),Renata Valieva(1),Steven N. Longmore(1,2) ((1) Allora Foundation, (2) Liverpool John Moores University)
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: 14 pages, 6 figures, 2 tables; appeared in ADI (January 2025)

点击查看摘要

Abstract:Decentralized learning networks aim to synthesize a single network inference from a set of raw inferences provided by multiple participants. To determine the combined inference, these networks must adopt a mapping from historical participant performance to weights, and to appropriately incentivize contributions they must adopt a mapping from performance to fair rewards. Despite the increased prevalence of decentralized learning networks, there exists no systematic study that performs a calibration of the associated free parameters. Here we present an optimization framework for key parameters governing decentralized online learning in supervised regression and classification problems. These parameters include the slope of the mapping between historical performance and participant weight, the timeframe for performance evaluation, and the slope of the mapping between performance and rewards. These parameters are optimized using a suite of numerical experiments that mimic the design of the Allora Network, but have been extended to handle classification tasks in addition to regression tasks. This setup enables a comparative analysis of parameter tuning and network performance optimization (loss minimization) across both problem types. We demonstrate how the optimal performance-weight mapping, performance timeframe, and performance-reward mapping vary with network composition and problem type. Our findings provide valuable insights for the optimization of decentralized learning protocols, and we discuss how these results can be generalized to optimize any inference synthesis-based, decentralized AI network.

[LG-44] Open Problems in Mechanistic Interpretability

链接: https://arxiv.org/abs/2501.16496
作者: Lee Sharkey,Bilal Chughtai,Joshua Batson,Jack Lindsey,Jeff Wu,Lucius Bushnaq,Nicholas Goldowsky-Dill,Stefan Heimersheim,Alejandro Ortega,Joseph Bloom,Stella Biderman,Adria Garriga-Alonso,Arthur Conmy,Neel Nanda,Jessica Rumbelow,Martin Wattenberg,Nandi Schoots,Joseph Miller,Eric J. Michaud,Stephen Casper,Max Tegmark,William Saunders,David Bau,Eric Todd,Atticus Geiger,Mor Geva,Jesse Hoogland,Daniel Murfet,Tom McGrath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

[LG-45] Modular Framework for Uncertainty Prediction in Autonomous Vehicle Motion Forecasting within Complex Traffic Scenarios

链接: https://arxiv.org/abs/2501.16480
作者: Han Wang,Yuneil Yeo,Antonio R. Paiva,Jean Utke,Maria Laura Delle Monache
类目: Robotics (cs.RO); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We propose a modular modeling framework designed to enhance the capture and validation of uncertainty in autonomous vehicle (AV) trajectory prediction. Departing from traditional deterministic methods, our approach employs a flexible, end-to-end differentiable probabilistic encoder-decoder architecture. This modular design allows the encoder and decoder to be trained independently, enabling seamless adaptation to diverse traffic scenarios without retraining the entire system. Our key contributions include: (1) a probabilistic heatmap predictor that generates context-aware occupancy grids for dynamic forecasting, (2) a modular training approach that supports independent component training and flexible adaptation, and (3) a structured validation scheme leveraging uncertainty metrics to evaluate robustness under high-risk conditions. To highlight the benefits of our framework, we benchmark it against an end-to-end baseline, demonstrating faster convergence, improved stability, and flexibility. Experimental results validate these advantages, showcasing the capacity of the framework to efficiently handle complex scenarios while ensuring reliable predictions and robust uncertainty representation. This modular design offers significant practical utility and scalability for real-world autonomous driving applications.

[LG-46] Closed-Form Feedback-Free Learning with Forward Projection

链接: https://arxiv.org/abs/2501.16476
作者: Robert O’Shea,Bipin Rajendran
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This novel randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Target values for pre-activation membrane potentials are generated layer-wise via nonlinear projections of pre-synaptic inputs and the labels. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which is interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets. In few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training. Interpretation functions defined on local neuronal activity in FP-based models successfully identified clinically salient features for diagnosis in two biomedical datasets. Forward Projection is a computationally efficient machine learning approach that yields interpretable neural network models without retrograde communication of neuronal activity during training.

[LG-47] CoCoNUT: Structural Code Understanding does not fall out of a tree

链接: https://arxiv.org/abs/2501.16456
作者: Claas Beger,Saikat Dutta
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at 2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that high performance on such benchmarks does not correlate to humans’ innate ability to understand structural control flow in code. To this end, we extract solutions from the HumanEval benchmark, which the relevant models perform strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of seven state-of-the-art LLMs to match the execution trace and find that, despite their ability to generate semantically identical code, they possess limited ability to trace execution paths, especially for longer traces and specific control structures. We find that even the top-performing model, Gemini, can fully and correctly generate only 47% of HumanEval task traces. Additionally, we introduce a subset for three key structures not contained in HumanEval: Recursion, Parallel Processing, and Object-Oriented Programming, including concepts like Inheritance and Polymorphism. Besides OOP, we show that none of the investigated models achieve an accuracy over 5% on the relevant traces. Aggregating these specialized parts with HumanEval tasks, we present Benchmark CoCoNUT: Code Control Flow for Navigation Understanding and Testing, which measures a model’s ability to trace execution of code upon relevant calls, including advanced structural components. We conclude that current LLMs need significant improvement to enhance code reasoning abilities. We hope our dataset helps researchers bridge this gap.

[LG-48] Detecting clinician implicit biases in diagnoses using proximal causal inference

链接: https://arxiv.org/abs/2501.16399
作者: Kara Liu,Russ Altman,Vasilis Syrgkanis
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: The ~64 pages of the appendix IS UNPUBLISHED and novel content

点击查看摘要

Abstract:Clinical decisions to treat and diagnose patients are affected by implicit biases formed by racism, ableism, sexism, and other stereotypes. These biases reflect broader systemic discrimination in healthcare and risk marginalizing already disadvantaged groups. Existing methods for measuring implicit biases require controlled randomized testing and only capture individual attitudes rather than outcomes. However, the “big-data” revolution has led to the availability of large observational medical datasets, like EHRs and biobanks, that provide the opportunity to investigate discrepancies in patient health outcomes. In this work, we propose a causal inference approach to detect the effect of clinician implicit biases on patient outcomes in large-scale medical data. Specifically, our method uses proximal mediation to disentangle pathway-specific effects of a patient’s sociodemographic attribute on a clinician’s diagnosis decision. We test our method on real-world data from the UK Biobank. Our work can serve as a tool that initiates conversation and brings awareness to unequal health outcomes caused by implicit biases.

[LG-49] Visualizing the Local Atomic Environment Features of Machine Learning Interatomic Potential

链接: https://arxiv.org/abs/2501.16398
作者: Xuqiang Shao,Yuqi Zhang,Di Zhang,Tianxiang Gao,Xinyuan Liu,Zhiran Gan,Fanshun Meng,Hao Li,Weijie Yang
类目: Machine Learning (cs.LG); Atomic Physics (physics.atom-ph)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of creating efficient and high-quality datasets for machine learning potential functions. We present a novel approach, termed DV-LAE (Difference Vectors based on Local Atomic Environments), which utilizes the properties of atomic local environments and employs histogram statistics to generate difference vectors. This technique facilitates dataset screening and optimization, effectively minimizing redundancy while maintaining data diversity. We have validated the optimized datasets in high-temperature and high-pressure hydrogen systems as well as the \alpha-Fe/H binary system, demonstrating a significant reduction in computational resource usage without compromising prediction accuracy. Additionally, our method has revealed new structures that emerge during simulations but were underrepresented in the initial training datasets. The redundancy in the datasets and the distribution of these new structures can be visually analyzed through the visualization of difference vectors. This approach enhances our understanding of the characteristics of these newly formed structures and their impact on physical processes.

[LG-50] HOR: A Generic Energy Estimation Approach for On-Device Training

链接: https://arxiv.org/abs/2501.16397
作者: Jiaru Zhang,Zesong Wang,Hao Wang,Tao Song,Huai-an Su,Rui Chen,Yang Hua,Xiangwei Zhou,Ruhui Ma,Miao Pan,Haibing Guan
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Battery-powered mobile devices (e.g., smartphones, AR/VR glasses, and various IoT devices) are increasingly being used for AI training due to their growing computational power and easy access to valuable, diverse, and real-time data. On-device training is highly energy-intensive, making accurate energy consumption estimation crucial for effective job scheduling and sustainable AI. However, the heterogeneity of devices and the complexity of models challenge the accuracy and generalizability of existing estimation methods. This paper proposes THOR, a generic approach for energy consumption estimation in deep neural network (DNN) training. First, we examine the layer-wise energy additivity property of DNNs and strategically partition the entire model into layers for fine-grained energy consumption profiling. Then, we fit Gaussian Process (GP) models to learn from layer-wise energy consumption measurements and estimate a DNN’s overall energy consumption based on its layer-wise energy additivity property. We conduct extensive experiments with various types of models across different real-world platforms. The results demonstrate that THOR has effectively reduced the Mean Absolute Percentage Error (MAPE) by up to 30%. Moreover, THOR is applied in guiding energy-aware pruning, successfully reducing energy consumption by 50%, thereby further demonstrating its generality and potential. Comments: Under review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.16397 [cs.LG] (or arXiv:2501.16397v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.16397 Focus to learn more arXiv-issued DOI via DataCite

[LG-51] opoNets: High Performing Vision and Language Models with Brain-Like Topography

链接: https://arxiv.org/abs/2501.16396
作者: Mayukh Deb,Mainak Deb,N. Apurva Ratan Murty
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注: 10 pages

点击查看摘要

Abstract:Neurons in the brain are organized such that nearby cells tend to share similar functions. AI models lack this organization, and past efforts to introduce topography have often led to trade-offs between topography and task performance. In this work, we present TopoLoss, a new loss function that promotes spatially organized topographic representations in AI models without significantly sacrificing task performance. TopoLoss is highly adaptable and can be seamlessly integrated into the training of leading model architectures. We validate our method on both vision (ResNet-18, ResNet-50, ViT) and language models (GPT-Neo-125M, NanoGPT), collectively TopoNets. TopoNets are the highest-performing supervised topographic models to date, exhibiting brain-like properties such as localized feature processing, lower dimensionality, and increased efficiency. TopoNets also predict responses in the brain and replicate the key topographic signatures observed in the brain’s visual and language cortices. Together, this work establishes a robust and generalizable framework for integrating topography into leading model architectures, advancing the development of high-performing models that more closely emulate the computational strategies of the human brain.

[LG-52] ransformer-1: Input-Adaptive Computation for Resource-Constrained Deployment

链接: https://arxiv.org/abs/2501.16394
作者: Lumen AI,Tengzhou No. 1 Middle School,Shihao Ji,Zihui Song,Fucheng Zhong,Jisen Jia,Zhaobo Wu,Zheyi Cao,Xu Tianhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer ^-1 architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system’s theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark test, our method reduces FLOPs by 42.7% and peak memory usage by 34.1% compared to the standard Transformer, while maintaining comparable accuracy ( \pm 0.3%). Furthermore, we conducted practical deployment on the Jetson AGX Xavier platform, verifying the effectiveness and practical value of this method in resource-constrained environments. To further validate the generality of the method, we also conducted experiments on several natural language processing tasks and achieved significant improvements in resource efficiency.

[LG-53] Improving Network Threat Detection by Knowledge Graph Large Language Model and Imbalanced Learning AAAI

链接: https://arxiv.org/abs/2501.16393
作者: Lili Zhang,Quanyan Zhu,Herman Ray,Ying Xie
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: Accepted by “Combining AI and OR/MS for Better Trustworthy Decision Making” Bridge Program co-organized by AAAI and INFORMS as poster and demo

点击查看摘要

Abstract:Network threat detection has been challenging due to the complexities of attack activities and the limitation of historical threat data to learn from. To help enhance the existing practices of using analytics, machine learning, and artificial intelligence methods to detect the network threats, we propose an integrated modelling framework, where Knowledge Graph is used to analyze the users’ activity patterns, Imbalanced Learning techniques are used to prune and weigh Knowledge Graph, and LLM is used to retrieve and interpret the users’ activities from Knowledge Graph. The proposed framework is applied to Agile Threat Detection through Online Sequential Learning. The preliminary results show the improved threat capture rate by 3%-4% and the increased interpretabilities of risk predictions based on the users’ activities.

[LG-54] HMCGeo: IP Region Prediction Based on Hierarchical Multi-label Classification

链接: https://arxiv.org/abs/2501.16392
作者: Tianzi Zhao,Xinran Liu,Zhaoxin Zhang,Dong Zhao,Ning Li,Zhichao Zhang,Xinye Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-grained IP geolocation plays a critical role in applications such as location-based services and cybersecurity. Most existing fine-grained IP geolocation methods are regression-based; however, due to noise in the input data, these methods typically encounter kilometer-level prediction errors and provide incorrect region information for users. To address this issue, this paper proposes a novel hierarchical multi-label classification framework for IP region prediction, named HMCGeo. This framework treats IP geolocation as a hierarchical multi-label classification problem and employs residual connection-based feature extraction and attention prediction units to predict the target host region across multiple geographical granularities. Furthermore, we introduce probabilistic classification loss during training, combining it with hierarchical cross-entropy loss to form a composite loss function. This approach optimizes predictions by utilizing hierarchical constraints between regions at different granularities. IP region prediction experiments on the New York, Los Angeles, and Shanghai datasets demonstrate that HMCGeo achieves superior performance across all geographical granularities, significantly outperforming existing IP geolocation methods.

[LG-55] Development and Validation of a Dynamic Kidney Failure Prediction Model based on Deep Learning: A Real-World Study with External Validation

链接: https://arxiv.org/abs/2501.16388
作者: Jingying Ma,Jinwei Wang,Lanlan Lu,Yexiang Sun,Mengling Feng,Peng Shen,Zhiqin Jiang,Shenda Hong,Luxia Zhang
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Background: Chronic kidney disease (CKD), a progressive disease with high morbidity and mortality, has become a significant global public health problem. At present, most of the models used for predicting the progression of CKD are static models. We aim to develop a dynamic kidney failure prediction model based on deep learning (KFDeep) for CKD patients, utilizing all available data on common clinical indicators from real-world Electronic Health Records (EHRs) to provide real-time predictions. Findings: A retrospective cohort of 4,587 patients from EHRs of Yinzhou, China, is used as the development dataset (2,752 patients for training, 917 patients for validation) and internal validation dataset (917 patients), while a prospective cohort of 934 patients from the Peking University First Hospital CKD cohort (PKUFH cohort) is used as the external validation dataset. The AUROC of the KFDeep model reaches 0.946 (95% CI: 0.922-0.970) on the internal validation dataset and 0.805 (95% CI: 0.763-0.847) on the external validation dataset, both surpassing existing models. The KFDeep model demonstrates stable performance in simulated dynamic scenarios, with the AUROC progressively increasing over time. Both the calibration curve and decision curve analyses confirm that the model is unbiased and safe for practical use, while the SHAP analysis and hidden layer clustering results align with established medical knowledge. Interpretation: The KFDeep model built from real-world EHRs enhances the prediction accuracy of kidney failure without increasing clinical examination costs and can be easily integrated into existing hospital systems, providing physicians with a continuously updated decision-support tool due to its dynamic design. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.16388 [cs.LG] (or arXiv:2501.16388v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.16388 Focus to learn more arXiv-issued DOI via DataCite

[LG-56] Reduced-order modeling and classification of hydrodynamic pattern formation in gravure printing

链接: https://arxiv.org/abs/2501.16381
作者: Pauline Rothmann-Brumm,Steven L. Brunton,Isabel Scherl
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Hydrodynamic pattern formation phenomena in printing and coating processes are still not fully understood. However, fundamental understanding is essential to achieve high-quality printed products and to tune printed patterns according to the needs of a specific application like printed electronics, graphical printing, or biomedical printing. The aim of the paper is to develop an automated pattern classification algorithm based on methods from supervised machine learning and reduced-order modeling. We use the HYPA-p dataset, a large image dataset of gravure-printed images, which shows various types of hydrodynamic pattern formation phenomena. It enables the correlation of printing process parameters and resulting printed patterns for the first time. 26880 images of the HYPA-p dataset have been labeled by a human observer as dot patterns, mixed patterns, or finger patterns; 864000 images (97%) are unlabeled. A singular value decomposition (SVD) is used to find the modes of the labeled images and to reduce the dimensionality of the full dataset by truncation and projection. Selected machine learning classification techniques are trained on the reduced-order data. We investigate the effect of several factors, including classifier choice, whether or not fast Fourier transform (FFT) is used to preprocess the labeled images, data balancing, and data normalization. The best performing model is a k-nearest neighbor (kNN) classifier trained on unbalanced, FFT-transformed data with a test error of 3%, which outperforms a human observer by 7%. Data balancing slightly increases the test error of the kNN-model to 5%, but also increases the recall of the mixed class from 90% to 94%. Finally, we demonstrate how the trained models can be used to predict the pattern class of unlabeled images and how the predictions can be correlated to the printing process parameters, in the form of regime maps.

[LG-57] A novel Trunk Branch-net PINN for flow and heat transfer prediction in porous medium

链接: https://arxiv.org/abs/2501.16362
作者: Haoyun Xing,Kaiyan Jin,Guice Yao,Jin Zhao,Dichu Xu,Dongsheng Wen
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 26 pages, 17 figures,

点击查看摘要

Abstract:A novel Trunk-Branch (TB)-net physics-informed neural network (PINN) architecture is developed, which is a PINN-based method incorporating trunk and branch nets to capture both global and local features. The aim is to solve four main classes of problems: forward flow problem, forward heat transfer problem, inverse heat transfer problem, and transfer learning problem within the porous medium, which are notoriously complex that could not be handled by origin PINN. In the proposed TB-net PINN architecture, a Fully-connected Neural Network (FNN) is used as the trunk net, followed by separated FNNs as the branch nets with respect to outputs, and automatic differentiation is performed for partial derivatives of outputs with respect to inputs by considering various physical loss. The effectiveness and flexibility of the novel TB-net PINN architecture is demonstrated through a collection of forward problems, and transfer learning validates the feasibility of resource reuse. Combining with the superiority over traditional numerical methods in solving inverse problems, the proposed TB-net PINN shows its great potential for practical engineering applications.

[LG-58] he OpenLAM Challenges

链接: https://arxiv.org/abs/2501.16358
作者: Anyang Peng,Xinzijian Liu,Ming-Yu Guo,Linfeng Zhang,Han Wang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Inspired by the success of Large Language Models (LLMs), the development of Large Atom Models (LAMs) has gained significant momentum in scientific computation. Since 2022, the Deep Potential team has been actively pretraining LAMs and launched the OpenLAM Initiative to develop an open-source foundation model spanning the periodic table. A core objective is establishing comprehensive benchmarks for reliable LAM evaluation, addressing limitations in existing datasets. As a first step, the LAM Crystal Philately competition has collected over 19.8 million valid structures, including 1 million on the OpenLAM convex hull, driving advancements in generative modeling and materials science applications.

[LG-59] Convergence of two-timescale gradient descent ascent dynamics: finite-dimensional and mean-field perspectives

链接: https://arxiv.org/abs/2501.17122
作者: Jing An,Jianfeng Lu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:The two-timescale gradient descent-ascent (GDA) is a canonical gradient algorithm designed to find Nash equilibria in min-max games. We analyze the two-timescale GDA by investigating the effects of learning rate ratios on convergence behavior in both finite-dimensional and mean-field settings. In particular, for finite-dimensional quadratic min-max games, we obtain long-time convergence in near quasi-static regimes through the hypocoercivity method. For mean-field GDA dynamics, we investigate convergence under a finite-scale ratio using a mixed synchronous-reflection coupling technique.

[LG-60] Generative diffusion models from a PDE perspective

链接: https://arxiv.org/abs/2501.17054
作者: Fei Cao(1),Kimball Johnston(2),Thomas Laurent(3),Justin Le(2),Sébastien Motsch(2) ((1) University of Massachusetts Amherst, (2) Arizona State University, (3) Loyola Marymount University)
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注: 30 pages, 10 figures

点击查看摘要

Abstract:Diffusion models have become the de facto framework for generating new datasets. The core of these models lies in the ability to reverse a diffusion process in time. The goal of this manuscript is to explain, from a PDE perspective, how this method works and how to derive the PDE governing the reverse dynamics as well as to study its solution analytically. By linking forward and reverse dynamics, we show that the reverse process’s distribution has its support contained within the original distribution. Consequently, diffusion methods, in their analytical formulation, do not inherently regularize the original distribution, and thus, there is no generalization principle. This raises a question: where does generalization arise, given that in practice it does occur? Moreover, we derive an explicit solution to the reverse process’s SDE under the assumption that the starting point of the forward process is fixed. This provides a new derivation that links two popular approaches to generative diffusion models: stable diffusion (discrete dynamics) and the score-based approach (continuous dynamics). Finally, we explore the case where the original distribution consists of a finite set of data points. In this scenario, the reverse dynamics are explicit (i.e., the loss function has a clear minimizer), and solving the dynamics fails to generate new samples: the dynamics converge to the original samples. In a sense, solving the minimization problem exactly is “too good for its own good” (i.e., an overfitting regime).

[LG-61] Hellinger-Kantorovich Gradient Flows: Global Exponential Decay of Entropy Functionals

链接: https://arxiv.org/abs/2501.17049
作者: Alexander Mielke,Jia-Jie Zhu
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We investigate a family of gradient flows of positive and probability measures, focusing on the Hellinger-Kantorovich (HK) geometry, which unifies transport mechanism of Otto-Wasserstein, and the birth-death mechanism of Hellinger (or Fisher-Rao). A central contribution is a complete characterization of global exponential decay behaviors of entropy functionals (e.g. KL, \chi^2 ) under Otto-Wasserstein and Hellinger-type gradient flows. In particular, for the more challenging analysis of HK gradient flows on positive measures – where the typical log-Sobolev arguments fail – we develop a specialized shape-mass decomposition that enables new analysis results. Our approach also leverages the (Polyak-)Łojasiewicz-type functional inequalities and a careful extension of classical dissipation estimates. These findings provide a unified and complete theoretical framework for gradient flows and underpin applications in computational algorithms for statistical inference, optimization, and machine learning.

[LG-62] Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Averag e Treatment Effect

链接: https://arxiv.org/abs/2501.16988
作者: Mohammad Kaviul Anam Khan,Olli Saarela,Rafal Kustra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through “Marginal Variable Importance Metric” (MVIM), a model-agnostic measure of predictor importance based on the true conditional expectation function. MVIM evaluates predictors’ influence on continuous or discrete outcomes. A permutation-based estimation approach, inspired by \citetbreiman2001random and \citetfisher2019all, is proposed to estimate MVIM. MVIM estimator is biased when predictors are highly correlated, as black-box models struggle to extrapolate in low-probability regions. To address this, we investigated the bias-variance decomposition of MVIM to understand the source and pattern of the bias under high correlation. A Conditional Variable Importance Metric (CVIM), adapted from \citetstrobl2008conditional, is introduced to reduce this bias. Both MVIM and CVIM exhibit a quadratic relationship with the conditional average treatment effect (CATE).

[LG-63] Excited-state nonadiabatic dynamics in explicit solvent using machine learned interatomic potentials

链接: https://arxiv.org/abs/2501.16974
作者: Maximilian X. Tiefenbacher,Brigitta Bachmair,Cheng Giuseppe Chen,Julia Westermayr,Philipp Marquetand,Johannes C. B. Dietschreit,Leticia González
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Excited-state nonadiabatic simulations with quantum mechanics/molecular mechanics (QM/MM) are essential to understand photoinduced processes in explicit environments. However, the high computational cost of the underlying quantum chemical calculations limits its application in combination with trajectory surface hopping methods. Here, we use FieldSchNet, a machine-learned interatomic potential capable of incorporating electric field effects into the electronic states, to replace traditional QM/MM electrostatic embedding with its ML/MM counterpart for nonadiabatic excited state trajectories. The developed method is applied to furan in water, including five coupled singlet states. Our results demonstrate that with sufficiently curated training data, the ML/MM model reproduces the electronic kinetics and structural rearrangements of QM/MM surface hopping reference simulations. Furthermore, we identify performance metrics that provide robust and interpretable validation of model accuracy.

[LG-64] Empirical modeling and hybrid machine learning framework for nucleate pool boiling on microchannel structured surfaces

链接: https://arxiv.org/abs/2501.16867
作者: Vijay Kuberan,Sateesh Gedupudi
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Micro-structured surfaces influence nucleation characteristics and bubble dynamics besides increasing the heat transfer surface area, thus enabling efficient nucleate boiling heat transfer. Modeling the pool boiling heat transfer characteristics of these surfaces under varied conditions is essential in diverse applications. A new empirical correlation for nucleate boiling on microchannel structured surfaces has been proposed with the data collected from various experiments in previous studies since the existing correlations are limited by their accuracy and narrow operating ranges. This study also examines various Machine Learning (ML) algorithms and Deep Neural Networks (DNN) on the microchannel structured surfaces dataset to predict the nucleate pool boiling Heat Transfer Coefficient (HTC). With the aim to integrate both the ML and domain knowledge, a Physics-Informed Machine Learning Aided Framework (PIMLAF) is proposed. The proposed correlation in this study is employed as the prior physics-based model for PIMLAF, and a DNN is employed to model the residuals of the prior model. This hybrid framework achieved the best performance in comparison to the other ML models and DNNs. This framework is able to generalize well for different datasets because the proposed correlation provides the baseline knowledge of the boiling behavior. Also, SHAP interpretation analysis identifies the critical parameters impacting the model predictions and their effect on HTC prediction. This analysis further makes the model more robust and reliable. Keywords: Pool boiling, Microchannels, Heat transfer coefficient, Correlation analysis, Machine learning, Deep neural network, Physics-informed machine learning aided framework, SHAP analysis Subjects: Applied Physics (physics.app-ph); Machine Learning (cs.LG) Cite as: arXiv:2501.16867 [physics.app-ph] (or arXiv:2501.16867v1 [physics.app-ph] for this version) https://doi.org/10.48550/arXiv.2501.16867 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] Optimization and Learning in Open Multi-Agent Systems

链接: https://arxiv.org/abs/2501.16847
作者: Diego Deplano,Nicola Bastianello,Mauro Franceschelli,Karl H. Johansson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Modern artificial intelligence relies on networks of agents that collect data, process information, and exchange it with neighbors to collaboratively solve optimization and learning problems. This article introduces a novel distributed algorithm to address a broad class of these problems in “open networks”, where the number of participating agents may vary due to several factors, such as autonomous decisions, heterogeneous resource availability, or DoS attacks. Extending the current literature, the convergence analysis of the proposed algorithm is based on the newly developed “Theory of Open Operators”, which characterizes an operator as open when the set of components to be updated changes over time, yielding to time-varying operators acting on sequences of points of different dimensions and compositions. The mathematical tools and convergence results developed here provide a general framework for evaluating distributed algorithms in open networks, allowing to characterize their performance in terms of the punctual distance from the optimal solution, in contrast with regret-based metrics that assess cumulative performance over a finite-time horizon. As illustrative examples, the proposed algorithm is used to solve dynamic consensus or tracking problems on different metrics of interest, such as average, median, and min/max value, as well as classification problems with logistic loss functions.

[LG-66] Exponential Family Attention

链接: https://arxiv.org/abs/2501.16790
作者: Kevin Christian Wibisono,Yixin Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 47 pages

点击查看摘要

Abstract:The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial, or spatial-temporal data of mixed data types, including both discrete and continuous observations. The key idea of EFA is to model each observation conditional on all other existing observations, called the context, whose relevance is learned in a data-driven way via an attention-based latent factor model. In particular, unlike static latent embeddings, EFA uses the self-attention mechanism to capture dynamic interactions in the context, where the relevance of each context observations depends on other observations. We establish an identifiability result and provide a generalization guarantee on excess loss for EFA. Across real-world and synthetic data sets – including U.S. city temperatures, Instacart shopping baskets, and MovieLens ratings – we find that EFA consistently outperforms existing models in capturing complex latent structures and reconstructing held-out data.

[LG-67] owards the Generalization of Multi-view Learning: An Information-theoretical Analysis

链接: https://arxiv.org/abs/2501.16768
作者: Wen Wen,Tieliang Gong,Yuxin Dong,Shujian Yu,Weizhan Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiview learning has drawn widespread attention for its efficacy in leveraging cross-view consensus and complementarity information to achieve a comprehensive representation of data. While multi-view learning has undergone vigorous development and achieved remarkable success, the theoretical understanding of its generalization behavior remains elusive. This paper aims to bridge this gap by developing information-theoretic generalization bounds for multi-view learning, with a particular focus on multi-view reconstruction and classification tasks. Our bounds underscore the importance of capturing both consensus and complementary information from multiple different views to achieve maximally disentangled representations. These results also indicate that applying the multi-view information bottleneck regularizer is beneficial for satisfactory generalization performance. Additionally, we derive novel data-dependent bounds under both leave-one-out and supersample settings, yielding computational tractable and tighter bounds. In the interpolating regime, we further establish the fast-rate bound for multi-view learning, exhibiting a faster convergence rate compared to conventional square-root bounds. Numerical results indicate a strong correlation between the true generalization gap and the derived bounds across various learning scenarios.

[LG-68] Variational Schr"odinger Momentum Diffusion AISTATS25

链接: https://arxiv.org/abs/2501.16675
作者: Kevin Rojas,Yixin Tan,Molei Tao,Yuriy Nevmyvaka,Wei Deng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: AISTATS 25

点击查看摘要

Abstract:The momentum Schrödinger Bridge (mSB) has emerged as a leading method for accelerating generative diffusion processes and reducing transport costs. However, the lack of simulation-free properties inevitably results in high training costs and affects scalability. To obtain a trade-off between transport properties and scalability, we introduce variational Schrödinger momentum diffusion (VSMD), which employs linearized forward score functions (variational scores) to eliminate the dependence on simulated forward trajectories. Our approach leverages a multivariate diffusion process with adaptively transport-optimized variational scores. Additionally, we apply a critical-damping transform to stabilize training by removing the need for score estimations for both velocity and samples. Theoretically, we prove the convergence of samples generated with optimal variational scores and momentum diffusion. Empirical results demonstrate that VSMD efficiently generates anisotropic shapes while maintaining transport efficacy, outperforming overdamped alternatives, and avoiding complex denoising processes. Our approach also scales effectively to real-world data, achieving competitive results in time series and image generation.

[LG-69] FlowDAS: A Flow-Based Framework for Data Assimilation

链接: https://arxiv.org/abs/2501.16642
作者: Siyi Chen,Yixuan Jia,Qing Qu,He Sun,Jeffrey A Fessler
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Data assimilation (DA) is crucial for improving the accuracy of state estimation in complex dynamical systems by integrating observational data with physical models. Traditional solutions rely on either pure model-driven approaches, such as Bayesian filters that struggle with nonlinearity, or data-driven methods using deep learning priors, which often lack generalizability and physical interpretability. Recently, score-based DA methods have been introduced, focusing on learning prior distributions but neglecting explicit state transition dynamics, leading to limited accuracy improvements. To tackle the challenge, we introduce FlowDAS, a novel generative model-based framework using the stochastic interpolants to unify the learning of state transition dynamics and generative priors. FlowDAS achieves stable and observation-consistent inference by initializing from proximal previous states, mitigating the instability seen in score-based methods. Our extensive experiments demonstrate FlowDAS’s superior performance on various benchmarks, from the Lorenz system to high-dimensional fluid super-resolution tasks. FlowDAS also demonstrates improved tracking accuracy on practical Particle Image Velocimetry (PIV) task, showcasing its effectiveness in complex flow field reconstruction.

[LG-70] Subject Representation Learning from EEG using Graph Convolutional Variational Autoencoders ICASSP2025

链接: https://arxiv.org/abs/2501.16626
作者: Aditya Mishra,Ahnaf Mozib Samin,Ali Etemad,Javad Hashemi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted to 2025 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:We propose GC-VASE, a graph convolutional-based variational autoencoder that leverages contrastive learning for subject representation learning from EEG data. Our method successfully learns robust subject-specific latent representations using the split-latent space architecture tailored for subject identification. To enhance the model’s adaptability to unseen subjects without extensive retraining, we introduce an attention-based adapter network for fine-tuning, which reduces the computational cost of adapting the model to new subjects. Our method significantly outperforms other deep learning approaches, achieving state-of-the-art results with a subject balanced accuracy of 89.81% on the ERP-Core dataset and 70.85% on the SleepEDFx-20 dataset. After subject adaptive fine-tuning using adapters and attention layers, GC-VASE further improves the subject balanced accuracy to 90.31% on ERP-Core. Additionally, we perform a detailed ablation study to highlight the impact of the key components of our method.

[LG-71] UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of Pre-trained Speech Models for Robust Speaker Verification

链接: https://arxiv.org/abs/2501.16542
作者: Mufan Sang,John H. L. Hansen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

点击查看摘要

Abstract:With excellent generalization ability, SSL speech models have shown impressive performance on various downstream tasks in the pre-training and fine-tuning paradigm. However, as the size of pre-trained models grows, fine-tuning becomes practically unfeasible due to expanding computation and storage requirements and the risk of overfitting. This study explores parameter-efficient tuning (PET) methods for adapting large-scale pre-trained SSL speech models to speaker verification task. Correspondingly, we propose three PET methods: (i)an adapter-tuning method, (ii)a prompt-tuning method, and (iii)a unified framework that effectively incorporates adapter-tuning and prompt-tuning with a dynamically learnable gating mechanism. First, we propose the Inner+Inter Adapter framework, which inserts two types of adapters into pre-trained models, allowing for adaptation of latent features within the intermediate Transformer layers and output embeddings from all Transformer layers, through a parallel adapter design. Second, we propose the Deep Speaker Prompting method that concatenates trainable prompt tokens into the input space of pre-trained models to guide adaptation. Lastly, we propose the UniPET-SPK, a unified framework that effectively incorporates these two alternate PET methods into a single framework with a dynamic trainable gating mechanism. The proposed UniPET-SPK learns to find the optimal mixture of PET methods to match different datasets and scenarios. We conduct a comprehensive set of experiments on several datasets to validate the effectiveness of the proposed PET methods. Experimental results on VoxCeleb, CN-Celeb, and 1st 48-UTD forensic datasets demonstrate that the proposed UniPET-SPK consistently outperforms the two PET methods, fine-tuning, and other parameter-efficient tuning methods, achieving superior performance while updating only 5.4% of the parameters.

[LG-72] Safe Gradient Flow for Bilevel Optimization

链接: https://arxiv.org/abs/2501.16520
作者: Sina Sharifi,Nazanin Abolfazli,Erfan Yazdandoost Hamedani,Mahyar Fazlyab
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 2025 American Control Conference (ACC)

点击查看摘要

Abstract:Bilevel optimization is a key framework in hierarchical decision-making, where one problem is embedded within the constraints of another. In this work, we propose a control-theoretic approach to solving bilevel optimization problems. Our method consists of two components: a gradient flow mechanism to minimize the upper-level objective and a safety filter to enforce the constraints imposed by the lower-level problem. Together, these components form a safe gradient flow that solves the bilevel problem in a single loop. To improve scalability with respect to the lower-level problem’s dimensions, we introduce a relaxed formulation and design a compact variant of the safe gradient flow. This variant minimizes the upper-level objective while ensuring the lower-level solution remains within a user-defined distance. Using Lyapunov analysis, we establish convergence guarantees for the dynamics, proving that they converge to a neighborhood of the optimal solution. Numerical experiments further validate the effectiveness of the proposed approaches. Our contributions provide both theoretical insights and practical tools for efficiently solving bilevel optimization problems.

[LG-73] Nonparametric Sparse Online Learning of the Koopman Operator

链接: https://arxiv.org/abs/2501.16489
作者: Boya Hou,Sina Sanjari,Nathan Dahlin,Alec Koppel,Subhonmesh Bose
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 49 pages, 6 figures. arXiv admin note: text overlap with arXiv:2405.07432

点击查看摘要

Abstract:The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman operator via its action on the reproducing kernel Hilbert space (RKHS), and explore the mis-specified scenario where the dynamics may escape the chosen function space. We relate the Koopman operator to the conditional mean embeddings (CME) operator and then present an operator stochastic approximation algorithm to learn the Koopman operator iteratively with control over the complexity of the representation. We provide both asymptotic and finite-time last-iterate guarantees of the online sparse learning algorithm with trajectory-based sampling with an analysis that is substantially more involved than that for finite-dimensional stochastic approximation. Numerical examples confirm the effectiveness of the proposed algorithm.

[LG-74] DepoRanker: A Web Tool to predict Klebsiella Depolymerases using Machine Learning

链接: https://arxiv.org/abs/2501.16405
作者: George Wright,Slawomir Michniewski,Eleanor Jameson,Fayyaz ul Amir Afsar Minhas
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Phage therapy shows promise for treating antibiotic-resistant Klebsiella infections. Identifying phage depolymerases that target Klebsiella capsular polysaccharides is crucial, as these capsules contribute to biofilm formation and virulence. However, homology-based searches have limitations in novel depolymerase discovery. Objective: To develop a machine learning model for identifying and ranking potential phage depolymerases targeting Klebsiella. Methods: We developed DepoRanker, a machine learning algorithm to rank proteins by their likelihood of being depolymerases. The model was experimentally validated on 5 newly characterized proteins and compared to BLAST. Results: DepoRanker demonstrated superior performance to BLAST in identifying potential depolymerases. Experimental validation confirmed its predictive ability on novel proteins. Conclusions: DepoRanker provides an accurate and functional tool to expedite depolymerase discovery for phage therapy against Klebsiella. It is available as a webserver and open-source software. Availability: Webserver: this https URL Source code: this https URL Subjects: Genomics (q-bio.GN); Machine Learning (cs.LG) Cite as: arXiv:2501.16405 [q-bio.GN] (or arXiv:2501.16405v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2501.16405 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: George Wright [view email] [v1] Mon, 27 Jan 2025 11:48:15 UTC (450 KB) Full-text links: Access Paper: View a PDF of the paper titled DepoRanker: A Web Tool to predict Klebsiella Depolymerases using Machine Learning, by George Wright and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: q-bio.GN prev | next new | recent | 2025-01 Change to browse by: cs cs.LG q-bio References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-75] ILETIA: An AI-enhanced method for individualized trigger-oocyte pickup interval estimation of progestin-primed ovarian stimulation protocol

链接: https://arxiv.org/abs/2501.16386
作者: Binjian Wu,Qian Li,Zhe Kuang,Hongyuan Gao,Xinyi Liu,Haiyan Guo,Qiuju Chen,Xinyi Liu,Yangruizhe Jiang,Yuqi Zhang,Jinyin Zha,Mingyu Li,Qiuhan Ren,Sishuo Feng,Haicang Zhang,Xuefeng Lu,Jian Zhang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In vitro fertilization-embryo transfer (IVF-ET) stands as one of the most prevalent treatments for infertility. During an IVF-ET cycle, the time interval between trigger shot and oocyte pickup (OPU) is a pivotal period for follicular maturation, which determines mature oocytes yields and impacts the success of subsequent procedures. However, accurately predicting this interval is severely hindered by the variability of clinicians’experience that often leads to suboptimal oocyte retrieval rate. To address this challenge, we propose ILETIA, the first machine learning-based method that could predict the optimal trigger-OPU interval for patients receiving progestin-primed ovarian stimulation (PPOS) protocol. Specifically, ILETIA leverages a Transformer to learn representations from clinical tabular data, and then employs gradient-boosted trees for interval prediction. For model training and evaluating, we compiled a dataset PPOS-DS of nearly ten thousand patients receiving PPOS protocol, the largest such dataset to our knowledge. Experimental results demonstrate that our method achieves strong performance (AUROC = 0.889), outperforming both clinicians and other widely used computational models. Moreover, ILETIA also supports premature ovulation risk prediction in a specific OPU time (AUROC = 0.838). Collectively, by enabling more precise and individualized decisions, ILETIA has the potential to improve clinical outcomes and lay the foundation for future IVF-ET research.

[LG-76] MambaTron: Efficient Cross-Modal Point Cloud Enhancement using Aggregate Selective State Space Modeling WACV2025

链接: https://arxiv.org/abs/2501.16384
作者: Sai Tarun Inaganti,Gennady Petrenko
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted to the Workshop on Image Quality in Computer Vision and Generative AI, WACV 2025

点击查看摘要

Abstract:Point cloud enhancement is the process of generating a high-quality point cloud from an incomplete input. This is done by filling in the missing details from a reference like the ground truth via regression, for example. In addition to unimodal image and point cloud reconstruction, we focus on the task of view-guided point cloud completion, where we gather the missing information from an image, which represents a view of the point cloud and use it to generate the output point cloud. With the recent research efforts surrounding state-space models, originally in natural language processing and now in 2D and 3D vision, Mamba has shown promising results as an efficient alternative to the self-attention mechanism. However, there is limited research towards employing Mamba for cross-attention between the image and the input point cloud, which is crucial in multi-modal problems. In this paper, we introduce MambaTron, a Mamba-Transformer cell that serves as a building block for our network which is capable of unimodal and cross-modal reconstruction which includes view-guided point cloud this http URL explore the benefits of Mamba’s long-sequence efficiency coupled with the Transformer’s excellent analytical capabilities through MambaTron. This approach is one of the first attempts to implement a Mamba-based analogue of cross-attention, especially in computer vision. Our model demonstrates a degree of performance comparable to the current state-of-the-art techniques while using a fraction of the computation resources.

[LG-77] RNN-Based Models for Predicting Seizure Onset in Epileptic Patients

链接: https://arxiv.org/abs/2501.16334
作者: Mathan Kumar Mounagurusamy,Thiyagarajan V S,Abdur Rahman,Shravan Chandak,D. Balaji,Venkateswara Rao Jallepalli
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early management and better clinical outcomes for epileptic patients depend on seizure prediction. The accuracy and false alarm rates of existing systems are often compromised by their dependence on static thresholds and basic Electroencephalogram (EEG) properties. A novel Recurrent Neural Network (RNN)-based method for seizure start prediction is proposed in the article to overcome these limitations. As opposed to conventional techniques, the proposed system makes use of Long Short-Term Memory (LSTM) networks to extract temporal correlations from unprocessed EEG data. It enables the system to adapt dynamically to the unique EEG patterns of each patient, improving prediction accuracy. The methodology of the system comprises thorough data collecting, preprocessing, and LSTM-based feature extraction. Annotated EEG datasets are then used for model training and validation. Results show a considerable reduction in false alarm rates (average of 6.8%) and an improvement in prediction accuracy (90.2% sensitivity, 88.9% specificity, and AUC-ROC of 93). Additionally, computational efficiency is significantly higher than that of existing systems (12 ms processing time, 45 MB memory consumption). About improving seizure prediction reliability, these results demonstrate the effectiveness of the proposed RNN-based strategy, opening up possibilities for its practical application to improve epilepsy treatment.

信息检索

[IR-0] Enhanced Retrieval of Long Documents: Leverag ing Fine-Grained Block Representations with Large Language Models

链接: https://arxiv.org/abs/2501.17039
作者: Minghan Li,Eric Gaussier,Guodong Zhou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have demonstrated exceptional power in various domains, including information retrieval. Most of the previous practices involve leveraging these models to create a single embedding for each query, each passage, or each document individually, a strategy exemplified and used by the Retrieval-Augmented Generation (RAG) framework. While this method has proven effective, we argue that it falls short in fully capturing the nuanced intricacies of document-level texts due to its reliance on a relatively coarse-grained representation. To address this limitation, we introduce a novel, fine-grained approach aimed at enhancing the accuracy of relevance scoring for long documents. Our methodology firstly segments a long document into blocks, each of which is embedded using an LLM, for matching with the query representation. When calculating the relevance score, we aggregate the query-block relevance scores through a weighted sum method, yielding a comprehensive score for the query with the entire document. Despite its apparent simplicity, our experimental findings reveal that this approach outperforms standard representation methods and achieves a significant reduction in embedding generation latency. Moreover, by carefully optimizing pairwise loss functions, superior performances have been achieved.

[IR-1] Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks

链接: https://arxiv.org/abs/2501.16902
作者: Shengyao Zhuang,Ekaterina Khramtsova,Xueguang Ma,Bevan Koopman,Jimmy Lin,Guido Zuccon
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods. In this study, we propose three pixel poisoning attack methods designed to compromise VLM-based retrievers and evaluate their effectiveness under various attack settings and parameter configurations. Our empirical results demonstrate that injecting even a single adversarial screenshot into the retrieval corpus can significantly disrupt search results, poisoning the top-10 retrieved documents for 41.9% of queries in the case of DSE and 26.4% for ColPali. These vulnerability rates notably exceed those observed with equivalent attacks on text-only retrievers. Moreover, when targeting a small set of known queries, the attack success rate raises, achieving complete success in certain cases. By exposing the vulnerabilities inherent in vision-language models, this work highlights the potential risks associated with their deployment.

[IR-2] Secure Federated Graph-Filtering for Recommender Systems

链接: https://arxiv.org/abs/2501.16888
作者: Julien Nicolas,César Sabater,Mohamed Maouche,Sonia Ben Mokhtar,Mark Coates
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recommender systems often rely on graph-based filters, such as normalized item-item adjacency matrices and low-pass filters. While effective, the centralized computation of these components raises concerns about privacy, security, and the ethical use of user data. This work proposes two decentralized frameworks for securely computing these critical graph components without centralizing sensitive information. The first approach leverages lightweight Multi-Party Computation and distributed singular vector computations to privately compute key graph filters. The second extends this framework by incorporating low-rank approximations, enabling a trade-off between communication efficiency and predictive performance. Empirical evaluations on benchmark datasets demonstrate that the proposed methods achieve comparable accuracy to centralized state-of-the-art systems while ensuring data confidentiality and maintaining low communication costs. Our results highlight the potential for privacy-preserving decentralized architectures to bridge the gap between utility and user data protection in modern recommender systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-29

目录

概览 (2025-01-29)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载