本篇博文主要内容为 2025-03-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-03-06)
今日共更新444篇论文,其中:
- 自然语言处理共74篇(Computation and Language (cs.CL))
- 人工智能共99篇(Artificial Intelligence (cs.AI))
- 计算机视觉共101篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共138篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] he MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems WWW
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在追求目标过程中可能学习到欺骗行为的问题,并关注如何评估和提升模型的诚实性(Honesty)。现有的诚实性评估存在严重局限性,许多所谓的诚实性基准实际上仅测量准确性(Accuracy),即模型信念的正确性。为了解决这一问题,论文引入了一个大规模人工收集的数据集,用于直接衡量诚实性,从而首次将准确性与诚实性区分开来。关键解决方案在于通过设计新的基准测试,揭示即使在前沿LLMs中,尽管其准确性随规模增大而提高,但诚实性并未随之改善,且在压力下表现出显著的撒谎倾向。此外,研究发现简单的干预方法,如表征工程(Representation Engineering),可以有效提升模型的诚实性。这些结果强调了建立更稳健评估体系和开发有效干预措施以确保LLMs可信度的重要性。
链接: https://arxiv.org/abs/2503.03750
作者: Richard Ren,Arunim Agarwal,Mantas Mazeika,Cristina Menghini,Robert Vacareanu,Brad Kenstler,Mick Yang,Isabelle Barrass,Alice Gatti,Xuwang Yin,Eduardo Trevino,Matias Geralnik,Adam Khoja,Dean Lee,Summer Yue,Dan Hendrycks
机构: Center for AI Safety (人工智能安全中心); Scale AI (Scale AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Website: this https URL
Abstract:As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of “honesty” in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy–the correctness of a model’s beliefs–in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.
zh
[NLP-1] Process-based Self-Rewarding Language Models
【速读】: 该论文旨在解决现有自奖励(Self-Rewarding)方法在数学推理场景中效果不佳甚至可能导致性能下降的问题。为了解决这一挑战,论文提出了一种基于过程的自奖励(Process-based Self-Rewarding)管道,其关键是引入长程推理机制、逐步LLM作为裁判(step-wise LLM-as-a-Judge)以及逐步偏好优化(step-wise preference optimization)等技术,从而显著提升了大语言模型(Large Language Models, LLMs)在多个数学推理基准测试中的性能,展示了自奖励方法在实现超越人类能力的LLM推理方面的巨大潜力。
链接: https://arxiv.org/abs/2503.03746
作者: Shimao Zhang,Xiao Liu,Xin Zhang,Junxiao Liu,Zheheng Luo,Shujian Huang,Yeyun Gong
机构: National Key Laboratory for Novel Software Technology (国家重点实验室); Microsoft Research Asia (微软研究亚洲); Nanjing University (南京大学); The University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs’ performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.
zh
[NLP-2] Improving LLM Safety Alignment with Dual-Objective Optimization
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)在训练阶段的安全对齐技术易受越狱攻击(jailbreak attacks)的问题。论文指出,广泛使用的直接偏好优化(Direct Preference Optimization, DPO)方法在实验和理论层面均存在局限性,其损失函数在拒绝学习(refusal learning)方面表现不佳。为了解决这些问题,论文通过梯度分析识别了DPO的目标缺陷,并提出了一种改进的安全对齐方法,将DPO目标解耦为两个关键组件:(1) 强化拒绝训练(robust refusal training),即使在部分不安全生成已产生的情况下也能鼓励模型拒绝;(2) 针对有害知识的定向遗忘(targeted unlearning)。此外,论文引入了一种基于奖励的拒绝学习标记级别加权机制,以突出关键拒绝标记,进一步提升模型对抗对抗性攻击的鲁棒性。研究还表明,模型对越狱攻击的鲁棒性与训练过程中标记分布的变化以及拒绝和有害标记的内部表示密切相关,为未来LLM安全对齐研究提供了重要方向。代码已开源。
链接: https://arxiv.org/abs/2503.03710
作者: Xuandong Zhao,Will Cai,Tianneng Shi,David Huang,Licong Lin,Song Mei,Dawn Song
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at this https URL
zh
[NLP-3] Effective LLM Knowledge Learning via Model Generalization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在通过自回归预训练获取知识时缺乏清晰理解的问题,特别是如何有效利用持续预训练来学习最新的、缺乏基础性重复信息的知识。论文的关键在于揭示LLMs的知识学习本质上可以被视为隐藏在自回归预训练目标中的隐式监督任务,并提出相应的解决方案。为此,作者提出了基于格式的数据增强方法以扩展分布内样本,同时引入锐度感知最小化算法以优化泛化能力,而无需改变文档中嵌入的事实。这些方法不仅验证了其在持续预训练中的有效性,还推广到了指令微调领域。
链接: https://arxiv.org/abs/2503.03705
作者: Mingkang Zhu,Xi Chen,Zhongdao Wang,Bei Yu,Hengshuang Zhao,Jiaya Jia
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are trained on enormous documents that contain extensive world knowledge. However, it is still not well-understood how knowledge is acquired via autoregressive pre-training. This lack of understanding greatly hinders effective knowledge learning, especially for continued pretraining on up-to-date information, as this evolving information often lacks diverse repetitions like foundational knowledge. In this paper, we focus on understanding and improving LLM knowledge learning. We found and verified that knowledge learning for LLMs can be deemed as an implicit supervised task hidden in the autoregressive pre-training objective. Our findings suggest that knowledge learning for LLMs would benefit from methods designed to improve generalization ability for supervised tasks. Based on our analysis, we propose the formatting-based data augmentation to grow in-distribution samples, which does not present the risk of altering the facts embedded in documents as text paraphrasing. We also introduce sharpness-aware minimization as an effective optimization algorithm to better improve generalization. Moreover, our analysis and method can be readily extended to instruction tuning. Extensive experiment results validate our findings and demonstrate our methods’ effectiveness in both continued pre-training and instruction tuning. This paper offers new perspectives and insights to interpret and design effective strategies for LLM knowledge learning.
zh
[NLP-4] SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches ICLR2025
【速读】: 该论文试图解决自然语言处理和计算语言学领域中基于大规模语料库分析真实语言使用时存在的两个主要问题:一是现有模式匹配工具(如grep或基于关键词上下文的工具)依赖于表面字符串匹配,无法有效应对拼写变化和释义现象;二是现有连续方法(如密集向量搜索)倾向于过于粗略,常检索到与查询无关但主题相似的文本。论文的关键解决方案是提出了一种新颖的算法,通过利用词嵌入放松表面级匹配来实现软(语义)且高效的模式匹配,并结合倒排索引实现了对语料库规模扩展的高度可扩展性。此外,该算法在效率和准确性上表现出色,能够在秒级内处理十亿规模的语料库,同时支持多语言应用场景,包括从维基百科文章中提取有害实例及拉丁语的语料库语言学分析。
链接: https://arxiv.org/abs/2503.03703
作者: Hiroyuki Deguchi,Go Kamoda,Yusuke Matsushita,Chihiro Taguchi,Kohei Suenaga,Masaki Waga,Sho Yokoi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICLR2025
Abstract:Researchers and practitioners in natural language processing and computational linguistics frequently observe and analyze the real language usage in large-scale corpora. For that purpose, they often employ off-the-shelf pattern-matching tools, such as grep, and keyword-in-context concordancers, which is widely used in corpus linguistics for gathering examples. Nonetheless, these existing techniques rely on surface-level string matching, and thus they suffer from the major limitation of not being able to handle orthographic variations and paraphrasing – notable and common phenomena in any natural language. In addition, existing continuous approaches such as dense vector search tend to be overly coarse, often retrieving texts that are unrelated but share similar topics. Given these challenges, we propose a novel algorithm that achieves \emphsoft (or semantic) yet efficient pattern matching by relaxing a surface-level matching with word embeddings. Our algorithm is highly scalable with respect to the size of the corpus text utilizing inverted indexes. We have prepared an efficient implementation, and we provide an accessible web tool. Our experiments demonstrate that the proposed method (i) can execute searches on billion-scale corpora in less than a second, which is comparable in speed to surface-level string matching and dense vector search; (ii) can extract harmful instances that semantically match queries from a large set of English and Japanese Wikipedia articles; and (iii) can be effectively applied to corpus-linguistic analyses of Latin, a language with highly diverse inflections.
zh
[NLP-5] Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models
【速读】: 该论文旨在解决粤语作为低资源语言在自然语言处理(NLP)领域面临的挑战,特别是高质量数据稀缺的问题。论文的关键解决方案在于通过多源数据收集与严格的数据处理流程,构建了一个超过20亿词元的高质量粤语语料库,用于训练大型语言模型(LLMs)。这一过程包括从开源语料库、香港特定论坛、维基百科以及Common Crawl数据中采集文本,并经过语言过滤、质量过滤、内容过滤及去重等步骤确保数据质量。此外,通过精心设计的监督微调(SFT)任务进一步优化模型性能,使其在粤语特定任务上达到当前最先进(SOTA)水平,并在其他主流语言任务中表现出色。
链接: https://arxiv.org/abs/2503.03702
作者: Jiyue Jiang,Alfred Kar Yin Truong,Yanyu Chen,Qinghang Bao,Sheng Wang,Pengan Chen,Jiuming Wang,Lingpeng Kong,Yu Li,Chuan Wu
机构: The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:High-quality data resources play a crucial role in learning large language models (LLMs), particularly for low-resource languages like Cantonese. Despite having more than 85 million native speakers, Cantonese is still considered a low-resource language in the field of natural language processing (NLP) due to factors such as the dominance of Mandarin, lack of cohesion within the Cantonese-speaking community, diversity in character encoding and input methods, and the tendency of overseas Cantonese speakers to prefer using English. In addition, rich colloquial vocabulary of Cantonese, English loanwords, and code-switching characteristics add to the complexity of corpus collection and processing. To address these challenges, we collect Cantonese texts from a variety of sources, including open source corpora, Hong Kong-specific forums, Wikipedia, and Common Crawl data. We conduct rigorous data processing through language filtering, quality filtering, content filtering, and de-duplication steps, successfully constructing a high-quality Cantonese corpus of over 2 billion tokens for training large language models. We further refined the model through supervised fine-tuning (SFT) on curated Cantonese tasks, enhancing its ability to handle specific applications. Upon completion of the training, the model achieves state-of-the-art (SOTA) performance on four Cantonese benchmarks. After training on our dataset, the model also exhibits improved performance on other mainstream language tasks.
zh
[NLP-6] MAS-GPT : Training LLM LLM s to Build LLM-based Multi-Agent Systems
【速读】: 该论文旨在解决通过大规模语言模型(LLM)构建多智能体系统(MAS)时,现有方法因依赖手动配置或多次调用高级LLM而导致的适应性差和推理成本高的问题。为了解决这一挑战,论文的关键创新在于将MAS的构建重新定义为一个生成式语言任务,其中输入为用户查询,输出为相应的MAS。为此,论文统一了MAS的表示形式为可执行代码,并提出了一种一致性导向的数据构造管道,用于创建高质量的查询- MAS配对数据集。基于此数据集,开发了开源的中等规模LLM——MAS-GPT,它能够在单次LLM推理中生成与查询适配的MAS,从而实现高效且高质量的用户查询处理和响应。实验结果表明,MAS-GPT在多样化的基准测试和LLM设置中显著优于10多种基线方法,体现了其高效率、高效性和强泛化能力。
链接: https://arxiv.org/abs/2503.03686
作者: Rui Ye,Shuo Tang,Rui Ge,Yaxin Du,Zhenfei Yin,Siheng Chen,Jing Shao
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 26 pages, 7 figures
Abstract:LLM-based multi-agent systems (MAS) have shown significant potential in tackling diverse tasks. However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs. In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS. To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs. Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT’s high effectiveness, efficiency and strong generalization ability. Code will be available at this https URL.
zh
[NLP-7] Quantification of Tenseness in English and Japanese Tense-Lax Vowels: A Lagrangian Model with Indicator θ_1 and Force of Tenseness Ftense(t)
【速读】: 该论文试图解决传统二元区分松紧元音方法缺乏普遍接受的定量定义的问题,并探索影响元音质量的新因素。论文的关键在于提出了一种基于拉格朗日方程的简化模型,用于描述口腔内舌与下颌在发近音元音时的动力学相互作用,通过引入与力相关的参数来量化元音张力,从而提供了一个估算不同语言中元音产生所涉及力的理论框架,为理解元音发音的物理机制提供了新的视角。
链接: https://arxiv.org/abs/2503.03681
作者: Tatsuya Ishizaki
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:The concept of vowel tenseness has traditionally been examined through the binary distinction of tense and lax vowels. However, no universally accepted quantitative definition of tenseness has been established in any language. Previous studies, including those by Jakobson, Fant, and Halle (1951) and Chomsky and Halle (1968), have explored the relationship between vowel tenseness and the vocal tract. Building on these foundations, Ishizaki (2019, 2022) proposed an indirect quantification of vowel tenseness using formant angles \theta_1 and \theta_F1 and their first and second derivatives, d^Z_1(t)/dt = \lim \tan \theta_1(t ) and d^2 Z_1(t)/dt^2 = d/dt \lim \tan \theta_1(t) . This study extends this approach by investigating the potential role of a force-related parameter in determining vowel quality. Specifically, we introduce a simplified model based on the Lagrangian equation to describe the dynamic interaction of the tongue and jaw within the oral cavity during the articulation of close vowels. This model provides a theoretical framework for estimating the forces involved in vowel production across different languages, offering new insights into the physical mechanisms underlying vowel articulation. The findings suggest that this force-based perspective warrants further exploration as a key factor in phonetic and phonological studies.
zh
[NLP-8] Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在多轮对话中难以保持对复杂、任务特定指令的遵从性的问题,特别是在商业关键应用中面临的挑战。为了解决这一问题,论文提出了一种名为注意力推理查询(Attentive Reasoning Queries, ARQs)的新方法。ARQs的关键在于通过领域专用的推理蓝图引导LLMs按系统化步骤进行推理,并利用针对性的查询重新引入关键指令,同时在整个完成过程中促进中间推理。这种方法显著提升了指令跟随能力,在Parlant框架的测试中,ARQs达到了90.2%的成功率,优于链式思考推理(86.1%)和直接响应生成(81.5%),尤其在处理指南重用和幻觉预防等持续失败模式方面表现出色。此外,精心设计的ARQs可能比自由形式推理更具计算效率。这些发现表明,结构化推理方法为控制LLMs在复杂场景中的信息处理和决策提供了有效的机制。
链接: https://arxiv.org/abs/2503.03669
作者: Bar Karov,Dor Zohar,Yam Marcovitz
机构: Emcie Co Ltd (英迈思有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Supplementary materials, including code, is available on our GitHub: this https URL
Abstract:We present Attentive Reasoning Queries (ARQs), a novel structured reasoning approach that significantly improves instruction-following in Large Language Models through domain-specialized reasoning blueprints. While LLMs demonstrate remarkable capabilities across diverse tasks, they often fail to maintain adherence to complex, use-case-specific instructions during multi-turn conversations, presenting challenges for business-critical applications. ARQs address this limitation by guiding LLMs through systematic reasoning steps with targeted queries that reinstate critical instructions and facilitate intermediate reasoning throughout the completion process. In extensive testing within Parlant, our framework for reliable customer-facing agents in which ARQs were born out of necessity, they achieved a 90.2% success rate across 87 test scenarios, outperforming both Chain-of-Thought reasoning (86.1%) and direct response generation (81.5%). ARQs showed particular strength in addressing persistent failure modes like guideline re-application and hallucination prevention. Our analysis also revealed that ARQs can potentially be more computationally efficient than free-form reasoning when carefully designed. These findings demonstrate that structured reasoning approaches provide effective mechanisms for controlling how LLMs process information and make decisions in complex scenarios.
zh
[NLP-9] Analogical Reasoning Inside Large Language Models : Concept Vectors and the Limits of Abstraction
【速读】: 该论文试图解决的问题是探究大型语言模型(Large Language Models, LLMs)是否具备用于类比推理的概念抽象内部表征。研究聚焦于LLMs在上下文学习(in-context learning, ICL)任务中的功能向量(function vectors, FVs),发现这些向量并非输入不变性表示,表明它们不仅捕获概念信息,还可能包含其他因素。论文的关键解决方案在于通过表征相似性分析(representational similarity analysis, RSA)定位了一组注意力头,它们编码了对于词语概念(如“反义词”)的不变概念向量(concept vectors, CVs)。这些CVs作为独立于最终输出的特征检测器,揭示了模型可能形成正确的内部表示但仍产生错误输出的情况。此外,论文提出这些CVs可用于因果引导模型行为,并指出对于更抽象的概念(如“之前”和“之后”),由于LLMs在这些领域的泛化问题,未观察到类似的不变线性表示。
链接: https://arxiv.org/abs/2503.03666
作者: Gustaw Opiełka,Hannes Rosenbusch,Claire E. Stevenson
机构: Department of Psychological Methods, University of Amsterdam (阿姆斯特丹大学心理方法系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint
Abstract:Analogical reasoning relies on conceptual abstractions, but it is unclear whether Large Language Models (LLMs) harbor such internal representations. We explore distilled representations from LLM activations and find that function vectors (FVs; Todd et al., 2024) - compact representations for in-context learning (ICL) tasks - are not invariant to simple input changes (e.g., open-ended vs. multiple-choice), suggesting they capture more than pure concepts. Using representational similarity analysis (RSA), we localize a small set of attention heads that encode invariant concept vectors (CVs) for verbal concepts like “antonym”. These CVs function as feature detectors that operate independently of the final output - meaning that a model may form a correct internal representation yet still produce an incorrect output. Furthermore, CVs can be used to causally guide model behaviour. However, for more abstract concepts like “previous” and “next”, we do not observe invariant linear representations, a finding we link to generalizability issues LLMs display within these domains.
zh
[NLP-10] Improving Neutral Point of View Text Generation through Parameter-Efficient Reinforcement Learning and a Small-Scale High-Quality Dataset
【速读】: 该论文旨在解决生成式大语言模型(LLMs)在处理敏感话题时难以提供中立观点(Neutral Point of View, NPOV)答案的问题,具体表现为缺乏信息量、多样性以及公正性。论文的关键贡献在于两个方面:一是提出了一种通过迭代的人类同行评审和标注员培训方法构建高质量数据集的新方法,并公开了SHQ-NPOV数据集;二是识别出一种针对参数高效强化学习(Parameter-Efficient Reinforcement Learning, PE-RL)的有效训练方案,显著提升了模型生成中立观点的能力。与LoRA微调、监督微调(SFT)及强化学习从人类反馈(RLHF)等基线方法相比,PE-RL不仅在整体NPOV质量上超越最强基线(从97.06%提升至99.08%),还在语言学家认为区分优秀答案的关键特征上表现优异,例如支持性细节的呈现率从60.25%提高到85.21%,过度简化现象的减少率从68.74%提升至91.43%。此外,实验表明该方法在未见主题上的泛化能力良好,无统计学差异。
链接: https://arxiv.org/abs/2503.03654
作者: Jessica Hoffmann,Christiane Ahlheim,Zac Yu,Aria Walfrand,Jarvis Jin,Marie Tano,Ahmad Beirami,Erin van Liemt,Nithum Thain,Hakim Sidahmed,Lucas Dixon
机构: Google(谷歌); Google DeepMind(谷歌深思维); Google DeepMind(谷歌深思维); Google(谷歌); Google(谷歌); Google DeepMind(谷歌深思维); Google(谷歌); Google(谷歌); Google DeepMind(谷歌深思维); Google DeepMind(谷歌深思维); Google(谷歌); Google DeepMind(谷歌深思维); Google DeepMind(谷歌深思维); Google DeepMind(谷歌深思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper describes the construction of a dataset and the evaluation of training methods to improve generative large language models’ (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e., to provide significantly more informative, diverse and impartial answers. The dataset, the SHQ-NPOV dataset, comprises 300 high-quality, human-written quadruplets: a query on a sensitive topic, an answer, an NPOV rating, and a set of links to source texts elaborating the various points of view. The first key contribution of this paper is a new methodology to create such datasets through iterative rounds of human peer-critique and annotator training, which we release alongside the dataset. The second key contribution is the identification of a highly effective training regime for parameter-efficient reinforcement learning (PE-RL) to improve NPOV generation. We compare and extensively evaluate PE-RL and multiple baselines-including LoRA finetuning (a strong baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ( 97.06%\rightarrow 99.08% ), but also scores much higher on features linguists identify as key to separating good answers from the best answers ( 60.25%\rightarrow 85.21% for presence of supportive details, 68.74%\rightarrow 91.43% for absence of oversimplification). A qualitative analysis corroborates this. Finally, our evaluation finds no statistical differences between results on topics that appear in the training dataset and those on separated evaluation topics, which provides strong evidence that our approach to training PE-RL exhibits very effective out of topic generalization. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.03654 [cs.CL] (or arXiv:2503.03654v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.03654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-11] oken-Level Privacy in Large Language Models
【速读】: 该论文旨在解决使用语言模型作为远程服务时因传输私密信息而引发的重大隐私问题,包括敏感数据被不可信的服务提供商暴露以及被窃听者拦截的风险。现有自然语言处理(NLP)交互中的隐私保护方法主要依赖于语义相似性,忽视了上下文信息的作用。论文提出的关键解决方案是dchi-stencil,这是一种新颖的基于令牌级别的隐私保护机制,它整合了语义和上下文信息,并在dchi差分隐私框架下确保强隐私保证,实现了2epsilon-dchi-隐私。通过结合语义和上下文细微差别,dchi-stencil在隐私性和实用性之间取得了稳健的平衡。论文通过最先进的语言模型和多样化数据集验证了该方法,在实用性和隐私性之间的权衡上达到了与现有方法相当甚至更好的效果。
链接: https://arxiv.org/abs/2503.03652
作者: Re’em Harel,Niv Gilboa,Yuval Pinter
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:The use of language models as remote services requires transmitting private information to external providers, raising significant privacy concerns. This process not only risks exposing sensitive data to untrusted service providers but also leaves it vulnerable to interception by eavesdroppers. Existing privacy-preserving methods for natural language processing (NLP) interactions primarily rely on semantic similarity, overlooking the role of contextual information. In this work, we introduce dchi-stencil, a novel token-level privacy-preserving mechanism that integrates contextual and semantic information while ensuring strong privacy guarantees under the dchi differential privacy framework, achieving 2epsilon-dchi-privacy. By incorporating both semantic and contextual nuances, dchi-stencil achieves a robust balance between privacy and utility. We evaluate dchi-stencil using state-of-the-art language models and diverse datasets, achieving comparable and even better trade-off between utility and privacy compared to existing methods. This work highlights the potential of dchi-stencil to set a new standard for privacy-preserving NLP in modern, high-risk applications.
zh
[NLP-12] Psy-Copilot: Visual Chain of Thought for Counseling
【速读】: 该论文旨在解决在心理治疗过程中人类治疗师难以理解大型语言模型(LLMs)如何生成答案的问题。为了解决这一挑战,论文提出了两个关键解决方案:首先,构建了Psy-COT图谱,用于可视化LLMs在治疗对话中的思维过程,并通过半结构化咨询对话与逐步注释相结合的方式呈现治疗师的推理和见解;其次,开发了Psy-Copilot,这是一种辅助人类心理治疗师的会话型AI助手,能够基于检索提供可追溯的心理信息,包括回复候选、相似对话记录、相关策略以及结果的可视化轨迹。此外,还搭建了一个交互式AI辅助咨询平台,其界面展示了检索子图的相关部分。Psy-Copilot的设计目标并非取代心理治疗师,而是促进AI与人类治疗师之间的协作,从而推动心理健康的发展。代码和演示均已开源供使用。
链接: https://arxiv.org/abs/2503.03645
作者: Keqi Chen,Zekai Sun,Huijun Lian,Yingming Gao,Ya Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are becoming increasingly popular in the field of psychological counseling. However, when human therapists work with LLMs in therapy sessions, it is hard to understand how the model gives the answers. To address this, we have constructed Psy-COT, a graph designed to visualize the thought processes of LLMs during therapy sessions. The Psy-COT graph presents semi-structured counseling conversations alongside step-by-step annotations that capture the reasoning and insights of therapists. Moreover, we have developed Psy-Copilot, which is a conversational AI assistant designed to assist human psychological therapists in their consultations. It can offer traceable psycho-information based on retrieval, including response candidates, similar dialogue sessions, related strategies, and visual traces of results. We have also built an interactive platform for AI-assisted counseling. It has an interface that displays the relevant parts of the retrieval sub-graph. The Psy-Copilot is designed not to replace psychotherapists but to foster collaboration between AI and human therapists, thereby promoting mental health development. Our code and demo are both open-sourced and available for use.
zh
[NLP-13] Psy-Insight: Explainable Multi-turn Bilingual Dataset for Mental Health Counseling
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在心理健康支持领域因缺乏 Counseling 数据集,尤其是中文数据集而限制其应用的问题。解决方案的关键在于构建了 Psy-Insight,首个面向心理健康、可解释的多任务双语数据集。Psy-Insight 包含标注有心理治疗、情感、策略、主题标签以及对话轮次级推理和会话级指导的面对面多轮咨询对话,这些多任务标注能够满足模型训练需求,使其不仅能够模仿咨询对话风格,还能理解其中的底层策略与推理过程。
链接: https://arxiv.org/abs/2503.03607
作者: Keqi Chen,Zekai Sun,Yuhua Wen,Huijun Lian,Yingming Gao,Ya Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The in-context learning capabilities of large language models (LLMs) show great potential in mental health support. However, the lack of counseling datasets, particularly in Chinese corpora, restricts their application in this field. To address this, we constructed Psy-Insight, the first mental health-oriented explainable multi-task bilingual dataset. We collected face-to-face multi-turn counseling dialogues, which are annotated with multi-task labels and conversation process explanations. Our annotations include psychotherapy, emotion, strategy, and topic labels, as well as turn-level reasoning and session-level guidance. Psy-Insight is not only suitable for tasks such as label recognition but also meets the need for training LLMs to act as empathetic counselors through logical reasoning. Experiments show that training LLMs on Psy-Insight enables the models to not only mimic the conversation style but also understand the underlying strategies and reasoning of counseling.
zh
[NLP-14] Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
【速读】: 该论文试图解决人工文本检测(Artificial Text Detection, ATD)在面对不同类型的未见过文本以及新大型语言模型(Large Language Models, LLMs)时,现有算法无法始终表现良好且缺乏有效泛化能力的问题。论文的关键在于通过引入稀疏自编码器(Sparse Autoencoders, SAE)提取Gemma-2-2b残差流中的特征,并结合领域和模型特定统计方法、转向方法以及人工或基于LLM的解释,增强ATD的可解释性。这种方法揭示了来自不同模型的文本与人工撰写内容之间的差异,表明现代LLMs即使能够生成拟人化的输出,在信息密集领域仍具有独特的写作风格。
链接: https://arxiv.org/abs/2503.03601
作者: Kristian Kuznetsov,Laida Kushnareva,Polina Druzhinina,Anton Razzhigaev,Anastasia Voznyuk,Irina Piontkovskaya,Evgeny Burnaev,Serguei Barannikov
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
zh
[NLP-15] Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLM s
【速读】: 该论文旨在解决大规模语言模型(LLMs)在时间序列预测中的三个关键限制:数值时间序列模式参数利用率低、连续时间信号与离散文本嵌入模态不匹配以及缺乏实时专家知识集成的灵活性。为应对这些问题,论文提出了一种名为SMETimes的系统性研究,专注于高效且精确的时间序列预测子3B参数的小型语言模型(SLMs)。其解决方案的核心在于三项创新:通过描述性统计特征连接数值时间序列与文本语义的统计增强提示机制;通过可学习参数将时间模式与语言模型令牌空间对齐的自适应融合嵌入架构;以及利用SLMs计算效率实现的动态混合专家框架,以自适应地结合基础预测与领域特定模型。实验结果表明,该3B参数SLM在五个主要数据集上达到了最先进的性能,并且训练速度提高了3.8倍,内存消耗降低了5.2倍,同时在均方误差(MSE)方面比传统LLM低12.3%。
链接: https://arxiv.org/abs/2503.03594
作者: Haoran Fan,Bin Li,Yixuan Weng,Shoujun Zhou
机构: unknown
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs’ computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at this https URL.
zh
[NLP-16] English K_Quantization of LLM s Does Not Disproportionately Diminish Multilingual Performance
【速读】: 该论文旨在探讨在使用本地部署的大规模语言模型(LLMs)时,通过GGUF格式和k量化技术减少模型大小以适应消费级硬件的过程中,是否需要基于英语重要性矩阵(importance matrix)进行量化,以及这种做法是否会对多语言性能造成不利影响。论文的关键在于验证是否可以通过构建其他语言(如挪威语和马拉雅拉姆语)的重要性矩阵来平衡单语言(英语)任务性能与多语言性能之间的权衡。实验结果表明,基于不同语言的重要性矩阵进行k量化后,在MixEval数据集上的英挪威任务评估中未发现显著差异(p > 0.237),这证明当前量化方法并未对多语言性能造成过度损害。因此,解决方案的关键在于探索使用多语言重要性矩阵以优化量化过程,同时保持各语言任务的表现。
链接: https://arxiv.org/abs/2503.03592
作者: Karl Audun Borgersen
机构: University of Agder (阿格德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures
Abstract:For consumer usage of locally deployed LLMs, the GGUF format and k_quantization are invaluable tools for maintaining the performance of the original model while reducing it to sizes deployable with consumer-grade hardware. The number of bits dedicated to each weight from the original model is reduced based on how important they are thought to be during model inference. This importance is arrived at through the application of an ‘importance matrix’-a relatively small text document meant to be representative of the LLM’s standard use-cases. In the vast majority of quants available online, this document is primarily written in English. It was therefore an open question whether performance on English language tasks was preserved through the sacrifice of multilingual performance and whether it can be preserved with alternate importance matrices. This article investigates these hypotheses by quantizing Llama3.3 70B on importance matrices written in three languages (English, Norwegian, and Malayalam) and evaluating them on the MixEval dataset in both English and Norwegian. All experiments related to k_quantization yielded non-significant results (In all cases p 0.237) indicating that current quantization practices do not disproportionately harm multilingual performance.
zh
[NLP-17] PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理长上下文时因注意力机制的二次复杂性导致的效率瓶颈问题。现有稀疏注意力方法虽提供了解决思路,但普遍存在有效上下文不完整或实现复杂的问题。论文的关键在于通过理论分析提出了一种新的稀疏注意力设计——PowerAttention,它能够以指数方式扩展接收域,在d层LLM中使每个输出标记能够关注到(2^d)个标记,确保接收域的完整性和连续性。实验表明,PowerAttention相比现有的静态稀疏注意力方法提升了(5\sim 40%)的性能,特别是在需要长距离依赖的任务上表现出色,同时保持与滑动窗口注意力相当的时间复杂度。此外,PowerAttention在预填充和解码阶段均实现了显著的速度提升(在128K上下文中快3.0倍),成为处理LLMs长序列的有效且用户友好的解决方案。
链接: https://arxiv.org/abs/2503.03588
作者: Lida Chen,Dong Xu,Chenxin An,Xintao Wang,Yikai Zhang,Jiangjie Chen,Zujie Liang,Feng Wei,Jiaqing Liang,Yanghua Xiao,Wei Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: for associated code, see this https URL
Abstract:Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in d -layer LLMs, allowing each output token to attend to 2^d tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by 5\sim 40% , especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention’s superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ( 3.0\times faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.
zh
[NLP-18] Scaling Crowdsourced Election Monitoring: Construction and Evaluation of Classification Models for Multilingual and Cross-Domain Classification Settings
【速读】: 本文旨在解决基于众包的选举监测在扩展规模时面临的瓶颈问题,即传统方法依赖人工处理大量选举报告效率低下。为应对这一挑战,论文提出了一种两步分类方法:首先识别信息性报告,然后将这些报告归类为不同的信息类型。关键在于采用多语言Transformer模型(如XLM-RoBERTa)和多语言嵌入(如SBERT),结合基于语言学特征增强的分类策略。实验结果显示,在信息性检测任务上达到77%的F1分数,在信息类型分类任务上达到75%的F1分数。此外,跨领域实验表明,通过零样本和少量样本设置迁移训练好的模型至新选举域具有潜力,分别获得59%和63%的F1分数,但研究也指出英语报告相较于斯瓦希里语报告存在性能偏差,这可能源于训练数据的不平衡,提示在实际部署中需谨慎使用此类分类模型。
链接: https://arxiv.org/abs/2503.03582
作者: Jabez Magomere,Scott Hale
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The adoption of crowdsourced election monitoring as a complementary alternative to traditional election monitoring is on the rise. Yet, its reliance on digital response volunteers to manually process incoming election reports poses a significant scaling bottleneck. In this paper, we address the challenge of scaling crowdsourced election monitoring by advancing the task of automated classification of crowdsourced election reports to multilingual and cross-domain classification settings. We propose a two-step classification approach of first identifying informative reports and then categorising them into distinct information types. We conduct classification experiments using multilingual transformer models such as XLM-RoBERTa and multilingual embeddings such as SBERT, augmented with linguistically motivated features. Our approach achieves F1-Scores of 77% for informativeness detection and 75% for information type classification. We conduct cross-domain experiments, applying models trained in a source electoral domain to a new target electoral domain in zero-shot and few-shot classification settings. Our results show promising potential for model transfer across electoral domains, with F1-Scores of 59% in zero-shot and 63% in few-shot settings. However, our analysis also reveals a performance bias in detecting informative English reports over Swahili, likely due to imbalances in the training data, indicating a need for caution when deploying classification models in real-world election scenarios.
zh
[NLP-19] An Aspect Extraction Framework using Different Embedding Types Learning Models and Dependency Structure
【速读】: 该论文旨在解决方面级情感分析中的方面抽取问题,这是实现精准方面级情感分析的基础。论文的关键解决方案在于提出了一种结合多种嵌入类型(包括词嵌入和词性标记嵌入)以及多个学习模型的方面抽取模型,并创新性地引入了基于依存句法分析输出的树位置编码(tree positional encoding),以更有效地捕捉句子中的方面位置。此外,为了评估模型性能,还构建了一个通过受控方式机器翻译的土耳其语方面抽取数据集。实验结果表明,所提出的模型在两个土耳其语数据集上大多优于已有研究,且树位置编码的引入显著提升了模型性能。
链接: https://arxiv.org/abs/2503.03512
作者: Ali Erkan,Tunga Güngör
机构: Boğaziçi University (博阿济奇大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Aspect-based Sentiment Analysis, Aspect Extraction, Natural Language Processing, Machine Learning, Deep Neural Networks, Turkish
Abstract:Aspect-based sentiment analysis has gained significant attention in recent years due to its ability to provide fine-grained insights for sentiment expressions related to specific features of entities. An important component of aspect-based sentiment analysis is aspect extraction, which involves identifying and extracting aspect terms from text. Effective aspect extraction serves as the foundation for accurate sentiment analysis at the aspect level. In this paper, we propose aspect extraction models that use different types of embeddings for words and part-of-speech tags and that combine several learning models. We also propose tree positional encoding that is based on dependency parsing output to capture better the aspect positions in sentences. In addition, a new aspect extraction dataset is built for Turkish by machine translating an English dataset in a controlled setting. The experiments conducted on two Turkish datasets showed that the proposed models mostly outperform the studies that use the same datasets, and incorporating tree positional encoding increases the performance of the models.
zh
[NLP-20] CURVALID: Geometrically-guided Adversarial Prompt Detection
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中面临的安全性挑战,特别是由对抗性提示(adversarial prompts)引发的潜在不当行为问题。目前的缓解策略主要依赖于激活内置防御机制或对LLMs进行微调,但尚未深入理解对抗性提示与良性提示之间的根本区别。论文的关键创新在于提出了CurvaLID框架,通过利用文本提示的几何特性高效检测对抗性提示。这一框架不依赖特定类型的LLM,提供了一个跨多种对抗性提示和LLM架构的统一检测方法。其核心解决方案在于结合Whewell方程扩展的曲率概念与局部内在维度(Local Intrinsic Dimensionality, LID),从几何角度分析提示的语义偏移和流形曲率等特性,从而揭示对抗性提示与良性提示的根本差异。实验结果表明,CurvaLID在对抗性查询的检测和拒绝方面表现出色,为LLMs的安全部署奠定了基础。
链接: https://arxiv.org/abs/2503.03502
作者: Canaan Yung,Hanxun Huang,Sarah Monazam Erfani,Christopher Leckie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 Pages, 5 figues
Abstract:Adversarial prompts capable of jailbreaking large language models (LLMs) and inducing undesirable behaviours pose a significant obstacle to their safe deployment. Current mitigation strategies rely on activating built-in defence mechanisms or fine-tuning the LLMs, but the fundamental distinctions between adversarial and benign prompts are yet to be understood. In this work, we introduce CurvaLID, a novel defense framework that efficiently detects adversarial prompts by leveraging their geometric properties. It is agnostic to the type of LLM, offering a unified detection framework across diverse adversarial prompts and LLM architectures. CurvaLID builds on the geometric analysis of text prompts to uncover their underlying differences. We theoretically extend the concept of curvature via the Whewell equation into an n -dimensional word embedding space, enabling us to quantify local geometric properties, including semantic shifts and curvature in the underlying manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to capture geometric features of text prompts within adversarial subspaces. Our findings reveal that adversarial prompts differ fundamentally from benign prompts in terms of their geometric characteristics. Our results demonstrate that CurvaLID delivers superior detection and rejection of adversarial queries, paving the way for safer LLM deployment. The source code can be found at this https URL
zh
[NLP-21] Deictic Codes Demonstratives and Reference: A Step Toward Solving the Grounding Problem
【速读】: 该论文试图解决感知指示代词(perceptual demonstratives)所指涉的经验概念(experiential concepts)的指称固定问题。为避免“编码主义”(encodingism),即表征之间的关联而非表征与世界的关联,论文提出指称固定过程必须是自下而上的(bottom-up)且非概念性的(nonconceptual),以便打破概念内容的循环并触及现实世界。解决方案的关键在于建立适当的因果关系,使表征与世界相连,这种关系由基于空间和物体中心的注意(spatial and object-centered attention)以及通过示现行为(deictic acts)形成物体文件(object files)的功能实现。这一完整的因果过程发生在概念化之前(pre-conceptual level),符合解决指称接地问题(grounding problem)的要求,并吸收了Putnam和Kripke关于“新”指称工作的基本洞见。
链接: https://arxiv.org/abs/2503.03495
作者: Athanassios Raftopoulos,Vincent C. Müller
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper we address the issue of grounding for experiential concepts. Given that perceptual demonstratives are a basic form of such concepts, we examine ways of fixing the referents of such demonstratives. To avoid ‘encodingism’, that is, relating representations to representations, we postulate that the process of reference fixing must be bottom-up and nonconceptual, so that it can break the circle of conceptual content and touch the world. For that purpose, an appropriate causal relation between representations and the world is needed. We claim that this relation is provided by spatial and object-centered attention that leads to the formation of object files through the function of deictic acts. This entire causal process takes place at a pre-conceptual level, meeting the requirement for a solution to the grounding problem. Finally we claim that our account captures fundamental insights in Putnam’s and Kripke’s work on “new” reference.
zh
[NLP-22] Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
【速读】: 该论文旨在解决如何通过联合建模手势(gestures)与语言来提升语言模型在口语 discourse 建模中的性能。论文的关键解决方案是将 3D 人体运动序列编码为离散的手势 token(使用 VQ-VAE),并通过特征对齐方法将这些手势嵌入与文本嵌入对齐,映射到统一的文本嵌入空间中。通过这种方式,论文验证了手势信息能够增强口语 discourse 中连接词(discourse connectives)、立场标记(stance markers)和量化词(quantifiers)的预测准确性,从而证明手势可以提供互补的信息以改进口语 discourse 的建模。
链接: https://arxiv.org/abs/2503.03474
作者: Varsha Suresh,M. Hamza Mughal,Christian Theobalt,Vera Demberg
机构: Saarland University (萨尔兰大学); Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所, MPII)
类目: Computation and Language (cs.CL)
备注:
Abstract:Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
zh
[NLP-23] Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation
【速读】: 本文旨在解决多语言开放领域对话(Open-Domain Dialogue)数据集构建成本高且依赖于特定语言的问题。传统方法需要大量的人力和时间来收集和标注跨语言数据,而本文提出的关键解决方案是利用大型语言模型(Large Language Models, LLMs)的指令微调能力(instruction-tuning)及其多语言处理能力,在单一模型内生成目标语言的新样本,同时避免显式的机器翻译步骤。通过引入一种基于LLMs的数据生成流水线(pipeline),论文展示了如何仅使用源语言的数据来生成多种目标语言的开放领域对话数据,并通过融入会话事件类型和共同背景等元素,增强生成对话的自然度与真实性。这种方法不仅降低了跨语言数据收集的成本,还更好地保留了语言特有的细微差别。
链接: https://arxiv.org/abs/2503.03462
作者: Ahmed Njifenjou,Virgile Sucal,Bassam Jabaian,Fabrice Lefèvre
机构: Laboratoire Inforamitque d’Avignon (LIA)(阿维尼翁大学信息实验室), CERI - Avignon Université (阿维尼翁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:The prevailing paradigm in the domain of Open-Domain Dialogue agents predominantly focuses on the English language, encompassing both models and datasets. Furthermore, the financial and temporal investments required for crowdsourcing such datasets for finetuning are substantial, particularly when multiple languages are involved. Fortunately, advancements in Large Language Models (LLMs) have unveiled a plethora of possibilities across diverse tasks. Specifically, instruction-tuning has enabled LLMs to execute tasks based on natural language instructions, occasionally surpassing the performance of human crowdworkers. Additionally, these models possess the capability to function in various languages within a single thread. Consequently, to generate new samples in different languages, we propose leveraging these capabilities to replicate the data collection process. We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using LLMs, with demonstrations provided in a unique Source Language. By eschewing explicit Machine Translation in this approach, we enhance the adherence to language-specific nuances. We apply this methodology to the PersonaChat dataset. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in and also that of common ground which represents the premises of a conversation.
zh
[NLP-24] Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models
【速读】: 该论文旨在解决利用零阶优化(Zeroth-Order, ZO)方法在大尺寸语言模型(LLMs)中进行偏好优化(Preference Optimisation)的问题。传统的一阶方法如反向传播(back-propagation)虽然有效,但计算资源消耗巨大;而现有的ZO研究多集中于分类任务,对于更复杂的生成式任务(如摘要生成、机器翻译及对话助手)关注较少。为此,论文提出了一种名为ZOPrO的新算法,其关键在于通过分析策略模型与奖励模型在传统一阶偏好优化中的更新模式,结合针对性采样策略改进了同时扰动随机逼近(Simultaneous Perturbation Stochastic Approximation, SPSA),从而显著加速ZO方法在生成式任务中的收敛速度,同时保持与一阶方法相当的性能表现。这一工作首次将ZO方法扩展到LLMs的偏好优化领域,超越了传统的分类任务范畴。
链接: https://arxiv.org/abs/2503.03460
作者: Alessio Galatolo,Zhenbang Dai,Katie Winkle,Meriem Beloucif
机构: Uppsala University (乌普萨拉大学)
类目: Computation and Language (cs.CL)
备注: WIP
Abstract:Fine-tuning LLMs with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation, using function evaluations instead of gradients, reduces memory usage but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for \textitPreference Optimisation in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at this https URL
zh
[NLP-25] Unified Mind Model: Reimagining Autonomous Agents in the LLM Era
【速读】: 该论文旨在解决构建具备人类认知水平的通用自主代理(human-level autonomous agents)的理论基础问题,这一领域目前仍是一个具有挑战性的开放性问题。论文的关键在于提出了一种新颖的理论认知架构——统一心智模型(Unified Mind Model, UMM),该模型基于全局工作空间理论(Global Workspace Theory),并通过利用大型语言模型(Large Language Models, LLMs)赋予代理多种认知能力,如多模态感知、规划、推理、工具使用、学习、记忆、反思和动机等。基于UMM,作者进一步开发了一个无需编程即可快速创建特定领域或任务专用自主代理的引擎——MindOS,从而为实现人类认知水平的自主代理提供了系统化的解决方案。
链接: https://arxiv.org/abs/2503.03459
作者: Pengbo Hu,Xiang Ying
机构: Mindverse.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages
Abstract:Large language models (LLMs) have recently demonstrated remarkable capabilities across domains, tasks, and languages (e.g., ChatGPT and GPT-4), reviving the research of general autonomous agents with human-like cognitive this http URL human-level agents require semantic comprehension and instruction-following capabilities, which exactly fall into the strengths of this http URL there have been several initial attempts to build human-level agents based on LLMs, the theoretical foundation remains a challenging open problem. In this paper, we propose a novel theoretical cognitive architecture, the Unified Mind Model (UMM), which offers guidance to facilitate the rapid creation of autonomous agents with human-level cognitive abilities. Specifically, our UMM starts with the global workspace theory and further leverage LLMs to enable the agent with various cognitive abilities, such as multi-modal perception, planning, reasoning, tool use, learning, memory, reflection and motivation. Building upon UMM, we then develop an agent-building engine, MindOS, which allows users to quickly create domain-/task-specific autonomous agents without any programming effort.
zh
[NLP-26] axation Perspectives from Large Language Models : A Case Study on Additional Tax Penalties
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在税务领域能力评估的问题。现有研究多聚焦于法律领域的通用性探索,而专门针对税务的研究仍显不足,且已有的数据集要么过于简化未能反映实际复杂性,要么未公开可用。为填补这一空白,论文引入了PLAT(Predicting Additional Tax Penalties Benchmark),这是一个新的基准测试集,旨在评估LLMs预测附加税处罚合法性的能力。PLAT的设计特别关注LLMs对税务法律的理解,特别是在需要综合考量而非仅依赖相关法规的情况下。论文的关键解决方案在于通过启用检索(retrieval)、自我推理(self-reasoning)以及角色分配下的多代理讨论(discussion among multiple agents with specific role assignments),有效缓解了LLMs在处理冲突问题时能力受限的局限性。
链接: https://arxiv.org/abs/2503.03444
作者: Eunkyung Choi,Young Jin Suh,Hun Park,Wonseok Hwang
机构: University of Seoul (首尔大学); LBOX
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages
Abstract:How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain in general, research dedicated to taxation remain scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or unavailable as open source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT is constructed to evaluate LLMs’ understanding of tax law, particularly in cases where resolving the issue requires more than just applying related statutes. Our experiments with six LLMs reveal that their baseline capabilities are limited, especially when dealing with conflicting issues that demand a comprehensive understanding. However, we found that enabling retrieval, self-reasoning, and discussion among multiple agents with specific role assignments, this limitation can be mitigated.
zh
[NLP-27] RASD: Retrieval-Augmented Speculative Decoding
【速读】: 该论文旨在解决基于模型的推测解码(Model-based Speculative Decoding)在处理领域外场景(out-of-domain scenarios)时效果下降以及验证阶段接受长度受限的问题。现有方法通常依赖轻量级草案模型或额外模型结构生成草案标记并检索上下文,但这些方法因草案模型规模小、训练数据有限,导致其在跨领域任务中的有效性降低,同时草案生成阶段的时间开销限制了验证阶段的接受长度,从而影响整体效率。
论文提出的解决方案核心在于引入检索增强的推测解码(Retrieval-Augmented Speculative Decoding, RASD)。关键创新包括树剪枝(tree pruning)和树融合(tree fusion)技术:首先,通过基于草案模型概率分布的剪枝方法构建最优检索树;其次,利用最长前缀匹配算法将草案模型生成的树与检索树合并,形成一个统一的验证树。实验结果表明,RASD 在 DocQA、Summary、Code 和 In-Domain QA 等任务中实现了最先进的推理加速,并展现出良好的可扩展性,能够无缝集成到基于生成和基于检索的多种推测解码方法中。
链接: https://arxiv.org/abs/2503.03434
作者: Guofeng Quan,Wenfeng Feng,Chuzhan Hao,Guochao Jiang,Yuewei Zhang,Hao Wang
机构: Alibaba Cloud, Alibaba Group (阿里云, 阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model structures to generate draft tokens and retrieve context from databases. Due to the draft model’s small size and limited training data, model-based speculative decoding frequently becomes less effective in out-of-domain scenarios. Additionally, the time cost of the drafting phase results in a low upper limit on acceptance length during the verification step, limiting overall efficiency. This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding. We introduce tree pruning and tree fusion to achieve this. Specifically, we develop a pruning method based on the draft model’s probability distribution to construct the optimal retrieval tree. Second, we employ the longest prefix matching algorithm to merge the tree generated by the draft model with the retrieval tree, resulting in a unified tree for verification. Experimental results demonstrate that RASD achieves state-of-the-art inference acceleration across tasks such as DocQA, Summary, Code, and In-Domain QA. Moreover, RASD exhibits strong scalability, seamlessly integrating with various speculative decoding approaches, including both generation-based and retrieval-based methods.
zh
[NLP-28] When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
【速读】: 该论文旨在解决在线虚假信息重新以编辑形式出现时,基于嵌入(embedding-based)的事实核查方法在检索相关事实核查内容时性能下降的问题。论文引入了一种包含六种常见真实世界虚假信息编辑方式的分类法,并提出了一种扰动框架来生成有效的、自然的声明变体。解决方案的关键在于通过训练和推理时间的缓解方法显著提升了领域内鲁棒性最高达17个百分点,同时增强了跨领域的泛化能力达10个百分点。尽管使用强大的重排序器可以部分缓解问题,但它无法完全弥补第一阶段检索中的差距。因此,提升检索效果是该研究的核心贡献。
链接: https://arxiv.org/abs/2503.03417
作者: Jabez Magomere,Emanuele La Malfa,Manuel Tonneau,Ashkan Kazemi,Scott Hale
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Online misinformation remains a critical challenge, and fact-checkers increasingly rely on embedding-based methods to retrieve relevant fact-checks. Yet, when debunked claims reappear in edited forms, the performance of these methods is unclear. In this work, we introduce a taxonomy of six common real-world misinformation edits and propose a perturbation framework that generates valid, natural claim variations. Our multi-stage retrieval evaluation reveals that standard embedding models struggle with user-introduced edits, while LLM-distilled embeddings offer improved robustness at a higher computational cost. Although a strong reranker helps mitigate some issues, it cannot fully compensate for first-stage retrieval gaps. Addressing these retrieval gaps, our train- and inference-time mitigation approaches enhance in-domain robustness by up to 17 percentage points and boost out-of-domain generalization by 10 percentage points over baseline models. Overall, our findings provide practical improvements to claim-matching systems, enabling more reliable fact-checking of evolving misinformation.
zh
[NLP-29] he Serendipity of Claude AI: Case of the 13 Low-Resource National Languages of Mali
【速读】: 该论文试图解决自动翻译和生成式 AI (Generative AI) 对于资源匮乏语言支持不足的问题,特别是针对马里(Mali)的13种官方国家语言。解决方案的关键在于评估Claude AI在这些语言上的翻译性能,通过综合使用ChrF2、BLEU等自动化指标以及人工评估多个维度(如翻译准确性、上下文一致性、对方言变化的鲁棒性、语言偏见管理、有限语料库的适应能力及文本易理解性),验证其在资源有限的语言环境中生成可接受结果的能力,并发现其在模仿部分语言特征方面的潜力。
链接: https://arxiv.org/abs/2503.03380
作者: Alou Dembele,Nouhoum Souleymane Coulibaly,Michael Leventhal(RobotsMali AI4D Lab, Bamako, Mali)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in artificial intelligence (AI) and natural language processing (NLP) have improved the representation of underrepresented languages. However, most languages, including Mali’s 13 official national languages, continue to be poorly supported or unsupported by automatic translation and generative AI. This situation appears to have slightly improved with certain recent LLM releases. The study evaluated Claude AI’s translation performance on each of the 13 national languages of Mali. In addition to ChrF2 and BLEU scores, human evaluators assessed translation accuracy, contextual consistency, robustness to dialect variations, management of linguistic bias, adaptation to a limited corpus, and ease of understanding. The study found that Claude AI performs robustly for languages with very modest language resources and, while unable to produce understandable and coherent texts for Malian languages with minimal resources, still manages to produce results which demonstrate the ability to mimic some elements of the language.
zh
[NLP-30] ransformers for molecular property prediction: Domain adaptation efficiently improves performance
【速读】: 该论文旨在解决当前基于变压器(Transformer)的化学语言模型在预测分子性质时存在的局限性,特别是探讨数据集规模与多样性的增加是否能显著提升分子属性预测性能的问题。研究的关键在于评估预训练数据集大小和多样性对模型性能的影响,并探索领域自适应(domain adaptation)技术作为提升模型表现的方法。研究发现,当预训练数据集超过GuacaMol数据集中400K分子时,对于溶解度、渗透性、微粒体稳定性及血浆蛋白结合这四个ADME终点的预测性能并未显著改善;而通过在少量领域相关分子上进行多任务回归的进一步训练以实现领域自适应,则能够显著提高三个ADME终点的预测性能(P值<0.001)。此外,一个基于400K分子预训练且经过领域适应调整的模型,在性能上与更大规模预训练的复杂Transformer模型如MolBERT(1.3M分子)和MolFormer(100M分子)相当(P值<0.05),并且其表现也与基于基本理化性质训练的随机森林模型相似。因此,论文认为通过系统分析预训练策略、下游任务数据以及缩放规律等,可以进一步优化现有Transformer模型,从而开发出更高效和实用的工具。
链接: https://arxiv.org/abs/2503.03360
作者: Afnan Sultan,Max Rausch-Dupont,Shahrukh Khan,Olga Kalinina,Andrea Volkamer,Dietrich Klakow
机构: Saarland University (萨尔兰大学); Saarland Informatics Campus (萨尔兰计算机校园); Center for Bioinformatics (生物信息学中心); Data Driven Drug Design (数据驱动药物设计)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Most of the current transformer-based chemical language models are pre-trained on millions to billions of molecules. However, the improvement from such scaling in dataset size is not confidently linked to improved molecular property prediction. The aim of this study is to investigate and overcome some of the limitations of transformer models in predicting molecular properties. Specifically, we examine the impact of pre-training dataset size and diversity on the performance of transformer models and investigate the use of domain adaptation as a technique for improving model performance. First, our findings indicate that increasing pretraining dataset size beyond 400K molecules from the GuacaMol dataset does not result in a significant improvement on four ADME endpoints, namely, solubility, permeability, microsomal stability, and plasma protein binding. Second, our results demonstrate that using domain adaptation by further training the transformer model on a small set of domain-relevant molecules, i.e., a few hundred to a few thousand, using multi-task regression of physicochemical properties was sufficient to significantly improve performance for three out of the four investigated ADME endpoints (P-value 0.001). Finally, we observe that a model pre-trained on 400K molecules and domain adopted on a few hundred/thousand molecules performs similarly (P-value 0.05) to more complicated transformer models like MolBERT(pre-trained on 1.3M molecules) and MolFormer (pre-trained on 100M molecules). A comparison to a random forest model trained on basic physicochemical properties showed similar performance to the examined transformer models. We believe that current transformer models can be improved through further systematic analysis of pre-training and downstream data, pre-training objectives, and scaling laws, ultimately leading to better and more helpful models.
zh
[NLP-31] EnigmaToM: Improve LLM s Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States
【速读】: 该论文致力于解决大型语言模型(Large Language Models, LLMs)在 Theory-of-Mind (ToM) 推理任务中的效率与适用性问题。现有方法主要依赖于感知视角切换的推理机制,但在处理高阶 ToM 推理(如涉及多跳信念推理的任务)时表现受限,且过度依赖 LLMs 导致效率下降和应用局限性增加。为应对这些挑战,论文提出了一种名为 EnigmaToM 的新颖神经符号框架,其关键是通过引入一个实体状态神经知识库(Enigma)实现两个核心功能:(1) 受心理学启发的迭代掩码机制以促进精确的视角切换;(2) 知识注入以提取关键实体信息。Enigma 构建了结构化的实体状态表示,并利用空间信息作为归纳偏置生成场景图,从而支持跨多种 ToM 阶段的信念追踪及细粒度实体状态细节的增强。实验结果表明,EnigmaToM 在多个基准数据集(如 ToMi、HiToM 和 FANToM)上显著提升了 LLMs 的 ToM 推理能力,尤其在高阶推理场景中表现出色。
链接: https://arxiv.org/abs/2503.03340
作者: Hainiu Xu,Siya Qi,Jiazheng Li,Yuxiang Zhou,Jinhua Du,Caroline Catmur,Yulan He
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Theory-of-Mind (ToM), the ability to infer others’ perceptions and mental states, is fundamental to human interaction but remains a challenging task for Large Language Models (LLMs). While existing ToM reasoning methods show promise with reasoning via perceptual perspective-taking, they often rely excessively on LLMs, reducing their efficiency and limiting their applicability to high-order ToM reasoning, which requires multi-hop reasoning about characters’ beliefs. To address these issues, we present EnigmaToM, a novel neuro-symbolic framework that enhances ToM reasoning by integrating a Neural Knowledge Base of entity states (Enigma) for (1) a psychology-inspired iterative masking mechanism that facilitates accurate perspective-taking and (2) knowledge injection that elicits key entity information. Enigma generates structured representations of entity states, which construct spatial scene graphs – leveraging spatial information as an inductive bias – for belief tracking of various ToM orders and enhancing events with fine-grained entity state details. Experimental results on multiple benchmarks, including ToMi, HiToM, and FANToM, show that EnigmaToM significantly improves ToM reasoning across LLMs of varying sizes, particularly excelling in high-order reasoning scenarios.
zh
[NLP-32] News: A Multimodal Dataset for Modeling Personalized Affective Responses to News
【速读】: 该论文试图解决现有情感检测方法忽视情感体验主观性的问题,即过度依赖聚合标签而掩盖个体间情感反应的差异。为应对这一挑战,论文提出了iNews数据集,其关键在于通过来自291名英国参与者对2,899个多模态Facebook新闻帖子的标注,显式捕捉新闻标题的主观情感反应,并提供多维度标签(如效价、唤醒度、优势度、离散情绪等)以及注释者的人口统计学信息、人格特征、媒体信任度及消费模式等背景信息,这些信息解释了15.2%的标注方差,显著高于现有NLP数据集。引入这些个性化因素使零样本预测的准确率提高了7%,并且在32样本情况下依然有效,从而提升了大规模语言模型个性化、主观性研究、情感计算及个体行为模拟的能力。
链接: https://arxiv.org/abs/2503.03335
作者: Tiancheng Hu,Nigel Collier
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Current approaches to emotion detection often overlook the inherent subjectivity of affective experiences, instead relying on aggregated labels that mask individual variations in emotional responses. We introduce iNews, a novel large-scale dataset explicitly capturing subjective affective responses to news headlines. Our dataset comprises annotations from 291 demographically diverse UK participants across 2,899 multimodal Facebook news posts from major UK outlets, with an average of 5.18 annotators per sample. For each post, annotators provide multifaceted labels including valence, arousal, dominance, discrete emotions, content relevance judgments, sharing likelihood, and modality importance ratings (text, image, or both). Furthermore, we collect comprehensive annotator persona information covering demographics, personality, media trust, and consumption patterns, which explain 15.2% of annotation variance - higher than existing NLP datasets. Incorporating this information yields a 7% accuracy gain in zero-shot prediction and remains beneficial even with 32-shot. iNews will enhance research in LLM personalization, subjectivity, affective computing, and individual-level behavior simulation.
zh
[NLP-33] LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
【速读】: 该论文旨在解决文本属性图(Text-Attributed Graphs, TAGs)领域中现有方法因解耦架构与两阶段对齐方式导致的协同潜力受限问题,以及将词汇表外(Out-of-Vocabulary, OOV)标记分配给图节点所引发的图特定语义、标记爆炸及与任务导向提示模板的不兼容性,这些问题阻碍了跨图和跨任务的可转移性。为应对这些挑战,论文提出了PromptGFM,这是一种基于图词汇学习的通用图基础模型(Graph Foundation Model, GFM)。PromptGFM的关键在于其包含两个核心组件:(1) 图理解模块,通过显式引导大型语言模型(Large Language Models, LLMs)在文本空间内复制最精细的图神经网络(Graph Neural Networks, GNNs)工作流,从而实现无缝的GNN-LLM集成和优雅的图文本对齐;(2) 图推理模块,构建基于语言的图词汇,确保表达性、可转移性和可扩展性,为LLM微调提供可读指令。实验结果验证了PromptGFM在多种图和任务中的优越性和可转移性。
链接: https://arxiv.org/abs/2503.03313
作者: Xi Zhu,Haochen Xue,Ziwei Zhao,Wujiang Xu,Jingyuan Huang,Minghao Guo,Qifan Wang,Kaixiong Zhou,Yongfeng Zhang
机构: Rutgers University (罗格斯大学); University of Liverpool (利物浦大学); University of Science and Technology of China (中国科学技术大学); Meta AI (Meta人工智能实验室); North Carolina State University (北卡罗来纳州立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Text-Attributed Graphs (TAGs), where each node is associated with text descriptions, are ubiquitous in real-world scenarios. They typically exhibit distinctive structure and domain-specific knowledge, motivating the development of a Graph Foundation Model (GFM) that generalizes across diverse graphs and tasks. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures with two-stage alignment, limiting their synergistic potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens to graph nodes, leading to graph-specific semantics, token explosion, and incompatibility with task-oriented prompt templates, which hinders cross-graph and cross-task transferability. To address these challenges, we propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning. PromptGFM comprises two key components: (1) Graph Understanding Module, which explicitly prompts LLMs to replicate the finest GNN workflow within the text space, facilitating seamless GNN-LLM integration and elegant graph-text alignment; (2) Graph Inference Module, which establishes a language-based graph vocabulary ensuring expressiveness, transferability, and scalability, enabling readable instructions for LLM fine-tuning. Extensive experiments demonstrate our superiority and transferability across diverse graphs and tasks. The code is available at this: this https URL.
zh
[NLP-34] he Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation EMNLP
【速读】: 该论文试图解决的问题是:神经机器翻译(Neural Machine Translation, NMT)是否能够生成符合常识推理的译文?具体而言,论文关注NMT在处理词汇及句法歧义(包括无上下文和有上下文两种情况)时,是否能够正确应用常识知识进行推理。
解决方案的关键在于构建了一个测试集(test suite),包含三个测试集合,涵盖7种不同类型的常识推理任务,并手动创建了1,200个三元组,每个三元组包含一个源句子及其两个对比翻译。通过该测试集,论文评估了预训练语言模型(如BERT、GPT-2)以及NMT系统的常识推理能力。研究发现,这些模型在目标翻译上的常识推理准确率低于72%,而NMT系统的推理准确性仅为60.1%,一致性仅为31%。这表明NMT在处理常识推理相关歧义时表现不佳,论文进一步分析了影响该能力的因素。
链接: https://arxiv.org/abs/2503.03308
作者: Jie He,Tao Wang,Deyi Xiong,Qun Liu
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院), China; School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院), China; Huawei Noah’s Ark Lab (华为诺亚方舟实验室), Hong Kong, China
类目: Computation and Language (cs.CL)
备注: EMNLP findings 2020
Abstract:Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy (60.1%) and reasoning consistency (31%). The built commonsense test suite is available at this https URL.
zh
[NLP-35] SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection
【速读】: 该论文试图解决开放领域事件检测(Open Domain Event Detection, ODED)自动评估中存在的两个核心问题:(1) 现有评估基准缺乏代表性,难以真实反映各类ODED方法在实际场景中的性能;(2) 基于标记级别匹配规则的传统评估指标无法有效捕捉预测结果与黄金标签之间的语义相似性。为解决上述问题,论文提出了一种可扩展且可靠的语义级评估框架(Semantic-level Evaluation framework for Open domain Event detection, SEOE)。该方案的关键在于构建了一个更具代表性的评估基准,涵盖7大领域的564种事件类型,并通过成本效益高的补充标注策略确保其全面性,同时引入基于大型语言模型(Large Language Models, LLMs)的语义F1分数计算方法,利用细粒度的语义相似标签定义提升评估可靠性。
链接: https://arxiv.org/abs/2503.03303
作者: Yi-Fan Lu,Xian-Ling Mao,Tian Lan,Tong Zhang,Yu-Shi Zhu,Heyan Huang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic evaluation for Open Domain Event Detection (ODED) is a highly challenging task, because ODED is characterized by a vast diversity of un-constrained output labels from various domains. Nearly all existing evaluation methods for ODED usually first construct evaluation benchmarks with limited labels and domain coverage, and then evaluate ODED methods using metrics based on token-level label matching rules. However, this kind of evaluation framework faces two issues: (1) The limited evaluation benchmarks lack representatives of the real world, making it difficult to accurately reflect the performance of various ODED methods in real-world scenarios; (2) Evaluation metrics based on token-level matching rules fail to capture semantic similarity between predictions and golden labels. To address these two problems above, we propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection (SEOE) by constructing a more representative evaluation benchmark and introducing a semantic evaluation metric. Specifically, our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains, with a cost-effective supplementary annotation strategy to ensure the benchmark’s representativeness. The strategy also allows for the supplement of new event types and domains in the future. Then, the proposed SEOE leverages large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels to enhance the reliability of the evaluation. Extensive experiments validate the representatives of the benchmark and the reliability of the semantic evaluation metric. Existing ODED methods are thoroughly evaluated, and the error patterns of predictions are analyzed, revealing several insightful findings.
zh
[NLP-36] Which books do I like?
【速读】: 该论文试图解决的问题是如何帮助读者发现符合其个人喜好的有趣小说书籍。这一挑战源于故事的多面性以及个体难以明确自身文学品味的困难。论文提出的解决方案是ISAAC方法(内省支持、AI注释与策展),它通过四个步骤实现:用户提交书评分数,AI代理研究并标注这些书籍,用户回顾书籍阅读乐趣中的模式,最后AI代理推荐新书。该方案的关键在于整合自动化与直觉、提供精准且可定制的书籍注释,并生成可解释的书籍推荐,从而克服现有方法的局限性。
链接: https://arxiv.org/abs/2503.03300
作者: Hannes Rosenbusch,Erdem Ozan Meral
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:
Abstract:Finding enjoyable fiction books can be challenging, partly because stories are multi-faceted and one’s own literary taste might be difficult to ascertain. Here, we introduce the ISAAC method (Introspection-Support, AI-Annotation, and Curation), a pipeline which supports fiction readers in gaining awareness of their literary preferences and finding enjoyable books. ISAAC consists of four steps: a user supplies book ratings, an AI agent researches and annotates the provided books, patterns in book enjoyment are reviewed by the user, and the AI agent recommends new books. In this proof-of-concept self-study, the authors test whether ISAAC can highlight idiosyncratic patterns in their book enjoyment, spark a deeper reflection about their literary tastes, and make accurate, personalized recommendations of enjoyable books and underexplored literary niches. Results highlight substantial advantages of ISAAC over existing methods such as an integration of automation and intuition, accurate and customizable annotations, and explainable book recommendations. Observed disadvantages are that ISAAC’s outputs can elicit false self-narratives (if statistical patterns are taken at face value), that books cannot be annotated if their online documentation is lacking, and that people who are new to reading have to rely on assumed book ratings or movie ratings to power the ISAAC pipeline. We discuss additional opportunities of ISAAC-style book annotations for the study of literary trends, and the scientific classification of books and readers.
zh
[NLP-37] Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
【速读】: 该论文旨在解决视觉语言模型(VLMs)在医学异常检测与定位任务中的有效性不足问题,特别是面对复杂且抽象的医学术语时,如何将病理异常术语与其对应的视觉特征进行有效关联。论文的关键解决方案在于引入一种基于分解医学知识的新方法:通过将医学概念分解为其基本属性和常见的视觉模式,而非直接提示模型识别特定异常。这种策略显著提升了文本描述与视觉特征之间的对齐度,从而增强了医学异常的识别与定位能力。实验结果表明,该方法在仅使用小量训练数据的情况下,实现了与大规模医学VLMs相当的性能,并展现出强大的泛化能力。
链接: https://arxiv.org/abs/2503.03278
作者: Jun Li,Che Liu,Wenjia Bai,Rossella Arcucci,Cosmin I. Bercea,Julia A. Schnabel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 11 pages, 3 figures
Abstract:Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical this http URL evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.
zh
[NLP-38] LexGenie: Automated Generation of Structured Reports for European Court of Human Rights Case Law
【速读】: 该论文试图解决通过人工分析大量判例法以揭示特定主题下法律原则的演变这一艰巨任务的问题。解决方案的关键在于引入LexGenie,这是一个基于大型语言模型(LLM)的自动化管道系统,能够根据用户指定的主题,在欧洲人权法院管辖范围内自动生成多案例结构化报告。LexGenie通过检索、聚类和组织相关段落按主题生成每个部分的结构提纲和连贯内容,从而实现这一目标。
链接: https://arxiv.org/abs/2503.03266
作者: T.Y.S.S Santosh,Mahmoud Aly,Oana Ichim,Matthias Grabmair
机构: School of Computation, Information, and Technology (计算、信息和技术学院), Technical University of Munich (慕尼黑工业大学), Germany (德国); Graduate Institute of International and Development Studies (国际与发展研究研究生院), Geneva (日内瓦), Switzerland (瑞士)
类目: Computation and Language (cs.CL)
备注:
Abstract:Analyzing large volumes of case law to uncover evolving legal principles, across multiple cases, on a given topic is a demanding task for legal professionals. Structured topical reports provide an effective solution by summarizing key issues, principles, and judgments, enabling comprehensive legal analysis on a particular topic. While prior works have advanced query-based individual case summarization, none have extended to automatically generating multi-case structured reports. To address this, we introduce LexGenie, an automated LLM-based pipeline designed to create structured reports using the entire body of case law on user-specified topics within the European Court of Human Rights jurisdiction. LexGenie retrieves, clusters, and organizes relevant passages by topic to generate a structured outline and cohesive content for each section. Expert evaluation confirms LexGenie’s utility in producing structured reports that enhance efficient, scalable legal analysis.
zh
[NLP-39] Can Frontier LLM s Replace Annotators in Biomedical Text Mining? Analyzing Challenges and Exploring Solutions
【速读】: 该论文旨在解决生物医学文本挖掘领域中大型语言模型(Large Language Models, LLMs)表现不佳的问题。具体而言,论文识别出三个主要挑战:(1) LLMs 难以从监督数据中学习特定数据集的隐含特性;(2) 判别任务的通用格式化需求限制了 LLMs 的推理能力,尤其是缺乏测试时计算资源的 LLMs;(3) LLMs 在遵循标注指南和匹配精确模式方面存在困难,这阻碍了其理解详细的标注需求的能力。为应对这些挑战,论文通过针对性的提示工程技术和开发一种动态提取标注指南指令的流水线来解决。解决方案的关键在于利用提示工程技术优化 LLMs 的性能,并通过模型蒸馏方法训练仅基于由 LLMs 注释的合成数据的 BERT 模型,从而在不依赖大量人工标注数据的情况下实现接近或超越最先进的 BERT 基础模型的性能,同时探索在生产环境中部分替代人工标注的可能性。
链接: https://arxiv.org/abs/2503.03261
作者: Yichong Zhao,Susumu Goto
机构: Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo. (计算生物学与医学科学系,前沿科学研究科,东京大学); Database Center for Life Science, Joint Support-Center for Data Science Research, ROIS (生命科学数据库中心,数据科学研究联合支持中心,ROIS)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) can perform various natural language processing (NLP) tasks through in-context learning without relying on supervised data. However, multiple previous studies have reported suboptimal performance of LLMs in biological text mining. By analyzing failure patterns in these evaluations, we identified three primary challenges for LLMs in biomedical corpora: (1) LLMs fail to learn implicit dataset-specific nuances from supervised data, (2) The common formatting requirements of discriminative tasks limit the reasoning capabilities of LLMs particularly for LLMs that lack test-time compute, and (3) LLMs struggle to adhere to annotation guidelines and match exact schemas, which hinders their ability to understand detailed annotation requirements which is essential in biomedical annotation workflow. To address these challenges, we experimented with prompt engineering techniques targeted to the above issues, and developed a pipeline that dynamically extracts instructions from annotation guidelines. Our findings show that frontier LLMs can approach or surpass the performance of state-of-the-art (SOTA) BERT-based models with minimal reliance on manually annotated data and without fine-tuning. Furthermore, we performed model distillation on a closed-source LLM, demonstrating that a BERT model trained exclusively on synthetic data annotated by LLMs can also achieve a practical performance. Based on these results, we explored the feasibility of partially replacing manual annotation with LLMs in production scenarios for biomedical text mining.
zh
[NLP-40] FANS – Formal Answer Selection for Natural Language Math Reasoning Using Lean4
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言(Natural Language, NL)数学推理任务中因内在模糊性导致的推理能力不足问题。具体而言,LLMs生成的答案缺乏可验证性和可信的支持,影响其在文本生成、分类及问答等任务中的表现一致性与可靠性。为应对上述挑战,论文提出了一种名为FANS(Formal ANswer Selection for Natural Language Math Reasoning Using Lean4)的新框架。该框架的关键在于利用Lean4的形式化证明工具,将自然语言数学问题及其LLM生成的答案转化为Lean4的定理陈述,并通过Lean4定理证明器进行验证。最终,结合形式化逻辑(FL)结果辅助答案选择,不仅提升了LLMs在自然语言数学推理任务中的正确性,还为基于奖励模型的改进提供了替代方法。实验表明,FANS能够显著提升奖励模型增强的LLMs在MATH-500和AMC-23数据集上的准确性,尤其在数论等特定领域可实现完全正确的解选择,并使结果具备Lean4形式化证明支持。
链接: https://arxiv.org/abs/2503.03238
作者: Jiarui Yao,Ruida Wang,Tong Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs’ ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs’ NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs’ NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.
zh
[NLP-41] argeted Distillation for Sentiment Analysis
【速读】: 该论文旨在通过有针对性的知识蒸馏,从先进的大型语言模型(Large Language Models, LLMs)中提取出强大的情感分析能力,构建一个紧凑型模型。解决方案的关键在于将蒸馏目标解耦为两个核心组成部分:情感相关的知识与任务适配,并提出了一种两阶段的蒸馏框架。第一阶段(知识驱动蒸馏,Knowledge-driven Distillation, \textsc{KnowDist})专注于转移情感相关知识以提升基础情感分析能力;第二阶段(上下文学习蒸馏,In-Context Learning Distillation, \textsc{ICLDist})则致力于迁移特定任务的提示跟随能力以优化任务适配性。通过在包含12个数据集、涵盖3类任务的情感分析基准\textsc{SentiBench}上的实验验证,证明了该模型在模型规模与性能之间取得了良好的平衡,表现出相较于现有小型LLMs的强劲竞争力。
链接: https://arxiv.org/abs/2503.03225
作者: Yice Zhang,Guangyu Xie,Jingjie Lin,Jianzhu Bao,Qianlong Wang,Xi Zeng,Ruifeng Xu
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (广东省新型信息安全技术重点实验室); The 30th Research Institute of China Electronics Technology Group Corporation (中国电子科技集团公司第三十研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a compact model that achieves strong sentiment analysis capabilities through targeted distillation from advanced large language models (LLMs). Our methodology decouples the distillation target into two key components: sentiment-related knowledge and task alignment. To transfer these components, we propose a two-stage distillation framework. The first stage, knowledge-driven distillation (\textscKnowDist), transfers sentiment-related knowledge to enhance fundamental sentiment analysis capabilities. The second stage, in-context learning distillation (\textscICLDist), transfers task-specific prompt-following abilities to optimize task alignment. For evaluation, we introduce \textscSentiBench, a comprehensive sentiment analysis benchmark comprising 3 task categories across 12 datasets. Experiments on this benchmark demonstrate that our model effectively balances model size and performance, showing strong competitiveness compared to existing small-scale LLMs.
zh
[NLP-42] MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
【速读】: 该论文旨在解决单一大型语言模型(LLM)在使用计算机可验证语言(如Lean)进行数学定理证明时,缺乏有效结合自然语言(NL)高层次推理与形式语言(FL)验证反馈的结构化方法的问题。为了解决这一局限性,论文提出了一种名为MA-LoT(Multi-Agent Lean-based Long Chain-of-Thought框架)的多智能体方法,这是首个用于Lean4定理证明的多智能体框架,能够平衡长链思维(Long CoT)中的高层次NL推理与FL验证。该方案的关键在于利用新颖的LoT-迁移学习训练-推理管道,通过多智能体间的结构化交互,增强长链思维中的形式推理能力,从而实现更深层次的洞察力和长期一致性。实验结果表明,MA-LoT在Lean4版本的MiniF2F-Test数据集上达到了54.51%的准确率,显著优于其他基线方法,包括GPT-4、单智能体树搜索方法(InternLM-Step-Prover)以及完整证明生成方法(DeepSeek-Prover-v1.5)。
链接: https://arxiv.org/abs/2503.03205
作者: Ruida Wang,Rui Pan,Yuxin Li,Jipeng Zhang,Yizhen Jia,Shizhe Diao,Renjie Pi,Junjie Hu,Tong Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted mathematical and computer science communities. State-of-the-art methods utilize single Large Language Models (LLMs) as agents or provers to either generate complete proof or perform tree searches. However, single-agent methods inherently lack a structured way to combine high-level reasoning in Natural Language (NL) with Formal Language (FL) verification feedback. To solve these issues, we propose MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought framework, (to the best of our knowledge), the first multi-agent framework for Lean4 theorem proving that balance high-level NL reasoning and FL verification in Long CoT. Using this structured interaction, our approach enables deeper insights and long-term coherence in proof generation, with which past methods struggle. We do this by leveraging emergent formal reasoning ability in Long CoT using our novel LoT-Transfer Learning training-inference pipeline. Extensive experiments show that our framework achieves 54.51% accuracy rate on the Lean4 version of MiniF2F-Test dataset, largely outperforming GPT-4 (22.95%), single-agent tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (DeepSeek-Prover-v1.5, 48.36%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.
zh
[NLP-43] owards Robust Universal Information Extraction: Benchmark Evaluation and Solution
【速读】: 该论文旨在通过引入一个新的基准数据集、全面评估和可行的解决方案来增强通用信息提取(UIE)模型的鲁棒性。现有鲁棒基准数据集存在两个关键局限:一是仅针对单一信息抽取(IE)任务生成有限范围的扰动,无法有效评估UIE模型的鲁棒性;二是依赖小规模模型或手工规则生成扰动,导致生成的对抗样本不够自然。论文提出利用大规模语言模型(LLMs)的强大生成能力,构建了一个新的鲁棒UIE基准数据集RUIE-Bench,以在不同IE任务中生成更广泛且真实的扰动。基于此数据集,研究发现基于LLM的模型和其他模型均表现出显著的性能下降。为提高鲁棒性并降低训练成本,论文提出了一种数据增强解决方案,即根据模型推理损失动态选择困难样本进行迭代训练。实验结果显示,仅使用15%的数据即可使三个IE任务的平均相对性能提升7.5%。
链接: https://arxiv.org/abs/2503.03201
作者: Jizhao Zhu,Akang Shi,Zixuan Li,Long Bai,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences (中科院计算技术研究所网络数据科学与技术重点实验室); School of Computer Science, Shenyang Aerospace University, Shenyang, China (沈阳航空航天大学计算机科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we aim to enhance the robustness of Universal Information Extraction (UIE) by introducing a new benchmark dataset, a comprehensive evaluation, and a feasible solution. Existing robust benchmark datasets have two key limitations: 1) They generate only a limited range of perturbations for a single Information Extraction (IE) task, which fails to evaluate the robustness of UIE models effectively; 2) They rely on small models or handcrafted rules to generate perturbations, often resulting in unnatural adversarial examples. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench, which utilizes LLMs to generate more diverse and realistic perturbations across different IE tasks. Based on this dataset, we comprehensively evaluate existing UIE models and reveal that both LLM-based models and other models suffer from significant performance drops. To improve robustness and reduce training costs, we propose a data-augmentation solution that dynamically selects hard samples for iterative training based on the model’s inference loss. Experimental results show that training with only \textbf15% of the data leads to an average \textbf7.5% relative performance improvement across three IE tasks.
zh
[NLP-44] Structured Outputs Enable General-Purpose LLM s to be Medical Experts
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放性医学问答任务中因危险幻觉或缺乏对关键方面的全面覆盖而导致响应不准确的问题。尽管LLMs在多项选择测试中表现良好,但在开放性医学问题上的表现仍存在不足。为应对这一挑战,现有方法主要通过领域特定的微调来改进,但这种方法资源消耗大且难以跨模型扩展。
论文的关键解决方案是提出一种利用结构化医学推理的新方法,通过引导LLMs完成一个受临床诊断启发的七步认知过程,从而在无需额外训练的情况下实现更精确和完整的答案。实验结果显示,该方法在MedLFQA基准测试中的事实性得分达到了85.8,优于经过微调的模型,并且这种提升在较小的模型上同样有效,体现了方法的高效性和可扩展性。
链接: https://arxiv.org/abs/2503.03194
作者: Guangfu Guo,Kai Zhang,Bryan Hoo,Yujun Cai,Xiaoqian Lu,Nanyun Peng,Yiwei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical question-answering (QA) is a critical task for evaluating how effectively large language models (LLMs) encode clinical knowledge and assessing their potential applications in medicine. Despite showing promise on multiple-choice tests, LLMs frequently struggle with open-ended medical questions, producing responses with dangerous hallucinations or lacking comprehensive coverage of critical aspects. Existing approaches attempt to address these challenges through domain-specific fine-tuning, but this proves resource-intensive and difficult to scale across models. To improve the comprehensiveness and factuality of medical responses, we propose a novel approach utilizing structured medical reasoning. Our method guides LLMs through an seven-step cognitive process inspired by clinical diagnosis, enabling more accurate and complete answers without additional training. Experiments on the MedLFQA benchmark demonstrate that our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models. Notably, this improvement transfers to smaller models, highlighting the method’s efficiency and scalability. Our code and datasets are available.
zh
[NLP-45] Designing Speech Technologies for Australian Aboriginal English: Opportunities Risks and Participation
【速读】: 该论文旨在解决澳大利亚土著社区在使用本地化英语变体(如Australian Aboriginal English)时,因缺乏语言技术支持而面临的参与社会与经济活动的障碍,并探讨如何通过技术开发减少当代土著语言身份被边缘化的风险。论文的核心问题是:是否可以利用语音技术支持Australian Aboriginal English的使用者?如果可以,应采取何种适当的技术开发实践,并如何融入社区参与以降低潜在风险?
解决方案的关键在于将文化适宜性和参与性过程贯穿于整个项目的设计与实施中。通过一个实际案例研究,论文展示了如何结合当地文化和社区反馈来优化针对Australian Aboriginal English的语音技术。强调了只有在采用包容性和文化安全的实践中,这些语言所带来的重要经济和社会文化效益才能得以实现。
链接: https://arxiv.org/abs/2503.03186
作者: Ben Hutchinson,Celeste Rodríguez Louro,Glenys Collard,Ned Cooper
机构: Google(谷歌); University of Western Australia (西澳大利亚大学); Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In Australia, post-contact language varieties, including creoles and local varieties of international languages, emerged as a result of forced contact between Indigenous communities and English speakers. These contact varieties are widely used, yet are poorly supported by language technologies. This gap presents barriers to participation in civil and economic society for Indigenous communities using these varieties, and reproduces minoritisation of contemporary Indigenous sociolinguistic identities. This paper concerns three questions regarding this context. First, can speech technologies support speakers of Australian Aboriginal English, a local indigenised variety of English? Second, what risks are inherent in such a project? Third, what technology development practices are appropriate for this context, and how can researchers integrate meaningful community participation in order to mitigate risks? We argue that opportunities do exist – as well as risks – and demonstrate this through a case study exploring design practices in a real-world project aiming to improve speech technologies for Australian Aboriginal English. We discuss how we integrated culturally appropriate and participatory processes throughout the project. We call for increased support for languages used by Indigenous communities, including contact varieties, which provide practical economic and socio-cultural benefits, provided that participatory and culturally safe practices are enacted.
zh
[NLP-46] Intermediate-Task Transfer Learning: Leverag ing Sarcasm Detection for Stance Detection
【速读】: 该论文旨在解决社交媒体立场检测(Stance Detection, SD)中因文本包含讽刺和修辞语言而导致的模型性能下降问题。解决方案的关键在于通过引入中间任务迁移学习,将讽刺检测(Sarcasm Detection)作为辅助任务,以增强立场检测模型对讽刺文本元素的识别能力。具体而言,作者通过微调 BERT 和 RoBERTa 模型,并结合卷积双向 LSTM 和全连接层,构建了一个针对立场检测优化的迁移学习框架。实验结果表明,即使在未进行讽刺检测预训练的情况下,该方法仍优于当前最先进的立场检测基线模型,并显著提升了模型对讽刺文本的分类准确性,使整体 F1 分数提高了 85%。此外,研究强调了中间任务与目标任务之间词法属性相关性的重要性,为未来基于迁移学习的立场检测研究奠定了基础。
链接: https://arxiv.org/abs/2503.03172
作者: Gibson Nkhata,Susan Gauch
机构: University of Arkansas (阿肯色大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, published in The Sixteenth International Conference on Information (eKNOW 2024)
Abstract:Stance Detection (SD) on social media has emerged as a prominent area of interest with implications for social business and political applications thereby garnering escalating research attention within NLP. The inherent subtlety and complexity of texts procured from online platforms pose challenges for SD algorithms in accurately discerning the authors stance. Mostly the inclusion of sarcastic and figurative language drastically impacts the performance of SD models. This paper addresses this by employing sarcasm detection intermediate-task transfer learning tailored for SD. The proposed methodology involves the finetuning of BERT and RoBERTa and the concatenation of convolutional BiLSTM and dense layers. Rigorous experiments are conducted on publicly available datasets to evaluate our transfer-learning framework. The performance of the approach is assessed against various State-Of-The-Art baselines for SD providing empirical evidence of its effectiveness. Notably our model outperforms the best SOTA models even prior to sarcasm-detection pretraining. The integration of sarcasm knowledge into the model proves instrumental in mitigating misclassifications of sarcastic textual elements in SD. Our model accurately predicts 85% of texts that were previously misclassified by the model without sarcasm-detection pretraining thereby amplifying the average F1-score of the model. Our experiments also revealed that the success of the transfer-learning framework is contingent upon the correlation of lexical attributes between the intermediate task and the target task. This study represents the first exploration of sarcasm detection as an intermediate transfer-learning task in the context of SD and simultaneously uses the concatenation of BERT or RoBERTa with other deep-learning techniques establishing the proposed approach as a foundational baseline for future research endeavors in this domain.
zh
[NLP-47] DSVD: Dynamic Self-Verify Decoding for Faithful Generation in Large Language Models
【速读】: 该论文旨在解决大型语言模型在文本生成过程中因幻觉(hallucinations)和事实性错误导致的可靠性挑战。现有方法要么未能充分利用模型的自我修正能力,依赖于预先策略,要么采用成本高昂的事后验证。论文提出的关键解决方案是动态自我验证解码(Dynamic Self-Verify Decoding, DSVD),这是一种新的解码框架,通过实时幻觉检测和高效错误纠正来增强生成的可靠性。DSVD包含两个核心组件:(1) 并行自我验证架构以实现连续的质量评估;(2) 动态回滚机制以针对性地恢复错误。实验结果表明,DSVD显著提升了问答任务中的真实性(Truthfulness)和事实准确性(FActScore),并且可以与现有的忠实解码方法结合以进一步提升性能。这项工作证明了生成过程中实时自我验证是一种可行的方法,能够在不牺牲实际部署性的前提下构建更可信的语言模型。
链接: https://arxiv.org/abs/2503.03149
作者: YiQiu Guo,Yuchen Yang,Zhe Chen,Pingjie Wang,Yusheng Liao,Ya Zhang,Yanfeng Wang,Yu Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai JiaoTong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The reliability of large language models remains a critical challenge, particularly due to their susceptibility to hallucinations and factual inaccuracies during text generation. Existing solutions either underutilize models’ self-correction with preemptive strategies or use costly post-hoc verification. To further explore the potential of real-time self-verification and correction, we present Dynamic Self-Verify Decoding (DSVD), a novel decoding framework that enhances generation reliability through real-time hallucination detection and efficient error correction. DSVD integrates two key components: (1) parallel self-verification architecture for continuous quality assessment, (2) dynamic rollback mechanism for targeted error recovery. Extensive experiments across five benchmarks demonstrate DSVD’s effectiveness, achieving significant improvement in truthfulness (Quesetion-Answering) and factual accuracy (FActScore). Results show the DSVD can be further incorporated with existing faithful decoding methods to achieve stronger performance. Our work establishes that real-time self-verification during generation offers a viable path toward more trustworthy language models without sacrificing practical deployability.
zh
[NLP-48] owards Understanding Multi-Round Large Language Model Reasoning : Approximability Learnability and Generalizability
【速读】: 该论文旨在探究多轮推理(multi-round reasoning)在解决复杂任务中的理论基础及其对问题求解能力的提升机制。具体而言,论文关注多轮自回归模型的逼近性(approximation)、可学习性(learnability)以及泛化性(generalization)特性,并通过理论分析揭示这些特性如何支持模型在序列学习与推理中的有效性。论文的关键解决方案在于证明带有有限上下文窗口的Transformer模型能够作为通用逼近器来模拟图灵可计算函数的步骤,并通过多轮推理近似任何图灵可计算的序列到序列函数。此外,论文将PAC学习扩展到序列生成领域,表明即使序列长度超出模型上下文窗口,多轮生成过程依然具备可学习性。最后,论文探讨了多轮推理过程中泛化误差的传播机制,并指出如Chain-of-Thought、辩论和自我精炼等方法可通过约束误差累积确保输出结果保持在预期范围内。这一系列工作为多轮序列学习与推理提供了系统性的理论支撑,强调其在推断复杂性中的重要作用。
链接: https://arxiv.org/abs/2503.03128
作者: Chenhui Xu,Dancheng Liu,Jiajie Li,Amir Nassereldine,Zhaohui Li,Jinjun Xiong
机构: Department of Computer Science and Engineering, University at Buffalo (计算机科学与工程系, 布法罗大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model’s context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.
zh
[NLP-49] he Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models
【速读】: 该论文旨在解决现有Multimodal Reward Models (MM-RMs) 在处理分布外(out-of-distribution) 数据时的泛化能力不足问题。这一问题源于模型过度依赖单模态(unimodal)的虚假相关性,特别是训练数据中的文本-only快捷方式(text-only shortcuts),导致其无法充分利用真正的多模态奖励函数。论文的关键解决方案是一种Shortcut-aware MM-RM学习算法,通过动态调整训练样本权重、重新分配训练分布以增强多模态理解,并减少对单模态虚假相关性的依赖,从而有效缓解上述问题。实验结果表明,该方法显著提升了模型的泛化性能、下游任务表现以及扩展性,为多模态奖励建模构建了一个更稳健的框架。
链接: https://arxiv.org/abs/2503.03122
作者: Zichao Li,Xueru Wen,Jie Lou,Yuqiu Ji,Yaojie Lu,Xianpei Han,Debing Zhang,Le Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling.
zh
[NLP-50] External Reliable Information-enhanced Multimodal Contrastive Learning for Fake News Detection AAAI’25
【速读】: 该论文旨在解决多模态假新闻检测中的两个主要挑战:一是无法充分且有效地利用多模态信息进行检测;二是引入的外部信息可信度较低或静态特性限制了动态更新能力。为应对这些挑战,论文提出了一种名为ERIC-FND的外部可靠信息增强型多模态对比学习框架。其关键在于通过实体增强的外部信息提升新闻内容表示,并利用多模态语义交互方法结合多模态对比学习使不同模态之间的表征相互学习,同时采用自适应融合方法整合来自不同维度的新闻表示以实现最终分类。实验结果表明,ERIC-FND在两种常用的数据集(X,即Twitter和Weibo)上优于现有的最先进的假新闻检测方法。
链接: https://arxiv.org/abs/2503.03107
作者: Biwei Cao,Qihang Wu,Jiuxin Cao,Bo Liu,Jie Gui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted by AAAI’25
Abstract:With the rapid development of the Internet, the information dissemination paradigm has changed and the efficiency has been improved greatly. While this also brings the quick spread of fake news and leads to negative impacts on cyberspace. Currently, the information presentation formats have evolved gradually, with the news formats shifting from texts to multimodal contents. As a result, detecting multimodal fake news has become one of the research hotspots. However, multimodal fake news detection research field still faces two main challenges: the inability to fully and effectively utilize multimodal information for detection, and the low credibility or static nature of the introduced external information, which limits dynamic updates. To bridge the gaps, we propose ERIC-FND, an external reliable information-enhanced multimodal contrastive learning framework for fake news detection. ERIC-FND strengthens the representation of news contents by entity-enriched external information enhancement method. It also enriches the multimodal news information via multimodal semantic interaction method where the multimodal constrative learning is employed to make different modality representations learn from each other. Moreover, an adaptive fusion method is taken to integrate the news representations from different dimensions for the eventual classification. Experiments are done on two commonly used datasets in different languages, X (Twitter) and Weibo. Experiment results demonstrate that our proposed model ERIC-FND outperforms existing state-of-the-art fake news detection methods under the same settings.
zh
[NLP-51] Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation
【速读】: 该论文试图解决大型语言模型在生成过程中容易出现幻觉(hallucinations)的问题,即生成看似合理但事实错误的内容。现有缓解方法通常依赖于采样多个完整长度的生成结果,这不仅引入显著的响应延迟,而且当模型以高置信度持续产生幻觉输出时,这些方法变得无效。论文提出的解决方案的关键在于引入了一种名为监控解码(Monitoring Decoding, MD)的新框架。MD 动态监测生成过程,并选择性地应用中间干预,重点修正导致幻觉的关键标记(tokens)。通过在生成过程中使用监控函数识别易产生幻觉的标记,并利用基于树的解码策略进一步优化这些标记,MD 确保生成内容具有更高的事实准确性与连贯性,同时保持效率。实验结果显示,MD 在有效性和效率方面均优于基于自一致性(self-consistency)的方法,提高了事实准确性并大幅降低了计算开销。
链接: https://arxiv.org/abs/2503.03106
作者: Yurui Chang,Bochuan Cao,Lu Lin
机构: College of Information Sciences and Technology (信息科学与技术学院), Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While large language models have demonstrated exceptional performance across a wide range of tasks, they remain susceptible to hallucinations – generating plausible yet factually incorrect contents. Existing methods to mitigating such risk often rely on sampling multiple full-length generations, which introduces significant response latency and becomes ineffective when the model consistently produces hallucinated outputs with high confidence. To address these limitations, we introduce Monitoring Decoding (MD), a novel framework that dynamically monitors the generation process and selectively applies in-process interventions, focusing on revising crucial tokens responsible for hallucinations. Instead of waiting until completion of multiple full-length generations, we identify hallucination-prone tokens during generation using a monitor function, and further refine these tokens through a tree-based decoding strategy. This approach ensures an enhanced factual accuracy and coherence in the generated output while maintaining efficiency. Experimental results demonstrate that MD consistently outperforms self-consistency-based approaches in both effectiveness and efficiency, achieving higher factual accuracy while significantly reducing computational overhead.
zh
[NLP-52] MuCo-KGC: Multi-Context-Aware Knowledge Graph Completion
【速读】: 本文旨在解决知识图谱完成(Knowledge Graph Completion, KGC)任务中尾实体预测(tail entity prediction)的挑战,特别是在测试阶段对未见实体泛化能力不足的问题。传统基于嵌入的方法如TransE和ComplEx虽提升了尾实体预测性能,但难以有效应对未见过的实体;文本基模型通过利用额外语义上下文缓解了这一问题,然而其依赖负三元组采样导致计算开销高、语义不一致及数据不平衡等问题。此外,现有方法忽视了知识图谱中与实体和关系相关的有价值结构信息。针对上述挑战,本文提出了一种名为多上下文感知知识图谱完成(Multi-Context-Aware Knowledge Graph Completion, MuCo-KGC)的新模型。该模型的关键在于利用图中链接实体和关系的上下文信息来预测尾实体,同时摒弃了对实体描述和负三元组采样的依赖,从而显著降低了计算复杂度并提升了性能。实验结果表明,MuCo-KGC在多个标准数据集(FB15k-237、WN18RR、CoDEx-S和CoDEx-M)上的表现优于当前最先进的方法,并在特定数据集上显著提高了MRR指标,验证了其有效性。
链接: https://arxiv.org/abs/2503.03091
作者: Haji Gul,Ajaz Ahmad Bhat,Abdul Ghani Haji Naim
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge graph completion (KGC) seeks to predict missing entities (e.g., heads or tails) or relationships in knowledge graphs (KGs), which often contain incomplete data. Traditional embedding-based methods, such as TransE and ComplEx, have improved tail entity prediction but struggle to generalize to unseen entities during testing. Textual-based models mitigate this issue by leveraging additional semantic context; however, their reliance on negative triplet sampling introduces high computational overhead, semantic inconsistencies, and data imbalance. Recent approaches, like KG-BERT, show promise but depend heavily on entity descriptions, which are often unavailable in KGs. Critically, existing methods overlook valuable structural information in the KG related to the entities and relationships. To address these challenges, we propose Multi-Context-Aware Knowledge Graph Completion (MuCo-KGC), a novel model that utilizes contextual information from linked entities and relations within the graph to predict tail entities. MuCo-KGC eliminates the need for entity descriptions and negative triplet sampling, significantly reducing computational complexity while enhancing performance. Our experiments on standard datasets, including FB15k-237, WN18RR, CoDEx-S, and CoDEx-M, demonstrate that MuCo-KGC outperforms state-of-the-art methods on three datasets. Notably, MuCo-KGC improves MRR on WN18RR, and CoDEx-S and CoDEx-M datasets by 1.63% , and 3.77% and 20.15% respectively, demonstrating its effectiveness for KGC tasks.
zh
[NLP-53] Improving LLM -as-a-Judge Inference with the Judgment Distribution
【速读】: 该论文试图解决使用语言模型(Language Models, LMs)作为评委(LLM-as-a-judge)评估文本质量时,如何更有效地提取人类偏好。传统方法通常通过贪婪解码(greedy decoding)从模型输出的单个最优预测(即众数)中提取判断,但这种策略可能未能充分利用语言模型提供的概率分布信息。论文的关键解决方案在于探索如何利用语言模型输出的概率分布来提取更精细的偏好:研究发现取分布的均值(mean)始终优于取众数,并进一步验证了结合风险规避策略的方法能够提升性能。此外,论文还揭示了思维链提示(chain-of-thought prompting)可能导致判断分布收敛,从而损害表现的问题。综上所述,论文的核心在于强调利用语言模型的分布性输出而非单一文本输出,以改进其作为评委的性能。
链接: https://arxiv.org/abs/2503.03064
作者: Victor Wang,Michael J.Q. Zhang,Eunsol Choi
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Using language models to scalably approximate human preferences on text quality (LLM-as-a-judge) has become a standard practice applicable to many tasks. A judgment is often extracted from the judge’s textual output alone, typically with greedy decoding. However, LLM judges naturally provide distributions over judgment tokens, inviting a breadth of inference methods for extracting fine-grained preferences. We find that taking the mean of the judgment distribution consistently outperforms taking the mode (i.e. greedy decoding) in all evaluation settings (i.e. pointwise, pairwise, and listwise). We further explore novel methods of deriving preferences from judgment distributions, and find that methods incorporating risk aversion often improve performance. Lastly, we analyze LLM-as-a-judge paired with chain-of-thought (CoT) prompting, showing that CoT can collapse the spread of the judgment distribution, often harming performance. Our findings suggest leveraging distributional output can improve LLM-as-a-judge, as opposed to using the text interface alone.
zh
[NLP-54] Semi-Supervised In-Context Learning: A Baseline Study
【速读】: 该论文致力于解决数据选择在基于提示的In-Context Learning (ICL) 中过度依赖手工标注的问题,提出了一种三步半监督框架,旨在通过自生成标注(pseudo-demonstrations)提升模型性能。解决方案的关键在于引入了一个迭代伪示例精炼方法(IterPSD),该方法能够逐步优化自生成标注的质量,并结合高置信度的自生成示例进行提示构造,从而显著提升了分类任务的表现,在16个数据集上的平均性能超越了16-shot基线9.94%,并进一步实现了高达6.8%的额外增益。此外,研究还揭示了半监督ICL的缩放规律,表明模型在超过1,000个演示样本时表现最佳。
链接: https://arxiv.org/abs/2503.03062
作者: Zhengyao Gu,Henry Peng Zou,Yankai Chen,Aiwei Liu,Weizhi Zhang,Philip S. Yu
机构: University of Illinois Chicago (芝加哥伊利诺伊大学); Cornell University (康奈尔大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Most existing work in data selection for In-Context Learning (ICL) has focused on constructing demonstrations from ground truth annotations, with limited attention given to selecting reliable self-generated annotations. In this work, we propose a three-step semi-supervised ICL framework: annotation generation, demonstration selection, and semi-supervised inference. Our baseline, Naive-SemiICL, which prompts select high-confidence self-generated demonstrations for ICL prompting, outperforms a 16-shot baseline by an average of 9.94% across 16 datasets. We further introduce IterPSD, an annotation approach that refines pseudo-demonstrations iteratively, achieving up to 6.8% additional gains in classification tasks. Lastly, we reveal a scaling law for semi-supervised ICL, where models achieve optimal performance with over 1,000 demonstrations.
zh
[NLP-55] QE4PE: Word-level Quality Estimation for Human Post-Editing
【速读】: 该论文试图解决如何评估词级质量估计(Word-level Quality Estimation, QE)系统在机器翻译(Machine Translation, MT)人工后编辑(Post-editing, PE)中的实际可用性及其对后编辑效率、质量和编辑策略的影响。论文的关键在于通过对比多种错误跨度高亮显示模态(包括监督学习和基于不确定性方法的词级QE),分析其在真实场景下对专业后编辑人员的影响,并结合行为日志与人工标注评估后编辑努力、生产力及译文质量提升情况,从而揭示域(Domain)、语言(Language)以及编辑速度等因素对高亮效果有效性的作用,同时探讨了自动化QE与人工制作QE之间的差异,以弥补当前技术在准确性与实用性之间的差距。
链接: https://arxiv.org/abs/2503.03044
作者: Gabriele Sarti,Vilém Zouhar,Grzegorz Chrupała,Ana Guerberof-Arenas,Malvina Nissim,Arianna Bisazza
机构: CLCG, University of Groningen (CLCG, 格罗宁根大学); ETH Zürich (苏黎世联邦理工学院); CSAI, Tilburg University (CSAI, 埃因霍温大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Code: this https URL . Dataset: this https URL
Abstract:Word-level quality estimation (QE) detects erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. Our QE4PE study investigates the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated by behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.
zh
[NLP-56] SAGE: Steering and Refining Dialog Generation with State-Action Augmentation
【速读】: 本文旨在解决构建能够在自然对话中展现情感智能的聊天机器人这一挑战。现有大型语言模型在任务导向应用中表现出了强大的能力,但在实现情感智能的对话交互方面仍存在不足。为了解决这一问题,论文提出了一种名为SAGE的新方法,其核心是通过潜在变量控制对话生成中的长期行为,即State-Action Chain (SAC),它通过引入潜在变量来封装对话轮次间的情感状态和会话策略,从而增强标准语言模型的微调效果。在推理阶段,这些潜在变量在每次响应前生成,实现了对对话进程的粗粒度控制,同时保持了自然的交互模式。此外,还设计了一个自我改进管道,利用对话树搜索、基于LLM的奖励建模以及目标导向微调来优化会话轨迹。实验结果表明,采用此方法训练的模型在情感智能指标上表现出显著提升,并且在大型语言模型基准测试中依然保持了强劲的能力。SAGE的关键创新在于其离散的潜在变量,这不仅促进了基于搜索的策略,还为未来将强化学习应用于对话系统奠定了基础,在此类系统中学习可以在状态层面而非标记层面进行。
链接: https://arxiv.org/abs/2503.03040
作者: Yizhe Zhang,Navdeep Jaitly
机构: Apple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level.
zh
[NLP-57] SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中因幻觉(hallucinations)导致性能下降的问题。幻觉是指模型生成的信息与事实不符或无根据的情况,这在关键应用场景中尤为严重。为应对这一挑战,论文提出了一种名为SAFE的新方法,其关键是结合稀疏自动编码器(Sparse Autoencoders, SAEs)来检测和缓解幻觉现象。通过利用SAEs进行幻觉感知的查询增强(hallucination-aware query enrichment),SAFE不仅能够提高查询生成的准确性,还能有效减少幻觉的发生,从而提升模型的整体性能。实验结果表明,SAFE在三个跨领域数据集上的表现均优于其他方法,最高可将查询生成准确性提升29.45%。
链接: https://arxiv.org/abs/2503.03032
作者: Samir Abdaljalil,Filippo Pallucchini,Andrea Seveso,Hasan Kurban,Fabio Mercorio,Erchin Serpedin
机构: Electrical and Computer Engineering, Texas A&M University (德克萨斯农工大学), College Station, TX USA; Dept of Statistics and Quantitative Methods, University of Milano-Bicocca (米兰比可卡大学), Italy; CRISP Research Centre (CRISP研究中心), University of Milano-Bicocca (米兰比可卡大学), Italy; College of Science and Engineering, Hamad Bin Khalifa University (哈马德·本·哈利法大学), Doha, Qatar
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the state-of-the-art performance of Large Language Models (LLMs), these models often suffer from hallucinations, which can undermine their performance in critical applications. In this work, we propose SAFE, a novel method for detecting and mitigating hallucinations by leveraging Sparse Autoencoders (SAEs). While hallucination detection techniques and SAEs have been explored independently, their synergistic application in a comprehensive system, particularly for hallucination-aware query enrichment, has not been fully investigated. To validate the effectiveness of SAFE, we evaluate it on two models with available SAEs across three diverse cross-domain datasets designed to assess hallucination problems. Empirical results demonstrate that SAFE consistently improves query generation accuracy and mitigates hallucinations across all datasets, achieving accuracy improvements of up to 29.45%.
zh
[NLP-58] One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings
【速读】: 该论文旨在解决在部署语言模型时面临的模型大小与性能之间的权衡问题,同时满足下游任务的低延迟约束并保持模型的实用性。传统的模型蒸馏方法虽可减小模型规模,但其多轮训练过程效率较低。论文的关键创新在于引入MODULARSTARENCODER,这是一种具有10亿参数的模块化多出口编码器,专用于代码检索相关任务。其解决方案的核心在于采用一种新颖的自蒸馏机制,通过引导不同层之间的特征学习显著提升底层表示能力,从而实现模型各部分在性能上的良好平衡。此外,通过设计特定的中间层作为退出头(exit head),利用高层指导低层训练的策略进一步优化表征质量。此自蒸馏效应改善了中间表示,提高了检索召回率,且无需额外的训练开销。论文还提出了一个结合存储库级别上下文损失的方案,充分利用训练上下文窗口以增强表征效果,并发布了一个通过代码翻译构建的新数据集,扩展了传统文本到代码基准测试,增加了跨多种编程语言的代码到代码对样本。实验结果验证了多出口监督下自蒸馏方法的有效性。
链接: https://arxiv.org/abs/2503.03008
作者: Andrea Gurioli,Federico Pennino,João Monteiro,Maurizio Gabbrielli
机构: University of Bologna (博洛尼亚大学); Autodesk (Autodesk); University of Bologna (博洛尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:Deploying language models often requires handling model size vs. performance trade-offs to satisfy downstream latency constraints while preserving the model’s usefulness. Model distillation is commonly employed to reduce model size while maintaining acceptable performance. However, distillation can be inefficient since it involves multiple training steps. In this work, we introduce MODULARSTARENCODER, a modular multi-exit encoder with 1B parameters, useful for multiple tasks within the scope of code retrieval. MODULARSTARENCODER is trained with a novel self-distillation mechanism that significantly improves lower-layer representations-allowing different portions of the model to be used while still maintaining a good trade-off in terms of performance. Our architecture focuses on enhancing text-to-code and code-to-code search by systematically capturing syntactic and semantic structures across multiple levels of representation. Specific encoder layers are targeted as exit heads, allowing higher layers to guide earlier layers during training. This self-distillation effect improves intermediate representations, increasing retrieval recall at no extra training cost. In addition to the multi-exit scheme, our approach integrates a repository-level contextual loss that maximally utilizes the training context window, further enhancing the learned representations. We also release a new dataset constructed via code translation, seamlessly expanding traditional text-to-code benchmarks with code-to-code pairs across diverse programming languages. Experimental results highlight the benefits of self-distillation through multi-exit supervision.
zh
[NLP-59] Will I Get Hate Speech Predicting the Volume of Abusive Replies before Posting in Social Media
【速读】: 该论文旨在解决预测社交网络平台上发布的推文将收到的辱骂性回复数量这一问题。现有研究多侧重于事后识别已发布内容是否具有攻击性,而缺乏前瞻性的预测方法。为填补这一空白,论文提出从社交平台用户的角度出发,探讨如果发布某条特定消息,是否能够预测其可能引发的辱骂性回复的数量。论文的关键在于构建了一个包含文本特征、文本元数据、推文元数据以及账户特征在内的综合特征集,并发现通过这些与即将发布内容相关的特征可以有效构建性能优异的预测模型。此外,研究表明,与用户身份相关的特征对模型性能影响不大,这表明辱骂性回复更可能由帖子内容触发,而非发布者的身份。因此,该研究的核心解决方案在于开发基于帖子内容的全面特征集合,以实现对辱骂性回复数量的准确预测。
链接: https://arxiv.org/abs/2503.03005
作者: Raneem Alharthia,Rajwa Alharthib,Ravi Shekharc,Aiqi Jiangd,Arkaitz Zubiagaa
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the growing body of research tackling offensive language in social media, this research is predominantly reactive, determining if content already posted in social media is abusive. There is a gap in predictive approaches, which we address in our study by enabling to predict the volume of abusive replies a tweet will receive after being posted. We formulate the problem from the perspective of a social media user asking: ``if I post a certain message on social media, is it possible to predict the volume of abusive replies it might receive?‘’ We look at four types of features, namely text, text metadata, tweet metadata, and account features, which also help us understand the extent to which the user or the content helps predict the number of abusive replies. This, in turn, helps us develop a model to support social media users in finding the best way to post content. One of our objectives is also to determine the extent to which the volume of abusive replies that a tweet will get are motivated by the content of the tweet or by the identity of the user posting it. Our study finds that one can build a model that performs competitively by developing a comprehensive set of features derived from the content of the message that is going to be posted. In addition, our study suggests that features derived from the user’s identity do not impact model performance, hence suggesting that it is especially the content of a post that triggers abusive replies rather than who the user is.
zh
[NLP-60] Zero-Shot Multi-Label Classification of Bangla Documents: Large Decoders Vs. Classic Encoders
【速读】: 该论文试图解决Bangla语言在自然语言处理(NLP)领域因复杂形态学特征和资源匮乏所面临的独特挑战,特别是评估大型解码器模型(如GPT、LLaMA、DeepSeek)在零样本多标签分类(Zero-Shot Multi-Label Classification, Zero-Shot-MLC)任务中的表现。论文的关键在于首次构建了一个基准,通过对比基于解码器的大型语言模型(LLMs)与经典的基于编码器的模型,揭示现有强大的编码器和解码器在Bangla Zero-Shot-MLC任务中仍难以达到高精度的问题,并由此强调了Bangla NLP研究与资源投入的重要性。
链接: https://arxiv.org/abs/2503.02993
作者: Souvika Sarkar,Md. Najib Hasan,Santu Karmaker
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Bangla, a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide, presents unique challenges in natural language processing (NLP) due to its complex morphological characteristics and limited resources. While recent Large Decoder Based models (LLMs), such as GPT, LLaMA, and DeepSeek, have demonstrated excellent performance across many NLP tasks, their effectiveness in Bangla remains largely unexplored. In this paper, we establish the first benchmark comparing decoder-based LLMs with classic encoder-based models for Zero-Shot Multi-Label Classification (Zero-Shot-MLC) task in Bangla. Our evaluation of 32 state-of-the-art models reveals that, existing so-called powerful encoders and decoders still struggle to achieve high accuracy on the Bangla Zero-Shot-MLC task, suggesting a need for more research and resources for Bangla NLP.
zh
[NLP-61] Effectively Steer LLM To Follow Preference via Building Confident Directions
【速读】: 该论文旨在解决现有大语言模型(LLMs)与人类偏好对齐方法的局限性问题。当前大多数对齐方法依赖于微调或提示工程,这些方法要么成本高昂,要么难以精确控制。而现有的模型引导算法虽然易于实现且无需优化,但通常只能在两个方向之间进行双向引导,并缺乏理论保障其性能的有效性。
论文的关键在于提出了一种名为CONFST(Confident Direction Steering)的新型模型引导方法。该方法通过构建一个与用户偏好高度一致的置信方向,并将其添加到推理阶段的语言模型激活值中,从而实现对模型输出的高效引导。相比流行的双向模型引导方法,CONFST具有以下优势:1) 更强大的多偏好对齐能力,可同时处理多个用户的个性化需求;2) 实现简单,无需确定在模型哪一层添加引导向量;3) 不需要显式的用户指令。论文通过在GPT-2 XL、Mistral和Gemma-it等不同规模的语言模型上验证,展示了CONFST在跨主题和风格调整任务中的卓越性能。
链接: https://arxiv.org/abs/2503.02989
作者: Bingqing Song,Boran Han,Shuai Zhang,Hao Wang,Haoyang Fang,Bonan Min,Yuyang Wang,Mingyi Hong
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校); AWS (亚马逊云科技); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users’ preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users’ preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.
zh
[NLP-62] LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理能力评估中存在的过估计问题,主要源于现有评估基准可能被模型通过数据记忆(memorisation)的方式利用。论文的关键解决方案是引入一个框架,用于生成减少记忆效应影响的语言推理问题,并基于此框架开发了一个名为LINGOLY-TOO的挑战性评估基准。该方案通过设计正字法模板(orthographic templates),动态模糊真实语言的书写系统以生成大量问题变体,同时保留每种解法所需的推理步骤,从而降低特定问题实例出现在模型训练数据中的可能性。这一方法有效揭示了LLMs在相同问题的不同形式下的准确性变化,并证明了模型在原生正字法下的表现更优,进一步表明了LLMs响应生成的不透明性以及先前数据暴露对模型推理能力评估的潜在偏差。
链接: https://arxiv.org/abs/2503.02972
作者: Jude Khouja,Karolina Korgul,Simi Hellsten,Lingyi Yang,Vlad Neacs,Harry Mayne,Ryan Kearns,Andrew Bean,Adam Mahdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective evaluation of the reasoning capabilities of large language models (LLMs) are susceptible to overestimation due to data exposure of evaluation benchmarks. We introduce a framework for producing linguistic reasoning problems that reduces the effect of memorisation in model performance estimates and apply this framework to develop LINGOLY-TOO, a challenging evaluation benchmark for linguistic reasoning. By developing orthographic templates, we dynamically obfuscate the writing systems of real languages to generate numerous question variations. These variations preserve the reasoning steps required for each solution while reducing the likelihood of specific problem instances appearing in model training data. Our experiments demonstrate that frontier models, including OpenAI o1-preview and DeepSeem R1, struggle with advanced reasoning. Our analysis also shows that LLMs exhibit noticeable variance in accuracy across permutations of the same problem, and on average perform better on questions appearing in their original orthography. Our findings highlight the opaque nature of response generation in LLMs and provide evidence that prior data exposure contributes to overestimating the reasoning capabilities of frontier models.
zh
[NLP-63] Multilingual Relative Clause Attachment Ambiguity Resolution in Large Language Models ACL
【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)如何解决相对从句(Relative Clause, RC)的歧义附着问题,并将其性能与人类句子处理能力进行比较。研究聚焦于两个语言学因素:相对从句的长度以及复杂限定词短语(Determiner Phrase, DP)的句法位置,评估LLMs在处理语言复杂性时是否能够实现类似人类的解读。论文的关键在于通过多语言实验(包括英语、西班牙语、法语、德语、日语和韩语)揭示LLMs在不同语言中的表现差异,特别是发现这些模型在印欧语言中表现良好,但在亚洲语言中常因错误的英语翻译而遇到困难。研究结果强调了LLMs处理语言歧义的变异性,并指出需要针对非欧洲语言对模型进行改进。因此,该研究为未来LLM设计提供了方向,以提升其在多样化语言环境中的准确性和类人化处理能力。
链接: https://arxiv.org/abs/2503.02971
作者: So Young Lee,Russell Scheinberg,Amber Shore,Ameeta Agrawal
机构: Miami University (迈阿密大学), USA; Portland State University (波特兰州立大学), USA
类目: Computation and Language (cs.CL)
备注: Accepted at PACLIC 2024
Abstract:This study examines how large language models (LLMs) resolve relative clause (RC) attachment ambiguities and compares their performance to human sentence processing. Focusing on two linguistic factors, namely the length of RCs and the syntactic position of complex determiner phrases (DPs), we assess whether LLMs can achieve human-like interpretations amid the complexities of language. In this study, we evaluated several LLMs, including Claude, Gemini and Llama, in multiple languages: English, Spanish, French, German, Japanese, and Korean. While these models performed well in Indo-European languages (English, Spanish, French, and German), they encountered difficulties in Asian languages (Japanese and Korean), often defaulting to incorrect English translations. The findings underscore the variability in LLMs’ handling of linguistic ambiguities and highlight the need for model improvements, particularly for non-European languages. This research informs future enhancements in LLM design to improve accuracy and human-like processing in diverse linguistic environments.
zh
[NLP-64] InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model
【速读】: 该论文旨在解决无界流式语音同声翻译(Streaming Speech Translation, SST)中的挑战,特别是如何在保证翻译质量的同时有效处理历史语音上下文和过往翻译,以平衡翻译质量和延迟(包括计算开销)。论文提出的关键解决方案是将SST建模为一个多轮对话任务,并设计了一种新颖的方法称为InfiniSST。其核心在于通过构建多延迟增强的数据集(MuST-C)生成翻译轨迹和鲁棒片段,同时开发了一种基于键值(Key-Value, KV)缓存的管理策略,以实现高效推理。实验表明,InfiniSST可将计算感知延迟减少0.5至1秒,同时保持与基线方法相同的翻译质量。
链接: https://arxiv.org/abs/2503.02969
作者: Siqi Ouyang,Xi Xu,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the history speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. We release the code at this https URL
zh
[NLP-65] KodCode: A Diverse Challenging and Verifiable Synthetic Dataset for Coding
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在代码生成任务中缺乏高质量、可验证训练数据的问题,特别是数据覆盖范围与正确性难以兼顾的挑战。现有代码资源通常无法同时满足广泛的任务难度覆盖(如从简单编程任务到高级算法问题)和可验证的正确性(如单元测试)。为解决这一问题,论文提出了KodCode数据集,其核心解决方案在于通过自验证流程系统性生成包含问题-解答-测试三元组的数据,并采用多阶段管道:首先合成多样化的问题,接着生成对应的解答与测试用例,对难题分配额外生成尝试,最后利用基于推理模型的重采样机制生成多样化的问答对。此方法确保了数据集的大规模、鲁棒性和多样性,使其适用于监督微调及强化学习(RL)调优,从而显著提升了代码生成任务的性能表现。
链接: https://arxiv.org/abs/2503.02951
作者: Zhangchen Xu,Yang Liu,Yueqin Yin,Mingyuan Zhou,Radha Poovendran
机构: University of Washington (华盛顿大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Microsoft GenAI (微软生成式人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Codes and Data: this https URL
Abstract:We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
zh
[NLP-66] LiteWebAgent : The Open-Source Suite for VLM-Based Web-Agent Applications
【速读】: 本文旨在解决基于视觉语言模型(Vision-Language Model, VLM)的网络代理应用生态系统中的关键缺口。论文提出的解决方案核心在于提供一个生产就绪的框架——LiteWebAgent,它通过极简的无服务器后端配置、直观的用户与浏览器界面以及可扩展的研究能力(如代理规划、记忆与树搜索),填补了现有系统的不足。关键创新点在于LiteWebAgent的核心代理框架采用递归函数调用实现简单而有效的基线模型,并通过解耦动作生成与动作接地机制提升灵活性;同时,通过模块化设计集成高级研究组件,如代理规划、工作流记忆及树搜索功能。此外,论文还展示了两种部署形式:一是基于Vercel的生产级Web应用,二是利用Chrome扩展程序结合CDP(Chrome DevTools Protocol)控制现有浏览器,从而验证框架的实用性和扩展性。
链接: https://arxiv.org/abs/2503.02950
作者: Danqing Zhang,Balaji Rama,Jingyi Ni,Shiying He,Fu Zhao,Kunyu Chen,Arnold Chen,Junyu Cao
机构: PathOnAI.org (PathOnAI.org); Rutgers University (罗格斯大学), NJ, USA; The University of Texas at Austin (德克萨斯大学奥斯汀分校), TX, USA
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:We introduce LiteWebAgent, an open-source suite for VLM-based web agent applications. Our framework addresses a critical gap in the web agent ecosystem with a production-ready solution that combines minimal serverless backend configuration, intuitive user and browser interfaces, and extensible research capabilities in agent planning, memory, and tree search. For the core LiteWebAgent agent framework, we implemented a simple yet effective baseline using recursive function calling, providing with decoupled action generation and action grounding. In addition, we integrate advanced research components such as agent planning, agent workflow memory, and tree search in a modular and extensible manner. We then integrate the LiteWebAgent agent framework with frontend and backend as deployed systems in two formats: (1) a production Vercel-based web application, which provides users with an agent-controlled remote browser, (2) a Chrome extension leveraging LiteWebAgent’s API to control an existing Chrome browser via CDP (Chrome DevTools Protocol). The LiteWebAgent framework is available at this https URL, with deployed frontend at this https URL.
zh
[NLP-67] ExpertGenQA: Open-ended QA generation in Specialized Domains
【速读】: 该论文旨在解决在专业化技术领域生成高质量问答对(Question-Answer Pairs, QAPs)的挑战,现有方法难以在利用专家示例与实现主题多样性之间取得平衡。为应对这一问题,论文提出ExpertGenQA协议,其关键在于结合少量学习(Few-Shot Learning)与结构化主题及风格分类,以生成全面的领域特定问答对。通过使用美国联邦铁路管理局文档进行测试,结果表明ExpertGenQA在保持94.4%主题覆盖率的同时,效率达到基线方法的两倍。此外,研究揭示当前基于大型语言模型(LLMs)的质量评估器和奖励模型倾向于偏好表面写作风格而非内容质量,而ExpertGenQA在认知复杂度分布上更好地保留了专家撰写问题的特点,并且用于训练检索模型时,生成的查询使top-1准确性提升了13.02%,验证了其在技术领域下游应用的有效性。
链接: https://arxiv.org/abs/2503.02948
作者: Haz Sameen Shahgir,Chansong Lim,Jia Chen,Evangelos E. Papalexakis,Yue Dong
机构: University of California Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Generating high-quality question-answer pairs for specialized technical domains remains challenging, with existing approaches facing a tradeoff between leveraging expert examples and achieving topical diversity. We present ExpertGenQA, a protocol that combines few-shot learning with structured topic and style categorization to generate comprehensive domain-specific QA pairs. Using U.S. Federal Railroad Administration documents as a test bed, we demonstrate that ExpertGenQA achieves twice the efficiency of baseline few-shot approaches while maintaining 94.4% topic coverage. Through systematic evaluation, we show that current LLM-based judges and reward models exhibit strong bias toward superficial writing styles rather than content quality. Our analysis using Bloom’s Taxonomy reveals that ExpertGenQA better preserves the cognitive complexity distribution of expert-written questions compared to template-based approaches. When used to train retrieval models, our generated queries improve top-1 accuracy by 13.02% over baseline performance, demonstrating their effectiveness for downstream applications in technical domains.
zh
[NLP-68] xt2Scenario: Text-Driven Scenario Generation for Autonomous Driving Test
【速读】: 该论文旨在解决自动驾驶(AD)测试场景创建过程中手动配置耗时且劳动密集的问题。解决方案的关键在于提出了一种名为Text2Scenario的框架,该框架利用大型语言模型(LLM)通过精心设计的输入提示方案,从用户的自然语言描述中解析测试场景需求,并从分层组织的场景存储库中提取最符合用户偏好的组件。随后,通过领域特定语言(DSL)语料库中的场景表示匹配与链接,最终生成可执行的仿真测试场景。实验结果表明,这种提示工程方法能够精准提取嵌入多种描述格式中的场景元素细节,使生成的场景大多与用户的初始期望高度一致,从而实现对不同AD技术栈的高效精确评估,无需手动配置场景。
链接: https://arxiv.org/abs/2503.02911
作者: Xuan Cai,Xuesong Bai,Zhiyong Cui,Danmu Xie,Daocheng Fu,Haiyang Yu,Yilong Ren
机构: Beihang University (北航); Institute of Physics, Chinese Academy of Sciences (中国科学院物理研究所)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Autonomous driving (AD) testing constitutes a critical methodology for assessing performance benchmarks prior to product deployment. The creation of segmented scenarios within a simulated environment is acknowledged as a robust and effective strategy; however, the process of tailoring these scenarios often necessitates laborious and time-consuming manual efforts, thereby hindering the development and implementation of AD technologies. In response to this challenge, we introduce Text2Scenario, a framework that leverages a Large Language Model (LLM) to autonomously generate simulation test scenarios that closely align with user specifications, derived from their natural language inputs. Specifically, an LLM, equipped with a meticulously engineered input prompt scheme functions as a text parser for test scenario descriptions, extracting from a hierarchically organized scenario repository the components that most accurately reflect the user’s preferences. Subsequently, by exploiting the precedence of scenario components, the process involves sequentially matching and linking scenario representations within a Domain Specific Language corpus, ultimately fabricating executable test scenarios. The experimental results demonstrate that such prompt engineering can meticulously extract the nuanced details of scenario elements embedded within various descriptive formats, with the majority of generated scenarios aligning closely with the user’s initial expectations, allowing for the efficient and precise evaluation of diverse AD stacks void of the labor-intensive need for manual scenario configuration. Project page: this https URL.
zh
[NLP-69] “Would You Want an AI Tutor?” Understanding Stakeholder Perceptions of LLM -based Chatbots in the Classroom
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在教育领域应用中缺乏系统性反馈收集机制的问题。当前,无论是技术公司还是教育机构,在将LLMs整合到教学过程中时,均未建立正式体系来获取相关利益相关者(如学生、教师、家长及学校工作人员)的反馈。论文指出,理解这些直接或间接受影响群体对LLM驱动的聊天机器人等工具的看法,对于确保AI在教育领域的负责任使用至关重要。
解决方案的关键在于提出一个双层框架:首先,通过文献回顾归纳现有研究中关于LLM感知的不足之处,比如忽视重要教育主体(如家长或管理者)的角色以及实施环境的考量,并构建了一个利益相关者感知分类法;其次,提出了“教育环境中聊天机器人的情境化采纳感知”(Contextualized Perceptions for the Adoption of Chatbots in Education, Co-PACE) 框架,用于系统性地收集反馈信息,从而指导LLM驱动的聊天机器人在课堂中的设计、开发与部署决策。
链接: https://arxiv.org/abs/2503.02885
作者: Caterina Fuligni,Daniel Dominguez Figaredo,Julia Stoyanovich
机构: New York University (纽约大学); Universidad Nacional de Educación a Distancia (西班牙国家远程教育大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:In recent years, Large Language Models (LLMs) rapidly gained popularity across all parts of society, including education. After initial skepticism and bans, many schools have chosen to embrace this new technology by integrating it into their curricula in the form of virtual tutors and teaching assistants. However, neither the companies developing this technology nor the public institutions involved in its implementation have set up a formal system to collect feedback from the stakeholders impacted by them. In this paper, we argue that understanding the perceptions of those directly affected by LLMS in the classroom, such as students and teachers, as well as those indirectly impacted, like parents and school staff, is essential for ensuring responsible use of AI in this critical domain. Our contributions are two-fold. First, we present results of a literature review focusing on the perceptions of LLM-based chatbots in education. We highlight important gaps in the literature, such as the exclusion of key educational agents (e.g., parents or school administrators) when analyzing the role of stakeholders, and the frequent omission of the learning contexts in which the AI systems are implemented. Thus, we present a taxonomy that organizes existing literature on stakeholder perceptions. Second, we propose the Contextualized Perceptions for the Adoption of Chatbots in Education (Co-PACE) framework, which can be used to systematically elicit perceptions and inform whether and how LLM-based chatbots should be designed, developed, and deployed in the classroom.
zh
[NLP-70] OkraLong: A Flexible Retrieval-Augmented Framework for Long-Text Query Processing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长文本查询时面临的效率与成本挑战,特别是在企业文档分析和财务报告理解等应用场景中。传统方法如长上下文处理或检索增强生成(Retrieval-Augmented Generation, RAG)存在输入成本高昂或信息不完整的问题,而近期采用的上下文压缩和动态检索循环技术虽有所改进,但仍会丢失关键细节或导致迭代开销。论文的关键创新在于提出OkraLong框架,通过细粒度协调(fine-grained orchestration)优化整个处理流程。OkraLong包含三个协同工作的组件:分析器(analyzer)、组织器(organizer)和执行器(executor)。分析器负责任务状态的表征,指导组织器动态调度工作流,执行器则完成具体执行并输出最终答案。实验结果表明,OkraLong不仅提升了答案准确性,还实现了跨多种数据集的成本效益。
链接: https://arxiv.org/abs/2503.02603
作者: Yulong Hui,Yihao Liu,Yao Lu,Huanchen Zhang
机构: Tsinghua University (清华大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) encounter challenges in efficiently processing long-text queries, as seen in applications like enterprise document analysis and financial report comprehension. While conventional solutions employ long-context processing or Retrieval-Augmented Generation (RAG), they suffer from prohibitive input expenses or incomplete information. Recent advancements adopt context compression and dynamic retrieval loops, but still sacrifice critical details or incur iterative costs. To address these limitations, we propose OkraLong, a novel framework that flexibly optimizes the entire processing workflow. Unlike prior static or coarse-grained adaptive strategies, OkraLong adopts fine-grained orchestration through three synergistic components: analyzer, organizer and executor. The analyzer characterizes the task states, which guide the organizer in dynamically scheduling the workflow. The executor carries out the execution and generates the final answer. Experimental results demonstrate that OkraLong not only enhances answer accuracy but also achieves cost-effectiveness across a variety of datasets.
zh
[NLP-71] MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLM s
【速读】: 该论文试图解决多模态大型语言模型(MLLMs)在生成文本时容易产生幻觉的问题,并关注现有工作主要集中在单模态文本的引文生成,忽视了多模态上下文中引文生成的挑战与机遇。为填补这一空白,论文引入了MCiteBench,这是一个用于评估和分析MLLMs多模态引文文本生成能力的第一个基准数据集。MCiteBench的数据来源于学术论文和评审-反驳交互,包含多样化的信息源和多模态内容。通过全面评估模型在引用质量、来源可靠性及答案准确性等多个维度的表现,研究表明MLLMs在多模态引文文本生成方面存在困难。进一步深入分析表明,模型的主要瓶颈在于正确 attribution 来源而非理解多模态内容。因此,解决方案的关键在于改进模型对多模态内容中信息来源的准确 attribution 能力。
链接: https://arxiv.org/abs/2503.02589
作者: Caiyu Hu,Yikai Zhang,Tinghui Zhu,Yiwei Ye,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (上海关键数据科学实验室,复旦大学计算机学院); School of Computer Engineering and Science, Shanghai University (上海大学计算机工程与科学学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, overlooking the challenges and opportunities of multimodal contexts. To address this gap, we introduce MCiteBench, the first benchmark designed to evaluate and analyze the multimodal citation text generation ability of MLLMs. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. We comprehensively evaluate models from multiple dimensions, including citation quality, source reliability, and answer accuracy. Through extensive experiments, we observe that MLLMs struggle with multimodal citation text generation. We also conduct deep analyses of models’ performance, revealing that the bottleneck lies in attributing the correct sources rather than understanding the multimodal content.
zh
[NLP-72] Cognitive Behaviors that Enable Self-Improving Reason ers or Four Habits of Highly Effective STaRs
【速读】: 该论文旨在探究为何某些语言模型在测试阶段通过强化学习(Reinforcement Learning, RL)进行自改进时表现出显著提升,而另一些模型却很快达到性能瓶颈。研究的核心问题是:哪些内在属性使语言模型能够有效实现自我提升?论文的关键解决方案在于引入了一个分析框架,通过考察四种关键认知行为——验证(Verification)、回溯(Backtracking)、子目标设定(Subgoal Setting)和逆向推理(Backward Chaining),揭示了这些行为在专家级人类解题者与成功语言模型中的共性。研究表明,Qwen自然具备这些推理行为,而Llama初始缺乏这些能力。通过在强化学习过程中对Llama进行带有这些推理行为示例的引导(priming),可显著提高其性能。研究进一步发现,推理行为的存在比答案的正确性更为重要,即使使用包含错误答案但具有合理推理模式的数据进行引导,也能取得与正确答案相当的效果。最终,利用经过OpenWebMath数据持续预训练(过滤以增强推理行为)的策略,使Llama模型实现了与Qwen相当的自改进轨迹。论文的结论表明,初始推理行为的存在是决定语言模型能否有效利用额外计算资源实现改进的根本因素。
链接: https://arxiv.org/abs/2503.01307
作者: Kanishk Gandhi,Ayush Chakravarthy,Anikait Singh,Nathan Lile,Noah D. Goodman
机构: Stanford University (斯坦福大学); SynthLabs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Test-time inference has emerged as a powerful paradigm for enabling language models to ``think’’ longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors – verification, backtracking, subgoal setting, and backward chaining – that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen’s performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor – models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen’s self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.
zh
[NLP-73] Evaluating Intelligence via Trial and Error
【速读】: 该论文试图解决的问题是如何系统性地评估和提升人工智能系统的智能水平,特别是在复杂任务中的表现。论文提出“Survival Game”这一框架,通过衡量在试错过程中失败次数的期望值和方差来评价智能水平,定义了达到“自主级别”(Autonomous Level)的标准,即能够持续找到新挑战的解决方案的能力。关键在于发现人类任务具有“临界性”(criticality),需要对任务底层机制有深刻理解,而当前AI系统主要依赖浅层模仿,难以达到这种自主级别。论文进一步指出,即使通过扩展现有技术参数规模也面临天文数字级的成本与时间挑战,这凸显了现有AI技术在处理复杂任务时的局限性。
链接: https://arxiv.org/abs/2502.18858
作者: Jingtao Zhan,Jiahao Zhao,Jiayu Li,Yiqun Liu,Bo Zhang,Qingyao Ai,Jiaxin Mao,Hongning Wang,Min Zhang,Shaoping Ma
机构: Tsinghua University (清华大学); Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Intelligence is a crucial trait for species to find solutions within a limited number of trial-and-error attempts. Building on this idea, we introduce Survival Game as a framework to evaluate intelligence based on the number of failed attempts in a trial-and-error process. Fewer failures indicate higher intelligence. When the expectation and variance of failure counts are both finite, it signals the ability to consistently find solutions to new challenges, which we define as the Autonomous Level of intelligence. Using Survival Game, we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks, such as vision, search, recommendation, and language. While scaling current AI technologies might help, this would come at an astronomical cost. Projections suggest that achieving the Autonomous Level for general tasks would require 10^26 parameters. To put this into perspective, loading such a massive model requires so many H100 GPUs that their total value is 10^7 times that of Apple Inc.'s market value. Even with Moore’s Law, supporting such a parameter scale would take 70 years. This staggering cost highlights the complexity of human tasks and the inadequacies of current AI technologies. To further investigate this phenomenon, we conduct a theoretical analysis of Survival Game and its experimental results. Our findings suggest that human tasks possess a criticality property. As a result, Autonomous Level requires a deep understanding of the task’s underlying mechanisms. Current AI systems, however, do not fully grasp these mechanisms and instead rely on superficial mimicry, making it difficult for them to reach an autonomous level. We believe Survival Game can not only guide the future development of AI but also offer profound insights into human intelligence.
zh
计算机视觉
[CV-0] GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control CVPR2025
【速读】:该论文试图解决视频生成模型中3D信息利用不足导致的不一致性(如物体突然出现或消失)以及相机控制精度低的问题。解决方案的关键在于引入了一个基于3D缓存(3D cache)的机制:通过预测种子图像或先前生成帧的逐像素深度,构建点云作为3D缓存。在生成下一帧时,GEN3C仅依赖于由用户提供的新相机轨迹渲染的2D图像,并结合3D缓存进行条件生成。这种方法避免了模型需要记住之前生成的内容或从相机姿态推断图像结构,从而使其能够专注于未观察区域的生成以及推进场景状态到下一帧,实现了更精确的相机控制和在稀疏视图新视角合成任务中的领先性能。
链接: https://arxiv.org/abs/2503.03751
作者: Xuanchi Ren,Tianchang Shen,Jiahui Huang,Huan Ling,Yifan Lu,Merlin Nimier-David,Thomas Müller,Alexander Keller,Sanja Fidler,Jun Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: To appear in CVPR 2025. Website: this https URL
Abstract:We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! this https URL
zh
[CV-1] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在处理视觉与语言特征时需微调预训练视觉-语言模型(Vision-Language Models, VLMs)的问题,这种微调会破坏预训练过程中建立的语义对齐关系。为了解决这一问题,论文提出了一种名为OTTER的新架构。其关键是通过显式的、文本感知的视觉特征提取方法利用这些预训练的对齐关系,仅选择性地提取与语言指令语义对齐的任务相关视觉特征,并将其传递给策略Transformer,而非处理全部视觉特征。这种方法使得OTTER能够保持预训练的视觉-语言编码器不变,从而保留并利用大规模预训练中学到的丰富语义理解能力,实现强大的零样本泛化能力。实验结果表明,OTTER在模拟和真实环境中显著优于现有VLA模型,特别是在面对新物体和新环境时表现出色。
链接: https://arxiv.org/abs/2503.03734
作者: Huang Huang,Fangchen Liu,Letian Fu,Tingfan Wu,Mustafa Mukadam,Jitendra Malik,Ken Goldberg,Pieter Abbeel
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: this https URL.
zh
[CV-2] Rethinking Deep Clustering Paradigms: Self-Supervision Is All You Need
【速读】:该论文旨在解决现有深度聚类范式在特征随机性(Feature Randomness)、特征漂移(Feature Drift)和特征扭曲(Feature Twist)方面的局限性。这些问题源于自监督(self-supervision)与伪监督(pseudo-supervision)之间的权衡:联合训练会导致特征随机性和特征漂移,而独立训练则会导致特征随机性和特征扭曲。此外,使用伪标签会生成随机且不可靠的特征,而自监督与伪监督的结合会使聚类导向的可靠特征发生漂移,同时从自监督过渡到伪监督可能导致潜在流形的扭曲。
为了解决这些问题,论文提出了一种新的深度聚类范式——R-DC(Rethinking of Deep Clustering),其关键在于引入一种新策略,即用第二轮实例级自监督训练替代伪监督。这一策略通过使实例级自监督与邻域级自监督之间的过渡更加平滑,避免了因实例级自监督与聚类级伪监督之间强烈竞争所导致的漂移效应,同时消除了伪监督可能带来的随机特征生成风险。实验结果表明,在六个数据集上的两阶段自监督训练显著提升了性能。
链接: https://arxiv.org/abs/2503.03733
作者: Amal Shaheena,Nairouz Mrabahb,Riadh Ksantinia,Abdulla Alqaddoumia
机构: Computer Science, College of IT, UOB, Kingdom of Bahrain; Computer Science, Université du Québec à Montréal, Montréal, QC, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The recent advances in deep clustering have been made possible by significant progress in self-supervised and pseudo-supervised learning. However, the trade-off between self-supervision and pseudo-supervision can give rise to three primary issues. The joint training causes Feature Randomness and Feature Drift, whereas the independent training causes Feature Randomness and Feature Twist. In essence, using pseudo-labels generates random and unreliable features. The combination of pseudo-supervision and self-supervision drifts the reliable clustering-oriented features. Moreover, moving from self-supervision to pseudo-supervision can twist the curved latent manifolds. This paper addresses the limitations of existing deep clustering paradigms concerning Feature Randomness, Feature Drift, and Feature Twist. We propose a new paradigm with a new strategy that replaces pseudo-supervision with a second round of self-supervision training. The new strategy makes the transition between instance-level self-supervision and neighborhood-level self-supervision smoother and less abrupt. Moreover, it prevents the drifting effect that is caused by the strong competition between instance-level self-supervision and clustering-level pseudo-supervision. Moreover, the absence of the pseudo-supervision prevents the risk of generating random features. With this novel approach, our paper introduces a Rethinking of the Deep Clustering Paradigms, denoted by R-DC. Our model is specifically designed to address three primary challenges encountered in Deep Clustering: Feature Randomness, Feature Drift, and Feature Twist. Experimental results conducted on six datasets have shown that the two-level self-supervision training yields substantial improvements.
zh
[CV-3] Active 6D Pose Estimation for Textureless Objects using Multi-View RGB Frames
【速读】:该论文旨在解决纹理少或无纹理物体从RGB图像中估计6D位姿的问题,这一问题是机器人领域的重要挑战。由于外观模糊、旋转对称性以及严重遮挡的存在,基于单视图的6D位姿估计算法难以处理广泛的物体类型,从而推动了多视图位姿估计和最佳下一视点预测的研究。论文的关键解决方案在于提出了一种全面的主动感知框架,仅依赖RGB图像即可估计无纹理物体的6D位姿。其核心思想是将6D位姿估计分解为一个顺序的两步过程:首先通过解耦估计三维平移来消除RGB图像固有的尺度和深度歧义;接着利用规范尺度模板匹配简化三维方向估计任务。在此基础上,引入了一种主动感知策略,用于预测最佳下一视点以获取新的RGB图像,从而有效降低物体位姿不确定性并提高估计精度。实验表明,与现有方法相比,该多视图位姿估计方法在公共ROBI数据集及自建透明物体数据集上的性能显著提升,同时通过采用最佳下一视点策略,在所需视点数量远少于启发式策略的情况下实现了高精度的物体位姿估计。
链接: https://arxiv.org/abs/2503.03726
作者: Jun Yang,Wenjie Xue,Sahar Ghavidel,Steven L. Waslander
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Estimating the 6D pose of textureless objects from RBG images is an important problem in robotics. Due to appearance ambiguities, rotational symmetries, and severe occlusions, single-view based 6D pose estimators are still unable to handle a wide range of objects, motivating research towards multi-view pose estimation and next-best-view prediction that addresses these limitations. In this work, we propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images. Our approach is built upon a key idea: decoupling the 6D pose estimation into a sequential two-step process can greatly improve both accuracy and efficiency. First, we estimate the 3D translation of each object, resolving scale and depth ambiguities inherent to RGB images. These estimates are then used to simplify the subsequent task of determining the 3D orientation, which we achieve through canonical scale template matching. Building on this formulation, we then introduce an active perception strategy that predicts the next best camera viewpoint to capture an RGB image, effectively reducing object pose uncertainty and enhancing pose accuracy. We evaluate our method on the public ROBI dataset as well as on a transparent object dataset that we created. When evaluated using the same camera viewpoints, our multi-view pose estimation significantly outperforms state-of-the-art approaches. Furthermore, by leveraging our next-best-view strategy, our method achieves high object pose accuracy with substantially fewer viewpoints than heuristic-based policies.
zh
[CV-4] Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
【速读】:该论文旨在解决现有基于变分自编码器(VAE)架构的视频 tokenizer 在视频生成与重建任务中的局限性。这些方法通常依赖确定性解码器从紧凑的潜在表示中重建原始视频,但存在重建质量有限或对长视频处理效率较低的问题。论文提出了一种新颖的条件因果扩散(Conditioned Causal Diffusion-based)视频 tokenizer(命名为\ourmethod),其关键创新在于用三维因果扩散模型取代传统的确定性解码器。该扩散模型的逆向生成过程以编码器产生的潜在表示为条件,并通过特征缓存与采样加速技术高效重建高质量的任意长度视频。实验结果表明,\ourmethod仅需单步采样即可在视频重建任务中达到当前最优性能,且其小型化版本的表现仍优于现有顶级基线方法。
链接: https://arxiv.org/abs/2503.03708
作者: Nianzu Yang,Pandeng Li,Liming Zhao,Yang Li,Chen-Wei Xie,Yehui Tang,Xudong Lu,Zhihang Liu,Yun Zheng,Yu Liu,Junchi Yan
机构: School of Artificial Intelligence & School of Computer Science, Shanghai Jiao Tong University (上海交通大学人工智能学院与计算机科学学院); Tongyi Lab, Alibaba Group (阿里云通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video tokenizers, which transform videos into compact latent representations, are key to video generation. Existing video tokenizers are based on the VAE architecture and follow a paradigm where an encoder compresses videos into compact latents, and a deterministic decoder reconstructs the original videos from these latents. In this paper, we propose a novel \underline\textbfConditioned \underline\textbfDiffusion-based video \underline\textbfTokenizer entitled \textbf\ourmethod, which departs from previous methods by replacing the deterministic decoder with a 3D causal diffusion model. The reverse diffusion generative process of the decoder is conditioned on the latent representations derived via the encoder. With a feature caching and sampling acceleration, the framework efficiently reconstructs high-fidelity videos of arbitrary lengths. Results show that \ourmethod achieves state-of-the-art performance in video reconstruction tasks using just a single-step sampling. Even a smaller version of \ourmethod still achieves reconstruction results on par with the top two baselines. Furthermore, the latent video generation model trained using \ourmethod also shows superior performance.
zh
[CV-5] DualDiff: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance
【速读】:该论文旨在解决驾驶场景重建中精确性和高保真度不足的问题,现有方法主要依赖于3D边界框和鸟瞰图(BEV)道路地图来控制前景和背景,但无法充分捕捉驾驶场景的复杂性或有效整合多模态信息。为了解决这一挑战,论文提出DualDiff,这是一种双分支条件扩散模型,旨在增强多视角和视频序列下的驾驶场景生成能力。其关键解决方案包括引入Occupancy Ray-shape Sampling (ORS)作为条件输入,提供丰富的前景和背景语义以及3D空间几何以精确控制元素生成;设计Foreground-Aware Mask (FGM)去噪损失函数以改善细粒度前景对象的合成;开发Semantic Fusion Attention (SFA)机制以动态优先处理相关信息并抑制噪声;以及引入Reward-Guided Diffusion (RGD)框架以确保生成视频的高质量全局一致性和语义连贯性。实验结果表明,DualDiff在多个数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2503.03689
作者: Zhao Yang,Zezhong Qian,Xiaofan Li,Weixiang Xu,Gongpeng Zhao,Ruohong Yu,Lingsi Zhu,Longjun Liu
机构: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学); College of Optical Science and Engineering, Zhejiang University (浙江大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at this https URL.
zh
[CV-6] A Generative Approach to High Fidelity 3D Reconstruction from Text Data
【速读】:该论文旨在解决利用文本描述生成高质量三维模型的挑战,重点关注如何在保持语义一致性的同时处理几何复杂性并保留详细的视觉信息。论文提出了一种全自动的工作流,结合了基于文本到图像生成、多种图像处理技术以及深度学习方法用于反射移除和三维重建。关键在于采用先进的生成模型(如Stable Diffusion)通过多阶段工作流程将自然语言输入转换为详细的三维模型。该方案的核心在于从文本提示生成高质量图像,随后通过强化学习代理增强并使用Stable Delight模型去除反射,接着应用高级图像超分与背景移除技术提升视觉保真度,并最终利用复杂的机器学习算法将优化后的二维表示转化为体积化的三维模型,确保输出结构化且细节丰富,从而实现语义准确性和几何精确性的统一。
链接: https://arxiv.org/abs/2503.03664
作者: Venkat Kumar R,Deepak Saravanan
机构: BITS Pilani WILP (BITS Pilani 工业联系项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The convergence of generative artificial intelligence and advanced computer vision technologies introduces a groundbreaking approach to transforming textual descriptions into three-dimensional representations. This research proposes a fully automated pipeline that seamlessly integrates text-to-image generation, various image processing techniques, and deep learning methods for reflection removal and 3D reconstruction. By leveraging state-of-the-art generative models like Stable Diffusion, the methodology translates natural language inputs into detailed 3D models through a multi-stage workflow. The reconstruction process begins with the generation of high-quality images from textual prompts, followed by enhancement by a reinforcement learning agent and reflection removal using the Stable Delight model. Advanced image upscaling and background removal techniques are then applied to further enhance visual fidelity. These refined two-dimensional representations are subsequently transformed into volumetric 3D models using sophisticated machine learning algorithms, capturing intricate spatial relationships and geometric characteristics. This process achieves a highly structured and detailed output, ensuring that the final 3D models reflect both semantic accuracy and geometric precision. This approach addresses key challenges in generative reconstruction, such as maintaining semantic coherence, managing geometric complexity, and preserving detailed visual information. Comprehensive experimental evaluations will assess reconstruction quality, semantic accuracy, and geometric fidelity across diverse domains and varying levels of complexity. By demonstrating the potential of AI-driven 3D reconstruction techniques, this research offers significant implications for fields such as augmented reality (AR), virtual reality (VR), and digital content creation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.03664 [cs.CV] (or arXiv:2503.03664v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.03664 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-7] LION-FS: Fast Slow Video-Language Thinker as Online Video Assistant CVPR2025
【速读】:该论文旨在解决现有在线视频助手在提升实时效率的同时牺牲了响应效果的问题,即在生成式 AI (Generative AI) 领域中,如何平衡模型的有效性 (efficacy) 和效率 (efficiency)。论文提出的关键解决方案是“Fast-Slow Video-Language Thinker”(简称LION-FS),这是一种面向在线视频任务的新型视频语言助手。
LION-FS 的核心在于采用了一种两阶段优化策略:1)Fast Path 利用基于路由的响应确定机制(Routing-Based Response Determination)逐帧评估是否需要即时响应,并通过 Token Aggregation Routing 动态融合时空特征以高效处理高帧率输入,同时借助 Token Dropping Routing 消除冗余特征;2)Slow Path 在响应生成过程中优化关键帧,通过多粒度关键帧增强(Multi-granularity Keyframe Augmentation)提取细粒度的空间特征和人-环境交互特征,并将其整合到精心设计的多模态思维模板(multimodal Thinking Template)中,从而实现更精确的响应生成。这一方案成功实现了实时性、主动性、时间准确性以及上下文精确性的统一,在线视频任务的综合评估表明,LION-FS 达到了最先进的性能水平。
链接: https://arxiv.org/abs/2503.03663
作者: Wei Li,Bing Hu,Rui Shao,Leyang Shen,Liqiang Nie
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept to CVPR 2025
Abstract:First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual this http URL overcome the trade-off between efficacy and efficiency, we propose “Fast Slow Video-Language Thinker” as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
zh
[CV-8] Improving 6D Object Pose Estimation of metallic Household and Industry Objects
【速读】:该论文旨在解决6D物体位姿估计在金属物体上的准确性下降问题,特别是在工业应用中因反射和高光(specular highlights)等挑战导致的性能退化。论文的关键解决方案在于提出了一种新的BOP兼容数据集,该数据集包含多种金属物体(如罐头、家用及工业用品)在不同光照和背景条件下的样本,并引入了额外的几何与视觉线索。此外,通过改进GDRNPP算法,在其基础上增加了关键点预测和材质估算模块,以增强空间场景理解能力。评估结果显示,此方法显著提升了金属物体的位姿估计精度,验证了附加特征的有效性。
链接: https://arxiv.org/abs/2503.03655
作者: Thomas Pöllabauer,Michael Gasser,Tristan Wirth,Sarah Berkei,Volker Knauthe,Arjan Kuijper
机构: Technical University Darmstadt (达姆施塔特工业大学), Germany; Fraunhofer Institute for Computer Graphics Research IGD (弗劳恩霍夫图形研究学院), Germany; Threedy GmbH (三维公司), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:6D object pose estimation suffers from reduced accuracy when applied to metallic objects. We set out to improve the state-of-the-art by addressing challenges such as reflections and specular highlights in industrial applications. Our novel BOP-compatible dataset, featuring a diverse set of metallic objects (cans, household, and industrial items) under various lighting and background conditions, provides additional geometric and visual cues. We demonstrate that these cues can be effectively leveraged to enhance overall performance. To illustrate the usefulness of the additional features, we improve upon the GDRNPP algorithm by introducing an additional keypoint prediction and material estimator head in order to improve spatial scene understanding. Evaluations on the new dataset show improved accuracy for metallic objects, supporting the hypothesis that additional geometric and visual cues can improve learning.
zh
[CV-9] DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles CVPR2025
【速读】:该论文旨在解决在特定复杂领域中生成式模型适配(Domain Adaptation)的挑战,特别是在需要大量配对数据(paired data)以捕捉目标分布的情况下。传统方法依赖于丰富的配对文本-图像样本,但在许多实际场景中,这类标注数据难以获取,而单模态的未配对数据(unpaired data)则更为丰富。为应对这一问题,论文提出的关键解决方案是利用统一生成模型(unified generative model)所学习到的视觉与语言之间的双向映射关系,通过设计一种无需配对数据即可完成模型训练的方法。具体而言,作者提出了DoraCycle模型,该模型结合了文本到图像再到文本以及图像到文本再到图像的双通道多模态循环结构,并通过端点处的交叉熵损失函数进行优化,确保两个端点共享相同的模态。这种方法使得模型能够在缺乏标注数据的情况下实现自我进化,从而有效支持如风格迁移等独立于配对知识的任务,同时对于需要新配对知识的任务,少量配对样本与大规模未配对数据的组合即可实现高效的领域适配。
链接: https://arxiv.org/abs/2503.03651
作者: Rui Zhao,Weijia Mao,Mike Zheng Shou
机构: Show Lab (展示实验室), National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Adapting generative models to specific domains presents an effective solution for satisfying specialized requirements. However, adapting to some complex domains remains challenging, especially when these domains require substantial paired data to capture the targeted distributions. Since unpaired data from a single modality, such as vision or language, is more readily available, we utilize the bidirectional mappings between vision and language learned by the unified generative model to enable training on unpaired data for domain adaptation. Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. The model is optimized through cross-entropy loss computed at the cycle endpoints, where both endpoints share the same modality. This facilitates self-evolution of the model without reliance on annotated text-image pairs. Experimental results demonstrate that for tasks independent of paired knowledge, such as stylization, DoraCycle can effectively adapt the unified model using only unpaired data. For tasks involving new paired knowledge, such as specific identities, a combination of a small set of paired image-text examples and larger-scale unpaired data is sufficient for effective domain-oriented adaptation. The code will be released at this https URL.
zh
[CV-10] DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms
【速读】:该论文旨在解决由于缺乏相关数据集导致对东巴象形文字语义理解研究进展困难的问题。为了解决这一问题,论文提出了DongbaMIE,这是首个用于东巴象形文字语义理解和提取的多模态数据集。该数据集由东巴象形文字图像及其对应的中文语义注释组成,包含23,530个句子级图像和2,539个段落级图像,覆盖四个语义维度:物体、动作、关系和属性。解决方案的关键在于构建这样一个全面且标注清晰的数据集,以促进现有大型多模态模型在准确识别东巴象形文字多样化语义信息方面的能力提升。
链接: https://arxiv.org/abs/2503.03644
作者: Xiaojun Bi,Shuo Li,Ziyue Wang,Fuwen Luo,Weizheng Qiao,Lu Han,Ziwei Sun,Peng Li,Yang Liu
机构: College of Information and Engineering, Minzu University of China, Beijing, China (民族大学信息与工程学院,中国北京); Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing, China (教育部民族语言智能分析与安全治理重点实验室,民族大学,中国北京); College of Information and Communication Engineering, Harbin Engineering University, Harbin, China (哈尔滨工程大学信息与通信工程学院,中国哈尔滨); Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China (清华大学人工智能研究院计算机科学与技术系,中国北京); Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China (清华大学人工智能产业研究院,中国北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dongba pictographs are the only pictographs still in use in the world. They have pictorial ideographic features, and their symbols carry rich cultural and contextual information. Due to the lack of relevant datasets, existing research has difficulty in advancing the study of semantic understanding of Dongba pictographs. To this end, we propose DongbaMIE, the first multimodal dataset for semantic understanding and extraction of Dongba pictographs. The dataset consists of Dongba pictograph images and their corresponding Chinese semantic annotations. It contains 23,530 sentence-level and 2,539 paragraph-level images, covering four semantic dimensions: objects, actions, relations, and attributes. We systematically evaluate the GPT-4o, Gemini-2.0, and Qwen2-VL models. Experimental results show that the F1 scores of GPT-4o and Gemini in the best object extraction are only 3.16 and 3.11 respectively. The F1 score of Qwen2-VL after supervised fine-tuning is only 11.49. These results suggest that current large multimodal models still face significant challenges in accurately recognizing the diverse semantic information in Dongba pictographs. The dataset can be obtained from this URL.
zh
[CV-11] An Adaptive Underwater Image Enhancement Framework via Multi-Domain Fusion and Color Compensation
【速读】:该论文旨在解决水下光学成像因光吸收、散射及色彩失真导致的能见度下降和图像分析准确性受阻的问题。解决方案的关键在于提出了一种自适应增强框架,集成了光照补偿(Hybrid Illumination Compensation)、多域滤波(Two-Stage Filtering)以及动态色彩校正(Adaptive Color Compensation, ACC)。其中,光照补偿策略结合了CLAHE、Gamma校正和Retinex方法以提升可见性;滤波过程通过空间域(如Gaussian、Bilateral、Guided滤波)和频率域(如Fourier、Wavelet变换)技术有效降噪并保留细节;而色彩校正模型则基于谱衰减估计和水体类型动态组合RCP、DCP和MUDCP,同时辅以感知引导的颜色平衡机制确保自然的色彩还原。实验结果验证了该框架在对比度增强、色彩校正及结构保持方面的优越性能,使其适用于水下成像应用。
链接: https://arxiv.org/abs/2503.03640
作者: Yuezhe Tian,Kangchen Yao,Xiaoyang Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater optical imaging is severely degraded by light absorption, scattering, and color distortion, hindering visibility and accurate image analysis. This paper presents an adaptive enhancement framework integrating illumination compensation, multi-domain filtering, and dynamic color correction. A hybrid illumination compensation strategy combining CLAHE, Gamma correction, and Retinex enhances visibility. A two-stage filtering process, including spatial-domain (Gaussian, Bilateral, Guided) and frequency-domain (Fourier, Wavelet) methods, effectively reduces noise while preserving details. To correct color distortion, an adaptive color compensation (ACC) model estimates spectral attenuation and water type to combine RCP, DCP, and MUDCP dynamically. Finally, a perceptually guided color balance mechanism ensures natural color restoration. Experimental results on benchmark datasets demonstrate superior performance over state-of-the-art methods in contrast enhancement, color correction, and structural preservation, making the framework robust for underwater imaging applications.
zh
[CV-12] 4D Radar Ground Truth Augmentation with LiDAR-to-4D Radar Data Synthesis
【速读】:该论文旨在解决直接将基于激光雷达的ground truth增强(GT-Aug)方法应用于4D雷达张量数据时存在的局限性,即忽略了GT边界框(GT bboxes)之外的重要测量值(如旁瓣),导致生成的数据分布偏离真实世界4D雷达数据的问题。为了解决这一问题,论文提出了一种新的方法——4D雷达ground truth增强(4DR GT-Aug)。其关键在于通过激光雷达到4D雷达数据合成(L2RDaS)模块,首先增强激光雷达数据,再将其转换为4D雷达数据,该模块能够显式考虑GT bboxes内外的所有测量值,从而生成更接近真实世界的4D雷达数据分布,进而提升目标检测精度。
链接: https://arxiv.org/abs/2503.03637
作者: Woo-Jin Jung,Dong-Hee Paek,Seung-Hyun Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 24 pages
Abstract:Ground truth augmentation (GT-Aug) is a common method for LiDAR-based object detection, as it enhances object density by leveraging ground truth bounding boxes (GT bboxes). However, directly applying GT-Aug to 4D Radar tensor data overlooks important measurements outside the GT bboxes-such as sidelobes-leading to synthetic distributions that deviate from real-world 4D Radar data. To address this limitation, we propose 4D Radar Ground Truth Augmentation (4DR GT-Aug). Our approach first augments LiDAR data and then converts it to 4D Radar data via a LiDAR-to-4D Radar data synthesis (L2RDaS) module, which explicitly accounts for measurements both inside and outside GT bboxes. In doing so, it produces 4D Radar data distributions that more closely resemble real-world measurements, thereby improving object detection accuracy. Experiments on the K-Radar dataset show that the proposed method achieves improved performance compared to conventional GT-Aug in object detection for 4D Radar. The implementation code is available at this https URL.
zh
[CV-13] CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP CVPR2025
【速读】:该论文试图解决CLIP在零样本图像-文本匹配任务中对对抗性扰动高度脆弱的问题,尤其是在测试阶段抵御对抗攻击的鲁棒性不足。论文指出,现有方法主要通过微调CLIP的视觉编码器以提升其零样本对抗鲁棒性(zero-shot adversarial robustness),而本文提出了一种无需训练且与现有方法正交的解决方案。关键在于利用预训练的CLIP视觉编码器,在推理阶段对由恶意扰动生成的“虚假稳定”图像进行反击,从而实现稳健的对抗防御。该方法简单高效,不仅在16个分类数据集上验证了其稳定性与一致性改进,还进一步展示了可叠加于已有对抗微调模型上的潜力,以进一步增强其测试阶段的鲁棒性。
链接: https://arxiv.org/abs/2503.03613
作者: Songlong Xing,Zhengyu Zhao,Nicu Sebe
机构: University of Trento (意大利); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
Abstract:Despite its prevalent use in image-text matching tasks in a zero-shot manner, CLIP has been shown to be highly vulnerable to adversarial perturbations added onto images. Recent studies propose to finetune the vision encoder of CLIP with adversarial samples generated on the fly, and show improved robustness against adversarial attacks on a spectrum of downstream datasets, a property termed as zero-shot robustness. In this paper, we show that malicious perturbations that seek to maximise the classification loss lead to `falsely stable’ images, and propose to leverage the pre-trained vision encoder of CLIP to counterattack such adversarial images during inference to achieve robustness. Our paradigm is simple and training-free, providing the first method to defend CLIP from adversarial attacks at test time, which is orthogonal to existing methods aiming to boost zero-shot adversarial robustness of CLIP. We conduct experiments across 16 classification datasets, and demonstrate stable and consistent gains compared to test-time defence methods adapted from existing adversarial robustness studies that do not rely on external networks, without noticeably impairing performance on clean images. We also show that our paradigm can be employed on CLIP models that have been adversarially finetuned to further enhance their robustness at test time. Our code is available \hrefthis https URLhere.
zh
[CV-14] REGRACE: A Robust and Efficient Graph-based Re-localization Algorithm using Consistency Evaluation IROS2025
【速读】:该论文旨在解决大尺度导航中回环检测(loop closure detection)面临的两大挑战:一是基于密集点云的传统方法因扫描间比较计算昂贵而难以扩展;二是以物体为中心的方法虽高效但对视角变化敏感。为应对这些挑战,论文提出REGRACE方法,其关键是利用基于LiDAR的子图(submap),通过引入旋转不变特征(rotation-invariant features)来增强每个标注对象,并借助图神经网络结合邻域上下文信息加以优化。此外,采用可扩展的词袋(bag-of-words)方法提取子图的全局特征,同时利用几何一致性线索而非嵌入距离来识别远距离回环,从而实现高效且鲁棒的重定位能力。实验表明,REGRACE在性能上与最先进的方法相当,但运行速度提升两倍。
链接: https://arxiv.org/abs/2503.03599
作者: Débora N.P. Oliveira,Joshua Knights,Sebastián Barbas Laina,Simon Boche,Wolfram Burgard,Stefan Leutenegger
机构: Artificial Intelligence and Robotics Lab, Department of Computer Science and Artificial Intelligence, University of Technology of Nuremberg (UTN), Germany (德国纽伦堡技术大学); CSIRO Robotics, DATA61, and Queensland University of Technology (QUT), Brisbane, Australia (澳大利亚昆士兰科技大学); Smart Robotics Lab, School of Computation, Information and Technology, Technical University of Munich (TUM), Germany (德国慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to IROS2025
Abstract:Loop closures are essential for correcting odometry drift and creating consistent maps, especially in the context of large-scale navigation. Current methods using dense point clouds for accurate place recognition do not scale well due to computationally expensive scan-to-scan comparisons. Alternative object-centric approaches are more efficient but often struggle with sensitivity to viewpoint variation. In this work, we introduce REGRACE, a novel approach that addresses these challenges of scalability and perspective difference in re-localization by using LiDAR-based submaps. We introduce rotation-invariant features for each labeled object and enhance them with neighborhood context through a graph neural network. To identify potential revisits, we employ a scalable bag-of-words approach, pooling one learned global feature per submap. Additionally, we define a revisit with geometrical consistency cues rather than embedding distance, allowing us to recognize far-away loop closures. Our evaluations demonstrate that REGRACE achieves similar results compared to state-of-the-art place recognition and registration baselines while being twice as fast.
zh
[CV-15] owards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection CVPR2025
【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)领域中现有算法难以处理物理理解与推理问题的挑战。当前IAD算法主要基于静态且语义简单的数据集进行开发和测试,而这些数据集无法充分反映真实世界中需要物理知识和推理的实际场景。为弥合这一差距,论文引入了Physics Anomaly Detection (Phys-AD) 数据集,这是首个大规模的真实世界、基于物理的视频数据集,用于工业异常检测。Phys-AD 数据集包含超过6400个视频,涵盖22类真实物体,并模拟了47种异常情况,强调了动态和语义丰富的场景。论文的关键解决方案在于通过结合视觉推理和物理知识来检测异常,同时提出了Physics Anomaly Explanation (PAEval) 指标,以评估视觉-语言基础模型在检测异常的同时提供物理原因解释的能力。
链接: https://arxiv.org/abs/2503.03562
作者: Wenqiao Li,Yao Gu,Xintao Chen,Xiaohao Xu,Ming Hu,Xiaonan Huang,Yingna Wu
机构: ShanghaiTech University (上海科技大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025
Abstract:Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are this http URL bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object this http URL benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our dataset and benchmark will be publicly available.
zh
[CV-16] High-Quality Virtual Single-Viewpoint Surgical Video: Geometric Autocalibration of Multiple Cameras in Surgical Lights MICCAI2023
【速读】:该论文旨在解决手术视频生成中因外科医生遮挡摄像头视野而导致的遮挡问题。为应对这一挑战,以往方法通过在手术无影灯上安装多个摄像头,期望某些摄像头能够捕捉到较少遮挡的手术视野。然而,这种特殊摄像设备配置带来了新的成像难题,因为每次外科医生移动无影灯时,摄像机配置会发生变化,需要手动进行图像对齐。论文的关键解决方案在于提出了一种自动化对齐算法,该算法能够检测无影灯移动的帧、自动重新对齐这些帧,并选择遮挡最少的摄像头,从而生成更少遮挡且更稳定的视频。定量结果表明,该方法优于传统方法,且医学专家参与的用户研究进一步验证了其优越性。
链接: https://arxiv.org/abs/2503.03558
作者: Yuna Kato,Mariko Isogawa,Shohei Mori,Hideo Saito,Hiroki Kajita,Yoshifumi Takatsume
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI2023
Abstract:Occlusion-free video generation is challenging due to surgeons’ obstructions in the camera field of view. Prior work has addressed this issue by installing multiple cameras on a surgical light, hoping some cameras will observe the surgical field with less occlusion. However, this special camera setup poses a new imaging challenge since camera configurations can change every time surgeons move the light, and manual image alignment is required. This paper proposes an algorithm to automate this alignment task. The proposed method detects frames where the lighting system moves, realigns them, and selects the camera with the least occlusion. This algorithm results in a stabilized video with less occlusion. Quantitative results show that our method outperforms conventional approaches. A user study involving medical doctors also confirmed the superiority of our method.
zh
[CV-17] Afford-X: Generalizable and Slim Affordance Reasoning for Task-oriented Manipulation
【速读】:该论文旨在解决当前基于感知的affordance推理模型缺乏泛化性的问题,以及大型语言模型(LLMs)难以在本地设备上部署以支持任务导向操作的挑战。论文的关键在于引入了一个名为LVIS-Aff的大规模数据集,并基于此开发了Afford-X模型。Afford-X是一个端到端可训练的affordance推理模型,通过引入动词注意力(Verb Attention)和双模态融合(Bi-Fusion)模块,提升了多模态理解能力。这一解决方案不仅实现了性能上的显著提升,还保持了较小的参数规模,并大幅提高了推理速度,从而展示了在本地设备上实现高效且通用的affordance推理模型的可能性,为实际任务导向操作提供了有力支持。
链接: https://arxiv.org/abs/2503.03556
作者: Xiaomeng Zhu,Yuyang Li,Leiyao Cui,Pengfei Li,Huan-ang Gao,Yixin Zhu,Hao Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and Artificial Intelligence (AI). This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive Large Language Models (LLMs) with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce LVIS-Aff, a large-scale dataset comprising 1,496 tasks and 119k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop Afford-X, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 12.1% performance improvement over the best-reported results from non-LLM methods, while also demonstrating a 1.2% enhancement compared to our previous conference paper. Additionally, it maintains a compact 187M parameter size and infers nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase Afford-X’s effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and AI systems in real-world applications.
zh
[CV-18] Simulation-Based Performance Evaluation of 3D Object Detection Methods with Deep Learning for a LiDAR Point Cloud Dataset in a SOTIF-related Use Case
【速读】:本文旨在解决自动驾驶系统(Automated Driving Systems, ADS)在预期功能安全性(Safety of the Intended Functionality, SOTIF)方面面临的传感器性能限制及基于深度学习的目标检测不足问题,以确保ADS的预期功能。论文的关键解决方案在于定义并建模了一个包含21种多样化天气条件的SOTIF相关使用场景,并生成适用于3D目标检测方法的LiDAR点云数据集。通过构建包含547帧数据的数据集,覆盖多种天气条件与时间段(白天正午、黄昏及夜晚),结合MMDetection3D和OpenPCDET工具包,采用平均精度(Average Precision, AP)和召回率(Recall)等指标,评估并比较了最先进的3D目标检测方法的性能。关键在于通过全面的数据集设计和标准化评估流程,验证这些方法在复杂环境下的适应性和可靠性。
链接: https://arxiv.org/abs/2503.03548
作者: Milin Patel,Rolf Jung
机构: Institute for Advanced Driver Assistance Systems and Connected Mobility, Kempten University of Applied Sciences (凯姆廷应用科技大学), Benningen, Germany; Faculty of Computer Science, Kempten University of Applied Sciences (凯姆廷应用科技大学), Kempten, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Safety of the Intended Functionality (SOTIF) addresses sensor performance limitations and deep learning-based object detection insufficiencies to ensure the intended functionality of Automated Driving Systems (ADS). This paper presents a methodology examining the adaptability and performance evaluation of the 3D object detection methods on a LiDAR point cloud dataset generated by simulating a SOTIF-related Use Case. The major contributions of this paper include defining and modelling a SOTIF-related Use Case with 21 diverse weather conditions and generating a LiDAR point cloud dataset suitable for application of 3D object detection methods. The dataset consists of 547 frames, encompassing clear, cloudy, rainy weather conditions, corresponding to different times of the day, including noon, sunset, and night. Employing MMDetection3D and OpenPCDET toolkits, the performance of State-of-the-Art (SOTA) 3D object detection methods is evaluated and compared by testing the pre-trained Deep Learning (DL) models on the generated dataset using Average Precision (AP) and Recall metrics.
zh
[CV-19] A self-supervised cyclic neural-analytic approach for novel view synthesis and 3D reconstruction BMVC2024
【速读】:该论文旨在解决从录制视频中生成新颖视角以支持自主无人机(UAV)导航的问题。现有基于神经渲染的方法虽能快速生成新轨迹,但在未优化飞行路径的情况下,往往在远离训练数据的区域泛化效果不佳,导致重建质量下降。论文的关键解决方案在于提出了一种自监督循环神经-解析管道,结合了高质量的神经渲染输出与解析方法中的精确几何洞察。通过采用基于Transformer的架构进行图像重建,该方法能够有效改进RGB和网格重建,尤其在欠采样区域及与训练数据完全不同的场景中表现出色。此外,它无需依赖大规模标注数据即可适应新的、未见过的姿态,从而显著提升了新颖视角渲染及三维重建的质量,为复杂户外环境下的自主导航设定了新标准。
链接: https://arxiv.org/abs/2503.03543
作者: Dragos Costea,Alina Marcu,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in BMVC 2024, 10 pages, 4 figures
Abstract:Generating novel views from recorded videos is crucial for enabling autonomous UAV navigation. Recent advancements in neural rendering have facilitated the rapid development of methods capable of rendering new trajectories. However, these methods often fail to generalize well to regions far from the training data without an optimized flight path, leading to suboptimal reconstructions. We propose a self-supervised cyclic neural-analytic pipeline that combines high-quality neural rendering outputs with precise geometric insights from analytical methods. Our solution improves RGB and mesh reconstructions for novel view synthesis, especially in undersampled areas and regions that are completely different from the training dataset. We use an effective transformer-based architecture for image reconstruction to refine and adapt the synthesis process, enabling effective handling of novel, unseen poses without relying on extensive labeled datasets. Our findings demonstrate substantial improvements in rendering views of novel and also 3D reconstruction, which to the best of our knowledge is a first, setting a new standard for autonomous navigation in complex outdoor environments.
zh
[CV-20] Unified Human Localization and Trajectory Prediction with Monocular Vision ICRA2025
【速读】:该论文旨在解决传统人体轨迹预测模型依赖于清洁整理数据的问题,这些数据通常需要专用设备或人工标注,这在机器人应用中往往不切实际。现有预测器倾向于对清洁观测值过拟合,从而影响其在噪声输入下的鲁棒性。为了解决这些问题,论文提出了一种基于Transformer的框架MonoTransmotion (MT),仅使用单目相机即可联合完成定位与预测任务。解决方案的关键在于MT框架包含两个主要模块:鸟瞰图(BEV)定位和轨迹预测。其中,BEV定位模块通过结合2D人体姿态估计,并采用一种新颖的方向损失函数来实现更平滑的序列定位;轨迹预测模块则基于这些估计预测未来运动。通过联合训练这两个任务,该方法在包含噪声输入的真实场景中表现出更高的鲁棒性。实验验证表明,MT在网络优化后的数据集上比基线模型提升了约12%,并且在未经整理的真实世界数据集中保持了相似性能,展示了其鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2503.03535
作者: Po-Chien Luan,Yang Gao,Celine Demonsant,Alexandre Alahi
机构: EPFL (瑞士洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICRA 2025
Abstract:Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird’s Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at this https URL.
zh
[CV-21] AdaSin: Enhancing Hard Sample Metrics with Dual Adaptive Penalty for Face Recognition
【速读】:该论文旨在解决传统损失函数在量化困难样本难度方面的不足,尤其是无法有效对困难样本进行精确惩罚的问题。为了解决这一挑战,论文提出了一种名为Adaptive Sine (AdaSin) 的损失函数,其关键在于引入了样本嵌入特征与真实类别中心之间夹角的正弦值作为新的困难度指标。这一指标使得模型能够更精准且有效地惩罚困难样本,并通过结合课程学习机制,在不同训练阶段动态调整分类边界。与以往的自适应边距损失函数不同,AdaSin设计了双重自适应惩罚机制,同时作用于困难样本的正余弦相似度,从而施加更强约束以提升类内紧凑性和类间可分性。这种由精心设计的困难度指标引导的双重自适应惩罚与课程学习相结合的方式,使模型能够在后期训练阶段更加专注于困难样本,进而提取出高度判别性的面部特征。大量实验结果表明,AdaSin在八个基准数据集上的表现优于其他最先进的方法。
链接: https://arxiv.org/abs/2503.03528
作者: Qiqi Guo,Zhuowen Zheng,Guanghua Yang,Zhiquan Liu,Xiaofan Li,Jianqing Li,Jinyu Tian,Xueyuan Gong
机构: School of Intelligent Systems Science and Engineering, Jinan University, Guangdong, China (智能系统科学与工程学院,暨南大学,广东,中国); College of Cyber Security, Jinan University, Guangdong, China (网络空间安全学院,暨南大学,广东,中国); School of Computer Science and Engineering, Macau University of Science and Technology, Macau, China (计算机科学与工程学院,澳门科技大学,澳门,中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, the emergence of deep convolutional neural networks has positioned face recognition as a prominent research focus in computer vision. Traditional loss functions, such as margin-based, hard-sample mining-based, and hybrid approaches, have achieved notable performance improvements, with some leveraging curriculum learning to optimize training. However, these methods often fall short in effectively quantifying the difficulty of hard samples. To address this, we propose Adaptive Sine (AdaSin) loss function, which introduces the sine of the angle between a sample’s embedding feature and its ground-truth class center as a novel difficulty metric. This metric enables precise and effective penalization of hard samples. By incorporating curriculum learning, the model dynamically adjusts classification boundaries across different training stages. Unlike previous adaptive-margin loss functions, AdaSin introduce a dual adaptive penalty, applied to both the positive and negative cosine similarities of hard samples. This design imposes stronger constraints, enhancing intra-class compactness and inter-class separability. The combination of the dual adaptive penalty and curriculum learning is guided by a well-designed difficulty metric. It enables the model to focus more effectively on hard samples in later training stages, and lead to the extraction of highly discriminative face features. Extensive experiments across eight benchmarks demonstrate that AdaSin achieves superior accuracy compared to other state-of-the-art methods.
zh
[CV-22] Do ImageNet-trained models learn shortcuts? The impact of frequency shortcuts on generalization CVPR2025
【速读】:该论文试图解决模型在训练过程中过度依赖频率捷径(frequency shortcuts)的问题,这些频率模式可能导致模型在分布外(out-of-distribution, OOD)数据上的泛化性能下降。现有方法识别频率捷径需要昂贵的计算开销,限制了其在大规模数据集上的应用。为此,本文提出了一种更高效的方法来分析更大规模模型中的频率捷径。关键在于开发了一种能够有效评估卷积神经网络(CNN)和Transformer模型是否学习到频率捷径的技术,并揭示了这些捷径在保留纹理信息的OOD测试集上表现良好,但在基于表观变化的OOD测试集上则阻碍了模型的泛化能力。这一发现表明当前的OOD评估往往忽视了频率捷径对模型泛化的影响,未来基准测试应考虑显式评估和校正这些捷径以构建更通用的模型。
链接: https://arxiv.org/abs/2503.03519
作者: Shunxin Wang,Raymond Veldhuis,Nicola Strisciuglio
机构: University of Twente (特文特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: received at CVPR2025
Abstract:Frequency shortcuts refer to specific frequency patterns that models heavily rely on for correct classification. Previous studies have shown that models trained on small image datasets often exploit such shortcuts, potentially impairing their generalization performance. However, existing methods for identifying frequency shortcuts require expensive computations and become impractical for analyzing models trained on large datasets. In this work, we propose the first approach to more efficiently analyze frequency shortcuts at a larger scale. We show that both CNN and transformer models learn frequency shortcuts on ImageNet. We also expose that frequency shortcut solutions can yield good performance on out-of-distribution (OOD) test sets which largely retain texture information. However, these shortcuts, mostly aligned with texture patterns, hinder model generalization on rendition-based OOD test sets. These observations suggest that current OOD evaluations often overlook the impact of frequency shortcuts on model generalization. Future benchmarks could thus benefit from explicitly assessing and accounting for these shortcuts to build models that generalize across a broader range of OOD scenarios.
zh
[CV-23] Mineral segmentation using electron microscope images and spectral sampling through multimodal graph neural networks
【速读】:该论文旨在解决基于单模态扫描电子显微镜(SEM)背散射电子(BSE)图像进行矿物分割时信息不足的问题。传统方法通常通过点位能量色散X射线光谱(EDS)光谱测量补充信息,但其耗时特性限制了效率。为应对这一挑战,论文提出了一种基于图神经网络(Graph Neural Network, GNN)的数据融合方法,将稀疏的EDS光谱数据与BSE图像相结合,同时实现两种模态的融合及矿物相的分割。关键在于利用GNN处理EDS数据的非结构化特性,并有效整合多模态信息以提升分割精度,最终仅需少量EDS数据(如1%的BSE像素)即可实现准确的矿物分割,从而大幅提高分析效率。
链接: https://arxiv.org/abs/2503.03507
作者: Samuel Repka,Bořek Reich,Fedor Zolotarev,Tuomas Eerola,Pavel Zemčík
机构: Lappeenranta-Lahti University of Technology (拉彭兰塔-拉赫蒂工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a novel Graph Neural Network-based method for segmentation based on data fusion of multimodal Scanning Electron Microscope (SEM) images. In most cases, Backscattered Electron (BSE) images obtained using SEM do not contain sufficient information for mineral segmentation. Therefore, imaging is often complemented with point-wise Energy-Dispersive X-ray Spectroscopy (EDS) spectral measurements that provide highly accurate information about the chemical composition but that are time-consuming to acquire. This motivates the use of sparse spectral data in conjunction with BSE images for mineral segmentation. The unstructured nature of the spectral data makes most traditional image fusion techniques unsuitable for BSE-EDS fusion. We propose using graph neural networks to fuse the two modalities and segment the mineral phases simultaneously. Our results demonstrate that providing EDS data for as few as 1% of BSE pixels produces accurate segmentation, enabling rapid analysis of mineral samples. The proposed data fusion pipeline is versatile and can be adapted to other domains that involve image data and point-wise measurements.
zh
[CV-24] CarGait: Cross-Attention based Re-ranking for Gait recognition
【速读】:该论文试图解决单阶段行人步态识别模型在处理困难负样本(hard negatives)时表现下降的问题,特别是在最高排名(如Rank-1)上的性能不足。论文的关键解决方案是提出了一种名为CarGait的跨注意重排序方法,通过在步态序列条带之间利用交叉注意力机制捕获细粒度的相关性,重新排列初始Top-K候选列表。此重排序方案能够适配现有的单阶段模型,从而提升其最终性能。实验结果表明,CarGait在多个数据集和模型上实现了Rank-1和Rank-5准确性的一致提升,并优于现有重排序方法。
链接: https://arxiv.org/abs/2503.03501
作者: Gavriel Habib,Noa Barzilay,Or Shimshi,Rami Ben-Ari,Nir Darshan
机构: OriginAI, Israel (OriginAI, 以色列)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gait recognition is a computer vision task that identifies individuals based on their walking patterns. Gait recognition performance is commonly evaluated by ranking a gallery of candidates and measuring the accuracy at the top Rank- K . Existing models are typically single-staged, i.e. searching for the probe’s nearest neighbors in a gallery using a single global feature representation. Although these models typically excel at retrieving the correct identity within the top- K predictions, they struggle when hard negatives appear in the top short-list, leading to relatively low performance at the highest ranks (e.g., Rank-1). In this paper, we introduce CarGait, a Cross-Attention Re-ranking method for gait recognition, that involves re-ordering the top- K list leveraging the fine-grained correlations between pairs of gait sequences through cross-attention between gait strips. This re-ranking scheme can be adapted to existing single-stage models to enhance their final results. We demonstrate the capabilities of CarGait by extensive experiments on three common gait datasets, Gait3D, GREW, and OU-MVLP, and seven different gait models, showing consistent improvements in Rank-1,5 accuracy, superior results over existing re-ranking methods, and strong baselines.
zh
[CV-25] Find First Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation
【速读】:该论文旨在解决基于参考视频的目标分割(Referring Video Object Segmentation)任务中目标识别模糊性和跨帧掩码传播一致性不足的问题。现有方法通常以高度纠缠的方式融合视觉与文本特征,并行处理多模态信息以生成每帧掩码,但在多相似物体场景下难以准确定位目标,且无法确保跨帧掩码的一致性。为解决这些问题,论文提出了一种名为FindTrack的新框架,其关键在于将目标识别与掩码传播解耦。FindTrack首先通过平衡分割置信度和视觉-文本对齐来自适应选择关键帧,从而为待跟踪目标建立鲁棒参考;随后利用专门设计的传播模块基于此参考在整个视频中追踪并分割目标。通过这种解耦策略,FindTrack显著减少了目标关联中的歧义,并提升了分割结果的一致性。实验表明,FindTrack在公开基准数据集上优于现有方法。
链接: https://arxiv.org/abs/2503.03492
作者: Suhwan Cho,Seunghoon Lee,Minhyeok Lee,Jungho Lee,Sangyoun Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, a novel decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. We demonstrate that FindTrack outperforms existing methods on public benchmarks.
zh
[CV-26] Feature Point Extraction for Extra-Affine Image
【速读】:本文旨在解决大角度仿射变换(large-angle affine transformations,旋转角度超过50度)下图像特征提取稳定性显著下降的问题。现有方法如ASIFT虽基于SIFT并模拟大量仿射变换以增强鲁棒性,但仍存在耗时长、内存需求高的缺点,在大视角仿射变换下特征提取的稳定性迅速降低。为应对这一挑战,本文提出了一种改进方法,不仅在精度上有所提升,同时保持了仿射不变性,并成为目前所知针对超仿射图像(extra-affine images)最快的特征提取方法之一。此外,该方法将特征提取的稳定性逼近理论极限。
该解决方案的关键在于通过参考图像模拟仿射变换以获得最优参数集,结合Lanczos插值再现模拟的仿射变换,并与快速方向二进制描述子ORB相结合,充分发挥其实时性能优势。进一步引入尺度参数模拟以提高操作效率。这种方法在保证高精度的同时实现了显著的时间效率提升。
链接: https://arxiv.org/abs/2503.03479
作者: Tao Wang,Yinghui Wang,Yanxing Liang,Liangyi Huang,Jinlong Yang,Wei Li,Xiaojuan Ning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The issue concerning the significant decline in the stability of feature extraction for images subjected to large-angle affine transformations, where the angle exceeds 50 degrees, still awaits a satisfactory solution. Even ASIFT, which is built upon SIFT and entails a considerable number of image comparisons simulated by affine transformations, inevitably exhibits the drawbacks of being time-consuming and imposing high demands on memory usage. And the stability of feature extraction drops rapidly under large-view affine transformations. Consequently, we propose a method that represents an improvement over ASIFT. On the premise of improving the precision and maintaining the affine invariance, it currently ranks as the fastest feature extraction method for extra-affine images that we know of at present. Simultaneously, the stability of feature extraction regarding affine transformation images has been approximated to the maximum limits. Both the angle between the shooting direction and the normal direction of the photographed object (absolute tilt angle), and the shooting transformation angle between two images (transition tilt angle) are close to 90 degrees. The central idea of the method lies in obtaining the optimal parameter set by simulating affine transformation with the reference image. And the simulated affine transformation is reproduced by combining it with the Lanczos interpolation based on the optimal parameter set. Subsequently, it is combined with ORB, which exhibits excellent real-time performance for rapid orientation binary description. Moreover, a scale parameter simulation is introduced to further augment the operational efficiency.
zh
[CV-27] DTU-Net: A Multi-Scale Dilated Transformer Network for Nonlinear Hyperspectral Unmixing
【速读】:该论文致力于解决高光谱解混(Hyperspectral Unmixing, HU)任务中基于Transformer的方法所面临的两个主要挑战:一是难以有效捕捉多尺度和长程的空间相关性;二是受限于线性混合模型,缺乏应对显著非线性效应场景的灵活性。为了解决这些问题,论文提出了一个基于多尺度膨胀Transformer的非线性高光谱解混网络(Dilated Transformer-based Unmixing Network for Nonlinear HU, DTU-Net)。其关键在于设计了一个包含两个分支的编码器:第一个分支利用膨胀Transformer中的多尺度膨胀注意力机制(Multi-Scale Dilated Attention, MSDA),通过在不同注意力头中调整膨胀率来捕获长程和多尺度的空间相关性;第二个分支则采用带通道注意力的3D卷积神经网络进行光谱特征提取。两分支输出经过融合后,进一步转化为丰度估计。此外,解码器被设计为同时支持线性和非线性混合场景,并通过显式建模端元、丰度与多项式后非线性混合模型(Polynomial Post-Nonlinear Mixing Model, PPNMM)中的非线性系数的关系,提升了方法的可解释性。实验结果验证了DTU-Net在合成数据集和真实数据集上的有效性。
链接: https://arxiv.org/abs/2503.03465
作者: ChenTong Wang,Jincheng Gao,Fei Zhu,Abderrahim Halimi,C’edric Richard
机构: Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China (天津大学应用数学中心, 中国天津市, 300072); School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh, EH14 4AS, United Kingdom (英国爱丁堡赫瑞瓦特大学工程与物理科学学院); Université Côte d’Azur, CNRS, OCA, F-06108, Nice, France (法国尼斯大学, 国家科学研究中心, 地球与环境科学天文观测实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Transformers have shown significant success in hyperspectral unmixing (HU). However, challenges remain. While multi-scale and long-range spatial correlations are essential in unmixing tasks, current Transformer-based unmixing networks, built on Vision Transformer (ViT) or Swin-Transformer, struggle to capture them effectively. Additionally, current Transformer-based unmixing networks rely on the linear mixing model, which lacks the flexibility to accommodate scenarios where nonlinear effects are significant. To address these limitations, we propose a multi-scale Dilated Transformer-based unmixing network for nonlinear HU (DTU-Net). The encoder employs two branches. The first one performs multi-scale spatial feature extraction using Multi-Scale Dilated Attention (MSDA) in the Dilated Transformer, which varies dilation rates across attention heads to capture long-range and multi-scale spatial correlations. The second one performs spectral feature extraction utilizing 3D-CNNs with channel attention. The outputs from both branches are then fused to integrate multi-scale spatial and spectral information, which is subsequently transformed to estimate the abundances. The decoder is designed to accommodate both linear and nonlinear mixing scenarios. Its interpretability is enhanced by explicitly modeling the relationships between endmembers, abundances, and nonlinear coefficients in accordance with the polynomial post-nonlinear mixing model (PPNMM). Experiments on synthetic and real datasets validate the effectiveness of the proposed DTU-Net compared to PPNMM-derived methods and several advanced unmixing networks.
zh
[CV-28] Active Learning for Deep Learning-Based Hemodynamic Parameter Estimation
【速读】:本文旨在解决利用数据驱动模型进行计算流体动力学(CFD)结果快速估计时,因需要大量耗时的参考CFD模拟用于训练而带来的高成本问题。为降低这一障碍,论文提出了一种主动学习框架,通过减少训练所需的CFD模拟数量来提高深度学习代理模型在新应用中的部署潜力。关键在于提出的三种不同的查询策略,分别基于几何变化、集成不确定性以及遵循流体力学物理规律,以确定应针对哪些未标记样本获取CFD模拟。这些方法在合成冠状动脉分叉的速度场估计中进行了测试,结果显示可显著降低标注成本,并使训练出的模型对复杂情况更具鲁棒性,证明了主动学习在这种深度学习辅助CFD代理模型中的可行性。
链接: https://arxiv.org/abs/2503.03453
作者: Patryk Rygiel,Julian Suk,Kak Khee Yeung,Christoph Brune,Jelmer M. Wolterink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hemodynamic parameters such as pressure and wall shear stress play an important role in diagnosis, prognosis, and treatment planning in cardiovascular diseases. These parameters can be accurately computed using computational fluid dynamics (CFD), but CFD is computationally intensive. Hence, deep learning methods have been adopted as a surrogate to rapidly estimate CFD outcomes. A drawback of such data-driven models is the need for time-consuming reference CFD simulations for training. In this work, we introduce an active learning framework to reduce the number of CFD simulations required for the training of surrogate models, lowering the barriers to their deployment in new applications. We propose three distinct querying strategies to determine for which unlabeled samples CFD simulations should be obtained. These querying strategies are based on geometrical variance, ensemble uncertainty, and adherence to the physics governing fluid dynamics. We benchmark these methods on velocity field estimation in synthetic coronary artery bifurcations and find that they allow for substantial reductions in annotation cost. Notably, we find that our strategies reduce the number of samples required by up to 50% and make the trained models more robust to difficult cases. Our results show that active learning is a feasible strategy to increase the potential of deep learning-based CFD surrogates.
zh
[CV-29] Biased Heritage: How Datasets Shape Models in Facial Expression Recognition
【速读】:该论文试图解决在基于图像的面部表情识别(FER)系统中,如何分析和减轻数据集偏差向训练模型传播的问题。论文的关键在于提出了一种综合框架,用于研究从数据集到训练模型的偏差传播,并引入了专门针对多分类问题与多个人口群体设计的新偏差度量方法。通过在FER数据集中引入受控偏差、基于这些有偏数据集训练模型以及分析数据集偏差指标与模型公平性概念之间的相关性,论文揭示了刻板印象偏差比表征偏差更强烈地影响模型预测,强调在FER数据集中优先防止特定情绪的人口统计模式比追求一般人口平衡更为重要。此外,研究发现有偏数据集会导致模型准确性下降,挑战了公平性与准确性权衡的传统假设。
链接: https://arxiv.org/abs/2503.03446
作者: Iris Dominguez-Catena,Daniel Paternain,Mikel Galar,MaryBeth Defrance,Maarten Buyl,Tijl De Bie
机构: Institute of Smart Cities (ISC), Public University of Navarre (Public University of Navarre) (西班牙潘普洛纳); Ghent University (根特大学) (比利时根特)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 17 pages, 7 figures
Abstract:In recent years, the rapid development of artificial intelligence (AI) systems has raised concerns about our ability to ensure their fairness, that is, how to avoid discrimination based on protected characteristics such as gender, race, or age. While algorithmic fairness is well-studied in simple binary classification tasks on tabular data, its application to complex, real-world scenarios-such as Facial Expression Recognition (FER)-remains underexplored. FER presents unique challenges: it is inherently multiclass, and biases emerge across intersecting demographic variables, each potentially comprising multiple protected groups. We present a comprehensive framework to analyze bias propagation from datasets to trained models in image-based FER systems, while introducing new bias metrics specifically designed for multiclass problems with multiple demographic groups. Our methodology studies bias propagation by (1) inducing controlled biases in FER datasets, (2) training models on these biased datasets, and (3) analyzing the correlation between dataset bias metrics and model fairness notions. Our findings reveal that stereotypical biases propagate more strongly to model predictions than representational biases, suggesting that preventing emotion-specific demographic patterns should be prioritized over general demographic balance in FER datasets. Additionally, we observe that biased datasets lead to reduced model accuracy, challenging the assumed fairness-accuracy trade-off.
zh
[CV-30] JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba CVPR2025
【速读】:该论文旨在解决现有最先进的特征匹配器在利用Transformer捕获长距离依赖关系时面临的高空间复杂度问题,这导致训练开销大且推理延迟高。论文的关键在于提出了一种基于Mamba的超轻量级匹配器JamMa,并通过联合Mamba(Joint Mamba)与扫描-合并策略JEGO,实现了性能与效率的良好平衡。JEGO策略的核心创新包括两图像的联合扫描以实现高频互交互作用、带跳步的高效扫描以缩短序列长度、全局感受野以及全向特征表示。这些特性使得JEGO在特征匹配任务中显著优于VMamba和EVMamba中的扫描-合并策略,同时JamMa相较于基于注意力的稀疏或半密集匹配器,在保持优异性能的同时仅需不到50%的参数和浮点运算(FLOPs)。
链接: https://arxiv.org/abs/2503.03437
作者: Xiaoyong Lu,Songlin Du
机构: School of Automation, Southeast University (东南大学自动化学院), Nanjing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project page: this https URL
Abstract:Existing state-of-the-art feature matchers capture long-range dependencies with Transformers but are hindered by high spatial complexity, leading to demanding training and highlatency inference. Striking a better balance between performance and efficiency remains a challenge in feature matching. Inspired by the linear complexity O(N) of Mamba, we propose an ultra-lightweight Mamba-based matcher, named JamMa, which converges on a single GPU and achieves an impressive performance-efficiency balance in inference. To unlock the potential of Mamba for feature matching, we propose Joint Mamba with a scan-merge strategy named JEGO, which enables: (1) Joint scan of two images to achieve high-frequency mutual interaction, (2) Efficient scan with skip steps to reduce sequence length, (3) Global receptive field, and (4) Omnidirectional feature representation. With the above properties, the JEGO strategy significantly outperforms the scan-merge strategies proposed in VMamba and EVMamba in the feature matching task. Compared to attention-based sparse and semi-dense matchers, JamMa demonstrates a superior balance between performance and efficiency, delivering better performance with less than 50% of the parameters and FLOPs.
zh
[CV-31] CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization CVPR2025
【速读】:该论文试图解决自动驾驶领域中单个车辆感知能力较弱的问题,并针对现有多智能体协同感知方法在通信效率与感知精度之间存在的权衡困境提出解决方案。论文的关键在于提出了一种基于供需意识和中间-晚期混合协作的新型高效通信协同感知框架(\mymethodname)。通过建模智能体之间的供需关系,该框架优化了协作区域的选择,降低了不必要的通信开销,同时保持了感知准确性;此外,创新性地引入了中间-晚期混合协作模式,在低带宽通信条件下补偿了协同感知性能的下降。这一系列设计确保了在真实通信带宽下实现最优的检测精度与带宽权衡,验证了方法的有效性和实际应用价值。
链接: https://arxiv.org/abs/2503.03430
作者: Junhao Xu,Yanan Zhang,Zhi Cai,Di Huang
机构: State Key Laboratory of Complex and Critical Software Environment, Beihang University (北航), Beijing, China; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学), Hefei, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025
Abstract:Multi-agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single-vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization, dubbed as \mymethodname. By modeling the supply-demand relationship between agents, the framework refines the selection of collaboration regions, reducing unnecessary communication cost while maintaining accuracy. In addition, we innovatively introduce the intermediate-late hybrid collaboration mode, where late-stage collaboration compensates for the performance degradation in collaborative perception under low communication bandwidth. Extensive experiments on multiple datasets, including both simulated and real-world scenarios, demonstrate that \mymethodname~ achieves state-of-the-art detection accuracy and optimal bandwidth trade-offs, delivering superior detection precision under real communication bandwidths, thus proving its effectiveness and practical applicability. The code will be released at this https URL.
zh
[CV-32] Automatic Drywall Analysis for Progress Tracking and Quality Control in Construction
【速读】:本文旨在解决建筑工地中干挂墙板(drywall)施工进度与质量评估中存在的效率低下和准确性不足的问题。传统方法依赖人工检查,容易受到主观因素的影响且耗时费力。论文提出了一种基于图像的自动化干挂墙板分析方法,通过现场摄像系统实现施工进度跟踪和质量评估。解决方案的关键在于结合深度学习实例分割模型与分析模块:实例分割模型用于检测并分类干挂墙板的各种元素;分析模块则负责聚类墙面段、估算相机透视畸变并进行相应校正。此外,作者通过网络结构调整和针对性的数据增强技术显著提升了分割精度,并开发了一种新算法从分割结果中提取关键信息。这些改进使得系统能够更准确地从图像中提取有价值的信息,从而实现更加精确可靠的施工进度跟踪和质量评估。
链接: https://arxiv.org/abs/2503.03422
作者: Mariusz Trzeciakiewicz,Aleixo Cambeiro Barreiro,Niklas Gard,Anna Hilsmann,Peter Eisert
机构: Fraunhofer HHI (弗劳恩霍夫通信技术研究所), Berlin, Germany; Humboldt University of Berlin (柏林洪堡大学), Berlin, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digitalization in the construction industry has become essential, enabling centralized, easy access to all relevant information of a building. Automated systems can facilitate the timely and resource-efficient documentation of changes, which is crucial for key processes such as progress tracking and quality control. This paper presents a method for image-based automated drywall analysis enabling construction progress and quality assessment through on-site camera systems. Our proposed solution integrates a deep learning-based instance segmentation model to detect and classify various drywall elements with an analysis module to cluster individual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This system extracts valuable information from images, enabling more accurate progress tracking and quality assessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis, improving instance segmentation accuracy through architecture modifications and targeted data augmentation, and a novel algorithm to extract important information from the segmentation results. Our modified model, enhanced with data augmentation, achieves significantly higher accuracy compared to other architectures, offering more detailed and precise information than existing approaches. Combined with the proposed drywall analysis steps, it enables the reliable automation of construction progress and quality assessment.
zh
[CV-33] AI-Driven Multi-Stage Computer Vision System for Defect Detection in Laser-Engraved Industrial Nameplates
【速读】:该论文旨在解决工业制造中激光雕刻铭牌缺陷检测的问题,特别是确保铭牌上logo和字符的精确性以避免因误印或缺失字符导致的质量问题。解决方案的关键在于集成多种先进的计算机视觉技术,包括基于YOLOv7的目标检测用于定位铭牌区域、Tesseract驱动的光学字符识别(OCR)用于验证字符准确性、以及通过残差变分自编码器(ResVAE)实现的异常检测用于捕捉非正常图案变化,并结合其他视觉方法,在多个阶段进行综合检查。实验结果表明,该系统在保证高召回率(100%)的同时达到了91.33%的准确率,有效实现了对缺陷铭牌的一致检测与处理。这一成果展示了AI驱动视觉检测在提升质量控制效率、减少人工干预及优化整体制造流程方面的巨大潜力。
链接: https://arxiv.org/abs/2503.03395
作者: Adhish Anitha Vilasan,Stephan Jäger,Noah Klarmann
机构: Technische Hochschule Rosenheim (罗斯海姆技术大学); Knorr-Bremse Systeme für Nutzfahrzeuge GmbH (Knorr-Bremse商用车系统有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated defect detection in industrial manufacturing is essential for maintaining product quality and minimizing production errors. In air disc brake manufacturing, ensuring the precision of laser-engraved nameplates is crucial for accurate product identification and quality control. Engraving errors, such as misprints or missing characters, can compromise both aesthetics and functionality, leading to material waste and production delays. This paper presents a proof of concept for an AI-driven computer vision system that inspects and verifies laser-engraved nameplates, detecting defects in logos and alphanumeric strings. The system integrates object detection using YOLOv7, optical character recognition (OCR) with Tesseract, and anomaly detection through a residual variational autoencoder (ResVAE) along with other computer vision methods to enable comprehensive inspections at multiple stages. Experimental results demonstrate the system’s effectiveness, achieving 91.33% accuracy and 100% recall, ensuring that defective nameplates are consistently detected and addressed. This solution highlights the potential of AI-driven visual inspection to enhance quality control, reduce manual inspection efforts, and improve overall manufacturing efficiency.
zh
[CV-34] MIAdapt: Source-free Few-shot Domain Adaptive Object Detection for Microscopic Images
【速读】:该论文旨在解决源域数据不可用情况下的少样本领域自适应目标检测(Source-free Few-shot Domain Adaptive Object Detection, SF-FSDA)问题。现有通用无监督领域自适应方法通常需要同时访问大规模标记的源数据集和足够的未标记目标数据集,但在医学影像等领域,收集大规模数据集(即使是未标注的)面临挑战且成本高昂,同时隐私限制可能导致源数据无法获取。为此,论文提出MIAdapt方法,其关键在于无需使用任何源域图像即可实现高效的目标检测,通过定义两个竞争基准(Faster-FreeShot和MT-FreeShot),并在M5-Malaria和Raabin-WBC数据集上的实验验证了MIAdapt的有效性,分别在Raabin-WBC数据集上比最先进的源无关无监督领域自适应(SF-UDA)方法高出+21.3% mAP,比少样本领域适应(FSDA)方法高出+4.7% mAP。
链接: https://arxiv.org/abs/2503.03370
作者: Nimra Dilawar,Sara Nadeem,Javed Iqbal,Waqas Sultani,Mohsen Ali
机构: Intelligent Machines Lab (智能机器实验室), Department of Artificial Intelligence (人工智能系), Information Technology University (信息技术大学), Pakistan (巴基斯坦)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Existing generic unsupervised domain adaptation approaches require access to both a large labeled source dataset and a sufficient unlabeled target dataset during adaptation. However, collecting a large dataset, even if unlabeled, is a challenging and expensive endeavor, especially in medical imaging. In addition, constraints such as privacy issues can result in cases where source data is unavailable. Taking in consideration these challenges, we propose MIAdapt, an adaptive approach for Microscopic Imagery Adaptation as a solution for Source-free Few-shot Domain Adaptive Object detection (SF-FSDA). We also define two competitive baselines (1) Faster-FreeShot and (2) MT-FreeShot. Extensive experiments on the challenging M5-Malaria and Raabin-WBC datasets validate the effectiveness of MIAdapt. Without using any image from the source domain MIAdapt surpasses state-of-the-art source-free UDA (SF-UDA) methods by +21.3% mAP and few-shot domain adaptation (FSDA) approaches by +4.7% mAP on Raabin-WBC. Our code and models will be publicly available.
zh
[CV-35] op-K Maximum Intensity Projection Priors for 3D Liver Vessel Segmentation
【速读】:该论文致力于解决肝脏血管分割中未能有效保持全局肝脏-血管拓扑结构的问题。传统基于2D或3D卷积的方法主要关注于CT断层图像上的局部分割,而忽略了整体解剖结构的连贯性。为了解决这一问题,论文的关键创新在于引入了基于最大强度投影(top-k maximum intensity projections)的概念,通过模拟CT重建过程中的积分操作,保留每个投射方向上的前k个最大值,从而捕捉全局拓扑信息。这些投影被用于引导扩散模型生成三维肝脏-血管树,实现了对肝脏血管更精确的分割。实验结果显示,该方法在3D-ircadb-01数据集上取得了最高的Dice系数、交并比(IoU)以及敏感度(Sensitivity)评分。
链接: https://arxiv.org/abs/2503.03367
作者: Xiaotong Zhang,Alexander Broersen,Gonnie CM van Erp,Silvia L. Pintea,Jouke Dijkstra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2025 IEEE International Symposium on Biomedical Imaging (ISBI 2025)
Abstract:Liver-vessel segmentation is an essential task in the pre-operative planning of liver resection. State-of-the-art 2D or 3D convolution-based methods focusing on liver vessel segmentation on 2D CT cross-sectional views, which do not take into account the global liver-vessel topology. To maintain this global vessel topology, we rely on the underlying physics used in the CT reconstruction process, and apply this to liver-vessel segmentation. Concretely, we introduce the concept of top-k maximum intensity projections, which mimics the CT reconstruction by replacing the integral along each projection direction, with keeping the top-k maxima along each projection direction. We use these top-k maximum projections to condition a diffusion model and generate 3D liver-vessel trees. We evaluate our 3D liver-vessel segmentation on the 3D-ircadb-01 dataset, and achieve the highest Dice coefficient, intersection-over-union (IoU), and Sensitivity scores compared to prior work.
zh
[CV-36] opoMortar: A dataset to evaluate image segmentation methods focused on topology accuracy
【速读】:该论文试图解决的问题是如何有效评估专注于拓扑信息的图像分割方法,特别是拓扑损失函数在保持拓扑准确性方面的表现。现有数据集因包含小规模训练集、噪声标签及分布外测试样本等问题,影响了拓扑损失函数的效果评估。为解决此问题,论文提出了TopoMortar数据集,其关键是通过设计包含三种标签类型(精确标签、噪声标签、伪标签)、两种固定规模的训练集以及分布内与分布外测试样本的数据集,隔离数据相关性影响与拓扑先验知识整合效果的影响,从而明确验证方法是否真正利用了拓扑信息。实验表明,基于骨架化的clDice和Skeleton Recall损失在拓扑准确性及训练效率方面表现出色,为拓扑损失函数的研究提供了重要方向。
链接: https://arxiv.org/abs/2503.03365
作者: Juan Miguel Valverde,Motoya Koga,Nijihiko Otsuka,Anders Bjorholm Dahl
机构: Department of Applied Mathematics and Computer Science, Technical University of Denmark (丹麦技术大学应用数学与计算机科学系); A.I. Virtanen Institute, University of Eastern Finland (东芬兰大学A.I. Virtanen研究所); Department of Architecture Faculty of Engineering, Sojo University, Japan (日本那么塾大学工程学院建筑系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present TopoMortar, a brick wall dataset that is the first dataset specifically designed to evaluate topology-focused image segmentation methods, such as topology loss functions. TopoMortar enables to investigate in two ways whether methods incorporate prior topological knowledge. First, by eliminating challenges seen in real-world data, such as small training set, noisy labels, and out-of-distribution test-set images, that, as we show, impact the effectiveness of topology losses. Second, by allowing to assess in the same dataset topology accuracy across dataset challenges, isolating dataset-related effects from the effect of incorporating prior topological knowledge. In these two experiments, it is deliberately difficult to improve topology accuracy without actually using topology information, thus, permitting to attribute an improvement in topology accuracy to the incorporation of prior topological knowledge. To this end, TopoMortar includes three types of labels (accurate, noisy, pseudo-labels), two fixed training sets (large and small), and in-distribution and out-of-distribution test-set images. We compared eight loss functions on TopoMortar, and we found that clDice achieved the most topologically accurate segmentations, Skeleton Recall loss performed best particularly with noisy labels, and the relative advantageousness of the other loss functions depended on the experimental setting. Additionally, we show that simple methods, such as data augmentation and self-distillation, can elevate Cross entropy Dice loss to surpass most topology loss functions, and that those simple methods can enhance topology loss functions as well. clDice and Skeleton Recall loss, both skeletonization-based loss functions, were also the fastest to train, making this type of loss function a promising research direction. TopoMortar and our code can be found at this https URL
zh
[CV-37] Video Super-Resolution: All You Need is a Video Diffusion Model
【速读】:本文提出了一种基于扩散后验采样框架的通用视频超分辨率算法,旨在解决视频超分辨率任务中复杂运动模式建模和采样条件适应性的问题。解决方案的关键在于引入了一个无条件的潜空间视频生成模型——扩散变换器(Diffusion Transformer),它作为一个时空模型,通过学习真实世界的物理规律,将多种运动模式作为先验知识,从而避免了显式的光流或运动参数估计以实现像素对齐。此外,该模型能够在不同采样条件下无需重新训练即可进行适应,展示了其强大的泛化能力。实验结果基于合成数据验证了算法在有限计算资源和训练数据下的显著超分辨率性能。
链接: https://arxiv.org/abs/2503.03355
作者: Zhihao Zhan,Wang Pang,Xiang Zhu,Yechao Bai
机构: TopXGun Robotics (拓攻机器人); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:We present a generic video super-resolution algorithm in this paper, based on the Diffusion Posterior Sampling framework with an unconditional video generation model in latent space. The video generation model, a diffusion transformer, functions as a space-time model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Due to limited computational resources and training data, our experiments provide empirical evidence of the algorithm’s strong super-resolution capabilities using synthetic data.
zh
[CV-38] Automated Attendee Recognition System for Large-Scale Social Events or Conference Gathering
【速读】:该论文旨在解决大型活动(如婚礼或会议)中手动考勤效率低下且易出错的问题。为应对这一挑战,论文提出了一种基于云的自动化考勤系统,其关键在于利用安装在入口和出口闸门处的摄像头捕捉视频,并将视频数据实时传输到云端进行人脸检测与识别。与现有方案不同,该系统能够在参会者未直视相机、存在自然动作(如环顾四周或边走边交谈)的情况下实现高精度识别,这是首次在如此动态条件下达到高识别率的系统。论文表明,该系统整体准确率达到90%,每帧处理耗时5秒,确保了实时操作且无帧丢失,并同步向安保人员发送通知,对于无面部遮挡的个体实现了100%的识别准确率,同时覆盖摄像头视野内的所有参会者,提供了一个鲁棒的大型社交活动参会者识别解决方案。
链接: https://arxiv.org/abs/2503.03330
作者: Dhruv Motwani,Ankush Tyagi,Vipul Dabhi,Harshadkumar Prajapati
机构: Avahi (Avahi); Ericsson (Ericsson); Department of Information Technology, Dharmsinh Desai University (信息技术系, 达尔姆辛赫·德赛大学); Department of Information Technology, Dharmsinh Desai University (信息技术系, 达尔姆辛赫·德赛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Manual attendance tracking at large-scale events, such as marriage functions or conferences, is often inefficient and prone to human error. To address this challenge, we propose an automated, cloud-based attendance tracking system that uses cameras mounted at the entrance and exit gates. The mounted cameras continuously capture video and send the video data to cloud services to perform real-time face detection and recognition. Unlike existing solutions, our system accurately identifies attendees even when they are not looking directly at the camera, allowing natural movements, such as looking around or talking while walking. To the best of our knowledge, this is the first system to achieve high recognition rates under such dynamic conditions. Our system demonstrates overall 90% accuracy, with each video frame processed in 5 seconds, ensuring real time operation without frame loss. In addition, notifications are sent promptly to security personnel within the same latency. This system achieves 100% accuracy for individuals without facial obstructions and successfully recognizes all attendees appearing within the camera’s field of view, providing a robust solution for attendee recognition in large-scale social events.
zh
[CV-39] Deep Learning-Based Diffusion MRI Tractography: Integrating Spatial and Anatomical Information
【速读】:该论文旨在解决扩散磁共振成像(Diffusion MRI)纤维束重建中轨迹图(tractogram)准确性不足的问题,特别是由于过度依赖局部信息导致的假阳性连接过多。为解决此问题,论文提出了一种新颖的深度学习框架,关键在于将图像域空间信息与沿纤维束的解剖学信息相结合:前者通过卷积层提取,后者通过Transformer解码器建模,并采用加权损失函数缓解训练过程中遇到的纤维类别不平衡问题。这种方法显著提升了轨迹图的准确性和白质覆盖率,在多个数据集上的实验验证了其有效性。
链接: https://arxiv.org/abs/2503.03329
作者: Yiqiong Yang,Yitian Yuan,Baoxing Ren,Ye Wu,Yanqiu Feng,Xinyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Diffusion MRI tractography technique enables non-invasive visualization of the white matter pathways in the brain. It plays a crucial role in neuroscience and clinical fields by facilitating the study of brain connectivity and neurological disorders. However, the accuracy of reconstructed tractograms has been a longstanding challenge. Recently, deep learning methods have been applied to improve tractograms for better white matter coverage, but often comes at the expense of generating excessive false-positive connections. This is largely due to their reliance on local information to predict long range streamlines. To improve the accuracy of streamline propagation predictions, we introduce a novel deep learning framework that integrates image-domain spatial information and anatomical information along tracts, with the former extracted through convolutional layers and the later modeled via a Transformer-decoder. Additionally, we employ a weighted loss function to address fiber class imbalance encountered during training. We evaluate the proposed method on the simulated ISMRM 2015 Tractography Challenge dataset, achieving a valid streamline rate of 66.2%, white matter coverage of 63.8%, and successfully reconstructing 24 out of 25 bundles. Furthermore, on the multi-site Tractoinferno dataset, the proposed method demonstrates its ability to handle various diffusion MRI acquisition schemes, achieving a 5.7% increase in white matter coverage and a 4.1% decrease in overreach compared to RNN-based methods.
zh
[CV-40] Golden Cudgel Network for Real-Time Semantic Segmentation
【速读】:该论文旨在解决现有实时语义分割模型在性能与速度上的局限性。具体而言,单分支或双分支模型受限于多路径块的计算开销,部分模型还需依赖高性能教师模型(teacher model)进行训练。为克服这些问题,论文提出Golden Cudgel网络(GCNet)。其关键在于采用垂直多卷积和水平多路径设计用于训练阶段,并通过重新参数化将模型简化为单一卷积核用于推理阶段,从而同时优化性能与速度。此外,GCNet能够在训练时自适应扩展,在推理时自适应收缩,实现无需外部教师模型即可具备强大学习能力的效果。实验结果显示,GCNet在Cityscapes、CamVid和Pascal VOC 2012数据集上超越了现有的最先进方法。
链接: https://arxiv.org/abs/2503.03325
作者: Guoyu Yang,Yuan Wang,Daming Shi,Yanzhong Wang
机构: Shenzhen University (深圳大学); Zhejiang University of Technology (浙江工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent real-time semantic segmentation models, whether single-branch or multi-branch, achieve good performance and speed. However, their speed is limited by multi-path blocks, and some depend on high-performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). Specifically, GCNet uses vertical multi-convolutions and horizontal multi-paths for training, which are reparameterized into a single convolution for inference, optimizing both performance and speed. This design allows GCNet to self-enlarge during training and self-contract during inference, effectively becoming a “teacher model” without needing external ones. Experimental results show that GCNet outperforms existing state-of-the-art models in terms of performance and speed on the Cityscapes, CamVid, and Pascal VOC 2012 datasets. The code is available at this https URL.
zh
[CV-41] See What You Are Told: Visual Attention Sink in Large Multimodal Models
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在处理视觉信息时过度关注无关视觉标记的问题,即所谓的“视觉注意力陷阱(visual attention sink)”。这种现象源于某些隐藏状态维度的过度激活,导致模型分配过多注意力权重给与文本无关的视觉信息。论文的关键解决方案是提出了一种名为“视觉注意力重分布(Visual Attention Redistribution, VAR)”的方法。VAR通过重新分配图像中心注意力头中的注意力资源,将原本分配给无关视觉标记的注意力转移到更重要的视觉特征上,从而提升模型对图像的关注度,而无需额外训练或增加推理步骤。实验结果表明,VAR能够有效改善LMMs在多种任务上的性能,包括通用视觉-语言任务、视觉幻觉任务以及以视觉为中心的任务,为增强LMMs的多模态能力提供了新方向。
链接: https://arxiv.org/abs/2503.03321
作者: Seil Kang,Jinyeong Kim,Junhyeok Kim,Seong Jae Hwang
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large multimodal models (LMMs) “see” images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.
zh
[CV-42] Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers CVPR
【速读】:该论文旨在解决事件相机(event camera)中全六自由度(full-DoF)自运动参数估计的问题,现有稀疏几何求解器仅能处理已知旋转位移(如由惯性测量单元IMU提供的旋转信息)的情况,从而只能恢复平移运动参数。论文提出了一种统一框架下的多种求解器,用于同时估计旋转和线速度,其关键是利用由线段诱导的事件流形,并基于线的入射关系或法向量的新颖共面关系构建问题公式。此外,通过采用Adam优化框架并结合一阶旋转近似实现快速初始化,以提高优化效率。实验结果验证了该方法在合成数据和真实场景中的有效性。
链接: https://arxiv.org/abs/2503.03307
作者: Ji Zhao,Banglei Guan,Zibin Liu,Laurent Kneip
机构: Independent Researcher, Beijing, China (独立研究员, 北京, 中国); College of Aerospace Science and Engineering, National University of Defense Technology, China (航空航天科学与工程学院, 国防科技大学, 中国); Mobile Perception Lab, ShanghaiTech University, China (移动感知实验室, 上海科技大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Abstract:For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code is available at this https URL.
zh
[CV-43] Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters
【速读】:该论文旨在解决激光雷达语义分割模型因缺乏大规模多样化数据而难以进行通用预训练的问题,同时克服基于点云分割架构中自定义网络层限制跨领域迁移性的问题。论文的关键创新在于提出了一种名为BALViT的新方法,通过冻结的视觉模型作为模态无关特征编码器来学习强大的激光雷达编码器。其核心解决方案包括引入范围视图和鸟瞰视图两种激光雷达编码机制,并通过一种新颖的2D-3D适配器将两者结合。在此过程中,范围视图特征经过冻结图像主干处理,而鸟瞰视图分支则通过多次交叉注意力交互增强这些特征,从而利用领域相关知识持续改进视觉网络,最终实现高效的标签利用率和强大的激光雷达编码能力。这一方案在SemanticKITTI和nuScenes基准测试上的广泛评估表明,它在小数据集场景下超越了现有最先进方法。
链接: https://arxiv.org/abs/2503.03299
作者: Julia Hindel,Rohit Mohan,Jelena Bratulic,Daniele Cattaneo,Thomas Brox,Abhinav Valada
机构: Department of Computer Science, University of Freiburg (弗赖堡大学计算机科学系); Deutsche Forschungsgemeinschaft (德国研究基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird’s-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird’s-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: this http URL.
zh
[CV-44] Deep Understanding of Sign Language for Sign to Subtitle Alignment
【速读】:该论文旨在解决手语视频中异步字幕对齐的问题,特别是在标注数据有限的情况下。论文的关键解决方案包括:(1) 利用英式手语(British Sign Language, BSL)的基本语法规则预处理输入字幕;(2) 设计选择性对齐损失函数,仅在查询的手势确实出现在场景中时优化模型以预测手势的时间位置;(3) 借助经过精炼的伪标签进行自训练,这些伪标签比基于音频对齐的启发式标签更准确。这种方法不仅提高了文本与手势之间相关性的理解,还展示了在手语翻译中的应用潜力,尤其是在大规模手工标注难以实现的场景中。实验结果表明,该方法在帧级准确率和F1分数方面显著超越现有基线,验证了框架的有效性和实用性。
链接: https://arxiv.org/abs/2503.03287
作者: Youngjoon Jang,Jeongsoo Choi,Junseok Ahn,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The objective of this work is to align asynchronous subtitles in sign language videos with limited labelled data. To achieve this goal, we propose a novel framework with the following contributions: (1) we leverage fundamental grammatical rules of British Sign Language (BSL) to pre-process the input subtitles, (2) we design a selective alignment loss to optimise the model for predicting the temporal location of signs only when the queried sign actually occurs in a scene, and (3) we conduct self-training with refined pseudo-labels which are more accurate than the heuristic audio-aligned labels. From this, our model not only better understands the correlation between the text and the signs, but also holds potential for application in the translation of sign languages, particularly in scenarios where manual labelling of large-scale sign data is impractical or challenging. Extensive experimental results demonstrate that our approach achieves state-of-the-art results, surpassing previous baselines by substantial margins in terms of both frame-level accuracy and F1-score. This highlights the effectiveness and practicality of our framework in advancing the field of sign language video alignment and translation.
zh
[CV-45] Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning ICASSP2025
【速读】:该论文致力于解决视觉强制对齐(Visual Forced Alignment, VFA)中的精准同步问题,目标是将语音与对应的唇部运动精确对齐,而无需依赖音频线索。其关键解决方案在于提出了一种新颖的VFA方法,该方法集成了局部上下文感知特征提取器,并采用多任务学习来优化全局和局部上下文特征,从而增强对细微唇部运动的敏感性,实现单词级别和音素级别的精确对齐。此外,通过引入改进的维特比算法进行后处理,进一步显著减少了对齐错误。这一方法在LRS2数据集上的实验结果表明,其在单词级别上提升了6%的精度,在音素级别上提升了27%,显示出超越现有方法的性能优势。
链接: https://arxiv.org/abs/2503.03286
作者: Yi He,Lei Yang,Shilin Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP2025
Abstract:This paper introduces a novel approach to Visual Forced Alignment (VFA), aiming to accurately synchronize utterances with corresponding lip movements, without relying on audio cues. We propose a novel VFA approach that integrates a local context-aware feature extractor and employs multi-task learning to refine both global and local context features, enhancing sensitivity to subtle lip movements for precise word-level and phoneme-level alignment. Incorporating the improved Viterbi algorithm for post-processing, our method significantly reduces misalignments. Experimental results show our approach outperforms existing methods, achieving a 6% accuracy improvement at the word-level and 27% improvement at the phoneme-level in LRS2 dataset. These improvements offer new potential for applications in automatically subtitling TV shows or user-generated content platforms like TikTok and YouTube Shorts.
zh
[CV-46] Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations AAAI-25
【速读】:该论文旨在解决越南语视觉问答(Vietnamese Visual Question Answering, VQA)任务中因低资源语言导致的数据稀缺、标注质量不高以及对大规模预训练模型依赖的问题。传统方法在处理此类问题时往往受限于昂贵的计算开销和对大量标注数据的需求,难以在越南语场景下有效应用。
为了解决这些问题,论文提出了一种结合释义特征增强模块与动态课程学习策略的训练框架。关键在于通过引入释义样本(被视为“简单”样本)与原始样本(被视为“困难”样本)的组合,利用一种动态调整两者比例的机制,在训练过程中逐步增加数据集的难度。这种方法使得模型能够逐渐适应任务复杂性,从而实现更好的泛化能力并提升整体性能。实验结果表明,该方法在OpenViVQA数据集上表现出一致的改进,并在ViVQA数据集上显示出既有潜力又有挑战的结果。
链接: https://arxiv.org/abs/2503.03285
作者: Khoi Anh Nguyen,Linh Yen Vu,Thang Dinh Duong,Thuan Nguyen Duong,Huy Thanh Nguyen,Vinh Quang Dinh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, AAAI-25 Workshop on Document Understanding and Intelligence
Abstract:Visual Question Answering (VQA) is a multimodal task requiring reasoning across textual and visual inputs, which becomes particularly challenging in low-resource languages like Vietnamese due to linguistic variability and the lack of high-quality datasets. Traditional methods often rely heavily on extensive annotated datasets, computationally expensive pipelines, and large pre-trained models, specifically in the domain of Vietnamese VQA, limiting their applicability in such scenarios. To address these limitations, we propose a training framework that combines a paraphrase-based feature augmentation module with a dynamic curriculum learning strategy. Explicitly, augmented samples are considered “easy” while raw samples are regarded as “hard”. The framework then utilizes a mechanism that dynamically adjusts the ratio of easy to hard samples during training, progressively modifying the same dataset to increase its difficulty level. By enabling gradual adaptation to task complexity, this approach helps the Vietnamese VQA model generalize well, thus improving overall performance. Experimental results show consistent improvements on the OpenViVQA dataset and mixed outcomes on the ViVQA dataset, highlighting both the potential and challenges of our approach in advancing VQA for Vietnamese language.
zh
[CV-47] Gaussian highpass guided image filtering
【速读】:该论文旨在解决传统引导图像滤波(Guided Image Filtering, GIF)方法在结构传递机制理解上的不足,以及提升其在多种图像处理任务中的性能。传统GIF及其改进版本基于两参数局部仿射模型(Local Affine Model, LAM),但在该模型中输入图像未被纳入考虑,导致结构传递机制不够清晰。为解决此问题,论文提出了一种基于单参数先验模型的高斯滤波方法(Prior Model based on Gaussian Filtering, PM-GF),其中滤波输出是引导图像的高通分量加权部分与输入图像低通平滑结果的叠加。这种方法通过引入显式的高通滤波操作,更清晰地揭示了引导滤波的结构传递机制。在此基础上,进一步提出了若干基于PM-GF的高斯高通引导滤波器(Gaussian Highpass GIFs, GH-GIFs),通过替代原有的LAM实现对原始GIF及其改进版本的模拟。实验表明,所提出的滤波器在多种图像处理应用中优于现有方法。
链接: https://arxiv.org/abs/2503.03284
作者: Lei Zhao,Chuanjiang He
机构: College of Mathematics and Statistics, Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Guided image filtering (GIF) is a popular smoothing technique, in which an additional image is used as a structure guidance for noise removal with edge preservation. The original GIF and some of its subsequent improvements are derived from a two-parameter local affine model (LAM), where the filtering output is a local affine transformation of the guidance image, but the input image is not taken into account in the LAM formulation. In this paper, we first introduce a single-parameter Prior Model based on Gaussian (highpass/lowpass) Filtering (PM-GF), in which the filtering output is the sum of a weighted portion of Gaussian highpass filtering of the guidance image and Gaussian smoothing of the input image. In the PM-GF, the guidance structure determined by Gaussian highpass filtering is obviously transferred to the filtering output, thereby better revealing the structure transfer mechanism of guided filtering. Then we propose several Gaussian highpass GIFs (GH-GIFs) based on the PM-GF by emulating the original GIF and some improvements, i.e., using PM-GF instead of LAM in these GIFs. Experimental results illustrate that the proposed GIFs outperform their counterparts in several image processing applications.
zh
[CV-48] BEVMOSNet: Multimodal Fusion for BEV Moving Object Segmentation
【速读】:该论文致力于解决鸟瞰图(BEV)中动态物体精确运动理解的问题,这对于实现自动驾驶车辆可靠的障碍物避让系统和平滑路径规划至关重要。然而,与目标检测和分割任务相比,这一领域探索较少,现有的基于视觉的方法在低光照、夜间及恶劣天气(如雨天)条件下的表现显著下降。相比之下,LiDAR和雷达传感器在这些场景中几乎不受影响,并且雷达能够提供关键的目标速度信息。为此,论文提出了BEVMOSNet,据作者所知,这是首个端到端多模态融合方法,结合摄像头、LiDAR和雷达数据以精确预测BEV中的移动物体。解决方案的关键在于设计了一种可变形交叉注意力引导的传感器融合策略,用于BEV中的跨传感器知识共享,并通过深入分析确定了最佳融合策略。实验结果显示,在nuScenes数据集上的评估表明,与基于视觉的单模态基线BEV-MoSeg相比,IoU分数提升了36.59%,比扩展用于运动分割任务的多模态SimpleBEV提高了2.35%,确立了该方法在BEV运动分割领域的最先进水平。
链接: https://arxiv.org/abs/2503.03280
作者: Hiep Truong Cong,Ajay Kumar Sigatapu,Arindam Das,Yashwanth Sharma,Venkatesh Satagopan,Ganesh Sistu,Ciaran Eising
机构: DSW, Valeo Kronach (法雷奥克罗纳赫研发中心), Germany; DSW, Valeo India (法雷奥印度研发中心); University of Limerick (利默里克大学), Ireland; Valeo Vision Systems (法雷奥视觉系统), Ireland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)
Abstract:Accurate motion understanding of the dynamic objects within the scene in bird’s-eye-view (BEV) is critical to ensure a reliable obstacle avoidance system and smooth path planning for autonomous vehicles. However, this task has received relatively limited exploration when compared to object detection and segmentation with only a few recent vision-based approaches presenting preliminary findings that significantly deteriorate in low-light, nighttime, and adverse weather conditions such as rain. Conversely, LiDAR and radar sensors remain almost unaffected in these scenarios, and radar provides key velocity information of the objects. Therefore, we introduce BEVMOSNet, to our knowledge, the first end-to-end multimodal fusion leveraging cameras, LiDAR, and radar to precisely predict the moving objects in BEV. In addition, we perform a deeper analysis to find out the optimal strategy for deformable cross-attention-guided sensor fusion for cross-sensor knowledge sharing in BEV. While evaluating BEVMOSNet on the nuScenes dataset, we show an overall improvement in IoU score of 36.59% compared to the vision-based unimodal baseline BEV-MoSeg (Sigatapu et al., 2023), and 2.35% compared to the multimodel SimpleBEV (Harley et al., 2022), extended for the motion segmentation task, establishing this method as the state-of-the-art in BEV motion segmentation.
zh
[CV-49] owards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients CVPR2025
【速读】:该论文旨在解决基于事件的二值动态图像(binary dynamic images)在尖峰神经网络(SNNs)上的梯度不可见性问题以及现有攻击方法在不可感知性方面的不足。具体而言,论文针对以下两个挑战提出了创新性的解决方案:一是如何通过代理梯度(surrogate gradients, SGs)建立与受害模型的强关联;二是如何有效优化对抗扰动的稀疏性以提高攻击的不可感知性。
论文的关键在于引入了一种依赖势能的代理梯度(Potential-Dependent Surrogate Gradient, PDSG)方法,通过增强代理梯度与模型之间的适应性,使得梯度不可见的场景下仍能实现有效的对抗攻击。同时,提出了一种稀疏动态攻击(Sparse Dynamic Attack, SDA),采用生成-减少范式(generation-reduction paradigm),显著提升了对抗扰动的稀疏性。实验结果表明,所提出的PDSG和SDA方法在多种模型和数据集上优于现有的SNN基攻击方法,并展示了卓越的攻击成功率与效率。
链接: https://arxiv.org/abs/2503.03272
作者: Li Lun,Kunyu Feng,Qinglong Ni,Ling Liang,Yuan Wang,Ying Li,Dunshan Yu,Xiaoxin Cui
机构: School of Integrated Circuits, Peking University (北京大学微电子学院); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院); Institute of Microelectronics, Chinese Academy of Sciences (中国科学院微电子研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Spiking neural networks (SNNs) have shown their competence in handling spatial-temporal event-based data with low energy consumption. Similar to conventional artificial neural networks (ANNs), SNNs are also vulnerable to gradient-based adversarial attacks, wherein gradients are calculated by spatial-temporal back-propagation (STBP) and surrogate gradients (SGs). However, the SGs may be invisible for an inference-only model as they do not influence the inference results, and current gradient-based attacks are ineffective for binary dynamic images captured by the dynamic vision sensor (DVS). While some approaches addressed the issue of invisible SGs through universal SGs, their SGs lack a correlation with the victim model, resulting in sub-optimal performance. Moreover, the imperceptibility of existing SNN-based binary attacks is still insufficient. In this paper, we introduce an innovative potential-dependent surrogate gradient (PDSG) method to establish a robust connection between the SG and the model, thereby enhancing the adaptability of adversarial attacks across various models with invisible SGs. Additionally, we propose the sparse dynamic attack (SDA) to effectively attack binary dynamic images. Utilizing a generation-reduction paradigm, SDA can fully optimize the sparsity of adversarial perturbations. Experimental results demonstrate that our PDSG and SDA outperform state-of-the-art SNN-based attacks across various models and datasets. Specifically, our PDSG achieves 100% attack success rate on ImageNet, and our SDA obtains 82% attack success rate by modifying only 0.24% of the pixels on CIFAR10DVS. The code is available at this https URL .
zh
[CV-50] Reduced Spatial Dependency for More General Video-level Deepfake Detection ICASSP2025
【速读】:该论文旨在解决深度伪造(Deepfake)内容检测中因卷积神经网络(CNNs)引入空间偏倚(spatial bias),从而阻碍内在时间特征提取的问题。论文的关键解决方案是提出了一种名为空间依赖性减少(Spatial Dependency Reduction, SDR)的新方法。SDR通过设计多个空间扰动分支(Spatial Perturbation Branch, SPB)构建空间扰动特征簇,并利用互信息理论设计任务相关特征集成(Task-Relevant Feature Integration, TRFI)模块,从这些特征簇中捕获潜空间中相似的时间特征。最终,将集成后的特征输入时间Transformer以捕捉长距离依赖关系。
链接: https://arxiv.org/abs/2503.03270
作者: Beilin Chu,Xuan Xu,Yufei Zhang,Weike You,Linna Zhou
机构: School of Cyberspace Security, Beijing University of Posts and Telecommunications (北京邮电大学网络空间安全学院), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 5 pages, 2 figures. Accepted to ICASSP 2025
Abstract:As one of the prominent AI-generated content, Deepfake has raised significant safety concerns. Although it has been demonstrated that temporal consistency cues offer better generalization capability, existing methods based on CNNs inevitably introduce spatial bias, which hinders the extraction of intrinsic temporal features. To address this issue, we propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters, to reduce the dependency of the model on spatial information. Specifically, we design multiple Spatial Perturbation Branch (SPB) to construct spatially-perturbed feature clusters. Subsequently, we utilize the theory of mutual information and propose a Task-Relevant Feature Integration (TRFI) module to capture temporal features residing in similar latent space from these clusters. Finally, the integrated feature is fed into a temporal transformer to capture long-range dependencies. Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.
zh
[CV-51] Optimizing for the Shortest Path in Denoising Diffusion Model CVPR2025
【速读】:该论文旨在解决去噪扩散模型(Denoising Diffusion Models)在效率和生成质量方面的局限性问题。具体而言,研究提出了一种基于最短路径建模的新型去噪扩散模型——最短路径扩散模型(ShortDF),其关键在于通过优化残差传播来提升去噪效率和生成样本的质量。解决方案的核心是将去噪过程视为一个最短路径问题,旨在最小化重建误差。通过优化初始残差,改进了逆向扩散过程的效率,并提升了生成样本的视觉保真度。实验结果表明,与现有方法相比,ShortDF显著减少了扩散时间(或步骤),同时提高了生成样本的质量。
链接: https://arxiv.org/abs/2503.03265
作者: Ping Chen,Xingpeng Zhang,Zhaoxiang Liu,Huan Hu,Xiang Liu,Kai Wang,Min Wang,Yanlin Qian,Shiguo Lian
机构: Data Science & Artificial Intelligence Research Institute, China Unicom (中国联合网络通信集团有限公司数据科学与人工智能研究院); Unicom Data Intelligence, China Unicom (中国联合网络通信集团有限公司联通大数据智能公司); School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu, China (西南石油大学计算机科学与软件工程学院, 中国成都); DJI Technology Co.,Ltd. (大疆创新科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepet by CVPR 2025 (10 pages, 6 figures)
Abstract:In this research, we propose a novel denoising diffusion model based on shortest-path modeling that optimizes residual propagation to enhance both denoising efficiency and this http URL on Denoising Diffusion Implicit Models (DDIM) and insights from graph theory, our model, termed the Shortest Path Diffusion Model (ShortDF), treats the denoising process as a shortest-path problem aimed at minimizing reconstruction error. By optimizing the initial residuals, we improve the efficiency of the reverse diffusion process and the quality of the generated this http URL experiments on multiple standard benchmarks demonstrate that ShortDF significantly reduces diffusion time (or steps) while enhancing the visual fidelity of generated samples compared to prior this http URL work, we suppose, paves the way for interactive diffusion-based applications and establishes a foundation for rapid data generation. Code is available at this https URL.
zh
[CV-52] rajectory Prediction for Autonomous Driving: Progress Limitations and Future Directions
【速读】:该论文旨在解决大规模自动驾驶车辆在动态交通环境中安全导航的问题,特别是如何精确预测周围交通参与者的轨迹以避免碰撞。论文的关键在于通过回顾近年来大量轨迹预测方法,并提出一个分类体系来梳理现有解决方案,同时概述预测管道的核心要素,包括输入输出模态、建模特征及预测范式。此外,论文还探讨了轨迹预测领域的活跃研究方向、开放的研究问题以及存在的挑战与不足。其核心目标是全面分析当前技术状态并明确未来研究的方向。
链接: https://arxiv.org/abs/2503.03262
作者: Nadya Abdel Madjid,Abdulrahman Ahmad,Murad Mebrahtu,Yousef Babaa,Abdelmoamen Nasser,Sumbal Malik,Bilal Hassan,Naoufel Werghi,Jorge Dias,Majid Khonji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:As the potential for autonomous vehicles to be integrated on a large scale into modern traffic systems continues to grow, ensuring safe navigation in dynamic environments is crucial for smooth integration. To guarantee safety and prevent collisions, autonomous vehicles must be capable of accurately predicting the trajectories of surrounding traffic agents. Over the past decade, significant efforts from both academia and industry have been dedicated to designing solutions for precise trajectory forecasting. These efforts have produced a diverse range of approaches, raising questions about the differences between these methods and whether trajectory prediction challenges have been fully addressed. This paper reviews a substantial portion of recent trajectory prediction methods and devises a taxonomy to classify existing solutions. A general overview of the prediction pipeline is also provided, covering input and output modalities, modeling features, and prediction paradigms discussed in the literature. In addition, the paper discusses active research areas within trajectory prediction, addresses the posed research questions, and highlights the remaining research gaps and challenges.
zh
[CV-53] BANet: Bilateral Aggregation Network for Mobile Stereo Matching
【速读】:该论文旨在解决移动设备上立体匹配任务中,基于3D卷积的成本体聚合方法计算成本过高,而直接使用2D卷积进行成本聚合容易导致边缘模糊、细节丢失以及纹理较少区域的错误匹配的问题。为了解决这一挑战,论文提出了一种新颖的双边聚合网络(Bilateral Aggregation Network, BANet)。其关键在于通过空间注意图将完整成本体分离为细节体积和平滑体积,并分别执行详细的和平滑的聚合操作,最终融合两者以获得最终的视差图。此外,为了更精确地识别高频细节区域和低频平滑/无纹理区域,论文还提出了一个新的尺度感知空间注意模块。实验结果表明,提出的BANet在移动设备上不仅运行更快,而且在KITTI 2015数据集上的准确性比其他移动友好型方法高出35.3%,同时其扩展的3D版本在高端GPU上的实时方法中达到了最高的准确性。
链接: https://arxiv.org/abs/2503.03259
作者: Gangwei Xu,Jiaxin Liu,Xianqi Wang,Junda Cheng,Yong Deng,Jinliang Zang,Yurui Chen,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Autel Robotics (爱图仕机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Additionally, to accurately identify high-frequency detailed regions and low-frequency smooth/textureless regions, we propose a new scale-aware spatial attention module. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. The extended 3D version, BANet-3D, achieves the highest accuracy among all real-time methods on high-end GPUs. Code: \textcolormagentathis https URL.
zh
[CV-54] BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation
【速读】:该论文旨在解决事件相机(Event Camera)在复杂光照条件和快速运动物体场景下光流估计的问题,尤其针对现有基于图像的先进光流方法因事件数据的空间稀疏性而导致性能受限的挑战。论文提出了一种名为BAT的创新框架,其关键是通过双向自适应时间相关性(Bidirectional Adaptive Temporal Correlation)实现基于事件的光流估计。具体而言,BAT包含三项关键设计:1)双向时间相关性将时间上的密集运动线索转化为空间上的密集线索,以实现精确且空间分布密集的光流估计;2)自适应时间采样策略确保相关性的时间一致性;3)空间自适应时间运动聚合高效且灵活地将一致的目标运动特征融合到相邻运动特征中,并抑制不一致的特征。这些设计显著提升了光流估计的精度与细节质量,并实现了对未来光流的高精度预测,超越了现有最先进的方法及E-RAFT的预热方法。
链接: https://arxiv.org/abs/2503.03256
作者: Gangwei Xu,Haotong Lin,Zhaoxing Zhang,Hongcheng Luo,Haiyang Sun,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Event cameras deliver visual information characterized by a high dynamic range and high temporal resolution, offering significant advantages in estimating optical flow for complex lighting conditions and fast-moving objects. Current advanced optical flow methods for event cameras largely adopt established image-based frameworks. However, the spatial sparsity of event data limits their performance. In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. BAT includes three novel designs: 1) a bidirectional temporal correlation that transforms bidirectional temporally dense motion cues into spatially dense ones, enabling accurate and spatially dense optical flow estimation; 2) an adaptive temporal sampling strategy for maintaining temporal consistency in correlation; 3) spatially adaptive temporal motion aggregation to efficiently and adaptively aggregate consistent target motion features into adjacent motion features while suppressing inconsistent ones. Our results rank 1^st on the DSEC-Flow benchmark, outperforming existing state-of-the-art methods by a large margin while also exhibiting sharp edges and high-quality details. Notably, our BAT can accurately predict future optical flow using only past events, significantly outperforming E-RAFT’s warm-start approach. Code: \textcolormagentathis https URL.
zh
[CV-55] Computational Analysis of Degradation Modeling in Blind Panoramic Image Quality Assessment
【速读】:该论文旨在解决盲全景图像质量评估(Blind Panoramic Image Quality Assessment, BPIQA)领域中的“easy-database问题”,即现有数据集内容有限且样本数量较少,导致评估结果不够稳健,从而阻碍了BPIQA的发展。论文的关键在于通过充分的计算分析,深入探讨降质建模在BPIQA中的挑战,并通过设计三类实验,分别研究BPIQA与盲图像质量评估(Blind Image Quality Assessment, BIQA)之间的差距、BPIQA模型特定设计的必要性以及模型的泛化能力。研究发现,简单数据集缩小了BPIQA与BIQA性能的差距,限制了特定设计的有效验证,并影响了模型的泛化能力。为解决此问题,论文提出使用包含复杂退化的新构建数据集来训练BPIQA模型,以提升其泛化性能,呼吁从主观和客观角度进一步推动BPIQA的发展。
链接: https://arxiv.org/abs/2503.03255
作者: Jiebin Yan,Ziwen Tan,Jiale Rao,Lei Wu,Yifan Zuo,Yuming Fang
机构: Jiangxi University of Finance and Economics (江西财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind panoramic image quality assessment (BPIQA) has recently brought new challenge to the visual quality community, due to the complex interaction between immersive content and human behavior. Although many efforts have been made to advance BPIQA from both conducting psychophysical experiments and designing performance-driven objective algorithms, \textitlimited content and \textitfew samples in those closed sets inevitably would result in shaky conclusions, thereby hindering the development of BPIQA, we refer to it as the \textiteasy-database issue. In this paper, we present a sufficient computational analysis of degradation modeling in BPIQA to thoroughly explore the \textiteasy-database issue, where we carefully design three types of experiments via investigating the gap between BPIQA and blind image quality assessment (BIQA), the necessity of specific design in BPIQA models, and the generalization ability of BPIQA models. From extensive experiments, we find that easy databases narrow the gap between the performance of BPIQA and BIQA models, which is unconducive to the development of BPIQA. And the easy databases make the BPIQA models be closed to saturation, therefore the effectiveness of the associated specific designs can not be well verified. Besides, the BPIQA models trained on our recently proposed databases with complicated degradation show better generalization ability. Thus, we believe that much more efforts are highly desired to put into BPIQA from both subjective viewpoint and objective viewpoint.
zh
[CV-56] wo-Stream Thermal Imaging Fusion for Enhanced Time of Birth Detection in Neonatal Care
【速读】:该论文旨在解决新生儿出生时间(Time of Birth, ToB)记录不准确的问题,当前临床方法多依赖于手动操作,容易产生误差。为实现更精准的ToB检测,论文提出了一种新颖的双流融合系统,结合图像与视频分析技术,从分娩室和手术室的热成像记录中准确提取ToB信息。其关键在于通过整合静态和动态数据流,捕获更多与出生相关的时空特征,从而提高ToB估计的鲁棒性和精确性。实验结果表明,该多模态数据协同方法优于单一模态方法,在短片段视频中实现了95.7%的精度和84.8%的召回率,并借助评分聚合模块在所有测试案例中成功识别ToB,相比人工标注的中位绝对误差为2秒,绝对平均偏差为4.5秒。
链接: https://arxiv.org/abs/2503.03244
作者: Jorge García-Torres,Øyvind Meinich-Bache,Sara Brunner,Siren Rettedal,Vilde Kolstad,Kjersti Engan
机构: Department of Electrical Engineering and Computer Science, University of Stavanger (斯特伐格大学), Norway; Strategic Research, Laerdal Medical (拉尔达尔医疗), Stavanger, Norway; Faculty of Health Sciences, University of Stavanger (斯特伐格大学), Norway; Department for Simulation-based Learning, Stavanger University Hospital (斯特伐格大学医院), Norway
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE 25th International Conference on Digital Signal Processing
Abstract:Around 10% of newborns require some help to initiate breathing, and 5% need ventilation assistance. Accurate Time of Birth (ToB) documentation is essential for optimizing neonatal care, as timely interventions are vital for proper resuscitation. However, current clinical methods for recording ToB often rely on manual processes, which can be prone to inaccuracies. In this study, we present a novel two-stream fusion system that combines the power of image and video analysis to accurately detect the ToB from thermal recordings in the delivery room and operating theater. By integrating static and dynamic streams, our approach captures richer birth-related spatiotemporal features, leading to more robust and precise ToB estimation. We demonstrate that this synergy between data modalities enhances performance over single-stream approaches. Our system achieves 95.7% precision and 84.8% recall in detecting birth within short video clips. Additionally, with the help of a score aggregation module, it successfully identifies ToB in 100% of test cases, with a median absolute error of 2 seconds and an absolute mean deviation of 4.5 seconds compared to manual annotations.
zh
[CV-57] GenColor: Generative Color-Concept Association in Visual Design
【速读】:该论文试图解决现有颜色-概念关联方法在处理非常规概念时效果不佳以及对图像引用的不稳定性和图像条件变化敏感的问题。论文通过设计研究发现,设计中需要主强调色组合及上下文相关的颜色(如“清澈”与“污染”的天空),而现有方法难以满足这些需求。为应对这一挑战,论文提出了一种基于生成式方法挖掘语义相关颜色的解决方案,其关键是利用文本到图像模型生成的图像来替代传统的图像引用方式。具体而言,该框架包含三个阶段:首先通过扩散模型生成概念实例样本,接着利用文本引导的图像分割识别图像中的概念相关区域,最后提取主强调色及相关颜色。通过与专家设计的定量对比验证了该方法的有效性,并展示了其在多种设计场景中的适用性。
链接: https://arxiv.org/abs/2503.03236
作者: Yihan Hou,Xingchen Zeng,Yusong Wang,Manling Yang,Xiaojiao Chen,Wei Zeng
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Zhejiang University (浙江大学)(杭州); The Hong Kong University of Science and Technology(香港科技大学)(香港)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 16 figures. Accepted at CHI Conference on Human Factors in Computing Systems (CHI’25), April 26-May 1, 2025, Yokohama, Japan
Abstract:Existing approaches for color-concept association typically rely on query-based image referencing, and color extraction from image references. However, these approaches are effective only for common concepts, and are vulnerable to unstable image referencing and varying image conditions. Our formative study with designers underscores the need for primary-accent color compositions and context-dependent colors (e.g., ‘clear’ vs. ‘polluted’ sky) in design. In response, we introduce a generative approach for mining semantically resonant colors leveraging images generated by text-to-image models. Our insight is that contemporary text-to-image models can resemble visual patterns from large-scale real-world data. The framework comprises three stages: concept instancing produces generative samples using diffusion models, text-guided image segmentation identifies concept-relevant regions within the image, and color association extracts primarily accompanied by accent colors. Quantitative comparisons with expert designs validate our approach’s effectiveness, and we demonstrate the applicability through cases in various design scenarios and a gallery.
zh
[CV-58] Path-Adaptive Matting for Efficient Inference Under Various Computational Cost Constraints AAAI2025
【速读】:该论文旨在解决在不同计算成本约束(特别是浮点运算次数,FLOPs限制)下实现高效图像抠图推理的问题。现有方法因未探索可扩展架构或路径学习策略,难以应对这一挑战。论文的关键解决方案是提出了一种名为Path-Adaptive Matting (PAM) 的框架,通过动态调整网络路径以适应图像上下文和计算成本约束。PAM 将受计算成本约束的抠图网络训练形式化为双层优化问题,并设计了一种包含路径选择层和可学习连接层的统一网络架构,用于估计最优路径并执行高效推理。此外,论文还提出了一种性能感知的路径学习策略,通过在线评估从最优路径先验分布中采样的若干路径来生成路径标签,从而实现稳健且高效的在线路径学习。
链接: https://arxiv.org/abs/2503.03228
作者: Qinglin Liu,Zonglin Li,Xiaoqian Lv,Xin Sun,Ru Li,Shengping Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025
Abstract:In this paper, we explore a novel image matting task aimed at achieving efficient inference under various computational cost constraints, specifically FLOP limitations, using a single matting network. Existing matting methods which have not explored scalable architectures or path-learning strategies, fail to tackle this challenge. To overcome these limitations, we introduce Path-Adaptive Matting (PAM), a framework that dynamically adjusts network paths based on image contexts and computational cost constraints. We formulate the training of the computational cost-constrained matting network as a bilevel optimization problem, jointly optimizing the matting network and the path estimator. Building on this formalization, we design a path-adaptive matting architecture by incorporating path selection layers and learnable connect layers to estimate optimal paths and perform efficient inference within a unified network. Furthermore, we propose a performance-aware path-learning strategy to generate path labels online by evaluating a few paths sampled from the prior distribution of optimal paths and network estimations, enabling robust and efficient online path learning. Experiments on five image matting datasets demonstrate that the proposed PAM framework achieves competitive performance across a range of computational cost constraints.
zh
[CV-59] Mocap-2-to-3: Lifting 2D Diffusion-Based Pretrained Models for 3D Motion Capture
【速读】:该论文旨在解决从单目视图中恢复世界坐标系下绝对姿态的问题。主要挑战包括:一是现有方法依赖于在有限环境中收集的3D运动数据进行训练,获取新动作的3D标签耗时且不现实,限制了模型的泛化能力;二是从单一视角估计人体在度量空间中的绝对位置更为复杂。为了解决这些问题,论文提出了一种名为Mocap-2-to-3的新框架,其关键是将复杂的3D运动分解为2D姿态,并利用大规模2D数据增强3D运动重建,同时精确预测世界坐标系下的绝对位置。具体而言,该框架首先使用大量2D数据预训练单视图扩散模型,然后通过公开可用的3D数据微调多视图扩散模型以确保视图一致性,从而有效利用大规模2D数据。此外,还提出了一种创新的人体运动表示方法,将局部动作与全局运动解耦,并编码地面几何先验,使生成模型能够从2D数据中学习准确的运动先验。这种方法在推理阶段逐步恢复全局运动,实现更合理的定位。实验证明,该模型在真实世界数据集上的运动和绝对人体定位精度优于现有技术,同时具备更好的泛化性和可扩展性。
链接: https://arxiv.org/abs/2503.03222
作者: Zhumei Wang,Zechen Hu,Ruoxi Guo,Huaijin Pi,Ziyong Feng,Sida Peng,Xiaowei Zhou
机构: Deep Glint; Zhejiang University (浙江大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering absolute poses in the world coordinate system from monocular views presents significant challenges. Two primary issues arise in this context. Firstly, existing methods rely on 3D motion data for training, which requires collection in limited environments. Acquiring such 3D labels for new actions in a timely manner is impractical, severely restricting the model’s generalization capabilities. In contrast, 2D poses are far more accessible and easier to obtain. Secondly, estimating a person’s absolute position in metric space from a single viewpoint is inherently more complex. To address these challenges, we introduce Mocap-2-to-3, a novel framework that decomposes intricate 3D motions into 2D poses, leveraging 2D data to enhance 3D motion reconstruction in diverse scenarios and accurately predict absolute positions in the world coordinate system. We initially pretrain a single-view diffusion model with extensive 2D data, followed by fine-tuning a multi-view diffusion model for view consistency using publicly available 3D data. This strategy facilitates the effective use of large-scale 2D data. Additionally, we propose an innovative human motion representation that decouples local actions from global movements and encodes geometric priors of the ground, ensuring the generative model learns accurate motion priors from 2D data. During inference, this allows for the gradual recovery of global movements, resulting in more plausible positioning. We evaluate our model’s performance on real-world datasets, demonstrating superior accuracy in motion and absolute human positioning compared to state-of-the-art methods, along with enhanced generalization and scalability. Our code will be made publicly available.
zh
[CV-60] An Analytical Theory of Power Law Spectral Bias in the Learning Dynamics of Diffusion Models
【速读】:本文旨在理解扩散模型训练过程中学习分布的演化规律。为解决这一问题,研究者基于高斯等价性原理,推导了一层或两层线性去噪器在任意数据条件下权重梯度流动力学的精确解。关键在于这些解析结果使得能够以封闭形式推导生成分布及其在整个训练过程中的KL散度,揭示了显著的幂律谱偏差现象:即权重和分布的某一模式的收敛时间与其方差呈反幂律关系。此外,通过实验证明,即使使用更深或卷积架构,这种幂律谱偏差依然稳健。研究结果强调了数据协方差在决定扩散模型学习不同数据模式的顺序和速率中的重要性,并为早期停止可能导致图像生成模型细节错误提供了潜在解释。
链接: https://arxiv.org/abs/2503.03206
作者: Binxu Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: 50 pages, 10 figures. Preprint
Abstract:We developed an analytical framework for understanding how the learned distribution evolves during diffusion model training. Leveraging the Gaussian equivalence principle, we derived exact solutions for the gradient-flow dynamics of weights in one- or two-layer linear denoiser settings with arbitrary data. Remarkably, these solutions allowed us to derive the generated distribution in closed form and its KL divergence through training. These analytical results expose a pronounced power-law spectral bias, i.e., for weights and distributions, the convergence time of a mode follows an inverse power law of its variance. Empirical experiments on both Gaussian and image datasets demonstrate that the power-law spectral bias remains robust even when using deeper or convolutional architectures. Our results underscore the importance of the data covariance in dictating the order and rate at which diffusion models learn different modes of the data, providing potential explanations for why earlier stopping could lead to incorrect details in image generative models.
zh
[CV-61] Find Matching Faces Based On Face Parameters
【速读】:该论文旨在解决基于用户选择的人脸参数寻找匹配人脸的问题。解决方案的关键在于通过Gradio构建的用户界面,将用户选择的人脸参数转化为文本提示,并利用Text-To-Image生成模型生成逼真的人脸图像。随后,生成的图像与从外部链接下载的图像均经过人脸检测和特征提取模型处理,得到512维高维向量嵌入(vector embedding)。这些嵌入被存储于向量数据库中,通过在生成图像的向量嵌入与数据库中的嵌入之间进行相似性搜索,最终展示出与用户选择参数最匹配的前五个人脸图像。这一贡献具有开发高质量个性化人脸匹配工具的巨大潜力。
链接: https://arxiv.org/abs/2503.03204
作者: Setu A. Bhatt,Harshadkumar B. Prajapati,Vipul K. Dabhi,Ankush Tyagi
机构: Department of Information Technology (信息技术系), Dharmsinh Desai University (达姆辛赫·德赛大学), Nadiad, India; Software Development Manager (软件开发经理), Ericsson (爱立信), Austin, Texas, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents an innovative approach that enables the user to find matching faces based on the user-selected face parameters. Through gradio-based user interface, the users can interactively select the face parameters they want in their desired partner. These user-selected face parameters are transformed into a text prompt which is used by the Text-To-Image generation model to generate a realistic face image. Further, the generated image along with the images downloaded from the this http URL are processed through face detection and feature extraction model, which results in high dimensional vector embedding of 512 dimensions. The vector embeddings generated from the downloaded images are stored into vector database. Now, the similarity search is carried out between the vector embedding of generated image and the stored vector embeddings. As a result, it displays the top five similar faces based on the user-selected face parameters. This contribution holds a significant potential to turn into a high-quality personalized face matching tool.
zh
[CV-62] Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings
【速读】:该论文旨在解决在低数据场景下训练视觉-语言模型进行图像-文本对齐时,标准对比学习(Contrastive Learning)因过拟合和不稳定训练动态而难以有效对齐多模态信息的问题。为了解决这一挑战,论文提出了一种基于方差感知(variance-aware)的损失调度方法,其关键是动态调整对比损失的权重,依据模型对齐预测中的统计变异性(不确定性)。这种方法通过在Flickr8k数据集子集上的实验验证,在低数据条件下显著提升了图像-文本检索的准确性,并在与其他自适应加权策略的对比中表现出最佳的整体权衡。此外,该方法还通过t-SNE可视化展示了生成的多模态嵌入更为清晰,且在加入噪声的鲁棒性测试中保持了更高的召回率。这些结果表明,自适应损失加权对于低数据场景下的多模态对齐具有显著优势。
链接: https://arxiv.org/abs/2503.03202
作者: Sneh Pillai
机构: University of Massachusetts Dartmouth (马萨诸塞大学达特茅斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:Training vision-language models for image-text alignment typically requires large datasets to achieve robust performance. In low-data scenarios, standard contrastive learning can struggle to align modalities effectively due to overfitting and unstable training dynamics. In this paper, we propose a variance-aware loss scheduling approach that dynamically adjusts the weighting of the contrastive loss based on the statistical variability (uncertainty) in the model’s alignment predictions. Using a subset of the Flickr8k image-caption dataset to simulate limited data conditions, we demonstrate that our approach improves image-text retrieval accuracy compared to a fixed-weight baseline. We also compare against other adaptive weighting strategies (using output entropy and cosine similarity spread) and find that variance-aware scheduling provides the best overall trade-off. Qualitatively, our method yields more distinct multimodal embeddings as shown by t-SNE visualizations. Moreover, in a stress test with noise-injected captions and images, the variance-guided loss proves more robust, maintaining higher recall when random perturbations are introduced. These results highlight the benefit of adaptive loss weighting for multimodal alignment in low-data regimes.
zh
[CV-63] ransformer-Based Spatio-Temporal Association of Apple Fruitlets
【速读】:该论文旨在解决苹果幼果在不同时间采集的立体图像之间时空关联的问题,尤其针对小型水果在田间难以获得高分辨率点云或时序稳定特征的挑战。论文的关键解决方案在于提出了一种基于Transformer的架构,该架构通过自注意力(self-attention)和交叉注意力(cross-attention)机制,在一系列Transformer编码器层中对每个幼果的形状和位置特征进行编码、传播和精化,从而实现高效且准确的关联。实验结果显示,该方法在商业苹果园数据集上达到了92.4%的F1分数,显著优于现有基线方法和消融研究。
链接: https://arxiv.org/abs/2503.03200
作者: Harry Freeman,George Kantor
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations.
zh
[CV-64] SpiritSight Agent : Advanced GUI Agent with One Look CVPR2025
【速读】:该论文旨在解决现有基于视觉的图形用户界面(GUI)代理在准确性方面的不足,这些问题源于其在元素定位(element grounding)上的局限性。尽管这些方法通常能满足兼容性和低延迟的要求,但较低的准确性限制了它们的实际应用。为了解决这一问题,论文提出了 SpiritSight,这是一种基于视觉的端到端 GUI 代理,专注于跨多种 GUI 平台的导航任务。
解决方案的关键在于两个方面:首先,构建了一个名为 GUI-Lasagne 的多层级、大规模、高质量 GUI 数据集,通过可扩展的方法增强 SpiritSight 的 GUI 理解与定位能力;其次,引入了 Universal Block Parsing (UBP) 方法,用于解决动态高分辨率视觉输入中的歧义问题,进一步提升 SpiritSight 对 GUI 对象的定位能力。这些改进使 SpiritSight 在多个 GUI 基准测试中表现出色,验证了其卓越的性能和兼容性。
链接: https://arxiv.org/abs/2503.03196
作者: Zhiyuan Huang,Ziming Cheng,Junting Pan,Zhaohui Hou,Mingjie Zhan
机构: SenseTime Research (商汤科技研究院); Beijing University of Posts and Telecommunications (北京邮电大学); MMLab, CUHK (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Paper accepted to CVPR 2025
Abstract:Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user’s navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose \textbfSpiritSight , a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called \textbfGUI-Lasagne using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the \textbfUniversal Block Parsing (UBP) method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight’s ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models are available at \hrefthis https URLthis\ URL .
zh
[CV-65] DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
【速读】:本文旨在解决3D问答(3D Question Answering, 3D QA)任务中模型难以全面理解其所在三维场景文本描述的问题,同时强调现有方法通常依赖于纯三维点云的全局场景感知,而忽略了多视图图像中丰富局部纹理细节的重要性。此外,由于相机姿态的固有噪声和复杂遮挡,当将三维点云与多视图图像对齐时,会出现显著的特征退化和鲁棒性降低的问题。为了解决这些问题,论文提出了一种双视觉场景感知网络(Dual-vision Scene Perception Network, DSPNet)。关键在于设计了三个模块:文本引导的多视图融合(Text-guided Multi-view Fusion, TGMF)模块优先选择与文本语义内容紧密匹配的图像视图;自适应双视觉感知(Adaptive Dual-vision Perception, ADVP)模块用于自适应地融合后投影的多视图图像与点云特征,增强三维场景理解能力;以及多模态上下文引导推理(Multimodal Context-guided Reasoning, MCGR)模块,通过整合视觉和语言模态之间的上下文信息,实现更稳健的推理。实验结果表明,所提出的DSPNet在SQA3D和ScanQA数据集上优于现有方法。
链接: https://arxiv.org/abs/2503.03190
作者: Jingzhou Luo,Yang Liu,Weixing Chen,Zhen Li,Yaowei Wang,Guanbin Li,Liang Lin
机构: Sun Yat-sen University (中山大学); The Chinese University of Hong Kong (香港中文大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Question Answering (3D QA) requires the model to comprehensively understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation. However, existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images. Moreover, due to the inherent noise in camera poses and complex occlusions, there exists significant feature degradation and reduced feature robustness problems when aligning 3D point cloud with multi-view images. In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. Our Text-guided Multi-view Fusion (TGMF) module prioritizes image views that closely match the semantic content of the text. To adaptively fuse back-projected multi-view images with point cloud features, we design the Adaptive Dual-vision Perception (ADVP) module, enhancing 3D scene comprehension. Additionally, our Multimodal Context-guided Reasoning (MCGR) module facilitates robust reasoning by integrating contextual information across visual and linguistic modalities. Experimental results on SQA3D and ScanQA datasets demonstrate the superiority of our DSPNet. Codes will be available at this https URL.
zh
[CV-66] Partial Convolution Meets Visual Attention
【速读】:该论文旨在解决高效神经网络设计中的两个关键问题:一是深度可分离卷积(Depthwise Convolution, DWConv)在推理过程中频繁的内存访问导致吞吐量较低;二是部分卷积(Partial Convolution, PConv)作为DWConv的替代方案虽提升了效率但因通道利用率不足而牺牲了模型精度。为了解决这些问题,论文提出了一种新颖的部分视觉注意力机制(Partial visual ATtention, PAT),通过有效结合PConv与视觉注意力,不仅减少了模型参数和浮点运算次数(FLOPs),还提出了三种基于PAT的不同类型的块结构(PAT_ch、PAT_sp和PAT_sf)。其中,PAT_ch引入增强的高斯通道注意力以融入全局分布信息到未处理的通道中;空间注意力被应用于MLP层以进一步提升模型准确性;最后,在最后一阶段采用自注意力机制扩展全局感受野。基于此,论文设计了一个新的混合网络家族——PATNet,并在ImageNet-1K分类任务上实现了优于FasterNet的top-1精度和推理速度,同时在COCO数据集上的检测与分割任务中表现出色。关键在于PAT机制成功平衡了效率与精度之间的权衡,通过创新性地将注意力机制与部分卷积相结合,显著提升了整体性能。
链接: https://arxiv.org/abs/2503.03148
作者: Haiduo Huang,Fuwei Yang,Dong Li,Ji Liu,Lu Tian,Jinzhang Peng,Pengju Ren,Emad Barsoum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2502.01303
Abstract:Designing an efficient and effective neural network has remained a prominent topic in computer vision research. Depthwise onvolution (DWConv) is widely used in efficient CNNs or ViTs, but it needs frequent memory access during inference, which leads to low throughput. FasterNet attempts to introduce partial convolution (PConv) as an alternative to DWConv but compromises the accuracy due to underutilized channels. To remedy this shortcoming and consider the redundancy between feature map channels, we introduce a novel Partial visual ATtention mechanism (PAT) that can efficiently combine PConv with visual attention. Our exploration indicates that the partial attention mechanism can completely replace the full attention mechanism and reduce model parameters and FLOPs. Our PAT can derive three types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp) and Partial Self-Attention block (PAT_sf). First, PAT_ch integrates the enhanced Gaussian channel attention mechanism to infuse global distribution information into the untouched channels of PConv. Second, we introduce the spatial-wise attention to the MLP layer to further improve model accuracy. Finally, we replace PAT_ch in the last stage with the self-attention mechanism to extend the global receptive field. Building upon PAT, we propose a novel hybrid network family, named PATNet, which achieves superior top-1 accuracy and inference speed compared to FasterNet on ImageNet-1K classification and excel in both detection and segmentation on the COCO dataset. Particularly, our PATNet-T2 achieves 1.3% higher accuracy than FasterNet-T2, while exhibiting 25% higher GPU throughput and 24% lower CPU latency.
zh
[CV-67] mporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks CVPR2025
【速读】:该论文旨在解决现有知识蒸馏(Knowledge Distillation, KD)方法在将人工神经网络(Artificial Neural Networks, ANNs)的知识迁移到尖峰神经网络(Spiking Neural Networks, SNNs)时存在的性能差距问题。传统蒸馏技术通常未能充分考虑SNN特有的时空特性,导致其优势无法被充分利用。为克服这一挑战,论文提出了一种新颖的logit蒸馏方法,其关键在于引入时间分离(temporal separation)和熵正则化(entropy regularization)。该方法通过在不同时间步对logits进行蒸馏学习,而非仅依赖聚合后的输出特征,从而更好地捕获SNN的时空特性。此外,熵正则化的引入不仅稳定了模型优化过程,还进一步提升了性能表现。实验结果表明,所提方法在logit蒸馏、特征蒸馏或两者的组合基础上均实现了更优的性能。
链接: https://arxiv.org/abs/2503.03144
作者: Kairong Yu,Chengting Yu,Tianqing Zhang,Xiaochen Zhao,Shu Yang,Hongwei Wang,Qiang Zhang,Qi Xu
机构: Zhejiang University (浙江大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Spiking Neural Networks (SNNs), inspired by the human brain, offer significant computational efficiency through discrete spike-based information transfer. Despite their potential to reduce inference energy consumption, a performance gap persists between SNNs and Artificial Neural Networks (ANNs), primarily due to current training methods and inherent model limitations. While recent research has aimed to enhance SNN learning by employing knowledge distillation (KD) from ANN teacher networks, traditional distillation techniques often overlook the distinctive spatiotemporal properties of SNNs, thus failing to fully leverage their advantages. To overcome these challenge, we propose a novel logit distillation method characterized by temporal separation and entropy regularization. This approach improves existing SNN distillation techniques by performing distillation learning on logits across different time steps, rather than merely on aggregated output features. Furthermore, the integration of entropy regularization stabilizes model optimization and further boosts the performance. Extensive experimental results indicate that our method surpasses prior SNN distillation strategies, whether based on logit distillation, feature distillation, or a combination of both. The code will be available on GitHub.
zh
[CV-68] Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
【速读】:本文旨在解决 genus-zero 4D 表面(即随时间变形和演化的三维表面)的统计分析问题。这类问题因表面的任意参数化及其变化速度的不同而尤为具有挑战性,需要有效的时空配准方法。传统方法通常先对 4D 表面进行空间和时间上的离散化处理,再计算其时空配准、测地线及统计数据,但这种方法可能导致次优解,且如本文所展示,并非必要。论文的关键在于将 4D 表面视为时空连续函数,提出了一种名为动态球面神经曲面(Dynamic Spherical Neural Surfaces, D-SNS)的高效平滑连续时空表示方法。通过将神经表示与经典黎曼几何及统计形状分析技术相结合,论文直接在这些连续表示上执行核心的 4D 形状分析任务,无需提前离散化和网格化,从而构建了实现完整功能形状分析的基础模块。实验结果表明,该框架在 4D 人体和人脸数据集上表现高效。
链接: https://arxiv.org/abs/2503.03132
作者: Awais Nizamani,Hamid Laga,Guanjin Wang,Farid Boussaid,Mohammed Bennamoun,Anuj Srivastava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 23 figures, conference paper
Abstract:We propose a novel framework for the statistical analysis of genus-zero 4D surfaces, i.e., 3D surfaces that deform and evolve over time. This problem is particularly challenging due to the arbitrary parameterizations of these surfaces and their varying deformation speeds, necessitating effective spatiotemporal registration. Traditionally, 4D surfaces are discretized, in space and time, before computing their spatiotemporal registrations, geodesics, and statistics. However, this approach may result in suboptimal solutions and, as we demonstrate in this paper, is not necessary. In contrast, we treat 4D surfaces as continuous functions in both space and time. We introduce Dynamic Spherical Neural Surfaces (D-SNS), an efficient smooth and continuous spatiotemporal representation for genus-0 4D surfaces. We then demonstrate how to perform core 4D shape analysis tasks such as spatiotemporal registration, geodesics computation, and mean 4D shape estimation, directly on these continuous representations without upfront discretization and meshing. By integrating neural representations with classical Riemannian geometry and statistical shape analysis techniques, we provide the building blocks for enabling full functional shape analysis. We demonstrate the efficiency of the framework on 4D human and face datasets. The source code and additional results are available at this https URL.
zh
[CV-69] NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics
【速读】:该论文旨在解决现有热红外三维重建方法仅关注静态场景、忽视环境因素对热辐射的影响以及无法预测和分析随时间变化的温度分布的问题。为应对这些挑战,论文提出了一种名为NTR-Gaussian的方法,其关键是将温度视为一种热辐射形式,并结合对流换热与辐射散热等要素,利用神经网络预测热力学参数(如发射率、对流换热系数和热容量),从而实现夜间场景中不同时刻的精确温度预测。此外,论文还构建了一个专门用于夜间热成像的动态数据集,验证结果显示,NTR-Gaussian在热重建任务中的表现显著优于对比方法,预测温度误差控制在1摄氏度以内。
链接: https://arxiv.org/abs/2503.03115
作者: Kun Yang,Yuxiang Liu,Zeyu Cui,Yu Liu,Maojun Zhang,Shen Yan,Qing Wang
机构: Northwestern Polytechnical University (西北工业大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Conference on Computer Vision and Pattern Recognition 2025
Abstract:Thermal infrared imaging offers the advantage of all-weather capability, enabling non-intrusive measurement of an object’s surface temperature. Consequently, thermal infrared images are employed to reconstruct 3D models that accurately reflect the temperature distribution of a scene, aiding in applications such as building monitoring and energy management. However, existing approaches predominantly focus on static 3D reconstruction for a single time period, overlooking the impact of environmental factors on thermal radiation and failing to predict or analyze temperature variations over time. To address these challenges, we propose the NTR-Gaussian method, which treats temperature as a form of thermal radiation, incorporating elements like convective heat transfer and radiative heat dissipation. Our approach utilizes neural networks to predict thermodynamic parameters such as emissivity, convective heat transfer coefficient, and heat capacity. By integrating these predictions, we can accurately forecast thermal temperatures at various times throughout a nighttime scene. Furthermore, we introduce a dynamic dataset specifically for nighttime thermal imagery. Extensive experiments and evaluations demonstrate that NTR-Gaussian significantly outperforms comparison methods in thermal reconstruction, achieving a predicted temperature error within 1 degree Celsius.
zh
[CV-70] An Improved Pure Fully Connected Neural Network for Rice Grain Classification
【速读】:该论文旨在解决利用深度学习进行稻米分类时,经典模型因难以区分外观特征相似的稻米品种而导致误分类的问题。为应对这一挑战,论文的关键解决方案在于采用两阶段训练模式(two-stage training)以及改进的数据预处理方法(从随机倾斜调整为水平或垂直位置校正)。通过这些优化,模型的分类准确率显著提升,从97%提高到99%,从而大幅增强了深度学习模型在稻米分类任务中的性能表现。
链接: https://arxiv.org/abs/2503.03111
作者: Wanke Xia,Ruoxin Peng,Haoqi Chu,Xinlei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rice is a staple food for a significant portion of the world’s population, providing essential nutrients and serving as a versatile in-gredient in a wide range of culinary traditions. Recently, the use of deep learning has enabled automated classification of rice, im-proving accuracy and efficiency. However, classical models based on first-stage training may face difficulties in distinguishing between rice varieties with similar external characteristics, thus leading to misclassifications. Considering the transparency and feasibility of model, we selected and gradually improved pure fully connected neural network to achieve classification of rice grain. The dataset we used contains both global and domestic rice images obtained from websites and laboratories respectively. First, the training mode was changed from one-stage training to two-stage training, which significantly contributes to distinguishing two similar types of rice. Secondly, the preprocessing method was changed from random tilting to horizontal or vertical position cor-rection. After those two enhancements, the accuracy of our model increased notably from 97% to 99%. In summary, two subtle methods proposed in this study can remarkably enhance the classification ability of deep learning models in terms of the classification of rice grain.
zh
[CV-71] WarmFed: Federated Learning with Warm-Start for Globalization and Personalization Via Personalized Diffusion Models
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)框架中长期存在的权衡难题:即如何在单一全局模型(增强全球化)与个性化模型(满足本地化需求)之间取得平衡。传统方法往往需要在这两者之间做出取舍,而本文提出了一种全新的方法来同时优化全局模型和个性化模型。解决方案的关键在于从预训练初始化出发,通过引入WarmFed方法,利用局部高效的微调技术(LoRA)生成个性化的扩散模型,并在此基础上实施服务器端的微调策略以构建稳健的全局模型,同时结合动态自蒸馏(Dynamic Self-Distillation, DSD)进一步提升个性化模型的表现。这一创新方法能够在单次通信内显著提升全局与个性化模型的效果。
链接: https://arxiv.org/abs/2503.03110
作者: Tao Feng,Jie Zhang,Xiangjian Li,Rong Huang,Huashan Liu,Zhijie Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated Learning (FL) stands as a prominent distributed learning paradigm among multiple clients to achieve a unified global model without privacy leakage. In contrast to FL, Personalized federated learning aims at serving for each client in achieving persoanlized model. However, previous FL frameworks have grappled with a dilemma: the choice between developing a singular global model at the server to bolster globalization or nurturing personalized model at the client to accommodate personalization. Instead of making trade-offs, this paper commences its discourse from the pre-trained initialization, obtaining resilient global information and facilitating the development of both global and personalized models. Specifically, we propose a novel method called WarmFed to achieve this. WarmFed customizes Warm-start through personalized diffusion models, which are generated by local efficient fine-tunining (LoRA). Building upon the Warm-Start, we advance a server-side fine-tuning strategy to derive the global model, and propose a dynamic self-distillation (DSD) to procure more resilient personalized models simultaneously. Comprehensive experiments underscore the substantial gains of our approach across both global and personalized models, achieved within just one-shot and five communication(s).
zh
[CV-72] RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Parag raph Text Recognition
【速读】:本文旨在解决手写段落文本识别(Handwritten Paragraph Text Recognition, HPTR)任务中的挑战,即如何将包含丰富手写文本的图像转换为文本编码序列。当前最先进的模型之一是垂直注意力网络(Vertical Attention Network, VAN),其通过垂直注意力模块(Vertical Attention Module, VAM)隐式分割段落文本图像成文本行,从而降低识别难度。然而,从网络结构的角度来看,VAM 是单分支模块,在学习能力上不如多分支模块有效。为此,本文提出了一种新的模块——重参数化垂直注意力融合模块(Re-parameterizing Vertical Attention Fusion Module, RVAFM)。RVAFM 的关键是结合了结构重参数化技术,在训练阶段采用多分支结构以实现更有效的学习,在推理阶段则切换为单分支结构以加速处理。通过一种名为重参数化融合(Re-parameterization Fusion, RF)的特殊融合方法,多分支结构所学特征被无缝融入单分支结构中,且无任何信息丢失。最终,该方法在 IAM 数据集的段落级测试集上实现了 4.44% 的字符错误率(CER)和 14.37% 的词错误率(WER),并且推理速度略快于 VAN。
链接: https://arxiv.org/abs/2503.03104
作者: Jinhui Zheng,Zhiquan Liu,Yain-Whar Si,Jianqing Li,Xinyuan Zhang,Xiaofan Li,Haozhi Huang,Xueyuan Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Handwritten Paragraph Text Recognition (HPTR) is a challenging task in Computer Vision, requiring the transformation of a paragraph text image, rich in handwritten text, into text encoding sequences. One of the most advanced models for this task is Vertical Attention Network (VAN), which utilizes a Vertical Attention Module (VAM) to implicitly segment paragraph text images into text lines, thereby reducing the difficulty of the recognition task. However, from a network structure perspective, VAM is a single-branch module, which is less effective in learning compared to multi-branch modules. In this paper, we propose a new module, named Re-parameterizing Vertical Attention Fusion Module (RVAFM), which incorporates structural re-parameterization techniques. RVAFM decouples the structure of the module during training and inference stages. During training, it uses a multi-branch structure for more effective learning, and during inference, it uses a single-branch structure for faster processing. The features learned by the multi-branch structure are fused into the single-branch structure through a special fusion method named Re-parameterization Fusion (RF) without any loss of information. As a result, we achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set. Additionally, the inference speed is slightly faster than VAN.
zh
[CV-73] AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model
【速读】:该论文旨在解决Segment Anything Model (SAM) 在实际部署中因大存储需求和高计算成本带来的挑战,特别是现有后训练量化(Post-Training Quantization, PTQ) 方法在处理 SAM 时面临的两大难题:post-GELU 激活分布的重尾和偏态特性,以及线性投影激活中的显著通道间差异。为了解决这些问题,论文提出了 AHCPTQ,这是一种针对 SAM 的精确且硬件高效的后训练量化方法。其关键在于引入硬件兼容的 Hybrid Log-Uniform Quantization (HLUQ),通过对密集小值采用 log₂ 量化、对稀疏大值采用均匀量化来增强量化分辨率;同时结合 Channel-Aware Grouping (CAG),逐步聚类具有相似分布的激活通道,使它们共享量化参数以减轻通道间差异。这种组合不仅提升了量化效果,还确保了与高效硬件执行的兼容性。
链接: https://arxiv.org/abs/2503.03088
作者: Wenlun Zhang,Shimpei Ando,Kentaro Yoshioka
机构: Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:
Abstract:The Segment Anything Model (SAM) has demonstrated strong versatility across various visual tasks. However, its large storage requirements and high computational cost pose challenges for practical deployment. Post-training quantization (PTQ) has emerged as an effective strategy for efficient deployment, but we identify two key challenges in SAM that hinder the effectiveness of existing PTQ methods: the heavy-tailed and skewed distribution of post-GELU activations, and significant inter-channel variation in linear projection activations. To address these challenges, we propose AHCPTQ, an accurate and hardware-efficient PTQ method for SAM. AHCPTQ introduces hardware-compatible Hybrid Log-Uniform Quantization (HLUQ) to manage post-GELU activations, employing log2 quantization for dense small values and uniform quantization for sparse large values to enhance quantization resolution. Additionally, AHCPTQ incorporates Channel-Aware Grouping (CAG) to mitigate inter-channel variation by progressively clustering activation channels with similar distributions, enabling them to share quantization parameters and improving hardware efficiency. The combination of HLUQ and CAG not only enhances quantization effectiveness but also ensures compatibility with efficient hardware execution. For instance, under the W4A4 configuration on the SAM-L model, AHCPTQ achieves 36.6% mAP on instance segmentation with the DINO detector, while achieving a 7.89x speedup and 8.64x energy efficiency over its floating-point counterpart in FPGA implementation.
zh
[CV-74] BEVDriver: Leverag ing BEV Maps in LLM s for Robust Closed-Loop Driving
【速读】:该论文旨在解决将大型语言模型(Large Language Models, LLMs)与自动驾驶中的3D空间感知及决策规划相结合的问题,以实现安全、可靠且可解释的自主驾驶。当前自动驾驶方法在整合LLMs的推理与语言能力以及3D空间理解方面面临挑战。为应对这一问题,论文提出BEVDriver,这是一种基于LLMs的端到端闭环自动驾驶模型,用于CARLA仿真环境中。BEVDriver的关键在于利用潜在的鸟瞰图(BEV)特征作为感知输入,并通过BEV编码器高效处理多视角图像和3D激光雷达点云数据。在统一的潜在空间内,BEV特征经Q-Former模块与自然语言指令对齐后传递给LLM,从而预测并规划精确的未来轨迹,同时考虑导航指令和关键场景。实验结果显示,在LangAuto基准测试中,该模型相比现有最优方法(SoTA)提升了高达18.9%的驾驶评分(Driving Score)。
链接: https://arxiv.org/abs/2503.03074
作者: Katharina Winter,Mark Azer,Fabian B. Flohr
机构: Munich University of Applied Sciences(慕尼黑应用技术大学), Intelligent Vehicles Lab (IVL)(智能车辆实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Autonomous driving has the potential to set the stage for more efficient future mobility, requiring the research domain to establish trust through safe, reliable and transparent driving. Large Language Models (LLMs) possess reasoning capabilities and natural language understanding, presenting the potential to serve as generalized decision-makers for ego-motion planning that can interact with humans and navigate environments designed for human drivers. While this research avenue is promising, current autonomous driving approaches are challenged by combining 3D spatial grounding and the reasoning and language capabilities of LLMs. We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA that utilizes latent BEV features as perception input. BEVDriver includes a BEV encoder to efficiently process multi-view images and 3D LiDAR point clouds. Within a common latent space, the BEV features are propagated through a Q-Former to align with natural language instructions and passed to the LLM that predicts and plans precise future trajectories while considering navigation instructions and critical scenarios. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.
zh
[CV-75] Multi-View Depth Consistent Image Generation Using Generative AI Models: Application on Architectural Design of University Buildings
【速读】:该论文旨在解决在建筑设计早期阶段,基于鞋盒模型(shoebox model)生成多视角一致的详细建筑图像的问题。传统方法需要大量人工操作将鞋盒模型转化为详细的建筑设计,而生成式 AI (Generative AI) 虽然提供了解决方案,但确保多视角一致性仍然是一个重大挑战。论文的关键在于提出了一种三阶段一致图像生成框架:首先通过优化扩散模型(diffusion model)结合ControlNet来处理鞋盒模型的多视角输入;其次引入图像空间损失模块(包含风格损失、结构损失和角度对齐损失)以保证多视角图像在风格和结构上的统一性;最后利用深度感知3D注意力模块(depth-aware 3D attention module),结合图像与对应的深度图配对数据进一步提升多视角一致性。这一系列方法的核心创新点在于从多视角一致性出发,通过引入图像空间损失和深度感知模块,有效提升了生成图像的质量与真实性。
链接: https://arxiv.org/abs/2503.03068
作者: Xusheng Du,Ruihan Gui,Zhengyang Wang,Ye Zhang,Haoran Xie
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, in Proceedings of CAADRIA2025
Abstract:In the early stages of architectural design, shoebox models are typically used as a simplified representation of building structures but require extensive operations to transform them into detailed designs. Generative artificial intelligence (AI) provides a promising solution to automate this transformation, but ensuring multi-view consistency remains a significant challenge. To solve this issue, we propose a novel three-stage consistent image generation framework using generative AI models to generate architectural designs from shoebox model representations. The proposed method enhances state-of-the-art image generation diffusion models to generate multi-view consistent architectural images. We employ ControlNet as the backbone and optimize it to accommodate multi-view inputs of architectural shoebox models captured from predefined perspectives. To ensure stylistic and structural consistency across multi-view images, we propose an image space loss module that incorporates style loss, structural loss and angle alignment loss. We then use depth estimation method to extract depth maps from the generated multi-view images. Finally, we use the paired data of the architectural images and depth maps as inputs to improve the multi-view consistency via the depth-aware 3D attention module. Experimental results demonstrate that the proposed framework can generate multi-view architectural images with consistent style and structural coherence from shoebox model inputs.
zh
[CV-76] Learning from Noisy Labels with Contrastive Co-Transformer
【速读】:该论文致力于解决深度学习在处理带有噪声标签数据时的过拟合问题,这是弱监督学习中的一个有趣挑战。论文的关键解决方案在于引入了一个名为Contrastive Co-Transformer的框架,其核心思想是结合对比损失(contrastive loss)和分类损失(classification loss),利用Transformer模型的鲁棒性来有效应对标签噪声。该方法不仅能够充分利用数据集中的所有样本(无论是干净样本还是噪声样本),还通过Co-Training的思想提升了模型性能,且相比现有最先进的方法表现出显著的优势。
链接: https://arxiv.org/abs/2503.03042
作者: Yan Han,Soumava Kumar Roy,Mehrtash Harandi,Lars Petersson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning with noisy labels is an interesting challenge in weakly supervised learning. Despite their significant learning capacity, CNNs have a tendency to overfit in the presence of samples with noisy labels. Alleviating this issue, the well known Co-Training framework is used as a fundamental basis for our work. In this paper, we introduce a Contrastive Co-Transformer framework, which is simple and fast, yet able to improve the performance by a large margin compared to the state-of-the-art approaches. We argue the robustness of transformers when dealing with label noise. Our Contrastive Co-Transformer approach is able to utilize all samples in the dataset, irrespective of whether they are clean or noisy. Transformers are trained by a combination of contrastive loss and classification loss. Extensive experimental results on corrupted data from six standard benchmark datasets including Clothing1M, demonstrate that our Contrastive Co-Transformer is superior to existing state-of-the-art methods.
zh
[CV-77] Revolutionizing Traffic Management with AI-Powered Machine Vision: A Step Toward Smart Cities
【速读】:该论文旨在解决城市快速扩张与车辆拥堵加剧带来的交通管理与安全挑战,具体探索了人工智能(Artificial Intelligence, AI)与机器视觉技术在革新交通系统中的潜力。论文提出了一种基于先进监控摄像头和深度学习算法的实时系统,用于检测车辆、交通异常及驾驶行为,并通过整合地理空间数据和天气信息动态适应环境变化,确保在多种场景下的稳定性能。解决方案的关键在于采用YOLOv8和YOLOv11模型实现高精度的车辆检测与异常识别,从而优化交通流并提升道路安全性。这些成果为智能交通管理方案的发展提供了支持,并契合构建可持续且高效的智慧城市基础设施的愿景。
链接: https://arxiv.org/abs/2503.02967
作者: Seyed Hossein Hosseini DolatAbadi,Sayyed Mohammad Hossein Hashemi,Mohammad Hosseini,Moein-Aldin AliHosseini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 1 figure, 2 tables, accepted to 1th AITC conference in University Of Isfahan
Abstract:The rapid urbanization of cities and increasing vehicular congestion have posed significant challenges to traffic management and safety. This study explores the transformative potential of artificial intelligence (AI) and machine vision technologies in revolutionizing traffic systems. By leveraging advanced surveillance cameras and deep learning algorithms, this research proposes a system for real-time detection of vehicles, traffic anomalies, and driver behaviors. The system integrates geospatial and weather data to adapt dynamically to environmental conditions, ensuring robust performance in diverse scenarios. Using YOLOv8 and YOLOv11 models, the study achieves high accuracy in vehicle detection and anomaly recognition, optimizing traffic flow and enhancing road safety. These findings contribute to the development of intelligent traffic management solutions and align with the vision of creating smart cities with sustainable and efficient urban infrastructure.
zh
[CV-78] Monocular Person Localization under Camera Ego-motion
【速读】:该论文旨在解决从移动单目相机中进行人体定位的问题,这对于人机交互(Human-Robot Interaction, HRI)至关重要。现有的方法要么依赖固定相机的几何假设,要么使用在包含极少相机自运动的数据集上训练的位置回归模型,这些方法在面对剧烈的相机自运动时容易导致人体定位不准确。论文的关键创新在于将人体定位视为姿态估计问题的一部分,通过采用四点模型表示人体,同时联合估计2D相机姿态和3D人体位置,并通过优化实现。这种方案能够有效提升人体定位的准确性,尤其在存在显著相机自运动的情况下表现出色。
链接: https://arxiv.org/abs/2503.02916
作者: Yu Zhan,Hanjing Ye,Hong Zhang
机构: Shenzhen Key Laboratory of Robotics and Computer Vision, Southern University of Science and Technology (SUSTech), and the Department of Electronic and Electrical Engineering, SUSTech
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Under review
Abstract:Localizing a person from a moving monocular camera is critical for Human-Robot Interaction (HRI). To estimate the 3D human position from a 2D image, existing methods either depend on the geometric assumption of a fixed camera or use a position regression model trained on datasets containing little camera ego-motion. These methods are vulnerable to fierce camera ego-motion, resulting in inaccurate person localization. We consider person localization as a part of a pose estimation problem. By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person’s 3D location through optimization. Evaluations on both public datasets and real robot experiments demonstrate our method outperforms baselines in person localization accuracy. Our method is further implemented into a person-following system and deployed on an agile quadruped robot.
zh
[CV-79] LangGas: Introducing Language in Selective Zero-Shot Background Subtraction for Semi-Transparent Gas Leak Detection with a New Dataset
【速读】:该论文旨在解决气体泄漏检测中因传统人工检查方法效率低下且劳动密集型所导致的问题,并缓解现有机器学习方法因缺乏高质量公开数据集而面临的挑战。论文提出的关键解决方案是构建一个包含多样化背景、干扰前景物体、不同泄漏位置以及精确分割标注的合成数据集,并设计了一种零样本(zero-shot)方法,该方法结合背景减除、零样本目标检测、过滤及分割技术,充分利用所提出的合成数据集。实验结果表明,此方法显著优于仅基于背景减除和零样本目标检测加分割的基线方法,整体IoU达到69%。此外,论文还分析了多种提示配置和阈值设置,以深入探讨方法的性能表现。
链接: https://arxiv.org/abs/2503.02910
作者: Wenqi Guo,Yiyang Du,Shan Du
机构: University of British Columbia (英属哥伦比亚大学); Group of Methane Emission Observation & Warning (MEOW)(甲烷排放观测与预警小组), Weathon Software (维森软件); University of British Columbia (英属哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gas leakage poses a significant hazard that requires prevention. Traditionally, human inspection has been used for detection, a slow and labour-intensive process. Recent research has applied machine learning techniques to this problem, yet there remains a shortage of high-quality, publicly available datasets. This paper introduces a synthetic dataset featuring diverse backgrounds, interfering foreground objects, diverse leak locations, and precise segmentation ground truth. We propose a zero-shot method that combines background subtraction, zero-shot object detection, filtering, and segmentation to leverage this dataset. Experimental results indicate that our approach significantly outperforms baseline methods based solely on background subtraction and zero-shot object detection with segmentation, reaching an IoU of 69% overall. We also present an analysis of various prompt configurations and threshold settings to provide deeper insights into the performance of our method. The code and dataset will be released after publication.
zh
[CV-80] ClipGrader: Leverag ing Vision-Language Models for Robust Label Quality Assessment in Object Detection
【速读】:该论文旨在解决高质量目标检测标注的准确性保障难题,特别是边界框(bounding box)标注的精确性验证。传统方法在确保标注准确性方面既具有挑战性又成本高昂。论文提出的解决方案——ClipGrader,是一种创新的方法,通过利用视觉-语言模型(vision-language model)自动评估边界框标注的准确性。其关键是将CLIP(Contrastive Language-Image Pre-training)适配用于同时评估类别标签的正确性和边界框的空间精度,从而提供了一种高效且可扩展的工具,能够有效提升大规模目标检测数据集的标注质量控制,并支持伪标签的质量优化,尤其在半监督目标检测(SSOD)任务中表现出显著效果。
链接: https://arxiv.org/abs/2503.02897
作者: Hong Lu,Yali Bian,Rahul C. Shah
机构: Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:High-quality annotations are essential for object detection models, but ensuring label accuracy - especially for bounding boxes - remains both challenging and costly. This paper introduces ClipGrader, a novel approach that leverages vision-language models to automatically assess the accuracy of bounding box annotations. By adapting CLIP (Contrastive Language-Image Pre-training) to evaluate both class label correctness and spatial precision of bounding box, ClipGrader offers an effective solution for grading object detection labels. Tested on modified object detection datasets with artificially disturbed bounding boxes, ClipGrader achieves 91% accuracy on COCO with a 1.8% false positive rate. Moreover, it maintains 87% accuracy with a 2.1% false positive rate when trained on just 10% of the COCO data. ClipGrader also scales effectively to larger datasets such as LVIS, achieving 79% accuracy across 1,203 classes. Our experiments demonstrate ClipGrader’s ability to identify errors in existing COCO annotations, highlighting its potential for dataset refinement. When integrated into a semi-supervised object detection (SSOD) model, ClipGrader readily improves the pseudo label quality, helping achieve higher mAP (mean Average Precision) throughout the training process. ClipGrader thus provides a scalable AI-assisted tool for enhancing annotation quality control and verifying annotations in large-scale object detection datasets.
zh
[CV-81] Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies
【速读】:该论文旨在解决视觉Transformer(Vision Transformers, ViTs)在资源受限边缘设备上部署所面临的高计算复杂度和高内存需求的问题。论文的关键在于系统性地分类和分析模型压缩技术、边缘推理的软件工具以及硬件加速策略,并评估它们在精度、效率和硬件适应性方面的权衡。通过提供结构化的分析,论文揭示了当前研究中的关键挑战,并指出了推动ViTs在图形处理单元(GPUs)、张量处理单元(TPUs)和现场可编程门阵列(FPGAs)等边缘平台高效部署的研究方向。
链接: https://arxiv.org/abs/2503.02891
作者: Shaibal Saha,Lanyu Xu
机构: Oakland University (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注:
Abstract:In recent years, vision transformers (ViTs) have emerged as powerful and promising techniques for computer vision tasks such as image classification, object detection, and segmentation. Unlike convolutional neural networks (CNNs), which rely on hierarchical feature extraction, ViTs treat images as sequences of patches and leverage self-attention mechanisms. However, their high computational complexity and memory demands pose significant challenges for deployment on resource-constrained edge devices. To address these limitations, extensive research has focused on model compression techniques and hardware-aware acceleration strategies. Nonetheless, a comprehensive review that systematically categorizes these techniques and their trade-offs in accuracy, efficiency, and hardware adaptability for edge deployment remains lacking. This survey bridges this gap by providing a structured analysis of model compression techniques, software tools for inference on edge, and hardware acceleration strategies for ViTs. We discuss their impact on accuracy, efficiency, and hardware adaptability, highlighting key challenges and emerging research directions to advance ViT deployment on edge platforms, including graphics processing units (GPUs), tensor processing units (TPUs), and field-programmable gate arrays (FPGAs). The goal is to inspire further research with a contemporary guide on optimizing ViTs for efficient deployment on edge devices.
zh
[CV-82] Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在集成动作分块(action chunking)技术时,由于动作分块大小增加导致动作维度线性扩展的问题,从而降低推理效率。为了解决这一问题,论文提出了一种名为PD-VLA的并行解码框架,其关键是将自回归解码重新表述为一个可以通过并行不动点迭代解决的非线性系统。这种方法在保持模型性能的同时显著提升了解码速度,并实现了无需训练即可加速的能力,同时与现有加速技术无缝协同。此外,实验验证了PD-VLA在不同任务中的高适用性。
链接: https://arxiv.org/abs/2503.02310
作者: Wenxuan Song,Jiayi Chen,Pengxiang Ding,Han Zhao,Wei Zhao,Zhide Zhong,Zongyuan Ge,Jun Ma,Haoang Li
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2.52 times execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.
zh
[CV-83] NeuroGauss4D-PCI: 4D Neural Fields and Gaussian Deformation Fields for Point Cloud Interpolation
【速读】:该论文旨在解决点云插值(Point Cloud Interpolation)面临的挑战,包括点稀疏性、复杂的时空动态特性以及从稀疏时间信息推导完整三维点云的困难。论文提出的解决方案NeuroGauss4D-PCI的关键在于其创新性的方法设计:首先通过迭代高斯云软聚类模块生成结构化的时空点云表示;其次利用时域径向基函数高斯残差实现高斯参数的时间插值,捕捉高斯分布的时空残差;接着引入四维高斯变形场跟踪这些参数的演化,构建连续的时空变形场;然后借助四维神经场将低维时空坐标(x, y, z, t)映射到高维潜在空间;最后自适应地融合神经场的潜在特征与高斯变形场的几何特征。这种方法在对象级别(DHB)和大规模自动驾驶数据集(NL-Drive)上的点云帧插值任务中表现出色,并具有扩展至自动标注和点云稠化任务的潜力。
链接: https://arxiv.org/abs/2405.14241
作者: Chaokang Jiang,Dalong Du,Jiuming Liu,Siting Zhu,Zhenqiang Liu,Zhuang Ma,Zhujin Liang,Jie Zhou
机构: PhiGent Robotics (PhiGent Robotics); Shanghai Jiaotong University (上海交通大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Point Cloud Interpolation confronts challenges from point sparsity, complex spatiotemporal dynamics, and the difficulty of deriving complete 3D point clouds from sparse temporal information. This paper presents NeuroGauss4D-PCI, which excels at modeling complex non-rigid deformations across varied dynamic scenes. The method begins with an iterative Gaussian cloud soft clustering module, offering structured temporal point cloud representations. The proposed temporal radial basis function Gaussian residual utilizes Gaussian parameter interpolation over time, enabling smooth parameter transitions and capturing temporal residuals of Gaussian distributions. Additionally, a 4D Gaussian deformation field tracks the evolution of these parameters, creating continuous spatiotemporal deformation fields. A 4D neural field transforms low-dimensional spatiotemporal coordinates ( x,y,z,t ) into a high-dimensional latent space. Finally, we adaptively and efficiently fuse the latent features from neural fields and the geometric features from Gaussian deformation fields. NeuroGauss4D-PCI outperforms existing methods in point cloud frame interpolation, delivering leading performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive), with scalability to auto-labeling and point cloud densification tasks. The source code is released at this https URL.
zh
[CV-84] Bridging Synthetic-to-Real Gaps: Frequency-Aware Perturbation and Selection for Single-shot Multi-Parametric Mapping Reconstruction
【速读】:该论文旨在解决在医疗成像领域中,基于数据的人工智能方法因数据稀缺而引入合成数据时存在的合成与真实数据之间的域间隙问题。此外,针对多重回波分离(MOLED)技术在实际临床场景中的应用局限性,即质量下降的问题,包括域间隙缓解不足、结构完整性难以保持以及映射准确性不足等挑战。论文的关键解决方案是提出了频率感知扰动与选择(FPS)方法,该方法包含基于Wasserstein距离调制的频率感知扰动(WDFP)和分层频率感知选择网络(HFSNet),集成了频率感知自适应选择(FAS)、紧凑型FAS(cFAS)和特征感知架构集成(FAI)。具体而言,扰动激活了不确定性内的域不变特征学习,而选择则在扰动内优化最佳解,从而构建了一个稳健且闭环的学习路径。通过在合成数据以及来自健康志愿者、缺血性卒中患者和脑膜瘤患者的多样化真实临床案例上的广泛实验,验证了FPS方法的优越性和临床适用性,并进一步将其应用于扩散张量成像(DTI),展示了其多功能性和更广泛的医学应用潜力。
链接: https://arxiv.org/abs/2503.03475
作者: Linyu Fan,Che Wang,Ming Ye,Qizhi Yang,Zejun Wu,Xinghao Ding,Yue Huang,Jianfeng Bao,Shuhui Cai,Congbo Cai
机构: Department of Electronic Science, Fujian Provincial Key Laboratory of Plasma and Magnetic Resonance, Xiamen University (厦门大学); Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (教育部多媒体可信感知与高效计算重点实验室, 厦门大学); Department of Magnetic Resonance Imaging, The First Affiliated Hospital of Zhengzhou University, Zhengzhou University (郑州大学第一附属医院磁共振成像科, 郑州大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work will be submitted to the IEEE for possible publication
Abstract:Data-centric artificial intelligence (AI) has remarkably advanced medical imaging, with emerging methods using synthetic data to address data scarcity while introducing synthetic-to-real gaps. Unsupervised domain adaptation (UDA) shows promise in ground truth-scarce tasks, but its application in reconstruction remains underexplored. Although multiple overlapping-echo detachment (MOLED) achieves ultra-fast multi-parametric reconstruction, extending its application to various clinical scenarios, the quality suffers from deficiency in mitigating the domain gap, difficulty in maintaining structural integrity, and inadequacy in ensuring mapping accuracy. To resolve these issues, we proposed frequency-aware perturbation and selection (FPS), comprising Wasserstein distance-modulated frequency-aware perturbation (WDFP) and hierarchical frequency-aware selection network (HFSNet), which integrates frequency-aware adaptive selection (FAS), compact FAS (cFAS) and feature-aware architecture integration (FAI). Specifically, perturbation activates domain-invariant feature learning within uncertainty, while selection refines optimal solutions within perturbation, establishing a robust and closed-loop learning pathway. Extensive experiments on synthetic data, along with diverse real clinical cases from 5 healthy volunteers, 94 ischemic stroke patients, and 46 meningioma patients, demonstrate the superiority and clinical applicability of FPS. Furthermore, FPS is applied to diffusion tensor imaging (DTI), underscoring its versatility and potential for broader medical applications. The code is available at this https URL.
zh
[CV-85] Augmentation-Based Deep Learning for Identification of Circulating Tumor Cells
【速读】:该论文旨在解决循环肿瘤细胞(CTCs)在液体活检中识别困难的问题,主要由于其数量稀少且异质性强。传统基于荧光的方法受限于样本标记,难以在不同医院数据集中通用。此外,单细胞图像分析虽能提供详细的细胞形态和表型信息,但手动分析DEPArray获取的数字图像耗时且易变。为此,论文提出了一种基于深度学习的分类管道,利用ResNet架构的卷积神经网络(CNN),结合亮场成像与数据增强技术及DAPI荧光通道图像,以区分血液样本中的CTCs与白细胞。关键在于通过仅使用亮场图像进行测试,确保模型无需依赖荧光标记即可准确识别CTCs,从而提升诊断准确性并优化临床工作流程。
链接: https://arxiv.org/abs/2503.03410
作者: Martina Russo,Giulia Bertolini,Vera Cappelletti,Cinzia De Marco,Serena Di Cosimo,Petra Paiè,Nadia Brancati
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 4 figures, 3 tables
Abstract:Circulating tumor cells (CTCs) are crucial biomarkers in liquid biopsy, offering a noninvasive tool for cancer patient management. However, their identification remains particularly challenging due to their limited number and heterogeneity. Labeling samples for contrast limits the generalization of fluorescence-based methods across different hospital datasets. Analyzing single-cell images enables detailed assessment of cell morphology, subcellular structures, and phenotypic variations, often hidden in clustered images. Developing a method based on bright-field single-cell analysis could overcome these limitations. CTCs can be isolated using an unbiased workflow combining Parsortix technology, which selects cells based on size and deformability, with DEPArray technology, enabling precise visualization and selection of single cells. Traditionally, DEPArray-acquired digital images are manually analyzed, making the process time-consuming and prone to variability. In this study, we present a Deep Learning-based classification pipeline designed to distinguish CTCs from leukocytes in blood samples, aimed to enhance diagnostic accuracy and optimize clinical workflows. Our approach employs images from the bright-field channel acquired through DEPArray technology leveraging a ResNet-based CNN. To improve model generalization, we applied three types of data augmentation techniques and incorporated fluorescence (DAPI) channel images into the training phase, allowing the network to learn additional CTC-specific features. Notably, only bright-field images have been used for testing, ensuring the model’s ability to identify CTCs without relying on fluorescence markers. The proposed model achieved an F1-score of 0.798, demonstrating its capability to distinguish CTCs from leukocytes. These findings highlight the potential of DL in refining CTC analysis and advancing liquid biopsy applications.
zh
[CV-86] ScaleFusionNet: Transformer-Guided Multi-Scale Feature Fusion for Skin Lesion Segmentation
【速读】:本文旨在解决皮肤病变精确分割这一挑战性问题,这对于定量医学分析至关重要。论文的关键创新在于提出了一种名为ScaleFusionNet的模型,该模型通过集成Cross-Attention Transformer Module (CATM) 和AdaptiveFusionBlock来增强特征提取与融合能力。其中,CATM利用Swin Transformer Blocks和Cross Attention Fusion (CAF) 来自适应优化编码器-解码器之间的特征融合,从而减小语义差距并提升分割精度;而改进后的AdaptiveFusionBlock结合了基于Swin Transformer的注意力机制与基于可变形卷积的多尺度特征提取,进一步细化病灶边界并保留细节信息。这些方法共同提升了皮肤病变分割的性能,在ISIC-2016和ISIC-2018数据集上分别达到了92.94%和91.65%的Dice评分。
链接: https://arxiv.org/abs/2503.03327
作者: Saqib Qamar,Syed Furqan Qadri,Roobaea Alroobaea,Majed Alsafyani,Abdullah M. Baqasah
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Melanoma is a malignant tumor originating from skin cell lesions. Accurate and efficient segmentation of skin lesions is essential for quantitative medical analysis but remains challenging. To address this, we propose ScaleFusionNet, a segmentation model that integrates Cross-Attention Transformer Module (CATM) and AdaptiveFusionBlock to enhance feature extraction and fusion. The model employs a hybrid architecture encoder that effectively captures both local and global features. We introduce CATM, which utilizes Swin Transformer Blocks and Cross Attention Fusion (CAF) to adaptively refine encoder-decoder feature fusion, reducing semantic gaps and improving segmentation accuracy. Additionally, the AdaptiveFusionBlock is improved by integrating adaptive multi-scale fusion, where Swin Transformer-based attention complements deformable convolution-based multi-scale feature extraction. This enhancement refines lesion boundaries and preserves fine-grained details. ScaleFusionNet achieves Dice scores of 92.94% and 91.65% on ISIC-2016 and ISIC-2018 datasets, respectively, demonstrating its effectiveness in skin lesion analysis. Our code implementation is publicly available at GitHub.
zh
[CV-87] Interactive Segmentation and Report Generation for CT Images
【速读】:该论文旨在解决现有自动化CT报告生成方法缺乏可解释性、限制患者与临床医生理解的问题,同时其静态特性阻碍放射科医生在图像审查过程中动态调整评估。为应对这些挑战,论文提出了一种新颖的交互式框架,用于三维病灶形态学报告生成。该框架能够无缝生成包含全面属性描述的分割掩膜,使临床医生能够创建详细的病灶档案以增强诊断评估。解决方案的关键在于首次将交互式分割与结构化报告整合到三维CT医学图像中,从而提供更全面且可靠的病灶分割和捕捉报告系统。
链接: https://arxiv.org/abs/2503.03294
作者: Yannian Gu,Wenhui Lei,Hanyu Chen,Xiaofan Zhang,Shaoting Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated CT report generation plays a crucial role in improving diagnostic accuracy and clinical workflow efficiency. However, existing methods lack interpretability and impede patient-clinician understanding, while their static nature restricts radiologists from dynamically adjusting assessments during image review. Inspired by interactive segmentation techniques, we propose a novel interactive framework for 3D lesion morphology reporting that seamlessly generates segmentation masks with comprehensive attribute descriptions, enabling clinicians to generate detailed lesion profiles for enhanced diagnostic assessment. To our best knowledge, we are the first to integrate the interactive segmentation and structured reports in 3D CT medical images. Experimental results across 15 lesion types demonstrate the effectiveness of our approach in providing a more comprehensive and reliable reporting system for lesion segmentation and capturing. The source code will be made publicly available following paper acceptance.
zh
[CV-88] Rice Grain Size Measurement using Image Processing
【速读】:该论文试图解决传统方法在测定稻米粒形和垩白度时因人工检测效率低且结果不一致所面临的问题。解决方案的关键在于提出了一种基于图像处理的方法,该方法以稻米图像作为输入,输出每粒稻米的数量及其尺寸,并通过感兴趣区域提取、稻米分割以及子轮廓去除等步骤实现自动化检测。实验结果显示,该方法成功检测出95%的稻米,并在长度和宽度测量上达到90%的准确性。
链接: https://arxiv.org/abs/2503.03214
作者: Ankush Tyagi,Dhruv Motwani,Vipul K. Dabhi,Harshadkumar B. Prajapati
机构: Ericsson (爱立信); Avahi; Department of Information Technology, Dharmsinh Desai University (德瓦辛赫·德赛大学信息技术系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rice grain quality can be determined from its size and chalkiness. The traditional approach to measure the rice grain size involves manual inspection, which is inefficient and leads to inconsistent results. To address this issue, an image processing based approach is proposed and developed in this research. The approach takes image of rice grains as input and outputs the number of rice grains and size of each rice grain. The different steps, such as extraction of region of interest, segmentation of rice grains, and sub-contours removal, involved in the proposed approach are discussed. The approach was tested on rice grain images captured from different height using mobile phone camera. The obtained results show that the proposed approach successfully detected 95% of the rice grains and achieved 90% accuracy for length and width measurement.
zh
[CV-89] Implicit U-KAN2.0: Dynamic Efficient and Interpretable Medical Image Segmentation
【速读】:该论文旨在解决现有图像分割方法在性能、可解释性及表达能力方面的局限性,特别是基于U-Net架构的模型在处理内在噪声和离散层结构时存在的不足。论文的关键解决方案在于提出了一种名为Implicit U-KAN 2.0的新颖U-Net变体,其采用两阶段的编码器-解码器结构。在第一阶段(SONO),引入了二阶神经常微分方程(Second-order Neural Ordinary Differential Equations, SONO blocks)以实现更高效、更具表达力且理论基础更扎实的建模方式;第二阶段(SONO-MultiKAN)则结合了二阶NODEs与MultiKAN层作为核心计算模块,显著提升了模型的可解释性和表示能力。论文的贡献主要包括:一是通过隐式深度神经网络结合MultiKAN和二阶NODEs,实现了性能提升的同时降低了计算开销;二是从理论上证明了MultiKAN块的逼近能力不受输入维度的影响;三是通过在多种2D数据集以及一个3D数据集上的广泛实验验证,表明所提模型在分割任务中始终优于现有网络。
链接: https://arxiv.org/abs/2503.03141
作者: Chun-Wun Cheng,Yining Zhao,Yanqi Cheng,Javier Montoya,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
机构: University of Cambridge(剑桥大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Image segmentation is a fundamental task in both image analysis and medical applications. State-of-the-art methods predominantly rely on encoder-decoder architectures with a U-shaped design, commonly referred to as U-Net. Recent advancements integrating transformers and MLPs improve performance but still face key limitations, such as poor interpretability, difficulty handling intrinsic noise, and constrained expressiveness due to discrete layer structures, often lacking a solid theoretical this http URL this work, we introduce Implicit U-KAN 2.0, a novel U-Net variant that adopts a two-phase encoder-decoder structure. In the SONO phase, we use a second-order neural ordinary differential equation (NODEs), called the SONO block, for a more efficient, expressive, and theoretically grounded modeling approach. In the SONO-MultiKAN phase, we integrate the second-order NODEs and MultiKAN layer as the core computational block to enhance interpretability and representation power. Our contributions are threefold. First, U-KAN 2.0 is an implicit deep neural network incorporating MultiKAN and second order NODEs, improving interpretability and performance while reducing computational costs. Second, we provide a theoretical analysis demonstrating that the approximation ability of the MultiKAN block is independent of the input dimension. Third, we conduct extensive experiments on a variety of 2D and a single 3D dataset, demonstrating that our model consistently outperforms existing segmentation networks.
zh
[CV-90] Interpretable Few-Shot Retinal Disease Diagnosis with Concept-Guided Prompting of Vision-Language Models
【速读】:该论文旨在解决现有基于深度学习的视网膜疾病分类方法主要依赖图像数据、缺乏可解释性且将医疗专业人员仅视为标注者的问题。为填补这一空白,论文提出的关键解决方案包括:利用GPT模型的知识库提取视网膜疾病的可解释概念,并将这些概念作为语言组件融入提示学习(prompt-learning)中,与眼底图像共同训练视觉-语言(Vision-Language, VL)模型。此方法不仅提升了视网膜疾病的分类性能,还增强了少样本(few-shot)和零样本(zero-shot)检测能力,同时提供了基于概念的模型可解释性。实验结果表明,在两个不同的眼底图像数据集上的评估显示,通过概念集成的方法,VL模型在少样本学习和零样本检测中分别实现了约5.8%和2.7%的平均精度均值提升。这一方法标志着向实际临床应用中可解释且高效的视网膜疾病识别迈出了重要一步。
链接: https://arxiv.org/abs/2503.02917
作者: Deval Mehta,Yiwen Jiang,Catherine L Jan,Mingguang He,Kshitij Jadhav,Zongyuan Ge
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Information Processing in Medical Imaging (IPMI) 2025
Abstract:Recent advancements in deep learning have shown significant potential for classifying retinal diseases using color fundus images. However, existing works predominantly rely exclusively on image data, lack interpretability in their diagnostic decisions, and treat medical professionals primarily as annotators for ground truth labeling. To fill this gap, we implement two key strategies: extracting interpretable concepts of retinal diseases using the knowledge base of GPT models and incorporating these concepts as a language component in prompt-learning to train vision-language (VL) models with both fundus images and their associated concepts. Our method not only improves retinal disease classification but also enriches few-shot and zero-shot detection (novel disease detection), while offering the added benefit of concept-based model interpretability. Our extensive evaluation across two diverse retinal fundus image datasets illustrates substantial performance gains in VL-model based few-shot methodologies through our concept integration approach, demonstrating an average improvement of approximately 5.8% and 2.7% mean average precision for 16-shot learning and zero-shot (novel class) detection respectively. Our method marks a pivotal step towards interpretable and efficient retinal disease recognition for real-world clinical applications.
zh
[CV-91] Computer-aided shape features extraction and regression models for predicting the ascending aortic aneurysm growth rate
【速读】:该论文旨在解决临床中升主动脉动脉瘤生长预测的挑战。其关键在于评估并比较局部形状特征与全局形状特征在预测升主动脉动脉瘤生长中的能力。研究通过计算三种局部形状特征(最大直径与中心线长度比值、内外部线长度比值及升主动脉路径的曲折度)以及利用纵向数据推导动脉瘤生长率。同时,通过无监督主成分分析(PCA)和有监督偏最小二乘法(PLS)进行统计形状分析,提取两类全局形状特征,并构建基于高斯支持向量机的回归模型以及基于PLS线性回归的模型。最终结果表明,全局形状特征可能对预测动脉瘤生长提供重要贡献,尤其是靠近动脉根部且初始直径较大的动脉瘤表现出更快的生长速度。
链接: https://arxiv.org/abs/2503.02915
作者: Leonardo Geronzi,Antonio Martinez,Michel Rochette,Kexin Yan,Aline Bel-Brunon,Pascal Haigron,Pierre Escrig,Jacques Tomasi,Morgan Daniel,Alain Lalande,Siyu Lin,Diana Marcela Marin-Castrillon,Olivier Bouchot,Jean Porterie,Pier Paolo Valentini,Marco Evangelos Biancolini
机构: University of Rome Tor Vergata, Department of Enterprise Engineering “Mario Lucertini” (罗马大学托尔韦加塔校区,企业工程系“马里奥·卢切里尼”); Ansys France (法国ANSYS); University of Lyon, INSA Lyon, CNRS, LaMCoS, UMR5259 (里昂大学, INSA里昂, 法国国家科学研究中心, LaMCoS, UMR5259); University of Rennes, CHU Rennes, Inserm, LTSI – UMR 1099 (雷恩大学, 雷恩大学医院, 法国国家健康与医学研究院, LTSI – UMR 1099); ICMUB Laboratory, CNRS 6302, University of Burgundy (ICMUB实验室, 法国国家科学研究中心6302, 第戎大学); Medical Imaging Department, University Hospital of Dijon (医学影像科, 第戎大学医院); Department of Cardio-Vascular and Thoracic Surgery, University Hospital of Dijon (心血管和胸腔外科系, 第戎大学医院); Cardiac Surgery Department, Rangueil University Hospital (心脏外科部门, 拉古伊尔大学医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA); Medical Physics (physics.med-ph)
备注:
Abstract:Objective: ascending aortic aneurysm growth prediction is still challenging in clinics. In this study, we evaluate and compare the ability of local and global shape features to predict ascending aortic aneurysm growth. Material and methods: 70 patients with aneurysm, for which two 3D acquisitions were available, are included. Following segmentation, three local shape features are computed: (1) the ratio between maximum diameter and length of the ascending aorta centerline, (2) the ratio between the length of external and internal lines on the ascending aorta and (3) the tortuosity of the ascending tract. By exploiting longitudinal data, the aneurysm growth rate is derived. Using radial basis function mesh morphing, iso-topological surface meshes are created. Statistical shape analysis is performed through unsupervised principal component analysis (PCA) and supervised partial least squares (PLS). Two types of global shape features are identified: three PCA-derived and three PLS-based shape modes. Three regression models are set for growth prediction: two based on gaussian support vector machine using local and PCA-derived global shape features; the third is a PLS linear regression model based on the related global shape features. The prediction results are assessed and the aortic shapes most prone to growth are identified. Results: the prediction root mean square error from leave-one-out cross-validation is: 0.112 mm/month, 0.083 mm/month and 0.066 mm/month for local, PCA-based and PLS-derived shape features, respectively. Aneurysms close to the root with a large initial diameter report faster growth. Conclusion: global shape features might provide an important contribution for predicting the aneurysm growth. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA); Medical Physics (physics.med-ph) Cite as: arXiv:2503.02915 [eess.IV] (or arXiv:2503.02915v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.02915 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Volume 162, August 2023, 107052, Computers in Biology and Medicine Related DOI: https://doi.org/10.1016/j.compbiomed.2023.107052 Focus to learn more DOI(s) linking to related resources Submission history From: Leonardo Geronzi [view email] [v1] Tue, 4 Mar 2025 10:21:20 UTC (5,245 KB)
zh
[CV-92] Hyperspectral Image Restoration and Super-resolution with Physics-Aware Deep Learning for Biomedical Applications
【速读】:本文旨在解决高光谱成像在生物成像应用中面临的系统复杂性问题,特别是由空间分辨率、光谱分辨率和成像速度之间的固有权衡所导致的限制。论文提出了一种基于深度学习的方法,通过后处理方式在无任何先验知识的情况下恢复并增强像素分辨率,同时提升成像速度。关键在于设计了一个与成像模型对齐的优化指标,并结合物理先验知识进行微调,实现了16倍的像素超分辨增强和12倍的成像速度提升,且无需额外的迁移学习训练数据。此外,该方法在多种样本类型的数据上验证了其对生物完整性保持的能力,能够揭示唐氏综合征相关的代谢变化,同时提供了解释模型内部工作机制的物理见解,为进一步超越仪器限制奠定了基础。
链接: https://arxiv.org/abs/2503.02908
作者: Yuchen Xiang,Zhaolu Liu,Monica Emili Garcia-Segura,Daniel Simon,Boxuan Cao,Vincen Wu,Kenneth Robinson,Yu Wang,Ronan Battle,Robert T.Murray,Xavier Altafaj,Luca Peruzzotti-Jametti,Zoltan Takats
机构: Department of Metabolism, Digestion and Reproduction, Imperial College London (帝国理工学院), London, UK; Department of Mathematics, Imperial College London (帝国理工学院), London, UK; Department of Physics, Imperial College London (帝国理工学院), London, UK; Department of Clinical Neurosciences and NIHR Biomedical Research Centre, University of Cambridge (剑桥大学), Cambridge, UK; Department of Biomedicine, University of Barcelona (巴塞罗那大学), Barcelona, Spain
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Hyperspectral imaging is a powerful bioimaging tool which can uncover novel insights, thanks to its sensitivity to the intrinsic properties of materials. However, this enhanced contrast comes at the cost of system complexity, constrained by an inherent trade-off between spatial resolution, spectral resolution, and imaging speed. To overcome this limitation, we present a deep learning-based approach that restores and enhances pixel resolution post-acquisition without any a priori knowledge. Fine-tuned using metrics aligned with the imaging model, our physics-aware method achieves a 16X pixel super-resolution enhancement and a 12X imaging speedup without the need of additional training data for transfer learning. Applied to both synthetic and experimental data from five different sample types, we demonstrate that the model preserves biological integrity, ensuring no features are lost or hallucinated. We also concretely demonstrate the model’s ability to reveal disease-associated metabolic changes in Downs syndrome that would otherwise remain undetectable. Furthermore, we provide physical insights into the inner workings of the model, paving the way for future refinements that could potentially surpass instrumental limits in an explainable manner. All methods are available as open-source software on GitHub.
zh
[CV-93] Diagnosis of Patients with Viral Bacterial and Non-Pneumonia Based on Chest X-Ray Images Using Convolutional Neural Networks
【速读】:该论文旨在解决肺炎(pneumonia)导致的高死亡率问题,通过开发一种决策支持系统来分类患者,将其分为无肺炎组与病毒性或细菌性肺炎组。解决方案的关键在于利用迁移学习(Transfer Learning, TL)技术,基于预训练的卷积神经网络(Convolutional Neural Network, CNN)模型对胸片(Chest X-ray, CXR)图像进行处理,并结合Relief和卡方检验(Chi-square)方法作为降维技术,同时采用支持向量机(Support Vector Machines, SVM)进行分类。实验结果表明,该系统在区分无肺炎与肺炎患者时表现出色,准确率达到91.02%,F1分数为97.88%,而在区分病毒性肺炎与细菌性肺炎时也取得了较高的性能指标。
链接: https://arxiv.org/abs/2503.02906
作者: Carlos Arizmendi,Jorge Pinto,Alejandro Arboleda,Hernando González
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO)
备注:
Abstract:According to the World Health Organization (WHO), pneumonia is a disease that causes a significant number of deaths each year. In response to this issue, the development of a decision support system for the classification of patients into those without pneumonia and those with viral or bacterial pneumonia is proposed. This is achieved by implementing transfer learning (TL) using pre-trained convolutional neural network (CNN) models on chest x-ray (CXR) images. The system is further enhanced by integrating Relief and Chi-square methods as dimensionality reduction techniques, along with support vector machines (SVM) for classification. The performance of a series of experiments was evaluated to build a model capable of distinguishing between patients without pneumonia and those with viral or bacterial pneumonia. The obtained results include an accuracy of 91.02%, precision of 97.73%, recall of 98.03%, and an F1 Score of 97.88% for discriminating between patients without pneumonia and those with pneumonia. In addition, accuracy of 93.66%, precision of 94.26%, recall of 92.66%, and an F1 Score of 93.45% were achieved for discriminating between patients with viral pneumonia and those with bacterial pneumonia.
zh
[CV-94] Machine Learning Applications to Diffuse Reflectance Spectroscopy in Optical Diagnosis; A Systematic Review
【速读】:该论文旨在研究扩散反射光谱学(Diffuse Reflectance Spectroscopy, DRS)结合机器学习在组织区分中的应用现状,总结当前研究的进展,识别现有研究的不足,并提出未来发展方向。论文的关键在于探索如何通过先进的算法处理DRS信号(通常具有宽带和光滑特性),以实现高精度的组织分类与诊断,同时强调未来需要更严格的样本分层、体内验证以及可解释性算法的发展。
链接: https://arxiv.org/abs/2503.02905
作者: Nicola Rossberg,Celina L. Li,Simone Innocente,Stefan Andersson-Engels,Katarzyna Komolibus,Barry O’Sullivan,Andrea Visentin
机构: Research Ireland Center for Research Training in Artificial Intelligence (爱尔兰研究人工智能培训研究中心), School of Computer Science & Information Technology (计算机科学与信息技术学院), University College Cork (爱尔兰国立科克大学); Biophotonics@Tyndall (泰尔森国家研究所生物光子学中心), IPIC (爱尔兰 photonics集成中心), Tyndall National Institute (泰尔森国家研究所), Ireland; School of Physics (物理学院), University College Cork (爱尔兰国立科克大学); Research Ireland Insight Centre for Data Analytics (爱尔兰洞察数据分析研究中心), School of Computer Science & IT (计算机科学与信息技术学院), University College Cork (爱尔兰国立科克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 52 pages, Preprint, Systematic Review
Abstract:Diffuse Reflectance Spectroscopy has demonstrated a strong aptitude for identifying and differentiating biological tissues. However, the broadband and smooth nature of these signals require algorithmic processing, as they are often difficult for the human eye to distinguish. The implementation of machine learning models for this task has demonstrated high levels of diagnostic accuracies and led to a wide range of proposed methodologies for applications in various illnesses and conditions. In this systematic review, we summarise the state of the art of these applications, highlight current gaps in research and identify future directions. This review was conducted in accordance with the PRISMA guidelines. 77 studies were retrieved and in-depth analysis was conducted. It is concluded that diffuse reflectance spectroscopy and machine learning have strong potential for tissue differentiation in clinical applications, but more rigorous sample stratification in tandem with in-vivo validation and explainable algorithm development is required going forward.
zh
[CV-95] Surgical Vision World Model
【速读】:本文旨在解决手术仿真领域中缺乏逼真模拟以及现有世界模型难以有效利用无标注手术数据的问题。论文的关键在于提出了一种基于无标注 SurgToolLoc-2022 数据集的手术视觉世界模型,该模型能够生成可操控动作的手术数据。这一解决方案的核心创新点在于借鉴了 Genie 在利用无标注视频游戏数据推断潜在动作并实现动作控制数据生成的成功经验,从而突破了传统方法对标注数据依赖的限制,为手术仿真和自主手术代理训练提供了新的可能性。
链接: https://arxiv.org/abs/2503.02904
作者: Saurabh Koju,Saurav Bastola,Prashant Shrestha,Sanskar Amgain,Yash Raj Shrestha,Rudra P. K. Poudel,Binod Bhattarai
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Realistic and interactive surgical simulation has the potential to facilitate crucial applications, such as medical professional training and autonomous surgical agent training. In the natural visual domain, world models have enabled action-controlled data generation, demonstrating the potential to train autonomous agents in interactive simulated environments when large-scale real data acquisition is infeasible. However, such works in the surgical domain have been limited to simplified computer simulations, and lack realism. Furthermore, existing literature in world models has predominantly dealt with action-labeled data, limiting their applicability to real-world surgical data, where obtaining action annotation is prohibitively expensive. Inspired by the recent success of Genie in leveraging unlabeled video game data to infer latent actions and enable action-controlled data generation, we propose the first surgical vision world model. The proposed model can generate action-controllable surgical data and the architecture design is verified with extensive experiments on the unlabeled SurgToolLoc-2022 dataset. Codes and implementation details are available at this https URL
zh
[CV-96] OCL: Ordinal Contrastive Learning for Imputating Features with Progressive Labels MICCAI2024
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)分期判别中因多模态影像数据缺失导致样本量受限及分析精度下降的问题。论文的关键在于提出了一种整体成像特征填补方法(holistic imaging feature imputation),能够在保留所有受试者的同时充分利用多种成像特征。该方法包含两个网络:1)一个编码器用于提取模态无关的嵌入表示(modality-independent embeddings),并通过一种新颖的序数对比损失函数(ordinal contrastive loss)根据AD进展对齐样本嵌入;2)一个解码器在特定成像模态条件下重构原始测量值。此外,通过最大化同一受试者内各模态嵌入的一致性,并结合领域对抗训练算法进一步增强不同成像模态间的对齐效果。实验结果显示,该方法在统计分析和分类任务中优于现有填补基准方法,基于阿尔茨海默病神经影像计划(Alzheimer’s Disease Neuroimaging Initiative, ADNI)研究验证了其有效性。
链接: https://arxiv.org/abs/2503.02899
作者: Seunghun Baek,Jaeyoon Sim,Guorong Wu,Won Hwa Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2024 (Provisional Accept)
Abstract:Accurately discriminating progressive stages of Alzheimer’s Disease (AD) is crucial for early diagnosis and prevention. It often involves multiple imaging modalities to understand the complex pathology of AD, however, acquiring a complete set of images is challenging due to high cost and burden for subjects. In the end, missing data become inevitable which lead to limited sample-size and decrease in precision in downstream analyses. To tackle this challenge, we introduce a holistic imaging feature imputation method that enables to leverage diverse imaging features while retaining all subjects. The proposed method comprises two networks: 1) An encoder to extract modality-independent embeddings and 2) A decoder to reconstruct the original measures conditioned on their imaging modalities. The encoder includes a novel \em ordinal contrastive loss, which aligns samples in the embedding space according to the progression of AD. We also maximize modality-wise coherence of embeddings within each subject, in conjunction with domain adversarial training algorithms, to further enhance alignment between different imaging modalities. The proposed method promotes our holistic imaging feature imputation across various modalities in the shared embedding space. In the experiments, we show that our networks deliver favorable results for statistical analysis and classification against imputation baselines with Alzheimer’s Disease Neuroimaging Initiative (ADNI) study.
zh
[CV-97] Modality-Agnostic Style Transfer for Holistic Feature Imputation
【速读】:本文旨在解决利用单一影像技术难以表征阿尔茨海默病(Alzheimer’s Disease, AD)临床前阶段的问题,尤其是在早期症状较微妙的情况下。由于神经影像学研究常结合多种成像模态(如MRI和PET),但并非所有受试者都能获取全部模态的数据,导致缺失值不可避免。为减少额外检查的需求,本文提出了一种框架,通过现有影像数据生成特定受试者的未观测影像指标。方案的关键在于采用领域对抗训练(domain adversarial training)以保留与模态无关但AD特异性的信息,同时借助生成对抗网络(Generative Adversarial Network, GAN)添加不可区分的模态特定风格,从而实现模态间风格的迁移并保持AD特异性内容。实验基于阿尔茨海默病神经影像计划(Alzheimer’s Disease Neuroimaging Initiative, ADNI)验证了所提框架,并通过生成数据质量评估表明合成数据在不同模态下均具有实际可用性。
链接: https://arxiv.org/abs/2503.02898
作者: Seunghun Baek,Jaeyoon Sim,Mustafa Dere,Minjeong Kim,Guorong Wu,Won Hwa Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ISBI 2024 (oral)
Abstract:Characterizing a preclinical stage of Alzheimer’s Disease (AD) via single imaging is difficult as its early symptoms are quite subtle. Therefore, many neuroimaging studies are curated with various imaging modalities, e.g., MRI and PET, however, it is often challenging to acquire all of them from all subjects and missing data become inevitable. In this regards, in this paper, we propose a framework that generates unobserved imaging measures for specific subjects using their existing measures, thereby reducing the need for additional examinations. Our framework transfers modality-specific style while preserving AD-specific content. This is done by domain adversarial training that preserves modality-agnostic but AD-specific information, while a generative adversarial network adds an indistinguishable modality-specific style. Our proposed framework is evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study and compared with other imputation methods in terms of generated data quality. Small average Cohen’s d 0.19 between our generated measures and real ones suggests that the synthetic data are practically usable regardless of their modality type.
zh
[CV-98] Segmenting Bi-Atrial Structures Using ResNext Based Framework
【速读】:该论文旨在解决肺静脉隔离术在持续性心房颤动(Persistent Atrial Fibrillation, PAF)患者中的疗效局限性问题,通过利用晚期钆增强磁共振成像(Late Gadolinium-Enhanced MRI, LGE-MRI)识别纤维化区域来实现更精准的心房靶向治疗。然而,现有的手动分割方法耗时且易受主观变异性影响,而基于深度学习特别是卷积神经网络(Convolutional Neural Networks, CNNs)的自动化分割技术虽有潜力,但大多仅关注左心房(Left Atrium, LA)且依赖于小数据集,导致泛化能力不足。为应对上述挑战,本文提出了一种新颖的两阶段框架,该框架结合ResNeXt编码器与循环学习率策略,实现了右心房(Right Atrium, RA)和左心房壁及腔室的同时分割。关键在于通过引入这一框架,不仅能够有效提高对诸如心房壁等复杂小结构分割的准确性,同时保持在较大区域如心房腔体上的高性能表现,尤其针对类别分布不平衡的情况展现出优越的分割精度与鲁棒性。
链接: https://arxiv.org/abs/2503.02892
作者: Malitha Gunawardhana,Fangqiang Xu,Jichao Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Atrial fibrillation (AF) is the most common cardiac arrhythmia, significantly contributing to mortality, particularly in older populations. While pulmonary vein isolation is a standard treatment, its effectiveness is limited in patients with persistent AF. Recent research highlights the importance of targeting additional atrial regions, particularly fibrotic areas identified via late gadolinium-enhanced MRI (LGE-MRI). However, existing manual segmentation methods are time-consuming and prone to variability. Deep learning techniques, particularly convolutional neural networks (CNNs), have shown promise in automating segmentation. However, most studies focus solely on the left atrium (LA) and rely on small datasets, limiting generalizability. In this paper, we propose a novel two-stage framework incorporating ResNeXt encoders and a cyclic learning rate to segment both the right atrium (RA) and LA walls and cavities in LGE-MRIs. Our method aims to improve the segmentation of challenging small structures, such as atrial walls while maintaining high performance in larger regions like the atrial cavities. The results demonstrate that our approach offers superior segmentation accuracy and robustness compared to traditional architectures, particularly for imbalanced class structures.
zh
人工智能
[AI-0] CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning
链接: https://arxiv.org/abs/2503.03743
作者: Yuqi Zhou,Shuai Wang,Sunhao Dai,Qinglin Jia,Zhaocheng Du,Zhenhua Dong,Jun Xu
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human-like actions to address user requirements. Current VLM-based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high-level goals with low-level executable actions, is crucial for task completion but faces two challenges: ineffective subtasks that lower-level agent cannot execute and inefficient subtasks that fail to contribute to the completion of the higher-level task. These challenges stem from VLM’s lack of experience in decomposing subtasks within GUI scenarios in multi-agent architecture. To address these, we propose a new mobile assistant architecture with constrained high-frequency optimized planning (CHOP). Our approach overcomes the VLM’s deficiency in GUI scenarios planning by using human-planned subtasks as the basis vector. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at this https URL
[AI-1] Machine Learning in Biomechanics: Key Applications and Limitations in Walking Running and Sports Movements
链接: https://arxiv.org/abs/2503.03717
作者: Carlo Dindorf,Fabian Horst,Djordje Slijepčević,Bernhard Dumphart,Jonas Dully,Matthias Zeppelzauer,Brian Horsak,Michael Fröhlich
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:This chapter provides an overview of recent and promising Machine Learning applications, i.e. pose estimation, feature estimation, event detection, data exploration clustering, and automated classification, in gait (walking and running) and sports biomechanics. It explores the potential of Machine Learning methods to address challenges in biomechanical workflows, highlights central limitations, i.e. data and annotation availability and explainability, that need to be addressed, and emphasises the importance of interdisciplinary approaches for fully harnessing the potential of Machine Learning in gait and sports biomechanics.
[AI-2] Curating Demonstrations using Online Experience
链接: https://arxiv.org/abs/2503.03707
作者: Annie S. Chen,Alec M. Lessing,Yuejiang Liu,Chelsea Finn
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Many robot demonstration datasets contain heterogeneous demonstrations of varying quality. This heterogeneity may benefit policy pre-training, but can hinder robot performance when used with a final imitation learning objective. In particular, some strategies in the data may be less reliable than others or may be underrepresented in the data, leading to poor performance when such strategies are sampled at test time. Moreover, such unreliable or underrepresented strategies can be difficult even for people to discern, and sifting through demonstration datasets is time-consuming and costly. On the other hand, policy performance when trained on such demonstrations can reflect the reliability of different strategies. We thus propose for robots to self-curate based on online robot experience (Demo-SCORE). More specifically, we train and cross-validate a classifier to discern successful policy roll-outs from unsuccessful ones and use the classifier to filter heterogeneous demonstration datasets. Our experiments in simulation and the real world show that Demo-SCORE can effectively identify suboptimal demonstrations without manual curation. Notably, Demo-SCORE achieves over 15-35% higher absolute success rate in the resulting policy compared to the base policy trained with all original demonstrations.
[AI-3] ILLC: Iterative Layer-by-Layer Compression for Enhancing Structural Faithfulness in SpArX
链接: https://arxiv.org/abs/2503.03693
作者: Ungsik Kim
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures
Abstract:In the field of Explainable Artificial Intelligence (XAI), argumentative XAI approaches have been proposed to represent the internal reasoning process of deep neural networks in a more transparent way by interpreting hidden nodes as arguements. However, as the number of layers increases, existing compression methods simplify all layers at once, which lead to high accumulative information loss. To compensate for this, we propose an iterative layer-by-layer compression technique in which each layer is compressed separately and the reduction error in the next layer is immediately compensated for, thereby improving the overall input-output and structural fidelity of the model. Experiments on the Breast Cancer Diagnosis dataset show that, compared to traditional compression, the method reduces input-output and structural unfaithfulness, and maintains a more consistent attack-support relationship in the Argumentative Explanation scheme. This is significant because it provides a new way to make complex MLP models more compact while still conveying their internal inference logic without distortion.
[AI-4] Decoupled Recommender Systems: Exploring Alternative Recommender Ecosystem Designs
链接: https://arxiv.org/abs/2503.03606
作者: Anas Buhayh,Elizabeth McKinnie,Robin Burke
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Recommender ecosystems are an emerging subject of research. Such research examines how the characteristics of algorithms, recommendation consumers, and item providers influence system dynamics and long-term outcomes. One architectural possibility that has not yet been widely explored in this line of research is the consequences of a configuration in which recommendation algorithms are decoupled from the platforms they serve. This is sometimes called “the friendly neighborhood algorithm store” or “middleware” model. We are particularly interested in how such architectures might offer a range of different distributions of utility across consumers, providers, and recommendation platforms. In this paper, we create a model of a recommendation ecosystem that incorporates algorithm choice and examine the outcomes of such a design.
[AI-5] owards Understanding Text Hallucination of Diffusion Models via Local Generation Bias
链接: https://arxiv.org/abs/2503.03595
作者: Rui Lu,Runzhe Wang,Kaifeng Lyu,Xitai Jiang,Gao Huang,Mengdi Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Score-based diffusion models have achieved incredible performance in generating realistic images, audio, and video data. While these models produce high-quality samples with impressive details, they often introduce unrealistic artifacts, such as distorted fingers or hallucinated texts with no meaning. This paper focuses on textual hallucinations, where diffusion models correctly generate individual symbols but assemble them in a nonsensical manner. Through experimental probing, we consistently observe that such phenomenon is attributed it to the network’s local generation bias. Denoising networks tend to produce outputs that rely heavily on highly correlated local regions, particularly when different dimensions of the data distribution are nearly pairwise independent. This behavior leads to a generation process that decomposes the global distribution into separate, independent distributions for each symbol, ultimately failing to capture the global structure, including underlying grammar. Intriguingly, this bias persists across various denoising network architectures including MLP and transformers which have the structure to model global dependency. These findings also provide insights into understanding other types of hallucinations, extending beyond text, as a result of implicit biases in the denoising models. Additionally, we theoretically analyze the training dynamics for a specific case involving a two-layer MLP learning parity points on a hypercube, offering an explanation of its underlying mechanism.
[AI-6] A Conceptual Model for Attributions in Event-Centric Knowledge Graphs
链接: https://arxiv.org/abs/2503.03563
作者: Florian Plötzky,Katarina Britz,Wolf-Tilo Balke
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: Submitted to Data Knowledge Engineering, 22 pages, 9 figures
Abstract:The use of narratives as a means of fusing information from knowledge graphs (KGs) into a coherent line of argumentation has been the subject of recent investigation. Narratives are especially useful in event-centric knowledge graphs in that they provide a means to connect different real-world events and categorize them by well-known narrations. However, specifically for controversial events, a problem in information fusion arises, namely, multiple viewpoints regarding the validity of certain event aspects, e.g., regarding the role a participant takes in an event, may exist. Expressing those viewpoints in KGs is challenging because disputed information provided by different viewpoints may introduce inconsistencies. Hence, most KGs only feature a single view on the contained information, hampering the effectiveness of narrative information access. This paper is an extension of our original work and introduces attributions, i.e., parameterized predicates that allow for the representation of facts that are only valid in a specific viewpoint. For this, we develop a conceptual model that allows for the representation of viewpoint-dependent information. As an extension, we enhance the model by a conception of viewpoint-compatibility. Based on this, we deepen our original deliberations on the model’s effects on information fusion and provide additional grounding in the literature.
[AI-7] AI-Enabled Conversational Journaling for Advancing Parkinsons Disease Symptom Tracking
链接: https://arxiv.org/abs/2503.03532
作者: Mashrur Rashik,Shilpa Sweth,Nishtha Agrawal,Saiyyam Kochar,Kara M Smith,Fateme Rajabiyazdi,Vidya Setlur,Narges Mahyar,Ali Sarvghad
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: To appear in the ACM CHI conference on Human Factors in Computing Systems (CHI), 2025
Abstract:Journaling plays a crucial role in managing chronic conditions by allowing patients to document symptoms and medication intake, providing essential data for long-term care. While valuable, traditional journaling methods often rely on static, self-directed entries, lacking interactive feedback and real-time guidance. This gap can result in incomplete or imprecise information, limiting its usefulness for effective treatment. To address this gap, we introduce PATRIKA, an AI-enabled prototype designed specifically for people with Parkinson’s disease (PwPD). The system incorporates cooperative conversation principles, clinical interview simulations, and personalization to create a more effective and user-friendly journaling experience. Through two user studies with PwPD and iterative refinement of PATRIKA, we demonstrate conversational journaling’s significant potential in patient engagement and collecting clinically valuable information. Our results showed that generating probing questions PATRIKA turned journaling into a bi-directional interaction. Additionally, we offer insights for designing journaling systems for healthcare and future directions for promoting sustained journaling.
[AI-8] NeuGrasp: Generalizable Neural Surface Reconstruction with Background Priors for Material-Agnostic Object Grasp Detection ICRA
链接: https://arxiv.org/abs/2503.03511
作者: Qingyu Fan,Yinghao Cai,Chao Li,Wenzhe He,Xudong Zheng,Tao Lu,Bin Liang,Shuo Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures. IEEE International Conference on Robotics and Automation (ICRA) 2025
Abstract:Robotic grasping in scenes with transparent and specular objects presents great challenges for methods relying on accurate depth information. In this paper, we introduce NeuGrasp, a neural surface reconstruction method that leverages background priors for material-agnostic grasp detection. NeuGrasp integrates transformers and global prior volumes to aggregate multi-view features with spatial encoding, enabling robust surface reconstruction in narrow and sparse viewing conditions. By focusing on foreground objects through residual feature enhancement and refining spatial perception with an occupancy-prior volume, NeuGrasp excels in handling objects with transparent and specular surfaces. Extensive experiments in both simulated and real-world scenarios show that NeuGrasp outperforms state-of-the-art methods in grasping while maintaining comparable reconstruction quality. More details are available at this https URL.
[AI-9] Rethinking Synthetic Data definitions: A privacy driven approach
链接: https://arxiv.org/abs/2503.03506
作者: Vibeke Binz Vallevik,Serena Elizabeth Marshall,Aleksandar Babic,Jan Franz Nygaard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Synthetic data is gaining traction as a cost-effective solution for the increasing data demands of AI development and can be generated either from existing knowledge or derived data captured from real-world events. The source of the synthetic data generation and the technique used significantly impacts its residual privacy risk and therefore its opportunity for sharing. Traditional classification of synthetic data types no longer fit the newer generation techniques and there is a need to better align the classification with practical needs. We suggest a new way of grouping synthetic data types that better supports privacy evaluations to aid regulatory policymaking. Our novel classification provides flexibility to new advancements like deep generative methods and offers a more practical framework for future applications.
[AI-10] Parallelized Planning -Acting for Efficient LLM -based Multi-Agent Systems
链接: https://arxiv.org/abs/2503.03505
作者: Yaoru Li,Shunyu Liu,Tongya Zheng,Mingli Song
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Recent advancements in Large Language Model(LLM)-based Multi-Agent Systems(MAS) have demonstrated remarkable potential for tackling complex decision-making tasks. However, existing frameworks inevitably rely on serialized execution paradigms, where agents must complete sequential LLM planning before taking action. This fundamental constraint severely limits real-time responsiveness and adaptation, which is crucial in dynamic environments with ever-changing scenarios. In this paper, we propose a novel parallelized planning-acting framework for LLM-based MAS, featuring a dual-thread architecture with interruptible execution to enable concurrent planning and acting. Specifically, our framework comprises two core threads:(1) a planning thread driven by a centralized memory system, maintaining synchronization of environmental states and agent communication to support dynamic decision-making; and (2) an acting thread equipped with a comprehensive skill library, enabling automated task execution through recursive decomposition. Extensive experiments on challenging Minecraft demonstrate the effectiveness of the proposed framework.
[AI-11] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning
链接: https://arxiv.org/abs/2503.03480
作者: Borong Zhang,Yuhao Zhang,Jiaming Ji,Yingshan Lei,Josef Dai,Yuanpei Chen,Yaodong Yang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures
Abstract:Vision-language-action models (VLAs) have shown great potential as generalist robot policies. However, these models pose urgent safety challenges during deployment, including the risk of physical harm to the environment, the robot itself, and humans. How can safety be explicitly incorporated into VLAs? In this work, we propose SafeVLA, a novel algorithm designed to integrate safety into VLAs, ensuring the protection of the environment, robot hardware and humans in real-world settings. SafeVLA effectively balances safety and task performance by employing large-scale constrained learning within simulated environments. We demonstrate that SafeVLA outperforms the current state-of-the-art method in both safety and task performance, achieving average improvements of 83.58% and 3.85%, respectively, in simulation. By prioritizing safety, our approach eliminates high-risk behaviors and reduces the upper bound of unsafe behaviors to 1/35 of that in the current state-of-the-art, thereby significantly mitigating long-tail risks. Furthermore, the learned safety constraints generalize to diverse, unseen scenarios, including multiple out-of-distribution perturbations and tasks. Our data, models and newly proposed benchmark environment are available at this https URL.
[AI-12] Conceptualizing Uncertainty
链接: https://arxiv.org/abs/2503.03443
作者: Isaac Roberts,Alexander Schulz,Sarah Schroeder,Fabian Hinder,Barbara Hammer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Uncertainty in machine learning refers to the degree of confidence or lack thereof in a model’s predictions. While uncertainty quantification methods exist, explanations of uncertainty, especially in high-dimensional settings, remain an open challenge. Existing work focuses on feature attribution approaches which are restricted to local explanations. Understanding uncertainty, its origins, and characteristics on a global scale is crucial for enhancing interpretability and trust in a model’s predictions. In this work, we propose to explain the uncertainty in high-dimensional data classification settings by means of concept activation vectors which give rise to local and global explanations of uncertainty. We demonstrate the utility of the generated explanations by leveraging them to refine and improve our model.
[AI-13] Privacy is All You Need: Revolutionizing Wearable Health Data with Advanced PETs
链接: https://arxiv.org/abs/2503.03428
作者: Karthik Barma,Seshu Babu Barma
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注:
Abstract:In a world where data is the new currency, wearable health devices offer unprecedented insights into daily life, continuously monitoring vital signs and metrics. However, this convenience raises privacy concerns, as these devices collect sensitive data that can be misused or breached. Traditional measures often fail due to real-time data processing needs and limited device power. Users also lack awareness and control over data sharing and usage. We propose a Privacy-Enhancing Technology (PET) framework for wearable devices, integrating federated learning, lightweight cryptographic methods, and selectively deployed blockchain technology. The blockchain acts as a secure ledger triggered only upon data transfer requests, granting users real-time notifications and control. By dismantling data monopolies, this approach returns data sovereignty to individuals. Through real-world applications like secure medical data sharing, privacy-preserving fitness tracking, and continuous health monitoring, our framework reduces privacy risks by up to 70 percent while preserving data utility and performance. This innovation sets a new benchmark for wearable privacy and can scale to broader IoT ecosystems, including smart homes and industry. As data continues to shape our digital landscape, our research underscores the critical need to maintain privacy and user control at the forefront of technological progress.
[AI-14] Simplicial SMOTE: Oversampling Solution to the Imbalanced Learning Problem KDD2025
链接: https://arxiv.org/abs/2503.03418
作者: Oleg Kachan,Andrey Savchenko,Gleb Gusev
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at KDD 2025 (research track)
Abstract:SMOTE (Synthetic Minority Oversampling Technique) is the established geometric approach to random oversampling to balance classes in the imbalanced learning problem, followed by many extensions. Its idea is to introduce synthetic data points of the minor class, with each new point being the convex combination of an existing data point and one of its k-nearest neighbors. In this paper, by viewing SMOTE as sampling from the edges of a geometric neighborhood graph and borrowing tools from the topological data analysis, we propose a novel technique, Simplicial SMOTE, that samples from the simplices of a geometric neighborhood simplicial complex. A new synthetic point is defined by the barycentric coordinates w.r.t. a simplex spanned by an arbitrary number of data points being sufficiently close rather than a pair. Such a replacement of the geometric data model results in better coverage of the underlying data distribution compared to existing geometric sampling methods and allows the generation of synthetic points of the minority class closer to the majority class on the decision boundary. We experimentally demonstrate that our Simplicial SMOTE outperforms several popular geometric sampling methods, including the original SMOTE. Moreover, we show that simplicial sampling can be easily integrated into existing SMOTE extensions. We generalize and evaluate simplicial extensions of the classic Borderline SMOTE, Safe-level SMOTE, and ADASYN algorithms, all of which outperform their graph-based counterparts.
[AI-15] Multi-Agent DRL for Queue-Aware Task Offloading in Hierarchical MEC-Enabled Air-Ground Networks
链接: https://arxiv.org/abs/2503.03391
作者: Muhammet Hevesli,Abegaz Mohammed Seid,Aiman Erbad,Mohamed Abdallah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Mobile edge computing (MEC)-enabled air-ground networks are a key component of 6G, employing aerial base stations (ABSs) such as unmanned aerial vehicles (UAVs) and high-altitude platform stations (HAPS) to provide dynamic services to ground IoT devices (IoTDs). These IoTDs support real-time applications (e.g., multimedia and Metaverse services) that demand high computational resources and strict quality of service (QoS) guarantees in terms of latency and task queue management. Given their limited energy and processing capabilities, IoTDs rely on UAVs and HAPS to offload tasks for distributed processing, forming a multi-tier MEC system. This paper tackles the overall energy minimization problem in MEC-enabled air-ground integrated networks (MAGIN) by jointly optimizing UAV trajectories, computing resource allocation, and queue-aware task offloading decisions. The optimization is challenging due to the nonconvex, nonlinear nature of this hierarchical system, which renders traditional methods ineffective. We reformulate the problem as a multi-agent Markov decision process (MDP) with continuous action spaces and heterogeneous agents, and propose a novel variant of multi-agent proximal policy optimization with a Beta distribution (MAPPO-BD) to solve it. Extensive simulations show that MAPPO-BD outperforms baseline schemes, achieving superior energy savings and efficient resource management in MAGIN while meeting queue delay and edge computing constraints.
[AI-16] From Infants to AI: Incorporating Infant-like Learning in Models Boosts Efficiency and Generalization in Learning Social Prediction Tasks
链接: https://arxiv.org/abs/2503.03361
作者: Shify Treger,Shimon Ullman
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Early in development, infants learn a range of useful concepts, which can be challenging from a computational standpoint. This early learning comes together with an initial understanding of aspects of the meaning of concepts, e.g., their implications, causality, and using them to predict likely future events. All this is accomplished in many cases with little or no supervision, and from relatively few examples, compared with current network models. In learning about objects and human-object interactions, early acquired and possibly innate concepts are often used in the process of learning additional, more complex concepts. In the current work, we model how early-acquired concepts are used in the learning of subsequent concepts, and compare the results with standard deep network modeling. We focused in particular on the use of the concepts of animacy and goal attribution in learning to predict future events. We show that the use of early concepts in the learning of new concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data). We further show that this integration of early and new concepts shapes the representation of the concepts acquired by the model. The results show that when the concepts were learned in a human-like manner, the emerging representation was more useful, as measured in terms of generalization to novel data and tasks. On a more general level, the results suggest that there are likely to be basic differences in the conceptual structures acquired by current network models compared to human learning.
[AI-17] Leverag ing Large Language Models to Develop Heuristics for Emerging Optimization Problems
链接: https://arxiv.org/abs/2503.03350
作者: Thomas Bömer,Nico Koltermann,Max Disselnmeyer,Laura Dörr,Anne Meyer
类目: Artificial Intelligence (cs.AI)
*备注: Under review LION19: The 19th Learning and Intelligent OptimizatioN Conference
Abstract:Combinatorial optimization problems often rely on heuristic algorithms to generate efficient solutions. However, the manual design of heuristics is resource-intensive and constrained by the designer’s expertise. Recent advances in artificial intelligence, particularly large language models (LLMs), have demonstrated the potential to automate heuristic generation through evolutionary frameworks. Recent works focus only on well-known combinatorial optimization problems like the traveling salesman problem and online bin packing problem when designing constructive heuristics. This study investigates whether LLMs can effectively generate heuristics for niche, not yet broadly researched optimization problems, using the unit-load pre-marshalling problem as an example case. We propose the Contextual Evolution of Heuristics (CEoH) framework, an extension of the Evolution of Heuristics (EoH) framework, which incorporates problem-specific descriptions to enhance in-context learning during heuristic generation. Through computational experiments, we evaluate CEoH and EoH and compare the results. Results indicate that CEoH enables smaller LLMs to generate high-quality heuristics more consistently and even outperform larger models. Larger models demonstrate robust performance with or without contextualized prompts. The generated heuristics exhibit scalability to diverse instance configurations.
[AI-18] Navigating Intelligence: A Survey of Google OR-Tools and Machine Learning for Global Path Planning in Autonomous Vehicles
链接: https://arxiv.org/abs/2503.03338
作者: Alexandre Benoit,Pedram Asef
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注:
Abstract:We offer a new in-depth investigation of global path planning (GPP) for unmanned ground vehicles, an autonomous mining sampling robot named ROMIE. GPP is essential for ROMIE’s optimal performance, which is translated into solving the traveling salesman problem, a complex graph theory challenge that is crucial for determining the most effective route to cover all sampling locations in a mining field. This problem is central to enhancing ROMIE’s operational efficiency and competitiveness against human labor by optimizing cost and time. The primary aim of this research is to advance GPP by developing, evaluating, and improving a cost-efficient software and web application. We delve into an extensive comparison and analysis of Google operations research (OR)-Tools optimization algorithms. Our study is driven by the goal of applying and testing the limits of OR-Tools capabilities by integrating Reinforcement Learning techniques for the first time. This enables us to compare these methods with OR-Tools, assessing their computational effectiveness and real-world application efficiency. Our analysis seeks to provide insights into the effectiveness and practical application of each technique. Our findings indicate that Q-Learning stands out as the optimal strategy, demonstrating superior efficiency by deviating only 1.2% on average from the optimal solutions across our datasets.
[AI-19] Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum Systems
链接: https://arxiv.org/abs/2503.03274
作者: Alfreds Lapkovskis,Boris Sedlak,Sindri Magnússon,Schahram Dustdar,Praveen Kumar Donta
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
*备注:
Abstract:Ensuring Service Level Objectives (SLOs) in large-scale architectures, such as Distributed Computing Continuum Systems (DCCS), is challenging due to their heterogeneous nature and varying service requirements across different devices and applications. Additionally, unpredictable workloads and resource limitations lead to fluctuating performance and violated SLOs. To improve SLO compliance in DCCS, one possibility is to apply machine learning; however, the design choices are often left to the developer. To that extent, we provide a benchmark of Active Inference – an emerging method from neuroscience – against three established reinforcement learning algorithms (Deep Q-Network, Advantage Actor-Critic, and Proximal Policy Optimization). We consider a realistic DCCS use case: an edge device running a video conferencing application alongside a WebSocket server streaming videos. Using one of the respective algorithms, we continuously monitor key performance metrics, such as latency and bandwidth usage, to dynamically adjust parameters – including the number of streams, frame rate, and resolution – to optimize service quality and user experience. To test algorithms’ adaptability to constant system changes, we simulate dynamically changing SLOs and both instant and gradual data-shift scenarios, such as network bandwidth limitations and fluctuating device thermal states. Although the evaluated algorithms all showed advantages and limitations, our findings demonstrate that Active Inference is a promising approach for ensuring SLO compliance in DCCS, offering lower memory usage, stable CPU utilization, and fast convergence.
[AI-20] Conformal Transformations for Symmetric Power Transformers ICLR2025
链接: https://arxiv.org/abs/2503.03269
作者: Saurabh Kumar,Jacob Buckman,Carles Gelada,Sean Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: SCOPE Workshop at ICLR 2025
Abstract:Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.
[AI-21] Exploring the Potential of Large Language Models as Predictors in Dynamic Text-Attributed Graphs
链接: https://arxiv.org/abs/2503.03258
作者: Runlin Lei,Jiarui Ji,Haipeng Ding,Lu Yi,Zhewei Wei,Yongchao Liu,Chuntao Hong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:With the rise of large language models (LLMs), there has been growing interest in Graph Foundation Models (GFMs) for graph-based tasks. By leveraging LLMs as predictors, GFMs have demonstrated impressive generalizability across various tasks and datasets. However, existing research on LLMs as predictors has predominantly focused on static graphs, leaving their potential in dynamic graph prediction unexplored. In this work, we pioneer using LLMs for predictive tasks on dynamic graphs. We identify two key challenges: the constraints imposed by context length when processing large-scale historical data and the significant variability in domain characteristics, both of which complicate the development of a unified predictor. To address these challenges, we propose the GraphAgent-Dynamic (GAD) Framework, a multi-agent system that leverages collaborative LLMs. In contrast to using a single LLM as the predictor, GAD incorporates global and local summary agents to generate domain-specific knowledge, enhancing its transferability across domains. Additionally, knowledge reflection agents enable adaptive updates to GAD’s knowledge, maintaining a unified and self-consistent architecture. In experiments, GAD demonstrates performance comparable to or even exceeds that of full-supervised graph neural networks without dataset-specific training. Finally, to enhance the task-specific performance of LLM-based predictors, we discuss potential improvements, such as dataset-specific fine-tuning to LLMs. By developing tailored strategies for different tasks, we provide new insights for the future design of LLM-based predictors.
[AI-22] Less is more? Rewards in RL for Cyber Defence
链接: https://arxiv.org/abs/2503.03245
作者: Elizabeth Bates,Chris Hicks,Vasilios Mavroudis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 4 Pages
Abstract:The last few years has seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense “scaffolded” reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. Sparse rewards could offer preferable or more effective solutions and have been overlooked by cyber gyms to date. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.
[AI-23] COSINT-Agent : A Knowledge-Driven Multimodal Agent for Chinese Open Source Intelligence
链接: https://arxiv.org/abs/2503.03215
作者: Wentao Li,Congcong Wang,Xiaoxiao Cui,Zhi Liu,Wei Guo,Lizhen Cui
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Open Source Intelligence (OSINT) requires the integration and reasoning of diverse multimodal data, presenting significant challenges in deriving actionable insights. Traditional approaches, including multimodal large language models (MLLMs), often struggle to infer complex contextual relationships or deliver comprehensive intelligence from unstructured data sources. In this paper, we introduce COSINT-Agent, a knowledge-driven multimodal agent tailored to address the challenges of OSINT in the Chinese domain. COSINT-Agent seamlessly integrates the perceptual capabilities of fine-tuned MLLMs with the structured reasoning power of the Entity-Event-Scene Knowledge Graph (EES-KG). Central to COSINT-Agent is the innovative EES-Match framework, which bridges COSINT-MLLM and EES-KG, enabling systematic extraction, reasoning, and contextualization of multimodal insights. This integration facilitates precise entity recognition, event interpretation, and context retrieval, effectively transforming raw multimodal data into actionable intelligence. Extensive experiments validate the superior performance of COSINT-Agent across core OSINT tasks, including entity recognition, EES generation, and context matching. These results underscore its potential as a robust and scalable solution for advancing automated multimodal reasoning and enhancing the effectiveness of OSINT methodologies.
[AI-24] NodeReg: Mitigating the Imbalance and Distribution Shift Effects in Semi-Supervised Node Classification via Norm Consistency
链接: https://arxiv.org/abs/2503.03211
作者: Shenzhi Yang,Jun Xia,Jingbo Zhou,Xingkai Yao,Xiaofang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Aggregating information from neighboring nodes benefits graph neural networks (GNNs) in semi-supervised node classification tasks. Nevertheless, this mechanism also renders nodes susceptible to the influence of their neighbors. For instance, this will occur when the neighboring nodes are imbalanced or the neighboring nodes contain noise, which can even affect the GNN’s ability to generalize out of distribution. We find that ensuring the consistency of the norm for node representations can significantly reduce the impact of these two issues on GNNs. To this end, we propose a regularized optimization method called NodeReg that enforces the consistency of node representation norms. This method is simple but effective and satisfies Lipschitz continuity, thus facilitating stable optimization and significantly improving semi-supervised node classification performance under the above two scenarios. To illustrate, in the imbalance scenario, when training a GCN with an imbalance ratio of 0.1, NodeReg outperforms the most competitive baselines by 1.4%-25.9% in F1 score across five public datasets. Similarly, in the distribution shift scenario, NodeReg outperforms the most competitive baseline by 1.4%-3.1% in accuracy.
[AI-25] Directly Follows Graphs Go Predictive Process Monitoring With Graph Neural Networks
链接: https://arxiv.org/abs/2503.03197
作者: Attila Lischka,Simon Rauch,Oliver Stritzel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures, 3 tables
Abstract:In the past years, predictive process monitoring (PPM) techniques based on artificial neural networks have evolved as a method to monitor the future behavior of business processes. Existing approaches mostly focus on interpreting the processes as sequences, so-called traces, and feeding them to neural architectures designed to operate on sequential data such as recurrent neural networks (RNNs) or transformers. In this study, we investigate an alternative way to perform PPM: by transforming each process in its directly-follows-graph (DFG) representation we are able to apply graph neural networks (GNNs) for the prediction tasks. By this, we aim to develop models that are more suitable for complex processes that are long and contain an abundance of loops. In particular, we present different ways to create DFG representations depending on the particular GNN we use. The tested GNNs range from classical node-based to novel edge-based architectures. Further, we investigate the possibility of using multi-graphs. By these steps, we aim to design graph representations that minimize the information loss when transforming traces into graphs.
[AI-26] AttackSeqBench: Benchmarking Large Language Models Understanding of Sequential Patterns in Cyber Attacks
链接: https://arxiv.org/abs/2503.03170
作者: Javier Yong,Haokai Ma,Yunshan Ma,Anis Yusof,Zhenkai Liang,Ee-Chien Chang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:The observations documented in Cyber Threat Intelligence (CTI) reports play a critical role in describing adversarial behaviors, providing valuable insights for security practitioners to respond to evolving threats. Recent advancements of Large Language Models (LLMs) have demonstrated significant potential in various cybersecurity applications, including CTI report understanding and attack knowledge graph construction. While previous works have proposed benchmarks that focus on the CTI extraction ability of LLMs, the sequential characteristic of adversarial behaviors within CTI reports remains largely unexplored, which holds considerable significance in developing a comprehensive understanding of how adversaries operate. To address this gap, we introduce AttackSeqBench, a benchmark tailored to systematically evaluate LLMs’ capability to understand and reason attack sequences in CTI reports. Our benchmark encompasses three distinct Question Answering (QA) tasks, each task focuses on the varying granularity in adversarial behavior. To alleviate the laborious effort of QA construction, we carefully design an automated dataset construction pipeline to create scalable and well-formulated QA datasets based on real-world CTI reports. To ensure the quality of our dataset, we adopt a hybrid approach of combining human evaluation and systematic evaluation metrics. We conduct extensive experiments and analysis with both fast-thinking and slow-thinking LLMs, while highlighting their strengths and limitations in analyzing the sequential patterns in cyber attacks. The overarching goal of this work is to provide a benchmark that advances LLM-driven CTI report understanding and fosters its application in real-world cybersecurity operations. Our dataset and code are available at this https URL .
[AI-27] DiRe-JAX: A JAX based Dimensionality Reduction Algorithm for Large-scale Data
链接: https://arxiv.org/abs/2503.03156
作者: Alexander Kolpakov,Igor Rivin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
*备注: 22 pages, 12 figures; Github repository available at this https URL package available on PyPi this https URL
Abstract:DiRe-JAX is a new dimensionality reduction toolkit designed to address some of the challenges faced by traditional methods like UMAP and tSNE such as loss of global structure and computational efficiency. Built on the JAX framework, DiRe leverages modern hardware acceleration to provide an efficient, scalable, and interpretable solution for visualizing complex data structures, and for quantitative analysis of lower-dimensional embeddings. The toolkit shows considerable promise in preserving both local and global structures within the data as compare to state-of-the-art UMAP and tSNE implementations. This makes it suitable for a wide range of applications in machine learning, bioinformatics, and data science.
[AI-28] Position: Model Collapse Does Not Mean What You Think
链接: https://arxiv.org/abs/2503.03150
作者: Rylan Schaeffer,Joshua Kazdan,Alvan Caleb Arulandu,Sanmi Koyejo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
Abstract:The proliferation of AI-generated content online has fueled concerns over \emphmodel collapse, a degradation in future generative models’ performance when trained on synthetic data generated by earlier models. Industry leaders, premier research journals and popular science publications alike have prophesied catastrophic societal consequences stemming from model collapse. In this position piece, we contend this widespread narrative fundamentally misunderstands the scientific evidence. We highlight that research on model collapse actually encompasses eight distinct and at times conflicting definitions of model collapse, and argue that inconsistent terminology within and between papers has hindered building a comprehensive understanding of model collapse. To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature’s methodologies through this lens. While we leave room for reasonable disagreement, our analysis of research studies, weighted by how faithfully each study matches real-world conditions, leads us to conclude that certain predicted claims of model collapse rely on assumptions and conditions that poorly match real-world conditions, and in fact several prominent collapse scenarios are readily avoidable. Altogether, this position paper argues that model collapse has been warped from a nuanced multifaceted consideration into an oversimplified threat, and that the evidence suggests specific harms more likely under society’s current trajectory have received disproportionately less attention.
[AI-29] Knowledge Augmentation in Federation: Rethinking What Collaborative Learning Can Bring Back to Decentralized Data
链接: https://arxiv.org/abs/2503.03140
作者: Wentai Wu,Yingliang Wu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: preprint
Abstract:Data, as an observable form of knowledge, has become one of the most important factors of production for the development of Artificial Intelligence (AI). Meanwhile, increasing legislation and regulations on private and proprietary information results in scattered data sources also known as the ``data islands’'. Although some collaborative learning paradigms such as Federated Learning (FL) can enable privacy-preserving training over decentralized data, they have inherent deficiencies in fairness, costs and reproducibility because of being learning-centric, which greatly limits the way how participants cooperate with each other. In light of this, we present a knowledge-centric paradigm termed \emphKnowledge Augmentation in Federation (KAF), with focus on how to enhance local knowledge through collaborative effort. We provide the suggested system architecture, formulate the prototypical optimization objective, and review emerging studies that employ methodologies suitable for KAF. On our roadmap, with a three-way categorization we describe the methods for knowledge expansion, knowledge filtering, and label and feature space correction in the federation. Further, we highlight several challenges and open questions that deserve more attention from the community. With our investigation, we intend to offer new insights for what collaborative learning can bring back to decentralized data.
[AI-30] Convergence Analysis of Federated Learning Methods Using Backward Error Analysis
链接: https://arxiv.org/abs/2503.03139
作者: Jinwoo Lim,Suhyun Kim,Soo-Mook Moon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Backward error analysis allows finding a modified loss function, which the parameter updates really follow under the influence of an optimization method. The additional loss terms included in this modified function is called implicit regularizer. In this paper, we attempt to find the implicit regularizer for various federated learning algorithms on non-IID data distribution, and explain why each method shows different convergence behavior. We first show that the implicit regularizer of FedAvg disperses the gradient of each client from the average gradient, thus increasing the gradient variance. We also empirically show that the implicit regularizer hampers its convergence. Similarly, we compute the implicit regularizers of FedSAM and SCAFFOLD, and explain why they converge better. While existing convergence analyses focus on pointing out the advantages of FedSAM and SCAFFOLD, our approach can explain their limitations in complex non-convex settings. In specific, we demonstrate that FedSAM can partially remove the bias in the first-order term of the implicit regularizer in FedAvg, whereas SCAFFOLD can fully eliminate the bias in the first-order term, but not in the second-order term. Consequently, the implicit regularizer can provide a useful insight on the convergence behavior of federated learning from a different theoretical perspective.
[AI-31] L2R: Learning to Reduce Search Space for Generalizable Neural Routing Solver
链接: https://arxiv.org/abs/2503.03137
作者: Changliang Zhou,Xi Lin,Zhenkun Wang,Qingfu Zhang
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages, 10 figures
Abstract:Constructive neural combinatorial optimization (NCO) has attracted growing research attention due to its ability to solve complex routing problems without relying on handcrafted rules. However, existing NCO methods face significant challenges in generalizing to large-scale problems due to high computational complexity and inefficient capture of structural patterns. To address this issue, we propose a novel learning-based search space reduction method that adaptively selects a small set of promising candidate nodes at each step of the constructive NCO process. Unlike traditional methods that rely on fixed heuristics, our selection model dynamically prioritizes nodes based on learned patterns, significantly reducing the search space while maintaining solution quality. Experimental results demonstrate that our method, trained solely on 100-node instances from uniform distribution, generalizes remarkably well to large-scale Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) instances with up to 1 million nodes from the uniform distribution and over 80K nodes from other distributions.
[AI-32] Exploring Neural Ordinary Differential Equations as Interpretable Healthcare classifiers ACL
链接: https://arxiv.org/abs/2503.03129
作者: Shi Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ACL SRW Submission
Abstract:Deep Learning has emerged as one of the most significant innovations in machine learning. However, a notable limitation of this field lies in the ``black box" decision-making processes, which have led to skepticism within groups like healthcare and scientific communities regarding its applicability. In response, this study introduces a interpretable approach using Neural Ordinary Differential Equations (NODEs), a category of neural network models that exploit the dynamics of differential equations for representation learning. Leveraging their foundation in differential equations, we illustrate the capability of these models to continuously process textual data, marking the first such model of its kind, and thereby proposing a promising direction for future research in this domain. The primary objective of this research is to propose a novel architecture for groups like healthcare that require the predictive capabilities of deep learning while emphasizing the importance of model transparency demonstrated in NODEs.
[AI-33] A Multimodal Framework for Topic Propagation Classification in Social Networks
链接: https://arxiv.org/abs/2503.03112
作者: Yuchuan Jiang,Chaolong Jia,Yunyi Qin,Wei Cai,Yongsen Qian
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:The rapid proliferation of the Internet and the widespread adoption of social networks have significantly accelerated information dissemination. However, this transformation has introduced complexities in information capture and processing, posing substantial challenges for researchers and practitioners. Predicting the dissemination of topic-related information within social networks has thus become a critical research focus. This paper proposes a predictive model for topic dissemination in social networks by integrating multidimensional features derived from key dissemination characteristics. Specifically, we introduce two novel indicators, user relationship breadth and user authority, into the PageRank algorithm to quantify user influence more effectively. Additionally, we employ a Text-CNN model for sentiment classification, extracting sentiment features from textual content. Temporal embeddings of nodes are encoded using a Bi-LSTM model to capture temporal dynamics. Furthermore, we refine the measurement of user interaction traces with topics, replacing traditional topic view metrics with a more precise communication characteristics measure. Finally, we integrate the extracted multidimensional features using a Transformer model, significantly enhancing predictive performance. Experimental results demonstrate that our proposed model outperforms traditional machine learning and unimodal deep learning models in terms of FI-Score, AUC, and Recall, validating its effectiveness in predicting topic propagation within social networks.
[AI-34] SoK: Knowledge is All You Need: Last Mile Delivery for Automated Provenance-based Intrusion Detection with LLM s
链接: https://arxiv.org/abs/2503.03108
作者: Wenrui Cheng,Tiantian Zhu,Chunlin Xiong,Haofei Sun,Zijun Wang,Shunan Jing,Mingqi Lv,Yan Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Recently, provenance-based intrusion detection systems (PIDSes) have been widely proposed for endpoint threat analysis. However, due to the lack of systematic integration and utilization of knowledge, existing PIDSes still require significant manual intervention for practical deployment, making full automation challenging. This paper presents a disruptive innovation by categorizing PIDSes according to the types of knowledge they utilize. In response to the prevalent issue of ``knowledge silos problem’’ in existing research, we introduce a novel knowledge-driven provenance-based intrusion detection framework, powered by large language models (LLMs). We also present OmniSec, a best practice system built upon this framework. By integrating attack representation knowledge, threat intelligence knowledge, and benign behavior knowledge, OmniSec outperforms the state-of-the-art approaches on public benchmark datasets. OmniSec is available online at this https URL.
[AI-35] Hopfield Networks Meet Big Data: A Brain-Inspired Deep Learning Framework for Semantic Data Linking
链接: https://arxiv.org/abs/2503.03084
作者: Ashwin Viswanathan Kannan,Johnson P Thomas,Abhimanyu Mukerji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE)
*备注: 7 pages
Abstract:The exponential rise in data generation has led to vast, heterogeneous datasets crucial for predictive analytics and decision-making. Ensuring data quality and semantic integrity remains a challenge. This paper presents a brain-inspired distributed cognitive framework that integrates deep learning with Hopfield networks to identify and link semantically related attributes across datasets. Modeled on the dual-hemisphere functionality of the human brain, the right hemisphere assimilates new information while the left retrieves learned representations for association. Our architecture, implemented on MapReduce with Hadoop Distributed File System (HDFS), leverages deep Hopfield networks as an associative memory mechanism to enhance recall of frequently co-occurring attributes and dynamically adjust relationships based on evolving data patterns. Experiments show that associative imprints in Hopfield memory are reinforced over time, ensuring linked datasets remain contextually meaningful and improving data disambiguation and integration accuracy. Our results indicate that combining deep Hopfield networks with distributed cognitive processing offers a scalable, biologically inspired approach to managing complex data relationships in large-scale environments.
[AI-36] ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation
链接: https://arxiv.org/abs/2503.03045
作者: Yufei Wang,Ziyu Wang,Mino Nakura,Pratik Bhowal,Chia-Liang Kuo,Yi-Ting Chen,Zackory Erickson,David Held
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents ArticuBot, in which a single learned policy enables a robotics system to open diverse categories of unseen articulated objects in the real world. This task has long been challenging for robotics due to the large variations in the geometry, size, and articulation types of such objects. Our system, Articubot, consists of three parts: generating a large number of demonstrations in physics-based simulation, distilling all generated demonstrations into a point cloud-based neural policy via imitation learning, and performing zero-shot sim2real transfer to real robotics systems. Utilizing sampling-based grasping and motion planning, our demonstration generalization pipeline is fast and effective, generating a total of 42.3k demonstrations over 322 training articulated objects. For policy learning, we propose a novel hierarchical policy representation, in which the high-level policy learns the sub-goal for the end-effector, and the low-level policy learns how to move the end-effector conditioned on the predicted goal. We demonstrate that this hierarchical approach achieves much better object-level generalization compared to the non-hierarchical version. We further propose a novel weighted displacement model for the high-level policy that grounds the prediction into the existing 3D structure of the scene, outperforming alternative policy representations. We show that our learned policy can zero-shot transfer to three different real robot settings: a fixed table-top Franka arm across two different labs, and an X-Arm on a mobile base, opening multiple unseen articulated objects across two labs, real lounges, and kitchens. Videos and code can be found on our project website: this https URL.
[AI-37] LLM Misalignment via Adversarial RLHF Platforms
链接: https://arxiv.org/abs/2503.03039
作者: Erfan Entezami,Ali Naseh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Reinforcement learning has shown remarkable performance in aligning language models with human preferences, leading to the rise of attention towards developing RLHF platforms. These platforms enable users to fine-tune models without requiring any expertise in developing complex machine learning algorithms. While these platforms offer useful features such as reward modeling and RLHF fine-tuning, their security and reliability remain largely unexplored. Given the growing adoption of RLHF and open-source RLHF frameworks, we investigate the trustworthiness of these systems and their potential impact on behavior of LLMs. In this paper, we present an attack targeting publicly available RLHF tools. In our proposed attack, an adversarial RLHF platform corrupts the LLM alignment process by selectively manipulating data samples in the preference dataset. In this scenario, when a user’s task aligns with the attacker’s objective, the platform manipulates a subset of the preference dataset that contains samples related to the attacker’s target. This manipulation results in a corrupted reward model, which ultimately leads to the misalignment of the language model. Our results demonstrate that such an attack can effectively steer LLMs toward undesirable behaviors within the targeted domains. Our work highlights the critical need to explore the vulnerabilities of RLHF platforms and their potential to cause misalignment in LLMs during the RLHF fine-tuning process.
[AI-38] RAILGUN: A Unified Convolutional Policy for Multi-Agent Path Finding Across Different Environments and Tasks
链接: https://arxiv.org/abs/2503.02992
作者: Yimin Tang,Xiao Xiong,Jingyi Xi,Jiaoyang Li,Erdem Bıyık,Sven Koenig
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages
Abstract:Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP-hard so learning-based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community’s continued efforts, all learning-based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning-based policy for MAPF problem called RAILGUN. RAILGUN is not an agent-based policy but a map-based policy. By leveraging a CNN-based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule-based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero-shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.
[AI-39] aching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment
链接: https://arxiv.org/abs/2503.02976
作者: Matthew DosSantos DiSorbo,Harang Ju,Sinan Aral
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Large language models (LLMs), initially developed for generative AI, are now evolving into agentic AI systems, which make decisions in complex, real-world contexts. Unfortunately, while their generative capabilities are well-documented, their decision-making processes remain poorly understood. This is particularly evident when models are handling exceptions, a critical and challenging aspect of decision-making made relevant by the inherent incompleteness of contracts. Here we demonstrate that LLMs, even ones that excel at reasoning, deviate significantly from human judgments because they adhere strictly to policies, even when such adherence is impractical, suboptimal, or even counterproductive. We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning. We find that while ethical framework prompting fails and chain-of-thought prompting provides only slight improvements, supervised fine-tuning, specifically with human explanations, yields markedly better results. Surprisingly, in our experiments, supervised fine-tuning even enabled models to generalize human-like decision-making to novel scenarios, demonstrating transfer learning of human-aligned decision-making across contexts. Furthermore, fine-tuning with explanations, not just labels, was critical for alignment, suggesting that aligning LLMs with human judgment requires explicit training on how decisions are made, not just which decisions are made. These findings highlight the need to address LLMs’ shortcomings in handling exceptions in order to guide the development of agentic AI toward models that can effectively align with human judgment and simultaneously adapt to novel contexts.
[AI-40] Monocular visual simultaneous localization and mapping: ®evolution from geometry to deep learning-based pipelines
链接: https://arxiv.org/abs/2503.02955
作者: Olaya Alvarez-Tunon,Yury Brodskiy,Erdal Kayacan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
Abstract:With the rise of deep learning, there is a fundamental change in visual SLAM algorithms toward developing different modules trained as end-to-end pipelines. However, regardless of the implementation domain, visual SLAM’s performance is subject to diverse environmental challenges, such as dynamic elements in outdoor environments, harsh imaging conditions in underwater environments, or blurriness in high-speed setups. These environmental challenges need to be identified to study the real-world viability of SLAM implementations. Motivated by the aforementioned challenges, this paper surveys the current state of visual SLAM algorithms according to the two main frameworks: geometry-based and learning-based SLAM. First, we introduce a general formulation of the SLAM pipeline that includes most of the implementations in the literature. Second, those implementations are classified and surveyed for geometry and learning-based SLAM. After that, environment-specific challenges are formulated to enable experimental evaluation of the resilience of different visual SLAM classes to varying imaging conditions. We address two significant issues in surveying visual SLAM, providing (1) a consistent classification of visual SLAM pipelines and (2) a robust evaluation of their performance under different deployment conditions. Finally, we give our take on future opportunities for visual SLAM implementations.
[AI-41] Reliable and Efficient Multi-Agent Coordination via Graph Neural Network Variational Autoencoders ICRA2025
链接: https://arxiv.org/abs/2503.02954
作者: Yue Meng,Nathalie Majcherczyk,Wenliang Liu,Scott Kiesel,Chuchu Fan,Federico Pecora
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted by 2025 International Conference on Robotics and Automation (ICRA 2025)
Abstract:Multi-agent coordination is crucial for reliable multi-robot navigation in shared spaces such as automated warehouses. In regions of dense robot traffic, local coordination methods may fail to find a deadlock-free solution. In these scenarios, it is appropriate to let a central unit generate a global schedule that decides the passing order of robots. However, the runtime of such centralized coordination methods increases significantly with the problem scale. In this paper, we propose to leverage Graph Neural Network Variational Autoencoders (GNN-VAE) to solve the multi-agent coordination problem at scale faster than through centralized optimization. We formulate the coordination problem as a graph problem and collect ground truth data using a Mixed-Integer Linear Program (MILP) solver. During training, our learning framework encodes good quality solutions of the graph problem into a latent space. At inference time, solution samples are decoded from the sampled latent variables, and the lowest-cost sample is selected for coordination. Finally, the feasible proposal with the highest performance index is selected for the deployment. By construction, our GNN-VAE framework returns solutions that always respect the constraints of the considered coordination problem. Numerical results show that our approach trained on small-scale problems can achieve high-quality solutions even for large-scale problems with 250 robots, being much faster than other baselines. Project page: this https URL
[AI-42] Diverse Controllable Diffusion Policy with Signal Temporal Logic
链接: https://arxiv.org/abs/2503.02924
作者: Yue Meng,Chuchu fan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Accepted by IEEE Robotics and Automation Letters (RA-L), October 2024
Abstract:Generating realistic simulations is critical for autonomous system applications such as self-driving and human-robot interactions. However, driving simulators nowadays still have difficulty in generating controllable, diverse, and rule-compliant behaviors for road participants: Rule-based models cannot produce diverse behaviors and require careful tuning, whereas learning-based methods imitate the policy from data but are not designed to follow the rules explicitly. Besides, the real-world datasets are by nature “single-outcome”, making the learning method hard to generate diverse behaviors. In this paper, we leverage Signal Temporal Logic (STL) and Diffusion Models to learn controllable, diverse, and rule-aware policy. We first calibrate the STL on the real-world data, then generate diverse synthetic data using trajectory optimization, and finally learn the rectified diffusion policy on the augmented dataset. We test on the NuScenes dataset and our approach can achieve the most diverse rule-compliant trajectories compared to other baselines, with a runtime 1/17X to the second-best approach. In the closed-loop testing, our approach reaches the highest diversity, rule satisfaction rate, and the least collision rate. Our method can generate varied characteristics conditional on different STL parameters in testing. A case study on human-robot encounter scenarios shows our approach can generate diverse and closed-to-oracle trajectories. The annotation tool, augmented dataset, and code are available at this https URL.
[AI-43] Straight-Line Diffusion Model for Efficient 3D Molecular Generation
链接: https://arxiv.org/abs/2503.02918
作者: Yuyan Ni,Shikun Feng,Haohan Chi,Bowen Zheng,Huan-ang Gao,Wei-Ying Ma,Zhi-Ming Ma,Yanyan Lan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Diffusion-based models have shown great promise in molecular generation but often require a large number of sampling steps to generate valid samples. In this paper, we introduce a novel Straight-Line Diffusion Model (SLDM) to tackle this problem, by formulating the diffusion process to follow a linear trajectory. The proposed process aligns well with the noise sensitivity characteristic of molecular structures and uniformly distributes reconstruction effort across the generative process, thus enhancing learning efficiency and efficacy. Consequently, SLDM achieves state-of-the-art performance on 3D molecule generation benchmarks, delivering a 100-fold improvement in sampling efficiency. Furthermore, experiments on toy data and image generation tasks validate the generality and robustness of SLDM, showcasing its potential across diverse generative modeling domains.
[AI-44] owards Robust Multi-UAV Collaboration: MARL with Noise-Resilient Communication and Attention Mechanisms
链接: https://arxiv.org/abs/2503.02913
作者: Zilin Zhao,Chishui Chen,Haotian Shi,Jiale Chen,Xuanlin Yue,Zhejian Yang,Yang Liu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
Abstract:Efficient path planning for unmanned aerial vehicles (UAVs) is crucial in remote sensing and information collection. As task scales expand, the cooperative deployment of multiple UAVs significantly improves information collection efficiency. However, collaborative communication and decision-making for multiple UAVs remain major challenges in path planning, especially in noisy environments. To efficiently accomplish complex information collection tasks in 3D space and address robust communication issues, we propose a multi-agent reinforcement learning (MARL) framework for UAV path planning based on the Counterfactual Multi-Agent Policy Gradients (COMA) algorithm. The framework incorporates attention mechanism-based UAV communication protocol and training-deployment system, significantly improving communication robustness and individual decision-making capabilities in noisy conditions. Experiments conducted on both synthetic and real-world datasets demonstrate that our method outperforms existing algorithms in terms of path planning efficiency and robustness, especially in noisy environments, achieving a 78% improvement in entropy reduction.
[AI-45] Predicting Cascade Failures in Interdependent Urban Infrastructure Networks
链接: https://arxiv.org/abs/2503.02890
作者: Yinzhou Tang,Jinghua Piao,Huandong Wang,Shaw Rajib,Yong Li
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
*备注:
Abstract:Cascading failures (CF) entail component breakdowns spreading through infrastructure networks, causing system-wide collapse. Predicting CFs is of great importance for infrastructure stability and urban function. Despite extensive research on CFs in single networks such as electricity and road networks, interdependencies among diverse infrastructures remain overlooked, and capturing intra-infrastructure CF dynamics amid complex evolutions poses challenges. To address these gaps, we introduce the \textbfIntegrated \textbfInterdependent \textbfInfrastructure CF model ( I^3 ), designed to capture CF dynamics both within and across infrastructures. I^3 employs a dual GAE with global pooling for intra-infrastructure dynamics and a heterogeneous graph for inter-infrastructure interactions. An initial node enhancement pre-training strategy mitigates GCN-induced over-smoothing. Experiments demonstrate I^3 achieves a 31.94% in terms of AUC, 18.03% in terms of Precision, 29.17% in terms of Recall, 22.73% in terms of F1-score boost in predicting infrastructure failures, and a 28.52% reduction in terms of RMSE for cascade volume forecasts compared to leading models. It accurately pinpoints phase transitions in interconnected and singular networks, rectifying biases in models tailored for singular networks. Access the code at this https URL.
[AI-46] Interactive Debugging and Steering of Multi-Agent AI Systems
链接: https://arxiv.org/abs/2503.02068
作者: Will Epperson,Gagan Bansal,Victor Dibia,Adam Fourney,Jack Gerrits,Erkang Zhu,Saleema Amershi
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Published at CHI 25
Abstract:Fully autonomous teams of LLM-powered AI agents are emerging that collaborate to perform complex tasks for users. What challenges do developers face when trying to build and debug these AI agent teams? In formative interviews with five AI agent developers, we identify core challenges: difficulty reviewing long agent conversations to localize errors, lack of support in current tools for interactive debugging, and the need for tool support to iterate on agent configuration. Based on these needs, we developed an interactive multi-agent debugging tool, AGDebugger, with a UI for browsing and sending messages, the ability to edit and reset prior agent messages, and an overview visualization for navigating complex message histories. In a two-part user study with 14 participants, we identify common user strategies for steering agents and highlight the importance of interactive message resets for debugging. Our studies deepen understanding of interfaces for debugging increasingly important agentic workflows.
[AI-47] Deep Causal Behavioral Policy Learning: Applications to Healthcare
链接: https://arxiv.org/abs/2503.03724
作者: Jonas Knecht,Anna Zink,Jonathan Kolstad,Maya Petersen
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:We present a deep learning-based approach to studying dynamic clinical behavioral regimes in diverse non-randomized healthcare settings. Our proposed methodology - deep causal behavioral policy learning (DC-BPL) - uses deep learning algorithms to learn the distribution of high-dimensional clinical action paths, and identifies the causal link between these action paths and patient outcomes. Specifically, our approach: (1) identifies the causal effects of provider assignment on clinical outcomes; (2) learns the distribution of clinical actions a given provider would take given evolving patient information; (3) and combines these steps to identify the optimal provider for a given patient type and emulate that provider’s care decisions. Underlying this strategy, we train a large clinical behavioral model (LCBM) on electronic health records data using a transformer architecture, and demonstrate its ability to estimate clinical behavioral policies. We propose a novel interpretation of a behavioral policy learned using the LCBM: that it is an efficient encoding of complex, often implicit, knowledge used to treat a patient. This allows us to learn a space of policies that are critical to a wide range of healthcare applications, in which the vast majority of clinical knowledge is acquired tacitly through years of practice and only a tiny fraction of information relevant to patient care is written down (e.g. in textbooks, studies or standardized guidelines).
[AI-48] Collaborative Expert LLM s Guided Multi-Objective Molecular Optimization
链接: https://arxiv.org/abs/2503.03503
作者: Jiajun Yu,Yizhen Zheng,Huan Yee Koh,Shirui Pan,Tianyue Wang,Haishuai Wang
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Molecular optimization is a crucial yet complex and time-intensive process that often acts as a bottleneck for drug development. Traditional methods rely heavily on trial and error, making multi-objective optimization both time-consuming and resource-intensive. Current AI-based methods have shown limited success in handling multi-objective optimization tasks, hampering their practical utilization. To address this challenge, we present MultiMol, a collaborative large language model (LLM) system designed to guide multi-objective molecular optimization. MultiMol comprises two agents, including a data-driven worker agent and a literature-guided research agent. The data-driven worker agent is a large language model being fine-tuned to learn how to generate optimized molecules considering multiple objectives, while the literature-guided research agent is responsible for searching task-related literature to find useful prior knowledge that facilitates identifying the most promising optimized candidates. In evaluations across six multi-objective optimization tasks, MultiMol significantly outperforms existing methods, achieving a 82.30% success rate, in sharp contrast to the 27.50% success rate of current strongest methods. To further validate its practical impact, we tested MultiMol on two real-world challenges. First, we enhanced the selectivity of Xanthine Amine Congener (XAC), a promiscuous ligand that binds both A1R and A2AR, successfully biasing it towards A1R. Second, we improved the bioavailability of Saquinavir, an HIV-1 protease inhibitor with known bioavailability limitations. Overall, these results indicate that MultiMol represents a highly promising approach for multi-objective molecular optimization, holding great potential to accelerate the drug development process and contribute to the advancement of pharmaceutical research.
[AI-49] Exploring specialization and sensitivity of convolutional neural networks in the context of simultaneous image augmentations
链接: https://arxiv.org/abs/2503.03283
作者: Pavel Kharyuk,Sergey Matveev,Ivan Oseledets
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 26 pages; main text: 5 figures, 4 tables; appendix: 4 sections, 3 tables; supplementary: 7 files (figures S1-S6: packed as 7z archive, S7: single pdf file)
Abstract:Drawing parallels with the way biological networks are studied, we adapt the treatment–control paradigm to explainable artificial intelligence research and enrich it through multi-parametric input alterations. In this study, we propose a framework for investigating the internal inference impacted by input data augmentations. The internal changes in network operation are reflected in activation changes measured by variance, which can be decomposed into components related to each augmentation, employing Sobol indices and Shapley values. These quantities enable one to visualize sensitivity to different variables and use them for guided masking of activations. In addition, we introduce a way of single-class sensitivity analysis where the candidates are filtered according to their matching to prediction bias generated by targeted damaging of the activations. Relying on the observed parallels, we assume that the developed framework can potentially be transferred to studying biological neural networks in complex environments.
[AI-50] Adaptive Entanglement Routing with Deep Q-Networks in Quantum Networks
链接: https://arxiv.org/abs/2503.02895
作者: Lamarana Jallow,Majid Iqbal Khan
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 14 pages, 10 images. To be submitted to Quantum joural, this is to fullfill the requirements
Abstract:The quantum internet holds transformative potential for global communication by harnessing the principles of quantum information processing. Despite significant advancements in quantum communication technologies, the efficient distribution of critical resources, such as qubits, remains a persistent and unresolved challenge. Conventional approaches often fall short of achieving optimal resource allocation, underscoring the necessity for more effective solutions. This study proposes a novel reinforcement learning-based adaptive entanglement routing framework designed to enable resource allocation tailored to the specific demands of quantum applications. The introduced QuDQN model utilizes reinforcement learning to optimize the management of quantum networks, allocate resources efficiently, and enhance entanglement routing. The model integrates key considerations, including fidelity requirements, network topology, qubit capacity, and request demands.
[AI-51] Function-Coherent Gambles with Non-Additive Sequential Dynamics
链接: https://arxiv.org/abs/2503.02889
作者: Gregory Wheeler
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Probability (math.PR)
*备注: 10 pages, 1 figure
Abstract:The desirable gambles framework provides a rigorous foundation for imprecise probability theory but relies heavily on linear utility via its coherence axioms. In our related work, we introduced function-coherent gambles to accommodate non-linear utility. However, when repeated gambles are played over time – especially in intertemporal choice where rewards compound multiplicatively – the standard additive combination axiom fails to capture the appropriate long-run evaluation. In this paper we extend the framework by relaxing the additive combination axiom and introducing a nonlinear combination operator that effectively aggregates repeated gambles in the log-domain. This operator preserves the time-average (geometric) growth rate and addresses the ergodicity problem. We prove the key algebraic properties of the operator, discuss its impact on coherence, risk assessment, and representation, and provide a series of illustrative examples. Our approach bridges the gap between expectation values and time averages and unifies normative theory with empirically observed non-stationary reward dynamics.
机器学习
[LG-0] PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning
链接: https://arxiv.org/abs/2503.03747
作者: Ryozo Masukawa,Sanggeon Yun,Sungheon Jeong,Wenjun Huang,Yang Ni,Ian Bryant,Nathaniel D. Bastian,Mohsen Imani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 7 pages, 7 figures
Abstract:Traffic classification is vital for cybersecurity, yet encrypted traffic poses significant challenges. We present PacketCLIP, a multi-modal framework combining packet data with natural language semantics through contrastive pretraining and hierarchical Graph Neural Network (GNN) reasoning. PacketCLIP integrates semantic reasoning with efficient classification, enabling robust detection of anomalies in encrypted network flows. By aligning textual descriptions with packet behaviors, it offers enhanced interpretability, scalability, and practical applicability across diverse security scenarios. PacketCLIP achieves a 95% mean AUC, outperforms baselines by 11.6%, and reduces model size by 92%, making it ideal for real-time anomaly detection. By bridging advanced machine learning techniques and practical cybersecurity needs, PacketCLIP provides a foundation for scalable, efficient, and interpretable solutions to tackle encrypted traffic classification and network intrusion detection challenges in resource-constrained environments.
[LG-1] Constrained Gaussian Wasserstein Optimal Transport with Commutative Covariance Matrices
链接: https://arxiv.org/abs/2503.03744
作者: Jun Chen,Jia Wang,Ruibin Li,Han Zhou,Wei Dong,Huan Liu,Yuanhao Yu
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Optimal transport has found widespread applications in signal processing and machine learning. Among its many equivalent formulations, optimal transport seeks to reconstruct a random variable/vector with a prescribed distribution at the destination while minimizing the expected distortion relative to a given random variable/vector at the source. However, in practice, certain constraints may render the optimal transport plan infeasible. In this work, we consider three types of constraints: rate constraints, dimension constraints, and channel constraints, motivated by perception-aware lossy compression, generative principal component analysis, and deep joint source-channel coding, respectively. Special attenion is given to the setting termed Gaussian Wasserstein optimal transport, where both the source and reconstruction variables are multivariate Gaussian, and the end-to-end distortion is measured by the mean squared error. We derive explicit results for the minimum achievable mean squared error under the three aforementioned constraints when the covariance matrices of the source and reconstruction variables commute.
[LG-2] owards Understanding Distilled Reasoning Models: A Representational Approach
链接: https://arxiv.org/abs/2503.03730
作者: David D. Baek,Max Tegmark
类目: Machine Learning (cs.LG)
*备注: 13 pages, 11 figures
Abstract:In this paper, we investigate how model distillation impacts the development of reasoning features in large language models (LLMs). To explore this, we train a crosscoder on Qwen-series models and their fine-tuned variants. Our results suggest that the crosscoder learns features corresponding to various types of reasoning, including self-reflection and computation verification. Moreover, we observe that distilled models contain unique reasoning feature directions, which could be used to steer the model into over-thinking or incisive-thinking mode. In particular, we perform analysis on four specific reasoning categories: (a) self-reflection, (b) deductive reasoning, © alternative reasoning, and (d) contrastive reasoning. Finally, we examine the changes in feature geometry resulting from the distillation process and find indications that larger distilled models may develop more structured representations, which correlate with enhanced distillation performance. By providing insights into how distillation modifies the model, our study contributes to enhancing the transparency and reliability of AI systems.
[LG-3] Graph-Augmented LSTM for Forecasting Sparse Anomalies in Graph-Structured Time Series
链接: https://arxiv.org/abs/2503.03729
作者: Sneh Pillai
类目: Machine Learning (cs.LG)
*备注: 12 pages
Abstract:Detecting anomalies in time series data is a critical task across many domains. The challenge intensifies when anomalies are sparse and the data are multivariate with relational dependencies across sensors or nodes. Traditional univariate anomaly detectors struggle to capture such cross-node dependencies, particularly in sparse anomaly settings. To address this, we propose a graph-augmented time series forecasting approach that explicitly integrates the graph of relationships among time series into an LSTM forecasting model. This enables the model to detect rare anomalies that might otherwise go unnoticed in purely univariate approaches. We evaluate the approach on two benchmark datasets - the Yahoo Webscope S5 anomaly dataset and the METR-LA traffic sensor network - and compare the performance of the Graph-Augmented LSTM against LSTM-only, ARIMA, and Prophet baselines. Results demonstrate that the graph-augmented model achieves significantly higher precision and recall, improving F1-score by up to 10% over the best baseline
[LG-4] Handling Uncertainty in Health Data using Generative Algorithms
链接: https://arxiv.org/abs/2503.03715
作者: Mahdi Arab Loodaricheh,Neh Majmudar,Anita Raja,Ansaf Salleb-Aouissi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding and managing uncertainty is crucial in machine learning, especially in high-stakes domains like healthcare, where class imbalance can impact predictions. This paper introduces RIGA, a novel pipeline that mitigates class imbalance using generative AI. By converting tabular healthcare data into images, RIGA leverages models like cGAN, VQVAE, and VQGAN to generate balanced samples, improving classification performance. These representations are processed by CNNs and later transformed back into tabular format for seamless integration. This approach enhances traditional classifiers like XGBoost, improves Bayesian structure learning, and strengthens ML model robustness by generating realistic synthetic data for underrepresented classes.
[LG-5] A Practical Memory Injection Attack against LLM Agents
链接: https://arxiv.org/abs/2503.03704
作者: Shen Dong,Shaocheng Xu,Pengfei He,Yige Li,Jiliang Tang,Tianming Liu,Hui Liu,Zhen Xiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Agents based on large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, that enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps leading to undesirable agent actions when executing the victim user’s query. Specifically, we introduce a sequence of bridging steps to link the victim query to the malicious reasoning steps. During the injection of the malicious record, we propose an indication prompt to guide the agent to autonomously generate our designed bridging steps. We also propose a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing the victim query comes after. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
[LG-6] owards Trustworthy Federated Learning
链接: https://arxiv.org/abs/2503.03684
作者: Alina Basharat,Yijun Bian,Ping Xu,Zhi Tian
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:This paper develops a comprehensive framework to address three critical trustworthy challenges in federated learning (FL): robustness against Byzantine attacks, fairness, and privacy preservation. To improve the system’s defense against Byzantine attacks that send malicious information to bias the system’s performance, we develop a Two-sided Norm Based Screening (TNBS) mechanism, which allows the central server to crop the gradients that have the l lowest norms and h highest norms. TNBS functions as a screening tool to filter out potential malicious participants whose gradients are far from the honest ones. To promote egalitarian fairness, we adopt the q-fair federated learning (q-FFL). Furthermore, we adopt a differential privacy-based scheme to prevent raw data at local clients from being inferred by curious parties. Convergence guarantees are provided for the proposed framework under different scenarios. Experimental results on real datasets demonstrate that the proposed framework effectively improves robustness and fairness while managing the trade-off between privacy and accuracy. This work appears to be the first study that experimentally and theoretically addresses fairness, privacy, and robustness in trustworthy FL.
[LG-7] Optimally Installing Strict Equilibria
链接: https://arxiv.org/abs/2503.03676
作者: Jeremy McMahan,Young Wu,Yudong Chen,Xiaojin Zhu,Qiaomin Xie
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we develop a reward design framework for installing a desired behavior as a strict equilibrium across standard solution concepts: dominant strategy equilibrium, Nash equilibrium, correlated equilibrium, and coarse correlated equilibrium. We also extend our framework to capture the Markov-perfect equivalents of each solution concept. Central to our framework is a comprehensive mathematical characterization of strictly installable, based on the desired solution concept and the behavior’s structure. These characterizations lead to efficient iterative algorithms, which we generalize to handle optimization objectives through linear programming. Finally, we explore how our results generalize to bounded rational agents.
[LG-8] Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns
链接: https://arxiv.org/abs/2503.03660
作者: Dong Tian,Ge Li,Hongyi Zhou,Onur Celik,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures
Abstract:Soft Actor-Critic (SAC) critically depends on its critic network, which typically evaluates a single state-action pair to guide policy updates. Using N-step returns is a common practice to reduce the bias in the target values of the critic. However, using N-step returns can again introduce high variance and necessitates importance sampling, often destabilizing training. Recent algorithms have also explored action chunking-such as direct action repetition and movement primitives-to enhance exploration. In this paper, we propose a Transformer-based Critic Network for SAC that integrates the N-returns framework in a stable and efficient manner. Unlike approaches that perform chunking in the actor network, we feed chunked actions into the critic network to explore potential performance gains. Our architecture leverages the Transformer’s ability to process sequential information, facilitating more robust value estimation. Empirical results show that this method not only achieves efficient, stable training but also excels in sparse reward/multi-phase environments-traditionally a challenge for step-based methods. These findings underscore the promise of combining Transformer-based critics with N-returns to advance reinforcement learning performance
[LG-9] Robust Learning of Diverse Code Edits
链接: https://arxiv.org/abs/2503.03656
作者: Tushar Aggarwal,Swayam Singh,Abhijeet Awasthi,Aditya Kanade,Nagarajan Natarajan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today’s code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families (DeepSeekCoder and QwenCoder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation abilities post adaptation.
[LG-10] Its My Data Too: Private ML for Datasets with Multi-User Training Examples
链接: https://arxiv.org/abs/2503.03622
作者: Arun Ganesh,Ryan McKenna,Brendan McMahan,Adam Smith,Fan Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We initiate a study of algorithms for model training with user-level differential privacy (DP), where each example may be attributed to multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Training in the multi-attribution model is facilitated by solving the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding problem. We then empirically study this algorithm for a synthetic logistic regression task and a transformer training task, including studying variants of this baseline algorithm that optimize the subset chosen using different techniques and criteria. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of a bias-variance tradeoff inherent in solutions to the contribution bounding problem.
[LG-11] A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery
链接: https://arxiv.org/abs/2503.03579
作者: Hanxin Zhang,Abdulqader Dhafer,Zhou Daniel Hao,Hongbiao Dong
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We propose a novel system for robot-to-human object handover that emulates human coworker interactions. Unlike most existing studies that focus primarily on grasping strategies and motion planning, our system focus on 1. inferring human handover intents, 2. imagining spatial handover configuration. The first one integrates multimodal perception-combining visual and verbal cues-to infer human intent. The second one using a diffusion-based model to generate the handover configuration, involving the spacial relationship among robot’s gripper, the object, and the human hand, thereby mimicking the cognitive process of motor imagery. Experimental results demonstrate that our approach effectively interprets human cues and achieves fluent, human-like handovers, offering a promising solution for collaborative robotics. Code, videos, and data are available at: this https URL.
[LG-12] Optimal Decision Tree Pruning Revisited: Algorithms and Complexity
链接: https://arxiv.org/abs/2503.03576
作者: Juha Harviainen,Frank Sommer,Manuel Sorge,Stefan Szeider
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a comprehensive classical and parameterized complexity analysis of decision tree pruning operations, extending recent research on the complexity of learning small decision trees. Thereby, we offer new insights into the computational challenges of decision tree simplification, a crucial aspect of developing interpretable and efficient machine learning models. We focus on fundamental pruning operations of subtree replacement and raising, which are used in heuristics. Surprisingly, while optimal pruning can be performed in polynomial time for subtree replacement, the problem is NP-complete for subtree raising. Therefore, we identify parameters and combinations thereof that lead to fixed-parameter tractability or hardness, establishing a precise borderline between these complexity classes. For example, while subtree raising is hard for small domain size D or number d of features, it can be solved in D^2d \cdot |I|^O(1) time, where |I| is the input size. We complement our theoretical findings with preliminary experimental results, demonstrating the practical implications of our analysis.
[LG-13] Olympus: A Jumping Quadruped for Planetary Exploration Utilizing Reinforcement Learning for In-Flight Attitude Control ICRA
链接: https://arxiv.org/abs/2503.03574
作者: Jørgen Anker Olsen,Grzegorz Malczyk,Kostas Alexis
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2025
Abstract:Exploring planetary bodies with lower gravity, such as the moon and Mars, allows legged robots to utilize jumping as an efficient form of locomotion thus giving them a valuable advantage over traditional rovers for exploration. Motivated by this fact, this paper presents the design, simulation, and learning-based “in-flight” attitude control of Olympus, a jumping legged robot tailored to the gravity of Mars. First, the design requirements are outlined followed by detailing how simulation enabled optimizing the robot’s design - from its legs to the overall configuration - towards high vertical jumping, forward jumping distance, and in-flight attitude reorientation. Subsequently, the reinforcement learning policy used to track desired in-flight attitude maneuvers is presented. Successfully crossing the sim2real gap, extensive experimental studies of attitude reorientation tests are demonstrated.
[LG-14] Domain Consistent Industrial Decarbonisation of Global Coal Power Plants
链接: https://arxiv.org/abs/2503.03571
作者: Waqar Muhammad Ashraf,Vivek Dua,Ramit Debnath
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 figures. 17 pages
Abstract:Machine learning and optimisation techniques (MLOPT) hold significant potential to accelerate the decarbonisation of industrial systems by enabling data-driven operational improvements. However, the practical application of MLOPT in industrial settings is often hindered by a lack of domain compliance and system-specific consistency, resulting in suboptimal solutions with limited real-world applicability. To address this challenge, we propose a novel human-in-the-loop (HITL) constraint-based optimisation framework that integrates domain expertise with data-driven methods, ensuring solutions are both technically sound and operationally feasible. We demonstrate the efficacy of this framework through a case study focused on enhancing the thermal efficiency and reducing the turbine heat rate of a 660 MW supercritical coal-fired power plant. By embedding domain knowledge as constraints within the optimisation process, our approach yields solutions that align with the plant’s operational patterns and are seamlessly integrated into its control systems. Empirical validation confirms a mean improvement in thermal efficiency of 0.64% and a mean reduction in turbine heat rate of 93 kJ/kWh. Scaling our analysis to 59 global coal power plants with comparable capacity and fuel type, we estimate a cumulative lifetime reduction of 156.4 million tons of carbon emissions. These results underscore the transformative potential of our HITL-MLOPT framework in delivering domain-compliant, implementable solutions for industrial decarbonisation, offering a scalable pathway to mitigate the environmental impact of coal-based power generation worldwide.
[LG-15] ransformer-Based Power Optimization for Max-Min Fairness in Cell-Free Massive MIMO
链接: https://arxiv.org/abs/2503.03561
作者: Irched Chafaa,Giacomo Bacci,Luca Sanguinetti
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, IEEE WCL, 4 FIGURES
Abstract:Power allocation is an important task in wireless communication networks. Classical optimization algorithms and deep learning methods, while effective in small and static scenarios, become either computationally demanding or unsuitable for large and dynamic networks with varying user loads. This letter explores the potential of transformer-based deep learning models to address these challenges. We propose a transformer neural network to jointly predict optimal uplink and downlink power using only user and access point positions. The max-min fairness problem in cell-free massive multiple input multiple output systems is considered. Numerical results show that the trained model provides near-optimal performance and adapts to varying numbers of users and access points without retraining, additional processing, or updating its neural network architecture. This demonstrates the effectiveness of the proposed model in achieving robust and flexible power allocation for dynamic networks.
[LG-16] Revisiting the Role of Relearning in Semantic Dementia
链接: https://arxiv.org/abs/2503.03545
作者: Devon Jarvis,Verena Klar,Richard Klein,Benjamin Rosman,Andrew Saxe
类目: Machine Learning (cs.LG)
*备注: 3 pages, 2 figures, presented at the Cognitive Computational Neuroscience Conference (CCN) 2023
Abstract:Patients with semantic dementia (SD) present with remarkably consistent atrophy of neurons in the anterior temporal lobe and behavioural impairments, such as graded loss of category knowledge. While relearning of lost knowledge has been shown in acute brain injuries such as stroke, it has not been widely supported in chronic cognitive diseases such as SD. Previous research has shown that deep linear artificial neural networks exhibit stages of semantic learning akin to humans. Here, we use a deep linear network to test the hypothesis that relearning during disease progression rather than particular atrophy cause the specific behavioural patterns associated with SD. After training the network to generate the common semantic features of various hierarchically organised objects, neurons are successively deleted to mimic atrophy while retraining the model. The model with relearning and deleted neurons reproduced errors specific to SD, including prototyping errors and cross-category confusions. This suggests that relearning is necessary for artificial neural networks to reproduce the behavioural patterns associated with SD in the absence of \textitoutput non-linearities. Our results support a theory of SD progression that results from continuous relearning of lost information. Future research should revisit the role of relearning as a contributing factor to cognitive diseases.
[LG-17] Intrinsic and Extrinsic Factor Disentanglement for Recommendation in Various Context Scenarios
链接: https://arxiv.org/abs/2503.03524
作者: Yixin Su,Wei Jiang,Fangquan Lin,Cheng Yang,Sarah M. Erfani,Junhao Gan,Yunxiang Zhao,Ruixuan Li,Rui Zhang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 32 pages, 13 figures, 11 tables. Accepted by Transactions of Information Systems
Abstract:In recommender systems, the patterns of user behaviors (e.g., purchase, click) may vary greatly in different contexts (e.g., time and location). This is because user behavior is jointly determined by two types of factors: intrinsic factors, which reflect consistent user preference, and extrinsic factors, which reflect external incentives that may vary in different contexts. Differentiating between intrinsic and extrinsic factors helps learn user behaviors better. However, existing studies have only considered differentiating them from a single, pre-defined context (e.g., time or location), ignoring the fact that a user’s extrinsic factors may be influenced by the interplay of various contexts at the same time. In this paper, we propose the Intrinsic-Extrinsic Disentangled Recommendation (IEDR) model, a generic framework that differentiates intrinsic from extrinsic factors considering various contexts simultaneously, enabling more accurate differentiation of factors and hence the improvement of recommendation accuracy. IEDR contains a context-invariant contrastive learning component to capture intrinsic factors, and a disentanglement component to extract extrinsic factors under the interplay of various contexts. The two components work together to achieve effective factor learning. Extensive experiments on real-world datasets demonstrate IEDR’s effectiveness in learning disentangled factors and significantly improving recommendation accuracy by up to 4% in NDCG.
[LG-18] O-RAN xApps Conflict Management using Graph Convolutional Networks
链接: https://arxiv.org/abs/2503.03523
作者: Maryam Al Shami,Jun Yan,Emmanuel Thepie Fapi
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 9 pages, 10 figures
Abstract:Open Radio Access Network (O-RAN) adopts a flexible, open, and virtualized structure with standardized interfaces, reducing dependency on a single supplier. Conflict management in O-RAN refers to the process of identifying and resolving conflicts between network applications. xApps are applications deployed at the RAN Intelligent Controller (RIC) that leverage advanced AI/ML algorithms to make dynamic decisions for network optimization. The lack of a unified mechanism to coordinate and prioritize the actions of different applications can create three types of conflicts (direct, indirect, and implicit). In our paper, we introduce a novel data-driven GCN-based method called Graph-based xApps Conflict and Root Cause Analysis Engine (GRACE) based on Graph Convolutional Network (GCN). It detects three types of conflicts (direct, indirect, and implicit) and pinpoints the root causes (xApps). GRACE captures the complex and hidden dependencies among the xApps, the controlled parameters, and the KPIs in O-RAN to detect possible conflicts. Then, it identifies the root causes (xApps) contributing to the detected conflicts. The proposed method was tested on highly imbalanced datasets where the number of conflict instances ranges from 40% to 10%. The model is tested in a setting that simulates real-world scenarios where conflicts are rare to assess its performance and generalizability. Experimental results demonstrate an exceptional performance, achieving a high F1-score greater than 98% for all the case studies.
[LG-19] State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models
链接: https://arxiv.org/abs/2503.03499
作者: Wonjun Kang,Kevin Galim,Yuchen Zeng,Minjae Lee,Hyung Il Koo,Nam Ik Cho
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL
Abstract:State Space Models (SSMs) have emerged as efficient alternatives to Transformers, mitigating their quadratic computational cost. However, the application of Parameter-Efficient Fine-Tuning (PEFT) methods to SSMs remains largely unexplored. In particular, prompt-based methods like Prompt Tuning and Prefix-Tuning, which are widely used in Transformers, do not perform well on SSMs. To address this, we propose state-based methods as a superior alternative to prompt-based methods. This new family of methods naturally stems from the architectural characteristics of SSMs. State-based methods adjust state-related features directly instead of depending on external prompts. Furthermore, we introduce a novel state-based PEFT method: State-offset Tuning. At every timestep, our method directly affects the state at the current step, leading to more effective adaptation. Through extensive experiments across diverse datasets, we demonstrate the effectiveness of our method. Code is available at this https URL.
[LG-20] Federated Learning for Predicting Mild Cognitive Impairment to Dementia Conversion
链接: https://arxiv.org/abs/2503.03489
作者: Gaurang Sharma,Elaheh Moradi,Juha Pajula,Mika Hilvo,Jussi Tohka(for the Alzheimerś Disease Neuroimaging Initiative)
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Dementia is a progressive condition that impairs an individual’s cognitive health and daily functioning, with mild cognitive impairment (MCI) often serving as its precursor. The prediction of MCI to dementia conversion has been well studied, but previous studies have almost always focused on traditional Machine Learning (ML) based methods that require sharing sensitive clinical information to train predictive models. This study proposes a privacy-enhancing solution using Federated Learning (FL) to train predictive models for MCI to dementia conversion without sharing sensitive data, leveraging socio demographic and cognitive measures. We simulated and compared two network architectures, Peer to Peer (P2P) and client-server, to enable collaborative learning. Our results demonstrated that FL had comparable predictive performance to centralized ML, and each clinical site showed similar performance without sharing local data. Moreover, the predictive performance of FL models was superior to site specific models trained without collaboration. This work highlights that FL can eliminate the need for data sharing without compromising model efficacy.
[LG-21] Differentially Private Learners for Heterogeneous Treatment Effects ICLR2025
链接: https://arxiv.org/abs/2503.03486
作者: Maresa Schröder,Valentyn Melnychuk,Stefan Feuerriegel
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Published at ICLR 2025
Abstract:Patient data is widely used to estimate heterogeneous treatment effects and thus understand the effectiveness and safety of drugs. Yet, patient data includes highly sensitive information that must be kept private. In this work, we aim to estimate the conditional average treatment effect (CATE) from observational data under differential privacy. Specifically, we present DP-CATE, a novel framework for CATE estimation that is Neyman-orthogonal and further ensures differential privacy of the estimates. Our framework is highly general: it applies to any two-stage CATE meta-learner with a Neyman-orthogonal loss function, and any machine learning model can be used for nuisance estimation. We further provide an extension of our DP-CATE, where we employ RKHS regression to release the complete CATE function while ensuring differential privacy. We demonstrate our DP-CATE across various experiments using synthetic and real-world datasets. To the best of our knowledge, we are the first to provide a framework for CATE estimation that is Neyman-orthogonal and differentially private.
[LG-22] EDDY: A Family Of Foundation Models For Understanding Single Cell Biology
链接: https://arxiv.org/abs/2503.03485
作者: Alexis Chevalier,Soumya Ghosh,Urvi Awasthi,James Watkins,Julia Bieniewska,Nichita Mitrea,Olga Kotova,Kirill Shkura,Andrew Noble,Michael Steinbaugh,Julien Delile,Christoph Meier,Leonid Zhukov,Iya Khalil,Srayanta Mukherjee,Judith Mueller
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Understanding the biological mechanism of disease is critical for medicine, and in particular drug discovery. AI-powered analysis of genome-scale biological data hold great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models either do not improve or only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving the state-of-the-art. First, we scaled the pre-training dataset to 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the TEDDY family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on two downstream evaluation tasks – identifying the underlying disease state of held-out donors not seen during training and distinguishing healthy cells from diseased ones for disease conditions and donors not seen during training. Scaling experiments showed that performance improved predictably with both data volume and parameter count. Our models showed substantial improvement over existing work on the first task and more muted improvements on the second.
[LG-23] Data Poisoning Attacks to Locally Differentially Private Range Query Protocols
链接: https://arxiv.org/abs/2503.03454
作者: I-Jung Hsu,Chih-Hsun Lin,Chia-Mu Yu,Sy-Yen Kuo,Chun-Ying Huang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Trajectory data, which tracks movements through geographic locations, is crucial for improving real-world applications. However, collecting such sensitive data raises considerable privacy concerns. Local differential privacy (LDP) offers a solution by allowing individuals to locally perturb their trajectory data before sharing it. Despite its privacy benefits, LDP protocols are vulnerable to data poisoning attacks, where attackers inject fake data to manipulate aggregated results. In this work, we make the first attempt to analyze vulnerabilities in several representative LDP trajectory protocols. We propose \textscTraP, a heuristic algorithm for data \underlinePoisoning attacks using a prefix-suffix method to optimize fake \underlineTrajectory selection, significantly reducing computational complexity. Our experimental results demonstrate that our attack can substantially increase target pattern occurrences in the perturbed trajectory dataset with few fake users. This study underscores the urgent need for robust defenses and better protocol designs to safeguard LDP trajectory data against malicious manipulation.
[LG-24] Gradient Deconfliction via Orthogonal Projections onto Subspaces For Multi-task Learning WSDM2025
链接: https://arxiv.org/abs/2503.03438
作者: Shijie Zhu,Hui Zhao,Tianshu Wu,Pengjie Wang,Hongbo Deng,Jian Xu,Bo Zheng
类目: Machine Learning (cs.LG)
*备注: WSDM 2025
Abstract:Although multi-task learning (MTL) has been a preferred approach and successfully applied in many real-world scenarios, MTL models are not guaranteed to outperform single-task models on all tasks mainly due to the negative effects of conflicting gradients among the tasks. In this paper, we fully examine the influence of conflicting gradients and further emphasize the importance and advantages of achieving non-conflicting gradients which allows simple but effective trade-off strategies among the tasks with stable performance. Based on our findings, we propose the Gradient Deconfliction via Orthogonal Projections onto Subspaces (GradOPS) spanned by other task-specific gradients. Our method not only solves all conflicts among the tasks, but can also effectively search for diverse solutions towards different trade-off preferences among the tasks. Theoretical analysis on convergence is provided, and performance of our algorithm is fully testified on multiple benchmarks in various domains. Results demonstrate that our method can effectively find multiple state-of-the-art solutions with different trade-off strategies among the tasks on multiple datasets.
[LG-25] Early-Stopped Mirror Descent for Linear Regression over Convex Bodies
链接: https://arxiv.org/abs/2503.03426
作者: Tobias Wegel,Gil Kur,Patrick Rebeschini
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:Early-stopped iterative optimization methods are widely used as alternatives to explicit regularization, and direct comparisons between early-stopping and explicit regularization have been established for many optimization geometries. However, most analyses depend heavily on the specific properties of the optimization geometry or strong convexity of the empirical objective, and it remains unclear whether early-stopping could ever be less statistically efficient than explicit regularization for some particular shape constraint, especially in the overparameterized regime. To address this question, we study the setting of high-dimensional linear regression under additive Gaussian noise when the ground truth is assumed to lie in a known convex body and the task is to minimize the in-sample mean squared error. Our main result shows that for any convex body and any design matrix, up to an absolute constant factor, the worst-case risk of unconstrained early-stopped mirror descent with an appropriate potential is at most that of the least squares estimator constrained to the convex body. We achieve this by constructing algorithmic regularizers based on the Minkowski functional of the convex body.
[LG-26] Evolutionary Prediction Games
链接: https://arxiv.org/abs/2503.03401
作者: Eden Saig,Nir Rosenfeld
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
*备注: Comments are welcome
Abstract:When users decide whether to use a system based on the quality of predictions they receive, learning has the capacity to shape the population of users it serves - for better or worse. This work aims to study the long-term implications of this process through the lens of evolutionary game theory. We introduce and study evolutionary prediction games, designed to capture the role of learning as a driver of natural selection between groups of users, and hence a determinant of evolutionary outcomes. Our main theoretical results show that: (i) in settings with unlimited data and compute, learning tends to reinforce the survival of the fittest, and (ii) in more realistic settings, opportunities for coexistence emerge. We analyze these opportunities in terms of their stability and feasibility, present several mechanisms that can sustain their existence, and empirically demonstrate our findings using real and synthetic data.
[LG-27] Predicting Practically? Domain Generalization for Predictive Analytics in Real-world Environments
链接: https://arxiv.org/abs/2503.03399
作者: Hanyu Duan,Yi Yang,Ahmed Abbasi,Kar Yan Tam
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predictive machine learning models are widely used in customer relationship management (CRM) to forecast customer behaviors and support decision-making. However, the dynamic nature of customer behaviors often results in significant distribution shifts between training data and serving data, leading to performance degradation in predictive models. Domain generalization, which aims to train models that can generalize to unseen environments without prior knowledge of their distributions, has become a critical area of research. In this work, we propose a novel domain generalization method tailored to handle complex distribution shifts, encompassing both covariate and concept shifts. Our method builds upon the Distributionally Robust Optimization framework, optimizing model performance over a set of hypothetical worst-case distributions rather than relying solely on the training data. Through simulation experiments, we demonstrate the working mechanism of the proposed method. We also conduct experiments on a real-world customer churn dataset, and validate its effectiveness in both temporal and spatial generalization settings. Finally, we discuss the broader implications of our method for advancing Information Systems (IS) design research, particularly in building robust predictive models for dynamic managerial environments.
[LG-28] GNNMerge: Merging of GNN Models Without Accessing Training Data
链接: https://arxiv.org/abs/2503.03384
作者: Vipul Garg,Ishita Thakre,Sayan Ranu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging has gained prominence in machine learning as a method to integrate multiple trained models into a single model without accessing the original training data. While existing approaches have demonstrated success in domains such as computer vision and NLP, their application to Graph Neural Networks (GNNs) remains unexplored. These methods often rely on the assumption of shared initialization, which is seldom applicable to GNNs. In this work, we undertake the first benchmarking study of model merging algorithms for GNNs, revealing their limited effectiveness in this context. To address these challenges, we propose GNNMerge, which utilizes a task-agnostic node embedding alignment strategy to merge GNNs. Furthermore, we establish that under a mild relaxation, the proposed optimization objective admits direct analytical solutions for widely used GNN architectures, significantly enhancing its computational efficiency. Empirical evaluations across diverse datasets, tasks, and architectures establish GNNMerge to be up to 24% more accurate than existing methods while delivering over 2 orders of magnitude speed-up compared to training from scratch.
[LG-29] Paths and Ambient Spaces in Neural Loss Landscapes AISTATS2025
链接: https://arxiv.org/abs/2503.03382
作者: Daniel Dold,Julius Kobialka,Nicolai Palm,Emanuel Sommer,David Rügamer,Oliver Dürr
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, Accepted at AISTATS 2025
Abstract:Understanding the structure of neural network loss surfaces, particularly the emergence of low-loss tunnels, is critical for advancing neural network theory and practice. In this paper, we propose a novel approach to directly embed loss tunnels into the loss landscape of neural networks. Exploring the properties of these loss tunnels offers new insights into their length and structure and sheds light on some common misconceptions. We then apply our approach to Bayesian neural networks, where we improve subspace inference by identifying pitfalls and proposing a more natural prior that better guides the sampling procedure.
[LG-30] A Novel Multi-Criteria Local Latin Hypercube Refinement System for Commutation Angle Improvement in IPMSMs
链接: https://arxiv.org/abs/2503.03372
作者: Pedram Asef,Mouloud Denai,Johannes J. H. Paulides,Bruno Ricardo Marques,Andrew Lapthorn
类目: Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The commutation angle is defined as the angle between the fundamental of the motor phase current and the fundamental of the back-EMF. It can be utilised to provide a compensating effect in IPMSMs. This is due to the reluctance torque component being dependent on the commutation angle of the phase current even before entering the extended speed range. A real-time maximum torque per current and voltage strategy is demonstrated to find the trajectory and optimum commutation angles, gamma, where the level of accuracy depends on the application and available computational speed. A magnet volume reduction using a novel multi-criteria local Latin hypercube refinement (MLHR) sampling system is also presented to improve the optimisation process. The proposed new technique minimises the magnet mass to motor torque density whilst maintaining a similar phase current level. A mapping of gamma allows the determination of the optimum angles, as shown in this paper. The 3rd generation Toyota Prius IPMSM is considered as the reference motor, where the rotor configuration is altered to allow for an individual assessment.
[LG-31] Leap: Inductive Link Prediction via Learnable TopologyAugmentation
链接: https://arxiv.org/abs/2503.03331
作者: Ahmed E. Samy,Zekarias T. Kefato,Sarunas Girdzijauskas
类目: Machine Learning (cs.LG)
*备注: published in Machine Learning, Optimization, and Data Science, Springer Nature Switzerland
Abstract:Link prediction is a crucial task in many downstream applications of graph machine learning. To this end, Graph Neural Network (GNN) is a widely used technique for link prediction, mainly in transductive settings, where the goal is to predict missing links between existing nodes. However, many real-life applications require an inductive setting that accommodates for new nodes, coming into an existing graph. Thus, recently inductive link prediction has attracted considerable attention, and a multi-layer perceptron (MLP) is the popular choice of most studies to learn node representations. However, these approaches have limited expressivity and do not fully capture the graph’s structural signal. Therefore, in this work we propose LEAP, an inductive link prediction method based on LEArnable toPology augmentation. Unlike previous methods, LEAP models the inductive bias from both the structure and node features, and hence is more expressive. To the best of our knowledge, this is the first attempt to provide structural contexts for new nodes via learnable augmentation in inductive settings. Extensive experiments on seven real-world homogeneous and heterogeneous graphs demonstrates that LEAP significantly surpasses SOTA methods. The improvements are up to 22% and 17% in terms of AUC and average precision, respectively. The code and datasets are available on GitHub (this https URL)
[LG-32] Differential Machine Learning for Time Series Prediction
链接: https://arxiv.org/abs/2503.03302
作者: Akash Yadav,Eulalia Nualart
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate time series prediction is challenging due to the inherent nonlinearity and sensitivity to initial conditions. We propose a novel approach that enhances neural network predictions through differential learning, which involves training models on both the original time series and its differential series. Specifically, we develop a differential long short-term memory (Diff-LSTM) network that uses a shared LSTM cell to simultaneously process both data streams, effectively capturing intrinsic patterns and temporal dynamics. Evaluated on the Mackey-Glass, Lorenz, and Rössler chaotic time series, as well as a real-world financial dataset from ACI Worldwide Inc., our results demonstrate that the Diff- LSTM network outperforms prevalent models such as recurrent neural networks, convolutional neural networks, and bidirectional and encoder-decoder LSTM networks in both short-term and long-term predictions. This framework offers a promising solution for enhancing time series prediction, even when comprehensive knowledge of the underlying dynamics of the time series is not fully available.
[LG-33] rafficKAN-GCN: Graph Convolutional-based Kolmogorov-Arnold Network for Traffic Flow Optimization
链接: https://arxiv.org/abs/2503.03276
作者: Jiayi Zhang,Yiming Zhang,Yuan Zheng,Yuchen Wang,Jinjiang You,Yuchen Xu,Wenxing Jiang,Soumyabrata Dev
类目: Machine Learning (cs.LG)
*备注: 21 pages, 14 figures
Abstract:Urban traffic optimization is critical for improving transportation efficiency and alleviating congestion, particularly in large-scale dynamic networks. Traditional methods, such as Dijkstra’s and Floyd’s algorithms, provide effective solutions in static settings, but they struggle with the spatial-temporal complexity of real-world traffic flows. In this work, we propose TrafficKAN-GCN, a hybrid deep learning framework combining Kolmogorov-Arnold Networks (KAN) with Graph Convolutional Networks (GCN), designed to enhance urban traffic flow optimization. By integrating KAN’s adaptive nonlinear function approximation with GCN’s spatial graph learning capabilities, TrafficKAN-GCN captures both complex traffic patterns and topological dependencies. We evaluate the proposed framework using real-world traffic data from the Baltimore Metropolitan area. Compared with baseline models such as MLP-GCN, standard GCN, and Transformer-based approaches, TrafficKAN-GCN achieves competitive prediction accuracy while demonstrating improved robustness in handling noisy and irregular traffic data. Our experiments further highlight the framework’s ability to redistribute traffic flow, mitigate congestion, and adapt to disruptive events, such as the Francis Scott Key Bridge collapse. This study contributes to the growing body of work on hybrid graph learning for intelligent transportation systems, highlighting the potential of combining KAN and GCN for real-time traffic optimization. Future work will focus on reducing computational overhead and integrating Transformer-based temporal modeling for enhanced long-term traffic prediction. The proposed TrafficKAN-GCN framework offers a promising direction for data-driven urban mobility management, balancing predictive accuracy, robustness, and computational efficiency.
[LG-34] Structural Entropy Guided Unsupervised Graph Out-Of-Distribution Detection AAAI AAAI2025
链接: https://arxiv.org/abs/2503.03241
作者: Yue Hou,He Zhu,Ruomei Liu,Yingke Su,Jinxiang Xia,Junran Wu,Ke Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025 (The 39th Annual AAAI Conference on Artificial Intelligence)
Abstract:With the emerging of huge amount of unlabeled data, unsupervised out-of-distribution (OOD) detection is vital for ensuring the reliability of graph neural networks (GNNs) by identifying OOD samples from in-distribution (ID) ones during testing, where encountering novel or unknown data is inevitable. Existing methods often suffer from compromised performance due to redundant information in graph structures, which impairs their ability to effectively differentiate between ID and OOD data. To address this challenge, we propose SEGO, an unsupervised framework that integrates structural entropy into OOD detection regarding graph classification. Specifically, within the architecture of contrastive learning, SEGO introduces an anchor view in the form of coding tree by minimizing structural entropy. The obtained coding tree effectively removes redundant information from graphs while preserving essential structural information, enabling the capture of distinct graph patterns between ID and OOD samples. Furthermore, we present a multi-grained contrastive learning scheme at local, global, and tree levels using triplet views, where coding trees with essential information serve as the anchor view. Extensive experiments on real-world datasets validate the effectiveness of SEGO, demonstrating superior performance over state-of-the-art baselines in OOD detection. Specifically, our method achieves the best performance on 9 out of 10 dataset pairs, with an average improvement of 3.7% on OOD detection datasets, significantly surpassing the best competitor by 10.8% on the FreeSolv/ToxCast dataset pair.
[LG-35] PAIR: A Novel Large Language Model-Guided Selection Strategy for Evolutionary Algorithms
链接: https://arxiv.org/abs/2503.03239
作者: Shady Ali,Mahmoud Ashraf,Seif Hegazy,Fatty Salem,Hoda Mokhtar,Mohamed Medhat Gaber,Mohamed Taher Alrefaie
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Evolutionary Algorithms (EAs) employ random or simplistic selection methods, limiting their exploration of solution spaces and convergence to optimal solutions. The randomness in performing crossover or mutations may limit the model’s ability to evolve efficiently. This paper introduces Preference-Aligned Individual Reciprocity (PAIR), a novel selection approach leveraging Large Language Models to emulate human-like mate selection, thereby introducing intelligence to the pairing process in EAs. PAIR prompts an LLM to evaluate individuals within a population based on genetic diversity, fitness level, and crossover compatibility, guiding more informed pairing decisions. We evaluated PAIR against a baseline method called LLM-driven EA (LMEA), published recently. Results indicate that PAIR significantly outperforms LMEA across various TSP instances, achieving lower optimality gaps and improved convergence. This performance is especially noticeable when combined with the flash thinking model, demonstrating increased population diversity to escape local optima. In general, PAIR provides a new strategy in the area of in-context learning for LLM-driven selection in EAs via sophisticated preference modelling, paving the way for improved solutions and further studies into LLM-guided optimization.
[LG-36] Online Bidding under RoS Constraints without Knowing the Value
链接: https://arxiv.org/abs/2503.03195
作者: Sushant Vijayan,Zhe Feng,Swati Padmanabhan,Karthikeyan Shanmugam,Arun Suggala,Di Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of bidding in online advertising, where an advertiser aims to maximize value while adhering to budget and Return-on-Spend (RoS) constraints. Unlike prior work that assumes knowledge of the value generated by winning each impression (e.g., conversions), we address the more realistic setting where the advertiser must simultaneously learn the optimal bidding strategy and the value of each impression opportunity. This introduces a challenging exploration-exploitation dilemma: the advertiser must balance exploring different bids to estimate impression values with exploiting current knowledge to bid effectively. To address this, we propose a novel Upper Confidence Bound (UCB)-style algorithm that carefully manages this trade-off. Via a rigorous theoretical analysis, we prove that our algorithm achieves \widetildeO(\sqrtT\log(|\mathcalB|T)) regret and constraint violation, where T is the number of bidding rounds and \mathcalB is the domain of possible bids. This establishes the first optimal regret and constraint violation bounds for bidding in the online setting with unknown impression values. Moreover, our algorithm is computationally efficient and simple to implement. We validate our theoretical findings through experiments on synthetic data, demonstrating that our algorithm exhibits strong empirical performance compared to existing approaches.
[LG-37] Active operator learning with predictive uncertainty quantification for partial differential equations
链接: https://arxiv.org/abs/2503.03178
作者: Nick Winovich,Mitchell Daneker,Lu Lu,Guang Lin
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: Submitted to the Journal of Computational Physics
Abstract:In this work, we develop a method for uncertainty quantification in deep operator networks (DeepONets) using predictive uncertainty estimates calibrated to model errors observed during training. The uncertainty framework operates using a single network, in contrast to existing ensemble approaches, and introduces minimal overhead during training and inference. We also introduce an optimized implementation for DeepONet inference (reducing evaluation times by a factor of five) to provide models well-suited for real-time applications. We evaluate the uncertainty-equipped models on a series of partial differential equation (PDE) problems, and show that the model predictions are unbiased, non-skewed, and accurately reproduce solutions to the PDEs. To assess how well the models generalize, we evaluate the network predictions and uncertainty estimates on in-distribution and out-of-distribution test datasets. We find the predictive uncertainties accurately reflect the observed model errors over a range of problems with varying complexity; simpler out-of-distribution examples are assigned low uncertainty estimates, consistent with the observed errors, while more complex out-of-distribution examples are properly assigned higher uncertainties. We also provide a statistical analysis of the predictive uncertainties and verify that these estimates are well-aligned with the observed error distributions at the tail-end of training. Finally, we demonstrate how predictive uncertainties can be used within an active learning framework to yield improvements in accuracy and data-efficiency for outer-loop optimization procedures.
[LG-38] A Predict-Then-Optimize Customer Allocation Framework for Online Fund Recommendation DASFAA2025
链接: https://arxiv.org/abs/2503.03165
作者: Xing Tang,Yunpeng Weng,Fuyuan Lyu,Dugang Liu,Xiuqiang He
类目: Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by DASFAA 2025
Abstract:With the rapid growth of online investment platforms, funds can be distributed to individual customers online. The central issue is to match funds with potential customers under constraints. Most mainstream platforms adopt the recommendation formulation to tackle the problem. However, the traditional recommendation regime has its inherent drawbacks when applying the fund-matching problem with multiple constraints. In this paper, we model the fund matching under the allocation formulation. We design PTOFA, a Predict-Then-Optimize Fund Allocation framework. This data-driven framework consists of two stages, i.e., prediction and optimization, which aim to predict expected revenue based on customer behavior and optimize the impression allocation to achieve the maximum revenue under the necessary constraints, respectively. Extensive experiments on real-world datasets from an industrial online investment platform validate the effectiveness and efficiency of our solution. Additionally, the online A/B tests demonstrate PTOFA’s effectiveness in the real-world fund recommendation scenario.
[LG-39] SpinML: Customized Synthetic Data Generation for Private Training of Specialized ML Models
链接: https://arxiv.org/abs/2503.03160
作者: Jiang Zhang,Rohan Xavier Sequeira,Konstantinos Psounis
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 17 pages (with appendix), 6 figures, Accepted at The 25th Privacy Enhancing Technologies Symposium (PETS2025)
Abstract:Specialized machine learning (ML) models tailored to users needs and requests are increasingly being deployed on smart devices with cameras, to provide personalized intelligent services taking advantage of camera data. However, two primary challenges hinder the training of such models: the lack of publicly available labeled data suitable for specialized tasks and the inaccessibility of labeled private data due to concerns about user privacy. To address these challenges, we propose a novel system SpinML, where the server generates customized Synthetic image data to Privately traIN a specialized ML model tailored to the user request, with the usage of only a few sanitized reference images from the user. SpinML offers users fine-grained, object-level control over the reference images, which allows user to trade between the privacy and utility of the generated synthetic data according to their privacy preferences. Through experiments on three specialized model training tasks, we demonstrate that our proposed system can enhance the performance of specialized models without compromising users privacy preferences.
[LG-40] A Survey of Foundation Models for Environmental Science
链接: https://arxiv.org/abs/2503.03142
作者: Runlong Yu,Shengyu Chen,Yiqun Xie,Xiaowei Jia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modeling environmental ecosystems is essential for effective resource management, sustainable development, and understanding complex ecological processes. However, traditional methods frequently struggle with the inherent complexity, interconnectedness, and limited data of such systems. Foundation models, with their large-scale pre-training and universal representations, offer transformative opportunities by integrating diverse data sources, capturing spatiotemporal dependencies, and adapting to a broad range of tasks. This survey presents a comprehensive overview of foundation model applications in environmental science, highlighting advancements in forward prediction, data generation, data assimilation, downscaling, model ensembling, and decision-making across domains. We also detail the development process of these models, covering data collection, architecture design, training, tuning, and evaluation. By showcasing these emerging methods, we aim to foster interdisciplinary collaboration and advance the integration of cutting-edge machine learning for sustainable solutions in environmental science.
[LG-41] Bridging Molecular Graphs and Large Language Models
链接: https://arxiv.org/abs/2503.03135
作者: Runze Wang,Mingqi Yang,Yanming Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without fine-tuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multisources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLM generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.
[LG-42] Predicting Space Tourism Demand Using Explainable AI
链接: https://arxiv.org/abs/2503.03113
作者: Tan-Hanh Pham,Jingchen Bi,Rodrigo Mesa-Arangom,Kim-Doang Nguyen
类目: Machine Learning (cs.LG)
*备注: 15 pages
Abstract:Comprehensive forecasts of space tourism demand are crucial for businesses to optimize strategies and customer experiences in this burgeoning industry. Traditional methods struggle to capture the complex factors influencing an individual’s decision to travel to space. In this paper, we propose an explainable and trustworthy artificial intelligence framework to address the challenge of predicting space tourism demand by following the National Institute of Standards and Technology guidelines. We develop a novel machine learning network, called SpaceNet, capable of learning wide-range dependencies in data and allowing us to analyze the relationships between various factors such as age, income, and risk tolerance. We investigate space travel demand in the US, categorizing it into four types: no travel, moon travel, suborbital, and orbital travel. To this end, we collected 1860 data points in many states and cities with different ages and then conducted our experiment with the data. From our experiments, the SpaceNet achieves an average ROC-AUC of 0.82 \pm 0.088, indicating strong classification performance. Our investigation demonstrated that travel price, age, annual income, gender, and fatality probability are important features in deciding whether a person wants to travel or not. Beyond demand forecasting, we use explainable AI to provide interpretation for the travel-type decisions of an individual, offering insights into the factors driving interest in space travel, which is not possible with traditional classification methods. This knowledge enables businesses to tailor marketing strategies and optimize service offerings in this rapidly evolving market. To the best of our knowledge, this is the first work to implement an explainable and interpretable AI framework for investigating the factors influencing space tourism.
[LG-43] A Linear Theory of Multi-Winner Voting
链接: https://arxiv.org/abs/2503.03082
作者: Lirong Xia
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
Abstract:We introduces a general linear framework that unifies the study of multi-winner voting rules and proportionality axioms, demonstrating that many prominent multi-winner voting rules-including Thiele methods, their sequential variants, and approval-based committee scoring rules-are linear. Similarly, key proportionality axioms such as Justified Representation (JR), Extended JR (EJR), and their strengthened variants (PJR+, EJR+), along with core stability, can fit within this linear structure as well. Leveraging PAC learning theory, we establish general and novel upper bounds on the sample complexity of learning linear mappings. Our approach yields near-optimal guarantees for diverse classes of rules, including Thiele methods and ordered weighted average rules, and can be applied to analyze the sample complexity of learning proportionality axioms such as approximate core stability. Furthermore, the linear structure allows us to leverage prior work to extend our analysis beyond worst-case scenarios to study the likelihood of various properties of linear rules and axioms. We introduce a broad class of distributions that extend Impartial Culture for approval preferences, and show that under these distributions, with high probability, any Thiele method is resolute, CORE is non-empty, and any Thiele method satisfies CORE, among other observations on the likelihood of commonly-studied properties in social choice. We believe that this linear theory offers a new perspective and powerful new tools for designing and analyzing multi-winner rules in modern social choice applications. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH) Cite as: arXiv:2503.03082 [cs.GT] (or arXiv:2503.03082v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2503.03082 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] A2Perf: Real-World Autonomous Agents Benchmark
链接: https://arxiv.org/abs/2503.03056
作者: Ikechukwu Uchendu,Jason Jabbour,Korneel Van den Berghe,Joel Runevic,Matthew Stewart,Jeffrey Ma,Srivatsan Krishnan,Izzeddin Gur,Austin Huang,Colton Bishop,Paige Bailey,Wenjie Jiang,Ebrahim M. Songhori,Sergio Guadarrama,Jie Tan,Jordan K. Terry,Aleksandra Faust,Vijay Janapa Reddi
类目: Machine Learning (cs.LG)
*备注: 32 pages, 12 figures, preprint
Abstract:Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and inference, among other requirements. Several methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is a lack of benchmarking suites that define the environments, datasets, and metrics which can be used to provide a meaningful way for the community to compare progress on applying these methods to real-world problems. We introduce A2Perf–a benchmark with three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion. A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. Using A2Perf, we demonstrate that web navigation agents can achieve latencies comparable to human reaction times on consumer hardware, reveal reliability trade-offs between algorithms for quadruped locomotion, and quantify the energy costs of different learning approaches for computer chip-design. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning and hybrid algorithms, which allows us to better compare these approaches. A2Perf also contains several standard baselines, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy. As an open-source benchmark, A2Perf is designed to remain accessible, up-to-date, and useful to the research community over the long term.
[LG-45] Graph Transformer with Disease Subgraph Positional Encoding for Improved Comorbidity Prediction
链接: https://arxiv.org/abs/2503.03046
作者: Xihan Qin,Li Liao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Comorbidity, the co-occurrence of multiple medical conditions in a single patient, profoundly impacts disease management and outcomes. Understanding these complex interconnections is crucial, especially in contexts where comorbidities exacerbate outcomes. Leveraging insights from the human interactome (HI) and advancements in graph-based methodologies, this study introduces Transformer with Subgraph Positional Encoding (TSPE) for disease comorbidity prediction. Inspired by Biologically Supervised Embedding (BSE), TSPE employs Transformer’s attention mechanisms and Subgraph Positional Encoding (SPE) to capture interactions between nodes and disease associations. Our proposed SPE proves more effective than LPE, as used in Dwivedi et al.'s Graph Transformer, underscoring the importance of integrating clustering and disease-specific information for improved predictive accuracy. Evaluated on real clinical benchmark datasets (RR0 and RR1), TSPE demonstrates substantial performance enhancements over the state-of-the-art method, achieving up to 28.24% higher ROC AUC and 4.93% higher accuracy. This method shows promise for adaptation to other complex graph-based tasks and applications. The source code is available in the GitHub repository at: this https URL.
[LG-46] Leverag ing Randomness in Model and Data Partitioning for Privacy Amplification
链接: https://arxiv.org/abs/2503.03043
作者: Andy Dong,Wei-Ning Chen,Ayfer Ozgur
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:We study how inherent randomness in the training process – where each sample (or client in federated learning) contributes only to a randomly selected portion of training – can be leveraged for privacy amplification. This includes (1) data partitioning, where a sample participates in only a subset of training iterations, and (2) model partitioning, where a sample updates only a subset of the model parameters. We apply our framework to model parallelism in federated learning, where each client updates a randomly selected subnetwork to reduce memory and computational overhead, and show that existing methods, e.g. model splitting or dropout, provide a significant privacy amplification gain not captured by previous privacy analysis techniques. Additionally, we introduce Balanced Iteration Subsampling, a new data partitioning method where each sample (or client) participates in a fixed number of training iterations. We show that this method yields stronger privacy amplification than Poisson (i.i.d.) sampling of data (or clients). Our results demonstrate that randomness in the training process, which is structured rather than i.i.d. and interacts with data in complex ways, can be systematically leveraged for significant privacy amplification.
[LG-47] Generative assimilation and prediction for weather and climate
链接: https://arxiv.org/abs/2503.03038
作者: Shangshang Yang,Congyi Nai,Xinyan Liu,Weidong Li,Jie Chao,Jingnan Wang,Leyi Wang,Xichen Li,Xi Chen,Bo Lu,Ziniu Xiao,Niklas Boers,Huiling Yuan,Baoxiang Pan
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Machine learning models have shown great success in predicting weather up to two weeks ahead, outperforming process-based benchmarks. However, existing approaches mostly focus on the prediction task, and do not incorporate the necessary data assimilation. Moreover, these models suffer from error accumulation in long roll-outs, limiting their applicability to seasonal predictions or climate projections. Here, we introduce Generative Assimilation and Prediction (GAP), a unified deep generative framework for assimilation and prediction of both weather and climate. By learning to quantify the probabilistic distribution of atmospheric states under observational, predictive, and external forcing constraints, GAP excels in a broad range of weather-climate related tasks, including data assimilation, seamless prediction, and climate simulation. In particular, GAP is competitive with state-of-the-art ensemble assimilation, probabilistic weather forecast and seasonal prediction, yields stable millennial simulations, and reproduces climate variability from daily to decadal time scales.
[LG-48] Intrusion Detection in IoT Networks Using Hyperdimensional Computing: A Case Study on the NSL-KDD Dataset
链接: https://arxiv.org/abs/2503.03037
作者: Ghazal Ghajari,Elaheh Ghajari,Hossein Mohammadi,Fathi Amsaad
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The rapid expansion of Internet of Things (IoT) networks has introduced new security challenges, necessitating efficient and reliable methods for intrusion detection. In this study, a detection framework based on hyperdimensional computing (HDC) is proposed to identify and classify network intrusions using the NSL-KDD dataset, a standard benchmark for intrusion detection systems. By leveraging the capabilities of HDC, including high-dimensional representation and efficient computation, the proposed approach effectively distinguishes various attack categories such as DoS, probe, R2L, and U2R, while accurately identifying normal traffic patterns. Comprehensive evaluations demonstrate that the proposed method achieves an accuracy of 99.54%, significantly outperforming conventional intrusion detection techniques, making it a promising solution for IoT network security. This work emphasizes the critical role of robust and precise intrusion detection in safeguarding IoT systems against evolving cyber threats.
[LG-49] Network Anomaly Detection for IoT Using Hyperdimensional Computing on NSL-KDD
链接: https://arxiv.org/abs/2503.03031
作者: Ghazal Ghajari,Ashutosh Ghimire,Elaheh Ghajari,Fathi Amsaad
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:With the rapid growth of IoT devices, ensuring robust network security has become a critical challenge. Traditional intrusion detection systems (IDSs) often face limitations in detecting sophisticated attacks within high-dimensional and complex data environments. This paper presents a novel approach to network anomaly detection using hyperdimensional computing (HDC) techniques, specifically applied to the NSL-KDD dataset. The proposed method leverages the efficiency of HDC in processing large-scale data to identify both known and unknown attack patterns. The model achieved an accuracy of 91.55% on the KDDTrain+ subset, outperforming traditional approaches. These comparative evaluations underscore the model’s superior performance, highlighting its potential in advancing anomaly detection for IoT networks and contributing to more secure and intelligent cybersecurity solutions.
[LG-50] Hierarchical Refinement: Optimal Transport to Infinity and Beyond
链接: https://arxiv.org/abs/2503.03025
作者: Peter Halmos,Julian Gold,Xinhao Liu,Benjamin J. Raphael
类目: Machine Learning (cs.LG)
*备注: 32 pages, 9 figures
Abstract:Optimal transport (OT) has enjoyed great success in machine-learning as a principled way to align datasets via a least-cost correspondence. This success was driven in large part by the runtime efficiency of the Sinkhorn algorithm [Cuturi 2013], which computes a coupling between points from two datasets. However, Sinkhorn has quadratic space complexity in the number of points, limiting the scalability to larger datasets. Low-rank OT achieves linear-space complexity, but by definition, cannot compute a one-to-one correspondence between points. When the optimal transport problem is an assignment problem between datasets then the optimal mapping, known as the Monge map, is guaranteed to be a bijection. In this setting, we show that the factors of an optimal low-rank coupling co-cluster each point with its image under the Monge map. We leverage this invariant to derive an algorithm, Hierarchical Refinement (HiRef), that dynamically constructs a multiscale partition of a dataset using low-rank OT subproblems, culminating in a bijective coupling. Hierarchical Refinement uses linear space and has log-linear runtime, retaining the space advantage of low-rank OT while overcoming its limited resolution. We demonstrate the advantages of Hierarchical Refinement on several datasets, including ones containing over a million points, scaling full-rank OT to problems previously beyond Sinkhorn’s reach.
[LG-51] Quantum Non-Linear Bandit Optimization
链接: https://arxiv.org/abs/2503.03023
作者: Zakaria Shams Siam,Chaowen Guan,Chong Liu
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:We study non-linear bandit optimization where the learner maximizes a black-box function with zeroth order function oracle, which has been successfully applied in many critical applications such as drug discovery and hyperparameter tuning. Existing works have showed that with the aid of quantum computing, it is possible to break the \Omega(\sqrtT) regret lower bound in classical settings and achieve the new O(\mathrmpoly\log T) upper bound. However, they usually assume that the objective function sits within the reproducing kernel Hilbert space and their algorithms suffer from the curse of dimensionality. In this paper, we propose the new Q-NLB-UCB algorithm which uses the novel parametric function approximation technique and enjoys performance improvement due to quantum fast-forward and quantum Monte Carlo mean estimation. We prove that the regret bound of Q-NLB-UCB is not only O(\mathrmpoly\log T) but also input dimension-free, making it applicable for high-dimensional tasks. At the heart of our analyses are a new quantum regression oracle and a careful construction of parameter uncertainty region. Our algorithm is also validated for its efficiency on both synthetic and real-world tasks.
[LG-52] Generative Active Adaptation for Drifting and Imbalanced Network Intrusion Detection
链接: https://arxiv.org/abs/2503.03022
作者: Ragini Gupta,Shinan Liu,Ruixiao Zhang,Xinyue Hu,Pranav Kommaraju,Xiaoyang Wang,Hadjer Benkraouda,Nick Feamster,Klara Nahrstedt
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning has shown promise in network intrusion detection systems, yet its performance often degrades due to concept drift and imbalanced data. These challenges are compounded by the labor-intensive process of labeling network traffic, especially when dealing with evolving and rare attack types, which makes selecting the right data for adaptation difficult. To address these issues, we propose a generative active adaptation framework that minimizes labeling effort while enhancing model robustness. Our approach employs density-aware active sampling to identify the most informative samples for annotation and leverages deep generative models to synthesize diverse samples, thereby augmenting the training set and mitigating the effects of concept drift. We evaluate our end-to-end framework on both simulated IDS data and a real-world ISP dataset, demonstrating significant improvements in intrusion detection performance. Our method boosts the overall F1-score from 0.60 (without adaptation) to 0.86. Rare attacks such as Infiltration, Web Attack, and FTP-BruteForce, which originally achieve F1 scores of 0.001, 0.04, and 0.00, improve to 0.30, 0.50, and 0.71, respectively, with generative active adaptation in the CIC-IDS 2018 dataset. Our framework effectively enhances rare attack detection while reducing labeling costs, making it a scalable and adaptive solution for real-world intrusion detection.
[LG-53] Classifying States of the Hopfield Network with Improved Accuracy Generalization and Interpretability
链接: https://arxiv.org/abs/2503.03018
作者: Hayden McAlister,Anthony Robins,Lech Szymanski
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:We extend the existing work on Hopfield network state classification, employing more complex models that remain interpretable, such as densely-connected feed-forward deep neural networks and support vector machines. The states of the Hopfield network can be grouped into several classes, including learned (those presented during training), spurious (stable states that were not learned), and prototype (stable states that were not learned but are representative for a subset of learned states). It is often useful to determine to what class a given state belongs to; for example to ignore spurious states when retrieving from the network. Previous research has approached the state classification task with simple linear methods, most notably the stability ratio. We deepen the research on classifying states from prototype-regime Hopfield networks, investigating how varying the factors strengthening prototypes influences the state classification task. We study the generalizability of different classification models when trained on states derived from different prototype tasks – for example, can a network trained on a Hopfield network with 10 prototypes classify states from a network with 20 prototypes? We find that simple models often outperform the stability ratio while remaining interpretable. These models require surprisingly little training data and generalize exceptionally well to states generated by a range of Hopfield networks, even those that were trained on exceedingly different datasets.
[LG-54] Multi-Step Deep Koopman Network (MDK-Net) for Vehicle Control in Frenet Frame IROS2025
链接: https://arxiv.org/abs/2503.03002
作者: Mohammad Abtahi,Mahdis Rabbani,Armin Abdolmohammadi,Shima Nazari
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: This work has been submitted for IROS 2025 conference
Abstract:The highly nonlinear dynamics of vehicles present a major challenge for the practical implementation of optimal and Model Predictive Control (MPC) approaches in path planning and following. Koopman operator theory offers a global linear representation of nonlinear dynamical systems, making it a promising framework for optimization-based vehicle control. This paper introduces a novel deep learning-based Koopman modeling approach that employs deep neural networks to capture the full vehicle dynamics-from pedal and steering inputs to chassis states-within a curvilinear Frenet frame. The superior accuracy of the Koopman model compared to identified linear models is shown for a double lane change maneuver. Furthermore, it is shown that an MPC controller deploying the Koopman model provides significantly improved performance while maintaining computational efficiency comparable to a linear MPC.
[LG-55] Out-of-Distribution Generalization on Graphs via Progressive Inference AAAI2025
链接: https://arxiv.org/abs/2503.02988
作者: Yiming Xu,Bin Shi,Zhen Peng,Huixiang Liu,Bo Dong,Chen Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025
Abstract:The development and evaluation of graph neural networks (GNNs) generally follow the independent and identically distributed (i.i.d.) assumption. Yet this assumption is often untenable in practice due to the uncontrollable data generation mechanism. In particular, when the data distribution shows a significant shift, most GNNs would fail to produce reliable predictions and may even make decisions randomly. One of the most promising solutions to improve the model generalization is to pick out causal invariant parts in the input graph. Nonetheless, we observe a significant distribution gap between the causal parts learned by existing methods and the ground truth, leading to undesirable performance. In response to the above issues, this paper presents GPro, a model that learns graph causal invariance with progressive inference. Specifically, the complicated graph causal invariant learning is decomposed into multiple intermediate inference steps from easy to hard, and the perception of GPro is continuously strengthened through a progressive inference process to extract causal features that are stable to distribution shifts. We also enlarge the training distribution by creating counterfactual samples to enhance the capability of the GPro in capturing the causal invariant parts. Extensive experiments demonstrate that our proposed GPro outperforms the state-of-the-art methods by 4.91% on average. For datasets with more severe distribution shifts, the performance improvement can be up to 6.86%.
[LG-56] Integrating Predictive and Generative Capabilities by Latent Space Design via the DKL-VAE Model
链接: https://arxiv.org/abs/2503.02978
作者: Boris N. Slautin,Utkarsh Pratiush,Doru C. Lupascu,Maxim A. Ziatdinov,Sergei V. Kalinin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 25 pages, 15 figures
Abstract:We introduce a Deep Kernel Learning Variational Autoencoder (VAE-DKL) framework that integrates the generative power of a Variational Autoencoder (VAE) with the predictive nature of Deep Kernel Learning (DKL). The VAE learns a latent representation of high-dimensional data, enabling the generation of novel structures, while DKL refines this latent space by structuring it in alignment with target properties through Gaussian Process (GP) regression. This approach preserves the generative capabilities of the VAE while enhancing its latent space for GP-based property prediction. We evaluate the framework on two datasets: a structured card dataset with predefined variational factors and the QM9 molecular dataset, where enthalpy serves as the target function for optimization. The model demonstrates high-precision property prediction and enables the generation of novel out-of-training subset structures with desired characteristics. The VAE-DKL framework offers a promising approach for high-throughput material discovery and molecular design, balancing structured latent space organization with generative flexibility.
[LG-57] Privacy-Preserving Fair Synthetic Tabular Data
链接: https://arxiv.org/abs/2503.02968
作者: Fatima J. Sarmin,Atiquer R. Rahman,Christopher J. Henry,Noman Mohammed
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Sharing of tabular data containing valuable but private information is limited due to legal and ethical issues. Synthetic data could be an alternative solution to this sharing problem, as it is artificially generated by machine learning algorithms and tries to capture the underlying data distribution. However, machine learning models are not free from memorization and may introduce biases, as they rely on training data. Producing synthetic data that preserves privacy and fairness while maintaining utility close to the real data is a challenging task. This research simultaneously addresses both the privacy and fairness aspects of synthetic data, an area not explored by other studies. In this work, we present PF-WGAN, a privacy-preserving, fair synthetic tabular data generator based on the WGAN-GP model. We have modified the original WGAN-GP by adding privacy and fairness constraints forcing it to produce privacy-preserving fair data. This approach will enable the publication of datasets that protect individual’s privacy and remain unbiased toward any particular group. We compared the results with three state-of-the-art synthetic data generator models in terms of utility, privacy, and fairness across four different datasets. We found that the proposed model exhibits a more balanced trade-off among utility, privacy, and fairness.
[LG-58] Koopman-Based Generalization of Deep Reinforcement Learning With Application to Wireless Communications
链接: https://arxiv.org/abs/2503.02961
作者: Atefeh Termehchi,Ekram Hossain,Isaac Woungang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Deep Reinforcement Learning (DRL) is a key machine learning technology driving progress across various scientific and engineering fields, including wireless communication. However, its limited interpretability and generalizability remain major challenges. In supervised learning, generalizability is commonly evaluated through the generalization error using information-theoretic methods. In DRL, the training data is sequential and not independent and identically distributed (i.i.d.), rendering traditional information-theoretic methods unsuitable for generalizability analysis. To address this challenge, this paper proposes a novel analytical method for evaluating the generalizability of DRL. Specifically, we first model the evolution of states and actions in trained DRL algorithms as unknown discrete, stochastic, and nonlinear dynamical functions. Then, we employ a data-driven identification method, the Koopman operator, to approximate these functions, and propose two interpretable representations. Based on these interpretable representations, we develop a rigorous mathematical approach to evaluate the generalizability of DRL algorithms. This approach is formulated using the spectral feature analysis of the Koopman operator, leveraging the H_\infty norm. Finally, we apply this generalization analysis to compare the soft actor-critic method, widely recognized as a robust DRL approach, against the proximal policy optimization algorithm for an unmanned aerial vehicle-assisted mmWave wireless communication scenario.
[LG-59] Deal: Distributed End-to-End GNN Inference for All Nodes
链接: https://arxiv.org/abs/2503.02960
作者: Shiyang Chen,Xiang Song,Vasiloudis Theodore,Hang Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Graph Neural Networks (GNNs) are a new research frontier with various applications and successes. The end-to-end inference for all nodes, is common for GNN embedding models, which are widely adopted in applications like recommendation and advertising. While sharing opportunities arise in GNN tasks (i.e., inference for a few nodes and training), the potential for sharing in full graph end-to-end inference is largely underutilized because traditional efforts fail to fully extract sharing benefits due to overwhelming overheads or excessive memory usage. This paper introduces Deal, a distributed GNN inference system that is dedicated to end-to-end inference for all nodes for graphs with multi-billion edges. First, we unveil and exploit an untapped sharing opportunity during sampling, and maximize the benefits from sharing during subsequent GNN computation. Second, we introduce memory-saving and communication-efficient distributed primitives for lightweight 1-D graph and feature tensor collaborative partitioning-based distributed inference. Third, we introduce partitioned, pipelined communication and fusing feature preparation with the first GNN primitive for end-to-end inference. With Deal, the end-to-end inference time on real-world benchmark datasets is reduced up to 7.70 x and the graph construction time is reduced up to 21.05 x, compared to the state-of-the-art. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2503.02960 [cs.DC] (or arXiv:2503.02960v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2503.02960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-60] Node-level Contrastive Unlearning on Graph Neural Networks
链接: https://arxiv.org/abs/2503.02959
作者: Hong kyu Lee,Qiuchen Zhang,Carl Yang,Li Xiong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Graph unlearning aims to remove a subset of graph entities (i.e. nodes and edges) from a graph neural network (GNN) trained on the graph. Unlike machine unlearning for models trained on Euclidean-structured data, effectively unlearning a model trained on non-Euclidean-structured data, such as graphs, is challenging because graph entities exhibit mutual dependencies. Existing works utilize graph partitioning, influence function, or additional layers to achieve graph unlearning. However, none of them can achieve high scalability and effectiveness without additional constraints. In this paper, we achieve more effective graph unlearning by utilizing the embedding space. The primary training objective of a GNN is to generate proper embeddings for each node that encapsulates both structural information and node feature representations. Thus, directly optimizing the embedding space can effectively remove the target nodes’ information from the model. Based on this intuition, we propose node-level contrastive unlearning (Node-CUL). It removes the influence of the target nodes (unlearning nodes) by contrasting the embeddings of remaining nodes and neighbors of unlearning nodes. Through iterative updates, the embeddings of unlearning nodes gradually become similar to those of unseen nodes, effectively removing the learned information without directly incorporating unseen data. In addition, we introduce a neighborhood reconstruction method that optimizes the embeddings of the neighbors in order to remove influence of unlearning nodes to maintain the utility of the GNN model. Experiments on various graph data and models show that our Node-CUL achieves the best unlearn efficacy and enhanced model utility with requiring comparable computing resources with existing frameworks.
[LG-61] Robust time series generation via Schrödinger Bridge: a comprehensive evaluation
链接: https://arxiv.org/abs/2503.02943
作者: Alexandre Alouadi,Baptiste Barreau,Laurent Carlier,Huyên Pham
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 11 pages
Abstract:We investigate the generative capabilities of the Schrödinger Bridge (SB) approach for time series. The SB framework formulates time series synthesis as an entropic optimal interpolation transport problem between a reference probability measure on path space and a target joint distribution. This results in a stochastic differential equation over a finite horizon that accurately captures the temporal dynamics of the target time series. While the SB approach has been largely explored in fields like image generation, there is a scarcity of studies for its application to time series. In this work, we bridge this gap by conducting a comprehensive evaluation of the SB method’s robustness and generative performance. We benchmark it against state-of-the-art (SOTA) time series generation methods across diverse datasets, assessing its strengths, limitations, and capacity to model complex temporal dependencies. Our results offer valuable insights into the SB framework’s potential as a versatile and robust tool for time series generation.
[LG-62] Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis ISCA
链接: https://arxiv.org/abs/2503.02907
作者: Samuel S. Sohn,Sten Knutsen,Karin Stromswold
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Appears in Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models
Abstract:Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI’s Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, including male, female, neurotypical, and neurodivergent individuals, we assess the model’s ability to generalize stress patterns and classify speakers by neurotype and gender based on brief speech samples. Our results highlight near-human accuracy in ASR performance across all three stress types and near-perfect precision in classifying gender and neurotype. By improving prosody-aware ASR, this work contributes to equitable and robust transcription technologies for diverse populations.
[LG-63] Opportunistic Routing in Wireless Communications via Learnable State-Augmented Policies
链接: https://arxiv.org/abs/2503.03736
作者: Sourajit Das,Navid NaderiAlizadeh,Rahul Mangharam,Alejandro Ribeiro
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This paper addresses the challenge of packet-based information routing in large-scale wireless communication networks. The problem is framed as a constrained statistical learning task, where each network node operates using only local information. Opportunistic routing exploits the broadcast nature of wireless communication to dynamically select optimal forwarding nodes, enabling the information to reach the destination through multiple relay nodes simultaneously. To solve this, we propose a State-Augmentation (SA) based distributed optimization approach aimed at maximizing the total information handled by the source nodes in the network. The problem formulation leverages Graph Neural Networks (GNNs), which perform graph convolutions based on the topological connections between network nodes. Using an unsupervised learning paradigm, we extract routing policies from the GNN architecture, enabling optimal decisions for source nodes across various flows. Numerical experiments demonstrate that the proposed method achieves superior performance when training a GNN-parameterized model, particularly when compared to baseline algorithms. Additionally, applying the method to real-world network topologies and wireless ad-hoc network test beds validates its effectiveness, highlighting the robustness and transferability of GNNs.
[LG-64] Finite-sample valid prediction of future insurance claims in the regression problem
链接: https://arxiv.org/abs/2503.03659
作者: Liang Hong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:In the current insurance literature, prediction of insurance claims in the regression problem is often performed with a statistical model. This model-based approach may suffer from several drawbacks: (i) model misspecification, (ii) selection effect, and (iii) lack of finite-sample validity. This article addresses these three issues simultaneously by employing conformal prediction-a general machine learning strategy for valid predictions. The proposed method is both model-free and tuning-parameter-free. It also guarantees finite-sample validity at a pre-assigned coverage probability level.
[LG-65] Limits of nonlinear and dispersive fiber propagation for photonic extreme learning
链接: https://arxiv.org/abs/2503.03649
作者: Andrei V. Ermolaev,Mathilde Hary,Lev Leybov,Piotr Ryczkowski,Anas Skalli,Daniel Brunner,Goëry Genty,John M. Dudley
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:
Abstract:We report a generalized nonlinear Schrödinger equation simulation model of an extreme learning machine based on optical fiber propagation. Using handwritten digit classification as a benchmark, we study how accuracy depends on propagation dynamics, as well as parameters governing spectral encoding, readout, and noise. Test accuracies of over 91% and 93% are found for propagation in the anomalous and normal dispersion regimes respectively. Our simulation results also suggest that quantum noise on the input pulses introduces an intrinsic penalty to ELM performance.
[LG-66] Feature Matching Intervention: Leverag ing Observational Data for Causal Representation Learning
链接: https://arxiv.org/abs/2503.03634
作者: Haoze Li,Jun Xie
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:A major challenge in causal discovery from observational data is the absence of perfect interventions, making it difficult to distinguish causal features from spurious ones. We propose an innovative approach, Feature Matching Intervention (FMI), which uses a matching procedure to mimic perfect interventions. We define causal latent graphs, extending structural causal models to latent feature space, providing a framework that connects FMI with causal graph learning. Our feature matching procedure emulates perfect interventions within these causal latent graphs. Theoretical results demonstrate that FMI exhibits strong out-of-distribution (OOD) generalizability. Experiments further highlight FMI’s superior performance in effectively identifying causal features solely from observational data.
[LG-67] Deterministic Global Optimization of the Acquisition Function in Bayesian Optimization: To Do or Not To Do?
链接: https://arxiv.org/abs/2503.03625
作者: Anastasia Georgiou,Daniel Jungen,Luise Kaven,Verena Hunstig,Constantine Frangakis,Ioannis Kevrekidis,Alexander Mitsos
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 32 pages, 7 figures, 7 tables
Abstract:Bayesian Optimization (BO) with Gaussian Processes relies on optimizing an acquisition function to determine sampling. We investigate the advantages and disadvantages of using a deterministic global solver (MAiNGO) compared to conventional local and stochastic global solvers (L-BFGS-B and multi-start, respectively) for the optimization of the acquisition function. For CPU efficiency, we set a time limit for MAiNGO, taking the best point as optimal. We perform repeated numerical experiments, initially using the Muller-Brown potential as a benchmark function, utilizing the lower confidence bound acquisition function; we further validate our findings with three alternative benchmark functions. Statistical analysis reveals that when the acquisition function is more exploitative (as opposed to exploratory), BO with MAiNGO converges in fewer iterations than with the local solvers. However, when the dataset lacks diversity, or when the acquisition function is overly exploitative, BO with MAiNGO, compared to the local solvers, is more likely to converge to a local rather than a global ly near-optimal solution of the black-box function. L-BFGS-B and multi-start mitigate this risk in BO by introducing stochasticity in the selection of the next sampling point, which enhances the exploration of uncharted regions in the search space and reduces dependence on acquisition function hyperparameters. Ultimately, suboptimal optimization of poorly chosen acquisition functions may be preferable to their optimal solution. When the acquisition function is more exploratory, BO with MAiNGO, multi-start, and L-BFGS-B achieve comparable probabilities of convergence to a globally near-optimal solution (although BO with MAiNGO may require more iterations to converge under these conditions).
[LG-68] Probabilistic Insights for Efficient Exploration Strategies in Reinforcement Learning
链接: https://arxiv.org/abs/2503.03565
作者: Ernesto Garcia,Paola Bermolen,Matthieu Jonckheere,Seva Shneer
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注:
Abstract:We investigate efficient exploration strategies of environments with unknown stochastic dynamics and sparse rewards. Specifically, we analyze first the impact of parallel simulations on the probability of reaching rare states within a finite time budget. Using simplified models based on random walks and Lévy processes, we provide analytical results that demonstrate a phase transition in reaching probabilities as a function of the number of parallel simulations. We identify an optimal number of parallel simulations that balances exploration diversity and time allocation. Additionally, we analyze a restarting mechanism that exponentially enhances the probability of success by redirecting efforts toward more promising regions of the state space. Our findings contribute to a more qualitative and quantitative theory of some exploration schemes in reinforcement learning, offering insights into developing more efficient strategies for environments characterized by rare events.
[LG-69] DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions
链接: https://arxiv.org/abs/2503.03515
作者: Anna Kuchko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. While current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. Moreover, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for critical events problem.
[LG-70] Prediction of Halo Coronal Mass Ejections Using SDO/HMI Vector Magnetic Data Products and a Transformer Model
链接: https://arxiv.org/abs/2503.03237
作者: Hongyang Zhang,Ju Jing,Jason T. L. Wang,Haimin Wang,Yasser Abduallah,Yan Xu,Khalid A. Alobaid,Hameedullah Farooki,Vasyl Yurchyshyn
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures
Abstract:We present a transformer model, named DeepHalo, to predict the occurrence of halo coronal mass ejections (CMEs). Our model takes as input an active region (AR) and a profile, where the profile contains a time series of data samples in the AR that are collected 24 hours before the beginning of a day, and predicts whether the AR would produce a halo CME during that day. Each data sample contains physical parameters, or features, derived from photospheric vector magnetic field data taken by the Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO). We survey and match CME events in the Space Weather Database Of Notification, Knowledge, Information (DONKI) and Large Angle and Spectrometric Coronagraph (LASCO) CME Catalog, and compile a list of CMEs including halo CMEs and non-halo CMEs associated with ARs in the period between November 2010 and August 2023. We use the information gathered above to build the labels (positive versus negative) of the data samples and profiles at hand, where the labels are needed for machine learning. Experimental results show that DeepHalo with a true skill statistics (TSS) score of 0.907 outperforms a closely related long short-term memory network with a TSS score of 0.821. To our knowledge, this is the first time that the transformer model has been used for halo CME prediction.
[LG-71] Convergence Rates for Softmax Gating Mixture of Experts ICML2024
链接: https://arxiv.org/abs/2503.03213
作者: Huy Nguyen,Nhat Ho,Alessandro Rinaldo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Section 2 of this work comes from our previous paper titled “On Least Square Estimation in Softmax Gating Mixture of Experts” and published at the ICML 2024
Abstract:Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed \emphstrong identifiability condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.
[LG-72] PAC Learning with Improvements
链接: https://arxiv.org/abs/2503.03184
作者: Idan Attias,Avrim Blum,Keziah Naggita,Donya Saless,Dravyansh Sharma,Matthew Walter
类目: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 40 pages, 13 figures
Abstract:One of the most basic lower bounds in machine learning is that in nearly any nontrivial setting, it takes \textitat least 1/\epsilon samples to learn to error \epsilon (and more, if the classifier being learned is complex). However, suppose that data points are agents who have the ability to improve by a small amount if doing so will allow them to receive a (desired) positive classification. In that case, we may actually be able to achieve \textitzero error by just being “close enough”. For example, imagine a hiring test used to measure an agent’s skill at some job such that for some threshold \theta , agents who score above \theta will be successful and those who score below \theta will not (i.e., learning a threshold on the line). Suppose also that by putting in effort, agents can improve their skill level by some small amount r . In that case, if we learn an approximation \hat\theta of \theta such that \theta \leq \hat\theta \leq \theta + r and use it for hiring, we can actually achieve error zero, in the sense that (a) any agent classified as positive is truly qualified, and (b) any agent who truly is qualified can be classified as positive by putting in effort. Thus, the ability for agents to improve has the potential to allow for a goal one could not hope to achieve in standard models, namely zero error. In this paper, we explore this phenomenon more broadly, giving general results and examining under what conditions the ability of agents to improve can allow for a reduction in the sample complexity of learning, or alternatively, can make learning harder. We also examine both theoretically and empirically what kinds of improvement-aware algorithms can take into account agents who have the ability to improve to a limited extent when it is in their interest to do so. Comments: 40 pages, 13 figures Subjects: Machine Learning (stat.ML); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2503.03184 [stat.ML] (or arXiv:2503.03184v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2503.03184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-73] Fast Jet Tagging with MLP-Mixers on FPGAs
链接: https://arxiv.org/abs/2503.03103
作者: Chang Sun,Jennifer Ngadiuba,Maurizio Pierini,Maria Spiropulu
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG)
*备注:
Abstract:We explore the innovative use of MLP-Mixer models for real-time jet tagging and establish their feasibility on resource-constrained hardware like FPGAs. MLP-Mixers excel in processing sequences of jet constituents, achieving state-of-the-art performance on datasets mimicking Large Hadron Collider conditions. By using advanced optimization techniques such as High-Granularity Quantization and Distributed Arithmetic, we achieve unprecedented efficiency. These models match or surpass the accuracy of previous architectures, reduce hardware resource usage by up to 97%, double the throughput, and half the latency. Additionally, non-permutation-invariant architectures enable smart feature prioritization and efficient FPGA deployment, setting a new benchmark for machine learning in real-time data processing at particle colliders.
[LG-74] Learning finite symmetry groups of dynamical systems via equivariance detection
链接: https://arxiv.org/abs/2503.03014
作者: Pablo Calvo-Barlés,Sergio G. Rodrigo,Luis Martín-Moreno
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:In this work, we introduce the Equivariance Seeker Model (ESM), a data-driven method for discovering the underlying finite equivariant symmetry group of an arbitrary function. ESM achieves this by optimizing a loss function that balances equivariance preservation with the penalization of redundant solutions, ensuring the complete and accurate identification of all symmetry transformations. We apply this framework specifically to dynamical systems, identifying their symmetry groups directly from observed trajectory data. To demonstrate its versatility, we test ESM on multiple systems in two distinct scenarios: (i) when the governing equations are known theoretically and (ii) when they are unknown, and the equivariance finding relies solely on observed data. The latter case highlights ESM’s fully data-driven capability, as it requires no prior knowledge of the system’s equations to operate.
[LG-75] Can Diffusion Models Provide Rigorous Uncertainty Quantification for Bayesian Inverse Problems?
链接: https://arxiv.org/abs/2503.03007
作者: Evan Scope Crafts,Umberto Villa
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:In recent years, the ascendance of diffusion modeling as a state-of-the-art generative modeling approach has spurred significant interest in their use as priors in Bayesian inverse problems. However, it is unclear how to optimally integrate a diffusion model trained on the prior distribution with a given likelihood function to obtain posterior samples. While algorithms that have been developed for this purpose can produce high-quality, diverse point estimates of the unknown parameters of interest, they are often tested on problems where the prior distribution is analytically unknown, making it difficult to assess their performance in providing rigorous uncertainty quantification. In this work, we introduce a new framework, Bayesian Inverse Problem Solvers through Diffusion Annealing (BIPSDA), for diffusion model based posterior sampling. The framework unifies several recently proposed diffusion model based posterior sampling algorithms and contains novel algorithms that can be realized through flexible combinations of design choices. Algorithms within our framework were tested on model problems with a Gaussian mixture prior and likelihood functions inspired by problems in image inpainting, x-ray tomography, and phase retrieval. In this setting, approximate ground-truth posterior samples can be obtained, enabling principled evaluation of the performance of the algorithms. The results demonstrate that BIPSDA algorithms can provide strong performance on the image inpainting and x-ray tomography based problems, while the challenging phase retrieval problem, which is difficult to sample from even when the posterior density is known, remains outside the reach of the diffusion model based samplers.
[LG-76] LAPD: Langevin-Assisted Bayesian Active Learning for Physical Discovery
链接: https://arxiv.org/abs/2503.02983
作者: Cindy Xiangrui Kong,Haoyang Zheng,Guang Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Discovering physical laws from data is a fundamental challenge in scientific research, particularly when high-quality data are scarce or costly to obtain. Traditional methods for identifying dynamical systems often struggle with noise sensitivity, inefficiency in data usage, and the inability to quantify uncertainty effectively. To address these challenges, we propose Langevin-Assisted Active Physical Discovery (LAPD), a Bayesian framework that integrates replica-exchange stochastic gradient Langevin Monte Carlo to simultaneously enable efficient system identification and robust uncertainty quantification (UQ). By balancing gradient-driven exploration in coefficient space and generating an ensemble of candidate models during exploitation, LAPD achieves reliable, uncertainty-aware identification with noisy data. In the face of data scarcity, the probabilistic foundation of LAPD further promotes the integration of active learning (AL) via a hybrid uncertainty-space-filling acquisition function. This strategy sequentially selects informative data to reduce data collection costs while maintaining accuracy. We evaluate LAPD on diverse nonlinear systems such as the Lotka-Volterra, Lorenz, Burgers, and Convection-Diffusion equations, demonstrating its robustness with noisy and limited data as well as superior uncertainty calibration compared to existing methods. The AL extension reduces the required measurements by around 60% for the Lotka-Volterra system and by around 40% for Burgers’ equation compared to random data sampling, highlighting its potential for resource-constrained experiments. Our framework establishes a scalable, uncertainty-aware methodology for data-efficient discovery of dynamical systems, with broad applicability to problems where high-fidelity data acquisition is prohibitively expensive.
[LG-77] HARP 2.0: Expanding Hosted Asynchronous Remote Processing for Deep Learning in the DAW
链接: https://arxiv.org/abs/2503.02977
作者: Christodoulos Benetatos,Frank Cwitkowitz,Nathan Pruyne,Hugo Flores Garcia,Patrick O’Reilly,Zhiyao Duan,Bryan Pardo
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: ISMIR 2024 Late-Breaking Demo
Abstract:HARP 2.0 brings deep learning models to digital audio workstation (DAW) software through hosted, asynchronous, remote processing, allowing users to route audio from a plug-in interface through any compatible Gradio endpoint to perform arbitrary transformations. HARP renders endpoint-defined controls and processed audio in-plugin, meaning users can explore a variety of cutting-edge deep learning models without ever leaving the DAW. In the 2.0 release we introduce support for MIDI-based models and audio/MIDI labeling models, provide a streamlined pyharp Python API for model developers, and implement numerous interface and stability improvements. Through this work, we hope to bridge the gap between model developers and creatives, improving access to deep learning models by seamlessly integrating them into DAW workflows.
[LG-78] Markets for Models
链接: https://arxiv.org/abs/2503.02946
作者: Krishna Dasaratha,Juan Ortner,Chengyang Zhu
类目: Theoretical Economics (econ.TH); Machine Learning (cs.LG)
*备注:
Abstract:Motivated by the prevalence of prediction problems in the economy, we study markets in which firms sell models to a consumer to help improve their prediction. Firms decide whether to enter, choose models to train on their data, and set prices. The consumer can purchase multiple models and use a weighted average of the models bought. Market outcomes can be expressed in terms of the bias-variance decompositions of the models that firms sell. We show that market structure can depend in subtle and nonmonotonic ways on the statistical properties of available models. Moreover, firms may choose inefficiently biased models to deter entry by competitors or to obtain larger profits.
[LG-79] Applications of Entropy in Data Analysis and Machine Learning: A Review
链接: https://arxiv.org/abs/2503.02921
作者: Salomé A. Sepúveda Fontaine,José M. Amigó
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 39 pages, 3 figures, 282 references
Abstract:Since its origin in the thermodynamics of the 19th century, the concept of entropy has also permeated other fields of physics and mathematics, such as Classical and Quantum Statistical Mechanics, Information Theory, Probability Theory, Ergodic Theory and the Theory of Dynamical Systems. Specifically, we are referring to the classical entropies: the Boltzmann-Gibbs, von Neumann, Shannon, Kolmogorov-Sinai and topological entropies. In addition to their common name, which is historically justified (as we briefly describe in this review), other commonality of the classical entropies is the important role that they have played and are still playing in the theory and applications of their respective fields and beyond. Therefore, it is not surprising that, in the course of time, many other instances of the overarching concept of entropy have been proposed, most of them tailored to specific purposes. Following the current usage, we will refer to all of them, whether classical or new, simply as entropies. Precisely, the subject of this review is their applications in data analysis and machine learning. The reason for these particular applications is that entropies are very well suited to characterize probability mass distributions, typically generated by finite-state processes or symbolized signals. Therefore, we will focus on entropies defined as positive functionals on probability mass distributions and provide an axiomatic characterization that goes back to Shannon and Khinchin. Given the plethora of entropies in the literature, we have selected a representative group, including the classical ones. The applications summarized in this review finely illustrate the power and versatility of entropy in data analysis and machine learning.
信息检索
[IR-0] Addressing Overprescribing Challenges: Fine-Tuning Large Language Models for Medication Recommendation Tasks
链接: https://arxiv.org/abs/2503.03687
作者: Zihao Zhao,Chenxiao Fan,Chongming Gao,Fuli Feng,Xiangnan He
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Medication recommendation systems have garnered attention within healthcare for their potential to deliver personalized and efficacious drug combinations based on patient’s clinical data. However, existing methodologies encounter challenges in adapting to diverse Electronic Health Records (EHR) systems and effectively utilizing unstructured data, resulting in limited generalization capabilities and suboptimal performance. Recently, interest is growing in harnessing Large Language Models (LLMs) in the medical domain to support healthcare professionals and enhance patient care. Despite the emergence of medical LLMs and their promising results in tasks like medical question answering, their practical applicability in clinical settings, particularly in medication recommendation, often remains underexplored. In this study, we evaluate both general-purpose and medical-specific LLMs for medication recommendation tasks. Our findings reveal that LLMs frequently encounter the challenge of overprescribing, leading to heightened clinical risks and diminished medication recommendation accuracy. To address this issue, we propose Language-Assisted Medication Recommendation (LAMO), which employs a parameter-efficient fine-tuning approach to tailor open-source LLMs for optimal performance in medication recommendation scenarios. LAMO leverages the wealth of clinical information within clinical notes, a resource often underutilized in traditional methodologies. As a result of our approach, LAMO outperforms previous state-of-the-art methods by over 10% in internal validation accuracy. Furthermore, temporal and external validations demonstrate LAMO’s robust generalization capabilities across various temporal and hospital contexts. Additionally, an out-of-distribution medication recommendation experiment demonstrates LAMO’s remarkable accuracy even with medications outside the training data. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2503.03687 [cs.IR] (or arXiv:2503.03687v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.03687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-1] Optimizing open-domain question answering with graph-based retrieval augmented generation
链接: https://arxiv.org/abs/2503.02922
作者: Joyce Cahoon,Prerna Singh,Nick Litombe,Jonathan Larson,Ha Trinh,Yiwen Zhu,Andreas Mueller,Fotis Psallidas,Carlo Curino
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In this work, we benchmark various graph-based retrieval-augmented generation (RAG) systems across a broad spectrum of query types, including OLTP-style (fact-based) and OLAP-style (thematic) queries, to address the complex demands of open-domain question answering (QA). Traditional RAG methods often fall short in handling nuanced, multi-document synthesis tasks. By structuring knowledge as graphs, we can facilitate the retrieval of context that captures greater semantic depth and enhances language model operations. We explore graph-based RAG methodologies and introduce TREX, a novel, cost-effective alternative that combines graph-based and vector-based retrieval techniques. Our benchmarking across four diverse datasets highlights the strengths of different RAG methodologies, demonstrates TREX’s ability to handle multiple open-domain QA types, and reveals the limitations of current evaluation methods. In a real-world technical support case study, we demonstrate how TREX solutions can surpass conventional vector-based RAG in efficiently synthesizing data from heterogeneous sources. Our findings underscore the potential of augmenting large language models with advanced retrieval and orchestration capabilities, advancing scalable, graph-based AI solutions. Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3; I.2.7 Cite as: arXiv:2503.02922 [cs.IR] (or arXiv:2503.02922v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.02922 Focus to learn more arXiv-issued DOI via DataCite