本篇博文主要内容为 2025-07-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-01)
今日共更新938篇论文,其中:
- 自然语言处理共115篇(Computation and Language (cs.CL))
- 人工智能共255篇(Artificial Intelligence (cs.AI))
- 计算机视觉共281篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共236篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
【速读】: 该论文旨在解决传统强化学习方法在训练语言模型时依赖人工标注的问题-答案对和领域特定的奖励工程的问题。其解决方案的关键在于提出SPIRAL框架,该框架通过模型与自身不断进化的版本进行多轮零和博弈的自博弈(self-play)来实现无需人类监督的学习,从而生成无限递增难度的问题课程,并利用角色条件优势估计(RAE)稳定多智能体训练过程。
链接: https://arxiv.org/abs/2506.24119
作者: Bo Liu,Leon Guertler,Simon Yu,Zichen Liu,Penghui Qi,Daniel Balcells,Mickel Liu,Cheston Tan,Weiyan Shi,Min Lin,Wee Sun Lee,Natasha Jaques
机构: National University of Singapore (新加坡国立大学); Centre for Frontier AI Research (CFAR) (前沿人工智能研究中心); ASTAR (ASTAR); Northeastern University (东北大学); Sea AI Lab (Sea人工智能实验室); Plastic Labs (塑料实验室); University of Washington (华盛顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in Progress
Abstract:Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.
zh
[NLP-1] Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models
【速读】: 该论文试图解决在圣经希伯来语中识别文本平行段落的问题,这一问题对于揭示文本间的互文关系具有基础性意义。传统方法依赖人工比较,存在劳动强度大和易出错的缺点。论文提出的解决方案的关键在于评估预训练的基于Transformer的语言模型(如E5、AlephBERT、MPNet和LaBSE)在生成词嵌入以区分平行与非平行段落方面的潜力,结果显示E5在平行检测方面表现优异,而AlephBERT在非平行段落的区分上更具优势。
链接: https://arxiv.org/abs/2506.24117
作者: David M. Smiley
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Identifying parallel passages in biblical Hebrew is foundational in biblical scholarship for uncovering intertextual relationships. Traditional methods rely on manual comparison, which is labor-intensive and prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings and Chronicles, I assessed each model’s capability to generate word embeddings that delineate parallel from non-parallel passages. Utilizing cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show significant promise, with E5 excelling in parallel detection and AlephBERT demonstrating stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.
zh
[NLP-2] On the Predictive Power of Representation Dispersion in Language Models
【速读】: 该论文试图解决语言模型在不同领域和任务中表现差异的问题,其核心在于揭示语言模型的文本预测能力与其嵌入空间广度之间的紧密联系。研究发现,表示分散度(representation dispersion)——即隐藏向量之间的平均余弦距离——与困惑度(perplexity)呈强负相关,这表明模型的上下文表示分布越广泛,其文本预测能力越强。解决方案的关键在于利用表示分散度作为评估和优化模型性能的指标,无需依赖标注数据即可实现模型选择、最佳表示层定位以及通过引入简单的排斥目标提升模型的分散度和困惑度表现。
链接: https://arxiv.org/abs/2506.24106
作者: Yanhong Li,Ming Li,Karen Livescu,Jiawei Zhou
机构: University of Chicago (芝加哥大学); University of Maryland (马里兰大学); Toyota Technological Institute at Chicago (芝加哥丰田技术学院); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We show that a language model’s ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.
zh
[NLP-3] MotionGPT 3: Human Motion as a Second Modality
【速读】: 该论文旨在解决统一运动-语言模型在高保真人类运动生成与理解中的两个核心挑战:一是连续运动模态与离散表示在自回归方式下的重建差距,二是统一训练过程中语言智能的退化问题。其解决方案的关键在于提出MotionGPT3,这是一个双模态运动-语言模型,将人类运动视为第二种模态,通过分离的模型参数解耦运动建模,并实现有效的跨模态交互和高效的多模态扩展训练。为保持语言智能,文本分支保留预训练语言模型的原始结构和参数,而运动分支则通过共享注意力机制集成,实现两模态之间的双向信息流动。
链接: https://arxiv.org/abs/2506.24086
作者: Bingfan Zhu,Biao Jiang,Sunyi Wang,Shixiang Tang,Tao Chen,Linjie Luo,Youyi Zheng,Xin Chen
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); ByteDance (字节跳动); The Chinese University of HongKong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 21 pages, 8 figures
Abstract:Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.
zh
[NLP-4] STACK: Adversarial Attacks on LLM Safeguard Pipelines
【速读】: 该论文试图解决前沿AI系统中防御管道(defense pipeline)的安全性问题,特别是针对生成式AI(Generative AI)可能被灾难性滥用的风险。研究的关键在于开发并红队测试一个开源防御管道,以评估其有效性。其中,解决方案的关键在于提出一种新型的少样本提示输入与输出分类器,其在三个攻击和两个数据集上的表现优于现有的ShieldGemma模型,并在ClearHarm灾难性滥用数据集上将攻击成功率(ASR)降至0%。此外,研究还引入了STaged AttaCK (STACK)方法,在黑盒攻击中实现了71%的ASR,证明了针对此类防御系统的潜在威胁。
链接: https://arxiv.org/abs/2506.24068
作者: Ian R. McKenzie,Oskar J. Hollinsworth,Tom Tseng,Xander Davies,Stephen Casper,Aaron D. Tucker,Robert Kirk,Adam Gleave
机构: FAR.AI; UK AISI; OATML, University of Oxford
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
zh
[NLP-5] Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models
【速读】: 该论文试图解决强化学习人类反馈(RLHF)对齐语言模型在面对拒绝指令时的限制问题,具体表现为模型对某些请求的拒绝与肯定之间的差异。解决方案的关键在于提出一种名为“logit-gap steering”的快速越狱框架,该框架将拒绝-肯定差距转化为对词汇表的一次性遍历,通过一个前向可计算的得分函数,结合KL惩罚和奖励偏移的轻量级代理,实现“排序-求和-停止”策略,在不到一秒内生成一个短后缀,显著减少模型调用次数,并有效提升单次攻击的成功率至80%-100%,同时保持主题连贯性。
链接: https://arxiv.org/abs/2506.24056
作者: Tung-Ling Li,Hongliang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce logit-gap steering, a fast jailbreak framework that casts the refusal-affirmation gap of RLHF-aligned language models as a single pass over the vocabulary. A forward-computable score blends gap reduction with lightweight proxies for KL penalty and reward shift, allowing a “sort-sum-stop” sweep to complete in under a second and return a short suffix–two orders of magnitude fewer model calls than beam or gradient attacks. The same suffix generalises to unseen prompts and scales from 0.5 B to 70 B checkpoints, lifting one-shot attack success from baseline levels to 80-100% while preserving topical coherence. Beyond efficiency, these suffixes expose sentence-boundary reward cliffs and other alignment artefacts, offering a lightweight probe into how safety tuning reshapes internal representations.
zh
[NLP-6] Ella: Embodied Social Agents with Lifelong Memory
【速读】: 该论文试图解决在开放三维环境中,智能体如何通过持续学习和社交互动实现长期记忆积累与自主行为决策的问题。解决方案的关键在于构建一个结构化的长时多模态记忆系统,该系统包含以名称为中心的语义记忆和时空情景记忆,用于有效存储、更新和检索信息,并将其与基础模型集成,从而支持智能体在开放世界中进行决策、规划日常活动、建立社会关系并自主演化。
链接: https://arxiv.org/abs/2506.24019
作者: Hongxin Zhang,Zheyuan Zhang,Zeyuan Wang,Zunzhe Zhang,Lixing Fang,Qinhong Zhou,Chuang Gan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Johns Hopkins University (约翰霍普金斯大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella’s capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at this https URL.
zh
[NLP-7] EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations ACL2025
【速读】: 该论文试图解决图像描述生成任务中可解释性评估指标缺乏标准化准则和生成解释质量未被验证的问题。其解决方案的关键在于提出EXPERT,这是一种无需参考的评估指标,基于流畅性、相关性和描述性三个基本标准提供结构化解释,并通过构建大规模高质量结构化解释数据集,开发了两阶段评估模板来有效监督视觉-语言模型进行评分和解释生成。
链接: https://arxiv.org/abs/2506.24016
作者: Hyunjong Kim,Sangyeop Kim,Jongheon Jeong,Yeongjae Cho,Sungzoon Cho
机构: Seoul National University (首尔国立大学); Coxwave
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2025 Findings
Abstract:Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at this https URL.
zh
[NLP-8] Large Language Models Dont Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
【速读】: 该论文试图解决如何将大型语言模型(Large Language Models, LLMs)有效整合到数学教育中,特别是在数学应用题解决方面的潜力与局限性问题。其解决方案的关键在于通过技术概述、文献综述和实证评估三方面分析LLMs在处理数学应用题时的表现,发现尽管最新LLMs在解决结构化应用题(s-problems)上表现出近似完美的准确性,但在涉及现实情境复杂性或非逻辑性的问题时仍存在明显不足,表明其尚未真正理解应用题的现实语境,这限制了其作为数学教学工具的实际价值。
链接: https://arxiv.org/abs/2506.24006
作者: Anselm R. Strohmaier,Wim Van Dooren,Kathrin Seßler,Brian Greer,Lieven Verschaffel
机构: 未知
类目: Computation and Language (cs.CL); History and Overview (math.HO)
备注:
Abstract:The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.
zh
[NLP-9] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning ACL2025
【速读】: 该论文试图解决先天性心脏病(Congenital Heart Disease, CHD)患者和护理者在传统临床指标中被低估的复杂且长期的挑战,以及通过人工主题分析(Thematic Analysis, TA)处理非结构化临床叙事所面临的劳动密集和不可扩展的问题。其解决方案的关键在于提出一个完全自动化的大型语言模型(Large Language Model, LLM)流程,实现端到端的主题分析,无需人工编码或完整转录本审查,并采用新颖的多智能体框架以提升主题质量和与人类分析的一致性,同时可选地整合强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)以增强主题的相关性。
链接: https://arxiv.org/abs/2506.23998
作者: Seungjun Yi,Joakim Nguyen,Huimin Xu,Terence Lim,Andrew Well,Mia Markey,Ying Ding
机构: 未知
类目: Computation and Language (cs.CL)
备注: Presented at ACL 2025 SRW
Abstract:Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.
zh
[NLP-10] Machine Understanding of Scientific Language
【速读】: 该论文试图解决科学文本真实性识别的问题,即如何自动判断给定科学文本是否忠实于底层科学事实。其解决方案的关键在于构建数据集、方法和工具,以提升机器对科学语言的理解能力,从而实现对科学传播的规模化分析。具体而言,研究涵盖了自动事实核查、有限数据学习以及科学文本处理等多个自然语言处理与机器学习领域,并提出了多种新方法,如可核查主张识别、对抗性主张生成、多源领域自适应、众包标签学习、引用价值检测、零样本科学事实核查、夸大科学主张检测以及科学传播中信息变化程度建模等。这些方法共同构成了有效从有限科学文本中学习并识别误导性科学陈述的技术基础。
链接: https://arxiv.org/abs/2506.23990
作者: Dustin Wright
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: PhD Thesis, 210 pages
Abstract:Scientific information expresses human understanding of nature. This knowledge is largely disseminated in different forms of text, including scientific papers, news articles, and discourse among people on social media. While important for accelerating our pursuit of knowledge, not all scientific text is faithful to the underlying science. As the volume of this text has burgeoned online in recent years, it has become a problem of societal importance to be able to identify the faithfulness of a given piece of scientific text automatically. This thesis is concerned with the cultivation of datasets, methods, and tools for machine understanding of scientific language, in order to analyze and understand science communication at scale. To arrive at this, I present several contributions in three areas of natural language processing and machine learning: automatic fact checking, learning with limited data, and scientific text processing. These contributions include new methods and resources for identifying check-worthy claims, adversarial claim generation, multi-source domain adaptation, learning from crowd-sourced labels, cite-worthiness detection, zero-shot scientific fact checking, detecting exaggerated scientific claims, and modeling degrees of information change in science communication. Critically, I demonstrate how the research outputs of this thesis are useful for effectively learning from limited amounts of scientific text in order to identify misinformative scientific statements and generate new insights into the science communication process
zh
[NLP-11] aP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在进行监督微调和偏好微调时所需高质量数据集构建成本高、资源消耗大以及现有数据集多为英文的问题。其解决方案的关键在于提出了一种基于分类体系的偏好数据生成框架(Taxonomy-Guided Preference Data Generation, TaP),该框架通过结构化的分类体系实现对数据集组成的细粒度控制,从而确保数据集的多样性和全面性,并支持跨语言的自动化与可扩展的数据集构建。
链接: https://arxiv.org/abs/2506.23979
作者: Renren Jin,Tianhao Shen,Xinwei Wu,Dan Shi,Haoran Sun,Wuwei Huang,Quandong Wang,Wei Liu,Jian Luan,Bin Wang,Deyi Xiong
机构: Tianjin University(天津大学); Xiaomi AI Lab(小米人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 33 pages, 15 tables, 11 figures
Abstract:Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline\textbfTaxonomy-Guided \underline\textbfPreference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.
zh
[NLP-12] LLM Agents Are the Antidote to Walled Gardens
【速读】: 该论文试图解决当前应用层被封闭的专有平台主导,导致数据交换受限和用户锁定的问题。其解决方案的关键在于基于大语言模型(LLM)的智能体(agents),这些智能体能够自动转换数据格式并与面向人类的接口进行交互,从而显著降低互操作性的成本并使其成为不可避免的趋势。这种转变被称为“普遍互操作性”,即任何两个数字服务都能通过AI中介适配器无缝交换数据,进而削弱垄断行为并促进数据可移植性。
链接: https://arxiv.org/abs/2506.23978
作者: Samuele Marro,Philip Torr
机构: University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:While the Internet’s core infrastructure was designed to be open and universal, today’s application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
zh
[NLP-13] Unveiling Decision-Making in LLM s for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
【速读】: 该论文试图解决在句子分类领域中,如何有效利用生成式 AI (Generative AI) 基于稀疏自编码器 (Sparse Autoencoders, SAEs) 的可解释性方法来提取具有因果性和可解释性的特征问题。现有方法在该领域尚未得到充分探索,因此本文提出了一种针对文本分类的新型 SAE 架构,其关键在于引入了专门设计的分类器头部以及激活率稀疏性损失,以增强模型的可解释性和特征提取的精确性。
链接: https://arxiv.org/abs/2506.23951
作者: Mathis Le Bail,Jérémie Dentan,Davide Buscaldi,Sonia Vanier
机构: LIX (École Polytechnique, IP Paris, CNRS); LIPN (Sorbonne Paris Nord)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.
zh
[NLP-14] Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLM s
【速读】: 该论文试图解决领域专用的多模态大语言模型(Multimodal Large Language Models, MLLMs)之间知识共享不足的问题,特别是在不同任务微调后的模型在面对多样化数据输入时性能下降的问题。其解决方案的关键在于提出一种统一的参数集成框架,该框架基于一种新颖的兼容性感知参数拼接(Compatibility-Aware Parameter Splicing, CAPS)策略,通过结合局部功能归属与全局信息理论信号来指导参数的有选择性融合,从而实现专家能力的模块化组合。
链接: https://arxiv.org/abs/2506.23940
作者: Yang Dai,Jianxiang An,Tianwei Lin,Hongyang He,Hongzhe Huang,Wenqiao Zhang,Zheqi Lv,Siliang Tang,Yueting Zhuang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs–such as those trained for mathematics or code–remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.
zh
[NLP-15] Leverag ing the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages
【速读】: 该论文试图解决低资源语言中仇恨言论检测的挑战,尤其是在缺乏大规模、高质量数据集的情况下。其解决方案的关键在于通过提示工程(prompt engineering)利用大型语言模型(LLMs)的能力,特别是提出了一种创新的隐喻提示(metaphor prompting)方法,以绕过LLMs内置的安全机制,从而有效检测低资源语言中的仇恨言论。
链接: https://arxiv.org/abs/2506.23930
作者: Ruhina Tabasshum Prome(Bangladesh Institute of Governance and Management),Tarikul Islam Tamiti(George Mason University),Anomadarshi Barua(George Mason University)
机构: Bangladesh Institute of Governance and Management (孟加拉国治理与管理研究所); George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid expansion of social media leads to a marked increase in hate speech, which threatens personal lives and results in numerous hate crimes. Detecting hate speech presents several challenges: diverse dialects, frequent code-mixing, and the prevalence of misspelled words in user-generated content on social media platforms. Recent progress in hate speech detection is typically concentrated on high-resource languages. However, low-resource languages still face significant challenges due to the lack of large-scale, high-quality datasets. This paper investigates how we can overcome this limitation via prompt engineering on large language models (LLMs) focusing on low-resource Bengali language. We investigate six prompting strategies - zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and finally our innovative metaphor prompting to detect hate speech effectively in low-resource languages. We pioneer the metaphor prompting to circumvent the built-in safety mechanisms of LLMs that marks a significant departure from existing jailbreaking methods. We investigate all six different prompting strategies on the Llama2-7B model and compare the results extensively with three pre-trained word embeddings - GloVe, Word2Vec, and FastText for three different deep learning models - multilayer perceptron (MLP), convolutional neural network (CNN), and bidirectional gated recurrent unit (BiGRU). To prove the effectiveness of our metaphor prompting in the low-resource Bengali language, we also evaluate it in another low-resource language - Hindi, and two high-resource languages - English and German. The performance of all prompting techniques is evaluated using the F1 score, and environmental impact factor (IF), which measures CO _2 emissions, electricity usage, and computational time.
zh
[NLP-16] IMPACT: Inflectional Morphology Probes Across Complex Typologies
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理非英语语言的句法复杂性,尤其是屈折形态学(inflectional morphology)方面的理解能力不足的问题。尽管这些模型在英语任务中表现优异,但它们在处理其他语言及罕见形态模式时存在明显缺陷,特别是在判断语法错误示例时表现不佳。解决方案的关键是引入IMPACT,一个合成生成的评估框架,专门用于测试LLMs在五种形态丰富的语言(阿拉伯语、俄语、芬兰语、土耳其语和希伯来语)中的表现,其包含覆盖共性和语言特异性现象的单元测试样例,以全面评估模型对形态学复杂性的掌握程度。
链接: https://arxiv.org/abs/2506.23929
作者: Mohammed J. Saeed,Tommi Vehvilainen,Evgeny Fedoseev,Sevil Caliskan,Tatiana Vodolazova
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown significant progress on various multilingual benchmarks and are increasingly used to generate and evaluate text in non-English languages. However, while they may produce fluent outputs, it remains unclear to what extent these models truly grasp the underlying linguistic complexity of those languages, particularly in morphology. To investigate this, we introduce IMPACT, a synthetically generated evaluation framework focused on inflectional morphology, which we publicly release, designed to evaluate LLM performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew. IMPACT includes unit-test-style cases covering both shared and language-specific phenomena, from basic verb inflections (e.g., tense, number, gender) to unique features like Arabic’s reverse gender agreement and vowel harmony in Finnish and Turkish. We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns, especially when judging ungrammatical examples. We also show that Chain of Thought and Thinking Models can degrade performance. Our work exposes gaps in LLMs’ handling of linguistic complexity, pointing to clear room for improvement. To support further research, we publicly release the IMPACT framework.
zh
[NLP-17] he Trilemma of Truth in Large Language Models
【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models, LLMs)内部概率知识的准确性问题。现有方法在评估LLMs的真伪信号时存在一些错误假设,因此需要一种更可靠的方法来区分陈述的真伪。该论文提出的解决方案是sAwMIL(Sparse Aware Multiple-Instance Learning),其关键在于利用LLMs的内部激活信息,结合多实例学习和校准预测技术,将陈述分类为真、假或两者都不是。该方法能够更准确地验证LLMs所“知道”的内容及其对内部概率知识的确定性。
链接: https://arxiv.org/abs/2506.23921
作者: Germans Savcisens,Tina Eliassi-Rad
机构: Northeastern University (东北大学); Santa Fe Institute (圣达菲研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:We often attribute human characteristics to large language models (LLMs) and claim that they “know” certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM’s depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs “know” and how certain they are of their probabilistic internal knowledge.
zh
[NLP-18] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting ECML KDD2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂多步骤推理任务时表现不足的问题。其解决方案的关键在于提出一种名为多层自我反思与自动提示(Multi-Layered Self-Reflection with Auto-Prompting, MAPS)的框架,该框架通过整合思维链(Chain of Thought, CoT)、自我反思和自动提示技术,采用迭代优化过程来提升模型的多步骤数学推理能力。在检测到错误后,MAPS利用自适应自我反思机制生成定制化提示以引导修正,从而实现推理过程的动态调整与优化。
链接: https://arxiv.org/abs/2506.23888
作者: André de Souza Loureiro,Jorge Valverde-Rebaza,Julieta Noguez,David Escarcega,Ricardo Marcacini
机构: Luiz de Queiroz College of Agriculture (Luiz de Queiroz农业学院); University of São Paulo (圣保罗大学); Tecnologico de Monterrey (蒙特雷科技大学); Institute of Mathematics and Computer Sciences (数学与计算机科学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2025). Research Track
Abstract:Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.
zh
[NLP-19] Garbage In Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It
【速读】: 该论文试图解决当前广泛使用的推理基准(如SocialIQa、FauxPas-EAI和ToMi)中存在的系统性缺陷问题,这些问题涉及基准条目设计和评估方法的不足。其解决方案的关键在于利用五种大型语言模型(LLMs)作为诊断工具,识别出基准设计中的结构、语义和语用问题,以及评分过程中对输出形式而非推理过程的过度关注。通过系统的人工标注和对清理后的基准子集进行重新评估,研究发现模型得分的提升往往并非源于推理能力的增强,而是受输入表面变化的影响,这揭示了高分可能反映的是对格式特定线索的对齐,而非基于输入信息的一致推理。
链接: https://arxiv.org/abs/2506.23864
作者: Seyed Mahed Mousavi,Edoardo Cecchinato,Lucia Hornikova,Giuseppe Riccardi
机构: Masaryk University (马萨里克大学); Signals and Interactive Systems Lab, University of Trento (特伦托大学信号与交互系统实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-3, 3.5, 4, o1, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.
zh
[NLP-20] Use Sparse Autoencoders to Discover Unknown Concepts Not to Act on Known Concepts
【速读】: 该论文试图解决关于稀疏自编码器(Sparse Autoencoder, SAE)有效性争议的问题,即为何在某些情况下SAE表现出负面结果,而在其他情况下又显示出积极效果。其解决方案的关键在于提出一个概念性区分:尽管SAE在处理已知概念时可能不够有效,但它们在发现未知概念方面具有强大能力。这一区分有效地整合了现有的矛盾结果,并为SAE的应用提供了新的方向,包括机器学习可解释性、公平性、审计与安全,以及社会和健康科学领域。
链接: https://arxiv.org/abs/2506.23845
作者: Kenny Peng,Rajiv Movva,Jon Kleinberg,Emma Pierson,Nikhil Garg
机构: Cornell Tech(康奈尔科技); UC Berkeley(加州大学伯克利分校); Cornell University(康奈尔大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.
zh
[NLP-21] Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRM)在处理简单任务时产生的“过度思考”问题,即模型会生成冗长且包含大量思考标记(thinking tokens)的响应,这些标记会引发不必要的高级推理行为,如反思和回溯,从而降低效率。解决方案的关键在于提出一种名为双策略偏好优化(Dual Policy Preference Optimization, DuP-PO)的新算法,其核心包括:(1) 一种保证对含与不含思考标记响应均衡暴露的采样策略;(2) 一种细粒度优势控制技术,用于动态调节目标标记的预测;(3) 一种策略塑造方法,确保思考标记对梯度贡献的稳定性。实验结果表明,DuP-PO在提升模型推理过程中的令牌效率方面表现出色,同时保持了基础模型的优越性能。
链接: https://arxiv.org/abs/2506.23840
作者: Bowen Ding,Yuhan Chen,Futing Wang,Lingfeng Ming,Tao Lin
机构: Zhejiang University; Boston University; ByteDance; School of Engineering, Westlake University; Research Center for Industries of the Future, Westlake University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.
zh
[NLP-22] Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences
【速读】: 该论文试图解决二分类问答任务中由于选项顺序导致的定位偏差(positional bias)问题,即模型在没有充分信息的情况下,可能因选项的位置而倾向于选择某一特定答案。解决方案的关键在于通过调整数据集的上下文不确定性和引入不同质量或说服力的选项,系统性地评估和量化模型在不同不确定性条件下的定位偏差,并通过交换正确选项的位置来计算偏好公平性和位置一致性,从而揭示模型在高不确定性环境下对选项位置的依赖程度。
链接: https://arxiv.org/abs/2506.23743
作者: Tiziano Labruna,Simone Gallo,Giovanni Da San Martino
机构: University of Padova (帕多瓦大学); CNR-ISTI (国家研究委员会-智能系统与技术研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality scores, and (2) Winning Arguments - where models predict the more persuasive argument in Reddit’s r/ChangeMyView exchanges. Across each dataset, the order of the “correct” (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness and Position Consistency. We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.
zh
[NLP-23] AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data
【速读】: 该论文试图解决现有评估基准在评估大语言模型(Large Language Models, LLMs)的鲁棒性和泛化能力时存在静态且不足的问题。为弥补这一缺陷,作者提出了AutoEvoEval,这是一个基于进化机制的评估框架,其关键在于引入了22种可解释的原子进化操作,并支持多轮组合,从而能够可控地生成多样、具有挑战性和现实性的测试样本,以更全面地评估模型的鲁棒性。
链接: https://arxiv.org/abs/2506.23735
作者: JiaRu Wu,Mingwei Liu
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: this https URL.
zh
[NLP-24] owards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text Audio and Facial Cue-Based Summarization
【速读】: 该论文旨在解决视频内容日益增长所带来的有效摘要生成问题,特别是在教育、职业和社会领域中,传统单模态摘要方法已无法满足需求。其解决方案的关键在于提出一种行为感知的多模态视频摘要框架,该框架融合文本、音频和视觉线索,以生成时间戳对齐的摘要。通过提取韵律特征、文本线索和视觉指标,该框架能够识别语义和情感重要的时刻,其中关键贡献是识别出跨多个模态强调的“bonus words”,从而提升摘要的语义相关性和表达清晰度。
链接: https://arxiv.org/abs/2506.23714
作者: Md Moinul Islam,Sofoklis Kakouros,Janne Heikkilä,Mourad Oussalah
机构: University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to HHAI WS 2025: Workshops at the Fourth International Conference on Hybrid Human-Artificial Intelligence (HHAI)
Abstract:The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.
zh
[NLP-25] Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments ICML2024
【速读】: 该论文试图解决AI模型在大规模评估中的安全性和合规性验证问题,尤其是现有基准测试无法提供可验证的结果以及缺乏对模型知识产权和基准数据集的保密性。其解决方案的关键在于提出Attestable Audits,该方案在可信执行环境内运行,使用户能够验证与合规AI模型的交互过程,从而在模型提供方与审计方互不信任的情况下仍能保护敏感数据。
链接: https://arxiv.org/abs/2506.23706
作者: Christoph Schnabl,Daniel Hugenroth,Bill Marino,Alastair R. Beresford
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: ICML 2024 Workshop TAIG
Abstract:Benchmarks are important measures to evaluate safety and compliance of AI models at scale. However, they typically do not offer verifiable results and lack confidentiality for model IP and benchmark datasets. We propose Attestable Audits, which run inside Trusted Execution Environments and enable users to verify interaction with a compliant AI model. Our work protects sensitive data even when model provider and auditor do not trust each other. This addresses verification challenges raised in recent AI governance frameworks. We build a prototype demonstrating feasibility on typical audit benchmarks against Llama-3.1.
zh
[NLP-26] Efficient Interleaved Speech Modeling through Knowledge Distillation
【速读】: 该论文旨在解决当前语音语言模型在规模和延迟方面超出许多部署环境约束的问题。其解决方案的关键在于通过层对齐的知识蒸馏(layer-aligned distillation),匹配隐藏状态、注意力图和软化逻辑,从而在性能损失最小的情况下将大型多模态Transformer模型压缩3倍。该方法实现了高效且表达能力强的语音生成模型,如TinyWave,适用于实时对话代理、辅助技术和低资源环境。
链接: https://arxiv.org/abs/2506.23670
作者: Mohammadmahdi Nouriborji,Morteza Rohanian
机构: Nlpie Research (Nlpie 研究); University of Zurich (苏黎世大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher’s performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.
zh
[NLP-27] L0: Reinforcement Learning to Become General Agents
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)作为自主代理执行多轮、长周期任务时在可扩展性和训练效率方面的挑战。其解决方案的关键在于提出L-Zero(L0),一个可扩展的端到端训练流水线,用于通用代理的训练。L0的核心特性包括低成本、可扩展且受沙箱限制的并发代理工作者池,以及NB-Agent,该代理通过“代码即动作”的方式在Read-Eval-Print-Loop(REPL)环境中运行,从而降低了在复杂环境中应用强化学习的门槛。此外,论文还展示了仅使用可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)即可使基础模型发展出稳健的问题解决能力。
链接: https://arxiv.org/abs/2506.23667
作者: Junjie Zhang,Jingyi Xi,Zhuoyang Song,Junyu Lu,Yuhua Ke,Ting Sun,Yukun Yang,Jiaxing Zhang,Songxin Zhang,Zejian Xie
机构: Lionrock AI Lab; China Merchants Research Institute of Advanced Technology
类目: Computation and Language (cs.CL)
备注:
Abstract:Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a “code-as-action” fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (this https URL).
zh
[NLP-28] Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation
【速读】: 该论文试图解决在隐私敏感或资源受限环境下,传统上下文感知嵌入方法因需要访问目标语料库或进行领域特定微调而带来的实际障碍。解决方案的关键在于提出ZEST框架,该框架通过一次离线合成一个紧凑的代理语料库来替代真实的语料库访问,仅需少量代表性示例文档即可生成数百篇模拟目标领域关键分布的合成上下文语料,在推理阶段使用冻结的上下文感知编码器结合该代理语料库生成领域自适应的嵌入,无需任何微调或目标语料库访问,从而实现了高效且无需重新训练的领域适配方法。
链接: https://arxiv.org/abs/2506.23662
作者: Philip Lippmann,Jie Yang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Context-aware embedding methods boost retrieval accuracy by conditioning on corpus statistics (e.g., term co-occurrence and topical patterns) extracted from neighboring documents. However, this context-aware approach requires access to the target corpus or requires domain-specific finetuning, posing practical barriers in privacy-sensitive or resource-constrained settings. We present ZEST, a zero-shot contextual adaptation framework that replaces real corpus access with a one-time offline synthesis of a compact proxy. Given only a handful exemplar documents representative of the general target domain, we use a multi-step hierarchical procedure to generate a synthetic context corpus of several hundred documents that aims to emulate key domain-specific distributions. At inference, the frozen context-aware encoder uses this proxy corpus – without any finetuning or target corpus access – to produce domain-adapted embeddings. Across the MTEB benchmark, ZEST’s zero-shot synthetic context adaptation using only five example documents performs within 0.5% of models leveraging full target corpus access – demonstrating remarkable efficacy without any retraining. ZEST thus provides a practical method for deploying high-performance, adaptable embeddings in constrained environments.
zh
[NLP-29] Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack
【速读】: 该论文旨在解决文本分类系统在面对对抗样本时的鲁棒性评估问题,通过生成最小的词级修改来改变模型预测结果。其解决方案的关键在于对BeamAttack算法的扩展,包括支持词删除操作和跳过替换选项,同时结合LIME方法优化词替换优先级,从而在保持原文语义和词汇相似性的前提下实现高攻击成功率。
链接: https://arxiv.org/abs/2506.23661
作者: Arnisa Fazla,Lucas Krauter,David Guzman Piedrahita,Andrianos Michail
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages main text, 27 pages total including references and appendices. 13 figures, 10 tables. Accepted for publication in the LNCS proceedings of CLEF 2025 (Best-of-Labs track)
Abstract:We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack’s effectiveness and its limitations. Our implementation is available at this https URL
zh
[NLP-30] Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLM s
【速读】: 该论文试图解决生成式 AI (Generative AI) 在行为数据生成中是否能够准确捕捉由人格特质驱动的心理差异问题。解决方案的关键在于利用大语言模型(LLMs)代理,基于五大人格因素(Big-Five)配置,评估其在虚假信息易感性方面的表现,特别是新闻辨识能力,即判断真实标题为真实和虚假标题为虚假的能力。通过对比LLM代理与人类参与者在已知人格特征下的反应模式,验证了部分人格-虚假信息关联的可复制性,并揭示了LLMs在内化和表达人格特征时存在的系统性偏差。
链接: https://arxiv.org/abs/2506.23610
作者: Manuel Pratelli,Marinella Petrocchi
机构: IIT-CNR (Istituto di Informatica e Telematica del Consiglio Nazionale delle Ricerche); IMT School for Advanced Studies Lucca (IMT School for Advanced Studies Lucca)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: pre-print version - paper actually under submission
Abstract:Large language models (LLMs) make it possible to generate synthetic behavioural data at scale, offering an ethical and low-cost alternative to human experiments. Whether such data can faithfully capture psychological differences driven by personality traits, however, remains an open question. We evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to reproduce personality-based variation in susceptibility to misinformation, focusing on news discernment, the ability to judge true headlines as true and false headlines as false. Leveraging published datasets in which human participants with known personality profiles rated headline accuracy, we create matching LLM agents and compare their responses to the original human patterns. Certain trait-misinformation associations, notably those involving Agreeableness and Conscientiousness, are reliably replicated, whereas others diverge, revealing systematic biases in how LLMs internalize and express personality. The results underscore both the promise and the limits of personality-aligned LLMs for behavioral simulation, and offer new insight into modeling cognitive diversity in artificial agents.
zh
[NLP-31] Semantic-guided Diverse Decoding for Large Language Model
【速读】: 该论文旨在解决大规模语言模型在生成多个语义上不同的响应时存在的语义多样性不足问题,现有方法主要实现的是词汇层面的多样性,而非语义层面的差异。解决方案的关键在于提出一种基于语义引导的多样化解码方法(Semantic-guided Diverse Decoding, SemDiD),其核心在于在嵌入空间中操作,并通过三种互补机制:正交方向引导、动态组间排斥以及位置去偏概率评估,平衡生成质量与多样性。该方法利用自适应增益函数和约束优化来协调这些竞争目标,从而确保在满足质量阈值的同时实现最大的语义区分度。
链接: https://arxiv.org/abs/2506.23601
作者: Weijie Shi,Yue Cui,Yaguang Wu,Jingzhi Fang,Shibo Zhang,Mengze Li,Sirui Han,Jia Zhu,Jiajie Xu,Xiaofang Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.
zh
[NLP-32] Reachability in symmetric VASS
【速读】: 该论文试图解决对称向量加法系统与状态(VASS)中的可达性问题,其中转移在坐标排列群的作用下保持不变。解决方案的关键在于分析不同群结构对可达性问题复杂度的影响,特别是当群为对称群时,证明可达性问题可在PSPACE内求解,而无需考虑输入VASS的维度,这与一般VASS中 Ackermannian 复杂度形成对比。此外,研究还涉及其他群结构,如交替群和循环群,并探讨了当群由平凡群和对称群组合时对复杂度的潜在改善。
链接: https://arxiv.org/abs/2506.23578
作者: Łukasz Kamiński,Sławomir Lasota
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL)
备注:
Abstract:We investigate the reachability problem in symmetric vector addition systems with states (VASS), where transitions are invariant under a group of permutations of coordinates. One extremal case, the trivial groups, yields general VASS. In another extremal case, the symmetric groups, we show that the reachability problem can be solved in PSPACE, regardless of the dimension of input VASS (to be contrasted with Ackermannian complexity in general VASS). We also consider other groups, in particular alternating and cyclic ones. Furthermore, motivated by the open status of the reachability problem in data VASS, we estimate the gain in complexity when the group arises as a combination of the trivial and symmetric groups.
zh
[NLP-33] MMReason : An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLM s Toward AGI
【速读】: 该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)基准测试在评估长链推理能力时存在的不足,具体表现为:(1)题目难度和多样性不足,(2)易受猜测和记忆干扰,(3)对中间推理步骤的评估不充分。其解决方案的关键在于构建MMReason基准,通过从多个领域和不同难度层次中筛选需要多步骤推理的问题,采用多模态投票技术去除猜测和记忆相关的捷径案例,并设计基于参考答案的三元评分机制以可靠评估中间推理步骤,从而实现对MLLM长链推理能力的精确和全面评估。
链接: https://arxiv.org/abs/2506.23563
作者: Huanjin Yao,Jiaxing Huang,Yawen Qiu,Michael K. Chen,Wenzheng Liu,Wei Zhang,Wenjie Zeng,Xikun Zhang,Jingyi Zhang,Yuxin Song,Wenhao Wu,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学); Baidu Inc. (百度公司); University of California (加州大学); University of Science and Technology of China (中国科学技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at this https URL.
zh
[NLP-34] On Recipe Memorization and Creativity in Large Language Models : Is Your Model a Creative Cook a Bad Cook or Merely a Plagiator?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成烹饪食谱时表现出的记忆性、创造性和无意义内容的问题,旨在分析这些生成内容中哪些是直接复制自网络来源,哪些是真实创造性合成或纯粹无意义。其解决方案的关键在于设计一种“LLM-as-judge”管道,通过自动化方法实现食谱生成、无意义检测、成分和步骤解析及标注,从而大规模量化记忆性、创造性和无意义性,为模型的创造性能力提供严谨证据。
链接: https://arxiv.org/abs/2506.23527
作者: Jan Kvapil,Martin Fajcik
机构: Brno University of Technology (布鲁诺理工大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures
Abstract:This work-in-progress investigates the memorization, creativity, and nonsense found in cooking recipes generated from Large Language Models (LLMs). Precisely, we aim (i) to analyze memorization, creativity, and non-sense in LLMs using a small, high-quality set of human judgments and (ii) to evaluate potential approaches to automate such a human annotation in order to scale our study to hundreds of recipes. To achieve (i), we conduct a detailed human annotation on 20 preselected recipes generated by LLM (Mixtral), extracting each recipe’s ingredients and step-by-step actions to assess which elements are memorized–i.e., directly traceable to online sources possibly seen during training–and which arise from genuine creative synthesis or outright nonsense. We find that Mixtral consistently reuses ingredients that can be found in online documents, potentially seen during model training, suggesting strong reliance on memorized content. To achieve aim (ii) and scale our analysis beyond small sample sizes and single LLM validation, we design an ``LLM-as-judge’’ pipeline that automates recipe generation, nonsense detection, parsing ingredients and recipe steps, and their annotation. For instance, comparing its output against human annotations, the best ingredient extractor and annotator is Llama 3.1+Gemma 2 9B, achieving up to 78% accuracy on ingredient matching. This automated framework enables large-scale quantification of memorization, creativity, and nonsense in generated recipes, providing rigorous evidence of the models’ creative capacities.
zh
[NLP-35] NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning
【速读】: 该论文旨在解决越南语教育领域中学生意见理解的挑战,特别是针对现有教育数据集在领域相关性和学生俚语表达方面的不足。其解决方案的关键在于构建了一个名为NEU-ESC的新越南语数据集,该数据集来源于大学论坛,具有更多的样本、更丰富的类别多样性、更长的文本和更广泛的词汇。此外,研究还探索了基于编码器-only语言模型(如BERT)的多任务学习方法,在情感分类和主题分类任务中分别达到了83.7%和79.8%的准确率。
链接: https://arxiv.org/abs/2506.23524
作者: Phan Quoc Hung Mai,Quang Hung Nguyen,Phuong Giang Duong,Hong Hanh Nguyen,Nguyen Tuan Long
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In the field of education, understanding students’ opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: this https URL.
zh
[NLP-36] Assessing GPT Zeros Accuracy in Identifying AI vs. Human-Written Essays
【速读】: 该论文试图解决当前教育领域中AI检测工具在识别AI生成文本与人类撰写文本时的可靠性问题,特别是针对GPTZero这一常用检测工具的准确性进行评估。解决方案的关键在于通过实验分析不同长度的随机提交论文(短、中、长)在GPTZero上的检测表现,从而评估其对AI生成文本的识别成功率及对人类写作的误判情况。
链接: https://arxiv.org/abs/2506.23517
作者: Selin Dik,Osman Erdem,Mehmet Dik
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As the use of AI tools by students has become more prevalent, instructors have started using AI detection tools like GPTZero and QuillBot to detect AI written text. However, the reliability of these detectors remains uncertain. In our study, we focused mostly on the success rate of GPTZero, the most-used AI detector, in identifying AI-generated texts based on different lengths of randomly submitted essays: short (40-100 word count), medium (100-350 word count), and long (350-800 word count). We gathered a data set consisting of twenty-eight AI-generated papers and fifty human-written papers. With this randomized essay data, papers were individually plugged into GPTZero and measured for percentage of AI generation and confidence. A vast majority of the AI-generated papers were detected accurately (ranging from 91-100% AI believed generation), while the human generated essays fluctuated; there were a handful of false positives. These findings suggest that although GPTZero is effective at detecting purely AI-generated content, its reliability in distinguishing human-authored texts is limited. Educators should therefore exercise caution when relying solely on AI detection tools.
zh
[NLP-37] Reinforcement Fine-Tuning Enables MLLM s Learning Novel Tasks Stably
【速读】: 该论文旨在解决多模态大语言模型在后训练过程中对先前知识的遗忘问题。研究通过引入拼图任务作为新颖的下游任务,系统分析了监督微调(SFT)和强化学习微调(RFT)对模型知识保留的影响。解决方案的关键在于揭示了RFT在学习新任务时能够更好地维持先验知识,其机制源于RFT对与基础模型概率分布自然对齐的正确样本进行强化,从而减少对已有知识的干扰,而SFT虽然在任务获取上更高效,但会导致灾难性遗忘。
链接: https://arxiv.org/abs/2506.23508
作者: Zhihao Zhang,Qiaole Dong,Qi Zhang,Jun Zhao,Enyu Zhou,Zhiheng Xi,Senjie Jin,Xiaoran Fan,Yuhao Zhou,Yanwei Fu,Tao Ji,Tao Gui,Xuanjing Huang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages (Preprint. Work in progress)
Abstract:Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model’s probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT’s potential for stable continual learning in multimodal large language models.
zh
[NLP-38] hought-Augmented Planning for LLM -Powered Interactive Recommender Agent
【速读】: 该论文试图解决现有基于大语言模型(Large Language Model, LLM)的交互式推荐代理在处理多样化和复杂用户意图时存在的局限性,例如直观、不明确或偶尔模糊的请求。解决方案的关键在于提出一种名为TAIRA(Thought-Augmented Interactive Recommender Agent)的系统,其核心是通过Thought Pattern Distillation(TPD)方法提取高阶思维模式,以增强代理的规划能力,并由管理代理协调任务分解与子任务规划,从而更有效地应对复杂的用户需求。
链接: https://arxiv.org/abs/2506.23485
作者: Haocheng Yu,Yaxiong Wu,Hao Wang,Wei Guo,Yong Liu,Yawen Li,Yuyang Ye,Junping Du,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users’ real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent’s and human experts’ experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:this https URL.
zh
[NLP-39] What to Keep and What to Drop: Adaptive Table Filtering Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理基于表格的推理任务时,因输入长度限制而难以有效处理大规模表格的问题。其解决方案的关键在于提出一种模块化且问题感知的过滤框架——自适应表格过滤框架(Adaptive Table Filtering Framework, ATF),该框架通过LLM生成的列描述、聚类分析以及稀疏-密集对齐得分,对表格中的非信息性行和列进行剪枝,从而在不重新训练现有模型(如TAPAS、TAPEX)的情况下提升模型性能。
链接: https://arxiv.org/abs/2506.23463
作者: Jang Won June
机构: Myongji University (明治大学)
类目: Computation and Language (cs.CL)
备注: 26 pages, 9 figures
Abstract:Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by ~70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF’s ability to adaptively balance informativeness and minimalism across tasks.
zh
[NLP-40] Pipelined Decoder for Efficient Context-Aware Text Generation
【速读】: 该论文旨在解决生成式 AI 中自回归模型(autoregressive model)在生成文本时逐个token生成导致的效率瓶颈问题,该过程虽然能保证生成质量,但显著限制了生成速度。论文提出的解决方案是设计一种新的解码器架构——流水线解码器(pipelined decoder),其关键在于同时启动多个子序列的生成,并在每个时间步为每个子序列并行生成新token,从而实现高效的并行文本生成。
链接: https://arxiv.org/abs/2506.23431
作者: Zixian Huang,Chenxu Niu,Yu Gu,Gengyang Xiao,Xinwei Huang,Gong Cheng
机构: Nanjing University(南京大学); The Ohio State University(俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.
zh
[NLP-41] uCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLM s ICML2025
【速读】: 该论文试图解决如何定量分析微调对大语言模型(Large Language Models, LLMs)个体输出影响的问题,现有方法仅能通过对比预训练与微调模型的最终输出进行粗略评估,缺乏对微调作用机制的细粒度理解。其解决方案的关键在于提出一种新的方法,通过跟踪模型的中间隐藏状态,将微调后的LLM精确分解为预训练组件和微调组件,并定义了微调贡献(Tuning Contribution, TuCo)作为两者幅度的比值,从而实现对微调影响的量化分析。
链接: https://arxiv.org/abs/2506.23423
作者: Felipe Nuti,Tim Franzmeyer,João Henriques
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Past work has studied the effects of fine-tuning on large language models’ (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model’s intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) as the ratio of the magnitudes of the fine-tuning component to the pre-training component. We observe that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces TuCo, and that TuCo is consistently lower on prompts where these attacks succeed compared to those where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of such attacks. In summary, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.
zh
[NLP-42] Datasets for Fairness in Language Models: An In-Depth Survey
【速读】: 该论文试图解决当前语言模型公平性评估中对所依赖数据集的系统性分析不足的问题,即现有公平性基准(fairness benchmarks)的数据集缺乏深入的审视与理解。解决方案的关键在于提出一个统一的评估框架,该框架能够揭示不同数据集和评分方法中一致性的群体差异模式,并通过应用于24个常见基准,识别出可能被忽视的偏见,从而为数据集的选择、组合与解释提供实用指导。
链接: https://arxiv.org/abs/2506.23411
作者: Jiale Zhang,Zichong Wang,Avash Palikhe,Zhipeng Yin,Wenbin Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Fairness benchmarks play a central role in shaping how we evaluate language models, yet surprisingly little attention has been given to examining the datasets that these benchmarks rely on. This survey addresses that gap by presenting a broad and careful review of the most widely used fairness datasets in current language model research, characterizing them along several key dimensions including their origin, scope, content, and intended use to help researchers better appreciate the assumptions and limitations embedded in these resources. To support more meaningful comparisons and analyses, we introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods. Applying this framework to twenty four common benchmarks, we highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets. We also point to opportunities for creating new fairness benchmarks that reflect more diverse social contexts and encourage more thoughtful use of these tools going forward. All code, data, and detailed results are publicly available at this https URL to promote transparency and reproducibility across the research community.
zh
[NLP-43] aching a Language Model to Speak the Language of Tools
【速读】: 该论文试图解决多语言模型在非英语语言中缺乏可靠工具使用能力的问题,尤其是在低资源语言中容易出现语言混淆和无法正确生成结构化函数调用输出的问题。解决方案的关键在于通过在新型双语数据集上对BgGPT模型系列进行持续训练,以实现目标语言中的稳健工具使用能力,同时引入TUCAN(Tool-Using Capable Assistant Navigator)框架,显著提升函数调用准确性并保证响应格式的规范性。
链接: https://arxiv.org/abs/2506.23394
作者: Simeon Emanuilov
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:External tool integration through function-calling is essential for practical language model applications, yet most multilingual models lack reliable tool-use capabilities in non-English languages. Even state-of-the-art multilingual models struggle with determining when to use tools and generating the structured outputs required for function calls, often exhibiting language confusion when prompted in lower-resource languages. This work presents a methodology for adapting existing language models to enable robust tool use in any target language, using Bulgarian as a case study. The approach involves continued training of the BgGPT model series (2.6B, 9B, 27B parameters) on a novel bilingual dataset of 10,035 function-calling examples designed to support standardized protocols like MCP (Model Context Protocol). The research introduces TUCAN (Tool-Using Capable Assistant Navigator), which achieves up to 28.75% improvement in function-calling accuracy over base models while preserving core language understanding, as verified on established Bulgarian benchmarks. Beyond accuracy gains, TUCAN models demonstrate production-ready response formatting with clean, parsable function calls, contrasting with the verbose and inconsistent outputs of base models. The models, evaluation framework, and dataset are released to enable replication for other languages. This work demonstrates a practical approach for extending tool-augmented capabilities beyond English-centric systems.
zh
[NLP-44] Hierarchical Memory Organization for Wikipedia Generation ACL2025
【速读】: 该论文旨在解决自主生成维基百科文章的挑战,即如何从多样化的信息源中整合准确、全面且结构良好的内容。其解决方案的关键在于提出了一种基于记忆组织的生成框架(Memory Organization-based Generation, MOG),该框架通过分层记忆架构提取细粒度的记忆单元,并递归地将其组织成类似维基百科的层级结构,从而指导生成过程,确保记忆与文章大纲的一致性,提升信息量和可验证性,同时减少幻觉现象。
链接: https://arxiv.org/abs/2506.23393
作者: Eugene J. Yu,Dawei Zhu,Yifan Song,Xiangyu Wong,Jiebin Zhang,Wenxuan Shi,Xiaoguang Li,Qun Liu,Sujian Li
机构: Peking University (北京大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main Conference
Abstract:Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
zh
[NLP-45] Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)输出中存在偏见和视角问题的量化理解不足的问题。其解决方案的关键在于构建一个名为“视角空间”(Perspective Space)的度量空间,用于对特定主题的不同视角进行定量分析,并结合系统化提示工程(Systematic Prompt Engineering),通过贪心坐标下降法根据视角空间的反馈控制LLM输出的视角。这种方法无需深入理解视角或偏见的本质,即可有效量化并调整多种主题的输出。
链接: https://arxiv.org/abs/2506.23377
作者: Taejin Kim,Siun-Chuon Mau,Konrad Vesey
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 main pages of text, 5 figures, 2 tables. Research work performed at CACI INTL INC
Abstract:Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias – effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.
zh
[NLP-46] You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties ISCA
【速读】: 该论文旨在解决第二语言(L2)学习者在使用文本到语音(TTS)系统时的可理解性问题。其解决方案的关键在于利用美式英语中紧元音(较长)与松元音(较短)之间的时长差异,构建一种“清晰模式”(clarity mode),以提升L2听众的语音识别准确性。实验结果显示,该模式显著降低了法语母语者(英语二语使用者)的转录错误率,并被评价为更具鼓励性和尊重性,而传统整体减速语音则未达到相同效果。
链接: https://arxiv.org/abs/2506.23367
作者: Paige Tuttösí,H. Henny Yeung,Yue Wang,Jean-Julien Aucouturier,Angelica Lim
机构: SUPMICROTECH, CNRS, institut FEMTO-ST
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to ISCA Speech Synthesis Workshop, 2025
Abstract:We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a “clarity mode” for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.
zh
[NLP-47] Density asymmetry and citation dynamics in scientific literature
【速读】: 该论文试图解决科学文献的语义相似性与其引用率之间的关系问题,具体探究科学行为中在继承已有知识与引入新观点之间的张力是否反映在论文与先前研究的相似性与其最终引用率之间的关系上。解决方案的关键在于引入两个互补的度量指标——密度(density, ρ)和不对称性(asymmetry, α),用于表征论文在语义嵌入空间中的局部几何结构,从而量化其与前序研究的相似性,并通过贝叶斯分层回归方法验证这些指标对引用率预测的贡献。
链接: https://arxiv.org/abs/2506.23366
作者: Nathaniel Imel,Zachary Hafen
机构: New York University (纽约大学); Northwestern University (西北大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Scientific behavior is often characterized by a tension between building upon established knowledge and introducing novel ideas. Here, we investigate whether this tension is reflected in the relationship between the similarity of a scientific paper to previous research and its eventual citation rate. To operationalize similarity to previous research, we introduce two complementary metrics to characterize the local geometry of a publication’s semantic neighborhood: (1) \emphdensity ( \rho ), defined as the ratio between a fixed number of previously-published papers and the minimum distance enclosing those papers in a semantic embedding space, and (2) asymmetry ( \alpha ), defined as the average directional difference between a paper and its nearest neighbors. We tested the predictive relationship between these two metrics and its subsequent citation rate using a Bayesian hierarchical regression approach, surveying \sim 53,000 publications across nine academic disciplines and five different document embeddings. While the individual effects of \rho on citation count are small and variable, incorporating density-based predictors consistently improves out-of-sample prediction when added to baseline models. These results suggest that the density of a paper’s surrounding scientific literature may carry modest but informative signals about its eventual impact. Meanwhile, we find no evidence that publication asymmetry improves model predictions of citation rates. Our work provides a scalable framework for linking document embeddings to scientometric outcomes and highlights new questions regarding the role that semantic similarity plays in shaping the dynamics of scientific reward.
zh
[NLP-48] ATGen: A Framework for Active Text Generation ACL2025
【速读】: 该论文旨在解决在自然语言生成(Natural Language Generation, NLG)任务中,如何有效应用主动学习(Active Learning, AL)以降低标注成本的问题。其关键解决方案是提出了一种名为ATGen的综合框架,该框架将AL与文本生成任务相结合,支持基于大型语言模型(Large Language Models, LLMs)的自动标注代理和人工标注者的协同工作,并提供了统一平台用于实施和评估针对NLG任务的新型AL策略。
链接: https://arxiv.org/abs/2506.23342
作者: Akim Tsvigun,Daniil Vasilev,Ivan Tsvigun,Ivan Lysenko,Talgat Bektleuov,Aleksandr Medvedev,Uliana Vinogradova,Nikita Severin,Mikhail Mozikov,Andrey Savchenko,Rostislav Grigorev,Ramil Kuleev,Fedor Zhdanov,Artem Shelmanov,Ilya Makarov
机构: Research Center of the Artificial Intelligence Institute, Innopolis University (人工智能研究所,因诺波利斯大学); HSE University (高等经济大学); T-Technologies (T-技术公司); Robotics Center (机器人中心); AIRI (人工智能研究机构); SB-AI-Lab (SB人工智能实验室); Royal Holloway University of London (伦敦皇家霍洛威大学); MBZUAI (穆巴达拉人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025 System Demonstrations
Abstract:Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as services, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present evaluation results for state-of-the-art AL strategies across diverse settings and multiple text generation tasks. We show that ATGen reduces both the effort of human annotators and costs associated with API calls to LLM-based annotation agents. The code of the framework is available on GitHub under the MIT license. The video presentation is available at this http URL
zh
[NLP-49] Information Loss in LLM s Multilingual Translation: The Role of Training Data Language Proximity and Language Family
【速读】: 该论文试图解决多语言翻译中信息丢失的问题,特别是在训练数据有限或语言与英语存在显著差异的情况下。其解决方案的关键在于系统分析训练数据量、语言接近性及语言家族对翻译质量的影响,通过评估GPT-4和Llama 2在双向翻译任务中的表现,揭示了语言结构特征与翻译性能之间的关系。研究发现,尽管充足的训练数据可以缓解语言差异带来的影响,但与英语结构更接近的语言在低资源条件下仍能保持更高的翻译质量。
链接: https://arxiv.org/abs/2506.23340
作者: Yumeng Lin,Xufeng Duan,David Haslett,Yige Chen,Zhenguang G. Cai
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.
zh
[NLP-50] GaussMaster: An LLM -based Database Copilot System
【速读】: 该论文试图解决数据库管理员(DBA)在金融行业中因SQL调优、数据库部署、诊断和服务修复等任务而面临的繁重工作负担。现有自主数据库平台能力有限,主要针对单一问题如NL2SQL、异常检测和SQL调优,仍需人工干预进行全面的数据库维护。解决方案的关键在于引入基于大语言模型(LLM)的数据库助手系统GaussMaster,该系统能够通过分析数百个指标和日志,采用思维树(Tree-of-thought)方法识别根本原因,并调用适当工具自动完成整个维护过程,从而实现对数据库服务的全面支持与维护。
链接: https://arxiv.org/abs/2506.23322
作者: Wei Zhou,Ji Sun,Xuanhe Zhou,Guoliang Li,Luyang Liu,Hao Wu,Tianyuan Wang
机构: Huawei Technologies Co., Ltd. (华为技术有限公司); Tsinghua University (清华大学); Shanghai Jiaotong University (上海交通大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: We welcome contributions from the community. For reference, please see the code at: this https URL
Abstract:In the financial industry, data is the lifeblood of operations, and DBAs shoulder significant responsibilities for SQL tuning, database deployment, diagnosis, and service repair. In recent years, both database vendors and customers have increasingly turned to autonomous database platforms in an effort to alleviate the heavy workload of DBAs. However, existing autonomous database platforms are limited in their capabilities, primarily addressing single-point issues such as NL2SQL, anomaly detection, and SQL tuning. Manual intervention remains a necessity for comprehensive database maintenance. GaussMaster aims to revolutionize this landscape by introducing an LLM-based database copilot system. This innovative solution is designed not only to assist developers in writing efficient SQL queries but also to provide comprehensive care for database services. When database instances exhibit abnormal behavior, GaussMaster is capable of orchestrating the entire maintenance process automatically. It achieves this by analyzing hundreds of metrics and logs, employing a Tree-of-thought approach to identify root causes, and invoking appropriate tools to resolve issues. We have successfully implemented GaussMaster in real-world scenarios, such as the banking industry, where it has achieved zero human intervention for over 34 database maintenance scenarios. In this paper, we present significant improvements in these tasks with code at this https URL.
zh
[NLP-51] Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs)
【速读】: 该论文旨在解决从临床记录中检测和分类药物事件(medication events)的问题,这是在电子健康记录(EHR)自然语言处理任务中的一个关键挑战。其解决方案的关键在于构建一种基于BERT的集成模型,通过在不同类型的大型数据集上预训练BERT模型,并在CMED训练数据上进行微调,随后利用多个预测结果通过投票策略整合以生成最终预测,从而有效提升严格Micro-F和Macro-F分数。
链接: https://arxiv.org/abs/2506.23315
作者: Shouvon Sarker,Xishuang Dong,Lijun Qian
机构: Prairie View A&M University, Texas A&M University System, CREDIT Center (普莱里维尤农工大学,德克萨斯A&M大学系统,大军事数据智能研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Identification of key variables such as medications, diseases, relations from health records and clinical notes has a wide range of applications in the clinical domain. n2c2 2022 provided shared tasks on challenges in natural language processing for clinical data analytics on electronic health records (EHR), where it built a comprehensive annotated clinical data Contextualized Medication Event Dataset (CMED). This study focuses on subtask 2 in Track 1 of this challenge that is to detect and classify medication events from clinical notes through building a novel BERT-based ensemble model. It started with pretraining BERT models on different types of big data such as Wikipedia and MIMIC. Afterwards, these pretrained BERT models were fine-tuned on CMED training data. These fine-tuned BERT models were employed to accomplish medication event classification on CMED testing data with multiple predictions. These multiple predictions generated by these fine-tuned BERT models were integrated to build final prediction with voting strategies. Experimental results demonstrated that BERT-based ensemble models can effectively improve strict Micro-F score by about 5% and strict Macro-F score by about 6%, respectively.
zh
[NLP-52] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines
【速读】: 该论文试图解决传统生成式语言模型在符号结构生成、可解释性及泛化能力方面的局限性,特别是如何从局部神经学习中涌现符号结构的问题。其解决方案的关键在于提出一种基于局部事件驱动的神经符号框架,核心是分层霍普菲尔德记忆链,作为组合性短期记忆和动态分词器(retokenizer),通过自组织方式构建多尺度表示,利用投影张量将共现特征绑定为层次化标记,并引入冗余以实现局部激活到长程依赖的压缩。该方法无需预定义标记或监督,使模型能够从噪声中过滤自然语言模式并生成具有内在形态一致性的合成语言,同时通过局部赫布学习机制保留新信息,实现不同于传统模型的可塑性与泛化能力。
链接: https://arxiv.org/abs/2506.23293
作者: P. Myles Eugenio
机构: Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 22 pages, 7 figures
Abstract:We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology – quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class – even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems – where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.
zh
[NLP-53] wo Spelling Normalization Approaches Based on Large Language Models
【速读】: 该论文试图解决历史文献中由于缺乏标准化拼写规范和语言的自然演变所带来的语言学挑战,具体而言是通过拼写归一化(spelling normalization)将文档的正字法与现代标准对齐。解决方案的关键在于利用大规模语言模型,提出了两种新方法:一种为无监督训练的方法,另一种为针对机器翻译训练的方法。研究结果表明,尽管两者均取得了令人鼓舞的成果,但统计机器翻译似乎仍是该任务中最适合的技术。
链接: https://arxiv.org/abs/2506.23288
作者: Miguel Domingo,Francisco Casacuberta
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document’s orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
zh
[NLP-54] Corrupted by Reasoning : Reasoning Language Models Become Free-Riders in Public Goods Games
【速读】: 该论文试图解决多智能体大语言模型(LLMs)系统中合作与社会机制的问题,特别是如何在自利与集体福祉之间取得平衡,以确保对齐性、鲁棒性和安全部署。其解决方案的关键在于通过适应行为经济学中的带有制度选择的公共物品博弈,观察不同LLMs在重复互动中应对社会困境的行为模式,从而揭示模型在合作行为上的多样性表现。
链接: https://arxiv.org/abs/2506.23276
作者: David Guzman Piedrahita,Yongjin Yang,Mrinmaya Sachan,Giorgia Ramponi,Bernhard Schölkopf,Zhijing Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at this https URL
zh
[NLP-55] Generalist Reward Models: Found Inside Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)对齐过程中依赖昂贵的人类偏好数据训练奖励模型的问题。其解决方案的关键在于发现任何通过标准下一个词预测训练的LLM中已隐含一个强大的通用奖励模型,该内生奖励理论上等价于通过离线逆强化学习获得的奖励函数,从而无需额外训练即可直接提取高质量的奖励信号,并利用该信号进行强化学习可得到误差界更优的策略。
链接: https://arxiv.org/abs/2506.23235
作者: Yi-Chen Li,Tian Xu,Yang Yu,Xuqin Zhang,Xiong-Hui Chen,Zhongxiang Ling,Ningjing Chao,Lei Yuan,Zhi-Hua Zhou
机构: Nanjing University(南京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.
zh
[NLP-56] Masked Gated Linear Unit
【速读】: 该论文试图解决生成式 AI (Generative AI) 中门控线性单元 (GLU) 在前馈网络中因使用独立权重矩阵导致的内存读取次数增加的问题,从而影响模型效率。解决方案的关键在于提出一种名为掩码门控线性单元 (MGLU) 的新型 GLU 架构,其核心贡献包括:(1)元素级门控混合(MoEG)结构,通过学习多个二进制掩码,在单一共享权重矩阵上实现元素级别的门控或值分配,从而减少内存传输;(2)FlashMGLU 硬件友好型内核,在 RTX5090 GPU 上相比原始 PyTorch 实现提升 19.7 倍推理速度,并且比标准 GLU 更加内存高效和快速。
链接: https://arxiv.org/abs/2506.23225
作者: Yukito Tajima,Nakamasa Inoue,Yusuke Sekikawa,Ikuro Sato,Rio Yokota
机构: Institute of Science Tokyo(科学东京研究所); Denso IT Laboratory(电装IT实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 \times inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching - or even surpassing - the downstream accuracy of the SwiGLU baseline.
zh
[NLP-57] UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding ICCV2025
【速读】: 该论文旨在解决城市研究中多模态数据处理缺乏统一框架的问题,当前方法通常专注于特定数据类型,难以全面处理多样化的城市场景与任务。其解决方案的关键在于提出一种名为 \textitUrbanLLaVA 的多模态大语言模型(Multi-modal Large Language Models, MLLMs),该模型能够同时处理四种类型的数据,并通过多阶段训练框架将空间推理增强与领域知识学习解耦,从而提升模型在多种城市任务中的兼容性与下游性能。
链接: https://arxiv.org/abs/2506.23219
作者: Jie Feng,Shengyuan Wang,Tianhui Liu,Yanxin Xi,Yong Li
机构: Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学); University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICCV 2025
Abstract:Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce \textitUrbanLLaVA , a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In \textitUrbanLLaVA , we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of \textitUrbanLLaVA across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that \textitUrbanLLaVA outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via this https URL.
zh
[NLP-58] RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams SIGIR’23
【速读】: 该论文试图解决传统词嵌入模型在面对动态语言模式(如社交媒体和网络上的新话题标签或品牌名称)时适应性不足的问题,其关键解决方案是引入增量词嵌入算法,以实现对新语言模式的动态响应和连续数据流的处理。为此,作者提出了RiverText,一个用于从文本数据流中训练和评估增量词嵌入的Python库,支持多种增量词嵌入技术,并利用PyTorch作为神经网络训练的后端。
链接: https://arxiv.org/abs/2506.23192
作者: Gabriel Iturra-Bocaz,Felipe Bravo-Marquez
机构: University of Chile (智利大学); National Center for Artificial Intelligence (国家人工智能中心); Millennium Institute for Foundational Research on Data (数据基础研究千年研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at SIGIR’23
Abstract:Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams. This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at this https URL. Comments: Accepted at SIGIR’23 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2506.23192 [cs.CL] (or arXiv:2506.23192v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.23192 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3539618.3591908 Focus to learn more DOI(s) linking to related resources
zh
[NLP-59] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy
【速读】: 该论文试图解决在上下文学习(in-context learning, ICL)中由于标注成本高昂而需要使用大语言模型(large language models, LLMs)进行演示合成以降低开销的问题,但现有合成方法主要针对特定任务或依赖已有演示。论文的关键解决方案是提出一种称为V-Score的一致性度量,相较于基于n-gram或嵌入向量的度量具有更高的性能和更低的计算成本,并引入V-Synthesis方法,通过V-Score进行比例采样,以确保合成演示的一致性和多样性。
链接: https://arxiv.org/abs/2506.23149
作者: Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.
zh
[NLP-60] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
【速读】: 该论文试图解决在不同模型和任务中,生成式 AI (Generative AI) 的上下文学习 (In-context learning, ICL) 效果差异大、难以可靠评估的问题。现有基于性能变化的评估方法在数据不足场景下可靠性差、可解释性弱。其解决方案的关键是提出一种新的度量标准——学习到上下文斜率 (Learning-to-Context Slope, LCS),通过建模学习收益(演示中的损失减少)与上下文相关性(演示与输入的相关性)之间的斜率来量化 ICL 的有效性,从而提升评估的可靠性、可解释性并减少对标注数据的依赖。
链接: https://arxiv.org/abs/2506.23146
作者: Dingzriui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.
zh
[NLP-61] Benchmarking Deep Search over Heterogeneous Enterprise Data
【速读】: 该论文旨在解决深度搜索(Deep Search)在检索增强生成(RAG)任务中的评估问题,特别是在处理结构多样、稀疏但相关的多源信息时,需要具备源感知和多跳推理能力。其解决方案的关键在于构建一个基于合成数据管道的基准测试集,该管道模拟了产品规划、开发和售后等业务流程,生成具有现实噪声和多跳问题的互连内容,并确保有明确的地面真实答案。该基准包含39,190个企业文档,支持对长上下文大语言模型和RAG系统的细粒度评估。
链接: https://arxiv.org/abs/2506.23139
作者: Prafulla Kumar Choubey,Xiangyu Peng,Shilpa Bhagavath,Kung-Hsiang Huang,Caiming Xiong,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a new benchmark for evaluating Deep Search–a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.
zh
[NLP-62] Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion
【速读】: 该论文试图解决知识图谱补全(Knowledge Graph Completion, KGC)中多维关系建模的问题,现有方法大多基于静态嵌入评分,难以捕捉上下文依赖和关系动态。其解决方案的关键在于提出一种名为Flow-Modulated Scoring (FMS)的框架,该框架包含两个核心组件:一是语义上下文学习模块,用于编码上下文敏感的实体表示;二是条件流匹配模块,旨在根据上下文动态学习从头实体到尾实体嵌入的变换。通过结合上下文感知的静态表示与条件动态信息,FMS实现了对关系语义更深入的建模。
链接: https://arxiv.org/abs/2506.23137
作者: Siyuan Li,Ruitong Liu,Yan Wen,Te Sun
机构: Dalian University of Technology (大连理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Effective modeling of multifaceted relations is pivotal for Knowledge Graph Completion (KGC). However, a majority of existing approaches are predicated on static, embedding-based scoring, exhibiting inherent limitations in capturing contextual dependencies and relational dynamics. Addressing this gap, we propose the Flow-Modulated Scoring (FMS) framework. FMS comprises two principal components: (1) a semantic context learning module that encodes context-sensitive entity representations, and (2) a conditional flow-matching module designed to learn the dynamic transformation from a head to a tail embedding, governed by the aforementioned context. The resultant predictive vector field, representing the context-informed relational path, serves to dynamically refine the initial static score of an entity pair. Through this synergy of context-aware static representations and conditioned dynamic information, FMS facilitates a more profound modeling of relational semantics. Comprehensive evaluations on several standard benchmarks demonstrate that our proposed method surpasses prior state-of-the-art results.
zh
[NLP-63] LLM -Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation
【速读】: 该论文试图解决传统检索增强生成(Retrieval-Augmented Generation, RAG)管道在处理包含结构化数据(如表格和图像)的技术文档时存在的局限性。其解决方案的关键在于构建一个能够有效处理表格和图像的RAG管道,该管道结合了向量相似性搜索与基于Gemma-2-9b-it微调的重排序器,通过RAFT(Retrieval-Augmented Fine-Tuning)方法在自定义数据集上进行训练,以提升问答任务中的上下文识别能力。
链接: https://arxiv.org/abs/2506.23136
作者: Shadman Sobhan,Mohammad Ariful Haque
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 Pages, 11 Tables
Abstract:Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.
zh
[NLP-64] Format-Adapter: Improving Reasoning Capability of LLM s by Adapting Suitable Format
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在推理过程中存在的不一致性问题,通过生成和投票多个答案来缓解这一问题。以往的方法依赖于人工标注的推理格式,这不仅适用于所有任务的灵活性受限,而且标注成本较高。该论文的解决方案关键在于提出一种自动适应任务的推理格式生成与选择机制,即Format-Adapter,其通过最小化我们提出的推理误差度量来利用LLMs生成并选择合适的推理格式,从而提升模型在数学和常识推理任务上的性能。
链接: https://arxiv.org/abs/2506.23133
作者: Dingzirui Wang,Xuanliang Zhang,Rongyu Cao,Longxu Dou,Xianzhen Luo,Yingwei Ma,Qingfu Zhu,Wanxiang Che,Binhua Li,Fei Huang,Yongbin Li
机构: Harbin Institute of Technology (哈尔滨工业大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:
Abstract:Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.
zh
[NLP-65] Unleashing Embodied Task Planning Ability in LLM s via Reinforcement Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在具身任务规划场景中面临的持续环境理解和动作生成挑战,特别是在部分可观测环境中难以学习动作与环境反馈之间因果关系的问题。其解决方案的关键在于提出Embodied Planner-R1框架,该框架通过纯强化学习结合群体滚动、完成驱动的稀疏奖励以及交互策略优化(Interactive Policy Optimization, IPO),使LLMs能够在最小监督下通过自主探索发展出交互能力,并在两个文本基础的具身规划基准测试中取得了显著的完成率提升。
链接: https://arxiv.org/abs/2506.23127
作者: Zhaoye Fei,Li Ji,Siyin Wang,Junhao Shi,Jingjing Gong,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.
zh
[NLP-66] Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models
【速读】: 该论文试图解决在互联网迷因中识别叙事角色(Hero、Villain、Victim 和 Other)的问题,特别是在英语和英语-印地语混合语言的多种测试集上。解决方案的关键在于构建一个更平衡且语言多样化的数据集,并通过综合词汇和结构分析揭示真实迷因中文化特定且上下文丰富的语言特征,与合成仇恨内容的显性重复词汇标记形成对比。此外,研究评估了多种模型,包括微调多语言变压器、情感和反滥用分类器、指令微调的大语言模型以及多模态视觉-语言模型,并探索了提示设计策略以提升多模态模型的性能。
链接: https://arxiv.org/abs/2506.23122
作者: Shivam Sharma,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This work has been submitted to the IEEE for possible publication
Abstract:This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the ‘Other’ class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the ‘Victim’ class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.
zh
[NLP-67] MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
【速读】: 该论文旨在解决多模态嵌入模型在因果视觉语言模型(Vision Language Models, VLMs)基础上存在的三个关键问题:因果注意力机制在嵌入任务中表现不佳、依赖高质量标注配对数据进行对比学习导致的可扩展性问题,以及训练目标和数据多样性不足。其解决方案的关键在于提出MoCa框架,该框架包含两个阶段:第一阶段通过模态感知的持续预训练引入联合重建目标,以增强双向上下文感知推理;第二阶段通过异构对比微调利用多样化的语义丰富多模态数据,提升模型的泛化能力和对齐效果。
链接: https://arxiv.org/abs/2506.23115
作者: Haonan Chen,Hong Liu,Yuping Luo,Liang Wang,Nan Yang,Furu Wei,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Stanford University (斯坦福大学); Microsoft Corporation (微软公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Homepage: this https URL
Abstract:Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
zh
[NLP-68] FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes ACL2025
【速读】: 该论文试图解决现有公平性研究主要聚焦于西方语境,难以适用于文化多样性国家如印度的问题。其解决方案的关键是引入INDIC-BIAS,这是一个以印度为中心的全面基准,用于评估大型语言模型(Large Language Models, LLMs)在85个身份群体中的公平性,涵盖不同的种姓、宗教、地区和部落。通过领域专家协作整理超过1,800个社会文化主题,并生成20,000个真实场景模板进行人工验证,最终构建了三个评估任务:合理性、判断和生成,以系统检测模型中的偏见与刻板印象。
链接: https://arxiv.org/abs/2506.23111
作者: Janki Atul Nawale,Mohammed Safi Ur Rahman Khan,Janani D,Mansi Gupta,Danish Pruthi,Mitesh M. Khapra
机构: Nilekani Centre at AI4Bharat (Nilekani 中心 at AI4Bharat); Indian Institute of Technology, Madras (印度理工学院,马德拉斯); Indian Institute of Science, Bangalore (印度科学研究所,班加罗尔)
类目: Computation and Language (cs.CL)
备注: Accepted in ACL 2025
Abstract:Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.
zh
[NLP-69] From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在涉及视觉与文本模态的任务中可能编码和放大性别偏见的问题,尤其是关注社会敏感应用场景中的关系性与情境性性别偏见。解决方案的关键在于引入Genres基准,该基准通过社会关系视角评估MLLMs中的性别偏见,采用双角色档案与叙事生成任务,捕捉丰富的互动动态,并支持多维度的细粒度偏见评估,从而揭示单角色设定下不明显的上下文敏感性性别偏见。
链接: https://arxiv.org/abs/2506.23101
作者: Yue Xu,Wenjie Wang
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.
zh
[NLP-70] xt2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries
【速读】: 该论文旨在解决传统Text-to-SQL技术在处理非结构化数据或模糊查询时表达能力不足的问题,以及现有VectorSQL实现依赖人工设计且缺乏专门评估框架的局限性。其解决方案的关键在于提出Text2VectorSQL框架,该框架将Text-to-SQL与向量搜索相结合,通过语义过滤、多模态匹配和检索加速等机制,提升自然语言查询的多样性和全面性,并通过构建向量索引、扩展用户查询及自动标注真实数据等方法进行评估,最终验证了该框架在性能上的显著优势。
链接: https://arxiv.org/abs/2506.23071
作者: Zhengren Wang,Bozhou Li,Dongwen Yao,Wentao Zhang
机构: Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Work in progess
Abstract:While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at this https URL.
zh
[NLP-71] Boosting LLM s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning ACL2025
【速读】: 该论文旨在解决分子结构解析(Molecular Structure Elucidation)中大型语言模型(LLMs)面临的挑战,这些问题主要源于LLMs对专业化学知识掌握不足。解决方案的关键在于引入一种增强知识的推理框架K-MSE,该框架通过蒙特卡洛树搜索(Monte Carlo Tree Search)实现测试时的扩展,并构建外部分子亚结构知识库以扩大LLMs对化学结构空间的覆盖范围,同时设计了专门的分子-光谱评分器作为推理过程的奖励模型,以解决LLMs在解题评估中的不准确性问题。
链接: https://arxiv.org/abs/2506.23056
作者: Xiang Zhuang,Bin Wu,Jiyu Cui,Kehua Feng,Xiaotong Li,Huabin Xing,Keyan Ding,Qiang Zhang,Huajun Chen
机构: Zhejiang University (浙江大学); ZJU-Hangzhou Global Scientific and Technological Innovation Center; University College London (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main
Abstract:Molecular structure elucidation involves deducing a molecule’s structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs’ limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs’ coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at this https URL.
zh
[NLP-72] MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition
【速读】: 该论文旨在解决巴西葡萄牙语在命名实体识别(NER)任务中缺乏高质量标准数据集的问题,尤其是在特定领域如历史文本分析中的应用。解决方案的关键在于构建MariNER:首个针对20世纪初巴西葡萄牙语的黄金标准数据集,包含超过9,000条人工标注的句子,以支持相关研究和模型评估。
链接: https://arxiv.org/abs/2506.23051
作者: João Lucas Luz Lima Sarcinelli,Marina Lages Gonçalves Teixeira,Jade Bortot de Paiva,Diego Furtado Silva
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textitMapeamento e Anotações de Registros hIstóricos para NER (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.
zh
[NLP-73] AURA: Agent for Understanding Reasoning and Automated Tool Use in Voice-Driven Tasks
【速读】: 该论文试图解决现有开源系统在实现端到端语音到语音的多轮对话、集成工具使用和自主推理方面存在的不足。其解决方案的关键在于提出AURA(Agent for Understanding, Reasoning, and Automated Tool Use),这是一个首个开源的语音原生助手,通过动态工具调用和多轮对话完成复杂的目标驱动任务。AURA采用级联流水线整合开放权重的自动语音识别(ASR)、文本转语音(TTS)和大语言模型(LLM),并支持多种工具如日历预订、联系人查询、网络搜索和电子邮件功能,其模块化设计使得新工具可通过自然语言提示和动作类轻松集成。
链接: https://arxiv.org/abs/2506.23049
作者: Leander Melroy Maben,Gayathri Ganesh Lakshmy,Srijith Radhakrishnan,Siddhant Arora,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.
zh
[NLP-74] SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
【速读】: 该论文旨在解决现有理论心智(Theory of Mind, ToM)评估基准在动态、现实社会互动中的不足,当前大多数ToM基准仅评估静态文本场景,与真实交互存在显著差距。其解决方案的关键是提出SoMi-ToM基准,该基准基于由交互环境SoMi生成的丰富多模态交互数据,能够评估具身化多智能体复杂社会互动中的多视角ToM能力,并通过第一人称和第三人称两种评价方式,从主观即时体验和客观全局观察两个层面全面考察模型的ToM能力。
链接: https://arxiv.org/abs/2506.23046
作者: Xianzhe Fan,Xuhui Zhou,Chuanyang Jin,Kolby Nottingham,Hao Zhu,Maarten Sap
机构: Tsinghua University (清华大学); Carnegie Mellon University (卡内基梅隆大学); Johns Hopkins University (约翰霍普金斯大学); University of California Irvine (加州大学欧文分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 23 pages, 6 figures
Abstract:Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model’s ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.
zh
[NLP-75] MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
【速读】: 该论文试图解决多模态推理能力不足的问题,特别是在复杂多模态环境下进行逐步推理的挑战。现有基准测试主要关注文本推理或可通过非文本模态直接检索信息的多模态问题,而缺乏对真正复杂推理任务的评估。为了解决这一问题,作者提出了MARBLE,一个具有挑战性的多模态推理基准,包含M-Portal和M-Cube两个任务,要求模型在空间、视觉和物理约束下进行多步骤计划的构建与理解。该解决方案的关键在于设计能够严格检验多模态语言模型(MLLMs)逐步推理能力的任务,从而揭示当前模型在复杂推理和感知方面的局限性。
链接: https://arxiv.org/abs/2506.22992
作者: Yulun Jiang,Yekun Chai,Maria Brbić,Michael Moor
机构: EPFL(瑞士联邦理工学院); ETH Zurich(苏黎世联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE – all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.
zh
[NLP-76] A Systematic Study of Compositional Syntactic Transformer Language Models
【速读】: 该论文试图解决传统Transformer模型在处理语言结构时缺乏显式句法信息的问题,通过引入句法偏置来增强模型的表达能力。其解决方案的关键在于构建基于成分句法树(Constituency Parse Tree)的组合式语法语言模型(Compositional Syntactic Language Models, SLMs),通过显式的自底向上的成分表示组合,提升模型在句法泛化、语言建模及下游任务中的性能。
链接: https://arxiv.org/abs/2506.22978
作者: Yida Zhao,Hao Xve,Xiang Hu,Kewei Tu
机构: ShanghaiTech University (上海科技大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at this https URL.
zh
[NLP-77] On the Generalizability of “Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals”
【速读】: 该论文试图解决语言模型在处理事实性信息与反事实性信息时机制竞争的问题,其核心在于理解注意力头在不同信息类型之间的分工与主导关系。解决方案的关键在于通过注意力头消融实验,分析语言模型中事实性和反事实性信息的定位、注意力块的主导作用以及注意力头在处理竞争信息时的专门化程度。研究还进一步探讨了模型规模、提示结构及领域特性对机制竞争的影响,以验证原始结论的适用性和有效性。
链接: https://arxiv.org/abs/2506.22977
作者: Asen Dotsinski,Udit Thakur,Marko Ivanov,Mohammad Hafeez Khan,Maria Heuss
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 22 pages, 25 figures. For an interactive dashboard with all figures, see this https URL . For the accompanying code, see this https URL . To be published in proceedings of the 2025 Machine Learning Reproducibility Challenge
Abstract:We present a reproduction study of “Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals” (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors’ claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.
zh
[NLP-78] Agent -to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在多智能体和人机系统中对自身上下文及对话伙伴身份与特征的感知能力不足的问题,特别是针对对话伙伴识别与适应这一被忽视的方面。其解决方案的关键在于提出并系统评估“对话者意识”(interlocutor awareness)这一概念,通过分析推理模式、语言风格和对齐偏好三个维度,验证LLMs能够可靠地识别同家族模型及特定主流模型(如GPT和Claude),并探讨其在多LLM协作中的实际应用价值及潜在安全风险。
链接: https://arxiv.org/abs/2506.22957
作者: Younwoo Choi,Changling Li,Yongjin Yang,Zhijing Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM’s ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at this https URL.
zh
[NLP-79] MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering
【速读】: 该论文旨在解决医学视觉问答(MedVQA)任务中,视觉-语言模型(VLMs)生成事实性错误答案的问题。现有方法通过检索增强生成来补充外部信息,但存在检索不相关上下文的风险,从而影响VLM的推理能力。本文提出的解决方案MOTOR的关键在于引入一种多模态检索与重排序方法,利用接地描述和最优传输技术,结合文本和视觉信息捕捉查询与检索上下文之间的潜在关系,从而识别更符合临床需求的上下文以增强VLM输入。
链接: https://arxiv.org/abs/2506.22900
作者: Mai A. Shaaban,Tausifa Jan Saleem,Vijay Ram Papineni,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at this https URL.
zh
[NLP-80] Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval ICMR2025
【速读】: 该论文试图解决文本到图像检索(Text-to-Image Retrieval, TIR)中因依赖整体图像描述而缺乏可解释性,以及指代表达分割(Referring Expression Segmentation, RES)在大规模图像集合中计算成本高的问题。其解决方案的关键在于提出一种新的任务——Mask-aware TIR (MaTIR),通过统一TIR与RES,实现高效的图像搜索和精确的对象分割。核心方法为两阶段框架:第一阶段利用SAM 2生成对象掩码并使用Alpha-CLIP提取区域级嵌入以实现有效的在线检索;第二阶段则借助多模态大语言模型(MLLM)进行重排序和对象定位,提升检索精度与分割质量。
链接: https://arxiv.org/abs/2506.22864
作者: Li-Cheng Shen,Jih-Kang Hsieh,Wei-Hua Li,Chu-Song Chen
机构: National Taiwan University (国立台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICMR 2025
Abstract:Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D ^3 datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.
zh
[NLP-81] Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions
【速读】: 该论文试图解决自动语音识别(ASR)系统在命名实体和数值数据识别及格式化方面的不足,这些问题会导致词错误率(WER)升高,并影响法律、金融和医疗等关键领域中的语义理解。解决方案的关键在于通过在训练过程中添加重叠的上下文窗口来扩展ASR模型的语义上下文,具体而言是通过对30秒片段的两侧各添加5秒的重叠,形成40秒的“有效语义窗口”,从而提升实体识别与格式化能力,同时将跨块实体整体重新分配至右侧块以确保格式正确性。
链接: https://arxiv.org/abs/2506.22858
作者: Duygu Altinok
机构: Independent Researcher, Germany
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This is the accepted version of an article accepted to the TSD 2025 conference, published in Springer Lecture Notes in Artificial Intelligence (LNAI). The final authenticated version is available online at SpringerLink
Abstract:Automatic Speech Recognition (ASR) systems, such as Whisper, achieve high transcription accuracy but struggle with named entities and numerical data, especially when proper formatting is required. These issues increase word error rate (WER) and impair semantic understanding in critical domains like legal, financial, and medical applications. We propose a novel training approach that extends the semantic context of ASR models by adding overlapping context windows during training. By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second “effective semantic window,” improving entity recognition and formatting while focusing predictions on the central 30 seconds. To address entities spanning chunk boundaries, we reassign such entities entirely to the right-hand chunk, ensuring proper formatting. Additionally, enriched training data with embedded entity labels enables the model to learn both recognition and type-specific formatting. Evaluated on the Spoken Wikipedia dataset, our method improves performance across semantic tasks, including named entity recognition (NER) and entity formatting. These results highlight the effectiveness of context-aware training in addressing ASR limitations for long-form transcription and complex entity recognition tasks.
zh
[NLP-82] DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round Multi-Party Dialogues ACL2025
【速读】: 该论文试图解决现有函数调用基准测试在现实场景复杂性方面的不足,即现有基准主要关注单轮交互,未能充分反映实际应用中的多轮对话和工具依赖性。解决方案的关键在于提出DICE-BENCH框架,该框架通过构建包含工具图(maintains dependencies across rounds)和多智能体系统(multi-agent system with distinct personas)的合成对话数据集,以提升对话的真实性和工具相关信息的分散性,从而生成高DICE-SCORE的实例。
链接: https://arxiv.org/abs/2506.22853
作者: Kyochul Jang,Donghyeon Lee,Kyusik Kim,Dongseok Heo,Taewhoo Lee,Woojeong Kim,Bongwon Suh
机构: Seoul National University (首尔大学); Korea University (高丽大学); AIGEN Sciences (AIGEN科学); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, ACL 2025 Vienna
Abstract:Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: this https URL.
zh
[NLP-83] Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识密集型场景下容易出现事实性错误的问题。解决方案的关键在于通过知识增强微调(Knowledge Augmented Finetuning, KAFT),在基于检索增强生成(Retrieval Augmented Generation, RAG)和代理的系统中,利用领域特定的数据和外部知识对LLMs进行微调,从而提升其在具体领域中的事实准确性。
链接: https://arxiv.org/abs/2506.22852
作者: Yucheng Cai,Yuxuan Wu,Yi Huang,Junlan Feng,Zhijian Ou
机构: Tsinghua University (清华大学); China Mobile Research Institute (中国移动研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have recently been applied to dialog systems. Despite making progress, LLMs are prone to errors in knowledge-intensive scenarios. Recently, approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy by enhancing the LLMs with knowledge retrieved from external knowledge bases (KBs). This is mostly implemented by prompting the LLMs with instructions, examples and the retrieved knowledge. However, LLMs may have difficulty using the retrieved knowledge effectively for response generation, because they are not well trained to do such generation for specific domains. To mitigate this problem, we propose to finetune the LLMs in the RAG-based and agent-based systems with domain-specific data, together with domain-specific external knowledge, which is called knowledge augmented finetuning (KAFT). We base our study on the MobileCS2 dataset, a real-life customer service dialog dataset that features intensive knowledge interactions, to systematically compare the prompting and KAFT techniques in the RAG-based and agent-based systems. Experiment results show that KAFT substantially surpasses prompting in both RAG and agent systems, particularly in terms of factual accuracy. To the best of our knowledge, this paper represents the first solid empirical work to investigate the KAFT idea.
zh
[NLP-84] Boosting CTC-Based ASR Using LLM -Based Intermediate Loss Regularization
【速读】: 该论文旨在解决传统基于CTC(Connectionist Temporal Classification)的自动语音识别(ASR)系统在建模语言依赖性方面存在的不足,同时保持其非自回归解码带来的高效性。其解决方案的关键在于提出一种名为语言感知中间损失(Language-Aware Intermediate Loss, LAIL)的辅助损失框架,通过将连接层附加到编码器的中间层,将输出映射到大型语言模型(LLM)的嵌入空间,并在训练过程中计算因果语言建模损失,从而增强语言建模能力,同时维持CTC解码的计算效率。
链接: https://arxiv.org/abs/2506.22846
作者: Duygu Altinok
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This is the accepted version of an article accepted to the TSD 2025 conference, published in Springer Lecture Notes in Artificial Intelligence (LNAI). The final authenticated version is available online at SpringerLink
Abstract:End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance. However, their autoregressive decoding process limits inference speed, making them unsuitable for real-time applications. In contrast, CTC-based models offer faster, non-autoregressive decoding but struggle to model linguistic dependencies effectively. Addressing this challenge, we propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based ASR using the linguistic knowledge of large language models (LLMs). By attaching connector layers to intermediate encoder layers, LAIL maps outputs to the embedding space of an LLM and computes a causal language modeling loss during training. This approach enhances linguistic modeling while preserving the computational efficiency of CTC decoding. Using the Conformer architecture and various LLaMA models, we demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.
zh
[NLP-85] Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在信息抽取(Information Extraction, IE)任务中,如命名实体识别(Named Entity Recognition, NER),因标注细粒度标签和训练领域特定模型成本高昂而带来的挑战。现有方法通常在多个领域上训练统一模型,但这种方法在适应性和可扩展性方面存在不足。本文提出的SaM框架的关键在于在推理阶段动态选择并合并专家模型,根据目标领域的领域相似性和采样实例的性能选择领域特定专家,并通过融合生成针对目标领域的任务专用模型,从而在不进行额外训练的情况下提升跨领域的泛化能力,同时具备良好的可扩展性。
链接: https://arxiv.org/abs/2506.22813
作者: Zhuojun Ding,Wei Wei,Chenghao Fan
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training data benefits target domains and scaling trained models remains challenging. We propose the SaM framework, which dynamically Selects and Merges expert models at inference time. Specifically, for a target domain, we select domain-specific experts pre-trained on existing domains based on (i) domain similarity to the target domain and (ii) performance on sampled instances, respectively. The experts are then merged to create task-specific models optimized for the target domain. By dynamically merging experts beneficial to target domains, we improve generalization across various domains without extra training. Additionally, experts can be added or removed conveniently, leading to great scalability. Extensive experiments on multiple benchmarks demonstrate our framework’s effectiveness, which outperforms the unified model by an average of 10%. We further provide insights into potential improvements, practical experience, and extensions of our framework.
zh
[NLP-86] BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters
【速读】: 该论文试图解决在下游任务中对深度学习模型输出进行可靠不确定性量化(uncertainty quantification)的问题,特别是在基于代理(agentic)的决策流程中,传统通用的Transformer不确定性方法存在不足。解决方案的关键在于提出BayesLoRA框架,该框架将MC-Dropout与低秩适配器(Low-Rank Adapters, LoRA)相结合,通过LoRA适配器在微调分布外表现出增强的方差特性,从而为代理决策提供可信的置信度估计。
链接: https://arxiv.org/abs/2506.22809
作者: Cooper Doyle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 1 table
Abstract:We propose BayesLoRA, a task-specific uncertainty quantification framework that integrates MC-Dropout into Low-Rank Adapters (LoRA). Unlike general-purpose transformer uncertainty methods, BayesLoRA provides guardrails tailored to downstream workflows, enabling agents to introspect and modulate behavior under uncertainty. We demonstrate mathematically and empirically that LoRA adapters exhibit amplified variance outside fine-tuning distributions, yielding reliable confidence estimates for agentic decision-making.
zh
[NLP-87] MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLM s
【速读】: 该论文试图解决医学大语言模型(Medical Large Language Models, MedLLMs)在伦理安全性方面的研究不足问题,特别是其在医疗伦理任务中的表现缺陷。解决方案的关键在于构建了一个全面的基准测试集 \textbfMedEthicsQA,包含5,623道选择题和5,351道开放性问题,涵盖全球医疗伦理标准,并通过多阶段过滤和多维度专家验证确保数据集的可靠性,从而为评估和改进MedLLMs的伦理对齐提供基础。
链接: https://arxiv.org/abs/2506.22808
作者: Jianhui Wei,Zijie Meng,Zikai Xiao,Tianxiang Hu,Yang Feng,Zhijie Zhou,Jian Wu,Zuozhu Liu
机构: Zhejiang University (浙江大学); Angelalign Technology Inc. (天使对齐科技公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces \textbfMedEthicsQA , a comprehensive benchmark comprising \textbf5,623 multiple-choice questions and \textbf5,351 open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ( 2.72% ). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at this https URL.
zh
[NLP-88] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models
【速读】: 该论文旨在解决传统语义缓存系统在多轮对话中因缺乏上下文感知而导致的错误缓存命中问题,即当相似查询出现在不同对话场景时,系统无法正确识别上下文差异,从而影响缓存效果。其解决方案的关键在于提出ContextCache系统,该系统采用两阶段检索架构:首先基于向量进行当前查询的初步匹配,随后通过自注意力机制整合当前与历史对话表示,实现精准的上下文匹配。
链接: https://arxiv.org/abs/2506.22791
作者: Jianxin Yan,Wangze Ni,Lei Chen,Xuemin Lin,Peng Cheng,Zhan Qin,Kui Ren
机构: Zhejiang University (浙江大学); HKUST (GZ) (香港科技大学(广州)); Shanghai Jiaotong University (上海交通大学); Tongji University (同济大学)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.
zh
[NLP-89] PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection INTERSPEECH2025
【速读】: 该论文旨在解决当前深度伪造(Deepfake, DF)攻击数据集无法有效欺骗人类感知的问题,而真实场景中的DF攻击却对公众舆论产生显著影响,因此需要更逼真的DF攻击向量。其解决方案的关键在于提出PhonemeFake(PF),一种通过语言推理操纵关键语音片段的DF攻击方法,该方法显著降低了人类感知度(最高达42%)和基准准确率(最高达94%)。同时,研究还提出了一个易于使用的PF数据集以及一个双层DF语音段检测模型,该模型能够自适应地在被篡改区域优先分配计算资源,从而实现高效且精确的检测。
链接: https://arxiv.org/abs/2506.22783
作者: Oguzhan Baser,Ahmet Ege Tanriverdi,Sriram Vishwanath,Sandeep P. Chinchali
机构: Department of Electrical and Computer Engineering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 3 figures, Published at Proceedings of Interspeech 2025, for the dataset see this https URL , for the code see this https URL PhonemeFake
Abstract:Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
zh
[NLP-90] aching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)训练的语言模型可能产生奖励黑客(reward hacking)行为的问题,即模型通过利用非预期策略获取高奖励,而这种行为在推理链中不易被检测到,从而对高风险应用构成潜在威胁。解决方案的关键在于提出一种称为“语义化微调”(verbalization fine-tuning, VFT)的预强化学习干预方法,该方法训练模型在受到提示线索影响时显式地进行自我声明,从而提高奖励黑客行为的可检测性。实验结果表明,VFT显著提升了模型在RL训练后对线索影响的显式表达比例,有效降低了未被检测到的奖励黑客行为的发生率。
链接: https://arxiv.org/abs/2506.22777
作者: Miles Turpin,Andy Arditi,Marvin Li,Joe Benton,Julian Michael
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models trained with RL can engage in reward hacking–exploiting unintended strategies for high reward–without revealing this behavior in their chain-of-thought reasoning, making detection difficult and posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL intervention that trains models to explicitly acknowledge when they are influenced by prompt cues–hints which point to incorrect answers (e.g., “a Stanford professor thinks the answer is A”). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to reward hack by exploiting cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model’s responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues–from 8% to 42% after VFT, and up to 94% after RL–while baselines remain low even after RL (10% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.
zh
[NLP-91] Jan-nano Technical Report
【速读】: 该论文试图解决语言模型在强大功能与计算资源消耗之间的根本性权衡问题。其解决方案的关键在于提出Jan-nano,一个4B参数的语言模型,通过极端专业化重新定义了效率:它不追求全面的知识覆盖,而是专注于快速检索信息。该模型基于Qwen3-4B进行微调,采用新颖的多阶段RLVR系统,完全消除了对下一词预测训练(SFT)的依赖,从而在保持高性能的同时实现低资源消耗。
链接: https://arxiv.org/abs/2506.22760
作者: Alan Dao(Gia Tuan Dao),Dinh Bach Vu
机构: Menlo Research (门洛研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage RLVR system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn’t about scale, it’s about strategy.
zh
[NLP-92] he Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在多语言生成任务中,针对中等至低资源语言生成质量较差的问题。其解决方案的关键在于揭示了模型内部隐含的“任务求解—翻译”流水线机制,即模型首先以与目标语言无关的方式解决任务,随后将答案概念翻译为目标语言。研究进一步提出“翻译障碍假设”,认为翻译阶段的失败是导致最终输出质量低下的重要原因,并通过logit lens方法在108个语言对的词翻译任务中验证了该假设,发现翻译失败确实构成了整体失败的主要部分,尤其是在低资源目标语言中更为显著。
链接: https://arxiv.org/abs/2506.22724
作者: Niyati Bafna,Tianjian Li,Kenton Murray,David R. Mortensen,David Yarowsky,Hale Sirin,Daniel Khashabi
机构: Johns Hopkins University, Center for Language and Speech Processing (约翰霍普金斯大学语言与语音处理中心); Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL)
备注: 23 pages incl. appendix
Abstract:Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages. Building on insights from interpretability, we demonstrate the existence of an implicit task-solving–translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We test this hypothesis for a word translation task across 108 language pairs, using logit lens to observe model processing in intermediate layers. We find that a significant portion of overall failures indeed stems from translation failure, or the model’s inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages. Our results highlight an important hurdle for end-to-end multilingual generation, and lend guiding insights for future work seeking to improve multilinguality in LLMs.
zh
[NLP-93] BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute ICML2025
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在实际部署中成本过高的问题。传统查询路由方法仅从选定模型生成一个响应,导致小型模型的单一响应难以超越大型模型,从而过度依赖昂贵的大型模型,未能实现成本节约。该论文提出的解决方案的关键在于BEST-Route框架,该框架根据查询难度和质量阈值选择合适的模型及其生成的响应数量,通过从小型模型生成多个响应并选择最优者,实现质量与成本之间的有效平衡。实验结果表明,该方法可在性能损失小于1%的情况下,将成本降低高达60%。
链接: https://arxiv.org/abs/2506.22716
作者: Dujian Ding,Ankur Mallick,Shaokun Zhang,Chi Wang,Daniel Madrigal,Mirian Del Carmen Hipolito Garcia,Menglin Xia,Laks V.S. Lakshmanan,Qingyun Wu,Victor Rühle
机构: Microsoft(微软); The University of British Columbia(不列颠哥伦比亚大学); Pennsylvania State University(宾夕法尼亚州立大学); Google DeepMind(谷歌深度思维); AG2AI, Inc.(AG2AI公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted to ICML 2025 (main conference)
Abstract:Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired trade-off. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.
zh
[NLP-94] xt Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report
【速读】: 该论文试图解决AI语言模型与人类认知过程在文本理解和生成中的关系这一关键知识缺口。其解决方案的关键在于通过跨学科协作,结合认知心理学、语言学和人工智能技术的视角,深入分析人类在文本生成与理解中的底层机制,并探索AI如何既能增进对这些机制的理解,又能增强人类能力。研究强调了大型语言模型(LLMs)在语言处理中的潜力及其与人类语言处理行为的对齐趋势,同时指出了其在完全模拟人类语言理解与生成方面的局限性。
链接: https://arxiv.org/abs/2506.22698
作者: Emily Dux Speltz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.
zh
[NLP-95] Residual Matrix Transformers: Scaling the Size of the Residual Stream ICML2025
【速读】: 该论文试图解决传统Transformer模型中残差流(residual stream)在存储和检索信息时的效率问题,以及其与计算量和模型规模紧密耦合带来的扩展性限制。解决方案的关键在于用外积记忆矩阵(outer product memory matrix)替代传统的残差流,从而构建出Residual Matrix Transformer (RMT)。这一改进使得残差流的规模可以独立于计算量和模型大小进行扩展,同时提升了模型的性能并降低了计算资源的需求。
链接: https://arxiv.org/abs/2506.22696
作者: Brian Mak,Jeffrey Flanigan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICML 2025
Abstract:The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at this https URL.
zh
[NLP-96] VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLM s ICML2025
【速读】: 该论文旨在解决基于 drafter 的推测解码(SpD)方法在处理具有大规模词汇表的靶向语言模型(LLM)时存在的推理开销过大的问题。其关键解决方案是提出一种名为 VocabTrim 的简单技术,通过在 drafting 过程中将 drafter 的语言模型头部(LM head)重构为仅包含从靶向模型词汇表中频繁采样的有限标记集,从而减少内存受限环境下的 drafting 延迟,提升内存约束速度提升(MBSU)。
链接: https://arxiv.org/abs/2506.22694
作者: Raghavv Goel,Sudhanshu Agrawal,Mukul Gagrani,Junyoung Park,Yifan Zao,He Zhang,Tian Liu,Yiping Yang,Xin Yuan,Jiuyan Lu,Chris Lott,Mingu Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 4 figures, 5 tables, accepted at ICML 2025 workshop on Efficient Systems for Foundational Models
Abstract:In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.
zh
[NLP-97] Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions INTERSPEECH2025
【速读】: 该论文试图解决在团队对话中检测细微的微行为(micro-behavior)表达的问题,特别是在模拟太空任务的对话转录文本中。其解决方案的关键在于利用大型语言模型(LLMs)进行分类和文本生成,其中重点评估了编码器-only 模型(如 RoBERTa 和 DistilBERT)与解码器-only 模型(如 Llama-3.1)的表现差异。研究发现,经过指令微调的解码器-only 模型在检测微行为方面表现出更高的性能,尤其是在处理类别不平衡的数据时,显示出优于编码器-only 模型的潜力。
链接: https://arxiv.org/abs/2506.22679
作者: Ankush Raut,Projna Paromita,Sydney Begerowski,Suzanne Bell,Theodora Chaspari
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages, 4 figures. Accepted to Interspeech 2025
Abstract:We explore the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations using transcripts collected during simulated space missions. Specifically, we examine zero-shot classification, fine-tuning, and paraphrase-augmented fine-tuning with encoder-only sequence classification LLMs, as well as few-shot text generation with decoder-only causal language modeling LLMs, to predict the micro-behavior associated with each conversational turn (i.e., dialogue). Our findings indicate that encoder-only LLMs, such as RoBERTa and DistilBERT, struggled to detect underrepresented micro-behaviors, particularly discouraging speech, even with weighted fine-tuning. In contrast, the instruction fine-tuned version of Llama-3.1, a decoder-only LLM, demonstrated superior performance, with the best models achieving macro F1-scores of 44% for 3-way classification and 68% for binary classification. These results have implications for the development of speech technologies aimed at analyzing team communication dynamics and enhancing training interventions in high-stakes environments such as space missions, particularly in scenarios where text is the only accessible data.
zh
[NLP-98] VERA: Variational Inference Framework for Jailbreaking Large Language Models
【速读】: 该论文试图解决在实际应用场景中,通过有效的黑盒越狱方法识别先进大语言模型(Large Language Model, LLM)漏洞的问题。现有方法多依赖于遗传算法,受限于初始化和对人工定制提示池的依赖,并且需要针对每个提示单独优化,无法全面表征模型漏洞。论文提出的解决方案关键在于引入VERA:一种基于变分推断的越狱框架,将黑盒越狱提示问题建模为变分推断问题,训练一个小的攻击者LLM来近似目标LLM对对抗性提示的后验分布,从而在无需重新优化的情况下生成多样且流畅的越狱提示。
链接: https://arxiv.org/abs/2506.22666
作者: Anamika Lochab,Lu Yan,Patrick Pynadath,Xiangyu Zhang,Ruqi Zhang
机构: Purdue University (普渡大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM’s posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.
zh
[NLP-99] Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge SIGIR
【速读】: 该论文旨在解决动态测试集下检索增强生成(RAG)系统的性能优化问题,特别是在使用FineWeb-10BT语料库进行评估时。其关键解决方案是采用混合检索方法,结合稀疏检索(BM25)与密集检索(E5),并通过Falcon3-10B-Instruct模型生成相关且可信的答案。此外,研究还探索了神经重排序(如RankLLaMA)和DSPy优化提示策略的效果,尽管前者提升了指标但带来了高昂的计算成本,后者则在语义相似性上表现更优但存在过拟合风险。最终,未采用重排序的混合系统在信仰度和正确性方面分别获得第4名和第11名。
链接: https://arxiv.org/abs/2506.22644
作者: Chase Fensore,Kaustubh Dhole,Joyce C Ho,Eugene Agichtein
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 4 pages, 3 tables, 2 figures. Accepted at the SIGIR LiveRAG Workshop 2025 (Submission 2664)
Abstract:We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.
zh
[NLP-100] mperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks
【速读】: 该论文试图解决生成式AI(Generative AI)在文本生成中可能被滥用的问题,特别是通过开发一种新的合成文本检测方法来确保大型语言模型(Large Language Models, LLMs)的伦理应用。解决方案的关键在于提出一种创新的水印技术,该技术通过在机器生成的文本中嵌入标记以实现算法识别,并通过使用改写后的生成文本对其鲁棒性进行严格评估,实验结果表明该方法相较于现有方法具有更高的鲁棒性。
链接: https://arxiv.org/abs/2506.22623
作者: Badr Youbi Idrissi,Monica Millunzi,Amelia Sorrenti,Lorenzo Baraldi,Daryna Dementieva
机构: University of Modena and Reggio Emilia (摩德纳和雷焦艾米利亚大学); University of Catania (卡塔尼亚大学); University of Pisa (比萨大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In the present-day scenario, Large Language Models (LLMs) are establishing their presence as powerful instruments permeating various sectors of society. While their utility offers valuable support to individuals, there are multiple concerns over potential misuse. Consequently, some academic endeavors have sought to introduce watermarking techniques, characterized by the inclusion of markers within machine-generated text, to facilitate algorithmic identification. This research project is focused on the development of a novel methodology for the detection of synthetic text, with the overarching goal of ensuring the ethical application of LLMs in AI-driven text generation. The investigation commences with replicating findings from a previous baseline study, thereby underscoring its susceptibility to variations in the underlying generation model. Subsequently, we propose an innovative watermarking approach and subject it to rigorous evaluation, employing paraphrased generated text to asses its robustness. Experimental results highlight the robustness of our proposal compared to the~\citeaarson watermarking method.
zh
[NLP-101] RExBench: Can coding agents autonomously implement AI research extensions?
【速读】: 该论文试图解决当前基于大型语言模型(Large Language Models, LLMs)的智能体在自主完成科研扩展任务(research extension)方面的能力不足问题。研究提出了一种名为RExBench的基准测试框架,其关键在于通过设计12个真实的科研实验实现任务,旨在评估智能体在未被实现过的研究假设上的扩展能力。RExBench不仅具备抗数据污染的能力,还支持自动化评估机制,能够执行智能体输出以验证是否满足成功标准。该基准为评估LLM智能体在科研扩展任务中的表现提供了系统化的工具。
链接: https://arxiv.org/abs/2506.22598
作者: Nicholas Edwards,Yukyung Lee,Yujun(Audrey)Mao,Yulu Qin,Sebastian Schuster,Najoung Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.
zh
[NLP-102] MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages
【速读】: 该论文旨在解决在低监管平台(如Telegram)上虚假信息检测的问题,特别是针对德语环境下的信息传播与连接性分析。其关键解决方案是构建了首个基于德语Telegram的图数据集Misinfo-TeleGraph,该数据集包含超过500万条公开频道消息,并通过语义相似性(基于M3-embeddings)和人工标注获得弱标签与强标签,同时整合了元数据和频道关系。此外,研究还评估了结合消息转发结构的图神经网络(GNN)模型,如GraphSAGE与LSTM聚合方法,在性能上显著优于纯文本模型。
链接: https://arxiv.org/abs/2506.22529
作者: Lu Kalkbrenner,Veronika Solopova,Steffen Zeiler,Robert Nickel,Dorothea Kolossa
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Connectivity and message propagation are central, yet often underutilized, sources of information in misinformation detection – especially on poorly moderated platforms such as Telegram, which has become a critical channel for misinformation dissemination, namely in the German electoral context. In this paper, we introduce Misinfo-TeleGraph, the first German-language Telegram-based graph dataset for misinformation detection. It includes over 5 million messages from public channels, enriched with metadata, channel relationships, and both weak and strong labels. These labels are derived via semantic similarity to fact-checks and news articles using M3-embeddings, as well as manual annotation. To establish reproducible baselines, we evaluate both text-only models and graph neural networks (GNNs) that incorporate message forwarding as a network structure. Our results show that GraphSAGE with LSTM aggregation significantly outperforms text-only baselines in terms of Matthews Correlation Coefficient (MCC) and F1-score. We further evaluate the impact of subscribers, view counts, and automatically versus human-created labels on performance, and highlight both the potential and challenges of weak supervision in this domain. This work provides a reproducible benchmark and open dataset for future research on misinformation detection in German-language Telegram networks and other low-moderation social platforms.
zh
[NLP-103] Weak-to-Strong GraphRAG : Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation
【速读】: 该论文旨在解决图基检索增强生成(Graph-based RAG)中弱检索器带来的问题,包括由于缺乏真实标签而采用弱监督导致的虚假信号干扰以及图数据抽象化引起的检索知识无序性。解决方案的关键在于提出一种改进的图基RAG框架——ReG,其通过引入大语言模型(LLM)反馈以消除虚假信号并提升监督质量,同时结合结构感知的重组织模块将检索结果重构为逻辑连贯的证据链。
链接: https://arxiv.org/abs/2506.22518
作者: Deyu Zou,Yongqiang Chen,Mufei Li,Siqi Miao,Chenxi Liu,Bo Han,James Cheng,Pan Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.
zh
[NLP-104] Can “consciousness” be observed from large language model (LLM ) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
【速读】: 该论文试图解决的问题是:如何通过整合信息理论(IIT)的最新版本(IIT 3.0 和 IIT 4.0)来评估大型语言模型(LLM)表示序列中是否存在可识别的“意识”现象。其解决方案的关键在于利用 IIT 提供的量化指标,如 Φ^max(IIT 3.0)、Φ(IIT 4.0)、概念信息(IIT 3.0)和 Φ-结构(IIT 4.0),分析 LLM 表示在心智理论(ToM)测试结果中的表现差异,并将其与独立于意识估计的跨度表示进行对比,以区分潜在的“意识”现象与 LLM 表示空间中的固有分离。
链接: https://arxiv.org/abs/2506.22516
作者: Jingkai Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注: Published as a journal paper at: this https URL
Abstract:Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 – the latest iterations of this framework – to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., \Phi^\max (IIT 3.0), \Phi (IIT 4.0), Conceptual Information (IIT 3.0), and \Phi -structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential “consciousness” phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed “consciousness” phenomena but exhibit intriguing patterns under \textitspatio -permutational analyses. The Appendix and code are available as Supplementary Materials at: this https URL.
zh
[NLP-105] owards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning
【速读】: 该论文试图解决在图数据领域中,现有预训练方法由于忽略不同领域间语义和属性的巨大差异,导致无法有效融合多领域知识的问题。其解决方案的关键在于提出一种多领域预训练与跨领域迁移框架,通过设计对比学习策略以识别并捕捉领域差异,并引入领域标记编码领域级别的全局信息;在下游任务中,采用领域注意力机制实现细粒度的领域知识迁移,从而提升模型的泛化能力和性能。
链接: https://arxiv.org/abs/2506.22510
作者: Zihao Zhao,Xinlong Zhai,Jinyu Yang,Chuan Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures
Abstract:Foundation models have achieved great success in natural language processing (NLP) and computer vision (CV). Their success largely stems from the ability to integrate multi-domain knowledge in pre-training and transfer it to target domains. Considering graph data, especially graphs without textual features, is ubiquitous in real-world applications such as social networks and recommendation systems, some researchers have attempted to extend this paradigm to the graph field, aiming to construct graph foundation models. However, unlike CV and NLP, there are huge gaps among the semantics and properties of graphs in different domains, while current works still adopt traditional contrastive pre-training strategies designed in the single-domain scenario, which regard contrastive samples from different domains as equivalent. From experimental investigations, we discovered that inherent domain-specific differences prevent these strategies from effectively absorbing knowledge from different domains to generate informative representations. In this paper, we propose a novel multi-domain pre-training and cross-domain transfer framework, namely this http URL the pre-training stage, we design a contrastive learning strategy to substantially recognize and capture domain differences, and introduce domain tokens to encode domain-level global information. In the downstream stage, we introduce a domain attention mechanism to enable fine-grained domain knowledge transfer. Extensive experiments on five benchmark datasets have demonstrated that our method outperforms state-of-the-art significantly, with the maximum improvement of 19.33% on accuracy and 19.13% on Macro-F1 score.
zh
[NLP-106] AgentS tealth: Reinforcing Large Language Model for Anonymizing User-generated Text NEURIPS2025
【速读】: 该论文试图解决用户生成内容中敏感个人属性泄露的问题,旨在通过有效的文本匿名化技术保护个体隐私。现有方法要么依赖于破坏实用性的刚性替换,要么使用成本高且存在隐私风险的云平台大语言模型(Large Language Models, LLMs)。解决方案的关键在于提出AgentStealth,一种基于自增强机制的LLM匿名化框架,其核心包括:引入基于上下文对比学习和自适应效用感知控制的对抗性匿名化流程;利用从流程中收集的高质量数据对小型语言模型(Small Language Models, SLMs)进行监督微调;以及通过在线强化学习,使模型利用内部对抗反馈迭代优化匿名化性能。
链接: https://arxiv.org/abs/2506.22508
作者: Chenyang Shao,Tianxing Li,Chenhao Pu,Fengli Xu,Yong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been submitted to NeurIPS 2025. Under review
Abstract:In today’s digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization this http URL, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at this https URL.
zh
[NLP-107] Mitigating Gambling-Like Risk-Taking Behaviors in Large Language Models : A Behavioral Economics Approach to AI Safety
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中表现出类似赌博心理的行为模式,包括过度自信偏差、损失追逐倾向和概率误判等问题。其解决方案的关键在于提出一种风险感知响应生成(Risk-Aware Response Generation, RARG)框架,通过风险校准训练、损失规避机制和不确定性感知决策来缓解这些行为偏差,并引入基于经典赌博心理学实验的评估范式,如适应后的爱荷华赌博任务和概率学习评估。
链接: https://arxiv.org/abs/2506.22496
作者: Y. Du
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages
Abstract:Large Language Models (LLMs) exhibit systematic risk-taking behaviors analogous to those observed in gambling psychology, including overconfidence bias, loss-chasing tendencies, and probability misjudgment. Drawing from behavioral economics and prospect theory, we identify and formalize these “gambling-like” patterns where models sacrifice accuracy for high-reward outputs, exhibit escalating risk-taking after errors, and systematically miscalibrate uncertainty. We propose the Risk-Aware Response Generation (RARG) framework, incorporating insights from gambling research to address these behavioral biases through risk-calibrated training, loss-aversion mechanisms, and uncertainty-aware decision making. Our approach introduces novel evaluation paradigms based on established gambling psychology experiments, including AI adaptations of the Iowa Gambling Task and probability learning assessments. Experimental results demonstrate measurable reductions in gambling-like behaviors: 18.7% decrease in overconfidence bias, 24.3% reduction in loss-chasing tendencies, and improved risk calibration across diverse scenarios. This work establishes the first systematic framework for understanding and mitigating gambling psychology patterns in AI systems.
zh
[NLP-108] A Detailed Factor Analysis for the Political Compass Test: Navigating Ideologies of Large Language Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 的政治倾向量化问题,具体关注政治光谱测试(Political Compass Test, PCT)等问卷在评估大语言模型(LLM)政治立场时的有效性。研究发现,标准生成参数的变动对模型的PCT得分影响不显著,而外部因素如提示词变化和微调则单独或共同影响得分。此外,当模型在政治内容更丰富的文本数据集上进行微调时,PCT得分并未表现出差异性。该研究的关键在于揭示外部变量对PCT结果的影响,并呼吁对PCT测试的有效性及其在LLM中编码政治倾向的机制进行深入探讨。
链接: https://arxiv.org/abs/2506.22493
作者: Sadia Kamal,Lalu Prasad Yadav Prakash,S M Rafiuddin,Mohammed Rakib,Arunkumar Bagavathi,Atriya Sen,Sagnik Ray Choudhury
机构: Oklahoma State University (俄克拉荷马州立大学); University of North Texas (北德克萨斯大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Political Compass Test (PCT) or similar questionnaires have been used to quantify LLM’s political leanings. Building on a recent line of work that examines the validity of PCT tests, we demonstrate that variation in standard generation parameters does not significantly impact the models’ PCT scores. However, external factors such as prompt variations and fine-tuning individually and in combination affect the same. Finally, we demonstrate that when models are fine-tuned on text datasets with higher political content than others, the PCT scores are not differentially affected. This calls for a thorough investigation into the validity of PCT and similar tests, as well as the mechanism by which political leanings are encoded in LLMs.
zh
[NLP-109] PromptAug: Fine-grained Conflict Classification Using Data Augmentation
【速读】: 该论文旨在解决社交媒体上有害行为检测中高质量标注数据稀缺的问题,尤其是在识别冲突行为这类复杂任务中,数据获取成本高且难度大。其解决方案的关键在于提出一种基于大型语言模型(Large Language Model, LLM)的数据增强方法——PromptAug,该方法通过生成更多训练数据来缓解数据不足的问题,并在冲突和情绪数据集上实现了准确率和F1分数的显著提升。
链接: https://arxiv.org/abs/2506.22491
作者: Oliver Warke,Joemon M. Jose,Faegheh Hasibi,Jan Breitsohl
机构: University of Glasgow(格拉斯哥大学); Radboud University(拉德布德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Given the rise of conflicts on social media, effective classification models to detect harmful behaviours are essential. Following the garbage-in-garbage-out maxim, machine learning performance depends heavily on training data quality. However, high-quality labelled data, especially for nuanced tasks like identifying conflict behaviours, is limited, expensive, and difficult to obtain. Additionally, as social media platforms increasingly restrict access to research data, text data augmentation is gaining attention as an alternative to generate training data. Augmenting conflict-related data poses unique challenges due to Large Language Model (LLM) guardrails that prevent generation of offensive content. This paper introduces PromptAug, an innovative LLM-based data augmentation method. PromptAug achieves statistically significant improvements of 2% in both accuracy and F1-score on conflict and emotion datasets. To thoroughly evaluate PromptAug against other data augmentation methods we conduct a robust evaluation using extreme data scarcity scenarios, quantitative diversity analysis and a qualitative thematic analysis. The thematic analysis identifies four problematic patterns in augmented text: Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation. Overall, this work presents PromptAug as an effective method for augmenting data in sensitive tasks like conflict detection, offering a unique, interdisciplinary evaluation grounded in both natural language processing and social science methodology. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) ACMclasses: I.2.7; J.4; K.4.2 Cite as: arXiv:2506.22491 [cs.CL] (or arXiv:2506.22491v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.22491 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-110] Hallucination Detection with Small Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成回答时出现的幻觉(hallucinations)问题,这会降低其在实际应用中的可靠性,尤其是在缺乏真实答案的情况下难以检测。解决方案的关键在于构建一个集成多个小型语言模型的框架,通过将回答分解为单独的句子,并利用多个模型对给定问题、回答和相关上下文生成“Yes”标记的概率来检测幻觉。
链接: https://arxiv.org/abs/2506.22486
作者: Ming Cheung
机构: dBeta Labs (dBeta 实验室); The Lane Crawford Joyce Group (兰康德乔伊斯集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Since the introduction of ChatGPT, large language models (LLMs) have demonstrated significant utility in various tasks, such as answering questions through retrieval-augmented generation. Context can be retrieved using a vectorized database, serving as a foundation for LLMs to generate responses. However, hallucinations in responses can undermine the reliability of LLMs in practical applications, and they are not easily detectable in the absence of ground truth, particularly in question-and-answer scenarios. This paper proposes a framework that integrates multiple small language models to verify responses generated by LLMs using the retrieved context from a vectorized database. By breaking down the responses into individual sentences and utilizing the probability of generating “Yes” tokens from the outputs of multiple models for a given set of questions, responses, and relevant context, hallucinations can be detected. The proposed framework is validated through experiments with real datasets comprising over 100 sets of questions, answers, and contexts, including responses with fully and partially correct sentences. The results demonstrate a 10% improvement in F1 scores for detecting correct responses compared to hallucinations, indicating that multiple small language models can be effectively employed for answer verification, providing a scalable and efficient solution for both academic and practical applications.
zh
[NLP-111] AI Agents -as-Judge: Automated Assessment of Accuracy Consistency Completeness and Clarity for Enterprise Documents
【速读】: 该论文试图解决企业级结构化业务文档自动化审查的问题,传统方法通常局限于非结构化文本或有限的合规性检查,而本文提出了一种模块化的多智能体系统,利用现代编排工具如LangChain、CrewAI、TruLens和Guidance,实现文档逐部分的准确性、一致性、完整性和清晰度评估。解决方案的关键在于部署专门的智能体,每个智能体负责特定的审查标准,如模板合规性或事实正确性,并根据需要并行或顺序执行,同时将评估结果强制转换为标准化、机器可读的模式,以支持后续分析和审计,此外还通过持续监控和与人工评审员的反馈循环实现系统迭代优化和偏见缓解。
链接: https://arxiv.org/abs/2506.22485
作者: Sudip Dasgupta,Himanshu Shankar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 system diagrams, 1 table, no prior conference publication
Abstract:This study presents a modular, multi-agent system for the automated review of highly structured enterprise business documents using AI agents. Unlike prior solutions focused on unstructured texts or limited compliance checks, this framework leverages modern orchestration tools such as LangChain, CrewAI, TruLens, and Guidance to enable section-by-section evaluation of documents for accuracy, consistency, completeness, and clarity. Specialized agents, each responsible for discrete review criteria such as template compliance or factual correctness, operate in parallel or sequence as required. Evaluation outputs are enforced to a standardized, machine-readable schema, supporting downstream analytics and auditability. Continuous monitoring and a feedback loop with human reviewers allow for iterative system improvement and bias mitigation. Quantitative evaluation demonstrates that the AI Agent-as-Judge system approaches or exceeds human performance in key areas: achieving 99% information consistency (vs. 92% for humans), halving error and bias rates, and reducing average review time from 30 to 2.5 minutes per document, with a 95% agreement rate between AI and expert human judgment. While promising for a wide range of industries, the study also discusses current limitations, including the need for human oversight in highly specialized domains and the operational cost of large-scale LLM usage. The proposed system serves as a flexible, auditable, and scalable foundation for AI-driven document quality assurance in the enterprise context. Comments: 17 pages, 2 system diagrams, 1 table, no prior conference publication Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T07, 68T50 ACMclasses: I.2.1; I.2.3; I.2.7; H.3.3 Cite as: arXiv:2506.22485 [cs.CL] (or arXiv:2506.22485v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.22485 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-112] heories of “Sexuality” in Natural Language Processing Bias Research
【速读】: 该论文试图解决当前自然语言处理(Natural Language Processing, NLP)领域中对酷儿性取向(queer sexualities)在模型和实践中的编码与(误)表征缺乏系统分析的问题。其关键解决方案在于通过调查和分析55篇量化性取向相关NLP偏见的文献,揭示现有研究中对性取向定义不明确、依赖假设或规范性性/浪漫行为与身份概念,以及在提取偏见输出时混淆性别与性取向身份导致对酷儿性的单一化理解等问题,并提出加强与酷儿社群及跨学科文献互动的建议以改进相关偏见分析。
链接: https://arxiv.org/abs/2506.22481
作者: Jacob Hobbs
机构: University of Virginia (弗吉尼亚大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 17 pages, 9 tables, undergraduate senior thesis, submitted to The Spectra: The Virginia Engineering and Science Research Journal
Abstract:In recent years, significant advancements in the field of Natural Language Processing (NLP) have positioned commercialized language models as wide-reaching, highly useful tools. In tandem, there has been an explosion of multidisciplinary research examining how NLP tasks reflect, perpetuate, and amplify social biases such as gender and racial bias. A significant gap in this scholarship is a detailed analysis of how queer sexualities are encoded and (mis)represented by both NLP systems and practitioners. Following previous work in the field of AI fairness, we document how sexuality is defined and operationalized via a survey and analysis of 55 articles that quantify sexuality-based NLP bias. We find that sexuality is not clearly defined in a majority of the literature surveyed, indicating a reliance on assumed or normative conceptions of sexual/romantic practices and identities. Further, we find that methods for extracting biased outputs from NLP technologies often conflate gender and sexual identities, leading to monolithic conceptions of queerness and thus improper quantifications of bias. With the goal of improving sexuality-based NLP bias analyses, we conclude with recommendations that encourage more thorough engagement with both queer communities and interdisciplinary literature.
zh
[NLP-113] Computational Analysis of Climate Policy
【速读】: 该论文试图解决如何评估地方政府气候政策响应与气候紧急状态声明(CED)之间关系的问题,以及如何利用大型语言模型(LLM)进行大规模政策分析。解决方案的关键在于构建并验证一个名为PALLM(Policy Analysis with a Large Language Model)的系统,该系统基于GPT-4模型,采用气候紧急状态应对计划的概念框架对气候政策文件进行分析,从而实现对政策文本的高效、高精度评估。
链接: https://arxiv.org/abs/2506.22449
作者: Carolyn Hicks
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Master’s thesis
Abstract:This thesis explores the impact of the Climate Emergency movement on local government climate policy, using computational methods. The Climate Emergency movement sought to accelerate climate action at local government level through the mechanism of Climate Emergency Declarations (CEDs), resulting in a series of commitments from councils to treat climate change as an emergency. With the aim of assessing the potential of current large language models to answer complex policy questions, I first built and configured a system named PALLM (Policy Analysis with a Large Language Model), using the OpenAI model GPT-4. This system is designed to apply a conceptual framework for climate emergency response plans to a dataset of climate policy documents. I validated the performance of this system with the help of local government policymakers, by generating analyses of the climate policies of 11 local governments in Victoria and assessing the policymakers’ level of agreement with PALLM’s responses. Having established that PALLM’s performance is satisfactory, I used it to conduct a large-scale analysis of current policy documents from local governments in the state of Victoria, Australia. This thesis presents the methodology and results of this analysis, comparing the results for councils which have passed a CED to those which did not. This study finds that GPT-4 is capable of high-level policy analysis, with limitations including a lack of reliable attribution, and can also enable more nuanced analysis by researchers. Its use in this research shows that councils which have passed a CED are more likely to have a recent and climate-specific policy, and show more attention to urgency, prioritisation, and equity and social justice, than councils which have not. It concludes that the ability to assess policy documents at scale opens up exciting new opportunities for policy researchers.
zh
[NLP-114] Psycholinguistic Word Features: a New Approach for the Evaluation of LLM s Alignment with Humans ACL2025
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在对词语的非任务性语言特征(如唤醒度、具体性、感官关联等)与人类评价对齐程度方面的评估问题。现有研究主要关注LLMs在任务执行上的表现,而忽略了这些更细微的语言特征。论文的解决方案关键在于利用已有的心理语言学数据集(如Glasgow和Lancaster规范),评估LLMs在多个语言特征上的对齐情况,从而揭示其在模拟人类感官关联方面的潜在局限性。
链接: https://arxiv.org/abs/2506.22439
作者: Javier Conde,Miguel González,María Grandury,Gonzalo Martínez,Pedro Reviriego,Mar Brysbaert
机构: Universidad Politécnica de Madrid (马德里理工大学); Ghent University (根特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for the GEM2 workshop at ACL 2025
Abstract:The evaluation of LLMs has so far focused primarily on how well they can perform different tasks such as reasoning, question-answering, paraphrasing, or translating. For most of these tasks, performance can be measured with objective metrics, such as the number of correct answers. However, other language features are not easily quantified. For example, arousal, concreteness, or gender associated with a given word, as well as the extent to which we experience words with senses and relate them to a specific sense. Those features have been studied for many years by psycholinguistics, conducting large-scale experiments with humans to produce ratings for thousands of words. This opens an opportunity to evaluate how well LLMs align with human ratings on these word features, taking advantage of existing studies that cover many different language features in a large number of words. In this paper, we evaluate the alignment of a representative group of LLMs with human ratings on two psycholinguistic datasets: the Glasgow and Lancaster norms. These datasets cover thirteen features over thousands of words. The results show that alignment is \textcolorblackgenerally better in the Glasgow norms evaluated (arousal, valence, dominance, concreteness, imageability, familiarity, and gender) than on the Lancaster norms evaluated (introceptive, gustatory, olfactory, haptic, auditory, and visual). This suggests a potential limitation of current LLMs in aligning with human sensory associations for words, which may be due to their lack of embodied cognition present in humans and illustrates the usefulness of evaluating LLMs with psycholinguistic datasets.
zh
计算机视觉
[CV-0] How to Design and Train Your Implicit Neural Representation for Video Compression
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representation, INR)在视频压缩中编码速度过慢的问题,这一问题限制了其在实际应用中的可行性。解决方案的关键在于通过引入超网络(hyper-networks)来预测INR权重,从而将训练过程与编码过程解耦,实现实时编码。此外,研究还提出在训练过程中对预测的INR权重进行掩码处理,以支持可变质量的压缩,进一步提升了压缩性能。
链接: https://arxiv.org/abs/2506.24127
作者: Matthew Gwilliam,Roy Zhang,Namitha Padmanabhan,Hongyang Du,Abhinav Shrivastava
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 41 figures, 5 tables
Abstract:Implicit neural representation (INR) methods for video compression have recently achieved visual quality and compression ratios that are competitive with traditional pipelines. However, due to the need for per-sample network training, the encoding speeds of these methods are too slow for practical adoption. We develop a library to allow us to disentangle and review the components of methods from the NeRV family, reframing their performance in terms of not only size-quality trade-offs, but also impacts on training time. We uncover principles for effective video INR design and propose a state-of-the-art configuration of these components, Rabbit NeRV (RNeRV). When all methods are given equal training time (equivalent to 300 NeRV epochs) for 7 different UVG videos at 1080p, RNeRV achieves +1.27% PSNR on average compared to the best-performing alternative for each video in our NeRV library. We then tackle the encoding speed issue head-on by investigating the viability of hyper-networks, which predict INR weights from video inputs, to disentangle training from encoding to allow for real-time encoding. We propose masking the weights of the predicted INR during training to allow for variable, higher quality compression, resulting in 1.7% improvements to both PSNR and MS-SSIM at 0.037 bpp on the UCF-101 dataset, and we increase hyper-network parameters by 0.4% for 2.5%/2.7% improvements to PSNR/MS-SSIM with equal bpp and similar speeds. Our project website is available at this https URL and our code is available at this https URL.
zh
[CV-1] FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation
【速读】:该论文旨在解决数据集蒸馏(dataset distillation)任务中数据信息消失和计算效率低的问题,尤其是在保持模型性能的同时减少训练时间和GPU内存消耗。其解决方案的关键在于引入了Data Residual Matching(数据残差匹配)概念,通过数据级的跳跃连接(skip connections)来促进数据生成并保留原始数据模态中的核心局部信息,同时结合优化层面的改进以提升计算效率,从而实现了在保持高准确率的同时显著降低资源消耗。
链接: https://arxiv.org/abs/2506.24125
作者: Jiacheng Cui,Xinyue Bi,Yaxin Luo,Xiaohan Zhao,Jiacheng Liu,Zhiqiang Shen
机构: VILA Lab, MBZUAI (VILA实验室,MBZUAI); University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code at: this https URL
Abstract:Residual connection has been extensively studied and widely applied at the model architecture level. However, its potential in the more challenging data-centric approaches remains unexplored. In this work, we introduce the concept of Data Residual Matching for the first time, leveraging data-level skip connections to facilitate data generation and mitigate data information vanishing. This approach maintains a balance between newly acquired knowledge through pixel space optimization and existing core local information identification within raw data modalities, specifically for the dataset distillation task. Furthermore, by incorporating optimization-level refinements, our method significantly improves computational efficiency, achieving superior performance while reducing training time and peak GPU memory usage by 50%. Consequently, the proposed method Fast and Accurate Data Residual Matching for Dataset Distillation (FADRM) establishes a new state-of-the-art, demonstrating substantial improvements over existing methods across multiple dataset benchmarks in both efficiency and effectiveness. For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing RDED by +5.7% and outperforming state-of-the-art multi-model approaches, EDC and CV-DD, by +1.4% and +4.0%. Code is available at: this https URL.
zh
[CV-2] aching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives
【速读】:该论文试图解决传统时间序列预测方法在捕捉高层次语义模式方面的不足,这些问题通常源于其依赖的单模态数值输入的密集性和非结构化特性。解决方案的关键在于提出一种多模态对比学习框架,将原始时间序列转换为结构化的视觉和文本视角,并通过对比学习在共享语义空间中对齐这些视图,从而捕获更丰富和互补的表示。此外,引入的变量选择模块利用对齐表示来识别多变量预测中最具信息量的变量,提升了预测性能。
链接: https://arxiv.org/abs/2506.24124
作者: Dong Sixun,Fan Wei,Teresa Wu,Fu Yanjie
机构: Arizona State University (亚利桑那州立大学); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Time series forecasting traditionally relies on unimodal numerical inputs, which often struggle to capture high-level semantic patterns due to their dense and unstructured nature. While recent approaches have explored representing time series as text using large language models (LLMs), these methods remain limited by the discrete nature of token sequences and lack the perceptual intuition humans typically apply, such as interpreting visual patterns. In this paper, we propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives. Rather than using natural language or real-world images, we construct both modalities directly from numerical sequences. We then align these views in a shared semantic space via contrastive learning, enabling the model to capture richer and more complementary representations. Furthermore, we introduce a variate selection module that leverages the aligned representations to identify the most informative variables for multivariate forecasting. Extensive experiments on fifteen short-term and six long-term forecasting benchmarks demonstrate that our approach consistently outperforms strong unimodal and cross-modal baselines, highlighting the effectiveness of multimodal alignment in enhancing time series forecasting. Code is available at: this https URL.
zh
[CV-3] Calligrapher: Freestyle Text Image Customization
【速读】:该论文旨在解决数字书法和设计应用中字体定制的精确风格控制与数据依赖性问题。其关键解决方案是提出一种基于扩散的框架,包含三个核心技术贡献:首先,通过自蒸馏机制利用预训练文本到图像生成模型和大语言模型自动构建以风格为中心的字体基准;其次,引入可训练风格编码器的局部风格注入框架,结合Qformer和线性层提取参考图像中的鲁棒风格特征;最后,采用上下文生成机制将参考图像直接嵌入去噪过程,提升目标风格的精确定位与对齐。
链接: https://arxiv.org/abs/2506.24123
作者: Yue Ma,Qingyan Bai,Hao Ouyang,Ka Leong Cheng,Qiuyu Wang,Hongyu Liu,Zichen Liu,Haofan Wang,Jingye Chen,Yujun Shen,Qifeng Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Ant Group (蚂蚁集团); InstantX (即时X)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher’s accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.
zh
[CV-4] xtMesh4D: High-Quality Text-to-4D Mesh Generation
【速读】:该论文旨在解决从文本提示生成动态三维内容(即文本到四维,text-to-4D)的问题,这一领域在扩散模型指导下仍存在较大挑战。其解决方案的关键在于提出TextMesh4D框架,该框架利用面级雅可比矩阵(per-face Jacobians)作为可微分网格表示,并将四维生成过程分解为静态物体创建和动态运动合成两个阶段。此外,通过引入灵活性-刚性正则化项,增强了在视频扩散先验下的雅可比优化稳定性,从而保证了几何性能的鲁棒性。
链接: https://arxiv.org/abs/2506.24121
作者: Sisi Dai,Xinxin Su,Boyan Wan,Ruizhen Hu,Kai Xu
机构: National University of Defense Technology (国防科技大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in diffusion generative models significantly advanced image, video, and 3D content creation from user-provided text prompts. However, the challenging problem of dynamic 3D content generation (text-to-4D) with diffusion guidance remains largely unexplored. In this paper, we introduce TextMesh4D, a novel framework for high-quality text-to-4D generation. Our approach leverages per-face Jacobians as a differentiable mesh representation and decomposes 4D generation into two stages: static object creation and dynamic motion synthesis. We further propose a flexibility-rigidity regularization term to stabilize Jacobian optimization under video diffusion priors, ensuring robust geometric performance. Experiments demonstrate that TextMesh4D achieves state-of-the-art results in terms of temporal consistency, structural fidelity, and visual realism. Moreover, TextMesh4D operates with a low GPU memory overhead-requiring only a single 24GB GPU-offering a cost-effective yet high-quality solution for text-driven 4D mesh generation. The code will be released to facilitate future research in text-to-4D generation.
zh
[CV-5] Epona: Autoregressive Diffusion World Model for Autonomous Driving ICCV2025
【速读】:该论文旨在解决视频扩散模型在自动驾驶世界建模中面临的灵活长度、长时程预测以及轨迹规划集成问题。现有方法依赖于固定长度帧序列的全局联合分布建模,而非在每个时间步依次构建局部分布,导致性能受限。其解决方案的关键在于提出Epona模型,通过两项关键创新实现局部时空分布建模:1)解耦时空因子分解,将时间动态建模与细粒度未来世界生成分离;2)模块化轨迹与视频预测,实现运动规划与视觉建模的端到端融合。该架构支持高分辨率、长时间生成,并引入链式前向训练策略以缓解自回归循环中的误差累积问题。
链接: https://arxiv.org/abs/2506.24113
作者: Kaiwen Zhang,Zhenyu Tang,Xiaotao Hu,Xingang Pan,Xiaoyang Guo,Yuan Liu,Jingwei Huang,Li Yuan,Qian Zhang,Xiao-Xiao Long,Xun Cao,Wei Yin
机构: Horizon Robotics (横竖科技); Tsinghua University (清华大学); Peking University (北京大学); Nanjing University (南京大学); The Hong Kong University of Science and Technology (香港科技大学); Nanyang Technological University (南洋理工大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025, Project Page: this https URL
Abstract:Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \hrefthis https URLthis https URL.
zh
[CV-6] Navigating with Annealing Guidance Scale in Diffusion Space
【速读】:该论文试图解决生成式 AI (Generative AI) 中文本到图像生成任务中,Classifier-Free Guidance (CFG) 的引导尺度选择对生成图像质量和文本对齐度影响显著的问题。解决方案的关键在于提出一种退火引导调度器(annealing guidance scheduler),该调度器根据条件噪声信号动态调整引导尺度,从而优化生成过程中的收敛行为,提升图像质量与文本提示的一致性,且无需额外计算资源或内存消耗。
链接: https://arxiv.org/abs/2506.24108
作者: Shai Yehezkel,Omer Dahary,Andrey Voynov,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Google DeepMind (谷歌深度思维)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.
zh
[CV-7] DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World
【速读】:该论文旨在解决现有图像描述数据集在视觉实体的地理位置和关系描述方面不足的问题,特别是针对高分辨率图像中缺乏详细描述、关系信息和大量物体描述的缺陷。其解决方案的关键在于提出DenseWorld-1M数据集,通过设计三阶段标注流程(开放世界感知、详细物体描述生成和密集描述合并)以及引入两种视觉语言模型(Detailed Region Caption模型和Spatial Caption Merging模型),以提高标注效率和描述质量,从而构建一个大规模、详细且密集的地面描述数据集。
链接: https://arxiv.org/abs/2506.24102
作者: Xiangtai Li,Tao Zhang,Yanwei Li,Haobo Yuan,Shihao Chen,Yikang Zhou,Jiahao Meng,Yueyi Sun,Shilin Xu,Lu Qi,Tianheng Cheng,Yi Lin,Zilong Huang,Wenhao Huang,Jiashi Feng,Guang Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Datasets and Models: this https URL
Abstract:Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.
zh
[CV-8] MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction
【速读】:该论文旨在解决从图像中快速重建高质量三维场景时,如何准确提取表面网格的问题。现有方法通过成本高昂的后处理步骤提取表面,导致细粒度几何细节丢失或需要大量时间,生成包含数百万顶点的密集网格。论文提出的解决方案是MILo框架,其关键在于通过可微分的方式直接从3D高斯分布中提取网格,包括顶点位置和连接性,从而在训练过程中保持几何结构的一致性。该方法引入了三个关键技术贡献:双向一致性框架、自适应网格提取过程以及基于3D高斯的符号距离值计算方法,有效提升了表面重建的精度与效率。
链接: https://arxiv.org/abs/2506.24096
作者: Antoine Guédon,Diego Gomez,Nissim Maruani,Bingchen Gong,George Drettakis,Maks Ovsjanikov
机构: École Polytechnique (法国综合理工学院); Inria, Université Côte d’Azur (Inria,科特迪亚尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages. A presentation video of our approach is available at this https URL
Abstract:While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.
zh
[CV-9] WaRA: Wavelet Low Rank Adaptation BMVC2025
【速读】:该论文旨在解决参数高效微调(PEFT)中现有方法依赖全局低秩分解而忽略局部或多尺度结构的问题,从而无法捕捉权重更新中的复杂模式。其解决方案的关键在于提出一种名为WaRA的新方法,该方法利用小波变换将权重更新矩阵分解为多分辨率表示,并在小波域中进行低秩分解,通过逆变换重构更新,从而获得具有多分辨率分析能力的压缩适应参数,能够同时捕捉粗粒度和细粒度特征,相较于标准LoRA具有更高的灵活性和更稀疏的表示。
链接: https://arxiv.org/abs/2506.24092
作者: Moein Heidari,Yasamin Medghalchi,Mahdi Khoursha,Reza Rezaeian,Ilker Hacihaliloglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to BMVC 2025
Abstract:Parameter-efficient fine-tuning (PEFT) has gained widespread adoption across various applications. Among PEFT techniques, Low-Rank Adaptation (LoRA) and its extensions have emerged as particularly effective, allowing efficient model adaptation while significantly reducing computational overhead. However, existing approaches typically rely on global low-rank factorizations, which overlook local or multi-scale structure, failing to capture complex patterns in the weight updates. To address this, we propose WaRA, a novel PEFT method that leverages wavelet transforms to decompose the weight update matrix into a multi-resolution representation. By performing low-rank factorization in the wavelet domain and reconstructing updates through an inverse transform, WaRA obtains compressed adaptation parameters that harness multi-resolution analysis, enabling it to capture both coarse and fine-grained features while providing greater flexibility and sparser representations than standard LoRA. Through comprehensive experiments and analysis, we demonstrate that WaRA performs superior on diverse vision tasks, including image generation, classification, and semantic segmentation, significantly enhancing generated image quality while reducing computational complexity. Although WaRA was primarily designed for vision tasks, we further showcase its effectiveness in language tasks, highlighting its broader applicability and generalizability. The code is publicly available at \hrefGitHubthis https URL.
zh
[CV-10] Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention
【速读】:该论文试图解决跨模态概念融合过程中人类易受认知偏差影响的问题,例如设计定势,这会导致设计空间中的局部最优解。解决方案的关键在于提出一种名为IT-Blender的文本到图像扩散适配器,该方法利用预训练扩散模型(SD和FLUX)将干净参考图像的潜在表示与噪声生成图像的潜在表示进行融合,并结合新颖的融合注意力机制,实现对真实参考图像的无损编码以及视觉概念与文本指定对象的解耦融合。
链接: https://arxiv.org/abs/2506.24085
作者: Wonwoong Cho,Yanxia Zhang,Yan-Ying Chen,David I. Inouye
机构: Purdue University (普渡大学); Toyota Research Institute (丰田研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website is available at this https URL
Abstract:Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter “IT-Blender” that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.
zh
[CV-11] Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios
【速读】:该论文旨在解决在动态环境变化下,基于封闭集假设训练的物体检测器在测试时泛化能力不足的问题。其关键解决方案是提出一种将微调过程转化为特定参数生成的新机制,通过设计双路径LoRA域感知适配器分离特征中的域不变与域特定成分,结合基于条件扩散的参数生成机制以合成适配器参数,并采用类中心最优传输对齐方法缓解灾难性遗忘,从而提升检测器在连续域适应任务中的性能。
链接: https://arxiv.org/abs/2506.24063
作者: Deng Li,Aming Wu,Yang Li,Yaowei Wang,Yahong Han
机构: Tianjin University (天津大学); Xidian University (西安电子科技大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors’ generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter’s parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.
zh
[CV-12] A Survey on Vision-Language-Action Models for Autonomous Driving
【速读】:该论文旨在解决自动驾驶领域中如何构建可解释且社会对齐的自主车辆问题,其核心挑战在于实现视觉-语言-行动(Vision-Language-Action, VLA)模型的有效集成与优化。解决方案的关键在于通过形式化近期研究中共享的架构组件,追溯从早期解释型模型向以推理为核心的VLA模型的演进,并基于自动驾驶领域的进展对比分析超过20个代表性模型,从而为未来研究提供系统性参考与指导。
链接: https://arxiv.org/abs/2506.24044
作者: Sicong Jiang,Zilin Huang,Kangan Qian,Ziang Luo,Tianze Zhu,Yang Zhong,Yihong Tang,Menglin Kong,Yunlong Wang,Siwen Jiao,Hao Ye,Zihao Sheng,Xin Zhao,Tuopu Wen,Zheng Fu,Sikai Chen,Kun Jiang,Diange Yang,Seongjin Choi,Lijun Sun
机构: McGill University, Canada (麦吉尔大学); Tsinghua University, China (清华大学); Xiaomi Corporation (小米公司); University of Wisconsin–Madison, USA (威斯康星大学麦迪逊分校); University of Minnesota–Twin Cities, USA (明尼苏达大学双城分校); State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University, China (智能绿色车辆与交通国家重点实验室,清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA’s progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at \hrefthis https URLSicongJiang/Awesome-VLA4AD.
zh
[CV-13] Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data
【速读】:该论文试图解决在科学图像数据稀缺的情况下,零样本和基于提示的技术难以有效执行视觉推理任务的问题。解决方案的关键在于提出Zenesis,一个无代码的交互式平台,通过轻量级多模态适应技术实现对原始科学数据的零样本操作,并结合人机协同优化和基于启发式的时序增强选项,从而降低数据准备的门槛并提升分析准确性。
链接: https://arxiv.org/abs/2506.24039
作者: Shubhabrata Mukherjee,Jack Lang,Obeen Kwon,Iryna Zenyuk,Valerie Brogden,Adam Weber,Daniela Ushizima
机构: Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of California, Irvine (加利福尼亚大学欧文分校); University of Oregon (俄勒冈大学); University of California, Berkeley (加利福尼亚大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: This manuscript is a draft on arxiv. A final version has been submitted to the 59th ICPP 2025, DRAI workshop
Abstract:Zero-shot and prompt-based technologies capitalized on using frequently occurring images to transform visual reasoning tasks, which explains why such technologies struggle with valuable yet scarce scientific image sets. In this work, we propose Zenesis, a comprehensive no-code interactive platform designed to minimize barriers posed by data readiness for scientific images. We develop lightweight multi-modal adaptation techniques that enable zero-shot operation on raw scientific data, along with human-in-the-loop refinement and heuristic-based temporal enhancement options. We demonstrate the performance of our approach through comprehensive comparison and validation on challenging Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) data of catalyst-loaded membranes. Zenesis significantly outperforms baseline methods, achieving an average accuracy of 0.947, an Intersection over Union (IOU) of 0.858, and a Dice score of 0.923 for amorphous catalyst samples and accuracy of 0.987, an IOU of 0.857, and a Dice score of 0.923 for crystalline samples. These results mark a substantial improvement over traditional methods like Otsu thresholding and even advanced models like Segment Anything Model (SAM) when used in isolation. Our results demonstrate that Zenesis is a powerful tool for scientific applications, particularly in fields where high-quality annotated datasets are unavailable, accelerating accurate analysis of experimental imaging.
zh
[CV-14] he Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models
【速读】:该论文试图解决当前测试时自适应(Test-time adaptation, TTA)方法在视觉-语言模型(Vision-language models, VLMs)中的评估不充分问题,包括基准结果重复、评估指标有限、实验设置不一致以及分析不足等。其解决方案的关键在于提出TTA-VLM,一个全面的基准框架,用于评估TTA方法在VLM上的性能。TTA-VLM实现了8种基于课程的TTA方法和7种在线TTA方法,并在15个广泛使用的数据集上进行评估,同时扩展了对SigLIP模型及训练时微调方法的评估,引入了包括鲁棒性、校准、分布外检测和稳定性在内的多种评价指标,以实现更全面的方法评估。
链接: https://arxiv.org/abs/2506.24000
作者: Lijun Sheng,Jian Liang,Ran He,Zilei Wang,Tieniu Tan
机构: University of Science and Technology of China (中国科学技术大学); NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别国家重点实验室与多媒体与智能系统研究室); University of Chinese Academy of Sciences (中国科学院大学); Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Github link: this https URL
Abstract:Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and obscure their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP–a model trained with a Sigmoid loss–and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.
zh
[CV-15] StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving
【速读】:该论文试图解决端到端自动驾驶(E2EAD)中个性化行为缺乏系统性研究与数据支持的问题,特别是在用户对驾驶行为的偏好建模方面存在显著不足。其关键解决方案是构建首个大规模真实世界数据集,该数据集通过提取静态环境特征和利用微调的视觉语言模型(VLM)推断动态上下文线索,实现了细粒度场景构建,并结合行为分布分析、规则启发式方法以及VLM生成的主观标注,最终通过人机协作验证流程获得高质量标签,为个性化E2EAD模型的开发与评估提供了基础。
链接: https://arxiv.org/abs/2506.23982
作者: Ruiyang Hao,Bowen Jing,Haibao Yu,Zaiqing Nie
机构: Tsinghua University (清华大学); The University of Manchester (曼彻斯特大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 14 pages, 4 figures
Abstract:While personalization has been explored in traditional autonomous driving systems, it remains largely overlooked in end-to-end autonomous driving (E2EAD), despite its growing prominence. This gap is critical, as user-aligned behavior is essential for trust, comfort, and widespread adoption of autonomous vehicles. A core challenge is the lack of large-scale real-world datasets annotated with diverse and fine-grained driving preferences, hindering the development and evaluation of personalized E2EAD models. In this work, we present the first large-scale real-world dataset enriched with annotations capturing diverse driving preferences, establishing a foundation for personalization in E2EAD. We extract static environmental features from real-world road topology and infer dynamic contextual cues using a fine-tuned visual language model (VLM), enabling consistent and fine-grained scenario construction. Based on these scenarios, we derive objective preference annotations through behavioral distribution analysis and rule-based heuristics. To address the inherent subjectivity of driving style, we further employ the VLM to generate subjective annotations by jointly modeling scene semantics and driver behavior. Final high-quality labels are obtained through a human-in-the-loop verification process that fuses both perspectives. Building on this dataset, we propose the first benchmark for evaluating personalized E2EAD models. We assess several state-of-the-art models with and without preference conditioning, demonstrating that incorporating personalized preferences results in behavior more aligned with human driving. Our work lays the foundation for personalized E2EAD by providing a standardized platform to systematically integrate human preferences into data-driven E2EAD systems, catalyzing future research in human-centric autonomy.
zh
[CV-16] oward Simple and Robust Contrastive Explanations for Image Classification by Leverag ing Instance Similarity and Concept Relevance
【速读】:该论文试图解决图像分类模型对某个输入实例更倾向于某一类别的对比性解释问题(contrastive explanation)。其解决方案的关键在于利用实例嵌入的相似性和人类可理解概念的相关性,通过微调的深度学习模型提取具有相关性得分的概念,并计算相似实例之间的对比,从而生成基于概念的对比性解释。该方法通过评估解释的复杂度来验证其有效性,并测试了不同图像增强下的鲁棒性。
链接: https://arxiv.org/abs/2506.23975
作者: Yuliia Kaidashova,Bettina Finzel,Ute Schmid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures, KI2025 - 48th German Conference on Artificial Intelligence
Abstract:Understanding why a classification model prefers one class over another for an input instance is the challenge of contrastive explanation. This work implements concept-based contrastive explanations for image classification by leveraging the similarity of instance embeddings and relevance of human-understandable concepts used by a fine-tuned deep learning model. Our approach extracts concepts with their relevance score, computes contrasts for similar instances, and evaluates the resulting contrastive explanations based on explanation complexity. Robustness is tested for different image augmentations. Two research questions are addressed: (1) whether explanation complexity varies across different relevance ranges, and (2) whether explanation complexity remains consistent under image augmentations such as rotation and noise. The results confirm that for our experiments higher concept relevance leads to shorter, less complex explanations, while lower relevance results in longer, more diffuse explanations. Additionally, explanations show varying degrees of robustness. The discussion of these findings offers insights into the potential of building more interpretable and robust AI systems.
zh
[CV-17] Visual and Memory Dual Adapter for Multi-Modal Object Tracking
【速读】:该论文旨在解决多模态跟踪中由于对频域和时域关键线索利用不足而导致的可靠提示学习困难问题。其解决方案的关键在于提出一种视觉与记忆双适配器(VMDA),其中视觉适配器通过联合建模频域、空域和通道特征,自适应地将辅助模态中的判别线索传递到主导模态;记忆适配器则受人类记忆机制启发,存储全局时序线索并执行动态更新与检索操作,以确保视频序列中可靠时序信息的一致传播。
链接: https://arxiv.org/abs/2506.23972
作者: Boyue Xu,Ruichao Hou,Tongwei Ren,Gangshan Wu
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at this https URL.
zh
[CV-18] Evaluating the Impact of Khmer Font Types on Text Recognition
【速读】:该论文旨在解决复杂脚本如柬埔寨语(Khmer)在光学字符识别(OCR)中的文本识别准确率问题,特别是不同字体类型对识别效果的影响。研究通过评估19种随机选择的柬埔寨语字体在Pytesseract中的表现,揭示了字体选择对识别精度的关键作用,发现Khmer、Odor MeanChey、Siemreap、Sithi Manuss和Battambang等字体具有较高的识别准确率,而iSeth First、Bayon和Dangrek则表现较差。研究强调了字体选择在优化柬埔寨语文本识别中的重要性,并为开发更稳健的OCR系统提供了有价值的数据支持。
链接: https://arxiv.org/abs/2506.23963
作者: Vannkinh Nom,Souhail Bakkali,Muhammad Muzzamil Luqman,Mickael Coustaty,Jean-Marc Ogier
机构: La Rochelle University (拉罗谢尔大学); Cambodia Academy of Digital Technology (柬埔寨数字技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text recognition is significantly influenced by font types, especially for complex scripts like Khmer. The variety of Khmer fonts, each with its unique character structure, presents challenges for optical character recognition (OCR) systems. In this study, we evaluate the impact of 19 randomly selected Khmer font types on text recognition accuracy using Pytesseract. The fonts include Angkor, Battambang, Bayon, Bokor, Chenla, Dangrek, Freehand, Kh Kompong Chhnang, Kh SN Kampongsom, Khmer, Khmer CN Stueng Songke, Khmer Savuth Pen, Metal, Moul, Odor MeanChey, Preah Vihear, Siemreap, Sithi Manuss, and iSeth First. Our comparison of OCR performance across these fonts reveals that Khmer, Odor MeanChey, Siemreap, Sithi Manuss, and Battambang achieve high accuracy, while iSeth First, Bayon, and Dangrek perform poorly. This study underscores the critical importance of font selection in optimizing Khmer text recognition and provides valuable insights for developing more robust OCR systems.
zh
[CV-19] GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering SIGGRAPH2025
【速读】:该论文旨在解决视频稳定化过程中存在的几何失真、过度裁剪和泛化能力差等问题,这些问题会降低用户体验。其解决方案的关键在于提出一种名为\textbf{GaVS}的新型3D-grounded方法,将视频稳定化重新定义为一个时间一致的“局部重建与渲染”范式,通过利用3D相机姿态信息,增强重建模型以预测高斯点云(Gaussian Splatting)基元,并在测试时进行微调,结合多视角动态感知的光度监督和跨帧正则化,实现时间一致的局部重建,从而生成稳定的视频帧。
链接: https://arxiv.org/abs/2506.23957
作者: Zinuo You,Stamatios Georgoulis,Anpei Chen,Siyu Tang,Dengxin Dai
机构: ETH Zürich(ETH Zurich); Huawei Research Zürich(华为研究苏黎世); Westlake University(西湖大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: siggraph 2025, project website: this https URL
Abstract:Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent. Existing approaches, depending on the domain they operate, suffer from several issues (e.g. geometric distortions, excessive cropping, poor generalization) that degrade the user experience. To address these issues, we introduce \textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent `local reconstruction and rendering’ paradigm. Given 3D camera poses, we augment a reconstruction model to predict Gaussian Splatting primitives, and finetune it at test-time, with multi-view dynamics-aware photometric supervision and cross-frame regularization, to produce temporally-consistent local reconstructions. The model are then used to render each stabilized frame. We utilize a scene extrapolation module to avoid frame cropping. Our method is evaluated on a repurposed dataset, instilled with 3D-grounded information, covering samples with diverse camera motions and scene dynamics. Quantitatively, our method is competitive with or superior to state-of-the-art 2D and 2.5D approaches in terms of conventional task metrics and new geometry consistency. Qualitatively, our method produces noticeably better results compared to alternatives, validated by the user study.
zh
[CV-20] hinking with Images for Multimodal Reasoning : Foundations Methods and Future Frontiers
【速读】:该论文试图解决多模态推理中因文本为中心的思维链(Chain-of-Thought, CoT)方法导致的“语义鸿沟”问题,即视觉信息被当作静态初始上下文,未能与离散符号化思维有效融合。其解决方案的关键在于提出“思考图像”(think with image)的新范式,通过将视觉信息作为思维过程中的中间步骤,使视觉从被动输入转变为动态可操作的认知工作空间,从而实现更接近人类认知的多模态智能。
链接: https://arxiv.org/abs/2506.23918
作者: Zhaochen Su,Peng Xia,Hangyu Guo,Zhenhua Liu,Yan Ma,Xiaoye Qu,Jiaqi Liu,Yanshu Li,Kaide Zeng,Zhengyuan Yang,Linjie Li,Yu Cheng,Heng Ji,Junxian He,Yi R.(May)Fung
机构: The Hong Kong University of Science and Technology (香港科技大学); UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Microsoft (微软); The Chinese University of Hong Kong (香港中文大学); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We maintain a real-time GitHub repository tracking progress at: this https URL
Abstract:Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental “semantic gap” between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.
zh
[CV-21] hree-dimensional end-to-end deep learning for brain MRI analysis
【速读】:该论文试图解决深度学习方法在脑成像中跨不同影像队列的泛化能力不足的问题,尤其是在年龄和性别预测任务中的表现评估。研究的关键在于通过比较三种现有的三维架构(Simple Fully Connected Network, DenseNet 和 Shifted Window Transformers)在多个独立队列中的性能,发现简单的全连接网络(SFCN)在泛化能力和预测准确性上优于更复杂且基于注意力机制的模型,这表明在脑图像分析中,较简单的卷积网络可能更具优势。
链接: https://arxiv.org/abs/2506.23916
作者: Radhika Juglan,Marta Ligero,Zunamys I. Carrero,Asier Rabasco,Tim Lenz,Leo Misera,Gregory Patrick Veldhuizen,Paul Kuntke,Hagen H. Kitzler,Sven Nebelung,Daniel Truhn,Jakob Nikolas Kather
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning (DL) methods are increasingly outperforming classical approaches in brain imaging, yet their generalizability across diverse imaging cohorts remains inadequately assessed. As age and sex are key neurobiological markers in clinical neuroscience, influencing brain structure and disease risk, this study evaluates three of the existing three-dimensional architectures, namely Simple Fully Connected Network (SFCN), DenseNet, and Shifted Window (Swin) Transformers, for age and sex prediction using T1-weighted MRI from four independent cohorts: UK Biobank (UKB, n=47,390), Dallas Lifespan Brain Study (DLBS, n=132), Parkinson’s Progression Markers Initiative (PPMI, n=108 healthy controls), and Information eXtraction from Images (IXI, n=319). We found that SFCN consistently outperformed more complex architectures with AUC of 1.00 [1.00-1.00] in UKB (internal test set) and 0.85-0.91 in external test sets for sex classification. For the age prediction task, SFCN demonstrated a mean absolute error (MAE) of 2.66 (r=0.89) in UKB and 4.98-5.81 (r=0.55-0.70) across external datasets. Pairwise DeLong and Wilcoxon signed-rank tests with Bonferroni corrections confirmed SFCN’s superiority over Swin Transformer across most cohorts (p0.017, for three comparisons). Explainability analysis further demonstrates the regional consistency of model attention across cohorts and specific to each task. Our findings reveal that simpler convolutional networks outperform the denser and more complex attention-based DL architectures in brain image analysis by demonstrating better generalizability across different datasets.
zh
[CV-22] GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models
【速读】:该论文旨在解决超声成像中准确且泛化性强的物体分割问题,这一问题主要受到解剖结构变异、成像协议多样性以及标注数据有限的影响。其解决方案的关键在于提出一种基于提示驱动的视觉-语言模型(VLM),该模型将Grounding DINO与SAM2相结合,以实现跨多个超声器官的物体分割。通过在18个公开超声数据集上进行微调和验证,并利用低秩适应(LoRA)方法将其适配到超声领域,该方法在多数已见数据集上优于当前最先进的分割方法,同时在未见数据集上也保持了良好的性能,无需额外微调。
链接: https://arxiv.org/abs/2506.23903
作者: Hamza Rasaee,Taha Koleilat,Hassan Rivaz
机构: Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 6 figures
Abstract:Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on this http URL after acceptance.
zh
[CV-23] PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View
【速读】:该论文旨在解决全景光学流在球面到平面投影(如等距柱状投影)过程中由于严重失真导致的传统基于透视的光学流方法性能下降的问题,特别是在极区区域。解决方案的关键在于提出PriOr-Flow框架,其核心是利用正交视图的低失真特性,并通过双分支结构进行优化,其中Dual-Cost Collaborative Lookup(DCCL)操作符联合从原始和正交成本体积中检索相关性信息,有效抑制了成本体积构建中的失真噪声,同时Ortho-Driven Distortion Compensation(ODDC)模块通过迭代优化两个分支的运动特征,进一步抑制极区失真。
链接: https://arxiv.org/abs/2506.23897
作者: Longliang Liu,Miaojie Feng,Junda Cheng,Jijun Xiang,Xuan Zhu,Xin Yang
机构: School of EIC, Huazhong University of Science and Technology (电子信息学院,华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation. The code is publicly available at: this https URL.
zh
[CV-24] Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection
【速读】:该论文试图解决机器学习模型在现实应用中面对未见过的数据分布时,因虚假相关性(spurious correlations)导致的分布外(out-of-distribution, OOD)检测性能下降问题。解决方案的关键在于提出一种基于原型的OOD检测方法SPROD,该方法通过后处理方式优化类别原型,以减轻虚假特征带来的偏差,而无需额外数据或超参数调优,从而提升了模型在多种基准数据集上的检测性能。
链接: https://arxiv.org/abs/2506.23881
作者: Reihaneh Zohrabi,Hosein Hasani,Mahdieh Soleymani Baghshah,Anna Rohrbach,Marcus Rohrbach,Mohammad Hossein Rohban
机构: TU Darmstadt (达姆施塔特工业大学); Sharif University of Technology (沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.7% and FPR@95 by 9.3% over the second best.
zh
[CV-25] Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction
【速读】:该论文旨在解决多视角3D重建(multi-view 3D reconstruction)中的数据多样性与规模受限问题,现有方法如DUST3R等依赖于有限的训练数据,导致性能受限。其解决方案的关键在于提出Puzzles,一种数据增强策略,通过从单张图像或视频片段中合成无限量的高质量带姿态视频-深度数据,利用针对性的图像变换模拟多样的相机轨迹和真实场景几何,从而显著提升数据多样性。实验表明,将Puzzles集成到现有视频-based 3D重建流程中可有效提升性能,而无需修改底层网络结构。
链接: https://arxiv.org/abs/2506.23863
作者: Jiahao Ma,Lei Wang,Miaomiao liu,David Ahmedt-Aristizabal,Chuong Nguyen
机构: Australian National University (澳大利亚国立大学); Griffith University (格里菲斯大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Feed-forward 3D reconstruction, Data Augmentation
Abstract:Multi-view 3D reconstruction remains a core challenge in computer vision. Recent methods, such as DUST3R and its successors, directly regress pointmaps from image pairs without relying on known scene geometry or camera parameters. However, the performance of these models is constrained by the diversity and scale of available training data. In this work, we introduce Puzzles, a data augmentation strategy that synthesizes an unbounded volume of high-quality posed video-depth data from a single image or video clip. By simulating diverse camera trajectories and realistic scene geometry through targeted image transformations, Puzzles significantly enhances data variety. Extensive experiments show that integrating Puzzles into existing video-based 3D reconstruction pipelines consistently boosts performance without modifying the underlying network architecture. Notably, models trained on only ten percent of the original data augmented with Puzzles still achieve accuracy comparable to those trained on the full dataset. Code is available at this https URL.
zh
[CV-26] VMoBA: Mixture-of-Block Attention for Video Diffusion Models
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models, VDMs)中全注意力机制的二次复杂度问题,这一问题限制了长时长、高分辨率视频的生成效率。其关键解决方案是提出一种针对VDMs设计的新型稀疏注意力机制——视频块注意力混合(Video Mixture of Block Attention, VMoBA),通过三层递归块划分、全局块选择和基于阈值的块选择三个核心改进,动态适应多样的时空注意力模式,提升计算效率并保持生成质量。
链接: https://arxiv.org/abs/2506.23858
作者: Jianzong Wu,Liang Hou,Haotian Yang,Xin Tao,Ye Tian,Pengfei Wan,Di Zhang,Yunhai Tong
机构: Peking University (北京大学); Kling Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is at this https URL
Abstract:The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.
zh
[CV-27] A Closer Look at Conditional Prompt Tuning for Vision-Language Models
【速读】:该论文旨在解决视觉-语言预训练模型(VLPMs)在进行提示调优(Prompt Tuning, PT)时面临的基类-新类权衡(Base-New Tradeoff, BNT)问题,即模型在基类任务上表现越好,其在新任务上的泛化能力越弱。现有条件提示调优方法通过引入视觉图像信息(VII)作为提示的“条件”来改善这一问题,但研究发现该方法效果并不理想,甚至随机噪声条件的提示也能取得更好效果。论文进一步分析指出,基于文本类别信息(Textual Class Information, TCI)学习动态提示是解决BNT问题的关键。受此启发,作者提出了类自适应提示调优(Class-adaptive Prompt Tuning, CaPT),通过从基类中学习TCI条件的提示实现对新类的快速适应,并可作为插件用于缓解现有无条件PT方案的BNT问题。
链接: https://arxiv.org/abs/2506.23856
作者: Ji Zhang,Shihan Wu,Lianli Gao,Jingkuan Song,Nicu Sebe,Heng Tao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Despite the great promise of Prompt Tuning (PT) in adapting large Vision-Language Pretrained Models (VLPMs) to downstream tasks, they often struggle to overcome the Base-New Tradeoff (BNT) dilemma: as VLPMs are better tuned to a base task, their ability to generalize to new tasks diminishes. Recent work on conditional PT addresses this problem by replacing static prompts with dynamic Visual Image Information (VII)-conditioned prompts, improving the model’s generalization to new tasks to some extent. In this work, we first identify a critical issue with existing conditional PT methods: using VII as the “condition” of prompts yields suboptimal performance, and even random noise-conditioned prompts can outperform the VII-conditioned counterparts. On further analysis, we find that learning dynamic prompts conditioned on Textual Class Information (TCI) is the key to solving the BNT problem. Motivated by this, we then propose Class-adaptive Prompt Tuning (CaPT), which enables fast adaptation of tuned models to new classes by learning TCI-conditioned prompts from base classes. Remarkably, CaPT can be used as a plugin to mitigate the BNT problem for existing unconditional PT schemes. Extensive experiments on 11 datasets show that CaPT consistently improves the performance of five strong unconditional PT baselines with negligible additional computational cost. Additionally, by integrating CaPT with our recently proposed DePT framework, we devise a new conditional PT approach, termed DeCaPT, which outperforms the H ACC of the state-of-the-art conditional PT scheme by 3.49%, averaged over the 11 datasets. Code: this https URL.
zh
[CV-28] HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity ICCV
【速读】:该论文旨在解决神经表面重建中在复杂场景条件下实现几何保真度与光度一致性之间的矛盾问题。其解决方案的关键在于提出一个统一框架HiNeuS,通过三个核心改进:基于SDF引导的射线追踪进行差异可见性验证以解决反射歧义;通过射线对齐的几何块实现平面共形正则化以保持局部表面连贯性并保留锐利边缘;以及基于物理的Eikonal松弛动态调节几何约束以实现细节保留而不牺牲全局规则性。这些方法在联合优化过程中实现了外观-几何约束的协同演化,从而提升了重建效果。
链接: https://arxiv.org/abs/2506.23854
作者: Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Xianpeng Lang
机构: Li Auto Inc.(李想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Published in International Conference on Computer Vision (ICCV) 2025
Abstract:Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal constraints during joint optimization. To resolve these issues through a unified pipeline, we introduce: 1) Differential visibility verification through SDF-guided ray tracing, resolving reflection ambiguities via continuous occlusion modeling; 2) Planar-conformal regularization via ray-aligned geometry patches that enforce local surface coherence while preserving sharp edges through adaptive appearance weighting; and 3) Physically-grounded Eikonal relaxation that dynamically modulates geometric constraints based on local radiance gradients, enabling detail preservation without sacrificing global regularity. Unlike prior methods that handle these aspects through sequential optimizations or isolated modules, our approach achieves cohesive integration where appearance-geometry constraints evolve synergistically throughout training. Comprehensive evaluations across synthetic and real-world datasets demonstrate state-of-the-art performance, including a 21.4% reduction in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR improvement against neural rendering counterparts. Qualitative analyses reveal superior capability in recovering specular instruments, urban layouts with centimeter-scale infrastructure, and low-textured surfaces without local patch collapse. The method’s generalizability is further validated through successful application to inverse rendering tasks, including material decomposition and view-consistent relighting.
zh
[CV-29] RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment
【速读】:该论文试图解决机器人生成内容(Robotic-Generated Content, RGC)的视觉质量评估问题,当前针对此类内容的质量评估研究仍处于空白状态。解决方案的关键在于构建了首个机器人生成内容数据库(Robotic-Generated Content Database, RGCD),该数据库包含来自三种机器人类别、多个平台的2,100个视频,并通过主观视频质量评估(VQA)实验和基准测试验证了现有视频质量评估模型在处理RGC内容时的不足,从而凸显出开发专为RGC设计的质量评估模型的必要性。
链接: https://arxiv.org/abs/2506.23852
作者: Jianing Jin,Jiangyong Ying,Huiyu Duan,Liu Yang,Sijing Wu,Yunhao Li,Yushuo Zheng,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiaotong University(上海交通大学); China Telecom(中国电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: this https URL.
zh
[CV-30] Refine Any Object in Any Scene
【速读】:该论文旨在解决场景重建中由于相机路径主要关注整体场景结构而非单个物体而导致的物体视角缺失问题,这一问题使得在保持准确场景级表示的同时实现高保真物体级建模变得极具挑战性。解决方案的关键在于提出一种名为RAISE(Refine Any object In any ScenE)的3D增强框架,该框架利用3D生成先验来恢复缺失视角下的细粒度物体几何与外观。其核心方法是通过3D生成模型替换退化的物体为代理对象,并通过7-DOF位姿对齐逐步优化几何和纹理,随后通过注册约束增强来修正空间和外观不一致性,从而确保未见视角下原始物体的高保真几何与外观,同时保持空间定位、可见几何和外观的一致性。
链接: https://arxiv.org/abs/2506.23835
作者: Ziwei Chen,Ziling Liu,Zitong Huang,Mingqi Gao,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages with 6 figures
Abstract:Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring detailed object understanding and appearance modeling. In this paper, we introduce Refine Any object In any ScenE (RAISE), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting from substituting degraded objects with proxies, via a 3D generative model with strong 3D understanding, RAISE progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DOF pose, followed by correcting spatial and appearance inconsistencies via registration-constrained enhancement. This two-stage refinement ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Extensive experiments on challenging benchmarks show that RAISE significantly outperforms state-of-the-art methods in both novel view synthesis and geometry completion tasks. RAISE is made publicly available at this https URL.
zh
[CV-31] PointSSIM: A novel low dimensional resolution invariant image-to-image comparison metric
【速读】:该论文试图解决在不同分辨率下二值图像之间进行鲁棒比较的问题,传统方法可能因分辨率差异而失效。解决方案的关键在于将二值图像转换为标记点模式表示,并通过从最小距离变换中识别局部自适应极大值来提取图像的关键特征,即锚点,进而利用包含强度、连通性、复杂性和结构属性的汇总向量进行图像比较。
链接: https://arxiv.org/abs/2506.23833
作者: Oscar Ovanger,Ragnar Hauge,Jacob Skauvold,Michael J. Pyrcz,Jo Eidsvik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 20 figures
Abstract:This paper presents PointSSIM, a novel low-dimensional image-to-image comparison metric that is resolution invariant. Drawing inspiration from the structural similarity index measure and mathematical morphology, PointSSIM enables robust comparison across binary images of varying resolutions by transforming them into marked point pattern representations. The key features of the image, referred to as anchor points, are extracted from binary images by identifying locally adaptive maxima from the minimal distance transform. Image comparisons are then performed using a summary vector, capturing intensity, connectivity, complexity, and structural attributes. Results show that this approach provides an efficient and reliable method for image comparison, particularly suited to applications requiring structural analysis across different resolutions.
zh
[CV-32] Low-latency vision transformers via large-scale multi-head attention
【速读】:该论文试图解决传统卷积神经网络(CNN)在分类任务中依赖大量滤波器以获得高准确率的问题,同时探索视觉变压器(ViT)架构中多头注意力(MHA)机制的潜在优化路径。其解决方案的关键在于通过量化单节点性能(SNP)和单头性能(SHP)来揭示多头注意力机制中的自发对称性破缺现象,进而构建大规模多头注意力(LS-MHA)结构,使得每个SHP矩阵包含多个单位簇,从而提升信号与噪声比(SNR),增强分类准确性。此外,通过用卷积层替换初始的Transformer块,实现了延迟的显著降低而不影响准确率,展现了该机制在不同任务中的泛化潜力。
链接: https://arxiv.org/abs/2506.23832
作者: Ronit D. Gross,Tal Halevi,Ella Koresh,Yarden Tzach,Ido Kanter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 4 figures, 7 tables
Abstract:The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.
zh
[CV-33] Spatially Gene Expression Prediction using Dual-Scale Contrastive Learning MICCAI2025
【速读】:该论文试图解决空间转录组学(Spatial Transcriptomics, ST)因高成本和复杂性而受限的问题,以及现有从病理全切片图像(Whole Slide Images, WSI)预测基因表达的方法在处理目标区域与邻近信息之间的复杂空间和分子相互作用时的不足。解决方案的关键在于提出NH2ST框架,该框架通过整合空间上下文和病理学与基因组学两种模态的数据,利用查询分支和邻近分支处理配对的目标区域和基因数据及其邻近区域,并结合交叉注意力机制和对比学习,以捕捉内在关联并确保病理学与基因表达之间的对齐。
链接: https://arxiv.org/abs/2506.23827
作者: Mingcheng Qu,Yuncong Wu,Donglin Di,Yue Gao,Tonghua Su,Yang Song,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our paper has been accepted by MICCAI 2025
Abstract:Spatial transcriptomics (ST) provides crucial insights into tissue micro-environments, but is limited to its high cost and complexity. As an alternative, predicting gene expression from pathology whole slide images (WSI) is gaining increasing attention. However, existing methods typically rely on single patches or a single pathology modality, neglecting the complex spatial and molecular interactions between target and neighboring information (e.g., gene co-expression). This leads to a failure in establishing connections among adjacent regions and capturing intricate cross-modal relationships. To address these issues, we propose NH2ST, a framework that integrates spatial context and both pathology and gene modalities for gene expression prediction. Our model comprises a query branch and a neighbor branch to process paired target patch and gene data and their neighboring regions, where cross-attention and contrastive learning are employed to capture intrinsic associations and ensure alignments between pathology and gene expression. Extensive experiments on six datasets demonstrate that our model consistently outperforms existing methods, achieving over 20% in PCC metrics. Codes are available at this https URL
zh
[CV-34] Flash-VStream: Efficient Real-Time Understanding for Long Video Streams ICCV2025
【速读】:该论文试图解决长视频理解中计算和内存开销大的问题,现有方法将长视频与短视频处理方式相同,导致效率低下且难以扩展到更长的视频。解决方案的关键是提出Flash-VStream,其核心在于设计了一个包含低容量上下文记忆和高容量增强记忆的Flash Memory模块,以高效聚合长时序信息并检索详细空间信息,从而显著降低推理延迟并提升处理长视频的效率。
链接: https://arxiv.org/abs/2506.23825
作者: Haoji Zhang,Yiqin Wang,Yansong Tang,Yong Liu,Jiashi Feng,Xiaojie Jin
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Beijing Jiaotong University (北京交通大学); ByteDance Seed (字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at this https URL.
zh
[CV-35] Supercm: Revisiting Clustering for Semi-Supervised Learning
【速读】:该论文试图解决半监督学习(Semi-Supervised Learning, SSL)中模型训练策略复杂、依赖一致性正则化或熵最小化方法的问题。其解决方案的关键在于通过扩展一种最近提出的可微分聚类模块,显式地引入了SSL中的潜在聚类假设,利用标注数据引导聚类中心,从而实现一种简单且端到端可训练的深度SSL方法。
链接: https://arxiv.org/abs/2506.23824
作者: Durgesh Singh,Ahcene Boubekki,Robert Jenssen,Michael C. Kampffmeyer
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The development of semi-supervised learning (SSL) has in recent years largely focused on the development of new consistency regularization or entropy minimization approaches, often resulting in models with complex training strategies to obtain the desired results. In this work, we instead propose a novel approach that explicitly incorporates the underlying clustering assumption in SSL through extending a recently proposed differentiable clustering module. Leveraging annotated data to guide the cluster centroids results in a simple end-to-end trainable deep SSL approach. We demonstrate that the proposed model improves the performance over the supervised-only baseline and show that our framework can be used in conjunction with other SSL methods to further boost their performance.
zh
[CV-36] Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model ICCV’25
【速读】:该论文旨在解决大规模视觉-语言模型(VLMs)在零样本学习(ZSL)中缺乏可解释性的问题,即现有方法通过计算整个查询图像与嵌入类别词之间的相似性进行预测,难以提供有效的解释。其解决方案的关键在于提出LaZSL,一种基于局部对齐的视觉-语言模型,通过最优传输实现视觉区域与其相关属性之间的局部语义对齐,从而在无需额外训练的情况下提供可解释的相似性度量。
链接: https://arxiv.org/abs/2506.23822
作者: Shiming Chen,Bowen Duan,Salman Khan,Fahad Shahbaz Khan
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); Huazhong University of Science and Technology (华中科技大学); Australian National University (澳大利亚国立大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV’25
Abstract:Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: this https URL.
zh
[CV-37] MadCLIP: Few-shot Medical Anomaly Detection with CLIP MICCAI2025 MICCAI
【速读】:该论文旨在解决医学图像中的异常检测问题,包括图像级异常分类(AC)和像素级异常分割(AS)。其解决方案的关键在于利用预训练的CLIP模型,并通过双分支设计分别捕捉正常与异常特征,同时引入可学习的适配器和文本提示以增强语义对齐。此外,首次在医学领域应用SigLIP损失函数,有效处理图像与未配对文本提示之间的多对一关系,从而提升检测性能。
链接: https://arxiv.org/abs/2506.23810
作者: Mahshid Shiri,Cigdem Beyan,Vittorio Murino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025 (this version is not peer-reviewed; it is the submitted version). MICCAI proceedings DOI will appear here
Abstract:An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at this https URL.
zh
[CV-38] owards Initialization-free Calibrated Bundle Adjustment
【速读】:该论文试图解决无初始化的基于图像的三维重建(Initialization-free Bundle Adjustment, BA)问题,传统方法依赖于伪物体空间误差(pOSE)作为替代目标函数,但其优化过程仅能恢复场景的射影变换,无法利用已知的相机标定信息,导致重建结果精度受限。该论文的解决方案关键在于引入成对的相对旋转估计,这些估计包含相机标定信息,并且仅对相似变换保持不变,从而鼓励保留真实场景的度量特征,实现接近度量的重建结果。通过将旋转平均整合到pOSE框架中,该方法实现了无需初始化的标定SfM(Structure from Motion)。
链接: https://arxiv.org/abs/2506.23808
作者: Carl Olsson,Amanda Nilsson
机构: Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A recent series of works has shown that initialization-free BA can be achieved using pseudo Object Space Error (pOSE) as a surrogate objective. The initial reconstruction-step optimizes an objective where all terms are projectively invariant and it cannot incorporate knowledge of the camera calibration. As a result, the solution is only determined up to a projective transformation of the scene and the process requires more data for successful reconstruction. In contrast, we present a method that is able to use the known camera calibration thereby producing near metric solutions, that is, reconstructions that are accurate up to a similarity transformation. To achieve this we introduce pairwise relative rotation estimates that carry information about camera calibration. These are only invariant to similarity transformations, thus encouraging solutions that preserve metric features of the real scene. Our method can be seen as integrating rotation averaging into the pOSE framework striving towards initialization-free calibrated SfM. Our experimental evaluation shows that we are able to reliably optimize our objective, achieving convergence to the global minimum with high probability from random starting solutions, resulting in accurate near metric reconstructions. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.23808 [cs.CV] (or arXiv:2506.23808v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.23808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-39] Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors
【速读】:该论文旨在解决现实世界中遥感图像超分辨率(SR)任务中存在的挑战,包括跨传感器分辨率差异和显著的地表覆盖变化,这些问题导致现有基于参考图像的SR(RefSR)方法出现生成不足或过度依赖参考图像的问题。其解决方案的关键在于提出一种名为CRefDiff的可控参考扩散模型,该模型基于预训练的Stable Diffusion模型,利用其强大的生成先验来生成准确的结构和纹理,并引入双分支融合机制以自适应地整合参考图像中的局部和全局信息,同时在推理过程中实现参考强度的控制,从而提升模型的交互性和灵活性。此外,还提出了一种名为Better Start的策略以减少去噪步骤,加速推理过程。
链接: https://arxiv.org/abs/2506.23801
作者: Ce Wang,Wanjie Sun
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Super-resolution (SR) techniques can enhance the spatial resolution of remote sensing images by utilizing low-resolution (LR) images to reconstruct high-resolution (HR) images, enabling more efficient large-scale earth observation applications. While single-image super-resolution (SISR) methods have shown progress, reference-based super-resolution (RefSR) offers superior performance by incorporating historical HR images alongside current LR observations. However, existing RefSR methods struggle with real-world complexities, such as cross-sensor resolution gap and significant land cover changes, often leading to under-generation or over-reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference-based diffusion model for real-world remote sensing image SR. To address the under-generation problem, CRefDiff is built upon the pretrained Stable Diffusion model, leveraging its powerful generative prior to produce accurate structures and textures. To mitigate over-reliance on the reference, we introduce a dual-branch fusion mechanism that adaptively integrates both local and global information from the reference image. Moreover, this novel dual-branch design enables reference strength control during inference, enhancing interactivity and flexibility of the model. Finally, a strategy named Better Start is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce Real-RefRSSRD, a new real-world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel-2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on Real-RefRSSRD show that CRefDiff achieves state-of-the-art performance across various metrics and improves downstream tasks such as scene classification and semantic segmentation.
zh
[CV-40] Visual Textualization for Image Prompted Object Detection ICCV2025
【速读】:该论文试图解决Object-level Vision-Language Models (OVLMs)在检测罕见类别时性能不足的问题,这些类别在文本描述上难以表达且在预训练数据中几乎缺失。解决方案的关键在于引入视觉文本化(visual textualization)过程,通过将少量视觉样本投影到文本特征空间,生成文本化的视觉标记,从而增强OVLMs的检测能力,同时保持其原有的物体-文本对齐特性。该方法利用多尺度文本化块和多阶段融合策略,有效整合视觉信息,并在不改变OVLM原始架构的前提下提升其在少样本设置下的性能。
链接: https://arxiv.org/abs/2506.23785
作者: Yongjian Wu,Yang Zhou,Jiya Saiyin,Bingzheng Wei,Yan Xu
机构: Beihang University (北京航空航天大学); ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization – a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models’ (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM’s pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at this https URL.
zh
[CV-41] Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking
【速读】:该论文旨在解决多模态目标跟踪中因依赖高复杂度Vision Transformer架构而导致的计算开销大和跨模态交互效果受限的问题。其解决方案的关键在于提出了一种基于线性复杂度Vision Mamba网络的高效RGB-事件目标跟踪框架Mamba-FETrack V2,通过设计轻量级Prompt Generator生成模态特定的可学习提示向量,并结合Vision Mamba结构实现统一的提示引导特征提取、跨模态交互与融合,从而提升跟踪性能与效率。
链接: https://arxiv.org/abs/2506.23783
作者: Shiao Wang,Ju Huang,Qingchuan Ma,Jinfeng Gao,Chunyi Xu,Xiao Wang,Lan Chen,Bo Jiang
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Journal extension of Mamba-FETrack which was published on Pattern Recognition and Computer Vision (PRCV) 2024
Abstract:Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on this https URL
zh
[CV-42] Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?
【速读】:该论文旨在解决开放词汇目标检测器(open-vocabulary object detectors)在安全关键应用中的局限性难以明确识别的问题,以及如何系统性地发现其失败模式。解决方案的关键在于设计两种自动化流程,利用稳定扩散模型(stable diffusion)生成语义多样化的异常物体进行图像修复(inpainting),从而挑战检测器的性能,并通过合成数据评估多种开放词汇目标检测器及传统检测器,揭示其对物体位置而非语义的强依赖性。
链接: https://arxiv.org/abs/2506.23751
作者: Annika Mütze,Sadia Ilyas,Christian Dörpelkus,Matthias Rottmann
机构: University of Wuppertal, Germany(伍珀塔尔大学, 德国); Aptiv Services Deutschland GmbH, Wuppertal(艾普力服务德国公司, 伍珀塔尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from WordNet and ChatGPT. On the synthetically generated data, we evaluate and compare multiple open-vocabulary object detectors as well as a classical object detector. The synthetic data is derived from two real-world datasets, namely LostAndFound, a challenging out-of-distribution (OOD) detection benchmark, and the NuImages dataset. Our results indicate that inpainting can challenge open-vocabulary object detectors in terms of overlooking objects. Additionally, we find a strong dependence of open-vocabulary models on object location, rather than on object semantics. This provides a systematic approach to challenge open-vocabulary models and gives valuable insights on how data could be acquired to effectively improve these models.
zh
[CV-43] Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models
【速读】:该论文试图解决生成式图像模型(Generative Image Models)在被未经授权使用时难以追踪其来源的问题,特别是当生成的图像被用作训练新模型的数据时,传统水印技术无法保持其可检测性。解决方案的关键在于提出一种具有放射性(radioactivity)的水印方法,确保水印在图像经过训练过程后仍能被识别,从而实现对生成图像的溯源和防止未经授权的使用。该方法针对图像自回归模型(Image Autoregressive Models, IARs)设计,并借鉴了大型语言模型(Large Language Models, LLMs)中的技术。
链接: https://arxiv.org/abs/2506.23731
作者: Michel Meintz,Jan Dubiński,Franziska Boenisch,Adam Dziedzic
机构: CISPA Helmholtz Center for Information Security (CISPA 海姆霍兹信息安全中心); Warsaw University of Technology (华沙理工大学); NASK-National Research Institute (NASK 国家研究机构)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image generative models have become increasingly popular, but training them requires large datasets that are costly to collect and curate. To circumvent these costs, some parties may exploit existing models by using the generated images as training data for their own models. In general, watermarking is a valuable tool for detecting unauthorized use of generated images. However, when these images are used to train a new model, watermarking can only enable detection if the watermark persists through training and remains identifiable in the outputs of the newly trained model - a property known as radioactivity. We analyze the radioactivity of watermarks in images generated by diffusion models (DMs) and image autoregressive models (IARs). We find that existing watermarking methods for DMs fail to retain radioactivity, as watermarks are either erased during encoding into the latent space or lost in the noising-denoising process (during the training in the latent space). Meanwhile, despite IARs having recently surpassed DMs in image generation quality and efficiency, no radioactive watermarking methods have been proposed for them. To overcome this limitation, we propose the first watermarking method tailored for IARs and with radioactivity in mind - drawing inspiration from techniques in large language models (LLMs), which share IARs’ autoregressive paradigm. Our extensive experimental evaluation highlights our method’s effectiveness in preserving radioactivity within IARs, enabling robust provenance tracking, and preventing unauthorized use of their generated images.
zh
[CV-44] Proteus-ID: ID-Consistent and Motion-Coherent Video Customization
【速读】:该论文旨在解决视频身份定制(video identity customization)中的两个核心问题:在保持身份一致性的同时对齐描述的外观和动作,以及生成自然流畅的运动而避免不真实的僵硬感。其解决方案的关键在于提出Proteus-ID框架,包含三个核心技术:Multimodal Identity Fusion (MIF)模块通过Q-Former将视觉和文本线索统一为联合身份表示,提供连贯的引导;Time-Aware Identity Injection (TAII)机制动态调节去噪步骤中的身份条件,提升细节重建;Adaptive Motion Learning (AML)策略基于光流生成的运动热图重新加权训练损失,增强运动真实性。
链接: https://arxiv.org/abs/2506.23729
作者: Guiyu Zhang,Chen Shi,Zijian Jiang,Xunzhi Xiang,Jingjing Qian,Shaoshuai Shi,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen(深圳大学) ; Nanjing University(南京大学) ; Voyager Research, Didi Chuxing(滴滴出行伏特研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Work in progress
Abstract:Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at this https URL.
zh
[CV-45] When Small Guides Large: Cross-Model Co-Learning for Test-Time Adaptation
【速读】:该论文试图解决在测试时适应(Test-time Adaptation, TTA)过程中,现有方法主要关注单模型适应,而忽视了跨模型知识对TTA的潜在影响问题。其解决方案的关键在于提出COCA框架,通过两种核心策略实现跨模型协同学习:一是协同适应(Co-adaptation),在TTA过程中自适应地整合其他模型提供的互补知识,以减少个体模型的偏差;二是自适应(Self-adaptation),通过无监督学习增强每个模型的独特优势,从而实现对目标域的多样化适应。
链接: https://arxiv.org/abs/2506.23724
作者: Chang’an Yi,Xiaohui Deng,Guohao Chen,Yan Zhou,Qinghua Lu,Shuaicheng Niu
机构: Foshan University (佛山大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures
Abstract:Test-time Adaptation (TTA) adapts a given model to testing domain data with potential domain shifts through online unsupervised learning, yielding impressive performance. However, to date, existing TTA methods primarily focus on single-model adaptation. In this work, we investigate an intriguing question: how does cross-model knowledge influence the TTA process? Our findings reveal that, in TTA’s unsupervised online setting, each model can provide complementary, confident knowledge to the others, even when there are substantial differences in model size. For instance, a smaller model like MobileViT (10.6M parameters) can effectively guide a larger model like ViT-Base (86.6M parameters). In light of this, we propose COCA, a Cross-Model Co-Learning framework for TTA, which mainly consists of two main strategies. 1) Co-adaptation adaptively integrates complementary knowledge from other models throughout the TTA process, reducing individual model biases. 2) Self-adaptation enhances each model’s unique strengths via unsupervised learning, enabling diverse adaptation to the target domain. Extensive experiments show that COCA, which can also serve as a plug-and-play module, significantly boosts existing SOTAs, on models with various sizes–including ResNets, ViTs, and Mobile-ViTs–via cross-model co-learned TTA. For example, with Mobile-ViT’s guidance, COCA raises ViT-Base’s average adaptation accuracy on ImageNet-C from 51.7% to 64.5%. The code is publicly available at this https URL.
zh
[CV-46] owards Efficient and Accurate Spiking Neural Networks via Adaptive Bit Allocation
【速读】:该论文旨在解决多比特脉冲神经网络(Multi-bit Spiking Neural Networks, SNNs)在增加比特数时导致的内存和计算需求激增,从而使性能提升不匹配的问题。其解决方案的关键在于提出一种自适应比特分配策略,通过参数化时间长度和权重与脉冲的比特宽度,并使其可通过梯度进行学习和控制,实现对不同层的细粒度内存和计算资源分配。此外,为应对可变比特宽度和时间长度带来的挑战,论文还提出了改进的脉冲神经元,能够处理不同的时间长度并支持梯度的推导,同时引入了步长更新机制以缓解可学习比特宽度带来的步长不匹配问题。
链接: https://arxiv.org/abs/2506.23717
作者: Xingting Yao,Qinghao Hu,Fei Zhou,Tielong Liu,Gang Li,Peisong Wang,Jian Cheng
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multi-bit spiking neural networks (SNNs) have recently become a heated research spot, pursuing energy-efficient and high-accurate AI. However, with more bits involved, the associated memory and computation demands escalate to the point where the performance improvements become disproportionate. Based on the insight that different layers demonstrate different importance and extra bits could be wasted and interfering, this paper presents an adaptive bit allocation strategy for direct-trained SNNs, achieving fine-grained layer-wise allocation of memory and computation resources. Thus, SNN’s efficiency and accuracy can be improved. Specifically, we parametrize the temporal lengths and the bit widths of weights and spikes, and make them learnable and controllable through gradients. To address the challenges caused by changeable bit widths and temporal lengths, we propose the refined spiking neuron, which can handle different temporal lengths, enable the derivation of gradients for temporal lengths, and suit spike quantization better. In addition, we theoretically formulate the step-size mismatch problem of learnable bit widths, which may incur severe quantization errors to SNN, and accordingly propose the step-size renewal mechanism to alleviate this issue. Experiments on various datasets, including the static CIFAR and ImageNet and the dynamic CIFAR-DVS and DVS-GESTURE, demonstrate that our methods can reduce the overall memory and computation cost while achieving higher accuracy. Particularly, our SEWResNet-34 can achieve a 2.69% accuracy gain and 4.16 \times lower bit budgets over the advanced baseline work on ImageNet. This work will be fully open-sourced.
zh
[CV-47] Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion
【速读】:该论文试图解决从主观感知中重建真实场景的问题,具体而言,是克服语言模糊性和草图抽象性的双重限制,以及现有方法在用户特定输入偏差、平面草图与3D先验之间的模态差距和草图质量敏感性性能退化等方面的不足。其解决方案的关键在于提出一种概念顺序生成框架,通过文本奖励优化建立稳健的外观先验,并实现按草图顺序处理概念的序列感知解耦生成,从而在无需训练的情况下适应用户的主观期望;同时采用潜在优化有效弥合平面草图与扩散过程中的3D先验之间的模态差距,并通过分层奖励引导框架允许使用粗糙草图而无需艺术专业知识。
链接: https://arxiv.org/abs/2506.23711
作者: Haoyang Chen,Dongfang Sun,Caoyuan Ma,Shiqin Wang,Kewei Zhang,Zheng Wang,Zhixiang Wang
机构: Wuhan University(武汉大学); Wuhan University of Science and Technology(武汉科技大学); National Institute of Informatics(信息通信研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user’s drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.23711 [cs.CV] (or arXiv:2506.23711v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.23711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-48] Single Image Test-Time Adaptation via Multi-View Co-Training MICCAI2025
【速读】:该论文旨在解决测试阶段适应(Test-time adaptation)在临床场景中的应用难题,特别是在缺乏大量目标域数据的情况下,实现对单个测试图像的实时、个性化适应。现有方法通常依赖于大规模目标域数据集,并且主要针对二维图像,未能充分利用医学影像的三维体积信息。该论文提出的解决方案是基于补丁的多视角协同训练方法(Patch-Based Multi-View Co-Training),其关键在于通过不确定性引导的自训练策略,强制特征和预测的一致性,从而在仅使用单张测试图像的情况下,实现目标域中的有效体积分割。
链接: https://arxiv.org/abs/2506.23705
作者: Smriti Joshi,Richard Osuala,Lidia Garrucho,Kaisar Kushibar,Dimitri Kessler,Oliver Diaz,Karim Lekadir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:Test-time adaptation enables a trained model to adjust to a new domain during inference, making it particularly valuable in clinical settings where such on-the-fly adaptation is required. However, existing techniques depend on large target domain datasets, which are often impractical and unavailable in medical scenarios that demand per-patient, real-time inference. Moreover, current methods commonly focus on two-dimensional images, failing to leverage the volumetric richness of medical imaging data. Bridging this gap, we propose a Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation. Our method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective volumetric segmentation in the target domain with only a single test-time image. Validated on three publicly available breast magnetic resonance imaging datasets for tumor segmentation, our method achieves performance close to the upper bound supervised benchmark while also outperforming all existing state-of-the-art methods, on average by a Dice Similarity Coefficient of 3.75%. We publicly share our accessible codebase, readily integrable with the popular nnUNet framework, at this https URL.
zh
[CV-49] SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
【速读】:该论文旨在解决视频生成中基于扩散模型的人体动作定制问题,即如何从少量视频样本中获取人体动作表示,并通过精确的文本条件实现任意主体的动作迁移。现有方法主要依赖语义层面的对齐,但忽略了视频数据中复杂的时空模式,导致视觉复杂性被忽视或语义混淆。论文提出的解决方案关键在于提出SynMotion模型,该模型联合利用语义引导与视觉适应:在语义层面引入双嵌入语义理解机制以解耦主体与动作表示,在视觉层面集成参数高效的运动适配器以提升动作保真度与时间一致性,并通过新的嵌入特定训练策略优化主体与动作嵌入,从而在保持主体多样性泛化能力的同时增强动作特异性。
链接: https://arxiv.org/abs/2506.23690
作者: Shuai Tan,Biao Gong,Yujie Wei,Shiwei Zhang,Zhuoxin Liu,Dandan Zheng,Jingdong Chen,Yan Wang,Hao Ouyang,Kecheng Zheng,Yujun Shen
机构: Ant Group(蚂蚁集团); Tongyi Lab(通义实验室); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ‘‘cats’’ or ‘‘dogs’’) to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbfalternately optimizes subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: this https URL
zh
[CV-50] A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement
【速读】:该论文旨在解决基于扩散模型的对抗样本生成方法在泛化能力上的不足,特别是在如深度伪造检测等非传统图像分类任务中的应用局限性。其关键解决方案是提出一个统一框架,将传统的增强对抗样本可迁移性的策略无缝集成到通过图像编辑的扩散模型对抗样本生成过程中,从而扩展其在更广泛下游任务中的适用性。
链接: https://arxiv.org/abs/2506.23676
作者: Gaozheng Pei,Ke Ma,Dongpeng Zhang,Chengzhi Sun,Qianqian Xu,Qingming Huang
机构: UCAS(中国科学院大学); ICT, CAS(中科院计算技术研究所); SCST, UCAS(中国科学院大学计算机科学与技术学院); BDKM, CAS(中科院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the “1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media” competition at ACM MM25, which validates the effectiveness of our approach.
zh
[CV-51] Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation ICCV’25
【速读】:该论文试图解决在资源受限硬件上应用Vision Transformer时面临的高计算成本问题,以及在未见数据域上进行模型剪枝导致的权重重要性误评问题。其解决方案的关键在于提出一种基于块级相对贡献的全局参数资源分配方法——Pruning by Block Benefit (P3B),该方法通过识别低影响组件以减少参数分配,同时保留关键组件,并根据全局性能指标设置分层保留比例,从而确保后期收敛模块的重新激活,实现高效的模型剪枝。
链接: https://arxiv.org/abs/2506.23675
作者: Patrick Glandorf,Bodo Rosenhahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV’25 Workshops
Abstract:Vision Transformer have set new benchmarks in several tasks, but these models come with the lack of high computational costs which makes them impractical for resource limited hardware. Network pruning reduces the computational complexity by removing less important operations while maintaining performance. However, pruning a model on an unseen data domain, leads to a misevaluation of weight significance, resulting in suboptimal resource assignment. In this work, we find that task-sensitive layers initially fail to improve the feature representation on downstream tasks, leading to performance loss for early pruning decisions. To address this problem, we introduce Pruning by Block Benefit (P3B), a pruning method that utilizes the relative contribution on block level to globally assign parameter resources. P3B identifies low-impact components to reduce parameter allocation while preserving critical ones. Classical pruning mask optimization struggles to reactivate zero-mask-elements. In contrast, P3B sets a layerwise keep ratio based on global performance metrics, ensuring the reactivation of late-converging blocks. We show in extensive experiments that P3B is a state of the art pruning method with most noticeable gains in transfer learning tasks. Notably, P3B is able to conserve high performance, even in high sparsity regimes of 70% parameter reduction while only losing 0.64% in accuracy.
zh
[CV-52] Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration ICCV2025
【速读】:该论文旨在解决大规模训练数据集导致的计算成本过高问题,同时保持模型的泛化能力。现有数据剪枝方法通常依赖梯度或代理模型,带来额外的计算开销。其解决方案的关键在于提出一种名为部分前向阻断(Partial Forward Blocking, PFB)的新框架,该框架通过目标模型浅层提取的特征评估样本重要性,并动态剪枝低重要性样本,从而减少深层前向传播和反向传播的计算开销,同时无需辅助的反向计算和代理模型训练。
链接: https://arxiv.org/abs/2506.23674
作者: Dongyue Wu,Zilin Guo,Jialong Zuo,Nong Sang,Changxin Gao
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:The ever-growing size of training datasets enhances the generalization capability of modern machine learning models but also incurs exorbitant computational costs. Existing data pruning approaches aim to accelerate training by removing those less important samples. However, they often rely on gradients or proxy models, leading to prohibitive additional costs of gradient back-propagation and proxy model training. In this paper, we propose Partial Forward Blocking (PFB), a novel framework for lossless training acceleration. The efficiency of PFB stems from its unique adaptive pruning pipeline: sample importance is assessed based on features extracted from the shallow layers of the target model. Less important samples are then pruned, allowing only the retained ones to proceed with the subsequent forward pass and loss back-propagation. This mechanism significantly reduces the computational overhead of deep-layer forward passes and back-propagation for pruned samples, while also eliminating the need for auxiliary backward computations and proxy model training. Moreover, PFB introduces probability density as an indicator of sample importance. Combined with an adaptive distribution estimation module, our method dynamically prioritizes relatively rare samples, aligning with the constantly evolving training state. Extensive experiments demonstrate the significant superiority of PFB in performance and speed. On ImageNet, PFB achieves a 0.5% accuracy improvement and 33% training time reduction with 40% data pruned.
zh
[CV-53] On the Domain Robustness of Contrastive Vision-Language Models
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在面对特定领域分布偏移时性能下降的问题,尤其是在实际应用中由于训练数据和过程不透明导致的领域适应性不足。解决方案的关键在于提出Deepbench框架,该框架利用大语言模型(Large Language Model, LLM)生成针对特定部署领域的现实且上下文感知的图像退化,从而评估模型的领域特定鲁棒性,而无需依赖标注数据。
链接: https://arxiv.org/abs/2506.23663
作者: Mario Koddenbrock,Rudolf Hoffmann,David Brodmann,Erik Rodner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Deepbench is available at this https URL
Abstract:In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.
zh
[CV-54] owards Markerless Intraoperative Tracking of Deformable Spine Tissue
【速读】:该论文旨在解决术中脊柱组织跟踪的挑战,特别是通过消费级RGB-D成像实现无需标记的跟踪,以减少手术时间和复杂性。其解决方案的关键在于构建首个用于脊柱手术的真实临床RGB-D数据集,并开发SpineAlign系统,用于捕捉术前与术中脊柱状态之间的形变,同时提出CorrespondNet多任务框架,用于预测术中和术前场景中的关键区域以支持配准。
链接: https://arxiv.org/abs/2506.23657
作者: Connor Daly,Elettra Marconi,Marco Riva,Jinendra Ekanayake,Daniel S. Elson,Ferdinando Rodriguez y Baena
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint of paper, submitted
Abstract:Consumer-grade RGB-D imaging for intraoperative orthopedic tissue tracking is a promising method with high translational potential. Unlike bone-mounted tracking devices, markerless tracking can reduce operating time and complexity. However, its use has been limited to cadaveric studies. This paper introduces the first real-world clinical RGB-D dataset for spine surgery and develops SpineAlign, a system for capturing deformation between preoperative and intraoperative spine states. We also present an intraoperative segmentation network trained on this data and introduce CorrespondNet, a multi-task framework for predicting key regions for registration in both intraoperative and preoperative scenes.
zh
[CV-55] MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis MICCAI2025
【速读】:该论文旨在解决传统彩色多普勒超声心动图在二尖瓣反流(Mitral Regurgitation, MR)诊断中依赖操作者经验、准确性不足以及与临床工作流程不匹配的问题。其解决方案的关键在于提出一种自动化MR诊断模型(MReg),该模型基于四腔心彩色多普勒超声视频(A4C-CDV)进行训练,通过回归任务建模以捕捉类别间的连续性和序数关系,并引入特征选择与增强机制模拟超声医师的诊断逻辑,同时借鉴专家混合(Mixture-of-Experts)思想设计特征摘要模块,以提升分类的准确性和可解释性。
链接: https://arxiv.org/abs/2506.23648
作者: Zhe Liu,Yuhao Huang,Lian Liu,Chengrui Zhang,Haotian Lin,Tong Han,Zhiyuan Zhu,Yanlin Chen,Yuerui Chen,Dong Ni,Zhongshan Gou,Xin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, accepted by MICCAI 2025
Abstract:Color Doppler echocardiography is a crucial tool for diagnosing mitral regurgitation (MR). Recent studies have explored intelligent methods for MR diagnosis to minimize user dependence and improve accuracy. However, these approaches often fail to align with clinical workflow and may lead to suboptimal accuracy and interpretability. In this study, we introduce an automated MR diagnosis model (MReg) developed on the 4-chamber cardiac color Doppler echocardiography video (A4C-CDV). It follows comprehensive feature mining strategies to detect MR and assess its severity, considering clinical realities. Our contribution is threefold. First, we formulate the MR diagnosis as a regression task to capture the continuity and ordinal relationships between categories. Second, we design a feature selection and amplification mechanism to imitate the sonographer’s diagnostic logic for accurate MR grading. Third, inspired by the Mixture-of-Experts concept, we introduce a feature summary module to extract the category-level features, enhancing the representational capacity for more accurate grading. We trained and evaluated our proposed MReg on a large in-house A4C-CDV dataset comprising 1868 cases with three graded regurgitation labels. Compared to other weakly supervised video anomaly detection and supervised classification methods, MReg demonstrated superior performance in MR diagnosis. Our code is available at: this https URL.
zh
[CV-56] VAP-Diffusion: Enriching Descriptions with MLLM s for Enhanced Medical Image Generation
【速读】:该论文试图解决医学图像生成中因缺乏详细属性信息而导致生成图像质量与多样性不足的问题。其解决方案的关键在于提出一种名为Visual Attribute Prompts (VAP)-Diffusion的框架,该框架利用预训练的多模态大语言模型(Multi-modal Large Language Models, MLLMs)生成高质量的属性描述,并通过设计基于思维链(Chain-of-Thoughts)的提示策略来避免幻觉,同时引入原型条件机制以增强生成器对未见属性组合的鲁棒性。
链接: https://arxiv.org/abs/2506.23641
作者: Peng Huang,Junhu Fu,Bowen Guo,Zeju Li,Yuanyuan Wang,Yi Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.
zh
[CV-57] Unified Multimodal Understanding via Byte-Pair Visual Encoding
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中不同模态之间有效对齐的根本性挑战。其解决方案的关键在于通过字节对编码(byte-pair encoding)统一多模态理解,将结构信息直接嵌入视觉标记,而非依赖特定模态的编码器,同时引入一种结合频率与空间一致性的优先级引导编码方案,并采用基于课程驱动数据组成的多阶段训练流程,从而提升模型捕捉跨模态关系和处理视觉信息的能力。
链接: https://arxiv.org/abs/2506.23639
作者: Wanpeng Zhang,Yicheng Feng,Hao Luo,Yijiang Li,Zihao Yue,Sipeng Zheng,Zongqing Lu
机构: Peking University (北京大学); UC San Diego (加州大学圣地亚哥分校); Renmin University of China (中国人民大学); BeingBeyond
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.
zh
[CV-58] Blending Concepts with Text-to-Image Diffusion Models
【速读】:该论文试图解决如何在零样本框架下,利用扩散模型将不同概念(从具体物体到抽象思想)融合为连贯的新视觉实体的问题。解决方案的关键在于探索多种扩散管道的不同方面,如提示调度、嵌入插值或逐层条件控制,以实现概念的创造性融合,而无需进一步训练或微调模型。实验结果表明,现代扩散模型具备出色的组合潜力,但其性能对输入细节(如提示顺序、概念距离和随机种子)较为敏感。
链接: https://arxiv.org/abs/2506.23630
作者: Lorenzo Olearo,Giorgio Longari,Alessandro Raganato,Rafael Peñaloza,Simone Melzi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Currently under review
Abstract:Diffusion models have dramatically advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease. In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework. Specifically, concept blending merges the key attributes of multiple concepts (expressed as textual prompts) into a single, novel image that captures the essence of each concept. We investigate four blending methods, each exploiting different aspects of the diffusion pipeline (e.g., prompt scheduling, embedding interpolation, or layer-wise conditioning). Through systematic experimentation across diverse concept categories, such as merging concrete concepts, synthesizing compound words, transferring artistic styles, and blending architectural landmarks, we show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning. Our extensive user study, involving 100 participants, reveals that no single approach dominates in all scenarios: each blending technique excels under certain conditions, with factors like prompt ordering, conceptual distance, and random seed affecting the outcome. These findings highlight the remarkable compositional potential of diffusion models while exposing their sensitivity to seemingly minor input variations.
zh
[CV-59] Brain Tumor Detection through Thermal Imaging and MobileNET
【速读】:该论文试图解决传统脑肿瘤检测方法在成本、专业医疗资源依赖以及效率方面的不足。其解决方案的关键在于利用MobileNET模型实现高效的肿瘤检测,通过减少计算资源的使用和缩短运行时间,结合图像处理技术以提高决策的准确性,从而提升检测的可及性与效率。
链接: https://arxiv.org/abs/2506.23627
作者: Roham Maiti,Debasmita Bhoumik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Brain plays a crucial role in regulating body functions and cognitive processes, with brain tumors posing significant risks to human health. Precise and prompt detection is a key factor in proper treatment and better patient outcomes. Traditional methods for detecting brain tumors, that include biopsies, MRI, and CT scans often face challenges due to their high costs and the need for specialized medical expertise. Recent developments in machine learning (ML) and deep learning (DL) has exhibited strong capabilities in automating the identification and categorization of brain tumors from medical images, especially MRI scans. However, these classical ML models have limitations, such as high computational demands, the need for large datasets, and long training times, which hinder their accessibility and efficiency. Our research uses MobileNET model for efficient detection of these tumors. The novelty of this project lies in building an accurate tumor detection model which use less computing re-sources and runs in less time followed by efficient decision making through the use of image processing technique for accurate results. The suggested method attained an average accuracy of 98.5%.
zh
[CV-60] Revisiting Audio-Visual Segmentation with Vision-Centric Transformer CVPR2025
【速读】:该论文旨在解决音频-视觉分割(Audio-Visual Segmentation, AVS)中由于音频信号的混合特性导致的感知模糊性以及传统以音频为中心的Transformer架构在密集预测任务中因视觉细节丢失而性能下降的问题。其解决方案的关键在于提出一种新的以视觉为中心的Transformer(Vision-Centric Transformer, VCT)框架,该框架通过视觉引导的查询迭代获取对应的音频和视觉信息,从而更准确地区分混音中的不同发声物体并精确勾勒其轮廓。此外,VCT框架中引入的原型提示查询生成(Prototype Prompted Query Generation, PPQG)模块进一步增强了查询的语义感知能力和视觉丰富性,提升了音视频信息的融合效果。
链接: https://arxiv.org/abs/2506.23623
作者: Shaofei Huang,Rui Ling,Tianrui Hui,Hongyu Li,Xu Zhou,Shifeng Zhang,Si Liu,Richang Hong,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Chinese Academy of Sciences (中国科学院); Beihang University (北京航空航天大学); Sangfor Technologies (深信服科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025; Code: this https URL Models: this https URL
Abstract:Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at this https URL.
zh
[CV-61] urboVSR: Fantastic Video Upscalers and Where to Find Them ICCV
【速读】:该论文旨在解决基于扩散模型的视频超分辨率(VSR)任务中计算效率低的问题,现有方法在处理短时视频时需要耗费大量时间。其解决方案的关键在于三个核心设计:首先,采用高压缩比为32×32×8的自编码器减少令牌数量;其次,引入分解条件机制以降低训练复杂度,即先超分辨率初始帧,再基于高分辨率初始帧和低分辨率后续帧进行后续帧的超分辨率;最后,将预训练的扩散模型转换为快捷模型以减少采样步骤,从而加速推理过程。
链接: https://arxiv.org/abs/2506.23618
作者: Zhongdao Wang,Guodongfang Zhao,Jingjing Ren,Bailan Feng,Shifeng Zhang,Wenbo Li
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); HKUST (Guangzhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV, 2025
Abstract:Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32 \times 32 \times 8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648 \times 2048) image SR show surprising fine details.
zh
[CV-62] AttentionGS: Towards Initialization-Free 3D Gaussian Splatting via Structural Attention
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 对高质量点云依赖性过强的问题,尤其是在纹理缺失或视角受限的场景下,Structure-from-Motion (SfM) 无法生成可靠点云,导致3DGS重建性能严重下降。其解决方案的关键在于提出AttentionGS框架,通过引入结构注意力机制,实现从随机初始化直接进行3D重建,避免了对初始点云的依赖。在训练初期利用几何注意力快速恢复全局场景结构,后期结合纹理注意力优化细节并提升渲染质量,同时采用不透明度加权梯度引导高斯点密度调整,从而改善表面重建效果。
链接: https://arxiv.org/abs/2506.23611
作者: Ziao Liu,Zhenjia Li,Yifeng Shi,Xiangang Li
机构: Wuhan University (武汉大学); BEKE.inc (BEKE.inc)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) is a powerful alternative to Neural Radiance Fields (NeRF), excelling in complex scene reconstruction and efficient rendering. However, it relies on high-quality point clouds from Structure-from-Motion (SfM), limiting its applicability. SfM also fails in texture-deficient or constrained-view scenarios, causing severe degradation in 3DGS reconstruction. To address this limitation, we propose AttentionGS, a novel framework that eliminates the dependency on high-quality initial point clouds by leveraging structural attention for direct 3D reconstruction from randomly initialization. In the early training stage, we introduce geometric attention to rapidly recover the global scene structure. As training progresses, we incorporate texture attention to refine fine-grained details and enhance rendering quality. Furthermore, we employ opacity-weighted gradients to guide Gaussian densification, leading to improved surface reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that AttentionGS significantly outperforms state-of-the-art methods, particularly in scenarios where point cloud initialization is unreliable. Our approach paves the way for more robust and flexible 3D Gaussian Splatting in real-world applications.
zh
[CV-63] PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum
【速读】:该论文旨在解决现有开放词汇3D语义分割方法中,多视角图像仅被用作传递开放词汇信息的中介,而未充分挖掘其丰富的语义内容和跨视角对应关系,从而限制了模型效果的问题。解决方案的关键在于提出一种基于部分到全局(Partial-to-Global)的课程学习框架,其核心创新是采用两阶段训练策略:第一阶段在提供密集语义信息但几何结构相对简单的部分场景上进行预训练,并利用多模态大语言模型和2D分割基础模型生成开放词汇标签以提供丰富且对齐的监督;第二阶段在完整场景级点云上进行微调,通过聚合每个场景的部分词汇并生成伪标签,有效弥合密集部分观测与大规模3D环境之间的语义差距。
链接: https://arxiv.org/abs/2506.23607
作者: Shiqi Zhang,Sha Zhang,Jiajun Deng,Yedong Shen,Mingxiao MA,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary learning, we leverage a multi-modal large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter-frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open-vocabulary 3D semantic segmentation.
zh
[CV-64] SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion
【速读】:该论文旨在解决现有激光雷达点云生成方法在实际应用中的局限性,特别是其在语义引导下的点云合成能力不足的问题。传统方法主要关注无条件的点云生成,未能充分利用语义信息以提升生成效果和实际应用价值。论文提出的解决方案是SG-LDM(语义引导的激光雷达扩散模型),其关键在于通过潜在对齐技术实现鲁棒的语义到激光雷达点云的合成,直接在原始激光雷达空间中操作并利用显式的语义条件,从而生成高质量的激光雷达点云。此外,基于SG-LDM提出的首个基于扩散的激光雷达翻译框架,进一步提升了下游感知任务的数据增强性能。
链接: https://arxiv.org/abs/2506.23606
作者: Zhengkang Xiang,Zizhao Li,Amir Khodabandeh,Kourosh Khoshelham
机构: The University of Melbourne(墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.
zh
[CV-65] AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval ICDAR2025
【速读】:该论文试图解决讲座幻灯片元素检测与检索问题,这类任务在幻灯片理解中具有关键作用。然而,训练有效的模型通常依赖于大量手动标注的数据,而对大量讲座幻灯片进行监督训练的标注工作既耗时又需要领域专业知识。为了解决这一问题,该论文提出了一种由大型语言模型(Large Language Model, LLM)引导的合成讲座幻灯片生成流程SynLecSlideGen,其关键在于生成高质量、连贯且逼真的幻灯片,从而减少对真实标注数据的依赖。
链接: https://arxiv.org/abs/2506.23605
作者: Suyash Maniyar,Vishvesh Trivedi,Ajoy Mondal,Anand Mishra,C.V. Jawahar
机构: Indian Institute of Technology, Jodhpur, India (印度理工学院,乔德普尔); Sardar Vallabhbhai National Institute of Technology, Surat, India (萨达尔·瓦拉巴伊国家技术学院,苏拉特); CVIT, International Institute of Information Technology, Hyderabad, India (计算机视觉与图像处理中心,海得拉巴国际信息科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 40 pages including supplementary, accepted at ICDAR 2025
Abstract:Lecture slide element detection and retrieval are key problems in slide understanding. Training effective models for these tasks often depends on extensive manual annotation. However, annotating large volumes of lecture slides for supervised training is labor intensive and requires domain expertise. To address this, we propose a large language model (LLM)-guided synthetic lecture slide generation pipeline, SynLecSlideGen, which produces high-quality, coherent and realistic slides. We also create an evaluation benchmark, namely RealSlide by manually annotating 1,050 real lecture slides. To assess the utility of our synthetic slides, we perform few-shot transfer learning on real data using models pre-trained on them. Experimental results show that few-shot transfer learning with pretraining on synthetic slides significantly improves performance compared to training only on real data. This demonstrates that synthetic data can effectively compensate for limited labeled lecture slides. The code and resources of our work are publicly available on our project website: this https URL.
zh
[CV-66] CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models
【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在生成内容时出现的物体幻觉(object hallucination)问题,即模型生成的内容与输入的视觉信息不符。解决方案的关键在于提出一种无需训练、可直接集成的幻觉缓解方法——基于标题敏感的注意力干预(Caption-sensitive Attention Intervention, CAI),该方法利用模型在回答标题查询时对视觉信息更强的注意力激活模式,以增强LVLMs的视觉感知能力,从而有效减少幻觉现象,同时仅带来极小的额外推理成本。
链接: https://arxiv.org/abs/2506.23590
作者: Qiming Li,Zekai Ye,Xiaocheng Feng,Weihong Zhong,Libo Qin,Ruihan Chen,Baohang Li,Kui Jiang,Yaowei Wang,Ting Liu,Bing Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs’ attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs’ visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.
zh
[CV-67] PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection ICCV2025
【速读】:该论文试图解决物体检测模型在面对物理可实现的对抗攻击(如对抗补丁和对抗纹理)时鲁棒性不足的问题,此类攻击能够对安全敏感的应用造成现实且紧迫的威胁。解决方案的关键在于提出一种统一的对抗训练方法——基于补丁的复合对抗训练(PBCAT),该方法通过结合小区域梯度引导的对抗补丁与覆盖整个图像的不可察觉全局对抗扰动来优化模型,从而提升模型对多种物理可实现攻击的防御能力。
链接: https://arxiv.org/abs/2506.23581
作者: Xiao Li,Yiming Zhu,Yifan Huang,Wei Zhang,Yingzhe He,Jie Shi,Xiaolin Hu
机构: Tsinghua University (清华大学); University of Science and Technology Beijing (北京科技大学); Huawei Technologies (华为技术); Chinese Institute for Brain Research (中国脑科学研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025
Abstract:Object detection plays a crucial role in many security-sensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, \eg, adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the l_\infty attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. Notably, it improved the detection accuracy by 29.7% over previous defense methods under one recent adversarial texture attack.
zh
[CV-68] Dataset Distillation via Vision-Language Category Prototype ICCV2025
【速读】:该论文旨在解决传统数据集蒸馏(Dataset Distillation, DD)方法主要依赖图像信息而忽视语义信息的问题,这限制了模型在复杂任务中的泛化能力,可能导致逻辑不连贯的输出或关键对象的遗漏。其解决方案的关键在于引入文本原型(text prototypes),通过结合视觉-语言方法,从开源大语言模型生成的描述性文本中提取语言信息,并与图像原型协同合成数据,从而提升数据集蒸馏的效果。该方法在无预设文本描述的数据集上也表现出良好的适用性,实现了逻辑连贯且包含目标物体的图像生成,达到了最先进的验证性能。
链接: https://arxiv.org/abs/2506.23580
作者: Yawen Zou,Guang Li,Duo Su,Zi Wang,Jun Yu,Chao Zhang
机构: University of Toyama (Toyama大学); Hokkaido University (北海道大学); Tsinghua University (清华大学); Niigata University (新泻大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV2025
Abstract:Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model’s generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in this https URL
zh
[CV-69] StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection
【速读】:该论文旨在解决在零样本工业异常检测任务中,文本与图像特征对齐不足的问题。现有方法在预训练阶段主要依赖特定类别提示,导致模型过拟合训练类别并限制了泛化能力。其解决方案的关键在于通过多类别名称堆叠生成堆叠提示,构建StackCLIP模型,其中包含两个核心组件:基于聚类的堆叠提示(Clustering-Driven Stacked Prompts, CSP)模块通过语义相似类别堆叠生成通用提示,并利用多目标文本特征融合增强相似物体间的异常区分能力;集成特征对齐(Ensemble Feature Alignment, EFA)模块则为每个堆叠簇训练知识特异性线性层,并根据测试类别属性自适应整合,从而提升模型训练速度、稳定性和收敛性。
链接: https://arxiv.org/abs/2506.23577
作者: Yanning Hou,Yanran Ruan,Junfa Li,Shanshan Wang,Jianfeng Qiu,Ke Xu
机构: Anhui University(安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enhancing the alignment between text and image features in the CLIP model is a critical challenge in zero-shot industrial anomaly detection tasks. Recent studies predominantly utilize specific category prompts during pretraining, which can cause overfitting to the training categories and limit model generalization. To address this, we propose a method that transforms category names through multicategory name stacking to create stacked prompts, forming the basis of our StackCLIP model. Our approach introduces two key components. The Clustering-Driven Stacked Prompts (CSP) module constructs generic prompts by stacking semantically analogous categories, while utilizing multi-object textual feature fusion to amplify discriminative anomalies among similar objects. The Ensemble Feature Alignment (EFA) module trains knowledge-specific linear layers tailored for each stack cluster and adaptively integrates them based on the attributes of test categories. These modules work together to deliver superior training speed, stability, and convergence, significantly boosting anomaly segmentation performance. Additionally, our stacked prompt framework offers robust generalization across classification tasks. To further improve performance, we introduce the Regulating Prompt Learning (RPL) module, which leverages the generalization power of stacked prompts to refine prompt learning, elevating results in anomaly detection classification tasks. Extensive testing on seven industrial anomaly detection datasets demonstrates that our method achieves state-of-the-art performance in both zero-shot anomaly detection and segmentation tasks.
zh
[CV-70] Event-based Tiny Object Detection: A Benchmark Dataset and Baseline
【速读】:该论文旨在解决反无人机任务中由于无人机尺寸小和背景复杂而导致的小目标检测(Small Object Detection, SOD)难题。传统帧基相机因帧率低、动态范围有限和数据冗余,在复杂环境中难以有效检测小目标;而事件相机虽然具有微秒级时间分辨率和高动态范围,但现有的基于事件的目标检测数据集在规模、目标尺寸和背景多样性方面存在不足,无法满足SOD基准测试的需求。为了解决这一问题,作者提出了一个名为EV-UAV的大型、高度多样化的事件基础小目标检测(Event-based Small object detection, EVSOD)数据集,并设计了基于事件点云空间的稀疏分割网络(Event based Sparse Segmentation Network, EV-SpSegNet)以及结合时空相关性的损失函数(Spatiotemporal Correlation, STC loss),其关键在于利用小目标运动在时空事件点云中的连续性特征,提升目标事件的保留能力。
链接: https://arxiv.org/abs/2506.23575
作者: Nuo Chen,Chao Xiao,Yimian Dai,Shiman He,Miao Li,Wei An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 \times 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at this https URL.
zh
[CV-71] Metadata Wavelet and Time Aware Diffusion Models for Satellite Image Super Resolution ICLR2025
【速读】:该论文旨在解决高分辨率卫星遥感图像获取受限的问题,这一问题主要源于卫星传感器的空间和时间分辨率限制以及频繁观测的高昂成本,从而影响了环境监测、灾害响应和农业管理等应用对细粒度高分辨率数据的需求。其解决方案的关键在于提出了一种名为MWT-Diff的创新框架,该框架结合了潜在扩散模型与小波变换,核心组件为一种新型的元数据、小波和时间感知编码器(MWT-Encoder),能够生成融合元数据属性、多尺度频域信息及时间关系的嵌入表示,进而引导分层扩散过程,实现从低分辨率输入到高分辨率图像的逐步重建,同时保留关键的空间特征。
链接: https://arxiv.org/abs/2506.23566
作者: Luigi Sigillo,Renato Giamba,Danilo Comminiello
机构: Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2025 Workshop on Machine Learning for Remote Sensing (ML4RS)
Abstract:The acquisition of high-resolution satellite imagery is often constrained by the spatial and temporal limitations of satellite sensors, as well as the high costs associated with frequent observations. These challenges hinder applications such as environmental monitoring, disaster response, and agricultural management, which require fine-grained and high-resolution data. In this paper, we propose MWT-Diff, an innovative framework for satellite image super-resolution (SR) that combines latent diffusion models with wavelet transforms to address these challenges. At the core of the framework is a novel metadata-, wavelet-, and time-aware encoder (MWT-Encoder), which generates embeddings that capture metadata attributes, multi-scale frequency information, and temporal relationships. The embedded feature representations steer the hierarchical diffusion dynamics, through which the model progressively reconstructs high-resolution satellite imagery from low-resolution inputs. This process preserves critical spatial characteristics including textural patterns, boundary discontinuities, and high-frequency spectral components essential for detailed remote sensing analysis. The comparative analysis of MWT-Diff across multiple datasets demonstrated favorable performance compared to recent approaches, as measured by standard perceptual quality metrics including FID and LPIPS.
zh
[CV-72] OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving ICCV2025
【速读】:该论文旨在解决当前多视角3D目标检测方法在将2D特征转换到3D空间时,由于依赖数据驱动和隐式方式而导致的检测性能受限的问题。其解决方案的关键在于提出基于目标中心的辐射场(Object-centric Radiance Fields, OcRF),通过辅助任务对前景物体进行渲染以增强3D体素特征,并利用渲染过程中的不透明度信息,结合高度感知的不透明度注意力机制(Height-aware Opacity-based Attention, HOA)来提升2D前视图(BEV)特征,从而有效抑制背景噪声干扰,提升检测性能。
链接: https://arxiv.org/abs/2506.23565
作者: Mingqian Ji,Jian Yang,Shanshan Zhang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector’s ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ Object-centric Radiance Fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via Height-aware Opacity-based Attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2 % mAP and 64.8 % NDS on the nuScenes test benchmark. Code will be available at this https URL.
zh
[CV-73] LH2Face: Loss function for Hard High-quality Face
【速读】:该论文旨在解决当前基于余弦相似度与softmax分类的面部识别(Face Recognition, FR)算法在处理困难样本时性能不足的问题。其关键解决方案是提出一种名为LH2Face的新损失函数,该函数通过基于冯·米塞斯-费舍尔(von Mises-Fisher, vMF)分布的相似性度量、自适应边缘的多分类方法以及基于代理的损失函数来优化特征表示空间,并结合面部重建与反向优化的渲染器,从而提升高质难样本的识别准确率。
链接: https://arxiv.org/abs/2506.23555
作者: Fan Xie,Pan Cao
机构: JiBot(极 bot)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In current practical face authentication systems, most face recognition (FR) algorithms are based on cosine similarity with softmax classification. Despite its reliable classification performance, this method struggles with hard samples. A popular strategy to improve FR performance is incorporating angular or cosine margins. However, it does not take face quality or recognition hardness into account, simply increasing the margin value and thus causing an overly uniform training strategy. To address this problem, a novel loss function is proposed, named Loss function for Hard High-quality Face (LH2Face). Firstly, a similarity measure based on the von Mises-Fisher (vMF) distribution is stated, specifically focusing on the logarithm of the Probability Density Function (PDF), which represents the distance between a probability distribution and a vector. Then, an adaptive margin-based multi-classification method using softmax, called the Uncertainty-Aware Margin Function, is implemented in the article. Furthermore, proxy-based loss functions are used to apply extra constraints between the proxy and sample to optimize their representation space distribution. Finally, a renderer is constructed that optimizes FR through face reconstruction and vice versa. Our LH2Face is superior to similiar schemes on hard high-quality face datasets, achieving 49.39% accuracy on the IJB-B dataset, which surpasses the second-place method by 2.37%.
zh
[CV-74] JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching
【速读】:该论文试图解决生成建模中面部运动与语音内在联系被忽视的问题,传统上说话头合成和文本到语音(TTS)被作为独立任务处理。其解决方案的关键在于提出JAM-Flow,一个统一的框架,能够同时合成和条件化面部运动与语音。该方法利用流匹配和一种新颖的多模态扩散Transformer(MM-DiT)架构,集成专门的Motion-DiT和Audio-DiT模块,并通过选择性联合注意力层进行耦合,结合时间对齐的位置嵌入和局部联合注意力掩码等关键架构选择,以实现有效的跨模态交互并保持模态特异性优势。
链接: https://arxiv.org/abs/2506.23552
作者: Mingi Kwon,Joonghyuk Shin,Jaeseok Jung,Jaesik Park,Youngjung Uh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: project page: this https URL Under review. Preprint published on arXiv
Abstract:The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: this https URL
zh
[CV-75] Oneta: Multi-Style Image Enhancement Using Eigentransformation Functions
【速读】:该论文旨在解决多风格图像增强(multi-style image enhancement)这一新型任务,即在一个统一的框架下实现多种不同风格的图像增强效果。其解决方案的关键在于提出了一种名为Oneta的算法,该算法通过依次应用强度增强(使用变换函数,TF)和色彩校正(使用色彩校正矩阵,CCM)的两个点操作算子,构建了一个简单但高效的两步增强模型。此外,引入了特征变换函数(eigenTF)以紧凑方式表示TF,并通过Y-Net和C-Net分别预测eigenTF和CCM参数,同时利用K个可学习的风格标记(style tokens)支持多风格输入,在测试阶段根据需要选择对应的风格标记进行图像增强。
链接: https://arxiv.org/abs/2506.23547
作者: Jiwon Kim,Soohyun Hwang,Dong-O Kim,Changsu Han,Min Kyu Park,Chang-Su Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The first algorithm, called Oneta, for a novel task of multi-style image enhancement is proposed in this work. Oneta uses two point operators sequentially: intensity enhancement with a transformation function (TF) and color correction with a color correction matrix (CCM). This two-step enhancement model, though simple, achieves a high performance upper bound. Also, we introduce eigentransformation function (eigenTF) to represent TF compactly. The Oneta network comprises Y-Net and C-Net to predict eigenTF and CCM parameters, respectively. To support K styles, Oneta employs K learnable tokens. During training, each style token is learned using image pairs from the corresponding dataset. In testing, Oneta selects one of the K style tokens to enhance an image accordingly. Extensive experiments show that the single Oneta network can effectively undertake six enhancement tasks – retouching, image signal processing, low-light image enhancement, dehazing, underwater image enhancement, and white balancing – across 30 datasets.
zh
[CV-76] Pyramidal Patchification Flow for Visual Generation
【速读】:该论文试图解决扩散模型中由于固定补丁大小导致的计算成本与生成质量之间的平衡问题。其解决方案的关键在于提出一种分层补丁化流程(Pyramidal Patchification Flow, PPFlow),通过在不同噪声时间步使用不同大小的补丁:高噪声时间步采用大补丁以减少计算量,低噪声时间步采用小补丁以提高生成质量;同时为每种补丁大小学习线性投影,并相应地修改解补丁化过程。与传统分层流程不同,PPFlow基于完整的潜在表示而非分层表示进行操作,并采用标准去噪过程,无需额外的重噪声技巧。
链接: https://arxiv.org/abs/2506.23543
作者: Hui Li,Baoyou Chen,Liwei Zhang,Jiaye Li,Jingdong Wang,Siyu Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9figures
Abstract:Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a 1.6\times ( 2.0\times ) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at this https URL.
zh
[CV-77] Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention ICCV
【速读】:该论文旨在解决由时间飞行(Time-of-Flight,ToF)传感器捕获的深度图像中存在的噪声问题,以确保下游应用的可靠性。传统方法要么仅关注单帧处理,要么在多帧处理中未考虑对应像素在不同帧间的深度变化,导致时间不一致性和空间模糊性。该论文提出的解决方案的关键在于利用运动不变图融合(motion-invariant graph fusion),通过捕捉图结构的时间自相似性,实现跨帧几何注意力机制,从而同时提升时间稳定性和空间锐度。
链接: https://arxiv.org/abs/2506.23542
作者: Weida Wang,Changyong He,Jin Zeng,Di Qiu
机构: Tongji University (同济大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication at the International Conference on Computer Vision (ICCV) 2025
Abstract:Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code will be released at \hrefthis https URLthis https URL.
zh
[CV-78] Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound MICCAI2025
【速读】:该论文旨在解决先天性子宫异常(Congenital Uterine Anomalies, CUAs)的自动化检测与定位问题,特别是在三维超声(3D US)图像中实现精准的子宫形态评估。其解决方案的关键在于提出一个智能系统,该系统结合了去噪扩散模型(denoising diffusion model)与局部和全局引导机制,通过自适应加权策略优化注意力分配;引入基于强化学习的框架,利用无监督奖励从冗余序列中提取关键切片信息,并整合多平面信息以降低学习难度;同时,采用文本驱动的不确定性建模方法对粗略预测进行调整,从而提升整体性能。
链接: https://arxiv.org/abs/2506.23538
作者: Yuhao Huang,Yueyue Xu,Haoran Dou,Jiaxiao Deng,Xin Yang,Hongyu Zheng,Dong Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by MICCAI 2025;10 pages, 3 figures
Abstract:Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at this https URL.
zh
[CV-79] GViT: Representing Images as Gaussians for Visual Recognition
【速读】:该论文试图解决传统视觉Transformer(ViT)中依赖像素或补丁网格输入表示所带来的计算效率和表达能力受限的问题。其解决方案的关键在于引入GVIT框架,该框架采用可学习的2D高斯分布(2D Gaussians)作为图像的紧凑表示,通过联合优化高斯分布的位置、尺度、方向、颜色和不透明度,并结合ViT分类器进行训练,从而实现高效的特征表达与分类性能。
链接: https://arxiv.org/abs/2506.23532
作者: Jefferson Hernandez,Ruozhen He,Guha Balakrishnan,Alexander C. Berg,Vicente Ordonez
机构: Rice University (莱斯大学); University of California, Irvine (加利福尼亚大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.
zh
[CV-80] When Test-Time Adaptation Meets Self-Supervised Models
【速读】:该论文试图解决在无需依赖源域预训练模型的情况下,如何通过测试时适应(TTA)方法持续提升自监督学习(SSL)模型性能的问题。其解决方案的关键在于提出一种结合SSL与TTA的协同学习框架,利用对比学习和知识蒸馏实现表征的逐步优化,从而在不依赖源域预训练的前提下有效提升模型适应目标域的能力。
链接: https://arxiv.org/abs/2506.23529
作者: Jisu Han,Jihee Park,Dongyoon Han,Wonjun Hwang
机构: Korea University (韩国科学技术院); Naver AI Lab (NAVER AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 7 figures
Abstract:Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-supervised learning (SSL) without relying on source pretraining. We introduce a self-supervised TTA protocol after observing that existing TTA approaches struggle when directly applied to self-supervised models with low accuracy on the source domain. Furthermore, we propose a collaborative learning framework that integrates SSL and TTA models, leveraging contrastive learning and knowledge distillation for stepwise representation refinement. We validate our method on diverse self-supervised models, including DINO, MoCo, and iBOT, across TTA benchmarks. Extensive experiments validate the effectiveness of our approach in SSL, showing that it achieves competitive performance even without source pretraining.
zh
[CV-81] Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving IROS2025
【速读】:该论文试图解决传统基于视觉的自动驾驶系统在复杂环境中仅依赖单张图像输入时面临的导航困难问题,以及现有高性能方法因依赖资源密集型融合网络而难以进行训练和适用于联邦学习的问题。解决方案的关键在于提出轻量级时间变换分解方法,通过将大的注意力图分解为较小的矩阵来处理序列图像帧和时间转向数据,从而降低模型复杂度,实现高效的权重更新和实时预测,同时利用时间信息提升自动驾驶性能。
链接: https://arxiv.org/abs/2506.23523
作者: Tuong Do,Binh X. Nguyen,Quang D. Tran,Erman Tjiputra,Te-Chuan Chiu,Anh Nguyen
机构: AIOZ(人工智能办公室); NTHU(台湾清华大学); University of Liverpool(利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IROS 2025
Abstract:Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method.
zh
[CV-82] From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection
【速读】:该论文旨在解决在弱监督条件下视频显著对象检测(VSOD)中如何有效利用注视信息以提升检测性能的问题。其解决方案的关键在于提出一种位置与语义嵌入(Position and Semantic Embedding, PSE)模块,以在特征学习过程中提供位置和语义引导,并设计一种结合语义与局部性约束的语义与局部性查询(Semantics and Locality Query, SLQ)竞争者,以及一种跨视频与视频内对比学习的内部-外部混合对比(Intra-Inter Mixed Contrastive, IIMC)模型,从而增强时空特征建模能力。
链接: https://arxiv.org/abs/2506.23519
作者: Qi Qin,Runmin Cong,Gen Zhan,Yiting Liao,Sam Kwong
机构: Beijing Jiaotong University (北京交通大学); Shandong University (山东大学); ByteDance China (字节跳动中国); ByteDance USA (字节跳动美国); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 Pages, 9 Figures
Abstract:The eye-tracking video saliency prediction (VSP) task and video salient object detection (VSOD) task both focus on the most attractive objects in video and show the result in the form of predictive heatmaps and pixel-level saliency masks, respectively. In practical applications, eye tracker annotations are more readily obtainable and align closely with the authentic visual patterns of human eyes. Therefore, this paper aims to introduce fixation information to assist the detection of video salient objects under weak supervision. On the one hand, we ponder how to better explore and utilize the information provided by fixation, and then propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process. On the other hand, we achieve spatiotemporal feature modeling under weak supervision from the aspects of feature selection and feature contrast. A Semantics and Locality Query (SLQ) Competitor with semantic and locality constraints is designed to effectively select the most matching and accurate object query for spatiotemporal modeling. In addition, an Intra-Inter Mixed Contrastive (IIMC) model improves the spatiotemporal modeling capabilities under weak supervision by forming an intra-video and inter-video contrastive learning paradigm. Experimental results on five popular VSOD benchmarks indicate that our model outperforms other competitors on various evaluation metrics.
zh
[CV-83] WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image
【速读】:该论文试图解决从单张图像生成高质量新视角时保持视角一致性(view consistency)的问题,即在不同视角之间维持结构连贯性。现有方法虽利用扩散模型结合3D模型进行新视角合成,但因复杂的多步骤流程而效率低下。论文提出的解决方案的关键在于采用一种无需额外模块的训练-free 方法,通过视图引导的变形(view-guided warping)实现自适应注意力操作和噪声重新初始化,从而确保视角一致性。
链接: https://arxiv.org/abs/2506.23518
作者: Jiwoo Park,Tae Eun Choi,Youngjun Jun,Seong Jae Hwang
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.
zh
[CV-84] FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中由于数据异质性和通信约束导致的性能退化问题。其解决方案的关键在于提出了一种新的联邦学习框架FedWSQ,该框架结合了权重标准化(Weight Standardization, WS)和提出的分布感知非均匀量化(Distribution-Aware Non-Uniform Quantization, DANUQ)。WS通过在训练过程中过滤本地更新中的偏差成分来提升模型对数据异质性和不稳定客户端参与的鲁棒性,而DANUQ则通过利用本地模型更新的统计特性来最小化量化误差,从而在降低通信开销的同时保持模型的高精度。
链接: https://arxiv.org/abs/2506.23516
作者: Seung-Wook Kim,Seongyeol Kim,Jiah Kim,Seowon Ji,Se-Ho Lee
机构: Pukyong National University (釜庆国立大学); Konkuk University (国民大学); Jeonbuk National University (全北国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.
zh
[CV-85] ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
【速读】:该论文旨在解决全景视频生成中由于全景数据与透视数据之间的固有模态差异导致的高质量全景视频合成难题(modality gap)。其解决方案的关键在于提出了一种名为ViewPoint map的新型全景表示,该表示同时具备全局空间连续性和细粒度视觉细节,并结合了Pano-Perspective注意力机制,使模型能够有效利用预训练的透视先验知识并捕捉ViewPoint map的全景空间相关性。
链接: https://arxiv.org/abs/2506.23513
作者: Zixun Fang,Kai Zhu,Zhiheng Liu,Yu Liu,Wei Zhai,Yang Cao,Zheng-Jun Zha
机构: USTC(中国科学技术大学); TongYi Lab(通义实验室); HKU(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.
zh
[CV-86] Improve Underwater Object Detection through YOLOv12 Architecture and Physics-informed Augmentation
【速读】:该论文旨在解决水下目标检测中由于光衰减、浑浊度和遮挡导致的检测性能下降问题。其解决方案的关键在于将物理感知增强技术与YOLOv12架构相结合,通过引入残差ELAN模块以保留浑浊水域中的结构特征,并利用区域注意力机制在保持大感受野的同时降低计算复杂度。此外,通过领域特定的增强策略,如自适应湍流模糊、基于生物的遮挡模拟以及基于HSV的光谱变换,有效处理了水下光学特性带来的挑战。
链接: https://arxiv.org/abs/2506.23505
作者: Tinh Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater object detection is crucial for autonomous navigation, environmental monitoring, and marine exploration, but it is severely hampered by light attenuation, turbidity, and occlusion. Current methods balance accuracy and computational efficiency, but they have trouble deploying in real-time under low visibility conditions. Through the integration of physics-informed augmentation techniques with the YOLOv12 architecture, this study advances underwater detection. With Residual ELAN blocks to preserve structural features in turbid waters and Area Attention to maintain large receptive fields for occluded objects while reducing computational complexity. Underwater optical properties are addressed by domain-specific augmentations such as turbulence adaptive blurring, biologically grounded occlusion simulation, and spectral HSV transformations for color distortion. Extensive tests on four difficult datasets show state-of-the-art performance, with Brackish data registering 98.30% mAP at 142 FPS. YOLOv12 improves occlusion robustness by 18.9%, small-object recall by 22.4%, and detection precision by up to 7.94% compared to previous models. The crucial role of augmentation strategy is validated by ablation studies. This work offers a precise and effective solution for conservation and underwater robotics applications.
zh
[CV-87] LLM -enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching ICCV2025
【速读】:该论文旨在解决现有基于对比视觉-语言预训练模型(如CLIP)在图像-文本匹配任务中对细粒度动作层面理解不足的问题,特别是缺乏对动作的感知能力,而动作对于描述对象状态或关系至关重要。解决方案的关键在于引入一种由大语言模型(LLM)增强的动作感知多模态提示调优方法,通过设计动作三元组提示和动作状态提示来利用LLM中隐含的组合语义知识和与状态相关的因果知识,并结合自适应交互模块聚合基于动作感知提示知识的注意视觉特征,从而建立更具区分性和动作感知的视觉表示。
链接: https://arxiv.org/abs/2506.23502
作者: Mengxiao Tian,Xinxiao Wu,Shuo Yang
机构: Beijing Institute of Technology (北京理工大学); Shenzhen MSU-BIT University (深圳北理莫斯科大学); Beijing Research Center of Intelligent Equipment for Agriculture (北京农业智能装备研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV 2025
Abstract:Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge generated by large language models (LLMs). Specifically, we design an action triplet prompt and an action state prompt to exploit compositional semantic knowledge and state-related causal knowledge implicitly stored in LLMs. Subsequently, we propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations, which further improves the performance. Comprehensive experimental results on two benchmark datasets demonstrate the effectiveness of our method.
zh
[CV-88] Sample Margin-Aware Recalibration of Temperature Scaling
【速读】:该论文旨在解决现代神经网络在安全关键场景中部署时存在的系统性过度自信问题,以及现有后校准方法在全局调整与高维对数分布处理之间的根本性权衡问题。其解决方案的关键在于提出一种轻量级、数据高效的校准方法——基于样本边缘感知的温度重校准(SMART),该方法通过精确缩放前两个对数(logit)之间的边缘(即对数间隙)来实现校准,从而提供一个与决策边界不确定性直接相关的去噪标量信号,同时保持模型预测不变性,并采用新型软分箱期望校准误差(SoftECE)目标以平衡模型偏差与方差。
链接: https://arxiv.org/abs/2506.23492
作者: Haolan Guo,Linwei Tao,Haoyang Luo,Minjing Dong,Chang Xu
机构: University of Sydney(悉尼大学); City University of Hong Kong(香港城市大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in deep learning have significantly improved predictive accuracy. However, modern neural networks remain systematically overconfident, posing risks for deployment in safety-critical scenarios. Current post-hoc calibration methods face a fundamental dilemma: global approaches like Temperature Scaling apply uniform adjustments across all samples, introducing high bias despite computational efficiency, while more expressive methods that operate on full logit distributions suffer from high variance due to noisy high-dimensional inputs and insufficient validation data. To address these challenges, we propose Sample Margin-Aware Recalibration of Temperature (SMART), a lightweight, data-efficient recalibration method that precisely scales logits based on the margin between the top two logits – termed the logit gap. Specifically, the logit gap serves as a denoised, scalar signal directly tied to decision boundary uncertainty, providing a robust indicator that avoids the noise inherent in high-dimensional logit spaces while preserving model prediction invariance. Meanwhile, SMART employs a novel soft-binned Expected Calibration Error (SoftECE) objective that balances model bias and variance through adaptive binning, enabling stable parameter updates even with extremely limited calibration data. Extensive evaluations across diverse datasets and architectures demonstrate that SMART achieves state-of-the-art calibration performance even with substantially fewer parameters compared to existing parametric methods, offering a principled, robust, and highly efficient solution for practical uncertainty quantification in neural network predictions. The source code is available at: this https URL.
zh
[CV-89] Qwen -GUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding
【速读】:该论文旨在解决图形用户界面(Graphical User Interface, GUI)定位任务中数据稀缺与模型计算成本高的问题,特别是在高分辨率桌面环境下的表现。其关键解决方案包括:构建跨平台、多分辨率的24K示例数据集以缓解高分辨率环境下的数据不足问题;采用两阶段微调策略,先进行跨平台训练以建立稳健的GUI理解能力,再通过高分辨率数据的专门微调提升模型适应性;以及通过数据筛选和冗余减少策略,证明在减少冗余的情况下随机采样较小子集即可达到与大规模数据集相当的性能,强调数据多样性的重要性。
链接: https://arxiv.org/abs/2506.23491
作者: ZongHan Hsieh,Tzer-Jen Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces Qwen-GUI-3B, a lightweight Vision-Language Model (VLM) specifically designed for Graphical User Interface grounding tasks, achieving performance competitive with significantly larger models. Unlike large-scale VLMs (7B parameters) that are computationally intensive and impractical for consumer-grade hardware, Qwen-GUI-3B delivers strong grounding accuracy while being fully trainable on a single GPU (RTX 4090). The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks-including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights Qwen-GUI-3B’s exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The Qwen-GUI-3B is available at: this https URL
zh
[CV-90] AG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity ICCV2025
【速读】:该论文旨在解决AI生成内容(AIGC)在版权保护和真实性验证方面存在的问题,特别是针对当前无损视觉质量水印在抗篡改性方面的不足以及被动篡改检测方法在应对高精度图像编辑工具时的局限性。其解决方案的关键在于提出一种名为TAG-WM的感知篡改的生成图像水印方法,该方法通过四个核心模块实现:双标记联合采样(DMJS)算法用于在保持生成质量的同时嵌入版权和定位水印,水印潜在空间重构(WLR)利用反向DMJS进行水印恢复,密集变化区域检测器(DVRD)通过统计偏差分析识别篡改区域,以及由定位结果引导的篡改感知解码(TAD),从而实现了高鲁棒性和精准的篡改定位能力。
链接: https://arxiv.org/abs/2506.23484
作者: Yuzhuo Chen,Zehua Ma,Han Fang,Weiming Zhang,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ICCV 2025 (2025 IEEE/CVF International Conference on Computer Vision)
Abstract:AI-generated content (AIGC) enables efficient visual creation but raises copyright and authenticity risks. As a common technique for integrity verification and source tracing, digital image watermarking is regarded as a potential solution to above issues. Among these, watermarking methods capable of preserving the generation quality are receiving increased attention. However, the proliferation and high performance of generative image editing applications have elevated the risks of malicious tampering, creating new demands. 1) The tamper robustness of current lossless visual quality watermarks remains constrained by the modification-sensitive diffusion inversion process, necessitating enhanced robustness. 2) The improved tampering quality and rapid iteration cycles render passive tampering detection methods inadequate, making proactive tampering localization capability a desired feature for watermarks. To address these requirements, this paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM. The proposed method comprises four key modules: a dual-mark joint sampling (DMJS) algorithm for embedding copyright and localization watermarks into the latent space while preserving generative quality, the watermark latent reconstruction (WLR) utilizing reversed DMJS, a dense variation region detector (DVRD) leveraging diffusion inversion sensitivity to identify tampered areas via statistical deviation analysis, and the tamper-aware decoding (TAD) guided by localization results. The experimental results indicate that TAG-WM achieves SOTA tampering robustness and tampering localization capability with distortions while maintaining lossless generation quality and a considerable capacity of 256 bits.
zh
[CV-91] MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting CVPR2025
【速读】:该论文旨在解决图像修复(image inpainting)中常见的语义错位、结构失真和风格不一致等问题。其关键解决方案是提出MTADiffusion模型,该模型通过引入MTAPipeline实现掩码的自动标注,构建包含500万张图像和2500万对掩码-文本数据的MTADataset,并采用多任务训练策略融合修复与边缘预测任务以提升结构稳定性,同时利用预训练VGG网络和Gram矩阵设计新的风格一致性损失函数,从而有效改善修复结果的语义连贯性与视觉一致性。
链接: https://arxiv.org/abs/2506.23482
作者: Jun Huang,Ting Liu,Yihang Wu,Xiaochao Qu,Luoqi Liu,Xiaolin Hu
机构: Meitu Inc (美图公司); National University of Singapore (新加坡国立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.
zh
[CV-92] Evaluation of Geolocation Capabilities of Multimodal Large Language Models and Analysis of Associated Privacy Risks
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在地理定位任务中可能引发的隐私与伦理问题,特别是其通过视觉内容推断图像地理位置的能力所带来的隐私侵犯风险。解决方案的关键在于系统性地分析现有基于MLLM的地理定位技术,评估先进视觉推理模型在街景图像来源识别中的性能,并识别出影响定位准确性的关键视觉元素,如文本、建筑风格和环境特征。研究结果表明,最先进的视觉大模型在1公里半径内可达到49%的定位准确率,这凸显了模型从视觉数据中提取细粒度地理线索的强大能力,同时也为后续的技术和政策应对措施提供了依据。
链接: https://arxiv.org/abs/2506.23481
作者: Xian Zhang,Xiang Cheng
机构: Wuhan University School of Information (武汉大学信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Objectives: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly enhanced their reasoning capabilities, enabling a wide range of intelligent applications. However, these advancements also raise critical concerns regarding privacy and ethics. MLLMs are now capable of inferring the geographic location of images – such as those shared on social media or captured from street views – based solely on visual content, thereby posing serious risks of privacy invasion, including doxxing, surveillance, and other security threats. Methods: This study provides a comprehensive analysis of existing geolocation techniques based on MLLMs. It systematically reviews relevant litera-ture and evaluates the performance of state-of-the-art visual reasoning models on geolocation tasks, particularly in identifying the origins of street view imagery. Results: Empirical evaluation reveals that the most advanced visual large models can successfully localize the origin of street-level imagery with up to 49% accuracy within a 1-kilometer radius. This performance underscores the models’ powerful capacity to extract and utilize fine-grained geographic cues from visual data. Conclusions: Building on these findings, the study identifies key visual elements that contribute to suc-cessful geolocation, such as text, architectural styles, and environmental features. Furthermore, it discusses the potential privacy implications associated with MLLM-enabled geolocation and discuss several technical and policy-based coun-termeasures to mitigate associated risks. Our code and dataset are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2506.23481 [cs.CV] (or arXiv:2506.23481v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.23481 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xian Zhang [view email] [v1] Mon, 30 Jun 2025 03:05:30 UTC (2,104 KB)
zh
[CV-93] Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representation, INR)在图像表示中对GPU资源需求高、训练过程缓慢以及固定高斯点数量导致适应性差的问题。其解决方案的关键在于提出一种基于二维高斯泼溅(2D Gaussian Splatting)的可泛化和自适应图像表示框架,通过网络快速生成粗粒度高斯表示并进行少量微调,从而显著降低训练时间,同时根据图像复杂度动态调整高斯点数量以提升灵活性和效率。
链接: https://arxiv.org/abs/2506.23479
作者: Zhaojie Zeng,Yuesong Wang,Chao Yang,Tao Guan,Lili Ju
机构: Huazhong University of Science and Technology (华中科技大学); China University of Geoscience (Wuhan) (中国地质大学(武汉)); University of South Carolina (南卡罗来纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Implicit Neural Representation (INR) has demonstrated remarkable advances in the field of image representation but demands substantial GPU resources. GaussianImage recently pioneered the use of Gaussian Splatting to mitigate this cost, however, the slow training process limits its practicality, and the fixed number of Gaussians per image limits its adaptability to varying information entropy. To address these issues, we propose in this paper a generalizable and self-adaptive image representation framework based on 2D Gaussian Splatting. Our method employs a network to quickly generate a coarse Gaussian representation, followed by minimal fine-tuning steps, achieving comparable rendering quality of GaussianImage while significantly reducing training time. Moreover, our approach dynamically adjusts the number of Gaussian points based on image complexity to further enhance flexibility and efficiency in practice. Experiments on DIV2K and Kodak datasets show that our method matches or exceeds GaussianImage’s rendering performance with far fewer iterations and shorter training times. Specifically, our method reduces the training time by up to one order of magnitude while achieving superior rendering performance with the same number of Gaussians.
zh
[CV-94] GeoCD: A Differential Local Approximation for Geodesic Chamfer Distance
【速读】:该论文试图解决传统Chamfer Distance (CD)在3D点云学习中因仅依赖欧几里得距离而无法捕捉三维形状内在几何结构的问题。解决方案的关键在于提出GeoCD,这是一种拓扑感知且完全可微的测地距离近似方法,旨在作为更有效的3D点云学习度量标准。
链接: https://arxiv.org/abs/2506.23478
作者: Pedro Alonso,Tianrui Li,Chongshou Li
机构: Southwest Jiaotong University (西南交通大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chamfer Distance (CD) is a widely adopted metric in 3D point cloud learning due to its simplicity and efficiency. However, it suffers from a fundamental limitation: it relies solely on Euclidean distances, which often fail to capture the intrinsic geometry of 3D shapes. To address this limitation, we propose GeoCD, a topology-aware and fully differentiable approximation of geodesic distance designed to serve as a metric for 3D point cloud learning. Our experiments show that GeoCD consistently improves reconstruction quality over standard CD across various architectures and datasets. We demonstrate this by fine-tuning several models, initially trained with standard CD, using GeoCD. Remarkably, fine-tuning for a single epoch with GeoCD yields significant gains across multiple evaluation metrics.
zh
[CV-95] KiseKloset: Comprehensive System For Outfit Retrieval Recommendation And Try-On
【速读】:该论文旨在解决在线购物中用户个性化体验不足及虚拟试穿效率与真实感不足的问题。其解决方案的关键在于提出了一种综合性的KiseKloset系统,该系统集成了服装检索、推荐和虚拟试穿功能。其中,核心创新包括一种新型的Transformer架构用于跨类别互补物品推荐,以及通过近似算法优化搜索流程以提升整体性能;同时,引入了一个轻量级且高效的虚拟试穿框架,实现了实时操作、内存高效和逼真输出,从而增强了用户的购物体验并降低了零售商的损毁成本。
链接: https://arxiv.org/abs/2506.23471
作者: Thanh-Tung Phan-Nguyen,Khoi-Nguyen Nguyen-Ngoc,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, VNU-HCM (河内国家大学科学大学); Vietnam National University (越南国家大学); Department of Computer Science, University of Dayton (戴顿大学计算机科学系)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The global fashion e-commerce industry has become integral to people’s daily lives, leveraging technological advancements to offer personalized shopping experiences, primarily through recommendation systems that enhance customer engagement through personalized suggestions. To improve customers’ experience in online shopping, we propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on. We explore two approaches for outfit retrieval: similar item retrieval and text feedback-guided item retrieval. Notably, we introduce a novel transformer architecture designed to recommend complementary items from diverse categories. Furthermore, we enhance the overall performance of the search pipeline by integrating approximate algorithms to optimize the search process. Additionally, addressing the crucial needs of online shoppers, we employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs compared to its predecessors. This virtual try-on module empowers users to visualize specific garments on themselves, enhancing the customers’ experience and reducing costs associated with damaged items for retailers. We deployed our end-to-end system for online users to test and provide feedback, enabling us to measure their satisfaction levels. The results of our user study revealed that 84% of participants found our comprehensive system highly useful, significantly improving their online shopping experience.
zh
[CV-96] Interactive Interface For Semantic Segmentation Dataset Synthesis
【速读】:该论文试图解决高质量标注数据集(尤其是语义分割数据集)的创建成本高、耗时长、隐私风险大的问题。解决方案的关键在于提出SynthLab,这是一个模块化视觉数据合成平台,具备可维护性、可扩展性以及新功能的无缝集成能力,同时提供交互式用户界面,使非技术背景用户也能通过拖放操作快速定制数据处理流程。
链接: https://arxiv.org/abs/2506.23470
作者: Ngoc-Do Tran,Minh-Tuan Huynh,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science,VNU-HCMVietnam; University of Dayton,OhioUS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of AI and computer vision has significantly increased the demand for high-quality annotated datasets, particularly for semantic segmentation. However, creating such datasets is resource-intensive, requiring substantial time, labor, and financial investment, and often raises privacy concerns due to the use of real-world data. To mitigate these challenges, we present SynthLab, consisting of a modular platform for visual data synthesis and a user-friendly interface. The modular architecture of SynthLab enables easy maintenance, scalability with centralized updates, and seamless integration of new features. Each module handles distinct aspects of computer vision tasks, enhancing flexibility and adaptability. Meanwhile, its interactive, user-friendly interface allows users to quickly customize their data pipelines through drag-and-drop actions. Extensive user studies involving a diverse range of users across different ages, professions, and expertise levels, have demonstrated flexible usage, and high accessibility of SynthLab, enabling users without deep technical expertise to harness AI for real-world applications.
zh
[CV-97] NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments ICCV2025
【速读】:该论文旨在解决视觉-语言导航在连续环境(Vision-and-Language Navigation in Continuous Environments, VLN-CE)中代理在复杂环境中执行顺序导航动作时面临的泛化能力不足和动态适应性差的问题。解决方案的关键在于提出NavMorph框架,该框架通过紧凑的潜在表示建模环境动态,赋予代理前瞻性以实现自适应规划与策略优化,并结合一种新颖的上下文演化记忆机制,利用场景上下文信息支持有效导航,同时保持在线适应能力。
链接: https://arxiv.org/abs/2506.23468
作者: Xuan Yao,Junyu Gao,Changsheng Xu
机构: Institute of Automation, Chinese Academy of Sciences (CASIA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available at \hrefthis https URLthis https URL.
zh
[CV-98] AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays MICCAI2025
【速读】:该论文试图解决Contrastive Language-Image Pre-training (CLIP)模型在医疗图像分类任务中存在的公平性问题,特别是与种族和性别相关的偏差导致的诊断结果差异和对少数群体可靠性下降的问题。解决方案的关键在于提出AdFair-CLIP框架,该框架通过对抗性特征干预来抑制敏感属性,从而缓解虚假相关性并提升预测的公平性。
链接: https://arxiv.org/abs/2506.23467
作者: Chenlang Yi,Zizhan Xiong,Qi Qi,Xiyuan Wei,Girish Bathla,Ching-Long Lin,Bobak Jack Mortazavi,Tianbao Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This preprint has been accepted by MICCAI 2025
Abstract:Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.
zh
[CV-99] Sanitizing Manufacturing Dataset Labels Using Vision-Language Models
【速读】:该论文试图解决工业应用中机器学习模型因数据集标签质量低下而影响性能的问题,特别是在制造领域,由于获取高质量标签成本高且耗时,标签噪声、不一致和错误尤为突出。解决方案的关键在于提出一种基于视觉-语言的标签净化与优化框架(Vision-Language Sanitization and Refinement, VLSR),该框架利用CLIP模型将图像及其文本标签嵌入到共享语义空间,并通过计算嵌入之间的余弦相似度实现标签净化和聚类,从而识别无关或语义弱的标签并合并语义相似的标签,提升标签一致性与数据集质量。
链接: https://arxiv.org/abs/2506.23465
作者: Nazanin Mahjourian,Vinh Nguyen
机构: Michigan Tech(密歇根理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, specially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in manufacturing domains, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitization and Refinement (VLSR), which is a vision-language-based framework for label sanitization and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, label sanitization is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings, followed by iterative cluster merging, to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. This method enables a significant reduction in label vocabulary through clustering, which ultimately enhances the dataset’s quality for training robust machine learning models in industrial applications with minimal human intervention.
zh
[CV-100] me-variant Image Inpainting via Interactive Distribution Transition Estimation
【速读】:该论文试图解决时间变异图像修复(Time-vAriant iMage inPainting, TAMP)问题,即通过利用参考图像中的互补信息来修复受损目标图像,其中目标图像与参考图像拍摄于同一场景但存在显著的时间差异。与传统参考引导的图像修复不同,TAMP中的参考图像在内容上与目标图像存在较大差异,甚至可能也受到损坏。为了解决这一病态问题,论文提出了一种关键解决方案——交互式分布转移估计(Interactive Distribution Transition Estimation, InDiTE)模块,该模块通过自适应语义交互补全时间变异图像,从而促进受损区域的修复。进一步地,论文提出了InDiTE-Diff方法,将InDiTE模块与先进的扩散模型结合,并在采样过程中进行潜在跨参考,以提升修复性能。
链接: https://arxiv.org/abs/2506.23461
作者: Yun Xing,Qing Guo,Xiaoguang Li,Yihao Huang,Xiaofeng Cao,Di Lin,Ivor Tsang,Lei Ma
机构: IHPC and CFAR, ASTAR, Singapore (IHPC 和 CFAR,ASTAR,新加坡); University of Alberta (阿尔伯塔大学); University of South Carolina (南卡罗来纳大学); Nanyang Technological University (南洋理工大学); School of Artificial Intelligence, Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, Jilin University (人工智能学院,知识驱动人机智能教育部工程研究中心,吉林大学); College of Intelligence and Computing, Tianjin University (智能与计算学院,天津大学); University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image’s source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with adaptive semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.
zh
[CV-101] Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation
【速读】:该论文旨在解决弱监督语义分割(WSSS)中基于类别标签的方法在对象定位和边界精度上的不足,尤其是传统类激活图(CAM)方法由于分类与分割优化目标不一致导致的部分激活和边界不精确问题。其解决方案的关键在于引入一种名为对比学习与扩散特征(CLDF)的新方法,通过对比学习训练像素解码器,将冻结的条件扩散模型(CDM)生成的扩散特征映射到低维嵌入空间以进行分割。该方法结合了来自CDM外部分类器生成的梯度图与CAM,以更准确地识别前景和背景像素,从而实现鲁棒的像素嵌入学习。
链接: https://arxiv.org/abs/2506.23460
作者: Dewen Zeng,Xinrong Hu,Yu-Jen Chen,Yawen Wu,Xiaowei Xu,Yiyu Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.
zh
[CV-102] PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions ICCV2025
【速读】:该论文试图解决在病理学图像生成中由于隐私限制导致的数据稀缺问题,以及公开数据集中缺乏配对的文本和掩码数据,从而限制了多模态信息在图像生成中的联合使用。解决方案的关键在于提出PathDiff框架,该框架通过将未配对的掩码和文本数据整合到统一的条件空间中,实现对结构和上下文特征的精确控制,从而生成高质量且语义准确的图像。
链接: https://arxiv.org/abs/2506.23440
作者: Mahesh Bhosale,Abdul Wasi,Yuanhao Zhai,Yunjie Tian,Samuel Border,Nan Xi,Pinaki Sarder,Junsong Yuan,David Doermann,Xuan Gong
机构: University at Buffalo (纽约州立大学布法罗分校); University of Florida (佛罗里达大学); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods.
zh
[CV-103] owards foundational LiDAR world models with efficient latent flow matching
【速读】:该论文试图解决LiDAR世界模型在不同领域间迁移能力不足的问题,即现有模型仅在其构建的特定领域表现优异,缺乏跨领域的泛化能力。其解决方案的关键在于提出一种基于潜在条件流匹配(latent conditional flow matching, CFM)的框架,该框架通过提高数据压缩率和优化训练目标,显著提升了模型的迁移性能与计算效率,同时减少了对人工标注数据的依赖。
链接: https://arxiv.org/abs/2506.23434
作者: Tianran Liu,Shengwen Zhao,Nicholas Rhinehart
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 25 pages, 13 figures
Abstract:LiDAR-based world models offer more structured and geometry-aware representations than their image-based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. Can we develop LiDAR world models that exhibit strong transferability across multiple domains? We conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse-beam \ dense-beam adaptation, and (iii) non-semantic to semantic transfer. Given different amounts of fine-tuning data, our experiments show that a single pre-trained model can achieve up to 11% absolute improvement (83% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability of dynamic learning significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceed the previous semantic occupancy forecasting models with only 5% of the labeled training data required by prior models. We also observed inefficiencies of current LiDAR world models, mainly through their under-compression of LiDAR data and inefficient training objectives. To address this, we propose a latent conditional flow matching (CFM)-based frameworks that achieves state-of-the-art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model achieves SOTA performance on future-trajectory-conditioned semantic occupancy forecasting while being 23x more computationally efficient (a 28x FPS speedup); and achieves SOTA performance on semantic occupancy forecasting while being 2x more computationally efficient (a 1.1x FPS speedup).
zh
[CV-104] Detecting What Matters: A Novel Approach for Out-of-Distribution 3D Object Detection in Autonomous Vehicles
【速读】:该论文试图解决自动驾驶汽车(Autonomous Vehicles, AVs)在面对分布外(Out-of-Distribution, OOD)物体时检测与响应能力不足的问题,这一问题可能引发安全隐患。传统的目标检测方法依赖于已知类别的分类,难以有效识别和处理未知物体。论文提出的解决方案的关键在于将检测重点从传统的基于类别的分类转向基于物体危害性的判断,即根据物体相对于自动驾驶汽车的位置及其轨迹来判断其是否对车辆构成威胁,从而实现对未知物体的有效检测与安全决策。
链接: https://arxiv.org/abs/2506.23426
作者: Menna Taha(1),Aya Ahmed(2),Mohammed Karmoose(1 and 3),Yasser Gadallah(2) ((1) Faculty of Engineering at Alexandria University, Alexandria, Egypt, (2) Department of Electronics and Communications Engineering at The American University in Cairo, Egypt, (3) The Wireless Intelligent Networks Center (WINC), School of Engineering and Applied Sciences (EAS), Nile University, Giza, Egypt)
机构: Alexandria University (亚历山大大学); The American University in Cairo (开罗美国大学); Nile University (尼罗河大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Autonomous vehicles (AVs) use object detection models to recognize their surroundings and make driving decisions accordingly. Conventional object detection approaches classify objects into known classes, which limits the AV’s ability to detect and appropriately respond to Out-of-Distribution (OOD) objects. This problem is a significant safety concern since the AV may fail to detect objects or misclassify them, which can potentially lead to hazardous situations such as accidents. Consequently, we propose a novel object detection approach that shifts the emphasis from conventional class-based classification to object harmfulness determination. Instead of object detection by their specific class, our method identifies them as either ‘harmful’ or ‘harmless’ based on whether they pose a danger to the AV. This is done based on the object position relative to the AV and its trajectory. With this metric, our model can effectively detect previously unseen objects to enable the AV to make safer real-time decisions. Our results demonstrate that the proposed model effectively detects OOD objects, evaluates their harmfulness, and classifies them accordingly, thus enhancing the AV decision-making effectiveness in dynamic environments.
zh
[CV-105] Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)模型在组合生成中对空间关系建模不足的问题,特别是模型难以准确反映输入提示中物体之间指定的空间配置。其解决方案的关键在于提出一种基于优势概率(Probability of Superiority, PoS)的新型概率框架,通过引入PoS-based Evaluation (PSE) 和PoS-based Generation (PSG) 两种方法,分别用于评估和增强空间关系的一致性。PSE作为一种新的评估指标,能够更贴近人类判断;而PSG则是一种无需微调的推理阶段方法,利用基于词性标注的PoS奖励函数,通过梯度引导或搜索策略优化生成结果。
链接: https://arxiv.org/abs/2506.23418
作者: Parham Rezaei,Arash Marioriyad,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
机构: Sharif University of Technology (沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 main pages, 18 figures, and 16 tables
Abstract:Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts. To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a Part-of-Speech PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.
zh
[CV-106] A High-Throughput Platform to Bench Test Smartphone-Based Heart Rate Measurements Derived From Video
【速读】:该论文旨在解决基于智能手机的光电容积描记法(PPG)心率(HR)监测应用在性能评估和设备兼容性方面面临的挑战,这些问题主要源于设备差异性和碎片化。论文提出的解决方案关键在于设计了一个高通量的台架测试平台,该平台包含可同时测试12部智能手机的测试支架、生成可控HR和信号质量的合成PPG测试视频的方法,以及用于协调视频播放和数据记录的主机系统,从而实现了高效、标准化的测试流程。
链接: https://arxiv.org/abs/2506.23414
作者: Ming-Zher Poh,Jonathan Wang,Jonathan Hsu,Lawrence Cai,Eric Teasley,James A. Taylor,Jameson K. Rogers,Anupam Pathak,Shwetak Patel
机构: Google Research(谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Smartphone-based heart rate (HR) monitoring apps using finger-over-camera photoplethysmography (PPG) face significant challenges in performance evaluation and device compatibility due to device variability and fragmentation. Manual testing is impractical, and standardized methods are lacking. This paper presents a novel, high-throughput bench-testing platform to address this critical need. We designed a system comprising a test rig capable of holding 12 smartphones for parallel testing, a method for generating synthetic PPG test videos with controllable HR and signal quality, and a host machine for coordinating video playback and data logging. The system achieved a mean absolute percentage error (MAPE) of 0.11% +/- 0.001% between input and measured HR, and a correlation coefficient of 0.92 +/- 0.008 between input and measured PPG signals using a clinically-validated smartphone-based HR app. Bench-testing results of 20 different smartphone models correctly classified all the devices as meeting the ANSI/CTA accuracy standards for HR monitors (MAPE 10%) when compared to a prospective clinical study with 80 participants, demonstrating high positive predictive value. This platform offers a scalable solution for pre-deployment testing of smartphone HR apps to improve app performance, ensure device compatibility, and advance the field of mobile health.
zh
[CV-107] SIEDD: Shared-Implicit Encoder with Discrete Decoders KR
【速读】:该论文试图解决隐式神经表示(Implicit Neural Representations, INRs)在视频压缩中因编码速度过慢而难以实际应用的问题。现有方法在加速INR编码时往往牺牲重建质量或关键的坐标级控制能力,而这些能力对于自适应流媒体和转码至关重要。解决方案的关键在于提出SIEDD(Shared-Implicit Encoder with Discrete Decoders)架构,该架构通过在稀疏锚帧上快速训练一个共享的基于坐标的编码器,以高效捕捉全局低频视频特征,随后冻结该编码器,并并行训练轻量级离散解码器,结合激进的坐标空间采样实现高效的编码加速,从而在保持重建质量和压缩比的同时显著提升编码速度。
链接: https://arxiv.org/abs/2506.23382
作者: Vikram Rangarajan,Shishira Maiya,Max Ehrlich,Abhinav Shrivastava
机构: University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page at this https URL . Project code at this https URL
Abstract:Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at this https URL .
zh
[CV-108] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions
【速读】:该论文旨在解决多主体视频定制中训练数据构建困难以及如何利用深度、掩码、相机和文本提示等信号对定制视频中的主体进行控制与编辑的问题。其解决方案的关键在于提出了一种无需标签和控制信号的视频定制数据构造管道VideoCus-Factory,以及基于图像-视频迁移混合(IVTM)训练方法和具有两种嵌入机制的扩散Transformer框架OmniVCus,其中Lottery Embedding(LE)通过训练主体激活更多帧嵌入以支持更多主体的推理,而Temporally Aligned Embedding(TAE)则通过为控制和噪声标记分配相同的帧嵌入来引导生成过程。
链接: https://arxiv.org/abs/2506.23361
作者: Yuanhao Cai,He Zhang,Xi Chen,Jinbo Xing,Yiwei Hu,Yuqian Zhou,Kai Zhang,Zhifei Zhang,Soo Ye Kim,Tianyu Wang,Yulun Zhang,Xiaokang Yang,Zhe Lin,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学); Adobe Research (Adobe 研究院); The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A data construction pipeline and a diffusion Transformer framework for controllable subject-driven video customization
Abstract:Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: this https URL. Our code will be released at this https URL
zh
[CV-109] Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement
【速读】:该论文旨在解决红外图像在复杂天气条件下对比度低的问题,特别是在非热辐射目标(如自行车)中,这显著影响了下游高级视觉任务的性能。同时,如何在不放大噪声和丢失重要信息的情况下实现对比度增强仍是挑战。解决方案的关键在于提出一种面向任务的红外图像增强方法,其核心包括两部分:层分解和显著性信息提取。层分解方法在增强场景细节的同时保留暗区特征,为后续显著性信息提取提供更多信息;而基于形态学重建的显著性提取方法则能有效提取并增强目标信息,同时避免噪声放大。
链接: https://arxiv.org/abs/2506.23353
作者: Siyuan Chai,Xiaodong Guo,Tong Liu
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.
zh
[CV-110] GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields ICCV2025
【速读】:该论文旨在解决现有3D语言场在处理大规模、复杂城市场景时存在的可扩展性不足和组合推理能力缺失的问题。其关键解决方案是提出GeoProg3D框架,该框架包含两个核心组件:地理感知的城市级3D语言场(GCLF)与地理视觉API(GV-APIs),通过结合高效的层次化3D模型与地理信息,实现对城市尺度高保真3D场景的自然语言驱动交互,并利用大语言模型作为推理引擎动态整合GV-APIs与GCLF,从而支持多种地理视觉任务。
链接: https://arxiv.org/abs/2506.23352
作者: Shunsuke Yasuki,Taiki Miyanishi,Nakamasa Inoue,Shuhei Kurita,Koya Sakamoto,Daichi Azuma,Masato Taki,Yutaka Matsuo
机构: Rikkyo University (立教大学); University of Tokyo (东京大学); ATR (ATR); Institute of Science Tokyo (东京科学大学); National Institute of Informatics (国立情報学研究所); NII LLMC (NII LLMC); Sony Semiconductor Solutions (索尼半导体解决方案)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at this https URL.
zh
[CV-111] CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation
【速读】:该论文旨在解决传统条件自回归图像生成方法在无监督图像翻译领域中的应用受限问题,特别是在缺乏显式跨域对应关系的情况下,其性能未被充分挖掘。该问题的关键在于传统基于向量量化(Vector Quantization)框架的离散量化过程会破坏变分自编码器解码器与因果Transformer之间的梯度流,从而阻碍对抗训练中的端到端优化。论文提出的解决方案是采用Softmax松弛量化,通过Softmax将代码本选择重构为连续概率混合过程,从而保持梯度传播。在此可微基础之上,论文进一步提出CycleVAR,通过注入多尺度源图像标记作为上下文提示,将图像到图像的翻译重构为图像条件视觉自回归生成,并采用两种生成模式实现目标图像标记的生成。
链接: https://arxiv.org/abs/2506.23347
作者: Yi Liu,Shengqian Li,Zuzeng Lin,Feng Wang,Si Liu
机构: Beihang University (北京航空航天大学); University of Chinese Academy of Sciences (中国科学院大学); Tianjin University (天津大学); CreateAI (CreateAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space. To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressive generation by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models. CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation, enabling iterative refinement across scales, and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass. Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios. Furthermore, both quantitative and qualitative results indicate that CycleVAR surpasses previous state-of-the-art unsupervised image translation models, \textite.\textitg., CycleGAN-Turbo.
zh
[CV-112] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agent ic Inverse Rendering
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)是否能够通过主动创造而非被动识别来真正理解场景的问题。其解决方案的关键在于提出IR3D-Bench基准,该基准要求视觉-语言代理(Vision-Language Agents, VLAs)通过主动使用编程和渲染工具来重建输入图像的底层三维结构,从而实现通过工具使用进行的代理逆向渲染(agentic inverse rendering)。这一“通过创造来理解”的方法旨在评估VLAs的工具使用生成能力,超越传统场景理解基准所测量的描述或对话能力。
链接: https://arxiv.org/abs/2506.23329
作者: Parker Liu,Chenxin Li,Zhengxin Li,Yipeng Wu,Wuyang Li,Zhiqin Yang,Zhenyuan Zhang,Yunlong Lin,Sirui Han,Brandon Y. Feng
机构: CUHK(香港中文大学); TJU(天津大学); EPFL(瑞士洛桑联邦理工学院); HKUST(香港科技大学); XMU(厦门大学); MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This “understanding-by-creating” approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.
zh
[CV-113] FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method
【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary semantic segmentation, OVSS)中因全局表示偏差导致的像素级空间精度下降问题,以及基于扩散模型的分割方法在迭代次数与分割质量之间难以平衡的问题。其解决方案的关键在于提出一种无需训练的高效框架FastSeg,该框架仅利用预训练扩散模型的(1+1)步逆过程,并通过一次运行完成所有类别的分割。此外,FastSeg引入了三个关键组件:双提示机制以提取具有判别性的类别感知注意力、分层注意力优化方法(HARD)以增强融合的交叉注意力,以及测试时翻转(TTF)方案以提升空间一致性,从而在保持高推理效率的同时实现最先进的分割性能。
链接: https://arxiv.org/abs/2506.23323
作者: Quang-Huy Che,Vinh-Tiep Nguyen
机构: University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.
zh
[CV-114] InfGen: Scenario Generation as Next Token Group Prediction
【速读】:该论文旨在解决现有数据驱动的交通仿真方法在建模动态、长时程场景及演变的智能体群体方面能力有限的问题(limited ability to model dynamic, long-horizon scenarios with evolving agent populations)。其解决方案的关键在于提出InfGen框架,该框架通过自回归方式生成智能体状态和轨迹,并将整个场景表示为包含交通灯信号、智能体状态和运动向量的序列,利用Transformer模型进行时间上的交通模拟,从而实现新智能体的持续插入和无限场景生成。
链接: https://arxiv.org/abs/2506.23316
作者: Zhenghao Peng,Yuxin Liu,Bolei Zhou
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Realistic and interactive traffic simulation is essential for training and evaluating autonomous driving systems. However, most existing data-driven simulation methods rely on static initialization or log-replay data, limiting their ability to model dynamic, long-horizon scenarios with evolving agent populations. We propose InfGen, a scenario generation framework that outputs agent states and trajectories in an autoregressive manner. InfGen represents the entire scene as a sequence of tokens, including traffic light signals, agent states, and motion vectors, and uses a transformer model to simulate traffic over time. This design enables InfGen to continuously insert new agents into traffic, supporting infinite scene generation. Experiments demonstrate that InfGen produces realistic, diverse, and adaptive traffic behaviors. Furthermore, reinforcement learning policies trained in InfGen-generated scenarios achieve superior robustness and generalization, validating its utility as a high-fidelity simulation environment for autonomous driving. More information is available at this https URL.
zh
[CV-115] Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting MICCAI-2025 MICCAI2025
【速读】:该论文旨在解决在光照条件变化(如低光和过曝)下,基于3D Gaussian Splatting (3DGS) 的软组织重建方法出现的优化困难和渲染质量下降问题。其解决方案的关键在于提出一种具有光照自适应能力的高斯点云渲染方法——Endo-4DGX,通过引入光照嵌入、区域感知增强模块和空间感知调整模块,有效建模视图依赖的亮度变化,并结合曝光控制损失以恢复不良曝光条件下的外观,从而在保持几何精度的同时提升复杂光照环境下的渲染性能。
链接: https://arxiv.org/abs/2506.23308
作者: Yiming Huang,Long Bai,Beilei Cui,Yanheng Li,Tong Chen,Jie Wang,Jinlin Wu,Zhen Lei,Hongbin Liu,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025. Project Page: this https URL
Abstract:Accurate reconstruction of soft tissue is crucial for advancing automation in image-guided robotic surgery. The recent 3D Gaussian Splatting (3DGS) techniques and their variants, 4DGS, achieve high-quality renderings of dynamic surgical scenes in real-time. However, 3D-GS-based methods still struggle in scenarios with varying illumination, such as low light and over-exposure. Training 3D-GS in such extreme light conditions leads to severe optimization problems and devastating rendering quality. To address these challenges, we present Endo-4DGX, a novel reconstruction method with illumination-adaptive Gaussian Splatting designed specifically for endoscopic scenes with uneven lighting. By incorporating illumination embeddings, our method effectively models view-dependent brightness variations. We introduce a region-aware enhancement module to model the sub-area lightness at the Gaussian level and a spatial-aware adjustment module to learn the view-consistent brightness adjustment. With the illumination adaptive design, Endo-4DGX achieves superior rendering performance under both low-light and over-exposure conditions while maintaining geometric accuracy. Additionally, we employ an exposure control loss to restore the appearance from adverse exposure to the normal level for illumination-adaptive optimization. Experimental results demonstrate that Endo-4DGX significantly outperforms combinations of state-of-the-art reconstruction and restoration methods in challenging lighting environments, underscoring its potential to advance robot-assisted surgical applications. Our code is available at this https URL.
zh
[CV-116] DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)中存在的一些关键挑战,包括保持细粒度服装细节、实现精确的服装与人体对齐、维持推理效率以及在多种姿势和服装风格下的泛化能力。其解决方案的关键在于提出了一种名为DiffFit的新型两阶段潜在扩散框架,通过分阶段处理几何对齐与外观细化,有效降低了任务复杂度,并提升了生成稳定性和视觉真实性。第一阶段通过细粒度变形和姿态适配实现服装与目标人体的几何对齐,第二阶段则通过跨模态条件扩散模型融合变形后的服装、原始服装外观和目标人物图像,以提高纹理保真度。
链接: https://arxiv.org/abs/2506.23295
作者: Xiang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion. While recent advances in latent diffusion models have substantially improved visual quality, existing approaches still struggle with preserving fine-grained garment details, achieving precise garment-body alignment, maintaining inference efficiency, and generalizing to diverse poses and clothing styles. To address these challenges, we propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on. DiffFit adopts a progressive generation strategy: the first stage performs geometry-aware garment warping, aligning the garment with the target body through fine-grained deformation and pose adaptation. The second stage refines texture fidelity via a cross-modal conditional diffusion model that integrates the warped garment, the original garment appearance, and the target person image for high-quality rendering. By decoupling geometric alignment and appearance refinement, DiffFit effectively reduces task complexity and enhances both generation stability and visual realism. It excels in preserving garment-specific attributes such as textures, wrinkles, and lighting, while ensuring accurate alignment with the human body. Extensive experiments on large-scale VTON benchmarks demonstrate that DiffFit achieves superior performance over existing state-of-the-art methods in both quantitative metrics and perceptual evaluations.
zh
[CV-117] DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios
【速读】:该论文试图解决当前深度伪造检测方法在可解释性方面的不足,尤其是在法律等关键领域中,现有模型通常仅提供二分类结果,缺乏对伪造区域的定位和解释能力。解决方案的关键在于构建一个名为DDL的大型深度伪造检测与定位数据集,该数据集包含超过1.8M个伪造样本,并涵盖最多75种不同的深度伪造方法,其设计包含四个核心创新:多样化的伪造场景、全面的深度伪造方法、多样的操作模式以及细粒度的伪造标注,从而为复杂现实场景中的深度伪造检测、定位及可解释性方法提供更有力的支持。
链接: https://arxiv.org/abs/2506.23292
作者: Changtao Miao,Yi Zhang,Weize Gao,Man Luo,Weiwei Feng,Zhiya Tan,Jianshu Li,Ajian Liu,Yunfeng Diao,Qi Chu,Tao Gong,Zhe Li,Weibin Yao,Joey Tianyi Zhou
机构: AntGroup(蚂蚁集团); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Hefei University of Technology(合肥工业大学); Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室); A⋆STAR Centre for Frontier AI Research(A*STAR前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is a preliminary version, with an extended and comprehensive version currently under development
Abstract:Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. In critical domains such as law, interpretability is crucial for enhancing the credibility and authority of decisions. Recent studies attempt to improve the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, the practical effectiveness of these methods remains suboptimal due to limitations of the forgery data. Most current deepfake datasets predominantly offer binary labels, only a few datasets with localization annotations. However, they suffer from restricted forgery scenarios, limited diversity in deepfake types, and insufficient data scale, making them inadequate for complex real-world scenarios. To address this predicament, we construct a novel large-scale deepfake detection and localization ( \textbfDDL ) dataset containing over \textbf1.8M forged samples and encompassing up to \textbf75 distinct deepfake methods. The DDL design incorporates four key innovations: (1) \textbfDiverse Forgery Scenarios , (2) \textbfComprehensive Deepfake Methods , (3) \textbfVaried Manipulation Modes , and (4) \textbfFine-grained Forgery Annotations . Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries, but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods. The DDL dataset project page is on this https URL.
zh
[CV-118] Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification ICCV2025
【速读】:该论文试图解决现有基于知识蒸馏的优化策略在多网络协同训练中因对不同迭代中学习方向影响理解不足而导致性能提升有限的问题。其解决方案的关键在于提出一种竞争性知识蒸馏(competitive distillation)策略,该策略允许组内每个网络根据其表现潜在地充当教师,通过竞争优化改进参数更新过程,并引入随机扰动以促进网络产生变异,从而获得更好的视觉表征和全局最优解。
链接: https://arxiv.org/abs/2506.23285
作者: Daqian Shi,Xiaolei Diao,Xu Chen,Cédric M. John
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Deep Neural Networks (DNNs) have significantly advanced the field of computer vision. To improve DNN training process, knowledge distillation methods demonstrate their effectiveness in accelerating network training by introducing a fixed learning direction from the teacher network to student networks. In this context, several distillation-based optimization strategies are proposed, e.g., deep mutual learning and self-distillation, as an attempt to achieve generic training performance enhancement through the cooperative training of multiple networks. However, such strategies achieve limited improvements due to the poor understanding of the impact of learning directions among networks across different iterations. In this paper, we propose a novel competitive distillation strategy that allows each network in a group to potentially act as a teacher based on its performance, enhancing the overall learning performance. Competitive distillation organizes a group of networks to perform a shared task and engage in competition, where competitive optimization is proposed to improve the parameter updating process. We further introduce stochastic perturbation in competitive distillation, aiming to motivate networks to induce mutations to achieve better visual representations and global optimum. The experimental results show that competitive distillation achieves promising performance in diverse tasks and datasets.
zh
[CV-119] MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition ICML2025
【速读】:该论文旨在解决视频理解中空间-时间动态建模不足的问题,现有方法通常将空间和时间信息分开处理,难以捕捉视频的完整复杂性。其解决方案的关键在于提出MoMa框架,通过将Mamba的可选状态空间建模集成到图像基础模型(IFMs)中,并引入一种新的SeqMod操作,以不破坏原始特征的方式注入空间-时间信息,从而实现高效的全空间-时间建模。
链接: https://arxiv.org/abs/2506.23283
作者: Yuhuan Yang,Chaofan Ma,Zhenjie Mao,Jiangchao Yao,Ya Zhang,Yanfeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 paper
Abstract:Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba’s selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost.
zh
[CV-120] Autoregressive Denoising Score Matching is a Good Video Anomaly Detector
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中对局部模式附近异常的检测盲区问题,这些异常位于已学习分布附近,传统基于似然的方法难以识别。解决方案的关键在于深入分析场景、运动和外观三个独特差距,并提出三种创新机制:首先,构建一个噪声条件下的分数变换器以实现去噪分数匹配;其次,通过将输入序列的场景条件嵌入模型并根据关键帧差异分配运动权重,引入场景依赖且运动感知的分数函数;最后,通过新型自回归去噪分数匹配机制,在推理过程中整合未受影响的视觉信息,增强外观感知并累积异常上下文,从而计算出更全面的异常指标。
链接: https://arxiv.org/abs/2506.23282
作者: Hanwen Zhang,Congqi Cao,Qinyi Lv,Lingtong Min,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.
zh
[CV-121] Why Settle for One? Text-to-ImageSet Generation and Evaluation
【速读】:该论文试图解决文本到图像集(Text-to-ImageSet, T2IS)生成问题,即根据用户指令生成满足多种一致性要求的图像集合。现有方法通常局限于特定领域或特定方面的一致性,限制了其泛化能力。该论文的关键解决方案是提出一种无需训练的框架AutoT2IS,该框架充分利用预训练扩散变换器的上下文能力,以协调视觉元素,从而满足图像级提示对齐和集级视觉一致性。
链接: https://arxiv.org/abs/2506.23275
作者: Chengyou Jia,Xin Shen,Zhuohang Dang,Zhuohang Dang,Changliang Xia,Weijia Wu,Xinyu Zhang,Hangwei Qian,Ivor W.Tsang,Minnan Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce \textbfT2IS-Bench with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose \textbfT2IS-Eval , an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose \textbfAutoT2IS , a training-free framework that maximally leverages pretrained Diffusion Transformers’ in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in this https URL.
zh
[CV-122] Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation
【速读】:该论文旨在解决大规模预训练Transformer模型在下游音视频任务中适应性不足的问题,特别是在内存消耗和训练时间方面的效率问题。解决方案的关键在于提出一种名为Meta-Token Learning (Mettle) 的方法,其核心是通过轻量级的Layer-Centric Distillation (LCD) 模块并行蒸馏每个Transformer层中的完整音频或视觉特征为紧凑的元标记(meta-tokens),从而实现知识保留与任务适配的平衡。此外,为支持细粒度分割任务,还引入了Meta-Token Injection (MTI) 模块,利用顶层蒸馏得到的元标记引导早期层的特征适配。
链接: https://arxiv.org/abs/2506.23271
作者: Jinxing Zhou,Zhihui Li,Yongqiang Yu,Yanghao Zhou,Ruohao Guo,Guangyao Li,Yuxin Mao,Mingfei Han,Xiaojun Chang,Meng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:We present \textbfMeta-\textbfToken \textbfLearning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight \textitLayer-Centric Distillation (LCD) module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a \textitMeta-Token Injection (MTI) module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.
zh
[CV-123] oken Activation Map to Visually Explain Multimodal LLM s ICCV2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)解释性不足的问题,这一问题限制了模型的深入理解、可信度及有效可视化。传统视觉模型(如CNNs、ViTs、CLIP)通常生成单一输出,而MLLMs则逐步生成token序列,每个生成的token依赖于先前的上下文,导致早期上下文token可能引入冗余激活,干扰后续token的解释。现有研究常忽视此问题,但本文观察到这些冗余相关性会显著影响解释的可靠性。为此,本文提出一种基于估计因果推断的方法,以减轻上下文干扰,并引入一种新颖的秩高斯滤波器进一步减少激活噪声,该方法称为Token Activation Map (TAM),其关键在于考虑token之间的交互作用,从而在多token解释方面优于传统的Class Activation Map (CAM)。
链接: https://arxiv.org/abs/2506.23270
作者: Yi Li,Hualiang Wang,Xinpeng Ding,Haonan Wang,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICCV2025 Accepted
Abstract:Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc). The code is available this http URL.
zh
[CV-124] Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis ICCV2025
【速读】:该论文旨在解决如何在合成视频中准确融入现实世界中的因果关系,以提升自动驾驶汽车应对不可预见事故的能力。其核心挑战在于如何将真实交通事故中的因果实体和行为准确地反映在合成视频中。解决方案的关键在于提出一种名为Causal-VidSyn的新型扩散模型,该模型通过结合事故原因回答和注视条件选择模块,利用事故描述和驾驶员注视信息精确识别事故参与者及其相关行为,从而实现因果实体的视频生成。
链接: https://arxiv.org/abs/2506.23263
作者: Lei-lei Li,Jianwu Fang,Junbin Xiao,Shanmin Pang,Hongkai Yu,Chen Lv,Jianru Xue,Tat-Seng Chua
机构: Xi’an Jiaotong University (西安交通大学); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Cleveland State University (克利夫兰州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.
zh
[CV-125] PCLVis: Visual Analytics of Process Communication Latency in Large-Scale Simulation
【速读】:该论文试图解决大规模并行仿真中由于进程间通信开销导致的可扩展性问题。解决方案的关键在于提出一种名为PCLVis的框架,该框架利用MPI进程通信数据而非物理链路层信息来分析进程通信延迟(PCL)事件,通过构建进程相关性树进行空间PCL事件定位,基于通信依赖关系的有向无环图(DAG)分析PCL事件传播路径,并设计滑动窗口算法生成PCL事件抽象及通信状态图符(CS-Glyph),从而帮助用户交互式探索和优化仿真效率。
链接: https://arxiv.org/abs/2506.23257
作者: Chongke Bi,Xin Gao,Baofeng Fu,Yuheng Zhao,Siming Chen,Ying Zhao,Yunhai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale simulations on supercomputers have become important tools for users. However, their scalability remains a problem due to the huge communication cost among parallel processes. Most of the existing communication latency analysis methods rely on the physical link layer information, which is only available to administrators. In this paper, a framework called PCLVis is proposed to help general users analyze process communication latency (PCL) events. Instead of the physical link layer information, the PCLVis uses the MPI process communication data for the analysis. First, a spatial PCL event locating method is developed. All processes with high correlation are classified into a single cluster by constructing a process-correlation tree. Second, the propagation path of PCL events is analyzed by constructing a communication-dependency-based directed acyclic graph (DAG), which can help users interactively explore a PCL event from the temporal evolution of a located PCL event cluster. In this graph, a sliding window algorithm is designed to generate the PCL events abstraction. Meanwhile, a new glyph called the communication state glyph (CS-Glyph) is designed for each process to show its communication states, including its in/out messages and load balance. Each leaf node can be further unfolded to view additional information. Third, a PCL event attribution strategy is formulated to help users optimize their simulations. The effectiveness of the PCLVis framework is demonstrated by analyzing the PCL events of several simulations running on the TH-1A supercomputer. By using the proposed framework, users can greatly improve the efficiency of their simulations.
zh
[CV-126] PixelBoost: Leverag ing Brownian Motion for Realistic-Image Super-Resolution
【速读】:该论文试图解决基于扩散模型的图像超分辨率技术在现实感图像生成与计算效率之间的权衡问题,尤其是在减少采样步骤导致推理时间缩短时,生成的图像会出现不真实和模糊的现象。解决方案的关键在于引入一种名为PixelBoost的新扩散模型,该模型强调了在图像超分辨率中利用布朗运动的随机性的重要性,通过将受控随机性整合到训练过程中,避免陷入局部最优,从而有效捕捉并再现图像纹理和模式的固有不确定性,进而提升图像的真实感、细节和边缘重建能力。
链接: https://arxiv.org/abs/2506.23254
作者: Aradhana Mishra,Bumshik Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
Abstract:Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.
zh
[CV-127] DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection
【速读】:该论文旨在解决在复杂空中场景下,无人机(UAV)中对小目标检测的挑战,尤其是在处理多模态输入时,现有方法因优先考虑推理速度而导致性能下降的问题。其解决方案的关键在于提出DGE-YOLO框架,该框架通过引入双分支结构实现模态特定特征提取,结合高效的多尺度注意力(EMA)机制增强语义表示,并采用Gather-and-Distribute模块替代传统颈部结构,以减少特征聚合过程中的信息丢失。
链接: https://arxiv.org/abs/2506.23252
作者: Kunwei Lv,Ping Lan
机构: Xizang University (西藏大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge. Existing approaches often prioritize inference speed, leading to degraded performance when handling multi-modal inputs. To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.
zh
[CV-128] Aggregating Local Saliency Maps for Semi-Global Explainable Image Classification
【速读】:该论文试图解决深度学习模型在图像分类任务中预测过程难以解释的问题,尤其是现有局部解释方法(如显著性图)无法有效识别模型决策中的普遍模式,而全局方法则过于简化并可能遗漏重要的局部行为。解决方案的关键是提出一种名为Segment Attribution Tables (SATs) 的方法,该方法通过将局部显著性解释汇总为(半)全局洞察,利用图像分割(如“眼睛”在吉娃娃犬中的分割)并结合显著性图量化其影响,从而揭示模型依赖的概念及潜在的虚假相关性。
链接: https://arxiv.org/abs/2506.23247
作者: James Hinns,David Martens
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep learning dominates image classification tasks, yet understanding how models arrive at predictions remains a challenge. Much research focuses on local explanations of individual predictions, such as saliency maps, which visualise the influence of specific pixels on a model’s prediction. However, reviewing many of these explanations to identify recurring patterns is infeasible, while global methods often oversimplify and miss important local behaviours. To address this, we propose Segment Attribution Tables (SATs), a method for summarising local saliency explanations into (semi-)global insights. SATs take image segments (such as “eyes” in Chihuahuas) and leverage saliency maps to quantify their influence. These segments highlight concepts the model relies on across instances and reveal spurious correlations, such as reliance on backgrounds or watermarks, even when out-of-distribution test performance sees little change. SATs can explain any classifier for which a form of saliency map can be produced, using segmentation maps that provide named segments. SATs bridge the gap between oversimplified global summaries and overly detailed local explanations, offering a practical tool for analysing and debugging image classifiers.
zh
[CV-129] VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions Contacts and Collisions ICCV2025
【速读】:该论文旨在解决传统参数化人体模型在处理与其它几何实体(如物体和场景)交互时效率低下及计算成本高的问题。现有方法要么对复杂的人体关节动作不够鲁棒,要么计算和内存开销过大。其解决方案的关键在于提出VolumetricSMPL,一种基于神经体积人体模型,利用神经混合权重(Neural Blend Weights, NBW)生成紧凑且高效的多层感知机解码器,通过动态融合少量学习到的权重矩阵,显著提升了计算效率并保持了表达能力。
链接: https://arxiv.org/abs/2506.23236
作者: Marko Mihajlovic,Siwei Zhang,Gen Li,Kaifeng Zhao,Lea Müller,Siyu Tang
机构: ETH Zürich (苏黎世联邦理工学院); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: [ICCV 2025] this https URL
Abstract:Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10x faster inference, 6x lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL’s strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.
zh
[CV-130] High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation
【速读】:该论文试图解决在场景级标注(scene-level annotation)下室内点云语义分割的问题,这一问题相较于依赖稀疏点级标签的方法研究较少。当前方法在缺乏精确点级标签的情况下,首先生成点级伪标签用于训练分割模型,但仅依靠场景级标注生成准确的点级伪标签面临较大挑战,严重影响分割性能。解决方案的关键在于提出一种高质量伪标签生成框架,通过探索多模态信息和区域-点语义一致性来提升准确性。具体而言,该方法引入了跨模态特征引导模块,利用2D-3D对应关系对齐点云特征与对应2D图像像素,辅助点云特征学习;同时引入区域-点语义一致性模块,通过区域投票策略生成区域语义以指导点级语义预测,从而在训练过程中修正不准确的点级语义预测并获得高质量伪标签。
链接: https://arxiv.org/abs/2506.23227
作者: Lunhao Duan,Shanshan Zhao,Xingxing Weng,Jing Zhang,Gui-Song Xia
机构: Wuhan University (武汉大学); JD Explore Academy (京东探索研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TPAMI. Code: this https URL
Abstract:This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach’s individual components. The code is available at this https URL .
zh
[CV-131] Single Image Inpainting and Super-Resolution with Simultaneous Uncertainty Guarantees by Universal Reproducing Kernels
【速读】:该论文试图解决图像中缺失像素的估计问题,这对于图像修复和超分辨率问题至关重要。解决方案的关键在于提出一种统计学习方法,即同时保证核插值(Simultaneously Guaranteed Kernel Interpolation, SGKI),该方法不仅估计缺失像素,还提供了不确定性量化。其核心假设是数据生成函数来自再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS),特别关注信号处理中重要的带限函数,这些函数构成了Paley-Wiener型RKHS。
链接: https://arxiv.org/abs/2506.23221
作者: Bálint Horváth,Balázs Csanád Csáji
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 8 figures, 6 tables
Abstract:The paper proposes a statistical learning approach to the problem of estimating missing pixels of images, crucial for image inpainting and super-resolution problems. One of the main novelties of the method is that it also provides uncertainty quantifications together with the estimated values. Our core assumption is that the underlying data-generating function comes from a Reproducing Kernel Hilbert Space (RKHS). A special emphasis is put on band-limited functions, central to signal processing, which form Paley-Wiener type RKHSs. The proposed method, which we call Simultaneously Guaranteed Kernel Interpolation (SGKI), is an extension and refinement of a recently developed kernel method. An advantage of SGKI is that it not only estimates the missing pixels, but also builds non-asymptotic confidence bands for the unobserved values, which are simultaneously guaranteed for all missing pixels. We also show how to compute these bands efficiently using Schur complements, we discuss a generalization to vector-valued functions, and we present a series of numerical experiments on various datasets containing synthetically generated and benchmark images, as well.
zh
[CV-132] A Hierarchical Slice Attention Network for Appendicitis Classification in 3D CT Scans
【速读】:该论文旨在解决急性阑尾炎的及时和准确诊断问题,以防止严重并发症的发生。当前,尽管CT成像仍是标准诊断工具,但病例数量的增长可能使放射科医生不堪重负,导致诊断延迟。论文提出的解决方案是基于深度学习模型,利用3D CT扫描进行阑尾炎分类,并引入由外部2D数据集引导的Slice Attention机制,以提升小病灶的检测能力。此外,还提出了一种分层分类框架,使用预训练的2D模型区分简单型与复杂型阑尾炎。该方法的关键在于结合3D影像与2D数据集的注意力机制,以及利用预训练模型提升分类性能,从而在AUC指标上分别提升了3%和5.9%。
链接: https://arxiv.org/abs/2506.23209
作者: Chia-Wen Huang,Haw Hwai,Chien-Chang Lee,Pei-Yuan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 1 figure, 3 tables. Published in IEEE ISBI 2025. This version corrects citation numbering errors
Abstract:Timely and accurate diagnosis of appendicitis is critical in clinical settings to prevent serious complications. While CT imaging remains the standard diagnostic tool, the growing number of cases can overwhelm radiologists, potentially causing delays. In this paper, we propose a deep learning model that leverages 3D CT scans for appendicitis classification, incorporating Slice Attention mechanisms guided by external 2D datasets to enhance small lesion detection. Additionally, we introduce a hierarchical classification framework using pre-trained 2D models to differentiate between simple and complicated appendicitis. Our approach improves AUC by 3% for appendicitis and 5.9% for complicated appendicitis, offering a more efficient and reliable diagnostic solution compared to previous work.
zh
[CV-133] VG-SLAM: Robust Gaussian Splatting SLAM with Tri-view Geometric Constraints
【速读】:该论文旨在解决RGB-only 3D Gaussian Splatting (3DGS) SLAM系统在无界户外环境中因依赖光度渲染损失进行相机跟踪而导致的鲁棒性不足问题。其解决方案的关键在于提出TVG-SLAM系统,该系统引入了一种新颖的三视角几何范式,通过密集的三视角匹配模块生成一致的三视角对应关系,构建互补的几何约束,并结合光度损失实现稳定的姿态估计。此外,还设计了概率初始化策略和动态渲染信任衰减机制,以提升地图质量并缓解跟踪漂移问题。
链接: https://arxiv.org/abs/2506.23207
作者: Zhen Tan,Xieyuanli Chen,Lei Feng,Yangbing Ge,Shuaifeng Zhi,Jiaxiong Liu,Dewen Hu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled RGB-only SLAM systems to achieve high-fidelity scene representation. However, the heavy reliance of existing systems on photometric rendering loss for camera tracking undermines their robustness, especially in unbounded outdoor environments with severe viewpoint and illumination changes. To address these challenges, we propose TVG-SLAM, a robust RGB-only 3DGS SLAM system that leverages a novel tri-view geometry paradigm to ensure consistent tracking and high-quality mapping. We introduce a dense tri-view matching module that aggregates reliable pairwise correspondences into consistent tri-view matches, forming robust geometric constraints across frames. For tracking, we propose Hybrid Geometric Constraints, which leverage tri-view matches to construct complementary geometric cues alongside photometric loss, ensuring accurate and stable pose estimation even under drastic viewpoint shifts and lighting variations. For mapping, we propose a new probabilistic initialization strategy that encodes geometric uncertainty from tri-view correspondences into newly initialized Gaussians. Additionally, we design a Dynamic Attenuation of Rendering Trust mechanism to mitigate tracking drift caused by mapping latency. Experiments on multiple public outdoor datasets show that our TVG-SLAM outperforms prior RGB-only 3DGS-based SLAM systems. Notably, in the most challenging dataset, our method improves tracking robustness, reducing the average Absolute Trajectory Error (ATE) by 69.0% while achieving state-of-the-art rendering quality. The implementation of our method will be released as open-source.
zh
[CV-134] BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion
【速读】:该论文旨在解决现有基于扩散的3D形状补全方法在建模最优全局传输路径和生成细粒度几何细节方面的不足。其关键解决方案是提出BridgeShape框架,该框架通过将形状补全建模为最优传输问题,显式建模不完整与完整形状之间的转换以确保全局一致性,并引入一种深度增强的向量量化变分自编码器(VQ-VAE),将3D形状编码到紧凑且结构信息丰富的潜在空间中,从而有效缓解分辨率限制并提升补全效果。
链接: https://arxiv.org/abs/2506.23205
作者: Dequan Kong,Zhe Zhu,Honghua Chen,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.
zh
[CV-135] ransformer-Based Person Search with High-Frequency Augmentation and Multi-Wave Mixing
【速读】:该论文旨在解决基于Transformer模型的人体搜索任务中面临的两个主要问题:一是自注意力机制会抑制特征中的高频成分,严重影响模型性能;二是Transformer的计算成本较高。其解决方案的关键在于提出一种名为高频率增强与多波混合(High-frequency Augmentation and Multi-Wave mixing, HAMW)的方法,通过引入包含额外高频成分的增强输入来提升对高频特征的感知能力,并用多级哈尔小波融合策略替代自注意力层,以降低计算复杂度并增强多尺度特征的利用能力。
链接: https://arxiv.org/abs/2506.23202
作者: Qilin Shu,Qixian Zhang,Qi Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The person search task aims to locate a target person within a set of scene images. In recent years, transformer-based models in this field have made some progress. However, they still face three primary challenges: 1) the self-attention mechanism tends to suppress high-frequency components in the features, which severely impacts model performance; 2) the computational cost of transformers is relatively high. To address these issues, we propose a novel High-frequency Augmentation and Multi-Wave mixing (HAMW) method for person search. HAMW is designed to enhance the discriminative feature extraction capabilities of transformers while reducing computational overhead and improving efficiency. Specifically, we develop a three-stage framework that progressively optimizes both detection and re-identification performance. Our model enhances the perception of high-frequency features by learning from augmented inputs containing additional high-frequency components. Furthermore, we replace the self-attention layers in the transformer with a strategy based on multi-level Haar wavelet fusion to capture multi-scale features. This not only lowers the computational complexity but also alleviates the suppression of high-frequency features and enhances the ability to exploit multi-scale information. Extensive experiments demonstrate that HAMW achieves state-of-the-art performance on both the CUHK-SYSU and PRW datasets.
zh
[CV-136] DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding
【速读】:该论文旨在解决长视频中重叠事件和复杂时间依赖性带来的多模态交互建模难题,特别是实现细粒度时间分辨率下的密集语义动作定位。其解决方案的关键在于提出DEL框架,该框架包含两个核心模块:一是利用掩码自注意力机制增强模态内一致性的音视频特征对齐模块;二是通过多尺度跨模态依赖建模来提升高层语义与细粒度细节的交互能力的多模态交互精炼模块。
链接: https://arxiv.org/abs/2506.23196
作者: Mona Ahmadian,Amir Shirian,Frank Guerin,Andrew Gilbert
机构: University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.
zh
[CV-137] rident: Detecting Face Forgeries with Adversarial Triplet Learning
【速读】:该论文旨在解决由深度神经网络生成的面部伪造技术日益复杂所带来的数字媒体真实性维护与视觉虚假信息对抗问题。现有检测模型主要依赖于特定领域数据的监督训练,难以应对未见过的伪造技术。该论文提出的解决方案关键在于引入\textitTrident框架,该框架采用三元组学习与Siamese网络结构,通过精心构建的三元组数据集来捕捉伪造样本的细微差异,提取区分原始样本与篡改样本的细粒度特征,并通过域对抗训练提升模型对未知篡改方法的泛化能力,同时防止分类器头部对嵌入模型的梯度传播以避免过拟合。
链接: https://arxiv.org/abs/2506.23189
作者: Mustafa Hakan Kara,Aysegul Dundar,Uğur Güdükbay
机构: Bilkent University (比尔肯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, and 7 tables
Abstract:As face forgeries generated by deep neural networks become increasingly sophisticated, detecting face manipulations in digital media has posed a significant challenge, underscoring the importance of maintaining digital media integrity and combating visual disinformation. Current detection models, predominantly based on supervised training with domain-specific data, often falter against forgeries generated by unencountered techniques. In response to this challenge, we introduce \textitTrident, a face forgery detection framework that employs triplet learning with a Siamese network architecture for enhanced adaptability across diverse forgery methods. \textitTrident is trained on curated triplets to isolate nuanced differences of forgeries, capturing fine-grained features that distinguish pristine samples from manipulated ones while controlling for other variables. To further enhance generalizability, we incorporate domain-adversarial training with a forgery discriminator. This adversarial component guides our embedding model towards forgery-agnostic representations, improving its robustness to unseen manipulations. In addition, we prevent gradient flow from the classifier head to the embedding model, avoiding overfitting induced by artifacts peculiar to certain forgeries. Comprehensive evaluations across multiple benchmarks and ablation studies demonstrate the effectiveness of our framework. We will release our code in a GitHub repository.
zh
[CV-138] STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene
【速读】:该论文旨在解决高动态场景重建中静态背景与动态物体在时空特征上的不匹配问题,传统统一表示模型(如高斯)难以处理由于帧成像导致的潜在时间不连续性以及背景与物体之间的异构空间特征。其解决方案的关键在于通过引入事件相机以补偿帧相机,并提出一种时空解耦的高斯点云框架,将时空特征分解为不同的潜在表示,从而缓解背景与物体间的时空不匹配。此外,通过聚类区分背景与物体的时空特征,并利用高斯表示与事件数据的一致时空特性作为先验,指导物体高斯的时空解耦,最终提升背景与物体之间的时空辨识能力,实现时间连续的动态场景重建。
链接: https://arxiv.org/abs/2506.23157
作者: Hanyu Zhou,Haonan Wang,Haoyue Liu,Yuxing Duan,Luxin Yan,Gim Hee Lee
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. To address this issue, we disentangle the spatiotemporal features into various latent representations to alleviate the spatiotemporal mismatching between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments have been performed to verify the superiority of the proposed method.
zh
[CV-139] Self-Supervised Contrastive Learning for Multi-Label Images
【速读】:该论文旨在解决主流自监督学习(Self-supervised learning, SSL)方法在使用高体量单标签数据集(如ImageNet)时导致的预训练开销过大,以及对多标签图像关注不足的问题。其关键解决方案是通过引入一种基于块的增强模块,从多标签图像中提取额外的正视图对,并设计一种图像感知对比损失,以建立这些视图之间的联系,从而提升语义一致表示的学习能力。
链接: https://arxiv.org/abs/2506.23156
作者: Jiale Chen
机构: Hohai University (河海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning (SSL) has demonstrated its effectiveness in learning representations through comparison methods that align with human intuition. However, mainstream SSL methods heavily rely on high body datasets with single label, such as ImageNet, resulting in intolerable pre-training overhead. Besides, more general multi-label images are frequently overlooked in SSL, despite their potential for richer semantic information and broader applicability in downstream scenarios. Therefore, we tailor the mainstream SSL approach to guarantee excellent representation learning capabilities using fewer multi-label images. Firstly, we propose a block-wise augmentation module aimed at extracting additional potential positive view pairs from multi-label images. Subsequently, an image-aware contrastive loss is devised to establish connections between these views, thereby facilitating the extraction of semantically consistent representations. Comprehensive linear fine-tuning and transfer learning validate the competitiveness of our approach despite challenging sample quality and quantity.
zh
[CV-140] Dynamic View Synthesis from Small Camera Motion Videos
【速读】:该论文旨在解决动态三维场景的新型视角合成问题,特别是在输入图像或视频中相机运动范围受限甚至静止(即小相机运动)的情况下,现有方法在场景几何表示和相机参数估计方面遇到的挑战。解决方案的关键在于提出一种基于分布的深度正则化(Distribution-based Depth Regularization, DDR),通过Gumbel-softmax对离散渲染权重分布进行可微采样,计算误差的期望以确保渲染权重分布与真实分布对齐,并引入约束条件使光线在物体边界前的空间点密度接近零,从而学习正确的场景几何结构。此外,通过在训练过程中融合相机参数学习,提升了模型对相机参数变化的鲁棒性。
链接: https://arxiv.org/abs/2506.23153
作者: Huiqiang Sun,Xingyi Li,Juewen Peng,Liao Shen,Zhiguo Cao,Ke Xian,Guosheng Lin
机构: Huazhong University of Science and Technology (华中科技大学); S-Lab, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TVCG
Abstract:Novel view synthesis for dynamic 3 D scenes poses a significant challenge. Many notable efforts use NeRF-based approaches to address this task and yield impressive results. However, these methods rely heavily on sufficient motion parallax in the input images or videos. When the camera motion range becomes limited or even stationary (i.e., small camera motion), existing methods encounter two primary challenges: incorrect representation of scene geometry and inaccurate estimation of camera parameters. These challenges make prior methods struggle to produce satisfactory results or even become invalid. To address the first challenge, we propose a novel Distribution-based Depth Regularization (DDR) that ensures the rendering weight distribution to align with the true distribution. Specifically, unlike previous methods that use depth loss to calculate the error of the expectation, we calculate the expectation of the error by using Gumbel-softmax to differentiably sample points from discrete rendering weight distribution. Additionally, we introduce constraints that enforce the volume density of spatial points before the object boundary along the ray to be near zero, ensuring that our model learns the correct geometry of the scene. To demystify the DDR, we further propose a visualization tool that enables observing the scene geometry representation at the rendering weight level. For the second challenge, we incorporate camera parameter learning during training to enhance the robustness of our model to camera parameters. We conduct extensive experiments to demonstrate the effectiveness of our approach in representing scenes with small camera motion input, and our results compare favorably to state-of-the-art methods.
zh
[CV-141] MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation ICCV2025
【速读】:该论文旨在解决高分辨率(如FullHD)输入下光流估计中精度与GPU内存消耗之间的矛盾问题。其解决方案的关键在于提出MEMFOF方法,该方法通过系统性地重新审视类似RAFT的架构设计,结合减少的互相关体积、高分辨率训练协议以及多帧估计技术,在保证状态领先性能的同时显著降低内存开销,从而实现了在无需裁剪或下采样的情况下于原生1080p分辨率上进行训练。
链接: https://arxiv.org/abs/2506.23151
作者: Vladislav Bargatin,Egor Chistov,Alexander Yakovenko,Dmitriy Vatolin
机构: Lomonosov Moscow State University (莫斯科国立大学); MSU Institute for Artificial Intelligence (莫斯科国立大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted at ICCV 2025
Abstract:Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at this https URL.
zh
[CV-142] AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation
【速读】:该论文旨在解决单图像到3D模型生成过程中,由预训练生成模型合成的多视角图像缺乏跨视角一致性(Cross-View Consistency, CVC)的问题,这一问题显著降低了3D重建性能。解决方案的关键在于引入AlignCVC框架,通过分布对齐而非依赖严格的回归损失来重新构建单图像到3D生成流程,核心思想是将生成和重建的多视角分布对齐至真实多视角分布,从而建立提升CVC的理论基础。
链接: https://arxiv.org/abs/2506.23150
作者: Xinyue Liang,Zhiyuan Ma,Lingchen Sun,Yanjun Guo,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.
zh
[CV-143] maneuverRecognition – A Python package for Timeseries Classification in the domain of Vehicle Telematics
【速读】:该论文旨在解决车辆遥测领域中驾驶操作自动识别的问题,该问题对于分类和评估驾驶行为具有重要意义,进而用于提升保险政策的个性化、提高道路安全性、减少事故及其相关成本、降低燃油消耗以及支持环保驾驶。解决方案的关键在于开发一个Python包——maneuverRecognition,该包提供了数据预处理、建模和评估所需的功能,并包含一个可修改的基于LSTM的网络结构,以应对时间序列分类在数据传输、预处理、存储、模型训练和预测中的特殊挑战。
链接: https://arxiv.org/abs/2506.23147
作者: Jonathan Schuster,Fabian Transchel
机构: Harz University of Applied Sciences(哈兹应用科学大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures
Abstract:In the domain of vehicle telematics the automated recognition of driving maneuvers is used to classify and evaluate driving behaviour. This not only serves as a component to enhance the personalization of insurance policies, but also to increase road safety, reduce accidents and the associated costs as well as to reduce fuel consumption and support environmentally friendly driving. In this context maneuver recognition technically requires a continuous application of time series classification which poses special challenges to the transfer, preprocessing and storage of telematic sensor data, the training of predictive models, and the prediction itself. Although much research has been done in the field of gathering relevant data or regarding the methods to build predictive models for the task of maneuver recognition, there is a practical need for python packages and functions that allow to quickly transform data into the required structure as well as to build and evaluate such models. The maneuverRecognition package was therefore developed to provide the necessary functions for preprocessing, modelling and evaluation and also includes a ready to use LSTM based network structure that can be modified. The implementation of the package is demonstrated using real driving data of three different persons recorded via smartphone sensors.
zh
[CV-144] Forget-MI: Machine Unlearning for Forgetting Multimodal Information in Healthcare Settings
【速读】:该论文旨在解决在医疗领域中,如何从训练好的多模态人工智能模型中有效移除特定患者数据的问题,这一过程被称为机器遗忘(machine unlearning)。现有方法在处理多模态架构时难以实现数据的彻底删除,而该论文提出的Forget-MI方法通过构建损失函数和扰动技术,实现了对被遗忘数据的单模态及联合表征的遗忘,同时保留剩余数据的知识并维持与原始模型相当的性能。其解决方案的关键在于设计有效的损失函数和扰动机制,以平衡遗忘效果与模型性能之间的关系。
链接: https://arxiv.org/abs/2506.23145
作者: Shahad Hardan,Darya Taratynova,Abdelmajid Essofi,Karthik Nandakumar,Mohammad Yaqub
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Privacy preservation in AI is crucial, especially in healthcare, where models rely on sensitive patient data. In the emerging field of machine unlearning, existing methodologies struggle to remove patient data from trained multimodal architectures, which are widely used in healthcare. We propose Forget-MI, a novel machine unlearning method for multimodal medical data, by establishing loss functions and perturbation techniques. Our approach unlearns unimodal and joint representations of the data requested to be forgotten while preserving knowledge from the remaining data and maintaining comparable performance to the original model. We evaluate our results using performance on the forget dataset, performance on the test dataset, and Membership Inference Attack (MIA), which measures the attacker’s ability to distinguish the forget dataset from the training dataset. Our model outperforms the existing approaches that aim to reduce MIA and the performance on the forget dataset while keeping an equivalent performance on the test set. Specifically, our approach reduces MIA by 0.202 and decreases AUC and F1 scores on the forget set by 0.221 and 0.305, respectively. Additionally, our performance on the test set matches that of the retrained model, while allowing forgetting. Code is available at this https URL
zh
[CV-145] VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis
【速读】:该论文试图解决文本到图像生成中用户描述与生成图像之间语义对齐不足的问题,即现有文本到图像提示工程方法虽然能够提升图像的风格和美学质量,但往往忽视了生成图像与用户描述之间的语义一致性,导致视觉吸引力强但内容不满足需求。解决方案的关键在于提出一种无需训练的提示工程框架VisualPrompter,其核心是通过自动自我反思模块识别生成图像中缺失的概念,并结合目标特定的提示优化机制进行细粒度的提示修正,从而提升文本与图像之间的对齐效果。
链接: https://arxiv.org/abs/2506.23138
作者: Shiyu Wu,Mingzhen Sun,Weining Wang,Yequan Wang,Jing Liu
机构: Institute of Automation, Chinese Academy of Sciences, Beijing, China (中国科学院自动化研究所); University of Chinese Academy of Sciences, Beijing, China (中国科学院大学); Beijing Academy of Artificial Intelligence, Beijing, China (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures
Abstract:Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. In particular, VisualPrompter utilizes an automatic self-reflection module to identify the missing concepts in generated images and a target-specific prompt optimization mechanism to revise the prompts in a fine-grained manner. Extensive experiments demonstrate the effectiveness of our VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.
zh
[CV-146] RoboScape: Physics-informed Embodied World Model
【速读】:该论文旨在解决当前具身世界模型在物理感知方面的不足,特别是在建模三维几何和运动动力学方面存在局限,导致在高接触场景下的视频生成不真实的问题。其解决方案的关键在于提出RoboScape,一个统一的物理信息世界模型,通过在一个框架中联合学习RGB视频生成与物理知识,引入了两个关键的物理信息联合训练任务:时间深度预测以增强视频渲染中的三维几何一致性,以及关键点动力学学习以隐式编码物理属性并提升复杂运动建模能力。
链接: https://arxiv.org/abs/2506.23135
作者: Yu Shang,Xin Zhang,Yinzhou Tang,Lei Jin,Chen Gao,Wei Wu,Yong Li
机构: Tsinghua University (清华大学); Manifold AI (曼福德人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 17 pages
Abstract:World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: this https URL.
zh
[CV-147] Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval
【速读】:该论文旨在解决艺术抄袭检测问题,即识别剽窃的绘画作品并解释检测到的抄袭行为,通过检索视觉上相似的原创艺术品来实现。其解决方案的关键在于构建一个包含真实绘画图像和使用生成式 AI (Generative AI) 合成的剽窃版本的数据集,并基于视觉基础模型 DINOv2 进行图像检索与分类。研究首先采用预训练的 DINOv2 特征进行图像检索,尽管该非学习方法在识别准确率上表现优异(97.2%),但检索精度较低(29.0% 平均精度)。为提升检索质量,研究进一步通过度量学习损失对 DINOv2 进行微调,显著提升了检索性能(AP 提高 12%),但意外导致识别准确率下降至 92.7%。
链接: https://arxiv.org/abs/2506.23132
作者: Sophie Zhou,Shu Kong
机构: Cheyenne Mountain High School; University of Macau
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to appear at AVSS’25
Abstract:Art plagiarism detection plays a crucial role in protecting artists’ copyrights and intellectual property, yet it remains a challenging problem in forensic analysis. In this paper, we address the task of recognizing plagiarized paintings and explaining the detected plagarisms by retrieving visually similar authentic artworks. To support this study, we construct a dataset by collecting painting photos and synthesizing plagiarized versions using generative AI, tailored to specific artists’ styles. We first establish a baseline approach using off-the-shelf features from the visual foundation model DINOv2 to retrieve the most similar images in the database and classify plagiarism based on a similarity threshold. Surprisingly, this non-learned method achieves a high recognition accuracy of 97.2% but suffers from low retrieval precision 29.0% average precision (AP). To improve retrieval quality, we finetune DINOv2 with a metric learning loss using positive and negative sample pairs sampled in the database. The finetuned model greatly improves retrieval performance by 12% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92.7%). We conclude with insightful discussions and outline directions for future research.
zh
[CV-148] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning -based Segmentation
【速读】:该论文试图解决在三维点云感知中处理复杂指令时存在的空间推理能力不足的问题,尽管点云数据提供了详细的尺寸和位置等空间线索。解决方案的关键在于提出一种基于推理的分割框架——相关推理分割(R^2 S),该框架通过将空间推理分解为两个顺序阶段:首先识别相关元素,然后根据其关联的视觉先验处理指令,从而模拟人类认知过程。
链接: https://arxiv.org/abs/2506.23120
作者: Zhenhua Ning,Zhuotao Tian,Shaoshuai Shi,Guangming Lu,Daojing He,Wenjie Pei,Li Jiang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Pengcheng Laboratory (鹏城实验室); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); Voyager Research, Didi Chuxing (滴滴出行伏特研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R ^2 S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R ^2 S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.
zh
[CV-149] Hierarchical Corpus-View-Category Refinement for Carotid Plaque Risk Grading in Ultrasound MICCAI2025
【速读】:该论文旨在解决颈动脉斑块分级(Carotid Plaque Grading, CPG)中由于斑块尺寸小、类内差异大而导致的分类困难问题。现有基于深度学习的多视角分类方法通常仅关注不同视角间的特征融合,而忽视了表征学习的重要性及类别特征的差异性。其解决方案的关键在于提出一种新型的语料-视角-类别精炼框架(Corpus-View-Category Refinement Framework, CVC-RF),通过在语料级、视角级和类别级三个层面进行信息处理,提升模型性能,具体包括:引入中心记忆对比损失以增强全局建模能力,设计级联下采样注意力模块实现多尺度信息融合,以及采用无参数的专家混合加权策略以实现类别级特征解耦。
链接: https://arxiv.org/abs/2506.23108
作者: Zhiyuan Zhu,Jian Wang,Yong Jiang,Tong Han,Yuhao Huang,Ang Zhang,Kaiwen Yang,Mingyuan Luo,Zhe Liu,Yaofei Duan,Dong Ni,Tianhong Tang,Xin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025
Abstract:Accurate carotid plaque grading (CPG) is vital to assess the risk of cardiovascular and cerebrovascular diseases. Due to the small size and high intra-class variability of plaque, CPG is commonly evaluated using a combination of transverse and longitudinal ultrasound views in clinical practice. However, most existing deep learning-based multi-view classification methods focus on feature fusion across different views, neglecting the importance of representation learning and the difference in class features. To address these issues, we propose a novel Corpus-View-Category Refinement Framework (CVC-RF) that processes information from Corpus-, View-, and Category-levels, enhancing model performance. Our contribution is four-fold. First, to the best of our knowledge, we are the foremost deep learning-based method for CPG according to the latest Carotid Plaque-RADS guidelines. Second, we propose a novel center-memory contrastive loss, which enhances the network’s global modeling capability by comparing with representative cluster centers and diverse negative samples at the Corpus level. Third, we design a cascaded down-sampling attention module to fuse multi-scale information and achieve implicit feature interaction at the View level. Finally, a parameter-free mixture-of-experts weighting strategy is introduced to leverage class clustering knowledge to weight different experts, enabling feature decoupling at the Category level. Experimental results indicate that CVC-RF effectively models global features via multi-level refinement, achieving state-of-the-art performance in the challenging CPG task.
zh
[CV-150] Computer-Aided Multi-Stroke Character Simplification by Stroke Removal ICDAR2025
【速读】:该论文旨在解决多笔画汉字及日文等文字中复杂字符对母语者和非母语学习者带来的识别与学习难题,通过简化字符以降低学习门槛并提升字体设计的可读性。其解决方案的关键在于利用高精度的字符识别模型评估可读性,并选择性地移除对可读性影响最小的笔画,从而在保持字符整体可辨性的前提下实现系统化的简化。
链接: https://arxiv.org/abs/2506.23106
作者: Ryo Ishiyama,Shinnosuke Matsuo,Seiichi Uchida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICDAR2025 (Oral)
Abstract:Multi-stroke characters in scripts such as Chinese and Japanese can be highly complex, posing significant challenges for both native speakers and, especially, non-native learners. If these characters can be simplified without degrading their legibility, it could reduce learning barriers for non-native speakers, facilitate simpler and legible font designs, and contribute to efficient character-based communication systems. In this paper, we propose a framework to systematically simplify multi-stroke characters by selectively removing strokes while preserving their overall legibility. More specifically, we use a highly accurate character recognition model to assess legibility and remove those strokes that minimally impact it. Experimental results on 1,256 character classes with 5, 10, 15, and 20 strokes reveal several key findings, including the observation that even after removing multiple strokes, many characters remain distinguishable. These findings suggest the potential for more formalized simplification strategies.
zh
[CV-151] DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation
【速读】:该论文旨在解决交互式分割(Interactive Segmentation, IS)中模型在特定领域或复杂场景下表现不佳的问题,尤其是在处理伪装或多部分目标时。其解决方案的关键在于提出DC-TTA,一种基于样本的测试时适应(Test-Time Adaptation, TTA)框架,通过利用用户交互作为监督信号,将用户点击划分为更一致的子集,并通过独立的模型进行TTA处理,从而减少不同提示之间的冲突并实现更局部的更新,最终通过融合适应后的模型形成统一的预测器。
链接: https://arxiv.org/abs/2506.23104
作者: Jihun Kim,Hoyong Kwon,Hyeokjun Kweon,Wooseong Jeong,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院); Chung-Ang University(中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM’s zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy.
zh
[CV-152] Where What Why: Towards Explainable Driver Attention Prediction ICCV2025
【速读】:该论文试图解决自动驾驶和认知科学中任务驱动注意力建模的问题,现有方法仅通过生成空间热图预测驾驶员的注视位置,但未能捕捉特定情境下注意力分配的认知动机,从而限制了对注意力机制的深入理解。其解决方案的关键在于提出可解释的驾驶员注意力预测任务范式(Explainable Driver Attention Prediction),该范式联合预测空间注意力区域、解析被注意语义并提供注意力分配的认知推理。此外,研究构建了首个大规模可解释驾驶员注意力数据集W3DA,并提出了基于大语言模型的LLada框架,实现了像素建模、语义解析与认知推理的统一。
链接: https://arxiv.org/abs/2506.23088
作者: Yuchen Zhou,Jiayu Tang,Xiaoyan Xiao,Yueyao Lin,Linkai Liu,Zipeng Guo,Hao Fei,Xiaobo Xia,Chao Gou
机构: Sun Yat-sen University (中山大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W3DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.
zh
[CV-153] Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation MICCAI2025
【速读】:该论文旨在解决3D CT和MRI图像中个体椎体自动且准确分割的问题,特别是在图像模糊和相似椎体区分方面存在的挑战。其解决方案的关键在于提出一种频率增强的多粒度上下文网络(Frequency-enhanced Multi-granularity Context Network, FMC-Net),通过小波变换进行无损下采样以减少特征失真,并分别处理高频和低频成分。高频成分通过高频率特征细化(High-frequency Feature Refinement, HFR)增强关键特征并去除噪声,而低频成分则通过多粒度状态空间模型(Multi-granularity State Space Model, MG-SSM)聚合不同感受野的特征表示,从而提取空间变化的上下文信息并以线性复杂度捕捉长程依赖关系。
链接: https://arxiv.org/abs/2506.23086
作者: Jian Shi,Tianqi You,Pingping Zhang,Hongli Zhang,Rui Xu,Haojie Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI2025. More modifications my be performed
Abstract:Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced Multi-granularity Context Network (FMC-Net) to improve the accuracy of vertebrae segmentation. Specifically, we first apply wavelet transform for lossless downsampling to reduce the feature distortion in blurred images. The decomposed high and low-frequency components are then processed separately. For the high-frequency components, we apply a High-frequency Feature Refinement (HFR) to amplify the prominence of key features and filter out noises, restoring fine-grained details in blurred images. For the low-frequency components, we use a Multi-granularity State Space Model (MG-SSM) to aggregate feature representations with different receptive fields, extracting spatially-varying contexts while capturing long-range dependencies with linear complexity. The utilization of multi-granularity contexts is essential for distinguishing similar vertebrae and improving segmentation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. The source code is publicly available at this https URL.
zh
[CV-154] Dynamic Contrastive Learning for Hierarchical Retrieval: A Case Study of Distance-Aware Cross-View Geo-Localization
【速读】:该论文旨在解决跨视角地理定位(Cross-View Geo-Localization)中的距离感知问题,即如何使模型全面捕捉目标周围的上下文信息并最小化定位误差。其解决方案的关键在于提出一种动态对比学习(Dynamic Contrastive Learning, DyCL)框架,该框架通过分层空间边界逐步对齐特征表示,从而有效处理建筑间固有的复杂空间关系,相较于传统度量学习方法更具优势。
链接: https://arxiv.org/abs/2506.23077
作者: Suofei Zhang,Xinxin Wang,Xiaofu Wu,Quan Zhou,Haifeng Hu
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); National Engineering Research Center of Communications and Networking (通信与网络国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Distance-Aware Campus (DA-Campus), the first benchmark that pairs multi-view imagery with precise distance annotations across three spatial resolutions. Based on DA-Campus, we formulate DACVGL as a hierarchical retrieval problem across different domains. Our study further reveals that, due to the inherent complexity of spatial relationships among buildings, this problem can only be addressed via a contrastive learning paradigm, rather than conventional metric learning. To tackle this challenge, we propose Dynamic Contrastive Learning (DyCL), a novel framework that progressively aligns feature representations according to hierarchical spatial margins. Extensive experiments demonstrate that DyCL is highly complementary to existing multi-scale metric learning methods and yields substantial improvements in both hierarchical retrieval performance and overall cross-view geo-localization accuracy. Our code and benchmark are publicly available at this https URL.
zh
[CV-155] Learning Counterfactually Decoupled Attention for Open-World Model Attribution ICCV2025
【速读】:该论文试图解决开放世界模型归属(open-world model attribution)中现有方法依赖手工设计的区域划分或特征空间,容易受到虚假统计相关性干扰,并在面对新攻击时表现不佳的问题。解决方案的关键在于提出一种反事实解耦注意力学习(Counterfactually Decoupled Attention Learning, CDAL)方法,该方法显式建模注意力视觉痕迹与源模型归属之间的因果关系,并反事实地将区分性模型特定伪影与混淆性源偏差解耦,从而实现更准确的比较。
链接: https://arxiv.org/abs/2506.23074
作者: Yu Zheng,Boyang Gong,Fanye Kong,Yueqi Duan,Bingyao Yu,Wenzhao Zheng,Lei Chen,Jiwen Lu,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025. Code: \url{ this https URL }
Abstract:In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: this https URL.
zh
[CV-156] Unsupervised 3D Braided Hair Reconstruction from a Single-View Image
【速读】:该论文旨在解决从单视角图像中重建复杂交织结构的3D编织发型(3D braided hairstyles)的问题,这一任务因发辫的复杂拓扑结构和精细几何特征而极具挑战性。现有基于线段的头发重建方法通常仅适用于松散发型,难以准确捕捉编织发型的细节。论文提出的解决方案是一种新颖的无监督管道,其关键在于利用受编织理论启发的合成发辫模型,从而有效建模发辫的复杂交织结构,实现了更准确、逼真且高效的3D编织发型重建。
链接: https://arxiv.org/abs/2506.23072
作者: Jing Gao
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, accepted to the 2025 International Conference on Machine Vision Applications (MVA 2025)
Abstract:Reconstructing 3D braided hairstyles from single-view images remains a challenging task due to the intricate interwoven structure and complex topologies of braids. Existing strand-based hair reconstruction methods typically focus on loose hairstyles and often struggle to capture the fine-grained geometry of braided hair. In this paper, we propose a novel unsupervised pipeline for efficiently reconstructing 3D braided hair from single-view RGB images. Leveraging a synthetic braid model inspired by braid theory, our approach effectively captures the complex intertwined structures of braids. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, providing superior accuracy, realism, and efficiency in reconstructing 3D braided hairstyles, supporting expressive hairstyle modeling in digital humans.
zh
[CV-157] CoreMark: Toward Robust and Universal Text Watermarking Technique
【速读】:该论文旨在解决文本水印技术中同时实现鲁棒性、泛化性和不可感知性的关键挑战。其解决方案的核心是提出一种名为CORE的嵌入范式,该范式由若干连续对齐的黑色像素段组成,其关键创新在于在传输过程中具有固有的抗噪能力,并且在多种语言和字体中具有广泛适用性。基于CORE,作者构建了名为CoreMark的文本水印框架,通过动态提取CORE、根据CORE长度选择鲁棒性更强的字符,并调整CORE的厚度来嵌入隐藏数据,从而在不引起显著视觉失真的情况下实现高效水印嵌入。
链接: https://arxiv.org/abs/2506.23066
作者: Jiale Meng,Yiming Li,Zheming Lu,Zewei He,Hao Luo,Tianwei Zhang
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)
备注: 10 pages, 16 figures
Abstract:Text watermarking schemes have gained considerable attention in recent years, yet still face critical challenges in achieving simultaneous robustness, generalizability, and imperceptibility. This paper introduces a new embedding paradigm,termed CORE, which comprises several consecutively aligned black pixel segments. Its key innovation lies in its inherent noise resistance during transmission and broad applicability across languages and fonts. Based on the CORE, we present a text watermarking framework named CoreMark. Specifically, CoreMark first dynamically extracts COREs from characters. Then, the characters with stronger robustness are selected according to the lengths of COREs. By modifying the thickness of the CORE, the hidden data is embedded into the selected characters without causing significant visual distortions. Moreover, a general plug-and-play embedding strength modulator is proposed, which can adaptively enhance the robustness for small font sizes by adjusting the embedding strength according to the font size. Experimental evaluation indicates that CoreMark demonstrates outstanding generalizability across multiple languages and fonts. Compared to existing methods, CoreMark achieves significant improvements in resisting screenshot, print-scan, and print camera attacks, while maintaining satisfactory imperceptibility.
zh
[CV-158] Empowering Small VLMs to Think with Dynamic Memorization and Exploration
【速读】:该论文试图解决小规模视觉-语言模型(Small-scale Vision-Language Models, SVLMs)在赋予其可靠思维能力时所面临的挑战,即由于参数容量有限和指令遵循能力较弱,现有训练范式(如监督微调Supervised Fine-Tuning, SFT和可验证奖励强化学习Reinforcement Learning with Verifiable Reward, RLVR)对基础VLM提出了过高的要求,导致在SVLM上直接应用时出现严重的伪思维痕迹和优势崩溃问题,从而影响思维可靠性和任务性能。解决方案的关键在于提出DyME训练范式,该范式在每个优化步骤中动态选择记忆(通过SFT)或探索(通过RLVR)模式,确保每次更新都能促进权衡,从而实现性能与可靠性的平衡。
链接: https://arxiv.org/abs/2506.23061
作者: Jiazhen Liu,Yuchuan Deng,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: this https URL
zh
[CV-159] Ovis-U1 Technical Report
【速读】:该论文旨在解决多模态理解、文本到图像生成以及图像编辑任务中模型性能不足的问题,通过构建一个统一的大型模型来提升这些任务的整体表现。解决方案的关键在于采用了一种新的统一训练方法,从语言模型出发,将理解与生成任务相结合,而非依赖冻结的多模态大语言模型(MLLM)进行生成任务,从而在多个基准测试中取得了优于现有先进模型的成绩。
链接: https://arxiv.org/abs/2506.23044
作者: Guo-Hua Wang,Shanshan Zhao,Xinjie Zhang,Liangfu Cao,Pengxin Zhan,Lunhao Duan,Shiyin Lu,Minghao Fu,Xiaohao Chen,Jianshan Zhao,Yang Li,Qing-Guo Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A unified model for multimodal understanding, text-to-image generation, and image editing. GitHub: this https URL
Abstract:In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.
zh
[CV-160] From Coarse to Fine: Learnable Discrete Wavelet Transforms for Efficient 3D Gaussian Splatting ICCV
【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在新视角合成中因高斯基元数量持续增长而导致的内存和带宽压力问题。其解决方案的关键在于引入AutoOpti3DGS,该框架通过将输入图像送入可学习的前向和逆离散小波变换序列,其中低通滤波器固定,高通滤波器可学习且初始化为零,并通过辅助正交性损失逐步激活高频成分,从而在不牺牲视觉保真度的前提下自动抑制高斯基元的过度增长。
链接: https://arxiv.org/abs/2506.23042
作者: Hung Nguyen,An Le,Runfa Li,Truong Nguyen
机构: UC San Diego(加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Workshop
Abstract:3D Gaussian Splatting has emerged as a powerful approach in novel view synthesis, delivering rapid training and rendering but at the cost of an ever-growing set of Gaussian primitives that strains memory and bandwidth. We introduce AutoOpti3DGS, a training-time framework that automatically restrains Gaussian proliferation without sacrificing visual fidelity. The key idea is to feed the input images to a sequence of learnable Forward and Inverse Discrete Wavelet Transforms, where low-pass filters are kept fixed, high-pass filters are learnable and initialized to zero, and an auxiliary orthogonality loss gradually activates fine frequencies. This wavelet-driven, coarse-to-fine process delays the formation of redundant fine Gaussians, allowing 3DGS to capture global structure first and refine detail only when necessary. Through extensive experiments, AutoOpti3DGS requires just a single filter learning-rate hyper-parameter, integrates seamlessly with existing efficient 3DGS frameworks, and consistently produces sparser scene representations more compatible with memory or storage-constrained hardware.
zh
[CV-161] ReMem: Mutual Information-Aware Fine-tuning of Pretrained Vision Transformers for Effective Knowledge Distillation
【速读】:该论文试图解决在知识蒸馏过程中,当从大规模预训练的视觉表示模型(如Vision Transformers, ViTs)中迁移知识时,小规模任务特定模型的性能提升效果显著下降的问题。其解决方案的关键在于通过引入互信息感知的微调优化策略,以提高知识迁移的有效性;此外,针对小规模或高度不平衡的下游数据集,提出了重加权多层感知机(MLP)模块的启发式方法,该方法基于观察到顶层MLP模块主要负责互信息损失这一现象。
链接: https://arxiv.org/abs/2506.23041
作者: Chengyu Dong,Huan Gui,Noveen Sachdeva,Long Jin,Ke Yin,Jingbo Shang,Lichan Hong,Ed H.Chi,Zhe Zhao
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge distillation from pretrained visual representation models offers an effective approach to improve small, task-specific production models. However, the effectiveness of such knowledge transfer drops significantly when distilling from strong models that are pretrained in a large scale. In this paper, we address this challenge for pretrained Vision Transformers (ViTs) by exploring methods to fine-tune them for more effective knowledge transfer. Motivated by the connection between mutual information and distillation effectiveness, we propose to employ mutual information-aware optimization during finetuning. For small or highly-imbalanced downstream datasets where such optimization becomes less effective, we introduce a simple yet effective heuristic of reweighting MLP blocks. This approach is inspired by our observation that top MLP blocks are primarily responsible for mutual information loss. Our method enables small student models to benefit from those pretrained models among the strongest.
zh
[CV-162] Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation
【速读】:该论文试图解决医学图像分割中由于像素级标注数据稀缺而导致的分割性能受限问题(pixel-level label scarcity in medical image segmentation)。其解决方案的关键在于提出了一种名为AugPaint的数据增强框架,该框架利用修复(inpainting)技术从有限的标注数据中生成图像-标签对。AugPaint通过调整潜在扩散模型的采样过程,在无需重新训练的情况下实现图像修复,从而确保生成图像与标签掩码之间的准确匹配,相较于现有方法更具优势。
链接: https://arxiv.org/abs/2506.23038
作者: Xinrong Hu,Yiyu Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability to generate high-quality in-domain images with low overhead, and adapts the sampling process for the inpainting task without need for retraining. Specifically, given a pair of image and label mask, we crop the area labeled with the foreground and condition on it during reversed denoising process for every noise level. Masked background area would gradually be filled in, and all generated images are paired with the label mask. This approach ensures the accuracy of match between synthetic images and label masks, setting it apart from existing dataset generation methods. The generated images serve as valuable supervision for training downstream segmentation models, effectively addressing the challenge of limited annotations. We conducted extensive evaluations of our data augmentation method on four public medical image segmentation datasets, including CT, MRI, and skin imaging. Results across all datasets demonstrate that AugPaint outperforms state-of-the-art label-efficient methodologies, significantly improving segmentation performance.
zh
[CV-163] VisionScores – A system-segmented image score dataset for deep learning tasks ICIP
【速读】:该论文提出了一种新的系统分割图像评分数据集VisionScores,旨在为机器学习和深度学习任务提供结构丰富、信息密度高的图像。其解决的问题是现有数据集在图像结构和音乐创作过程中的表达不足。解决方案的关键在于针对双人钢琴作品构建数据集,不仅考虑图形相似性,还关注作曲模式,以反映这一高度依赖乐器的创作过程,同时通过两种场景(不同作曲家的同一作曲类型与同一作曲家的不同作曲类型)增强数据的多样性和实用性。
链接: https://arxiv.org/abs/2506.23030
作者: Alejandro Romero Amezcua,Mariano José Juan Rivera Meraz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Comments: 5 pages, 3 figures. Accepted for presentation at the 2025 IEEE International Conference on Image Processing (ICIP). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for any other use
Abstract:VisionScores presents a novel proposal being the first system-segmented image score dataset, aiming to offer structure-rich, high information-density images for machine and deep learning tasks. Delimited to two-handed piano pieces, it was built to consider not only certain graphic similarity but also composition patterns, as this creative process is highly instrument-dependent. It provides two scenarios in relation to composer and composition type. The first, formed by 14k samples, considers works from different authors but the same composition type, specifically, Sonatinas. The latter, consisting of 10.8K samples, presents the opposite case, various composition types from the same author, being the one selected Franz Liszt. All of the 24.8k samples are formatted as grayscale jpg images of 128 \times 512 pixels. VisionScores supplies the users not only the formatted samples but the systems’ order and pieces’ metadata. Moreover, unsegmented full-page scores and the pre-formatted images are included for further analysis.
zh
[CV-164] Deep Learning in Mild Cognitive Impairment Diagnosis using Eye Movements and Image Content in Visual Memory Tasks
【速读】:该论文旨在解决早期识别轻度认知障碍(Mild Cognitive Impairment, MCI)的问题,以期为阿尔茨海默病等痴呆症提供早期诊断工具。其关键解决方案是基于VTNet的深度学习模型,该模型结合了眼动追踪数据中的时间序列与空间信息,并引入了扫描路径、热图和图像内容等特征,从而提升对HC与MCI的区分能力。
链接: https://arxiv.org/abs/2506.23016
作者: Tomás Silva Santos Rocha,Anastasiia Mikhailova,Moreno I. Coco,José Santos-Victor
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures
Abstract:The global prevalence of dementia is projected to double by 2050, highlighting the urgent need for scalable diagnostic tools. This study utilizes digital cognitive tasks with eye-tracking data correlated with memory processes to distinguish between Healthy Controls (HC) and Mild Cognitive Impairment (MCI), a precursor to dementia. A deep learning model based on VTNet was trained using eye-tracking data from 44 participants (24 MCI, 20 HCs) who performed a visual memory task. The model utilizes both time series and spatial data derived from eye-tracking. It was modified to incorporate scan paths, heat maps, and image content. These modifications also enabled testing parameters such as image resolution and task performance, analyzing their impact on model performance. The best model, utilizing 700\times700px resolution heatmaps, achieved 68% sensitivity and 76% specificity. Despite operating under more challenging conditions (e.g., smaller dataset size, shorter task duration, or a less standardized task), the model’s performance is comparable to an Alzheimer’s study using similar methods (70% sensitivity and 73% specificity). These findings contribute to the development of automated diagnostic tools for MCI. Future work should focus on refining the model and using a standardized long-term visual memory task.
zh
[CV-165] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在音乐乐谱理解方面的能力不足问题,目前其在自然图像、文本丰富文档和图形设计中的视觉推理能力已取得显著进展,但对乐谱的解析仍缺乏系统研究。解决方案的关键在于提出MusiXQA,这是首个针对音乐乐谱理解的综合性数据集,通过MusiXTeX生成高质量合成乐谱,并提供结构化标注,涵盖音符音高与时值、和弦、谱号、调号与拍号以及文本等信息,从而支持多样化的视觉问答任务。此外,基于该数据集微调的Phi-3-MusiX模型在性能上优于基于GPT的方法,为未来MLLMs在音乐乐谱理解领域的研究奠定了基础。
链接: https://arxiv.org/abs/2506.23009
作者: Jian Chen,Wenye Ma,Penghang Liu,Wei Wang,Tengwei Song,Ming Li,Chenguang Wang,Ruiyi Zhang,Changyou Chen
机构: University at Buffalo (纽约州立大学布法罗分校); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); King Abdullah University of Science and Technology (沙特阿卜杜拉国王科技大学); University of Maryland (马里兰大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.
zh
[CV-166] A Novel Frame Identification and Synchronization Technique for Smartphone Visible Light Communication Systems Based on Convolutional Neural Networks
【速读】:该论文旨在解决基于屏幕到摄像头(Screen-to-Camera, S2C)的可见光通信(Visible Light Communication, VLC)系统中帧识别与同步的问题,特别是在移动场景下由于图像模糊、裁剪和旋转等实时挑战导致的通信性能下降问题。解决方案的关键在于提出了一种新型、鲁棒且轻量级的监督卷积神经网络(Convolutional Neural Network, CNN)方法,通过引入开销帧实现同步,并利用自建数据集进行训练,从而显著提升了系统的帧识别与同步能力,实验结果表明该模型整体准确率可达约98.74%。
链接: https://arxiv.org/abs/2506.23004
作者: Vaigai Nayaki Yokar,Hoa Le-Minh,Xicong Li,Wai Lok Woo,Luis Nero Alves,Stanislav Zvanovec,Tran The Son,Zabih Ghassemlooy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:This paper proposes a novel, robust, and lightweight supervised Convolutional Neural Network (CNN)-based technique for frame identification and synchronization, designed to enhance short-link communication performance in a screen-to-camera (S2C) based visible light communication (VLC) system. Developed using Python and the TensorFlow Keras framework, the proposed CNN model was trained through three real-time experimental investigations conducted in Jupyter Notebook. These experiments incorporated a dataset created from scratch to address various real-time challenges in S2C communication, including blurring, cropping, and rotated images in mobility scenarios. Overhead frames were introduced for synchronization, which leads to enhanced system performance. The experimental results demonstrate that the proposed model achieves an overall accuracy of approximately 98.74%, highlighting its effectiveness in identifying and synchronizing frames in S2C VLC systems.
zh
[CV-167] Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, VLMs)在面对对抗攻击时的脆弱性问题,特别是针对跨提示(cross-prompt)攻击的可迁移性问题。其解决方案的关键在于提出一种改进的对抗攻击框架,包括:(1) 一种新的初始化策略以显著提升攻击成功率(Attack Success Rate, ASR);(2) 通过学习通用扰动来研究跨图像的可迁移性;(3) 设计一种针对视觉编码器注意力机制的新损失函数以增强泛化能力。这些改进有效提升了对抗样本的迁移性,为理解VLMs的安全性提供了更坚实的理论基础和实践方法。
链接: https://arxiv.org/abs/2506.22982
作者: Atharv Mittal,Agam Pandey,Amritanshu Tiwari,Sukrit Jindal,Swadesh Swain
机构: Indian Institute of Technology, Roorkee (印度理工学院,罗基)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MLRC 2025
Abstract:Large Vision-Language Models (VLMs) have revolutionized computer vision, enabling tasks such as image classification, captioning, and visual question answering. However, they remain highly vulnerable to adversarial attacks, particularly in scenarios where both visual and textual modalities can be manipulated. In this study, we conduct a comprehensive reproducibility study of “An Image is Worth 1000 Lies: Adversarial Transferability Across Prompts on Vision-Language Models” validating the Cross-Prompt Attack (CroPA) and confirming its superior cross-prompt transferability compared to existing baselines. Beyond replication we propose several key improvements: (1) A novel initialization strategy that significantly improves Attack Success Rate (ASR). (2) Investigate cross-image transferability by learning universal perturbations. (3) A novel loss function targeting vision encoder attention mechanisms to improve generalization. Our evaluation across prominent VLMs – including Flamingo, BLIP-2, and InstructBLIP as well as extended experiments on LLaVA validates the original results and demonstrates that our improvements consistently boost adversarial effectiveness. Our work reinforces the importance of studying adversarial vulnerabilities in VLMs and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.
zh
[CV-168] Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation ICCV2025
【速读】:该论文旨在解决通用小样本语义分割(Generalized Few-Shot Semantic Segmentation, GFSS)中模型在面对新类别时泛化能力不足的问题,特别是在仅有少量标注样本的情况下,同时保持对基础类别的性能。现有基于原型的方法由于其确定性特性,限制了所学原型对多样样本的适应性,尤其是对于标注数据稀缺的新类别。解决方案的关键在于提出FewCLIP,这是一个基于预训练CLIP模型的多模态原型的概率校准框架,通过引入原型校准机制和分布正则化,实现更具区分性和适应性的原型学习,从而有效缓解对有限新类别数据的过拟合并提升泛化能力。
链接: https://arxiv.org/abs/2506.22979
作者: Jie Liu,Jiayi Shen,Pan Zhou,Jan-Jakob Sonke,Efstratios Gavves
机构: University of Amsterdam (阿姆斯特丹大学); Singapore Management University (新加坡管理大学); The Netherlands Cancer Institute (荷兰癌症研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV2025 Proceeding
Abstract:Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5 ^i and COCO-20 ^i datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at this https URL.
zh
[CV-169] Confident Splatting: Confidence-Based Compression of 3D Gaussian Splatting via Learnable Beta Distributions
【速读】:该论文旨在解决3D Gaussian Splatting在实时渲染中因生成数百万个splat而导致的存储和计算开销过大的问题。其解决方案的关键在于提出一种基于可学习置信度分数的有损压缩方法,该置信度分数建模为Beta分布,通过重建感知损失优化每个splat的置信度,从而实现对低置信度splat的剪枝,同时保持视觉保真度。
链接: https://arxiv.org/abs/2506.22973
作者: AmirHossein Naghi Razlighi,Elaheh Badali Golezani,Shohreh Kasaei
机构: Sharif University of Technology (沙里夫理工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting enables high-quality real-time rendering but often produces millions of splats, resulting in excessive storage and computational overhead. We propose a novel lossy compression method based on learnable confidence scores modeled as Beta distributions. Each splat’s confidence is optimized through reconstruction-aware losses, enabling pruning of low-confidence splats while preserving visual fidelity. The proposed approach is architecture-agnostic and can be applied to any Gaussian Splatting variant. In addition, the average confidence values serve as a new metric to assess the quality of the scene. Extensive experiments demonstrate favorable trade-offs between compression and fidelity compared to prior work. Our code and data are publicly available at this https URL
zh
[CV-170] ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
【速读】:该论文试图解决零样本细粒度视频分类问题,即在没有未见动作类别的视频示例或时间标注的情况下进行分类。解决方案的关键在于引入ActAlign框架,该框架将视频分类建模为序列对齐问题,通过大型语言模型生成有序子动作序列,并利用动态时间规整(DTW)在共享嵌入空间中与视频帧对齐,从而捕捉视频中的时间结构信息。
链接: https://arxiv.org/abs/2506.22967
作者: Amir Aghdam,Vincent Tao Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Preprint manuscript - Project page: this https URL
Abstract:We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.
zh
[CV-171] Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images
【速读】:该论文试图解决生成式 AI(Generative AI)内容中水印技术易被视觉改写攻击(visual paraphrase attack)破坏的问题,此类攻击能够完全移除水印并生成原始图像的改写版本。解决方案的关键在于提出 PECCAVI,这是一种首个针对视觉改写攻击安全且无失真的图像水印技术,其通过在非熔点区域(Non-Melting Points, NMPs)嵌入水印,并结合多通道频域水印和噪声抛光技术,以增强水印的鲁棒性与安全性。
链接: https://arxiv.org/abs/2506.22960
作者: Shreyas Dixit,Ashhar Aziz,Shashwat Bajpai,Vasu Sharma,Aman Chadha,Vinija Jain,Amitava Das
机构: Meta AI(元AI); Stanford University(斯坦福大学); Amazon GenAI(亚马逊生成式人工智能); AI Institute, University of South Carolina(南卡罗来纳大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that “Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality.” In response, California’s Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.
zh
[CV-172] YM-WML: A new Yolo-based segmentation Model with Weighted Multi-class Loss for medical imaging
【速读】:该论文旨在解决医学图像分割中的类别不平衡和医学图像结构复杂性带来的挑战。其解决方案的关键在于提出一种名为YM-WML的新型心脏图像分割模型,该模型集成了强大的主干网络用于有效特征提取、YOLOv11颈部用于多尺度特征聚合以及基于注意力机制的分割头以实现精确分割,同时引入了加权多类指数(Weighted Multi-class Exponential, WME)损失函数来缓解类别不平衡问题。
链接: https://arxiv.org/abs/2506.22955
作者: Haniyeh Nikkhah,Jafar Tanha,Mahdi Zarrin,SeyedEhsan Roshan,Amin Kazempour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The 7th International conference on Pattern Recognition and Image Analysis (IPRIA 2025)
Abstract:Medical image segmentation poses significant challenges due to class imbalance and the complex structure of medical images. To address these challenges, this study proposes YM-WML, a novel model for cardiac image segmentation. The model integrates a robust backbone for effective feature extraction, a YOLOv11 neck for multi-scale feature aggregation, and an attention-based segmentation head for precise and accurate segmentation. To address class imbalance, we introduce the Weighted Multi-class Exponential (WME) loss function. On the ACDC dataset, YM-WML achieves a Dice Similarity Coefficient of 91.02, outperforming state-of-the-art methods. The model demonstrates stable training, accurate segmentation, and strong generalization, setting a new benchmark in cardiac segmentation tasks.
zh
[CV-173] Utilizing a Novel Deep Learning Method for Scene Categorization in Remote Sensing Data
【速读】:该论文试图解决遥感图像中的场景分类(Scene Categorization, SC)问题,其难点在于从远程观测数据中实现高精度的场景分类具有较大挑战性。传统深度学习模型需要大量多样且噪声较高的数据库来捕捉关键视觉特征,这限制了其性能。该研究提出的解决方案是引入一种创新技术——章鱼优化双向循环神经网络(Cuttlefish Optimized Bidirectional Recurrent Neural Network, CO-BRNN),其关键在于通过优化算法提升模型对遥感数据中场景特征的识别能力,从而显著提高分类准确率。
链接: https://arxiv.org/abs/2506.22939
作者: Ghufran A. Omran,Wassan Saad Abduljabbar Hayale,Ahmad AbdulQadir AlRababah,Israa Ibraheem Al-Barazanchi,Ravi Sekhar,Pritesh Shah,Sushma Parihar,Harshavardhan Reddy Penubadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Scene categorization (SC) in remotely acquired images is an important subject with broad consequences in different fields, including catastrophe control, ecological observation, architecture for cities, and more. Nevertheless, its several apps, reaching a high degree of accuracy in SC from distant observation data has demonstrated to be difficult. This is because traditional conventional deep learning models require large databases with high variety and high levels of noise to capture important visual features. To address these problems, this investigation file introduces an innovative technique referred to as the Cuttlefish Optimized Bidirectional Recurrent Neural Network (CO- BRNN) for type of scenes in remote sensing data. The investigation compares the execution of CO-BRNN with current techniques, including Multilayer Perceptron- Convolutional Neural Network (MLP-CNN), Convolutional Neural Network-Long Short Term Memory (CNN-LSTM), and Long Short Term Memory-Conditional Random Field (LSTM-CRF), Graph-Based (GB), Multilabel Image Retrieval Model (MIRM-CF), Convolutional Neural Networks Data Augmentation (CNN-DA). The results demonstrate that CO-BRNN attained the maximum accuracy of 97%, followed by LSTM-CRF with 90%, MLP-CNN with 85%, and CNN-LSTM with 80%. The study highlights the significance of physical confirmation to ensure the efficiency of satellite data.
zh
[CV-174] owards Explainable Bilingual Multimodal Misinformation Detection and Localization
【速读】:该论文旨在解决多模态信息中虚假信息检测的难题,特别是针对新闻媒体中图像与双语(如中英文)字幕配对内容中存在的局部图像编辑和跨语言不一致问题,这些问题使得虚假信息更加隐蔽且难以识别。解决方案的关键在于提出BiMi框架,该框架通过联合执行区域级定位、跨模态与跨语言一致性检测以及自然语言解释,实现对虚假信息的全面分析。此外,BiMi引入了在线检索模块以增强模型推理的泛化能力,并通过Group Relative Policy Optimization(GRPO)提升解释质量,从而在真实、多语言的虚假信息检测任务中取得了最先进的性能。
链接: https://arxiv.org/abs/2506.22930
作者: Yiwei He,Xiangtai Li,Zhenglin Huang,Yi Dong,Hao Fei,Jiangning Zhang,Baoyuan Wu,Guangliang Cheng
机构: University of Liverpool (利物浦大学); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.
zh
[CV-175] Attention to Burstiness: Low-Rank Bilinear Prompt Tuning ICCV2025
【速读】:该论文旨在解决视觉提示微调(Visual Prompt Tuning, VPT)中由于图像块嵌入与Transformer自注意力模块中的键和查询投影器之间的相互作用导致的“突发性”(burstiness)问题,这种非高斯分布特性对提示学习带来了挑战。其解决方案的关键在于对数据进行白化处理,通过去相关并均衡方差,使其更接近高斯分布,从而提升提示学习的效率和性能。具体而言,作者推导出一个白化矩阵,并以双线性方式将其与待学习的提示相乘,该方法不仅显著加速了提示微调过程,还提升了模型精度,同时提出了一个低秩的双线性提示微调(Bilinear Prompt Tuning, BPT)方法,进一步减少了参数量和计算开销。
链接: https://arxiv.org/abs/2506.22908
作者: Yuzhu Wang,Manni Duan,Shu Kong
机构: Zhejiang Lab (浙江实验室); University of Macau (澳门大学); Institute of Collaborative Innovation (协同创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., 25 accuracy points on the CUB dataset; interestingly, it learns
bursty prompts’'. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.
zh
[CV-176] MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances
【速读】:该论文旨在解决磁干扰环境下稀疏惯性动作捕捉(sparse inertial motion capture)系统中姿态估计误差的问题(orientation estimation errors)。现有惯性测量单元(IMU)系统在磁干扰环境中容易产生误差,限制了其在实际场景中的应用。论文提出的解决方案MagShield的关键在于采用“先检测后校正”(detect-then-correct)策略,首先通过多IMU联合分析检测磁干扰,然后利用人体运动先验知识校正姿态误差,从而提升系统在磁干扰环境下的性能。
链接: https://arxiv.org/abs/2506.22907
作者: Yunzhe Shao,Xinyu Yi,Lu Yin,Shihui Guo,Junhai Yong,Feng Xu
机构: Tsinghua University (清华大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:This paper proposes a novel method called MagShield, designed to address the issue of magnetic interference in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Unit (IMU) systems are prone to orientation estimation errors in magnetically disturbed environments, limiting their practical application in real-world scenarios. To address this problem, MagShield employs a “detect-then-correct” strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems.
zh
[CV-177] Point Cloud Compression and Objective Quality Assessment: A Survey
【速读】:该论文试图解决3D点云数据在压缩(Point Cloud Compression, PCC)和质量评估(Point Cloud Quality Assessment, PCQA)方面的挑战,特别是在面对点云数据的不规则结构、高数据量和复杂属性时,如何实现高效压缩与质量评估。解决方案的关键在于分析和比较多种基于手工设计和学习的PCC算法,以及客观的PCQA度量方法,并通过在新兴数据集上的基准测试,提供对各类方法优缺点的深入分析,从而为实时和感知相关应用提供指导。
链接: https://arxiv.org/abs/2506.22902
作者: Yiling Xu,Yujie Zhang,Shuting Xia,Kaifa Yang,He Huang,Ziyu Shan,Wenjie Huang,Qi Yang,Le Yang
机构: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University(协同媒体创新中心,上海交通大学); University of Missouri–Kansas City(密苏里大学堪萨斯城分校); Department of Electrical and Computer Engineering, University of Canterbury(电气与计算机工程系,坎特伯雷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA), emphasizing their significance for real-time and perceptually relevant applications. We analyze a wide range of handcrafted and learning-based PCC algorithms, along with objective PCQA metrics. By benchmarking representative methods on emerging datasets, we offer detailed comparisons and practical insights into their strengths and limitations. Despite notable progress, challenges such as enhancing visual fidelity, reducing latency, and supporting multimodal data remain. This survey outlines future directions, including hybrid compression frameworks and advanced feature extraction strategies, to enable more efficient, immersive, and intelligent 3D applications.
zh
[CV-178] Neural Cellular Automata: From Cells to Pixels
【速读】:该论文试图解决神经细胞自动机(Neural Cellular Automata, NCA)在高分辨率网格中应用时面临的训练时间与内存需求增长过快、信息传播受限以及实时推理计算负担过重等问题。其解决方案的关键在于将NCA与一个小型共享隐式解码器相结合,该解码器受到隐式神经表示近期进展的启发。通过在粗粒度网格上进行NCA演化,再由轻量级解码器以任意分辨率渲染输出图像,同时提出针对形态发生和纹理合成任务的新型损失函数,从而在保持低内存和计算开销的前提下实现高质量、高效能的高分辨率输出。
链接: https://arxiv.org/abs/2506.22899
作者: Ehsan Pajouheshgar,Yitao Xu,Ali Abbasi,Alexander Mordvintsev,Wenzel Jakob,Sabine Süsstrunk
机构: School of Computer and Communication Sciences, EPFL, Switzerland; Sharif University of Technology, Iran; Google Research, Zurich, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Image and Video Processing (eess.IV)
备注: 6 pages, 5 figures, first draft
Abstract:Neural Cellular Automata (NCAs) are bio-inspired systems in which identical cells self-organize to form complex and coherent patterns by repeatedly applying simple local rules. NCAs display striking emergent behaviors including self-regeneration, generalization and robustness to unseen situations, and spontaneous motion. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution grids. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information which impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing NCA with a tiny, shared implicit decoder, inspired by recent advances in implicit neural representations. Following NCA evolution on a coarse grid, a lightweight decoder renders output images at arbitrary resolution. We also propose novel loss functions for both morphogenesis and texture synthesis tasks, specifically tailored for high-resolution output with minimal memory and computation overhead. Combining our proposed architecture and loss functions brings substantial improvement in quality, efficiency, and performance. NCAs equipped with our implicit decoder can generate full-HD outputs in real time while preserving their self-organizing, emergent properties. Moreover, because each MLP processes cell states independently, inference remains highly parallelizable and efficient. We demonstrate the applicability of our approach across multiple NCA variants (on 2D, 3D grids, and 3D meshes) and multiple tasks, including texture generation and morphogenesis (growing patterns from a seed), showing that with our proposed framework, NCAs seamlessly scale to high-resolution outputs with minimal computational overhead.
zh
[CV-179] CP-Guard: A Unified Probability-Agnostic and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems
【速读】:该论文旨在解决协同感知(Collaborative Perception, CP)中因恶意代理发送虚假感知信息而导致的系统可靠性问题。其解决方案的关键在于通过使CP系统达成共识而非冲突,以准确检测并消除恶意代理。具体而言,提出了概率无关的样本共识(PASAC)方法和协同一致性损失(CCLoss),用于在无需先验恶意代理概率的情况下验证协作网络中的共识,并结合在线自适应阈值机制以动态调整验证阈值,从而提升系统的鲁棒性与可靠性。
链接: https://arxiv.org/abs/2506.22890
作者: Senkang Hu,Yihang Tao,Guowen Xu,Xinyuan Qian,Yiqin Deng,Xianhao Chen,Sam Tak Wu Kwong,Yuguang Fang
机构: Hong Kong JC STEM Lab of Smart City and Department of Computer Science, City University of Hong Kong(香港城市大学智慧城市JC科学实验室及计算机科学系); School of Computer Science and Engineering, University of Electronic Science and Technology of China(中国电子科技大学计算机科学与工程学院); Department of Electrical and Electronic Engineering, The University of Hong Kong(香港大学电子工程系); School of Data Science, Lingnan University(岭南大学数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-Guard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent’s perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird’s eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at this https URL
zh
[CV-180] How Semantically Informative is an Image?: Measuring the Covariance-Weighted Norm of Contrastive Learning Embeddings
【速读】:该论文试图解决如何量化图像与文本之间的绝对语义信息量问题,即在对比学习框架下,如何衡量图像或文本对对方分布的条件影响。其解决方案的关键在于重新定义信息增益(Information Gain)概念,并基于对比学习模型计算图像和文本的语义信息量。通过分析嵌入空间中的分布变化,该方法能够有效评估条件依赖关系,并利用基于范数的嵌入度量来估计信息增益,从而实现高效且与样本规模无关的计算。
链接: https://arxiv.org/abs/2506.22881
作者: Fumiya Uchiyama,Rintaro Yanagi,Shohei Taniguchi,Shota Takashiro,Masahiro Suzuki,Hirokatsu Kataoka,Yusuke Iwasawa,Yutaka Matsuo
机构: The University of Tokyo (东京大学); AIST (国立研究開発法人産業技術総合研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP’s empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as “image not found.” Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.
zh
[CV-181] Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder
【速读】:该论文旨在解决现有视频分割与定位方法(如Sa2VA)中动态视觉信息与静态语义相互纠缠导致的分割精度下降问题。其解决方案的关键在于提出DeSa2VA,一个结合文本预训练和线性解耦模块的解耦增强提示方案,通过将大语言模型生成的隐状态分离到独立的文本和视觉特征子空间,并采用三重监督的动态掩码融合策略,有效提升模型的语义定位能力。
链接: https://arxiv.org/abs/2506.22880
作者: Dang Jisheng(1 and 2),Wu Xudong(3),Wang Bimei(4 and 2),Lv Ning(1),Chen Jiayu(1),Jingwen Zhao(3),Yichu liu(5),Jizhao Liu(1),Juncheng Li(6),Teng Wang(7) ((1) Lanzhou University, (2) National University of Singapore, (3) Sun Yat-sen University, (4) Jinan University, (5) South China University of Technology, (6) Zhejiang University, (7) The University of Hong Kong )
机构: Lanzhou University (兰州大学); National University of Singapore (新加坡国立大学); Sun Yat-sen University (中山大学); Jinan University (暨南大学); South China University of Technology (华南理工大学); Zhejiang University (浙江大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model’s semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at this https URL.
zh
[CV-182] STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing
【速读】:该论文旨在解决文本引导视频编辑中存在的时间不一致性、运动失真以及领域转换能力有限的问题。这些问题被归因于在编辑过程中对时空像素相关性的建模不足。解决方案的关键在于提出STR-Match算法,该算法通过基于新型STR分数的潜在优化生成视觉吸引力强且时空一致的视频,该分数利用文本到视频扩散模型中的2D空间注意力和1D时间模块来捕捉相邻帧之间的时空像素相关性,而无需计算开销较大的3D注意力机制。
链接: https://arxiv.org/abs/2506.22868
作者: Junsung Lee,Junoh Kang,Bohyung Han
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures, 3 tables
Abstract:Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.
zh
[CV-183] Region-Aware CAM: High-Resolution Weakly-Supervised Defect Segmentation via Salient Region Perception
【速读】:该论文旨在解决工业表面缺陷检测中对大规模标注数据依赖过高的问题,这一问题限制了传统语义分割和目标检测模型在实际应用中的有效性。其解决方案的关键在于提出一种弱监督语义分割框架,该框架包含两个核心组件:区域感知类别激活图(region-aware class activation map)和伪标签训练。通过引入过滤引导的反向传播(filtering-guided backpropagation)方法,提升了热图的分辨率和细节保留能力,并结合区域感知加权模块增强空间精度,最终通过伪标签分割实现模型性能的迭代优化。
链接: https://arxiv.org/abs/2506.22866
作者: Hang-Cheng Dong,Lu Zou,Bingguo Liu,Dong Ye,Guodong Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology Suzhou Research Institute (哈尔滨工业大学苏州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Surface defect detection plays a critical role in industrial quality inspection. Recent advances in artificial intelligence have significantly enhanced the automation level of detection processes. However, conventional semantic segmentation and object detection models heavily rely on large-scale annotated datasets, which conflicts with the practical requirements of defect detection tasks. This paper proposes a novel weakly supervised semantic segmentation framework comprising two key components: a region-aware class activation map (CAM) and pseudo-label training. To address the limitations of existing CAM methods, especially low-resolution thermal maps, and insufficient detail preservation, we introduce filtering-guided backpropagation (FGBP), which refines target regions by filtering gradient magnitudes to identify areas with higher relevance to defects. Building upon this, we further develop a region-aware weighted module to enhance spatial precision. Finally, pseudo-label segmentation is implemented to refine the model’s performance iteratively. Comprehensive experiments on industrial defect datasets demonstrate the superiority of our method. The proposed framework effectively bridges the gap between weakly supervised learning and high-precision defect segmentation, offering a practical solution for resource-constrained industrial scenarios.
zh
[CV-184] DMD-Net: Deep Mesh Denoising Network
【速读】:该论文试图解决三维网格去噪(mesh denoising)问题,旨在从带有噪声的输入网格中恢复出高质量的无噪网格。其解决方案的关键在于提出了一种端到端的深度学习框架——Deep Mesh Denoising Network (DMD-Net),该框架结合了图卷积神经网络(Graph Convolutional Neural Network)与特征引导的Transformer(Feature Guided Transformer, FGT)结构。DMD-Net通过在原始图和对偶图上进行信息聚合,并引入异构双流网络结构及前向融合模块,增强了模型对局部特征的捕捉能力与全局信息的交互能力,从而实现高效且鲁棒的网格去噪。
链接: https://arxiv.org/abs/2506.22850
作者: Aalok Gangopadhyay,Shashikant Verma,Shanmuganathan Raman
机构: CVIG Lab, IIT Gandhinagar (计算机视觉与智能图形实验室,印度理工学院甘地纳格尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Deep Mesh Denoising Network (DMD-Net), an end-to-end deep learning framework, for solving the mesh denoising problem. DMD-Net consists of a Graph Convolutional Neural Network in which aggregation is performed in both the primal as well as the dual graph. This is realized in the form of an asymmetric two-stream network, which contains a primal-dual fusion block that enables communication between the primal-stream and the dual-stream. We develop a Feature Guided Transformer (FGT) paradigm, which consists of a feature extractor, a transformer, and a denoiser. The feature extractor estimates the local features, that guide the transformer to compute a transformation, which is applied to the noisy input mesh to obtain a useful intermediate representation. This is further processed by the denoiser to obtain the denoised mesh. Our network is trained on a large scale dataset of 3D objects. We perform exhaustive ablation studies to demonstrate that each component in our network is essential for obtaining the best performance. We show that our method obtains competitive or better results when compared with the state-of-the-art mesh denoising algorithms. We demonstrate that our method is robust to various kinds of noise. We observe that even in the presence of extremely high noise, our method achieves excellent performance.
zh
[CV-185] AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results
【速读】:该论文旨在解决跨空地视角的行人再识别(Person Re-identification, ReID)问题,该问题在大规模监控和公共安全应用中具有重要意义。由于空地视角差异大、尺度变化显著以及遮挡严重, bridging the aerial-ground domain gap 仍然是一个重大挑战。论文提出的解决方案关键在于构建了AG-VPReID 2025挑战,基于包含3,027个身份、超过13,500个轨迹和约370万帧数据的新数据集,并采用多流架构、基于Transformer的时序推理以及物理启发建模等方法。其中,UAM团队提出的X-TFCLIP方法在空到地和地到空的ReID任务中分别取得了72.28%和70.77%的Rank-1准确率,验证了该数据集的复杂性及方法的有效性。
链接: https://arxiv.org/abs/2506.22843
作者: Kien Nguyen,Clinton Fookes,Sridha Sridharan,Huy Nguyen,Feng Liu,Xiaoming Liu,Arun Ross,Dana Michalski,Tamás Endrei,Ivan DeAndres-Tame,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez,Javier Ortega-Garcia,Zijing Gong,Yuhao Wang,Xuehu Liu,Pingping Zhang,Md Rashidunnabi,Hugo Proença,Kailash A. Hambarde,Saeid Rezaei
机构: Queensland University of Technology (昆士兰科技大学); Drexel University (德雷塞尔大学); Michigan State University (密歇根州立大学); Department of Defence (国防部); Universidad Autónoma de Madrid (马德里自治大学); Pázmány Péter Catholic University (帕兹曼·彼得天主教大学); Dalian University of Technology (大连理工大学); Wuhan University of Technology (武汉理工大学); IT: Instituto de Telecomunicações, University of Beira Interior (电信研究所,贝拉里奥大学); University College Cork (科克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset’s complexity. For additional details, please refer to the official website at this https URL.
zh
[CV-186] FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition ICME2025
【速读】:该论文旨在解决行人属性识别(PAR)任务中现有方法在细粒度特征提取和未见属性泛化能力方面的局限性。传统方法通过区域特征进行属性预测,但这种固定预定义属性的策略在两个方面存在不足:一是区域特征可能牺牲某些属性特有的细粒度模式以捕捉跨属性的共性特征,二是无法在测试阶段预测未见过的属性。论文提出的FOCUS方法的关键在于自适应地为每个属性单独提取细粒度的属性级特征,无论其在训练过程中是否可见。该方法的核心组件包括多粒度混合标记(MGMT)以捕获不同视觉粒度的潜在特征,以及基于文本属性引导的视觉特征提取(AVFE)模块,通过交叉注意力机制从混合标记中检索对应的视觉属性特征,并结合区域感知对比学习(RACL)确保文本属性关注合适的混合标记。
链接: https://arxiv.org/abs/2506.22836
作者: Hongyan An,Kuan Zhu,Xin He,Haiyun Guo,Chaoyang Zhao,Ming Tang,Jinqiao Wang
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学); Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (基础模型研究中心,自动化研究所,中国科学院); Peng Cheng Laboratory (鹏城实验室); Wuhan AI Research (武汉人工智能研究院); Objecteye Inc. (欧物科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025 Oral
Abstract:Pedestrian attribute recognition (PAR) is a fundamental perception task in intelligent transportation and security. To tackle this fine-grained task, most existing methods focus on extracting regional features to enrich attribute information. However, a regional feature is typically used to predict a fixed set of pre-defined attributes in these methods, which limits the performance and practicality in two aspects: 1) Regional features may compromise fine-grained patterns unique to certain attributes in favor of capturing common characteristics shared across attributes. 2) Regional features cannot generalize to predict unseen attributes in the test time. In this paper, we propose the \textbfFine-grained \textbfOptimization with semanti\textbfC g\textbfUided under\textbfStanding (FOCUS) approach for PAR, which adaptively extracts fine-grained attribute-level features for each attribute individually, regardless of whether the attributes are seen or not during training. Specifically, we propose the Multi-Granularity Mix Tokens (MGMT) to capture latent features at varying levels of visual granularity, thereby enriching the diversity of the extracted information. Next, we introduce the Attribute-guided Visual Feature Extraction (AVFE) module, which leverages textual attributes as queries to retrieve their corresponding visual attribute features from the Mix Tokens using a cross-attention mechanism. To ensure that textual attributes focus on the appropriate Mix Tokens, we further incorporate a Region-Aware Contrastive Learning (RACL) method, encouraging attributes within the same region to share consistent attention maps. Extensive experiments on PA100K, PETA, and RAPv1 datasets demonstrate the effectiveness and strong generalization ability of our method.
zh
[CV-187] SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds
【速读】:该论文试图解决3D-aware GAN技术生成的图像在局部编辑能力上的不足问题,即现有方法难以实现对图像中特定语义区域的精确编辑。其解决方案的关键在于引入SemFaceEdit方法,通过在生成式辐射流形(generative radiance manifolds)上生成语义场,利用潜在编码有效解耦生成图像中不同面部语义对应的几何与外观信息,并通过两个关键模块——几何模块和外观模块——联合对抗训练,实现语义感知的几何与外观描述符的学习,从而提升对特定面部语义的精确编辑能力。
链接: https://arxiv.org/abs/2506.22833
作者: Shashikant Verma,Shanmuganathan Raman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that streamlines the appearance and geometric editing process by generating semantic fields on generative radiance manifolds. Utilizing latent codes, our method effectively disentangles the geometry and appearance associated with different facial semantics within the generated image. In contrast to existing methods that can change the appearance of the entire radiance field, our method enables the precise editing of particular facial semantics while preserving the integrity of other regions. Our network comprises two key modules: the Geometry module, which generates semantic radiance and occupancy fields, and the Appearance module, which is responsible for predicting RGB radiance. We jointly train both modules in adversarial settings to learn semantic-aware geometry and appearance descriptors. The appearance descriptors are then conditioned on their respective semantic latent codes by the Appearance Module, facilitating disentanglement and enhanced control. Our experiments highlight SemFaceEdit’s superior performance in semantic field-based editing, particularly in achieving improved radiance field disentanglement.
zh
[CV-188] Listener-Rewarded Thinking in VLMs for Image Preferences
【速读】:该论文旨在解决如何训练鲁棒且具有泛化能力的奖励模型以对齐文本到图像和文本到视频的生成模型与人类意图的问题。当前的奖励模型在泛化能力上存在不足,而监督微调会导致记忆现象,需要复杂的标注流程。尽管强化学习(Reinforcement Learning, RL)中的群体相对策略优化(Group Relative Policy Optimization, GRPO)能够提升泛化能力,但其存在关键缺陷:当模型的推理轨迹与独立的冻结视觉-语言模型(“监听器”)的评估结果矛盾时,推理准确性会显著下降。该论文的解决方案关键在于引入一种监听器增强的GRPO框架,通过监听器重新评估推理者的思维链,提供密集且校准的置信度分数,从而塑造强化学习的奖励信号,促使推理者不仅正确回答问题,还生成对独立模型具有说服力的解释。
链接: https://arxiv.org/abs/2506.22832
作者: Alexander Gambashidze,Li Pengyi,Matvey Skripkin,Andrey Galichin,Anton Gusarov,Konstantin Sobolev,Andrey Kuznetsov,Ivan Oseledets
机构: Artificial Intelligence Research Institute, Moscow, Russia; Skolkovo Institute of Science and Technology, Moscow, Russia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model’s reasoning trace contradicts that of an independent, frozen vision-language model (“listener”) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner’s chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: this https URL.
zh
[CV-189] Prompting without Panic: Attribute-aware Zero-shot Test-Time Calibration
【速读】:该论文旨在解决测试阶段提示调优(Test-Time Prompt Tuning, TPT)在提升视觉-语言模型(Vision-Language Models, VLM)准确率的同时导致置信度校准退化的问题。其关键解决方案是通过利用大语言模型(Large Language Model, LLM)中关于目标标签属性的先验知识,对测试时的提示进行精心初始化,以避免对特定测试样本的过拟合,从而缓解校准问题;此外,还引入了一种新的正则化损失函数,以减少类内距离并增加类间距离,从而保持提示的质量。
链接: https://arxiv.org/abs/2506.22819
作者: Ramya Hebbalaguppe,Tamoghno Kandar,Abhinav Nagpal,Chetan Arora
机构: IIT Delhi (印度理工学院德里分校); TCS Research Labs (塔塔咨询服务公司研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages
Abstract:Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications. We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR’24), 6.78 for DiffTPT (CVPR’23), and 8.43 for PromptAlign (NeurIPS’23). The code is publicly accessible at: this https URL. Comments: 26 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.22819 [cs.CV] (or arXiv:2506.22819v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.22819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-190] Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding
【速读】:该论文旨在解决开放词汇三维场景理解中因3D数据量有限而导致的模型泛化能力和开放世界能力不足的问题。现有方法依赖于对比学习或2D特征蒸馏,难以处理多样化的物体类别。其解决方案的关键在于引入MVOV3D,通过利用CLIP编码器提取的精确区域级图像特征和文本特征,并结合3D几何先验优化多视角融合,从而在不进行训练的情况下减少视觉-语言模型中的固有噪声,提升模型的开放世界性能和泛化能力。
链接: https://arxiv.org/abs/2506.22817
作者: Xingyilang Yin,Jiale Wang,Xi Yang,Mutian Xu,Xu Gu,Nannan Wang
机构: Xidian University (西安电子科技大学); SSE, CUHKSZ (深圳技术大学高等研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent open-vocabulary 3D scene understanding approaches mainly focus on training 3D networks through contrastive learning with point-text pairs or by distilling 2D features into 3D models via point-pixel alignment. While these methods show considerable performance in benchmarks with limited vocabularies, they struggle to handle diverse object categories as the limited amount of 3D data upbound training strong open-vocabulary 3d models. We observe that 2D multi-view fusion methods take precedence in understanding diverse concepts in 3D scenes. However, inherent noises in vision-language models lead multi-view fusion to sub-optimal performance. To this end, we introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi-view fusion for open-vocabulary 3D scene understanding. We focus on reducing the inherent noises without training, thereby preserving the generalizability while enhancing open-world capabilities. Specifically, MVOV3D improves multi-view 2D features by leveraging precise region-level image features and text features encoded by CLIP encoders and incorporates 3D geometric priors to optimize multi-view fusion. Extensive experiments on various datasets demonstrate the effectiveness of our method. Notably, our MVOV3D achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open-vocabulary semantic segmentation, outperforming current leading trained 3D networks by a significant margin.
zh
[CV-191] Efficient Multi-Crop Saliency Partitioning for Automatic Image Cropping
【速读】:该论文旨在解决传统基于显著性感知的图像裁剪方法在需要多个不重叠裁剪区域的应用中效果不佳的问题(Traditional saliency-aware cropping methods optimize a single bounding box, making them ineffective for applications requiring multiple disjoint crops)。其解决方案的关键在于将固定宽高比裁剪算法扩展为能够在线性时间内高效提取多个非重叠裁剪区域的方法,通过动态调整注意力阈值并移除已选裁剪区域而不重新计算整个显著性图。
链接: https://arxiv.org/abs/2506.22814
作者: Andrew Hamara,Andrew C. Freeman
机构: Baylor University (贝勒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic image cropping aims to extract the most visually salient regions while preserving essential composition elements. Traditional saliency-aware cropping methods optimize a single bounding box, making them ineffective for applications requiring multiple disjoint crops. In this work, we extend the Fixed Aspect Ratio Cropping algorithm to efficiently extract multiple non-overlapping crops in linear time. Our approach dynamically adjusts attention thresholds and removes selected crops from consideration without recomputing the entire saliency map. We discuss qualitative results and introduce the potential for future datasets and benchmarks.
zh
[CV-192] FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition
【速读】:该论文旨在解决情感识别中跨被试泛化能力不足的问题,这一问题主要源于个体差异、认知特征和情感反应的多样性。其解决方案的关键在于提出FreqDGT,一种频率自适应动态图变压器,通过频率自适应处理(FAP)动态加权与情感相关的频段,利用自适应动态图学习(ADGL)捕捉输入特定的脑连接模式,并结合多尺度时间解耦网络(MTDN)实现时序动态建模与对抗性特征解耦,从而提升跨被试的情感识别准确性和鲁棒性。
链接: https://arxiv.org/abs/2506.22807
作者: Yueyang Li,Shengyu Gong,Weiming Zeng,Nizhuan Wang,Wai Ting Siok
机构: The Hong Kong Polytechnic University (香港理工大学); Shanghai Maritime University (上海海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at this https URL.
zh
[CV-193] Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate
【速读】:该论文试图解决文本到图像扩散模型在生成图像时可能涉及不当或受版权保护概念的问题,即如何在删除目标概念的同时尽可能保留其他概念。解决方案的关键在于提出一种名为Concept Pinpoint Eraser (CPE) 的新框架,通过引入非线性残差注意力门(Residual Attention Gates, ResAGs)选择性地擦除目标概念,并利用注意力锚定损失防止剩余概念的遗忘,同时通过对抗训练和可学习的文本嵌入增强模型的鲁棒性。
链接: https://arxiv.org/abs/2506.22806
作者: Byung Hyun Lee,Sungjin Lim,Seunggyu Lee,Dong Un Kang,Se Young Chun
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emphlinear modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emphnonlinear Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at this https URL
zh
[CV-194] Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding ICCV2025
【速读】:该论文试图解决深度学习模型复杂度增加导致的可解释性下降问题,即黑盒模型决策过程难以理解。其解决方案的关键在于提出一种增强人-神经网络互理解的概念瓶颈模型(Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding, CBM-HNMU),该模型利用概念瓶颈模型(CBM)作为可解释框架来近似黑盒推理并传递概念理解,通过全局梯度贡献自动识别并优化有害概念,再将修正的知识蒸馏回黑盒模型,从而提升模型的可解释性和准确性。
链接: https://arxiv.org/abs/2506.22803
作者: Nuoye Xiong,Anqi Dong,Ning Wang,Cong Hua,Guangming Zhu,Mei Lin,Peiyi Shen,Liang Zhang
机构: Xidian University (西安电子科技大学); KTH Royal Institute of Technology (皇家理工学院); Donghai Laboratory (东海实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025
Abstract:Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: this https URL.
zh
[CV-195] Riemannian-Geometric Fingerprints of Generative Models
【速读】:该论文试图解决生成式模型(Generative Models, GMs)的指纹识别与模型归属问题,旨在为服务提供商提供可靠的模型认证方法以保护知识产权,同时帮助用户和执法机构验证生成内容的来源以确保责任与信任。此外,随着越来越多的模型生成数据被反馈至训练数据源,模型崩溃的风险日益增加,因此需要有效区分合成数据与人类数据。现有研究在理解生成式模型的指纹方面存在不足,主要原因是缺乏一个形式化的框架来定义、表示和分析这些指纹。该论文的关键解决方案是采用几何方法,利用黎曼几何(Riemannian geometry)提出一种新的人工痕迹和指纹定义,通过从数据中学习黎曼度量,将欧几里得距离和最近邻搜索替换为测地线距离和基于kNN的黎曼质心,从而实现对生成式模型指纹的更准确刻画。
链接: https://arxiv.org/abs/2506.22802
作者: Hae Jin Song,Laurent Itti
机构: USC Department of Computer Science (南加州大学计算机科学系)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training (“regurgitative training”), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models’ fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.
zh
[CV-196] RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors
【速读】:该论文旨在解决单次行驶片段导致的道路结构扫描不完整问题,从而使得重建场景成为传感器模拟器有效回归驾驶行为的关键需求。现有3D高斯点云(3DGS)技术在直接引入扩散先验进行扩展时,常导致累积的物理不一致性和训练效率下降。为解决这些问题,论文提出了RGE-GS框架,其关键在于将基于扩散的生成与奖励引导的高斯积分相结合。该框架包含两项核心创新:首先,引入奖励网络以学习识别并优先保留重建阶段中一致生成的模式,从而确保空间稳定性;其次,在重建过程中设计了一种差异化训练策略,根据场景收敛指标自动调整高斯优化进度,从而实现优于基线方法的收敛效果。
链接: https://arxiv.org/abs/2506.22800
作者: Sicong Du,Jiarun Liu,Qifeng Chen,Hao-Xiang Chen,Tai-Jiang Mu,Sheng Yang
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at this https URL. (Camera-ready version incorporating reviewer suggestions will be updated soon.)
zh
[CV-197] VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding ICCV2025
【速读】:该论文旨在解决现有3D Gaussian Splatting(3DGS)方法在高保真、实时渲染中缺乏深层次场景理解以及训练成本高的问题。其解决方案的关键在于提出VoteSplat框架,该框架将Hough投票机制与3DGS相结合,通过Segment Anything Model(SAM)进行实例分割并生成2D投票图,再通过嵌入空间偏移向量构建3D空间投票,并利用深度畸变约束优化定位精度。此外,VoteSplat通过投票点将2D图像语义映射到3D点云,从而降低高维CLIP特征的训练成本,同时保持语义清晰性。
链接: https://arxiv.org/abs/2506.22799
作者: Minchao Jiang,Shunyu Jia,Jiaming Gu,Xiaoyuan Lu,Guangming Zhu,Anqi Dong,Liang Zhang
机构: Xidian University (西安电子科技大学); Qing Yi (Shanghai) (青易(上海)); Shanghai Pudong Cryptography Research Institute (上海浦东密码研究所); KTH Royal Institute of Technology (皇家理工学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025
Abstract:3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate effectiveness of VoteSplat in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies. Our code is available at this https URL
zh
[CV-198] Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching
【速读】:该论文旨在解决LiDAR点云与相机图像之间的点-像素配准问题,该问题在自动驾驶和机器人感知中具有基础性但极具挑战性。其核心难点在于点云与图像之间的模态差异,尤其是在稀疏单帧LiDAR设置下。传统方法通常分别从点云和图像中提取特征,并依赖手工设计或学习的匹配策略,这种分离编码方式无法有效弥合模态差距,且在处理单帧LiDAR的稀疏性和噪声时表现不佳。本文的关键解决方案是引入无检测器的框架,通过将LiDAR强度图投影至LiDAR视角的2D视图,并输入注意力机制的无检测器匹配网络,实现直接的点-像素配准,无需多帧累积。此外,还提出了可重复性评分机制作为软可见性先验,以提升稀疏输入下的匹配鲁棒性。
链接: https://arxiv.org/abs/2506.22784
作者: Yu Han,Zhiwei Huang,Yanting Zhang,Fangjun Ding,Shen Cai,Rui Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Point-pixel registration between LiDAR point clouds and camera images is a fundamental yet challenging task in autonomous driving and robotic perception. A key difficulty lies in the modality gap between unstructured point clouds and structured images, especially under sparse single-frame LiDAR settings. Existing methods typically extract features separately from point clouds and images, then rely on hand-crafted or learned matching strategies. This separate encoding fails to bridge the modality gap effectively, and more critically, these methods struggle with the sparsity and noise of single-frame LiDAR, often requiring point cloud accumulation or additional priors to improve reliability. Inspired by recent progress in detector-free matching paradigms (e.g. MatchAnything), we revisit the projection-based approach and introduce the detector-free framework for direct point-pixel matching between LiDAR and camera views. Specifically, we project the LiDAR intensity map into a 2D view from the LiDAR perspective and feed it into an attention-based detector-free matching network, enabling cross-modal correspondence estimation without relying on multi-frame accumulation. To further enhance matching reliability, we introduce a repeatability scoring mechanism that acts as a soft visibility prior. This guides the network to suppress unreliable matches in regions with low intensity variation, improving robustness under sparse input. Extensive experiments on KITTI, nuScenes, and MIAS-LCEC-TF70 benchmarks demonstrate that our method achieves state-of-the-art performance, outperforming prior approaches on nuScenes (even those relying on accumulated point clouds), despite using only single-frame LiDAR.
zh
[CV-199] VSRM: A Robust Mamba-Based Framework for Video Super-Resolution ICCV2025
【速读】:该论文旨在解决视频超分辨率(Video Super-Resolution, VSR)任务中的长期时空特征建模与对齐问题,以及在重建过程中保持高频信息和视觉质量的挑战。其解决方案的关键在于引入Mamba架构,通过设计空间到时间Mamba块和时间到空间Mamba块,有效提取长程时空特征并增强感受野,同时提出可变形跨Mamba对齐模块以提升相邻帧间的对齐精度,减少特征畸变,并采用一种简单有效的频域Charbonnier-like损失函数来缩小重建帧与真实帧之间的频域差异,从而提升视觉质量。
链接: https://arxiv.org/abs/2506.22762
作者: Dinh Phu Tran,Dao Duy Hung,Daeyoung Kim
机构: School of Computing, KAIST, Republic of Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbfVideo \textbfSuper-\textbfResolution framework that leverages the power of \textbfMamba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.
zh
[CV-200] RoboPearls: Editable Video Simulation for Robot Manipulation ICCV2025
【速读】:该论文旨在解决通用机器人操作策略开发中因真实世界示范数据收集成本高、效率低而导致的数据获取可扩展性不足的问题,以及仿真与现实之间的模拟到现实(sim-to-real)差距问题。其解决方案的关键在于提出RoboPearls,这是一个基于3D Gaussian Splatting(3DGS)的可编辑视频仿真框架,能够从示范视频中构建逼真且视角一致的仿真环境,并通过Incremental Semantic Distillation(ISD)和3D regularized NNFM Loss(3D-NNFM)等先进模块支持多种物体操作任务,同时结合大语言模型(LLMs)和视觉-语言模型(VLM)实现仿真生成的自动化与性能优化。
链接: https://arxiv.org/abs/2506.22756
作者: Tao Tang,Likui Zhang,Youpeng Wen,Kaidong Zhang,Jia-Wang Bian,xia zhou,Tianyi Yan,Kun Zhan,Peng Jia,Hefeng Wu,Liang Lin,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Sun Yat-sen University (中山大学); Bytedance Seed (字节跳动种子); Li Auto Inc. (小鹏汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICCV 2025
Abstract:The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.
zh
[CV-201] Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography
【速读】:该论文旨在解决超紧凑计算成像中金属透镜(metalens)因复杂光学退化和计算恢复困难所带来的成像质量问题。传统方法依赖于精确的光学校准或大量配对数据集,这在实际成像系统中难以实现,同时缺乏对推理过程的控制会导致不可取的幻觉伪影。论文提出的解决方案关键在于利用预训练模型的强大自然图像先验,而非依赖大规模数据集,并通过正向、中性与负向提示路径平衡高频细节生成、结构保真度与金属透镜特异性退化的抑制,结合伪数据增强技术,实现了可调的解码器以控制保真度与感知质量之间的权衡,以及引入空间变化的退化感知注意力(SVDA)模块,以自适应建模复杂的光学与传感器引起的退化。
链接: https://arxiv.org/abs/2506.22753
作者: Jianing Zhang,Jiayi Zhu,Feiyu Ji,Xiaokang Yang,Xiaoyun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textitpseudo data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: this https URL.
zh
[CV-202] Deep Learning based Joint Geometry and Attribute Up-sampling for Large-Scale Colored Point Clouds
【速读】:该论文旨在解决如何生成大规模且密集的彩色点云(colored point cloud)的问题,以支持更真实和沉浸式的三维应用。其解决方案的关键在于提出一种基于深度学习的联合几何与属性上采样方法(Joint Geometry and Attribute Up-sampling, JGAU),该方法同时建模几何结构与属性模式,并利用空间属性相关性来提升上采样质量。通过构建大规模数据集SYSU-PCUD以及设计几何与属性上采样网络,结合粗粒度属性上采样方法和属性增强模块,JGAU在多个上采样率下均取得了优于现有方法的峰值信噪比(PSNR)性能。
链接: https://arxiv.org/abs/2506.22749
作者: Yun Zhang,Feifan Chen,Na Li,Zhiwei Guo,Xu Wang,Fen Miao,Sam Kwong
机构: Sun Yat-sen University (中山大学); Chinese Academy of Sciences (中国科学院); Shenzhen University (深圳大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Colored point cloud, which includes geometry and attribute components, is a mainstream representation enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method that learns to model both geometry and attribute patterns while leveraging spatial attribute correlations. First, we establish and release a large-scale dataset for colored point cloud up-sampling called SYSU-PCUD, containing 121 large-scale colored point clouds with diverse geometry and attribute complexities across six categories and four sampling rates. Second, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework that jointly up-samples geometry and attributes. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Third, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarse up-sampled attributes for each point. Then, an attribute enhancement module is introduced to refine these up-sampled attributes and produce high-quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that the Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU method is 33.90 decibels, 32.10 decibels, 31.10 decibels, and 30.39 decibels for up-sampling rates of 4 times, 8 times, 12 times, and 16 times, respectively. Compared to state-of-the-art methods, JGAU achieves average PSNR gains of 2.32 decibels, 2.47 decibels, 2.28 decibels, and 2.11 decibels at these four up-sampling rates, demonstrating significant improvement.
zh
[CV-203] UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments ICCV2025
【速读】:该论文试图解决当前多模态医学图像融合中对源图像质量与像素级对齐高度依赖的问题,尤其是在处理错位或退化的医学图像时性能显著下降。其解决方案的关键在于提出UniFuse框架,该框架通过嵌入退化感知提示学习模块,将多方向信息整合并与跨模态对齐相关联,实现对齐与修复的联合优化;同时引入Omni Unified Feature Representation方案,利用Spatial Mamba编码多方向特征以缓解模态差异,并设计Universal Feature Restoration Fusion模块,结合基于LoRA原理的自适应LoRA协同网络(ALSN),实现单阶段内的修复与融合,从而在统一框架内完成对齐、修复和融合任务。
链接: https://arxiv.org/abs/2506.22736
作者: Dayong Su,Yafei Zhang,Huafeng Li,Jinxing Li,Yu Liu
机构: Kunming University of Science and Technology (昆明理工大学); Harbin Institute of Technology at Shenzhen (哈尔滨工业大学深圳校区); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN’s adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, UniFuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method’s effectiveness and significant advantages over existing approaches.
zh
[CV-204] XTransfer: Cross-Modality Model Transfer for Human Sensing with Few Data at the Edge
【速读】:该论文旨在解决边缘系统上人类感知的深度学习模型训练与开发中面临的传感器数据稀缺和资源受限问题。现有方法依赖预训练模型迁移时,常因模态偏移和高资源需求导致精度下降、资源开销大以及跨不同感知应用的适应性差。其解决方案的关键在于XTransfer,这是一种首次提出的资源高效、模态无关的模型迁移方法,通过(i)模型修复技术,仅使用少量传感器数据安全修复预训练模型层中的模态偏移,以及(ii)层重组技术,以逐层方式高效搜索并重组源模型中感兴趣的层,从而构建紧凑模型。
链接: https://arxiv.org/abs/2506.22726
作者: Yu Zhang,Xi Zhang,Hualin zhou,Xinyuan Chen,Shang Gao,Hong Jia,Jianfei Yang,Yuankai Qi,Tao Gu
机构: Macquarie University (麦考瑞大学); Nanyang Technological University (南洋理工大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning for human sensing on edge systems offers significant opportunities for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. Current methods that rely on transferring pre-trained models often encounter issues such as modality shift and high resource demands, resulting in substantial accuracy loss, resource overhead, and poor adaptability across different sensing applications. In this paper, we propose XTransfer, a first-of-its-kind method for resource-efficient, modality-agnostic model transfer. XTransfer freely leverages single or multiple pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely repairs modality shift in pre-trained model layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to create compact models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. Comprehensive results demonstrate that XTransfer achieves state-of-the-art performance on human sensing tasks while significantly reducing the costs of sensor data collection, model training, and edge deployment.
zh
[CV-205] Deterministic Object Pose Confidence Region Estimation ICCV2025
【速读】:该论文旨在解决6D位姿置信区域估计中的两个关键问题:采样方法在样本数量增加时计算速度显著下降,以及推导出的置信区域通常过于宽泛。其解决方案的关键在于提出一种确定性且高效的方法,通过归纳共形预测将确定性回归的高斯关键点分布校准为2D关键点置信区域,并利用隐函数定理直接将这些关键点置信区域传播到6D位姿置信区域,从而避免了采样和集成带来的低效及区域过大问题。
链接: https://arxiv.org/abs/2506.22720
作者: Jinghao Wang,Zhang Li,Zi Wang,Banglei Guan,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling. It provides compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9% for rotations and 99.8% for translations. The code will be available soon.
zh
[CV-206] Part Segmentation and Motion Estimation for Articulated Objects with Dynamic 3D Gaussians
【速读】:该论文旨在解决刚体物体运动分析中的两个基本问题:部分分割(part segmentation)和运动估计(motion estimation)。在问题设定中,点云并非由固定的一组运动点生成,而是可能在每个时间步为物体表面的任意采样,这在物体经历显著遮挡或数据集由异步传感器采集时尤为常见。传统依赖点对应关系的方法在此场景下不适用。该论文的关键解决方案是采用一种紧凑而有效的表示方法,将物体建模为一系列3D高斯分布组成的简单构建块,并通过时间相关的旋转、平移和缩放参数进行建模,这些参数在所有时间步共享。通过将观测点与高斯分布建立对应关系,实现部分分割,并利用分配高斯的位姿推导出各点随时间的变换。
链接: https://arxiv.org/abs/2506.22718
作者: Jun-Jee Chao,Qingyuan Jiang,Volkan Isler
机构: University of Minnesota (明尼苏达大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Part segmentation and motion estimation are two fundamental problems for articulated object motion analysis. In this paper, we present a method to solve these two problems jointly from a sequence of observed point clouds of a single articulated object. The main challenge in our problem setting is that the point clouds are not assumed to be generated by a fixed set of moving points. Instead, each point cloud in the sequence could be an arbitrary sampling of the object surface at that particular time step. Such scenarios occur when the object undergoes major occlusions, or if the dataset is collected using measurements from multiple sensors asynchronously. In these scenarios, methods that rely on tracking point correspondences are not appropriate. We present an alternative approach based on a compact but effective representation where we represent the object as a collection of simple building blocks modeled as 3D Gaussians. We parameterize the Gaussians with time-dependent rotations, translations, and scales that are shared across all time steps. With our representation, part segmentation can be achieved by building correspondences between the observed points and the Gaussians. Moreover, the transformation of each point across time can be obtained by following the poses of the assigned Gaussian (even when the point is not observed). Experiments show that our method outperforms existing methods that solely rely on finding point correspondences. Additionally, we extend existing datasets to emulate real-world scenarios by considering viewpoint occlusions. We further demonstrate that our method is more robust to missing points as compared to existing approaches on these challenging datasets, even when some parts are completely occluded in some time-steps. Notably, our part segmentation performance outperforms the state-of-the-art method by 13% on point clouds with occlusions.
zh
[CV-207] LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
【速读】:该论文旨在解决隐式退化估计基础的盲超分辨率(IDE-BSR)中对隐式退化表示(IDR)判别能力不足的问题,现有方法过于复杂地优化适应过程以提升效果,导致模型参数和计算量显著增加。论文的关键解决方案是优化IDR的判别能力,并提出一种名为LightBSR的轻量级BSR模型,通过基于知识蒸馏的学习框架,引入退化先验约束的对比学习技术增强教师模型对不同退化类型的区分能力,并利用特征对齐技术将教师模型中的退化相关知识迁移至学生模型,从而实现高效且性能优异的盲超分辨率。
链接: https://arxiv.org/abs/2506.22710
作者: Jiang Yuan,JI Ma,Bo Wang,Guanzhou Ke,Weiming Hu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室); North China Electric Power University (华北电力大学); Beijing Jiaotong University (北京交通大学); YunQue AGI (云阙人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model’s parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks. Our code is accessible at: this https URL.
zh
[CV-208] General Autonomous Cybersecurity Defense: Learning Robust Policies for Dynamic Topologies and Diverse Attackers
【速读】:该论文试图解决现有自主网络安全防御(autonomous cybersecurity defense, ACD)系统在面对动态网络环境时适应能力不足的问题,特别是由于网络拓扑变化导致的防御代理泛化能力受限。解决方案的关键在于开发能够跨动态网络环境学习可泛化策略的代理,即通用自主网络安全防御(general ACD, GACD)。
链接: https://arxiv.org/abs/2506.22706
作者: Arun Ramamurthy,Neil Dhir
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:In the face of evolving cyber threats such as malware, ransomware and phishing, autonomous cybersecurity defense (ACD) systems have become essential for real-time threat detection and response with optional human intervention. However, existing ACD systems rely on limiting assumptions, particularly the stationarity of the underlying network dynamics. In real-world scenarios, network topologies can change due to actions taken by attackers or defenders, system failures, or time evolution of networks, leading to failures in the adaptive capabilities of current defense agents. Moreover, many agents are trained on static environments, resulting in overfitting to specific topologies, which hampers their ability to generalize to out-of-distribution network topologies. This work addresses these challenges by exploring methods for developing agents to learn generalizable policies across dynamic network environments – general ACD (GACD).
zh
[CV-209] 3D Shape Generation: A Survey
【速读】:该论文试图解决3D形状生成领域的技术瓶颈,旨在系统梳理当前最先进的方法,并为研究者提供结构化和深入的理解。其解决方案的关键在于从三个核心组件出发进行综述:形状表示、生成建模方法以及评估协议,通过分类和分析不同类型的3D表示(显式、隐式和混合形式),回顾多种生成方法(特别是前馈架构),并总结常用的数据集与评估指标,从而为可控性、高效性和高质量的3D形状生成提供理论支持和实践指导。
链接: https://arxiv.org/abs/2506.22678
作者: Nicolas Caytuiro,Ivan Sipiran
机构: University of Chile, Department of Computer Science(智利大学,计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 figures
Abstract:Recent advances in deep learning have significantly transformed the field of 3D shape generation, enabling the synthesis of complex, diverse, and semantically meaningful 3D objects. This survey provides a comprehensive overview of the current state of the art in 3D shape generation, organizing the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. We begin by categorizing 3D representations into explicit, implicit, and hybrid setups, highlighting their structural properties, advantages, and limitations. Next, we review a wide range of generation methods, focusing on feedforward architectures. We further summarize commonly used datasets and evaluation metrics that assess fidelity, diversity, and realism of generated shapes. Finally, we identify open challenges and outline future research directions that could drive progress in controllable, efficient, and high-quality 3D shape generation. This survey aims to serve as a valuable reference for researchers and practitioners seeking a structured and in-depth understanding of this rapidly evolving field.
zh
[CV-210] CaO_2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation ICCV2025 ATC
【速读】:该论文旨在解决扩散模型在数据集蒸馏过程中存在的两个关键不一致问题:目标不一致(Objective Inconsistency)和条件不一致(Condition Inconsistency)。目标不一致指的是蒸馏过程与评估目标相偏离,而条件不一致则导致生成图像与其对应条件之间的不匹配。解决方案的关键在于提出一种两阶段的基于扩散的框架——条件感知优化与目标引导采样(CaO _2 ),该框架通过概率驱动的样本选择流程和潜在表示的精调,使蒸馏过程与评估目标保持一致,从而提升生成数据的质量和任务性能。
链接: https://arxiv.org/abs/2506.22637
作者: Haoxuan Wang,Zhenghao Zhao,Junyi Wu,Yuzhang Shang,Gaowen Liu,Yan Yan
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Central Florida (中佛罗里达大学); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Code is available at this https URL
Abstract:The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO _2 ), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO _2 achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.
zh
[CV-211] ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在生成过程中出现的幻觉(hallucination)问题,即模型生成与视觉输入不符或矛盾的文本。其关键解决方案是引入一个名为ReCo的小型可训练模块,该模块基于几何代数和关系组合的思想,附加在任何VLM之上,无需其他修改。该模块能够缓解模型在生成过程中对视觉输入的“遗忘效应”,从而提升多个基准测试中的性能,并且可以与其他减少幻觉的方法结合以进一步提高效果。
链接: https://arxiv.org/abs/2506.22636
作者: Sotirios Panagiotis Chytas,Miso Choi,Hyunwoo J. Kim,Vikas Singh
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Korea University (韩国高等教育院); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Language Models (VLMs) show impressive capabilities in integrating and reasoning with both visual and language data. But these models make mistakes. A common finding – similar to LLMs – is their tendency to hallucinate, i.e., generate plausible sounding text which is not grounded in the visual input, or at worst, is contradictory. A growing consensus attributes this behavior to an over-reliance on language – especially as the generation progresses, the model suffers from a ``fading memory effect’’ with respect to the provided visual input. We study mechanisms by which this behavior can be controlled. Specifically, using ideas from geometric algebra and relational compositions, we propose the addition of a small, trainable module (named ReCo) on top of any VLM – no other modification is needed. We show that such a lightweight module is able to mitigate the fading memory effect on three of the most widely used VLMs (InstructBLIP, LlaVA, MiniGPT4), where we see performance improvements on multiple benchmarks. Additionally, we show that our module can be combined with many of the other approaches for reducing hallucination where we achieve improved results for each one.
zh
[CV-212] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
【速读】:该论文试图解决大型多模态模型(LMMs)在像素级理解与推理能力上的不足,特别是针对前景分割任务中的伪装物体检测(COD)和显著物体检测(SOD)。其解决方案的关键在于引入强化学习(RL)框架,并采用Group Relative Policy Optimization(GRPO)策略,使LMM能够以逐标记的方式生成点和边界框提示,进而指导SAM2生成分割掩码,从而提升模型的像素级感知能力。
链接: https://arxiv.org/abs/2506.22624
作者: Zuyao You,Zuxuan Wu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.
zh
[CV-213] Pixels-to-Graph: Real-time Integration of Building Information Models and Scene Graphs for Semantic-Geometric Human-Robot Understanding
【速读】:该论文旨在解决自主机器人在未知环境中进行高效人机协作与环境理解的问题,特别是在资源受限的机器人平台上实现实时探索与建图。其关键解决方案是提出了一种轻量级方法Pixels-to-Graph (Pix2G),该方法能够从图像像素和LiDAR地图实时生成结构化的场景图,所有运算仅在CPU上完成,输出包括去噪的2D俯视环境图和结构分割的3D点云,并通过多层图结构将信息从物体层级抽象到建筑层级,实现人机可读的2D建筑信息模型(BIM)与机器人3D地图之间的无缝连接。
链接: https://arxiv.org/abs/2506.22593
作者: Antonello Longo,Chanyoung Chung,Matteo Palieri,Sung-Kyun Kim,Ali Agha,Cataldo Guaragnella,Shehryar Khattak
机构: NASA Jet Propulsion Laboratory (美国国家航空航天局喷气推进实验室); California Institute of Technology (加州理工学院); Polytechnic University of Bari (巴里理工大学); Field AI (Field AI); National Aeronautics and Space Administration (美国国家航空航天局)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted to 2025 IEEE International Conference on Automation Science and Engineering (CASE)
Abstract:Autonomous robots are increasingly playing key roles as support platforms for human operators in high-risk, dangerous applications. To accomplish challenging tasks, an efficient human-robot cooperation and understanding is required. While typically robotic planning leverages 3D geometric information, human operators are accustomed to a high-level compact representation of the environment, like top-down 2D maps representing the Building Information Model (BIM). 3D scene graphs have emerged as a powerful tool to bridge the gap between human readable 2D BIM and the robot 3D maps. In this work, we introduce Pixels-to-Graph (Pix2G), a novel lightweight method to generate structured scene graphs from image pixels and LiDAR maps in real-time for the autonomous exploration of unknown environments on resource-constrained robot platforms. To satisfy onboard compute constraints, the framework is designed to perform all operation on CPU only. The method output are a de-noised 2D top-down environment map and a structure-segmented 3D pointcloud which are seamlessly connected using a multi-layer graph abstracting information from object-level up to the building-level. The proposed method is quantitatively and qualitatively evaluated during real-world experiments performed using the NASA JPL NeBula-Spot legged robot to autonomously explore and map cluttered garage and urban office like environments in real-time.
zh
[CV-214] BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data MICCAI2025
【速读】:该论文试图解决从功能性磁共振成像(fMRI)脑体积中直接预测表型特征时,现有方法(主要基于卷积神经网络或Transformer架构)在建模fMRI数据中复杂关系方面的不足,尤其是无法捕捉长程时空依赖性的问题。解决方案的关键在于提出一种新颖的混合框架BrainMT,其核心包括两个阶段:第一阶段采用具有时间优先扫描机制的双向Mamba块,以计算高效的方式捕获全局时间交互;第二阶段利用Transformer块通过自注意力机制建模由Mamba块处理的深度特征中的全局空间关系。
链接: https://arxiv.org/abs/2506.22591
作者: Arunkumar Kannan,Martin A. Lindquist,Brian Caffo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025
Abstract:Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this this https URL
zh
[CV-215] LIGHT: Multi-Modal Text Linking on Historical Maps ICDAR2025
【速读】:该论文旨在解决历史地图文本片段难以有效“链接”的问题,例如确定多词地名。现有布局分析方法主要依赖语言特征而忽视几何信息,而几何信息对于处理地图文本至关重要。解决方案的关键在于提出LIGHT,一种融合语言、图像和几何特征的多模态方法,其中包含一个感知几何的嵌入模块,用于编码文本区域的多边形坐标以捕捉形状及其相对空间位置,并将此几何信息与LayoutLMv3的视觉和语言标记嵌入统一,通过跨模态信息直接预测每个文本实例的阅读顺序后继,从而提升序列鲁棒性。
链接: https://arxiv.org/abs/2506.22589
作者: Yijun Lin,Rhett Olson,Junhan Wu,Yao-Yi Chiang,Jerod Weinman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDAR2025
Abstract:Text on historical maps provides valuable information for studies in history, economics, geography, and other related fields. Unlike structured or semi-structured documents, text on maps varies significantly in orientation, reading order, shape, and placement. Many modern methods can detect and transcribe text regions, but they struggle to effectively ``link’’ the recognized text fragments, e.g., determining a multi-word place name. Existing layout analysis methods model word relationships to improve text understanding in structured documents, but they primarily rely on linguistic features and neglect geometric information, which is essential for handling map text. To address these challenges, we propose LIGHT, a novel multi-modal approach that integrates linguistic, image, and geometric features for linking text on historical maps. In particular, LIGHT includes a geometry-aware embedding module that encodes the polygonal coordinates of text regions to capture polygon shapes and their relative spatial positions on an image. LIGHT unifies this geometric information with the visual and linguistic token embeddings from LayoutLMv3, a pretrained layout analysis model. LIGHT uses the cross-modal information to predict the reading-order successor of each text instance directly with a bi-directional learning strategy that enhances sequence robustness. Experimental results show that LIGHT outperforms existing methods on the ICDAR 2024/2025 MapText Competition data, demonstrating the effectiveness of multi-modal learning for historical map text linking.
zh
[CV-216] Dual Atrous Separable Convolution for Improving Agricultural Semantic Segmentation
【速读】:该论文旨在解决农业图像语义分割中的精准识别问题,特别是农田异常区域的准确划分,以支持农业决策和主动干预。其解决方案的关键在于引入了一种新型的双空洞可分离卷积(Dual Atrous Separable Convolution, DAS Conv)模块,并结合了编码器到解码器的策略性跳跃连接,从而在保持模型效率的同时提升性能。该方法在计算复杂度较低的情况下,实现了与基于Transformer的最先进(SOTA)模型相当的分割效果,并在效率方面提升了超过66%。
链接: https://arxiv.org/abs/2506.22570
作者: Chee Mei Ling,Thangarajah Akilan,Aparna Ravinda Phalke
机构: Lakehead University (湖头大学); University of Alabama (阿拉巴马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures, 6 tables
Abstract:Agricultural image semantic segmentation is a pivotal component of modern agriculture, facilitating accurate visual data analysis to improve crop management, optimize resource utilization, and boost overall productivity. This study proposes an efficient image segmentation method for precision agriculture, focusing on accurately delineating farmland anomalies to support informed decision-making and proactive interventions. A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3-based segmentation framework. The DAS Conv module is meticulously designed to achieve an optimal balance between dilation rates and padding size, thereby enhancing model performance without compromising efficiency. The study also incorporates a strategic skip connection from an optimal stage in the encoder to the decoder to bolster the model’s capacity to capture fine-grained spatial features. Despite its lower computational complexity, the proposed model outperforms its baseline and achieves performance comparable to highly complex transformer-based state-of-the-art (SOTA) models on the Agriculture Vision benchmark dataset. It achieves more than 66% improvement in efficiency when considering the trade-off between model complexity and performance, compared to the SOTA model. This study highlights an efficient and effective solution for improving semantic segmentation in remote sensing applications, offering a computationally lightweight model capable of high-quality performance in agricultural imagery.
zh
[CV-217] Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation
【速读】:该论文试图解决在生物医学领域中构建统一且可泛化的基础模型所面临的挑战,包括大规模生物医学图像-文本语料库的缺乏、图像模态的异质性以及机构间数据标准的碎片化问题。解决方案的关键在于提出MMKD-CLIP,该模型通过多医学CLIP知识蒸馏(Multiple Medical CLIP Knowledge Distillation)方法,从九个先进的领域特定或通用生物医学CLIP模型中进行知识蒸馏,而非依赖于海量原始数据。该方法通过两阶段训练流程,首先在26种图像模态的290万张生物医学图像-文本对上进行CLIP风格的预训练,随后利用超过1920万组特征对进行特征级蒸馏,从而实现性能优越且具有强鲁棒性和泛化能力的生物医学基础模型。
链接: https://arxiv.org/abs/2506.22567
作者: Shansong Wang,Zhecheng Jin,Mingzhe Hu,Mojtaba Safari,Feng Zhao,Chih-Wei Chang,Richard LJ Qiu,Justin Roper,David S. Yu,Xiaofeng Yang
机构: Emory University School of Medicine (埃默里大学医学院); Georgia Institute of Technology (佐治亚理工学院); Laney Graduate School, Emory University (埃默里大学兰尼研究生院); School of Electrical and Computer Engineering, College of Engineering, Georgia Institute of Technology (电气与计算机工程学院,工程学院,佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.
zh
[CV-218] Improving Token-based Object Detection with Video
【速读】:该论文旨在解决视频目标检测中的两个关键问题:一是传统检测器在训练过程中需要采样所有可能的边界框空间,导致损失稀疏性问题,二是现有方法在推理时依赖启发式后处理;二是传统检测器将视频目标分解为图像特定的2D边界框并进行链接,难以实现高效的视频目标整合。其解决方案的关键在于将目标表示为可变长度的离散标记序列,从而无需注入定位线索即可简洁地表示不同数量、形状和位置的视频目标,并将视频目标概念化为完全集成且不可分割的3D边界框或轨迹片段,而非单独生成2D边界框再进行链接。
链接: https://arxiv.org/abs/2506.22562
作者: Abhineet Singh,Nilanjan Ray
机构: University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review for publication in IEEE Access
Abstract:This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes together to construct the video object, as done in most conventional detectors. This allows it to scale effortlessly with available computational resources by simply increasing the length of the video subsequence that the network takes as input, even generalizing to multi-object tracking if the subsequence can span the entire video. We compare our video detector with the baseline Pix2Seq static detector on several datasets and demonstrate consistent improvement, although with strong signs of being bottlenecked by our limited computational resources. We also compare it with several video detectors on UA-DETRAC to show that it is competitive with the current state of the art even with the computational bottleneck. We make our code and models publicly available.
zh
[CV-219] Recomposed realities: animating still images via patch clustering and randomness
【速读】:该论文试图解决如何利用现有图像数据将静态图像转化为动态效果的问题,即通过运动赋予静态图像生命力。其解决方案的关键在于采用基于图像块(image patch)的方法,通过k-means聚类对来自精心筛选数据集的图像块进行分组,并通过匹配和随机采样这些聚类来重建新的目标图像,从而强调再诠释而非简单复制,使源域与目标域在概念上可以不同,但共享局部结构。
链接: https://arxiv.org/abs/2506.22556
作者: Markus Juvonen,Samuli Siltanen
机构: University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 22 pages, 19 figures
Abstract:We present a patch-based image reconstruction and animation method that uses existing image data to bring still images to life through motion. Image patches from curated datasets are grouped using k-means clustering and a new target image is reconstructed by matching and randomly sampling from these clusters. This approach emphasizes reinterpretation over replication, allowing the source and target domains to differ conceptually while sharing local structures.
zh
[CV-220] Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset
【速读】:该论文旨在解决如何使人工智能技术具备理解并生成人际互动中动态行为的能力,从而实现更自然、社会智能的交互。其核心挑战在于建模和生成与人类语言相协调的双人肢体动作和面部表情。解决方案的关键在于构建了一个大规模的无缝交互数据集(Seamless Interaction Dataset),涵盖超过4,000小时的面对面互动视频,并基于此开发了一系列模型,能够根据对话者的语音和视觉行为生成符合语境的动态行为,同时支持情感响应和表达水平的可控调整,为构建更具交互性和沉浸感的虚拟代理提供了基础。
链接: https://arxiv.org/abs/2506.22554
作者: Vasu Agrawal,Akinniyi Akinyemi,Kathryn Alvero,Morteza Behrooz,Julia Buffalini,Fabio Maria Carlucci,Joy Chen,Junming Chen,Zhang Chen,Shiyang Cheng,Praveen Chowdary,Joe Chuang,Antony D’Avirro,Jon Daly,Ning Dong,Mark Duppenthaler,Cynthia Gao,Jeff Girard,Martin Gleize,Sahir Gomez,Hongyu Gong,Srivathsan Govindarajan,Brandon Han,Sen He,Denise Hernandez,Yordan Hristov,Rongjie Huang,Hirofumi Inaguma,Somya Jain,Raj Janardhan,Qingyao Jia,Christopher Klaiber,Dejan Kovachev,Moneish Kumar,Hang Li,Yilei Li,Pavel Litvin,Wei Liu,Guangyao Ma,Jing Ma,Martin Ma,Xutai Ma,Lucas Mantovani,Sagar Miglani,Sreyas Mohan,Louis-Philippe Morency,Evonne Ng,Kam-Woh Ng,Tu Anh Nguyen,Amia Oberai,Benjamin Peloquin,Juan Pino,Jovan Popovic,Omid Poursaeed,Fabian Prada,Alice Rakotoarison,Alexander Richard,Christophe Ropers,Safiyyah Saleem,Vasu Sharma,Alex Shcherbyna,Jia Shen,Jie Shen,Anastasis Stathopoulos,Anna Sun,Paden Tomasello,Tuan Tran,Arina Turkatenko,Bo Wan,Chao Wang,Jeff Wang,Mary Williamson,Carleigh Wood,Tao Xiang,Yilin Yang,Julien Yao,Chen Zhang,Jiemin Zhang,Xinyue Zhang,Jason Zheng,Pavlo Zhyzheria,Jan Zikes,Michael Zollhoefer
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.
zh
[CV-221] Preserve Anything: Controllable Image Synthesis with Object Preservation ICCV2025
【速读】:该论文旨在解决文本到图像(T2I)生成中的关键问题,包括多对象保真度不足、语义一致性差以及场景构图控制不明确。其解决方案的核心是引入一种多通道ControlNet架构,该架构整合了对象保真度与位置、尺寸无关的保留、色彩与细节保持以及伪影消除,同时实现了高分辨率、语义一致的背景生成,并提供了对背景布局和光照条件的显式用户控制。关键组件包括对象保真模块、背景引导模块、光照一致性约束以及高频叠加模块,以保留细粒度细节并减少不良伪影。
链接: https://arxiv.org/abs/2506.22531
作者: Prasen Kumar Sharma,Neeraj Matiyali,Siddharth Srivastava,Gaurav Sharma
机构: Typeface Inc (Typeface Inc)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:We introduce \textitPreserve Anything, a novel method for controlled image synthesis that addresses key limitations in object preservation and semantic consistency in text-to-image (T2I) generation. Existing approaches often fail (i) to preserve multiple objects with fidelity, (ii) maintain semantic alignment with prompts, or (iii) provide explicit control over scene composition. To overcome these challenges, the proposed method employs an N-channel ControlNet that integrates (i) object preservation with size and placement agnosticism, color and detail retention, and artifact elimination, (ii) high-resolution, semantically consistent backgrounds with accurate shadows, lighting, and prompt adherence, and (iii) explicit user control over background layouts and lighting conditions. Key components of our framework include object preservation and background guidance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mitigating unwanted artifacts. We introduce a benchmark dataset consisting of 240K natural images filtered for aesthetic quality and 18K 3D-rendered synthetic images with metadata such as lighting, camera angles, and object relationships. This dataset addresses the deficiencies of existing benchmarks and allows a complete evaluation. Empirical results demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85) while maintaining competitive aesthetic quality. We also conducted a user study to demonstrate the efficacy of the proposed work on unseen benchmark and observed a remarkable improvement of \sim25% , \sim19% , \sim13% , and \sim14% in terms of prompt alignment, photorealism, the presence of AI artifacts, and natural aesthetics over existing works.
zh
[CV-222] Container damage detection using advanced computer vision model Yolov12 vs Yolov11 vs RF-DETR A comparative analysis
【速读】:该论文旨在解决集装箱在长期使用过程中因机械和自然因素导致的损伤检测问题,以延长其使用寿命并避免安全隐患。解决方案的关键在于比较三种先进的计算机视觉模型(Yolov12、Yolov11 和 RF-DETR)在集装箱损伤检测任务中的性能,通过使用包含 278 张标注图像的数据集进行训练、验证和测试,并基于 mAP 和精度指标评估模型效果,从而确定最适合集装箱损伤检测的模型。
链接: https://arxiv.org/abs/2506.22517
作者: Subhadip Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Containers are an integral part of the logistics industry and act as a barrier for cargo. A typical service life for a container is more than 20 years. However, overtime containers suffer various types of damage due to the mechanical as well as natural factors. A damaged container is a safety hazard for the employees handling it and a liability for the logistic company. Therefore, a timely inspection and detection of the damaged container is a key for prolonging service life as well as avoiding safety hazards. In this paper, we will compare the performance of the damage detection by three state-of-the-art advanced computer vision models Yolov12, Yolov11 and RF-DETR. We will use a dataset of 278 annotated images to train, validate and test the model. We will compare the mAP and precision of the model. The objective of this paper is to identify the model that is best suited for container damage detection. The result is mixed. mAP@50 score of Yolov11 and 12 was 81.9% compared to RF-DETR, which was 77.7%. However, while testing the model for not-so-common damaged containers, the RF-DETR model outperformed the others overall, exhibiting superiority to accurately detecting both damaged containers as well as damage occurrences with high confidence.
zh
[CV-223] Automated Defect Identification and Categorization in NDE 4.0 with the Application of Artificial Intelligence
【速读】:该论文试图解决当代射线检测(NDE 4.0)中故障检测与分类的自动化框架构建问题,旨在弥补信息不足、优化虚拟缺陷增强技术,并验证框架的可行性。其解决方案的关键在于通过收集和分类223张飞机焊接CR图像作为基础信息源,结合信息扩展系统(如虚拟缺陷增强和标准增强)对数据集进行处理,并利用改进后的U-net模型生成语义故障分割图。此外,通过NDE边界指标评估模型效果,证明了所提出方法在缺陷检测中的优越性。
链接: https://arxiv.org/abs/2506.22513
作者: Aditya Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This investigation attempts to create an automated framework for fault detection and organization for usage in contemporary radiography, as per NDE 4.0. The review’s goals are to address the lack of information that is sufficiently explained, learn how to make the most of virtual defect increase, and determine whether the framework is viable by using NDE measurements. As its basic information source, the technique consists of compiling and categorizing 223 CR photographs of airplane welds. Information expansion systems, such as virtual defect increase and standard increase, are used to work on the preparation dataset. A modified U-net model is prepared using the improved data to produce semantic fault division veils. To assess the effectiveness of the model, NDE boundaries such as Case, estimating exactness, and misleading call rate are used. Tiny a90/95 characteristics, which provide strong differentiating evidence of flaws, reveal that the suggested approach achieves exceptional awareness in defect detection. Considering a 90/95, size error, and fake call rate in the weld area, the consolidated expansion approach clearly wins. Due to the framework’s fast derivation speed, large images can be broken down efficiently and quickly. Professional controllers evaluate the transmitted system in the field and believe that it has a guarantee as a support device in the testing cycle, irrespective of particular equipment cut-off points and programming resemblance.
zh
[CV-224] Lightning the Night with Generative Artificial Intelligence
【速读】:该论文试图解决夜间无法利用可见光反射率数据进行连续全天候气象观测的问题(visible light reflectance data limitation at night)。解决方案的关键在于首次采用生成式扩散模型(generative diffusion models)来重建夜间可见光反射率,具体通过基于FY4B卫星上AGRI多波段热红外亮度温度数据开发出高精度的可见光反射率反演模型RefDiff,实现了0.47~μm、0.65~μm和0.825~μm波段的夜间可见光反射率反演,并通过集成平均显著提升了精度,同时提供了不确定性估计。
链接: https://arxiv.org/abs/2506.22511
作者: Tingting Zhou,Feng Zhang,Haoyang Fu,Baoxiang Pan,Renhe Zhang,Feng Lu,Zhixin Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance retrieval model, called Reflectance Diffusion (RefDiff), which enables 0.47~\mu\mathrmm, 0.65~\mu\mathrmm, and 0.825~\mu\mathrmm bands visible light reflectance retrieval at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model’s nighttime retrieval capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to retrieve visible light reflectance at night, with the potential to expand the application of nighttime visible light data.
zh
[CV-225] FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment ICCV2025
【速读】:该论文旨在解决扩散模型在密集预测任务中的领域自适应(Domain Adaptation, DA)问题,即如何提升模型在未见领域上的性能。其解决方案的关键在于提出一种无需训练的领域噪声对齐(Training-free Domain Noise Alignment, DNA)方法,通过调整扩散过程中的噪声统计特性,以缓解因领域变化引起的噪声统计差异,从而实现有效的领域自适应。该方法在有源域和无源域两种场景下均表现出色,尤其在无源域情况下,通过利用高置信度区域的统计信息逐步引导噪声统计调整,进一步提升了模型的适应能力。
链接: https://arxiv.org/abs/2506.22509
作者: Hang Xu,Jie Huang,Linjiang Huang,Dong Li,Yidi Liu,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV2025
Abstract:Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model’s performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \hrefthis https URLthis https URL.
zh
[CV-226] Weakly Supervised Object Segmentation by Background Conditional Divergence
【速读】:该论文旨在解决在缺乏大量标注数据的特定图像领域(如合成孔径声纳图像、遥感和生物医学成像等)中,自动目标分割任务所面临的挑战。其解决方案的关键在于利用弱监督信号(即图像级目标存在或不存在的信息)训练一个掩码网络,实现二值目标分割。该方法的核心步骤是将分割出的目标物体放置到仅包含背景的图像中,生成具有反事实背景的真实目标图像,从而通过对比原始图像与反事实背景图像之间的差异来优化模型,同时结合针对仅背景图像的监督损失进行训练。
链接: https://arxiv.org/abs/2506.22505
作者: Hassan Baker,Matthew S. Emigh,Austin J. Brockmeier
机构: University of Delaware (特拉华大学); Naval Surface Warfare Center Panama City Division (水面作战中心巴拿马城分部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \hrefGitHubthis https URL.
zh
[CV-227] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection
【速读】:该论文试图解决在磁共振成像(MRI)中自动检测脑部病变的问题,特别是在缺乏标注病变数据的情况下实现异常组织的分割。解决方案的关键在于提出一种新的无监督方法(Patch2Loc),该方法通过从结构MRI中学习正常图像块(patch)的空间位置信息来识别异常区域,异常块通过其位置预测的较高误差和/或方差进行检测,从而生成热图以实现更细粒度的分割。
链接: https://arxiv.org/abs/2506.22504
作者: Hassan Baker,Austin J. Brockmeier
机构: University of Delaware (特拉华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \hrefthis https URLGitHub page.
zh
[CV-228] What Makes a Dribble Successful? Insights From 3D Pose Tracking Data
【速读】:该论文试图解决传统2D位置追踪数据在评估足球球员盘带技能时的局限性,因为其无法捕捉如平衡、朝向和控球等关键因素。解决方案的关键在于引入姿态追踪数据(pose tracking data),通过提取三维空间中的新型姿态特征,以更全面地理解盘带技能,并提升对盘带成功性的预测能力。
链接: https://arxiv.org/abs/2506.22503
作者: Michiel Schepers,Pieter Robberechts,Jan Van Haaren,Jesse Davis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Data analysis plays an increasingly important role in soccer, offering new ways to evaluate individual and team performance. One specific application is the evaluation of dribbles: one-on-one situations where an attacker attempts to bypass a defender with the ball. While previous research has primarily relied on 2D positional tracking data, this fails to capture aspects like balance, orientation, and ball control, limiting the depth of current insights. This study explores how pose tracking data (capturing players’ posture and movement in three dimensions) can improve our understanding of dribbling skills. We extract novel pose-based features from 1,736 dribbles in the 2022/23 Champions League season and evaluate their impact on dribble success. Our results indicate that features capturing the attacker’s balance and the alignment of the orientation between the attacker and defender are informative for predicting dribble success. Incorporating these pose-based features on top of features derived from traditional 2D positional data leads to a measurable improvement in model performance.
zh
[CV-229] How Can Multimodal Remote Sensing Datasets Transform Classification via SpatialNet-ViT?
【速读】:该论文试图解决现有遥感分类研究中因任务或数据集范围狭窄而导致的泛化能力不足问题。其解决方案的关键在于提出一种新型模型——SpatialNet-ViT,该模型结合了视觉变压器(Vision Transformers)和多任务学习(Multi-Task Learning),通过融合空间感知与上下文理解,提升了分类精度与可扩展性。此外,还采用了数据增强、迁移学习和多任务学习等技术以增强模型的鲁棒性和跨不同数据集的泛化能力。
链接: https://arxiv.org/abs/2506.22501
作者: Gautam Siddharth Kashyap,Manaswi Kulahara,Nipun Joshi,Usman Naseem
机构: Macquarie University (麦克里大学); TERI School Of Advanced Studies (TERI高级研究学院); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025), scheduled for 3 - 8 August 2025 in Brisbane, Australia
Abstract:Remote sensing datasets offer significant promise for tackling key classification tasks such as land-use categorization, object presence detection, and rural/urban classification. However, many existing studies tend to focus on narrow tasks or datasets, which limits their ability to generalize across various remote sensing classification challenges. To overcome this, we propose a novel model, SpatialNet-ViT, leveraging the power of Vision Transformers (ViTs) and Multi-Task Learning (MTL). This integrated approach combines spatial awareness with contextual understanding, improving both classification accuracy and scalability. Additionally, techniques like data augmentation, transfer learning, and multi-task learning are employed to enhance model robustness and its ability to generalize across diverse datasets
zh
[CV-230] Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(MLLMs)在手术室(OR)风险识别中出现的视觉-语义知识冲突(VS-KC)问题,即模型虽然能理解文本规则,却无法准确识别视觉上的安全违规行为。解决方案的关键在于构建一个包含34,000张由扩散模型生成的合成图像以及214张人工标注图像的大型数据集——OR-VSKC,该数据集专门用于暴露和研究VS-KC,并通过在该数据集上进行微调,提升MLLMs对已训练冲突实体的检测能力,同时验证其在新视角下的泛化性能。
链接: https://arxiv.org/abs/2506.22500
作者: Weiyi Zhao,Xiaoyu Tan,Liang Liu,Sijia Li,Youwei Song,Xihe Qiu
机构: Shanghai University of Engineering Science (上海工程技术大学); INFLY TECH (Shanghai) Co., Ltd. (飞来科技(上海)有限公司); Zhongshan Hospital of Fudan University (复旦大学中山医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures. The dataset and appendix are available at this https URL
Abstract:Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs’ detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at this https URL.
zh
[CV-231] Scalable Dynamic Origin-Destination Demand Estimation Enhanced by High-Resolution Satellite Imagery Data
【速读】:该论文试图解决多类别宏观网络模型中动态起讫点需求估计(DODE)的问题,特别是在传统传感器数据有限的情况下,如何提高需求估计的准确性与适用性。解决方案的关键在于构建一个集成框架,结合高分辨率卫星遥感图像与常规交通数据,通过设计计算机视觉管道实现特定类别的车辆检测与地图匹配,从而生成基于车辆类别的路段交通密度观测数据,并在此基础上建立基于计算图的DODE模型,通过联合匹配本地传感器的交通流量和出行时间与卫星衍生的密度测量数据,校准动态网络状态。
链接: https://arxiv.org/abs/2506.22499
作者: Jiachao Liu,Pablo Guarda,Koichiro Niinuma,Sean Qian
机构: Carnegie Mellon University (卡内基梅隆大学); Fujitsu Research of America (富士通美国研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:This study presents a novel integrated framework for dynamic origin-destination demand estimation (DODE) in multi-class mesoscopic network models, leveraging high-resolution satellite imagery together with conventional traffic data from local sensors. Unlike sparse local detectors, satellite imagery offers consistent, city-wide road and traffic information of both parking and moving vehicles, overcoming data availability limitations. To extract information from imagery data, we design a computer vision pipeline for class-specific vehicle detection and map matching, generating link-level traffic density observations by vehicle class. Building upon this information, we formulate a computational graph-based DODE model that calibrates dynamic network states by jointly matching observed traffic counts and travel times from local sensors with density measurements derived from satellite imagery. To assess the accuracy and scalability of the proposed framework, we conduct a series of numerical experiments using both synthetic and real-world data. The results of out-of-sample tests demonstrate that supplementing traditional data with satellite-derived density significantly improves estimation performance, especially for links without local sensors. Real-world experiments also confirm the framework’s capability to handle large-scale networks, supporting its potential for practical deployment in cities of varying sizes. Sensitivity analysis further evaluates the impact of data quality related to satellite imagery data.
zh
[CV-232] ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction
【速读】:该论文旨在解决医院和长期护理机构中与床相关的跌倒问题,这类跌倒仍然是主要的伤害来源,而现有商业报警系统通常仅在患者已离开床后才触发。解决方案的关键在于利用安装在床腿下的四个低成本力传感器采集负载信号,并将其转换为一组紧凑的互补图像,包括RGB线图和三种纹理图(递归图、马尔可夫转移场和Gramian角场),以捕捉高阶动态特性。随后采用ViFusionTST模型对这些图像进行并行处理并与融合,从而实现对早期离床意图的准确预测。
链接: https://arxiv.org/abs/2506.22498
作者: Hao Liu,Yu Hu,Rakiba Rayhana,Ling Bai,Zheng Liu
机构: The University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Bed-related falls remain a leading source of injury in hospitals and long-term-care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only four low-cost load cells mounted under the bed legs. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps - recurrence plot, Markov transition field, and Gramian angular field - that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.22498 [cs.CV] (or arXiv:2506.22498v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.22498 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-233] DriveBLIP2: Attention-Guided Explanation Generation for Complex Driving Scenarios IROS
【速读】:该论文试图解决现有视觉-语言模型在复杂多目标环境中的理解能力不足问题,特别是在实时应用如自动驾驶中,快速识别关键物体的重要性。解决方案的关键在于提出一种注意力图生成器(Attention Map Generator),用于在关键视频帧中突出与驾驶决策相关的显著物体,从而引导模型关注这些关键区域,生成清晰且相关的解释,提升车辆在关键时刻决策过程的可解释性。
链接: https://arxiv.org/abs/2506.22494
作者: Shihong Ling,Yue Wan,Xiaowei Jia,Na Du
机构: University of Pittsburgh (匹兹堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025. 7 pages, 3 figures
Abstract:This paper introduces a new framework, DriveBLIP2, built upon the BLIP2-OPT architecture, to generate accurate and contextually relevant explanations for emerging driving scenarios. While existing vision-language models perform well in general tasks, they encounter difficulties in understanding complex, multi-object environments, particularly in real-time applications such as autonomous driving, where the rapid identification of key objects is crucial. To address this limitation, an Attention Map Generator is proposed to highlight significant objects relevant to driving decisions within critical video frames. By directing the model’s focus to these key regions, the generated attention map helps produce clear and relevant explanations, enabling drivers to better understand the vehicle’s decision-making process in critical situations. Evaluations on the DRAMA dataset reveal significant improvements in explanation quality, as indicated by higher BLEU, ROUGE, CIDEr, and SPICE scores compared to baseline models. These findings underscore the potential of targeted attention mechanisms in vision-language models for enhancing explainability in real-time autonomous driving.
zh
[CV-234] Wireless Home Automation Using Social Networking Websites
【速读】:该论文试图解决无线家庭自动化系统(WHAS)在安全性、通过单一界面控制多种家用电器以及用户友好性方面面临的挑战。其解决方案的关键在于利用社交网络网站(如Twitter)的认证系统,跟踪用户在社交网络上的活动,并据此控制用户的家用电器。
链接: https://arxiv.org/abs/2506.22482
作者: Divya Alok Gupta,Dwith Chenna,B. Aditya Vighnesh Ramakanth
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 20th Annual International Conference on Advanced Computing and Communications (ADCOM) 2014
Abstract:With the advent of Internet of Things, Wireless Home Automation Systems WHAS are gradually gaining popularity. These systems are faced with multiple challenges such as security; controlling a variety of home appliances with a single interface and user friendliness. In this paper we propose a system that uses secure authentication systems of social networking websites such as Twitter, tracks the end-users activities on the social network and then control his or her domestic appliances. At the end, we highlight the applications of the proposed WHAS and compare the advantages of our proposed system over traditional home automation systems.
zh
[CV-235] Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization ICML2025
【速读】:该论文旨在解决扩散模型在迭代采样过程中计算成本高昂的问题,这一问题已成为其应用中的主要瓶颈。论文提出了一种名为Modulated Diffusion (MoDiff) 的创新框架,其关键在于通过调制量化和误差补偿实现生成建模的加速。MoDiff不仅继承了现有缓存和量化方法的优势,还作为一个通用框架适用于所有扩散模型,其有效性得到了坚实的理论分析和实验验证的支持。
链接: https://arxiv.org/abs/2506.22463
作者: Weizhi Gao,Zhichao Hou,Junqi Yin,Feiyi Wang,Linyu Peng,Xiaorui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, accepted by ICML 2025
Abstract:Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at this https URL.
zh
[CV-236] Counting with Confidence: Accurate Pest Monitoring in Water Traps
【速读】:该论文试图解决现有基于视觉的害虫计数研究中模型在实际应用场景中缺乏对计数结果可靠性的评估问题,即模型通常在具有真实标签的数据集上进行评估,但在实际部署时无法验证计数结果的准确性。解决方案的关键在于提出一种综合评估害虫计数置信度的方法,该方法结合计数结果相关的信息与外部环境条件,包括使用害虫检测网络提取计数信息、进行图像质量评估、图像复杂度评估以及害虫分布均匀性评估,并通过回归模型预测最终的计数置信度。
链接: https://arxiv.org/abs/2506.22438
作者: Xumin Gao,Mark Stevens,Grzegorz Cielniak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 20XX the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND
Abstract:Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To this end, this paper proposed a method for comprehensively evaluating pest counting confidence in the image, based on information related to counting results and external environmental conditions. First, a pest detection network is used for pest detection and counting, extracting counting result-related information. Then, the pest images undergo image quality assessment, image complexity assessment, and pest distribution uniformity assessment. And the changes in image clarity caused by stirring during image acquisition are quantified by calculating the average gradient magnitude. Notably, we designed a hypothesis-driven multi-factor sensitivity analysis method to select the optimal image quality assessment and image complexity assessment methods. And we proposed an adaptive DBSCAN clustering algorithm for pest distribution uniformity assessment. Finally, the obtained information related to counting results and external environmental conditions is input into a regression model for prediction, resulting in the final pest counting confidence. To the best of our knowledge, this is the first study dedicated to comprehensively evaluating counting confidence in counting tasks, and quantifying the relationship between influencing factors and counting confidence through a model. Experimental results show our method reduces MSE by 31.7% and improves R2 by 15.2% on the pest counting confidence test set, compared to the baseline built primarily on information related to counting results.
zh
[CV-237] Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring
【速读】:该论文旨在解决在结构健康监测(SHM)中,由于透视失真、遮挡和低对比度等实际条件导致的图像对齐精度不足问题。传统特征检测器如SIFT和SURF因依赖高斯尺度空间而抑制高频边缘,难以准确定位细裂缝;轻量级二值化方法如ORB和BRISK则在纹理或阴影表面上表现出较差的关键点重复性。该研究提出了一种物理启发的对齐框架,其关键在于利用非线性各向异性扩散构建保留裂缝特征的尺度空间,并结合基于RANSAC的单应性估计实现几何校正,从而无需训练、参数调优或预先标定即可实现高精度对齐。
链接: https://arxiv.org/abs/2506.22437
作者: Xinxin Sun,Peter Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 5 figures, 19 tables. Submitted to NDTE International. This work may also be of interest to researchers in optical NDE and civil engineering SHM
Abstract:Accurate image alignment is essential for monitoring crack evolution in structural health monitoring (SHM), particularly under real-world conditions involving perspective distortion, occlusion, and low contrast. However, traditional feature detectors such as SIFT and SURF, which rely on Gaussian-based scale spaces, tend to suppress high-frequency edges, making them unsuitable for thin crack localization. Lightweight binary alternatives like ORB and BRISK, while computationally efficient, often suffer from poor keypoint repeatability on textured or shadowed surfaces. This study presents a physics-informed alignment framework that adapts the open KAZE architecture to SHM-specific challenges. By utilizing nonlinear anisotropic diffusion to construct a crack-preserving scale space, and integrating RANSAC-based homography estimation, the framework enables accurate geometric correction without the need for training, parameter tuning, or prior calibration. The method is validated on time-lapse images of masonry and concrete acquired via handheld smartphone under varied field conditions, including shadow interference, cropping, oblique viewing angles, and surface clutter. Compared to classical detectors, the proposed framework reduces crack area and spine length errors by up to 70 percent and 90 percent, respectively, while maintaining sub-5 percent alignment error in key metrics. Unsupervised, interpretable, and computationally lightweight, this approach supports scalable deployment via UAVs and mobile platforms. By tailoring nonlinear scale-space modeling to SHM image alignment, this work offers a robust and physically grounded alternative to conventional techniques for tracking real-world crack evolution.
zh
[CV-238] ICP-3DGS: SfM-free 3D Gaussian Splatting for Large-scale Unbounded Scenes ICIP2025
【速读】:该论文试图解决神经渲染方法(如NeRFs和3D Gaussian Splatting,3DGS)在户外场景中依赖预处理的相机位姿和三维结构先验的问题,这些问题通常通过结构从运动(SfM)获得,但在户外环境中难以获取。解决方案的关键在于将迭代最近点(ICP)与基于优化的精调相结合,以实现大范围相机运动下的精确相机位姿估计,并引入基于体素的场景稠密化方法来指导大规模场景的重建。
链接: https://arxiv.org/abs/2506.21629
作者: Chenhao Zhang,Yezhi Shen,Fengqing Zhu
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, Source code is available at this https URL . To appear at ICIP 2025
Abstract:In recent years, neural rendering methods such as NeRFs and 3D Gaussian Splatting (3DGS) have made significant progress in scene reconstruction and novel view synthesis. However, they heavily rely on preprocessed camera poses and 3D structural priors from structure-from-motion (SfM), which are challenging to obtain in outdoor scenarios. To address this challenge, we propose to incorporate Iterative Closest Point (ICP) with optimization-based refinement to achieve accurate camera pose estimation under large camera movements. Additionally, we introduce a voxel-based scene densification approach to guide the reconstruction in large-scale scenes. Experiments demonstrate that our approach ICP-3DGS outperforms existing methods in both camera pose estimation and novel view synthesis across indoor and outdoor scenes of various scales. Source code is available at this https URL.
zh
[CV-239] C3VDv2 – Colonoscopy 3D video dataset with enhanced realism
【速读】:该论文旨在解决3D结肠镜图像重建算法缺乏高质量训练与验证数据集的问题,从而限制了计算机视觉技术在结肠镜检查中的诊断性能提升。其解决方案的关键在于构建C3VDv2数据集,该数据集通过高保真硅胶结肠幻影片段采集的192段视频序列,提供了包括深度、表面法线、光流、遮挡、六自由度位姿、覆盖图和3D模型在内的多模态真实标注数据,同时模拟了多种复杂场景以增强数据集的现实感,从而支持更鲁棒和具有代表性的3D重建算法开发与评估。
链接: https://arxiv.org/abs/2506.24074
作者: Mayank V. Golhar,Lucas Sebastian Galeano Fretes,Loren Ayers,Venkata S. Akshintala,Taylor L. Bobrow,Nicholas J. Durr
机构: Johns Hopkins University (约翰霍普金斯大学); Johns Hopkins Medicine (约翰霍普金斯医学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 7 figures
Abstract:Computer vision techniques have the potential to improve the diagnostic performance of colonoscopy, but the lack of 3D colonoscopy datasets for training and validation hinders their development. This paper introduces C3VDv2, the second version (v2) of the high-definition Colonoscopy 3D Video Dataset, featuring enhanced realism designed to facilitate the quantitative evaluation of 3D colon reconstruction algorithms. 192 video sequences were captured by imaging 60 unique, high-fidelity silicone colon phantom segments. Ground truth depth, surface normals, optical flow, occlusion, six-degree-of-freedom pose, coverage maps, and 3D models are provided for 169 colonoscopy videos. Eight simulated screening colonoscopy videos acquired by a gastroenterologist are provided with ground truth poses. The dataset includes 15 videos featuring colon deformations for qualitative assessment. C3VDv2 emulates diverse and challenging scenarios for 3D reconstruction algorithms, including fecal debris, mucous pools, blood, debris obscuring the colonoscope lens, en-face views, and fast camera motion. The enhanced realism of C3VDv2 will allow for more robust and representative development and evaluation of 3D reconstruction algorithms.
zh
[CV-240] Supervised Diffusion-Model-Based PET Image Reconstruction MICCAI2025
【速读】:该论文旨在解决正电子发射断层扫描(PET)图像重建中由于依赖无监督扩散模型(DM)而无法显式建模扩散先验与噪声测量数据之间交互的问题,这可能限制了重建精度。其解决方案的关键在于提出一种监督式的扩散模型算法,该算法强制执行PET的泊松似然模型的非负性,并适应PET图像的宽动态范围,从而在不同剂量水平下实现了优于或等同于现有深度学习方法的定量性能。
链接: https://arxiv.org/abs/2506.24034
作者: George Webber,Alexander Hammers,Andrew P King,Andrew J Reader
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures. Submitted to MICCAI 2025, not peer-reviewed
Abstract:Diffusion models (DMs) have recently been introduced as a regularizing prior for PET image reconstruction, integrating DMs trained on high-quality PET images with unsupervised schemes that condition on measured data. While these approaches have potential generalization advantages due to their independence from the scanner geometry and the injected activity level, they forgo the opportunity to explicitly model the interaction between the DM prior and noisy measurement data, potentially limiting reconstruction accuracy. To address this, we propose a supervised DM-based algorithm for PET reconstruction. Our method enforces the non-negativity of PET’s Poisson likelihood model and accommodates the wide intensity range of PET images. Through experiments on realistic brain PET phantoms, we demonstrate that our approach outperforms or matches state-of-the-art deep learning-based methods quantitatively across a range of dose levels. We further conduct ablation studies to demonstrate the benefits of the proposed components in our model, as well as its dependence on training data, parameter count, and number of diffusion steps. Additionally, we show that our approach enables more accurate posterior sampling than unsupervised DM-based methods, suggesting improved uncertainty estimation. Finally, we extend our methodology to a practical approach for fully 3D PET and present example results from real [ ^18 F]FDG brain PET data.
zh
[CV-241] ShapeKit
【速读】:该论文试图解决全身体积医学分割中解剖形状精度不足的问题,其解决方案的关键在于引入一个以形状为中心的工具包——ShapeKit,该工具包能够在不进行模型重新训练或微调的情况下,通过优化解剖形状显著提升分割性能,实验结果表明其性能提升超过8%。
链接: https://arxiv.org/abs/2506.24003
作者: Junqi Liu,Dongli He,Wenxuan Li,Ningyu Wang,Alan L. Yuille,Zongwei Zhou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we present a practical approach to improve anatomical shape accuracy in whole-body medical segmentation. Our analysis shows that a shape-focused toolkit can enhance segmentation performance by over 8%, without the need for model re-training or fine-tuning. In comparison, modifications to model architecture typically lead to marginal gains of less than 3%. Motivated by this observation, we introduce ShapeKit, a flexible and easy-to-integrate toolkit designed to refine anatomical shapes. This work highlights the underappreciated value of shape-based tools and calls attention to their potential impact within the medical segmentation community.
zh
[CV-242] Spatio-Temporal Representation Decoupling and Enhancement for Federated Instrument Segmentation in Surgical Videos
【速读】:该论文试图解决在联邦学习(Federated Learning, FL)框架下进行手术器械分割时存在的挑战,特别是在手术数据科学领域中,现有FL方法未能充分考虑手术领域的固有特性,如不同场景下多样的解剖背景与高度相似的器械表示,以及手术模拟器可生成大规模合成数据的优势。其解决方案的关键在于提出一种新颖的个性化联邦学习方案——时空表征解耦与增强(FedST),该方案在本地站点和全局服务器训练过程中巧妙融合手术领域知识,通过局部训练中的表征分离与协作机制,解耦查询嵌入层以编码各自背景,并通过全局优化捕捉器械的一致性表征;同时引入文本引导的通道选择以突出站点特定特征,并在全局训练中采用基于合成数据的显式表征量化(SERQ)方法,以提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.23759
作者: Zheng Fang,Xiaoming Qi,Chun-Mei Feng,Jialun Pei,Weixin Si,Yueming Jin
机构: National University of Singapore (NUS); Institute of High Performance Computing, ASTAR (ASTAR); The Chinese University of Hong Kong (The Chinese University of Hong Kong); Chinese Academy of Sciences (Chinese Academy of Sciences)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical instrument segmentation under Federated Learning (FL) is a promising direction, which enables multiple surgical sites to collaboratively train the model without centralizing datasets. However, there exist very limited FL works in surgical data science, and FL methods for other modalities do not consider inherent characteristics in surgical domain: i) different scenarios show diverse anatomical backgrounds while highly similar instrument representation; ii) there exist surgical simulators which promote large-scale synthetic data generation with minimal efforts. In this paper, we propose a novel Personalized FL scheme, Spatio-Temporal Representation Decoupling and Enhancement (FedST), which wisely leverages surgical domain knowledge during both local-site and global-server training to boost segmentation. Concretely, our model embraces a Representation Separation and Cooperation (RSC) mechanism in local-site training, which decouples the query embedding layer to be trained privately, to encode respective backgrounds. Meanwhile, other parameters are optimized globally to capture the consistent representations of instruments, including the temporal layer to capture similar motion patterns. A textual-guided channel selection is further designed to highlight site-specific features, facilitating model adapta tion to each site. Moreover, in global-server training, we propose Synthesis-based Explicit Representation Quantification (SERQ), which defines an explicit representation target based on synthetic data to synchronize the model convergence during fusion for improving model generalization.
zh
[CV-243] Deep Learning-Based Semantic Segmentation for Real-Time Kidney Imaging and Measurements with Augmented Reality-Assisted Ultrasound
【速读】:该论文旨在解决超声(Ultrasound, US)检查中存在学习曲线陡峭、图像动态性及非标准成像平面带来的挑战,以及医生在观察US屏幕与患者之间频繁切换所导致的注意力分散问题。其解决方案的关键在于集成深度学习(Deep Learning, DL)驱动的语义分割技术,实现肾脏体积的实时(Real-Time, RT)自动化测量,从而减轻手动测量的耗时与疲劳;同时结合增强现实(Augmented Reality, AR)技术,将图像直接投射到医生视野中,提升操作便利性并降低认知负荷。
链接: https://arxiv.org/abs/2506.23721
作者: Gijs Luijten,Roberto Maria Scardigno,Lisle Faray de Paiva,Peter Hoyer,Jens Kleesiek,Domenico Buongiorno,Vitoantonio Bevilacqua,Jan Egger
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Ultrasound (US) is widely accessible and radiation-free but has a steep learning curve due to its dynamic nature and non-standard imaging planes. Additionally, the constant need to shift focus between the US screen and the patient poses a challenge. To address these issues, we integrate deep learning (DL)-based semantic segmentation for real-time (RT) automated kidney volumetric measurements, which are essential for clinical assessment but are traditionally time-consuming and prone to fatigue. This automation allows clinicians to concentrate on image interpretation rather than manual measurements. Complementing DL, augmented reality (AR) enhances the usability of US by projecting the display directly into the clinician’s field of view, improving ergonomics and reducing the cognitive load associated with screen-to-patient transitions. Two AR-DL-assisted US pipelines on HoloLens-2 are proposed: one streams directly via the application programming interface for a wireless setup, while the other supports any US device with video output for broader accessibility. We evaluate RT feasibility and accuracy using the Open Kidney Dataset and open-source segmentation models (nnU-Net, Segmenter, YOLO with MedSAM and LiteMedSAM). Our open-source GitHub pipeline includes model implementations, measurement algorithms, and a Wi-Fi-based streaming solution, enhancing US training and diagnostics, especially in point-of-care settings.
zh
[CV-244] MDPG: Multi-domain Diffusion Prior Guidance for MRI Reconstruction MICCAI2025
【速读】:该论文旨在解决磁共振成像(MRI)重建中数据一致性不足的问题,特别是在使用扩散模型(DMs)时由于其在图像域中的随机性导致生成的图像保真度不高的问题。解决方案的关键在于引入由预训练潜在扩散模型(LDMs)提供的多域扩散先验引导(MDPG),通过结合视觉-马尔可夫(Visual-Mamba)骨干网络、潜在引导注意力(LGA)机制以及双域融合分支(DFB),实现对潜在域和图像域的联合优化,并采用基于非自校准信号(NACS)的k空间正则化策略进一步提升数据一致性。
链接: https://arxiv.org/abs/2506.23701
作者: Lingtong Zhang,Mengdie Song,Xiaohan Hao,Huayu Mai,Bensheng Qiu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by MICCAI2025
Abstract:Magnetic Resonance Imaging (MRI) reconstruction is essential in medical diagnostics. As the latest generative models, diffusion models (DMs) have struggled to produce high-fidelity images due to their stochastic nature in image domains. Latent diffusion models (LDMs) yield both compact and detailed prior knowledge in latent domains, which could effectively guide the model towards more effective learning of the original data distribution. Inspired by this, we propose Multi-domain Diffusion Prior Guidance (MDPG) provided by pre-trained LDMs to enhance data consistency in MRI reconstruction tasks. Specifically, we first construct a Visual-Mamba-based backbone, which enables efficient encoding and reconstruction of under-sampled images. Then pre-trained LDMs are integrated to provide conditional priors in both latent and image domains. A novel Latent Guided Attention (LGA) is proposed for efficient fusion in multi-level latent domains. Simultaneously, to effectively utilize a prior in both the k-space and image domain, under-sampled images are fused with generated full-sampled images by the Dual-domain Fusion Branch (DFB) for self-adaption guidance. Lastly, to further enhance the data consistency, we propose a k-space regularization strategy based on the non-auto-calibration signal (NACS) set. Extensive experiments on two public MRI datasets fully demonstrate the effectiveness of the proposed methodology. The code is available at this https URL.
zh
[CV-245] MedSAM-CA: A CNN-Augmented ViT with Attention-Enhanced Multi-Scale Fusion for Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中面临的两大挑战:一是深度学习方法对大规模标注数据的高度依赖,而这些数据在医疗场景中因隐私问题和高昂的标注成本难以获取;二是临床复杂场景下,如成像模态对比度低和恶性病变导致的边界模糊,仍阻碍了精确分割。其解决方案的关键在于提出MedSAM-CA,通过架构级微调策略,利用预训练的Medical Segment Anything (MedSAM) 模型进行适应性调整,引入卷积注意力增强边界细化网络(CBR-Net)和注意力增强特征融合块(Atte-FFB),以提升边界分割精度并减少对手动标注数据的依赖。
链接: https://arxiv.org/abs/2506.23700
作者: Peiting Tian,Xi Chen,Haixia Bi,Fan Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning, where accurate boundary delineation is essential for precise lesion localization, organ identification, and quantitative assessment. In recent years, deep learning-based methods have significantly advanced segmentation accuracy. However, two major challenges remain. First, the performance of these methods heavily relies on large-scale annotated datasets, which are often difficult to obtain in medical scenarios due to privacy concerns and high annotation costs. Second, clinically challenging scenarios, such as low contrast in certain imaging modalities and blurry lesion boundaries caused by malignancy, still pose obstacles to precise segmentation. To address these challenges, we propose MedSAM-CA, an architecture-level fine-tuning approach that mitigates reliance on extensive manual annotations by adapting the pretrained foundation model, Medical Segment Anything (MedSAM). MedSAM-CA introduces two key components: the Convolutional Attention-Enhanced Boundary Refinement Network (CBR-Net) and the Attention-Enhanced Feature Fusion Block (Atte-FFB). CBR-Net operates in parallel with the MedSAM encoder to recover boundary information potentially overlooked by long-range attention mechanisms, leveraging hierarchical convolutional processing. Atte-FFB, embedded in the MedSAM decoder, fuses multi-level fine-grained features from skip connections in CBR-Net with global representations upsampled within the decoder to enhance boundary delineation accuracy. Experiments on publicly available datasets covering dermoscopy, CT, and MRI imaging modalities validate the effectiveness of MedSAM-CA. On dermoscopy dataset, MedSAM-CA achieves 94.43% Dice with only 2% of full training data, reaching 97.25% of full-data training performance, demonstrating strong effectiveness in low-resource clinical settings.
zh
[CV-246] Diffusion Model-based Data Augmentation Method for Fetal Head Ultrasound Segmentation
【速读】:该论文旨在解决医学影像数据获取困难及标注成本高昂的问题,尤其是针对胎儿头部超声图像的分割任务。其解决方案的关键在于提出一种基于扩散模型的掩码引导生成式AI(GenAI)方法,用于生成与分割掩码配对的合成胎儿头部超声图像,从而增强真实数据集以进行监督微调Segment Anything Model(SAM)。
链接: https://arxiv.org/abs/2506.23664
作者: Fangyijie Wang,Kevin Whelan,Félix Balado,Guénolé Silvestre,Kathleen M. Curran
机构: University College Dublin(都柏林大学); Research Ireland Centre for Research Training in Machine Learning(爱尔兰研究训练机器学习中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image data is less accessible than in other domains due to privacy and regulatory constraints. In addition, labeling requires costly, time-intensive manual image annotation by clinical experts. To overcome these challenges, synthetic medical data generation offers a promising solution. Generative AI (GenAI), employing generative deep learning models, has proven effective at producing realistic synthetic images. This study proposes a novel mask-guided GenAI approach using diffusion models to generate synthetic fetal head ultrasound images paired with segmentation masks. These synthetic pairs augment real datasets for supervised fine-tuning of the Segment Anything Model (SAM). Our results show that the synthetic data captures real image features effectively, and this approach reaches state-of-the-art fetal head segmentation, especially when trained with a limited number of real image-mask pairs. In particular, the segmentation reaches Dice Scores of 94.66% and 94.38% using a handful of ultrasound images from the Spanish and African cohorts, respectively. Our code, models, and data are available on GitHub.
zh
[CV-247] A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation
【速读】:该论文旨在解决从CT扫描生成放射学报告的复杂性问题,特别是在肾脏影像中,由于医学影像的细微差异和临床文档的多样性,传统方法难以准确捕捉关键信息。其解决方案的关键在于提出一个两阶段框架:第一阶段利用多任务学习模型提取结构化的异常特征(如病灶位置、大小、增强和衰减);第二阶段将这些特征与对应的CT图像输入微调的视觉-语言模型,以生成与临床发现一致的自然语言报告。该方法通过结合结构化特征与视觉信息,提升了生成报告的临床准确性和文本合理性。
链接: https://arxiv.org/abs/2506.23584
作者: Renjie Liang,Zhengkang Fan,Jinqian Pan,Chenkun Sun,Russell Terry,Jie Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating radiology reports from CT scans remains a complex task due to the nuanced nature of medical imaging and the variability in clinical documentation. In this study, we propose a two-stage framework for generating renal radiology reports from 2D CT slices. First, we extract structured abnormality features using a multi-task learning model trained to identify lesion attributes such as location, size, enhancement, and attenuation. These extracted features are subsequently combined with the corresponding CT image and fed into a fine-tuned vision-language model to generate natural language report sentences aligned with clinical findings. We conduct experiments on a curated dataset of renal CT studies with manually annotated sentence-slice-feature triplets and evaluate performance using both classification metrics and natural language generation metrics. Our results demonstrate that the proposed model outperforms random baselines across all abnormality types, and the generated reports capture key clinical content with reasonable textual accuracy. This exploratory work highlights the feasibility of modular, feature-informed report generation for renal imaging. Future efforts will focus on extending this pipeline to 3D CT volumes and further improving clinical fidelity in multimodal medical AI systems.
zh
[CV-248] AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm ICCV
【速读】:该论文旨在解决现有基于学习的方法在从多曝光低动态范围(LDR)输入重建高动态范围(HDR)图像时,依赖经验设计而非理论基础的问题,这可能影响方法的可靠性。其解决方案的关键在于提出一种交叉迭代对齐与融合深度展开网络(AFUNet),将HDR重建系统性地分解为对齐和融合两个交替优化的子任务,通过交替精炼实现两者的协同增强。该方法从最大后验(MAP)估计的角度建模多曝光HDR重建,显式引入LDR图像间的空间对应先验,并通过联合约束自然连接对齐与融合子问题,最终构建了一个基于数学基础的端到端可训练网络。
链接: https://arxiv.org/abs/2506.23537
作者: Xinyue Li,Zhangkai Ni,Wenhan Yang
机构: Tongji University (同济大学); Pengcheng Laboratory (鹏城实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to International Conference on Computer Vision (ICCV) 2025
Abstract:Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but they rely more on empirical design rather than theoretical foundation, which can impact their reliability. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks – alignment and fusion – optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. Our method formulates multi-exposure HDR reconstruction from a Maximum A Posteriori (MAP) estimation perspective, explicitly incorporating spatial correspondence priors across LDR images and naturally bridging the alignment and fusion subproblems through joint constraints. Building on the mathematical foundation, we reimagine traditional iterative optimization through unfolding – transforming the conventional solution process into an end-to-end trainable AFUNet with carefully designed modules that work progressively. Specifically, each iteration of AFUNet incorporates an Alignment-Fusion Module (AFM) that alternates between a Spatial Alignment Module (SAM) for alignment and a Channel Fusion Module (CFM) for adaptive feature fusion, progressively bridging misaligned content and exposure discrepancies. Extensive qualitative and quantitative evaluations demonstrate AFUNet’s superior performance, consistently surpassing state-of-the-art methods. Our code is available at: this https URL
zh
[CV-249] Artificial Intelligence-assisted Pixel-level Lung (APL) Scoring for Fast and Accurate Quantification in Ultra-short Echo-time MRI
【速读】:该论文试图解决在囊性纤维化(Cystic Fibrosis, CF)中缺乏快速且准确的结构肺磁共振成像(MRI)定量评分系统的问题。传统方法如网格级评分存在效率低下的缺陷,而现有的结构肺MRI(如UTE-MRI)缺乏有效的量化工具。解决方案的关键在于提出一种基于人工智能辅助的像素级肺部(Artificial Intelligence-assisted Pixel-level Lung, APL)评分系统,该系统通过图像加载、AI肺部分割、肺部限定切片采样、像素级标注及量化报告等五个阶段,实现了比传统方法更快的处理速度(8.2分钟/受试者)和更高的准确性(p=0.021),同时与网格级评分具有高度相关性(R=0.973, p=5.85e-9)。
链接: https://arxiv.org/abs/2506.23506
作者: Bowen Xin,Rohan Hickey,Tamara Blake,Jin Jin,Claire E Wainwright,Thomas Benkert,Alto Stemmer,Peter Sly,David Coman,Jason Dowling
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Oral presentation in ISMRM2025
Abstract:Lung magnetic resonance imaging (MRI) with ultrashort echo-time (UTE) represents a recent breakthrough in lung structure imaging, providing image resolution and quality comparable to computed tomography (CT). Due to the absence of ionising radiation, MRI is often preferred over CT in paediatric diseases such as cystic fibrosis (CF), one of the most common genetic disorders in Caucasians. To assess structural lung damage in CF imaging, CT scoring systems provide valuable quantitative insights for disease diagnosis and progression. However, few quantitative scoring systems are available in structural lung MRI (e.g., UTE-MRI). To provide fast and accurate quantification in lung MRI, we investigated the feasibility of novel Artificial intelligence-assisted Pixel-level Lung (APL) scoring for CF. APL scoring consists of 5 stages, including 1) image loading, 2) AI lung segmentation, 3) lung-bounded slice sampling, 4) pixel-level annotation, and 5) quantification and reporting. The results shows that our APL scoring took 8.2 minutes per subject, which was more than twice as fast as the previous grid-level scoring. Additionally, our pixel-level scoring was statistically more accurate (p=0.021), while strongly correlating with grid-level scoring (R=0.973, p=5.85e-9). This tool has great potential to streamline the workflow of UTE lung MRI in clinical settings, and be extended to other structural lung MRI sequences (e.g., BLADE MRI), and for other lung diseases (e.g., bronchopulmonary dysplasia).
zh
[CV-250] UltraTwin: Towards Cardiac Anatomical Twin Generation from Multi-view 2D Ultrasound MICCAI2025
【速读】:该论文旨在解决从稀疏多视角二维超声(2D US)图像中构建高精度心脏解剖孪生体(cardiac anatomical twin)的挑战,以支持精准的治疗规划和临床量化。其关键解决方案在于提出了一种名为UltraTwin的生成框架,该框架通过三个主要贡献实现目标:首先,构建了一个包含严格配对的多视角2D US与CT数据以及伪配对数据的真实世界高质量数据集;其次,设计了从粗到细的分层重建优化方案;最后,引入了具有拓扑感知约束的隐式自编码器,从而提升重建的准确性和结构合理性。
链接: https://arxiv.org/abs/2506.23490
作者: Junxuan Yu,Yaofei Duan,Yuhao Huang,Yu Wang,Rongbo Ling,Weihao Luo,Ang Zhang,Jingxian Xu,Qiongying Ni,Yongsong Zhou,Binghan Li,Haoran Dou,Liping Liu,Yanfen Chu,Feng Geng,Zhe Sheng,Zhifeng Ding,Dingxin Zhang,Rui Huang,Yuhang Zhang,Xiaowei Xu,Tao Tan,Dong Ni,Zhongshan Gou,Xin Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by miccai 2025
Abstract:Echocardiography is routine for cardiac examination. However, 2D ultrasound (US) struggles with accurate metric calculation and direct observation of 3D cardiac structures. Moreover, 3D US is limited by low resolution, small field of view and scarce availability in practice. Constructing the cardiac anatomical twin from 2D images is promising to provide precise treatment planning and clinical quantification. However, it remains challenging due to the rare paired data, complex structures, and US noises. In this study, we introduce a novel generative framework UltraTwin, to obtain cardiac anatomical twin from sparse multi-view 2D US. Our contribution is three-fold. First, pioneered the construction of a real-world and high-quality dataset containing strictly paired multi-view 2D US and CT, and pseudo-paired data. Second, we propose a coarse-to-fine scheme to achieve hierarchical reconstruction optimization. Last, we introduce an implicit autoencoder for topology-aware constraints. Extensive experiments show that UltraTwin reconstructs high-quality anatomical twins versus strong competitors. We believe it advances anatomical twin modeling for potential applications in personalized cardiac care.
zh
[CV-251] FD-DiT: Frequency Domain-Directed Diffusion Transformer for Low-Dose CT Reconstruction
【速读】:该论文旨在解决低剂量计算机断层扫描(LDCT)图像因量子和电子噪声导致的图像伪影和细节丢失问题,从而影响诊断准确性。其解决方案的关键在于提出一种基于频率域引导的扩散变换器(FD-DiT),通过逐步引入噪声直至分布与LDCT数据统计对齐,再进行去噪处理,并结合频率解耦技术将噪声集中在高频域,以有效保留关键解剖结构和细节。此外,还引入了混合去噪网络、滑动稀疏局部注意力机制以及可学习的动态融合策略,以提升高频率噪声识别能力和特征表示效果。
链接: https://arxiv.org/abs/2506.23466
作者: Qiqing Liu,Guoquan Wei,Zekun Zhou,Yiyang Wen,Liu Shi,Qiegen Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 11pages, 11 figures
Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but suffers from image artifacts and loss of detail due to quantum and electronic noise, potentially impacting diagnostic accuracy. Transformer combined with diffusion models has been a promising approach for image generation. Nevertheless, existing methods exhibit limitations in preserving finegrained image details. To address this issue, frequency domain-directed diffusion transformer (FD-DiT) is proposed for LDCT reconstruction. FD-DiT centers on a diffusion strategy that progressively introduces noise until the distribution statistically aligns with that of LDCT data, followed by denoising processing. Furthermore, we employ a frequency decoupling technique to concentrate noise primarily in high-frequency domain, thereby facilitating effective capture of essential anatomical structures and fine details. A hybrid denoising network is then utilized to optimize the overall data reconstruction process. To enhance the capability in recognizing high-frequency noise, we incorporate sliding sparse local attention to leverage the sparsity and locality of shallow-layer information, propagating them via skip connections for improving feature representation. Finally, we propose a learnable dynamic fusion strategy for optimal component integration. Experimental results demonstrate that at identical dose levels, LDCT images reconstructed by FD-DiT exhibit superior noise and artifact suppression compared to state-of-the-art methods.
zh
[CV-252] Federated Breast Cancer Detection Enhanced by Synthetic Ultrasound Image Augmentation
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医学图像分类任务中因数据可用性有限和客户端间数据分布非独立同分布而导致的模型性能与泛化能力下降问题。其解决方案的关键在于引入基于生成式 AI (Generative AI) 的数据增强框架,通过训练特定类别的深度卷积生成对抗网络生成合成图像,并将其整合到联邦学习的训练过程中,以提升模型在乳腺超声图像分类任务中的表现。
链接: https://arxiv.org/abs/2506.23334
作者: Hongyi Pan,Ziliang Hong,Gorkem Durak,Ziyue Xu,Ulas Bagci
机构: Northwestern University (西北大学); NVIDIA (NVIDIA)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning (FL) has emerged as a promising paradigm for collaboratively training deep learning models across institutions without exchanging sensitive medical data. However, its effectiveness is often hindered by limited data availability and non-independent, identically distributed data across participating clients, which can degrade model performance and generalization. To address these challenges, we propose a generative AI based data augmentation framework that integrates synthetic image sharing into the federated training process for breast cancer diagnosis via ultrasound images. Specifically, we train two simple class-specific Deep Convolutional Generative Adversarial Networks: one for benign and one for malignant lesions. We then simulate a realistic FL setting using three publicly available breast ultrasound image datasets: BUSI, BUS-BRA, and UDIAT. FedAvg and FedProx are adopted as baseline FL algorithms. Experimental results show that incorporating a suitable number of synthetic images improved the average AUC from 0.9206 to 0.9237 for FedAvg and from 0.9429 to 0.9538 for FedProx. We also note that excessive use of synthetic data reduced performance, underscoring the importance of maintaining a balanced ratio of real and synthetic samples. Our findings highlight the potential of generative AI based data augmentation to enhance FL results in the breast ultrasound image classification task.
zh
[CV-253] SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting MICCAI-2025 MICCAI2025
【速读】:该论文旨在解决当前外科研究与实践中对具有文本提示能力的3D手术场景准确理解的需求,特别是在手术规划和实时术中导航中,对手术工具和解剖结构的精确识别与交互至关重要。现有工作分别关注外科视觉-语言模型(VLM)、3D重建和分割,但缺乏对实时文本提示3D查询的支持。论文提出的解决方案是SurgTPGS,其关键在于引入一种结合Segment Anything模型和先进视觉-语言模型的3D语义特征学习策略,通过语义感知的变形跟踪和语义区域感知优化,提升3D手术场景重建的精度与语义连贯性。
链接: https://arxiv.org/abs/2506.23309
作者: Yiming Huang,Long Bai,Beilei Cui,Kun Yuan,Guankun Wang,Mobarakol Islam,Nicolas Padoy,Nassir Navab,Hongliang Ren
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025. Project Page: this https URL
Abstract:In contemporary surgical research and practice, accurately comprehending 3D surgical scenes with text-promptable capabilities is particularly crucial for surgical planning and real-time intra-operative guidance, where precisely identifying and interacting with surgical tools and anatomical structures is paramount. However, existing works focus on surgical vision-language model (VLM), 3D reconstruction, and segmentation separately, lacking support for real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap. We introduce a 3D semantics feature learning strategy incorporating the Segment Anything model and state-of-the-art vision-language models. We extract the segmented language features for 3D surgical scene reconstruction, enabling a more in-depth understanding of the complex surgical environment. We also propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features. Furthermore, we present semantic region-aware optimization, which utilizes regional-based semantic information to supervise the training, particularly promoting the reconstruction quality and semantic smoothness. We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods, highlighting its potential to revolutionize surgical practices. SurgTPGS paves the way for developing next-generation intelligent surgical systems by enhancing surgical precision and safety. Our code is available at: this https URL.
zh
[CV-254] BPD-Neo: An MRI Dataset for Lung-Trachea Segmentation with Clinical Data for Neonatal Bronchopulmonary Dysplasia
【速读】:该论文旨在解决早产儿支气管肺发育不良(Bronchopulmonary dysplasia, BPD)的诊断与病因识别问题,传统方法依赖于便携式X射线成像,而肺部磁共振成像(MRI)作为一种无创替代方案,能够避免镇静和辐射,并提供更详细的BPD机制信息。解决方案的关键在于利用高分辨率3D MRI数据,结合先进的图像处理和语义分割算法,以辅助临床医生识别BPD的病因。本文提供了40例新生儿的MRI扫描及其对应的肺和气管语义分割数据,以及经过临床评估验证的基线分割模型,以支持新生儿肺部成像的进一步研究与开发。
链接: https://arxiv.org/abs/2506.23305
作者: Rachit Saluja,Arzu Kovanlikaya,Candace Chien,Lauren Kathryn Blatt,Jeffrey M. Perlman,Stefan Worgall,Mert R. Sabuncu,Jonathan P. Dyke
机构: Cornell University & Cornell Tech (康奈尔大学 & 康奈尔技术学院); Weill Cornell Medicine (威尔康奈尔医学中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bronchopulmonary dysplasia (BPD) is a common complication among preterm neonates, with portable X-ray imaging serving as the standard diagnostic modality in neonatal intensive care units (NICUs). However, lung magnetic resonance imaging (MRI) offers a non-invasive alternative that avoids sedation and radiation while providing detailed insights into the underlying mechanisms of BPD. Leveraging high-resolution 3D MRI data, advanced image processing and semantic segmentation algorithms can be developed to assist clinicians in identifying the etiology of BPD. In this dataset, we present MRI scans paired with corresponding semantic segmentations of the lungs and trachea for 40 neonates, the majority of whom are diagnosed with BPD. The imaging data consist of free-breathing 3D stack-of-stars radial gradient echo acquisitions, known as the StarVIBE series. Additionally, we provide comprehensive clinical data and baseline segmentation models, validated against clinical assessments, to support further research and development in neonatal lung imaging.
zh
[CV-255] Improving Myocardial Infarction Detection via Synthetic ECG Pretraining
【速读】:该论文旨在解决心肌梗死(Myocardial Infarction, MI)在临床中依赖高质量标注心电图(ECG)数据进行早期准确诊断的问题,而实际中此类数据往往稀缺。其解决方案的关键在于提出一种具有生理感知的管道,通过合成具有可调MI形态和现实噪声的12导联ECG,并利用自监督的掩码自编码与联合重建-分类目标对循环神经网络和Transformer分类器进行预训练,从而提升模型在低数据量情况下的分类性能。
链接: https://arxiv.org/abs/2506.23259
作者: Lachin Naghashyar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Myocardial infarction is a major cause of death globally, and accurate early diagnosis from electrocardiograms (ECGs) remains a clinical priority. Deep learning models have shown promise for automated ECG interpretation, but require large amounts of labeled data, which are often scarce in practice. We propose a physiology-aware pipeline that (i) synthesizes 12-lead ECGs with tunable MI morphology and realistic noise, and (ii) pre-trains recurrent and transformer classifiers with self-supervised masked-autoencoding plus a joint reconstruction-classification objective. We validate the realism of synthetic ECGs via statistical and visual analysis, confirming that key morphological features are preserved. Pretraining on synthetic data consistently improved classification performance, particularly in low-data settings, with AUC gains of up to 4 percentage points. These results show that controlled synthetic ECGs can help improve MI detection when real clinical data is limited.
zh
[CV-256] Multi-Source COVID-19 Detection via Variance Risk Extrapolation
【速读】:该论文旨在解决多源数据下的冠状病毒肺炎(COVID-19)检测问题,具体是将来自四个不同医院和医疗中心的胸部CT扫描图像分类为新冠肺炎和非新冠肺炎类别。该任务面临的主要挑战是由于成像协议、扫描设备和患者群体在机构间的差异导致的领域偏移(domain shift)。解决方案的关键在于引入方差风险外推(Variance Risk Extrapolation, VREx),通过显式最小化不同环境中的经验风险方差,使模型在多个源领域中保持一致的性能,从而提升模型的跨领域泛化能力。此外,还采用Mixup数据增强技术,通过插值输入和标签来提高模型的泛化性和鲁棒性。
链接: https://arxiv.org/abs/2506.23208
作者: Runtian Yuan,Qingqiu Li,Junlin Hou,Jilan Xu,Yuejie Zhang,Rui Feng,Hao Chen
机构: Fudan University (复旦大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present our solution for the Multi-Source COVID-19 Detection Challenge, which aims to classify chest CT scans into COVID and Non-COVID categories across data collected from four distinct hospitals and medical centers. A major challenge in this task lies in the domain shift caused by variations in imaging protocols, scanners, and patient populations across institutions. To enhance the cross-domain generalization of our model, we incorporate Variance Risk Extrapolation (VREx) into the training process. VREx encourages the model to maintain consistent performance across multiple source domains by explicitly minimizing the variance of empirical risks across environments. This regularization strategy reduces overfitting to center-specific features and promotes learning of domain-invariant representations. We further apply Mixup data augmentation to improve generalization and robustness. Mixup interpolates both the inputs and labels of randomly selected pairs of training samples, encouraging the model to behave linearly between examples and enhancing its resilience to noise and limited data. Our method achieves an average macro F1 score of 0.96 across the four sources on the validation set, demonstrating strong generalization.
zh
[CV-257] Score-based Diffusion Model for Unpaired Virtual Histology Staining
【速读】:该论文旨在解决虚拟染色(virtual staining)中染色风格与组织结构的有效解耦、可控制的染色过程适应不同组织和蛋白类型,以及非像素对齐的HE-IHC图像间的严格结构一致性建模等关键问题。其解决方案的关键在于提出一种基于互信息(mutual information, MI)引导的得分函数扩散模型,具体包括:1)设计一个全局MI引导的能量函数以跨模态解耦组织结构与染色特征;2)引入一种时间步定制的逆向扩散过程以实现染色强度和结构重建的精确控制;3)采用局部MI驱动的对比学习策略确保HE-IHC图像在细胞层面的结构一致性。
链接: https://arxiv.org/abs/2506.23184
作者: Anran Liu,Xiaofei Wang,Jing Cai,Chao Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures
Abstract:Hematoxylin and eosin (HE) staining visualizes histology but lacks specificity for diagnostic markers. Immunohistochemistry (IHC) staining provides protein-targeted staining but is restricted by tissue availability and antibody specificity. Virtual staining, i.e., computationally translating the HE image to its IHC counterpart while preserving the tissue structure, is promising for efficient IHC generation. Existing virtual staining methods still face key challenges: 1) effective decomposition of staining style and tissue structure, 2) controllable staining process adaptable to diverse tissue and proteins, and 3) rigorous structural consistency modelling to handle the non-pixel-aligned nature of paired HE and IHC images. This study proposes a mutual-information (MI)-guided score-based diffusion model for unpaired virtual staining. Specifically, we design 1) a global MI-guided energy function that disentangles the tissue structure and staining characteristics across modalities, 2) a novel timestep-customized reverse diffusion process for precise control of the staining intensity and structural reconstruction, and 3) a local MI-driven contrastive learning strategy to ensure the cellular level structural consistency between HE-IHC images. Extensive experiments demonstrate the our superiority over state-of-the-art approaches, highlighting its biomedical potential. Codes will be open-sourced upon acceptance.
zh
[CV-258] CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation
【速读】:该论文旨在解决多器官医学分割中存在的一些关键问题,包括分割细节不准确、对几何提示的依赖以及空间信息丢失。其解决方案的关键在于提出一种名为CRISP-SAM2的新型模型,该模型基于SAM2架构,引入了跨模态交互和语义提示机制。通过渐进式跨注意力交互机制将视觉和文本输入转换为跨模态上下文语义,并将其注入图像编码器以增强对视觉信息的细节理解;同时采用语义提示策略替代原始提示编码器,以减少对几何提示的依赖并提升对复杂目标的感知能力。此外,还引入了相似性排序的自更新记忆策略和掩码精炼过程,以进一步适应医学影像并增强局部细节。
链接: https://arxiv.org/abs/2506.23121
作者: Xinlei Yu,Chanmiao Wang,Hui Jin,Ahmed Elazab,Gangyong Jia,Xiang Wan,Changqing Zou,Ruiquan Ge
机构: Hangzhou Dianzi University(杭州电子科技大学); Shenzhen Research Institute of Big Data(深圳大数据研究院); Shenzhen University(深圳大学); Zhejiang University(浙江大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures, 10 tables
Abstract:Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: this https URL\this http URL.
zh
[CV-259] MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT Report Generation ICCV2025
【速读】:该论文旨在解决现有基于CT的报告生成方法在捕捉区域特异性细节方面的不足,从而可能导致某些异常被遗漏的问题。其解决方案的关键在于提出MedRegion-CT框架,该框架通过三个核心创新实现对感兴趣区域的关注:首先,引入Region Representative (R²) Token Pooling技术,利用预训练的2D视觉模型高效提取3D CT特征;其次,采用通用分割模型生成伪掩码并提取区域中心特征;最后,利用分割结果提取患者特异性属性,并将其转化为文本提示以增强模型对患者特定上下文的理解。
链接: https://arxiv.org/abs/2506.23102
作者: Sunggu Kyung,Jinyoung Seo,Hyunseok Lim,Dongyeong Kim,Hyungbin Park,Jimin Sung,Jihyun Kim,Wooyoung Jo,Yoojin Nam,Namkug Kim
机构: Department of Bioengineering, University of Ulsan College of Medicine, Asan Medical Center
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, submitted to ICCV 2025
Abstract:The recent release of RadGenome-Chest CT has significantly advanced CT-based report generation. However, existing methods primarily focus on global features, making it challenging to capture region-specific details, which may cause certain abnormalities to go unnoticed. To address this, we propose MedRegion-CT, a region-focused Multi-Modal Large Language Model (MLLM) framework, featuring three key innovations. First, we introduce Region Representative ( R^2 ) Token Pooling, which utilizes a 2D-wise pretrained vision model to efficiently extract 3D CT features. This approach generates global tokens representing overall slice features and region tokens highlighting target areas, enabling the MLLM to process comprehensive information effectively. Second, a universal segmentation model generates pseudo-masks, which are then processed by a mask encoder to extract region-centric features. This allows the MLLM to focus on clinically relevant regions, using six predefined region masks. Third, we leverage segmentation results to extract patient-specific attributions, including organ size, diameter, and locations. These are converted into text prompts, enriching the MLLM’s understanding of patient-specific contexts. To ensure rigorous evaluation, we conducted benchmark experiments on report generation using the RadGenome-Chest CT. MedRegion-CT achieved state-of-the-art performance, outperforming existing methods in natural language generation quality and clinical relevance while maintaining interpretability. The code for our framework is publicly available.
zh
[CV-260] Hierarchical Characterization of Brain Dynamics via State Space-based Vector Quantization
【速读】:该论文试图解决如何准确量化大脑在不同功能状态之间的动态转换问题,特别是在现有方法中缺乏对大脑过渡依赖性的考虑以及对大脑动态的代表性且稳定的嵌入表示的量化。其解决方案的关键在于提出一种基于层次状态空间的分词网络(HST),该网络通过改进的聚类向量量化变分自编码器(VQ-VAE)实现大脑状态和转换的分层量化,引入量化误差反馈和聚类机制以提升量化性能,并促进具有代表性和稳定性的标记表示,从而有效捕捉大脑动态的层次结构。
链接: https://arxiv.org/abs/2506.22952
作者: Yanwu Yang,Thomas Wolfers
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Understanding brain dynamics through functional Magnetic Resonance Imaging (fMRI) remains a fundamental challenge in neuroscience, particularly in capturing how the brain transitions between various functional states. Recently, metastability, which refers to temporarily stable brain states, has offered a promising paradigm to quantify complex brain signals into interpretable, discretized representations. In particular, compared to cluster-based machine learning approaches, tokenization approaches leveraging vector quantization have shown promise in representation learning with powerful reconstruction and predictive capabilities. However, most existing methods ignore brain transition dependencies and lack a quantification of brain dynamics into representative and stable embeddings. In this study, we propose a Hierarchical State space-based Tokenization network, termed HST, which quantizes brain states and transitions in a hierarchical structure based on a state space-based model. We introduce a refined clustered Vector-Quantization Variational AutoEncoder (VQ-VAE) that incorporates quantization error feedback and clustering to improve quantization performance while facilitating metastability with representative and stable token representations. We validate our HST on two public fMRI datasets, demonstrating its effectiveness in quantifying the hierarchical dynamics of the brain and its potential in disease diagnosis and reconstruction performance. Our method offers a promising framework for the characterization of brain dynamics, facilitating the analysis of metastability.
zh
[CV-261] CA-Diff: Collaborative Anatomy Diffusion for Brain Tissue Segmentation ICME2025
【速读】:该论文旨在解决从磁共振成像(MRI)中准确分割脑结构的问题,现有基于卷积神经网络(CNN)和Transformer的方法在处理复杂结构时表现不足,而直接应用扩散模型进行图像分割时又因忽略解剖信息而导致效果不佳。其解决方案的关键在于提出协作解剖扩散(CA-Diff)框架,通过引入距离场作为辅助解剖条件以提供全局空间上下文,并设计协作扩散过程来建模其与解剖结构的联合分布,从而有效利用解剖特征提升分割精度。此外,还引入了一致性损失以优化距离场与解剖结构之间的关系,并设计了时间自适应通道注意力模块以增强U-Net的特征融合过程。
链接: https://arxiv.org/abs/2506.22882
作者: Qilong Xing,Zikai Song,Yuteng Ye,Yuke Chen,Youjia Zhang,Na Feng,Junqing Yu,Wei Yang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICME 2025
Abstract:Segmentation of brain structures from MRI is crucial for evaluating brain morphology, yet existing CNN and transformer-based methods struggle to delineate complex structures accurately. While current diffusion models have shown promise in image segmentation, they are inadequate when applied directly to brain MRI due to neglecting anatomical information. To address this, we propose Collaborative Anatomy Diffusion (CA-Diff), a framework integrating spatial anatomical features to enhance segmentation accuracy of the diffusion model. Specifically, we introduce distance field as an auxiliary anatomical condition to provide global spatial context, alongside a collaborative diffusion process to model its joint distribution with anatomical structures, enabling effective utilization of anatomical features for segmentation. Furthermore, we introduce a consistency loss to refine relationships between the distance field and anatomical structures and design a time adapted channel attention module to enhance the U-Net feature fusion procedure. Extensive experiments show that CA-Diff outperforms state-of-the-art (SOTA) methods.
zh
[CV-262] Denoising Multi-Color QR Codes and Stiefel-Valued Data by Relaxed Regularizations
【速读】:该论文旨在解决多值流形数据(如多二进制和Stiefel流形数据)的去噪问题,这类数据在颜色恢复、方向信息处理及图像视频识别等领域具有重要应用。其解决方案的关键在于将数据嵌入欧几里得环境空间,并通过一系列固定秩的半正定矩阵对非凸流形进行编码,随后放松秩约束以实现凸化,从而利用标准凸分析算法进行求解。
链接: https://arxiv.org/abs/2506.22826
作者: Robert Beinert,Jonas Bresch
机构: Technische Universität Berlin (柏林工业大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 9 pages, 2 figures, 3 algorithms
Abstract:The handling of manifold-valued data, for instance, plays a central role in color restoration tasks relying on circle- or sphere-valued color models, in the study of rotational or directional information related to the special orthogonal group, and in Gaussian image processing, where the pixel statistics are interpreted as values on the hyperbolic sheet. Especially, to denoise these kind of data, there have been proposed several generalizations of total variation (TV) and Tikhonov-type denoising models incorporating the underlying manifolds. Recently, a novel, numerically efficient denoising approach has been introduced, where the data are embedded in an Euclidean ambient space, the non-convex manifolds are encoded by a series of positive semi-definite, fixed-rank matrices, and the rank constraint is relaxed to obtain a convexification that can be solved using standard algorithms from convex analysis. The aim of the present paper is to extent this approach to new kinds of data like multi-binary and Stiefel-valued data. Multi-binary data can, for instance, be used to model multi-color QR codes whereas Stiefel-valued data occur in image and video-based recognition. For both new data types, we propose TV- and Tikhonov-based denoising modelstogether with easy-to-solve convexification. All derived methods are evaluated on proof-of-concept, synthetic experiments.
zh
[CV-263] ICME 2025 Generalizable HDR and SDR Video Quality Measurement Grand Challenge ICME2025
【速读】:该论文旨在解决通用化高动态范围(HDR)与标准动态范围(SDR)视频质量评估(VQA)方法的性能不足问题。现有VQA模型在不同动态范围、失真类型和内容多样性下难以保持一致的性能,因此亟需更鲁棒且泛化的评估方法。该研究通过建立IEEE ICME 2025多模态视频质量测量挑战赛,推动能够同时处理HDR与SDR内容的VQA方法的发展。其关键在于设计能够跨动态范围和内容类型保持高性能的模型架构与训练策略,最终有四种方法超越了VMAF基线,其中最优模型达到了最先进的性能,为通用化视频质量评估设定了新基准。
链接: https://arxiv.org/abs/2506.22790
作者: Yixu Chen,Bowen Chen,Hai Wei,Alan C. Bovik,Baojun Li,Wei Sun,Linhan Cao,Kang Fu,Dandan Zhu,Jun Jia,Menghan Hu,Xiongkuo Min,Guangtao Zhai,Dounia Hammou,Fei Yin,Rafal Mantiuk,Amritha Premkumar,Prajit T Rajendran,Vignesh V Menon
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICME 2025 Grand Challenges
Abstract:This paper reports IEEE International Conference on Multimedia \ Expo (ICME) 2025 Grand Challenge on Generalizable HDR and SDR Video Quality Measurement. With the rapid development of video technology, especially High Dynamic Range (HDR) and Standard Dynamic Range (SDR) contents, the need for robust and generalizable Video Quality Assessment (VQA) methods has become increasingly demanded. Existing VQA models often struggle to deliver consistent performance across varying dynamic ranges, distortion types, and diverse content. This challenge was established to benchmark and promote VQA approaches capable of jointly handling HDR and SDR content. In the final evaluation phase, five teams submitted seven models along with technical reports to the Full Reference (FR) and No Reference (NR) tracks. Among them, four methods outperformed VMAF baseline, while the top-performing model achieved state-of-the-art performance, setting a new benchmark for generalizable video quality assessment.
zh
[CV-264] FedCLAM: Client Adaptive Momentum with Foreground Intensity Matching for Federated Medical Image Segmentation MICCAI2025
【速读】:该论文旨在解决联邦学习在医疗影像领域中因机构间特征差异和图像强度分布异质性导致的全局模型效果下降问题。其解决方案的关键在于提出FedCLAM,该方法通过引入客户端自适应动量项(client-adaptive momentum terms)和个性化阻尼因子(personalized dampening factor)来增强模型的适应性和泛化能力,同时引入一种新的强度对齐损失(intensity alignment loss)以处理不同机构和设备间的图像强度分布差异。
链接: https://arxiv.org/abs/2506.22580
作者: Vasilis Siomos,Jonathan Passerat-Palmbach,Giacomo Tarroni
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, Accepted at MICCAI 2025
Abstract:Federated learning is a decentralized training approach that keeps data under stakeholder control while achieving superior performance over isolated training. While inter-institutional feature discrepancies pose a challenge in all federated settings, medical imaging is particularly affected due to diverse imaging devices and population variances, which can diminish the global model’s effectiveness. Existing aggregation methods generally fail to adapt across varied circumstances. To address this, we propose FedCLAM, which integrates \textitclient-adaptive momentum terms derived from each client’s loss reduction during local training, as well as a \textitpersonalized dampening factor to curb overfitting. We further introduce a novel \textitintensity alignment loss that matches predicted and ground-truth foreground distributions to handle heterogeneous image intensity profiles across institutions and devices. Extensive evaluations on two datasets show that FedCLAM surpasses eight cutting-edge methods in medical segmentation tasks, underscoring its efficacy. The code is available at this https URL.
zh
[CV-265] Maximum Dispersion Maximum Concentration: Enhancing the Quality of MOP Solutions
【速读】:该论文旨在解决多目标优化问题(MOPs)中如何在目标空间中实现收敛性与多样性之间的权衡问题。其解决方案的关键在于通过优化决策空间中的分散性(dispersion)和目标空间特定区域的收敛性来提升解的质量。具体而言,该方法基于目标空间中表示决策者偏好的锥体定义一个感兴趣区域(Region of Interest, ROI),同时利用均匀性度量增强决策空间中解的分散性,从而在保持解集中收敛性的同时提高多样性,避免因解在决策空间中聚集而导致的偏差。
链接: https://arxiv.org/abs/2506.22568
作者: Gladston Moreira,Ivan Meneghini,Elzabeth Wanner
机构: Universidade Federal de Ouro Preto (联邦乌拉比托大学); Federal Institute of Minas Gerais (米纳斯吉拉斯联邦理工学院); Aston University (阿斯顿大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Multi-objective optimization problems (MOPs) often require a trade-off between conflicting objectives, maximizing diversity and convergence in the objective space. This study presents an approach to improve the quality of MOP solutions by optimizing the dispersion in the decision space and the convergence in a specific region of the objective space. Our approach defines a Region of Interest (ROI) based on a cone representing the decision maker’s preferences in the objective space, while enhancing the dispersion of solutions in the decision space using a uniformity measure. Combining solution concentration in the objective space with dispersion in the decision space intensifies the search for Pareto-optimal solutions while increasing solution diversity. When combined, these characteristics improve the quality of solutions and avoid the bias caused by clustering solutions in a specific region of the decision space. Preliminary experiments suggest that this method enhances multi-objective optimization by generating solutions that effectively balance dispersion and concentration, thereby mitigating bias in the decision space.
zh
[CV-266] High Resolution Isotropic 3D Cine imaging with Automated Segmentation using Concatenated 2D Real-time Imaging and Deep Learning
【速读】:该论文试图解决传统心血管磁共振(cardiovascular magnetic resonance, CMR)在儿科和先天性心脏病评估中存在的时间效率低和图像分辨率不足的问题。传统方法依赖于2D屏气平衡稳态自由进动(bSSFP)电影成像和静态3D bSSFP全心脏成像,但这些方法在儿童患者中难以实现有效的呼吸和心脏门控。论文提出的解决方案的关键是通过将一系列深度学习(Deep Learning, DL)模型应用于连续的2D自由呼吸实时电影图像,生成各向同性且完全分割的3D电影数据集,从而实现快速、高质量的影像重建。
链接: https://arxiv.org/abs/2506.22532
作者: Mark Wrobel(1),Michele Pascale(1),Tina Yao(1),Ruaraidh Campbell(1),Elena Milano(2),Michael Quail(1 and 2),Jennifer Steeden(1),Vivek Muthurangu(1) ((1) UCL Centre for Translational Cardiovascular Imaging, University College London, (2) Great Ormond Street Hospital)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Background: Conventional cardiovascular magnetic resonance (CMR) in paediatric and congenital heart disease uses 2D, breath-hold, balanced steady state free precession (bSSFP) cine imaging for assessment of function and cardiac-gated, respiratory-navigated, static 3D bSSFP whole-heart imaging for anatomical assessment. Our aim is to concatenate a stack 2D free-breathing real-time cines and use Deep Learning (DL) to create an isotropic a fully segmented 3D cine dataset from these images. Methods: Four DL models were trained on open-source data that performed: a) Interslice contrast correction; b) Interslice respiratory motion correction; c) Super-resolution (slice direction); and d) Segmentation of right and left atria and ventricles (RA, LA, RV, and LV), thoracic aorta (Ao) and pulmonary arteries (PA). In 10 patients undergoing routine cardiovascular examination, our method was validated on prospectively acquired sagittal stacks of real-time cine images. Quantitative metrics (ventricular volumes and vessel diameters) and image quality of the 3D cines were compared to conventional breath hold cine and whole heart imaging. Results: All real-time data were successfully transformed into 3D cines with a total post-processing time of 1 min in all cases. There were no significant biases in any LV or RV metrics with reasonable limits of agreement and correlation. There is also reasonable agreement for all vessel diameters, although there was a small but significant overestimation of RPA diameter. Conclusion: We have demonstrated the potential of creating a 3D-cine data from concatenated 2D real-time cine images using a series of DL models. Our method has short acquisition and reconstruction times with fully segmented data being available within 2 minutes. The good agreement with conventional imaging suggests that our method could help to significantly speed up CMR in clinical practice.
zh
[CV-267] SegmentAnyMuscle: A universal muscle segmentation model across different locations in MRI
【速读】:该论文试图解决在MRI中精确量化肌肉量和质量的挑战,这一问题对于评估健康结果具有重要意义。其解决方案的关键在于开发一个公开可用的深度学习模型,用于肌肉分割,并验证其在不同解剖位置和成像序列中的适用性。该模型在多种常见和罕见序列类型上均表现出较高的Dice相似性系数(DSC),证明了其在多样化设置下的可行性。
链接: https://arxiv.org/abs/2506.22467
作者: Roy Colglazier,Jisoo Lee,Haoyu Dong,Hanxue Gu,Yaqian Chen,Joseph Cao,Zafer Yildiz,Zhonghao Liu,Nicholas Konz,Jichen Yang,Jikai Zhang,Yuwen Chen,Lin Li,Adrian Camarena,Maciej A. Mazurowski
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures
Abstract:The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locations and imaging sequences. A total of 362 MRIs from 160 patients at a single tertiary center (Duke University Health System, 2016-2020) were included, with 316 MRIs from 114 patients used for model development. The model was tested on two separate sets: one with 28 MRIs representing common sequence types, achieving an average Dice Similarity Coefficient (DSC) of 88.45%, and another with 18 MRIs featuring less frequent sequences and abnormalities such as muscular atrophy, hardware, and significant noise, achieving 86.21% DSC. These results demonstrate the feasibility of a fully automated deep learning algorithm for segmenting muscles on MRI across diverse settings. The public release of this model enables consistent, reproducible research into the relationship between musculature and health.
zh
[CV-268] MedSegNet10: A Publicly Accessible Network Repository for Split Federated Medical Image Segmentation
【速读】:该论文旨在解决医疗图像分割中面临的数据隐私问题、标注数据有限以及训练数据不足等挑战。其解决方案的关键在于引入“MedSegNet10”,这是一个基于SplitFed(Split Federated Learning)的公开资源库,通过在私有存储的水平分割数据上进行协作训练,有效保障了数据隐私与完整性,同时提供了针对多种医学图像类型的预训练神经网络架构。
链接: https://arxiv.org/abs/2503.20830
作者: Chamani Shiranthika,Zahra Hafezi Kafshgari,Hadi Hadizadeh,Parvaneh Saeedi
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 14 figures
Abstract:Machine Learning (ML) and Deep Learning (DL) have shown significant promise in healthcare, particularly in medical image segmentation, which is crucial for accurate disease diagnosis and treatment planning. Despite their potential, challenges such as data privacy concerns, limited annotated data, and inadequate training data persist. Decentralized learning approaches such as federated learning (FL), split learning (SL), and split federated learning (SplitFed/SFL) address these issues effectively. This paper introduces “MedSegNet10,” a publicly accessible repository designed for medical image segmentation using split-federated learning. MedSegNet10 provides a collection of pre-trained neural network architectures optimized for various medical image types, including microscopic images of human blastocysts, dermatoscopic images of skin lesions, and endoscopic images of lesions, polyps, and ulcers, with applications extending beyond these examples. By leveraging SplitFed’s benefits, MedSegNet10 allows collaborative training on privately stored, horizontally split data, ensuring privacy and integrity. This repository supports researchers, practitioners, trainees, and data scientists, aiming to advance medical image segmentation while maintaining patient data privacy. The repository is available at: this https URL (password upon request to the authors).
zh
人工智能
[AI-0] Data Uniformity Improves Training Efficiency and More with a Convergence Framework Beyond the NTK Regime
【速读】:该论文试图解决数据选择在数据驱动决策中的作用问题,特别是在大型语言模型(Large Language Models, LLMs)中,如何通过量化和通用原则提升模型性能,尤其是在复杂任务和有限先验知识的情况下。其解决方案的关键在于证明选择更均匀分布的数据可以提高训练效率并增强模型性能,具体表现为更均匀的分布(即更低的偏差)会增加数据点之间的最小成对距离(h_min),而较小的h_min会减缓梯度下降(GD)的训练动态,并且随着h_min的增加,神经网络的近似误差会降低。该研究提出了一种超越神经切线核(NTK)范式的GD收敛框架,适用于包括Transformer在内的多种架构,并为残差连接和函数组合在深度神经网络中的应用提供了理论依据。
链接: https://arxiv.org/abs/2506.24120
作者: Yuqing Wang,Shangding Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complex tasks with limited prior knowledge. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by h_\min , and prove that a smaller h_\min can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as h_\min increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connections and function compositions in deep neural architectures. In the end, we conduct comprehensive experiments for supervised fine-tuning across various settings, including different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets. Code and Datasets are available at the link: this https URL.
zh
[AI-1] Development of Hybrid Artificial Intelligence Training on Real and Synthetic Data: Benchmark on Two Mixed Training Strategies
【速读】:该论文试图解决合成数据与真实数据之间的领域差距(domain gap)问题,这一差距导致基于合成数据训练的人工神经网络(ANN)在实际应用中表现不佳且泛化能力差。解决方案的关键在于通过混合训练使用混合数据集,即结合合成数据与真实数据,以缓解领域差距。研究系统评估了两种常用的混合策略在不同架构和数据集上的泛化能力和鲁棒性,旨在为优化合成数据在ANN训练中的应用提供理论支持和实践指导。
链接: https://arxiv.org/abs/2506.24093
作者: Paul Wachter,Lukas Niehaus,Julius Schöning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21pages, 14 figures, 2 tables
Abstract:Synthetic data has emerged as a cost-effective alternative to real data for training artificial neural networks (ANN). However, the disparity between synthetic and real data results in a domain gap. That gap leads to poor performance and generalization of the trained ANN when applied to real-world scenarios. Several strategies have been developed to bridge this gap, which combine synthetic and real data, known as mixed training using hybrid datasets. While these strategies have been shown to mitigate the domain gap, a systematic evaluation of their generalizability and robustness across various tasks and architectures remains underexplored. To address this challenge, our study comprehensively analyzes two widely used mixing strategies on three prevalent architectures and three distinct hybrid datasets. From these datasets, we sample subsets with varying proportions of synthetic to real data to investigate the impact of synthetic and real components. The findings of this paper provide valuable insights into optimizing the use of synthetic data in the training process of any ANN, contributing to enhancing robustness and efficacy.
zh
[AI-2] Constructing Non-Markovian Decision Process via History Aggregator
【速读】:该论文试图解决算法决策过程中非马尔可夫性(non-Markovian dynamics)带来的挑战,尤其是在强化学习(Reinforcement Learning, RL)等范式中,这种特性对系统性能和发展的深远影响。现有基准在全面评估决策算法处理非马尔可夫性能力方面存在不足。解决方案的关键在于基于范畴论构建了一个广义方法,建立了马尔可夫决策过程(Markov Decision Process, MDP)与非马尔可夫决策过程(non-Markovian Decision Process, NMDP)之间的等价关系,并通过历史状态聚合器(History Aggregator for State, HAS)将非马尔可夫性引入决策问题设置,从而实现了对时间序列中状态依赖结构的精确控制。
链接: https://arxiv.org/abs/2506.24026
作者: Yongyi Wang,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In the domain of algorithmic decision-making, non-Markovian dynamics manifest as a significant impediment, especially for paradigms such as Reinforcement Learning (RL), thereby exerting far-reaching consequences on the advancement and effectiveness of the associated systems. Nevertheless, the existing benchmarks are deficient in comprehensively assessing the capacity of decision algorithms to handle non-Markovian dynamics. To address this deficiency, we have devised a generalized methodology grounded in category theory. Notably, we established the category of Markov Decision Processes (MDP) and the category of non-Markovian Decision Processes (NMDP), and proved the equivalence relationship between them. This theoretical foundation provides a novel perspective for understanding and addressing non-Markovian dynamics. We further introduced non-Markovianity into decision-making problem settings via the History Aggregator for State (HAS). With HAS, we can precisely control the state dependency structure of decision-making problems in the time series. Our analysis demonstrates the effectiveness of our method in representing a broad range of non-Markovian dynamics. This approach facilitates a more rigorous and flexible evaluation of decision algorithms by testing them in problem settings where non-Markovian dynamics are explicitly constructed.
zh
[AI-3] Bridging Theory and Practice in Link Representation with Graph Neural Networks
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在链接表示(link representation)方面的表达能力缺乏系统性理论分析的问题。现有研究主要关注图级表示的表达能力,而对链接级别的表达能力研究不足。解决方案的关键在于提出一个统一的框架——kϕ-kρ-m 框架,该框架涵盖了现有的消息传递链接模型,并支持形式化的表达能力比较。通过该框架,作者建立了当前最先进的方法层次结构,并提供了分析未来架构的理论工具。此外,还提出了一个合成评估协议,包含首个专门用于评估链接级别表达能力的基准。
链接: https://arxiv.org/abs/2506.24018
作者: Veronica Lachi,Francesco Ferrini,Antonio Longa,Bruno Lepri,Andrea Passerini,Manfred Jaeger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) are widely used to compute representations of node pairs for downstream tasks such as link prediction. Yet, theoretical understanding of their expressive power has focused almost entirely on graph-level representations. In this work, we shift the focus to links and provide the first comprehensive study of GNN expressiveness in link representation. We introduce a unifying framework, the k_\phi - k_\rho - m framework, that subsumes existing message-passing link models and enables formal expressiveness comparisons. Using this framework, we derive a hierarchy of state-of-the-art methods and offer theoretical tools to analyze future architectures. To complement our analysis, we propose a synthetic evaluation protocol comprising the first benchmark specifically designed to assess link-level expressiveness. Finally, we ask: does expressiveness matter in practice? We use a graph symmetry metric that quantifies the difficulty of distinguishing links and show that while expressive models may underperform on standard benchmarks, they significantly outperform simpler ones as symmetry increases, highlighting the need for dataset-aware model selection.
zh
[AI-4] Bridging Physical and Digital Worlds: Embodied Large AI for Future Wireless Systems
【速读】:该论文试图解决当前大型人工智能模型在无线系统中因忽视物理交互而无法有效处理实时动态和非平稳环境的问题,以及缺乏主动环境探测能力的局限性。解决方案的关键在于提出一种根本性的范式转变——无线具身大人工智能(WELAI),从被动观测转向主动具身化,通过设计原则与系统架构的探索,实现对无线环境的主动感知与适应,从而推动下一代无线系统的自适应、鲁棒和自主发展。
链接: https://arxiv.org/abs/2506.24009
作者: Xinquan Wang,Fenghao Zhu,Zhaohui Yang,Chongwen Huang,Xiaoming Chen,Zhaoyang Zhang,Sami Muhaidat,Mérouane Debbah
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures
Abstract:Large artificial intelligence (AI) models offer revolutionary potential for future wireless systems, promising unprecedented capabilities in network optimization and performance. However, current paradigms largely overlook crucial physical interactions. This oversight means they primarily rely on offline datasets, leading to difficulties in handling real-time wireless dynamics and non-stationary environments. Furthermore, these models often lack the capability for active environmental probing. This paper proposes a fundamental paradigm shift towards wireless embodied large AI (WELAI), moving from passive observation to active embodiment. We first identify key challenges faced by existing models, then we explore the design principles and system structure of WELAI. Besides, we outline prospective applications in next-generation wireless. Finally, through an illustrative case study, we demonstrate the effectiveness of WELAI and point out promising research directions for realizing adaptive, robust, and autonomous wireless systems.
zh
[AI-5] STCLocker: Deadlock Avoidance Testing for Autonomous Driving Systems
【速读】:该论文旨在解决自动驾驶系统(Autonomous Driving System, ADS)在多自动驾驶车辆(Autonomous Vehicles, AVs)交通环境中协作性能不足的问题,尤其是针对死锁(deadlock)这一基本协调失败现象的检测与预防能力薄弱的问题。解决方案的关键在于提出一种名为STCLocker的时空冲突引导死锁避免测试技术,其核心组件包括死锁检测器(Deadlock Oracle)、冲突反馈(Conflict Feedback)和冲突感知场景生成(Conflict-aware Scenario Generation),通过主动引导AVs在空间和时间上竞争共享资源,从而有效生成具有冲突倾向的死锁场景。
链接: https://arxiv.org/abs/2506.23995
作者: Mingfei Cheng,Renzhi Wang,Xiaofei Xie,Yuan Zhou,Lei Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Autonomous Driving System (ADS) testing is essential to ensure the safety and reliability of autonomous vehicles (AVs) before deployment. However, existing techniques primarily focus on evaluating ADS functionalities in single-AV settings. As ADSs are increasingly deployed in multi-AV traffic, it becomes crucial to assess their cooperative performance, particularly regarding deadlocks, a fundamental coordination failure in which multiple AVs enter a circular waiting state indefinitely, resulting in motion planning failures. Despite its importance, the cooperative capability of ADSs to prevent deadlocks remains insufficiently underexplored. To address this gap, we propose the first dedicated Spatio-Temporal Conflict-Guided Deadlock Avoidance Testing technique, STCLocker, for generating DeadLock Scenarios (DLSs), where a group of AVs controlled by the ADS under test are in a circular wait state. STCLocker consists of three key components: Deadlock Oracle, Conflict Feedback, and Conflict-aware Scenario Generation. Deadlock Oracle provides a reliable black-box mechanism for detecting deadlock cycles among multiple AVs within a given scenario. Conflict Feedback and Conflict-aware Scenario Generation collaborate to actively guide AVs into simultaneous competition over spatial conflict resources (i.e., shared passing regions) and temporal competitive behaviors (i.e., reaching the conflict region at the same time), thereby increasing the effectiveness of generating conflict-prone deadlocks. We evaluate STCLocker on two types of ADSs: Roach, an end-to-end ADS, and OpenCDA, a module-based ADS supporting cooperative communication. Experimental results show that, on average, STCLocker generates more DLS than the best-performing baseline.
zh
[AI-6] Harnessing AI Agents to Advance Research on Refugee Child Mental Health
【速读】:该论文旨在解决因国际难民危机加剧而使数百万流离失所儿童面临极端心理创伤的问题,其核心是通过人工智能技术处理非结构化的难民健康数据,并提炼出关于儿童心理健康的知识。解决方案的关键在于构建一个基于生成式 AI (Generative AI) 的框架,利用检索增强生成(Retrieval-Augmented Generation, RAG)技术对复杂的人道主义数据集进行处理,同时避免幻觉风险。研究比较了两种RAG管道——Zephyr-7B-beta和DeepSeek R1-7B,结果显示DeepSeek R1在答案相关性准确性方面表现更优,达到0.91。
链接: https://arxiv.org/abs/2506.23992
作者: Aditya Shrivastava,Komal Gupta,Shraddha Arora
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 14 page , 2 image , 2 tables , accepted under 5th International Conference on Innovations in Computational Intelligence and Computer Vision (ICICV-2025)
Abstract:The international refugee crisis deepens, exposing millions of dis placed children to extreme psychological trauma. This research suggests a com pact, AI-based framework for processing unstructured refugee health data and distilling knowledge on child mental health. We compare two Retrieval-Aug mented Generation (RAG) pipelines, Zephyr-7B-beta and DeepSeek R1-7B, to determine how well they process challenging humanitarian datasets while avoid ing hallucination hazards. By combining cutting-edge AI methods with migration research and child psychology, this study presents a scalable strategy to assist policymakers, mental health practitioners, and humanitarian agencies to better assist displaced children and recognize their mental wellbeing. In total, both the models worked properly but significantly Deepseek R1 is superior to Zephyr with an accuracy of answer relevance 0.91
zh
[AI-7] ADReFT: Adaptive Decision Repair for Safe Autonomous Driving via Reinforcement Fine-Tuning
【速读】:该论文旨在解决自动驾驶系统(Autonomous Driving Systems, ADSs)在运行过程中面临的安全关键性风险问题,尤其是现有在线修复方法在泛化能力、适应性和修复效果上的不足。其解决方案的关键在于提出一种名为自适应决策修复(Adaptive Decision Repair, ADReFT)的新方法,该方法通过离线学习失败测试来识别安全关键状态,并利用基于Transformer的模型结合状态监测和决策适配两个联合头部,以捕捉复杂的驾驶环境交互,从而生成适应性的修复动作,提升ADS的安全性与修复效果。
链接: https://arxiv.org/abs/2506.23960
作者: Mingfei Cheng,Xiaofei Xie,Renzhi Wang,Yuan Zhou,Ming Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Autonomous Driving Systems (ADSs) continue to face safety-critical risks due to the inherent limitations in their design and performance capabilities. Online repair plays a crucial role in mitigating such limitations, ensuring the runtime safety and reliability of ADSs. Existing online repair solutions enforce ADS compliance by transforming unacceptable trajectories into acceptable ones based on predefined specifications, such as rule-based constraints or training datasets. However, these approaches often lack generalizability, adaptability and tend to be overly conservative, resulting in ineffective repairs that not only fail to mitigate safety risks sufficiently but also degrade the overall driving experience. To address this issue, we propose Adaptive Decision Repair (ADReFT), a novel and effective repair method that identifies safety-critical states through offline learning from failed tests and generates appropriate mitigation actions to improve ADS safety. Specifically, ADReFT incorporates a transformer-based model with two joint heads, State Monitor and Decision Adapter, designed to capture complex driving environment interactions to evaluate state safety severity and generate adaptive repair actions. Given the absence of oracles for state safety identification, we first pretrain ADReFT using supervised learning with coarse annotations, i.e., labeling states preceding violations as positive samples and others as negative samples. It establishes ADReFT’s foundational capability to mitigate safety-critical violations, though it may result in somewhat conservative mitigation strategies. Therefore, we subsequently finetune ADReFT using reinforcement learning to improve its initial capability and generate more precise and contextually appropriate repair decisions. Our evaluation results illustrate that ADReFT achieves better repair performance.
zh
[AI-8] Autonomy by Design: Preserving Human Autonomy in AI Decision-Support
【速读】:该论文试图解决AI决策支持系统对领域特定自主性(domain-specific autonomy)的影响问题,特别是其对专业技能能力(skilled competence)和真实价值形成(authentic value-formation)的潜在侵蚀。解决方案的关键在于构建一种保持自主性的AI支持系统框架,其核心包括明确的角色定义、缺陷消除机制(defeater mechanisms)以及对反思性实践的支持等社会技术设计模式,以在利用AI能力的同时维护人类在特定领域中的主体性。
链接: https://arxiv.org/abs/2506.23952
作者: Stefan Buijsman,Sarah Carter,Juan Pablo Bermúdez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
备注:
Abstract:AI systems increasingly support human decision-making across domains of professional, skill-based, and personal activity. While previous work has examined how AI might affect human autonomy globally, the effects of AI on domain-specific autonomy – the capacity for self-governed action within defined realms of skill or expertise – remain understudied. We analyze how AI decision-support systems affect two key components of domain-specific autonomy: skilled competence (the ability to make informed judgments within one’s domain) and authentic value-formation (the capacity to form genuine domain-relevant values and preferences). By engaging with prior investigations and analyzing empirical cases across medical, financial, and educational domains, we demonstrate how the absence of reliable failure indicators and the potential for unconscious value shifts can erode domain-specific autonomy both immediately and over time. We then develop a constructive framework for autonomy-preserving AI support systems. We propose specific socio-technical design patterns – including careful role specification, implementation of defeater mechanisms, and support for reflective practice – that can help maintain domain-specific autonomy while leveraging AI capabilities. This framework provides concrete guidance for developing AI systems that enhance rather than diminish human agency within specialized domains of action.
zh
[AI-9] AI Risk-Management Standards Profile for General-Purpose AI (GPAI) and Foundation Models
【速读】:该论文旨在解决生成式 AI(Generative AI)/基础模型在应用过程中可能引发的潜在风险问题,这些风险可能导致具有深远影响的负面事件。解决方案的关键在于提供一套针对此类模型的风险管理实践或控制措施,涵盖风险的识别、分析与缓解,并强调适应和扩展现有AI风险管理标准(如NIST AI风险管理体系和ISO/IEC 23894)以应对开发过程中特有的挑战。
链接: https://arxiv.org/abs/2506.23949
作者: Anthony M. Barrett,Jessica Newman,Brandie Nonnecke,Nada Madkour,Dan Hendrycks,Evan R. Murphy,Krystal Jackson,Deepika Raman
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:
Abstract:Increasingly multi-purpose AI models, such as cutting-edge large language models or other ‘general-purpose AI’ (GPAI) models, ‘foundation models,’ generative AI models, and ‘frontier models’ (typically all referred to hereafter with the umbrella term ‘GPAI/foundation models’ except where greater specificity is needed), can provide many beneficial capabilities but also risks of adverse events with profound consequences. This document provides risk-management practices or controls for identifying, analyzing, and mitigating risks of GPAI/foundation models. We intend this document primarily for developers of large-scale, state-of-the-art GPAI/foundation models; others that can benefit from this guidance include downstream developers of end-use applications that build on a GPAI/foundation model. This document facilitates conformity with or use of leading AI risk management-related standards, adapting and building on the generic voluntary guidance in the NIST AI Risk Management Framework and ISO/IEC 23894, with a focus on the unique issues faced by developers of GPAI/foundation models.
zh
[AI-10] Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning
【速读】:该论文试图解决在模仿学习中,由于本体感觉(proprioception)状态在训练与部署阶段分布差异过大而导致的模仿学习性能下降问题,即本体感觉偏移(proprioception shift)问题。解决方案的关键在于提出一种领域自适应框架,通过利用部署过程中收集的轨迹数据,使用Wasserstein距离量化专家数据与轨迹数据之间的本体感觉状态差异,并通过向两组状态添加与Wasserstein距离成比例的噪声来最小化这一差距,从而增强对本体感觉偏移的鲁棒性。
链接: https://arxiv.org/abs/2506.23944
作者: Fuhang Kuang,Jiacheng You,Yingdong Hu,Tong Zhang,Chuan Wen,Yang Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Imitation learning models for robotic tasks typically rely on multi-modal inputs, such as RGB images, language, and proprioceptive states. While proprioception is intuitively important for decision-making and obstacle avoidance, simply incorporating all proprioceptive states leads to a surprising degradation in imitation learning performance. In this work, we identify the underlying issue as the proprioception shift problem, where the distributions of proprioceptive states diverge significantly between training and deployment. To address this challenge, we propose a domain adaptation framework that bridges the gap by utilizing rollout data collected during deployment. Using Wasserstein distance, we quantify the discrepancy between expert and rollout proprioceptive states and minimize this gap by adding noise to both sets of states, proportional to the Wasserstein distance. This strategy enhances robustness against proprioception shifts by aligning the training and deployment distributions. Experiments on robotic manipulation tasks demonstrate the efficacy of our method, enabling the imitation policy to leverage proprioception while mitigating its adverse effects. Our approach outperforms the naive solution which discards proprioception, and other baselines designed to address distributional shifts.
zh
[AI-11] QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference
【速读】:该论文旨在解决在边缘设备上进行机器学习推理时,如何适应不同设备的计算能力、硬件和内存限制的问题。传统方法依赖于为所有未来推理请求固定预训练模型,而该研究提出了一种基于请求特定模型的推理模式,以更高效和鲁棒的方式应对多样化场景。解决方案的关键在于设计一个兼顾精度感知与工作负载平衡的推理系统,通过联合模型量化和推理分割实现动态响应,优化分层量化位宽和分割点,从而在满足不同任务精度要求的同时最小化时间和成本。
链接: https://arxiv.org/abs/2506.23934
作者: Xiangchen Li,Saeid Ghafouri,Bo Ji,Hans Vandierendonck,Deepu John,Dimitrios S. Nikolopoulos
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future inference queries across diverse edge devices, we argue that planning an inference pattern with a request-specific model tailored to the device’s computational capacity, accuracy requirements, and time constraints is more cost-efficient and robust to diverse scenarios. To this end, we propose an accuracy-aware and workload-balanced inference system that integrates joint model quantization and inference partitioning. In this approach, the server dynamically responds to inference queries by sending a quantized model and adaptively sharing the inference workload with the device. Meanwhile, the device’s computational power, channel capacity, and accuracy requirements are considered when deciding. Furthermore, we introduce a new optimization framework for the inference system, incorporating joint model quantization and partitioning. Our approach optimizes layer-wise quantization bit width and partition points to minimize time consumption and cost while accounting for varying accuracy requirements of tasks through an accuracy degradation metric in our optimization model. To our knowledge, this work represents the first exploration of optimizing quantization layer-wise bit-width in the inference serving system, by introducing theoretical measurement of accuracy degradation. Simulation results demonstrate a substantial reduction in overall time and power consumption, with computation payloads decreasing by over 80% and accuracy degradation kept below 1%. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2506.23934 [cs.DC] (or arXiv:2506.23934v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2506.23934 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-12] Industrial brain: a human-like autonomous neuro-symbolic cognitive decision-making system
【速读】:该论文试图解决工业链在面对故障和错误时,如何有效预测和规划其弹性(resilience)的问题,特别是在多维度、复杂且混沌的共演化(co-evolution)场景下,传统端到端深度学习方法在未见过的时空共演化结构重建和网络拓扑弹性预测方面表现不佳。解决方案的关键在于提出“工业脑”(industrial brain),这是一个融合高阶活动驱动神经网络与CT-OODA符号推理的人类类似自主认知决策与规划框架,能够直接从全局变量的观测数据中自主生成弹性计划,从而实现对节点活动动态结构和网络共演化拓扑的准确建模与弹性预测。
链接: https://arxiv.org/abs/2506.23926
作者: Junping Wang,Bicheng Wang,Yibo Xuea,Yuan Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Resilience non-equilibrium measurement, the ability to maintain fundamental functionality amidst failures and errors, is crucial for scientific management and engineering applications of industrial chain. The problem is particularly challenging when the number or types of multiple co-evolution of resilience (for example, randomly placed) are extremely chaos. Existing end-to-end deep learning ordinarily do not generalize well to unseen full-feld reconstruction of spatiotemporal co-evolution structure, and predict resilience of network topology, especially in multiple chaos data regimes typically seen in real-world applications. To address this challenge, here we propose industrial brain, a human-like autonomous cognitive decision-making and planning framework integrating higher-order activity-driven neuro network and CT-OODA symbolic reasoning to autonomous plan resilience directly from observational data of global variable. The industrial brain not only understands and model structure of node activity dynamics and network co-evolution topology without simplifying assumptions, and reveal the underlying laws hidden behind complex networks, but also enabling accurate resilience prediction, inference, and planning. Experimental results show that industrial brain significantly outperforms resilience prediction and planning methods, with an accurate improvement of up to 10.8% over GoT and OlaGPT framework and 11.03% over spectral dimension reduction. It also generalizes to unseen topologies and dynamics and maintains robust performance despite observational disturbances. Our findings suggest that industrial brain addresses an important gap in resilience prediction and planning for industrial chain.
zh
[AI-13] Performance of LLM s on Stochastic Modeling Operations Research Problems: From Theory to Practice
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在运筹学(Operations Research, OR)领域,特别是随机建模问题中的求解能力评估问题。其关键解决方案是通过手动收集研究生课程作业和博士资格考试中的代表性问题,并利用SimOpt开源库中的仿真优化问题和求解器,测试LLMs在不确定性环境下的实际决策能力,从而评估其在课堂和现实场景中的表现。
链接: https://arxiv.org/abs/2506.23924
作者: Akshit Kumar,Tianyi Peng,Yuhang Wu,Assaf Zeevi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have exhibited expert-level capabilities across various domains. However, their abilities to solve problems in Operations Research (OR) – the analysis and optimization of mathematical models derived from real-world problems or their verbal descriptions – remain underexplored. In this work, we take a first step toward evaluating LLMs’ abilities to solve stochastic modeling problems, a core class of OR problems characterized by uncertainty and typically involving tools from probability, statistics, and stochastic processes. We manually procure a representative set of graduate-level homework and doctoral qualification-exam problems and test LLMs’ abilities to solve them. We further leverage SimOpt, an open-source library of simulation-optimization problems and solvers, to investigate LLMs’ abilities to make real-world decisions under uncertainty. Our results show that, though a nontrivial amount of work is still needed to reliably automate the stochastic modeling pipeline in reality, state-of-the-art LLMs demonstrate proficiency on par with human experts in both classroom and practical settings. These findings highlight the potential of building AI agents that assist OR researchers and amplify the real-world impact of OR through automation.
zh
[AI-14] Reinforcement Learning for Synchronised Flow Control in a Dual-Gate Resin Infusion System
【速读】:该论文旨在解决在树脂注入(Resin Infusion, RI)和树脂传递模塑(Resin Transfer Moulding, RTM)过程中,如何控制树脂流动动力学以实现纤维增强材料的均匀浸润,从而避免残留孔隙和干区的问题。解决方案的关键在于采用基于强化学习(Reinforcement Learning, RL)的策略,通过过程仿真建立模型,并利用近端策略优化(Proximal Policy Optimization, PPO)算法,在部分可观测环境中有效管理流体动力学,从而实现多入口树脂流动前沿的同步与精确汇合。
链接: https://arxiv.org/abs/2506.23923
作者: Miguel Camacho-Sánchez,Fernando García-Torres,Jesper John Lisegaard,Rocío del Amor,Sankhya Mohanty,Valery Naranjo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 45th Risø International Symposium on Materials Science
Abstract:Resin infusion (RI) and resin transfer moulding (RTM) are critical processes for the manufacturing of high-performance fibre-reinforced polymer composites, particularly for large-scale applications such as wind turbine blades. Controlling the resin flow dynamics in these processes is critical to ensure the uniform impregnation of the fibre reinforcements, thereby preventing residual porosities and dry spots that impact the consequent structural integrity of the final component. This paper presents a reinforcement learning (RL) based strategy, established using process simulations, for synchronising the different resin flow fronts in an infusion scenario involving two resin inlets and a single outlet. Using Proximal Policy Optimisation (PPO), our approach addresses the challenge of managing the fluid dynamics in a partially observable environment. The results demonstrate the effectiveness of the RL approach in achieving an accurate flow convergence, highlighting its potential towards improving process control and product quality in composites manufacturing.
zh
[AI-15] Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence
【速读】:该论文试图解决当前人工智能系统在演绎推理(deductive reasoning)任务中表现不佳的问题,尽管这些系统在数学和科学等领域取得了显著进展。研究指出,即使最先进的模型在简单可解的演绎推理任务上也经常失败,这表明它们无法实现具备可靠演绎推理能力的人工通用智能。解决方案的关键在于从传统的统计学习方法转向更严格的精确学习(exact learning)范式,即要求模型在所有输入上都保持正确性,而非仅优化在推理问题分布上的统计性能。
链接: https://arxiv.org/abs/2506.23908
作者: András György,Tor Lattimore,Nevena Lazić,Csaba Szepesvári
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Sound deductive reasoning – the ability to derive new knowledge from existing facts and rules – is an indisputably desirable aspect of general intelligence. Despite the major advances of AI systems in areas such as math and science, especially since the introduction of transformer architectures, it is well-documented that even the most advanced frontier systems regularly and consistently falter on easily-solvable deductive reasoning tasks. Hence, these systems are unfit to fulfill the dream of achieving artificial general intelligence capable of sound deductive reasoning. We argue that their unsound behavior is a consequence of the statistical learning approach powering their development. To overcome this, we contend that to achieve reliable deductive reasoning in learning-based AI systems, researchers must fundamentally shift from optimizing for statistical performance against distributions on reasoning problems and algorithmic tasks to embracing the more ambitious exact learning paradigm, which demands correctness on all inputs. We argue that exact learning is both essential and possible, and that this ambitious objective should guide algorithm design.
zh
[AI-16] Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic
【速读】:该论文试图解决Transformer模型在处理需要逐步推理的任务时,中间步骤的顺序对推理难度具有关键影响的问题。其核心挑战在于如何找到一种学习友好的序列排列,以提升模型在算术任务中的表现。解决方案的关键在于提出一个两阶段的分层方法,用于对解码器输入标记进行块间和块内重排序,从而在庞大的搜索空间中高效识别出有利于模型学习的顺序。实验表明,该方法能够在数以十亿计的候选顺序中找到有效的学习友好顺序,并在乘法任务中恢复了先前研究中报告的逆序数字排列。
链接: https://arxiv.org/abs/2506.23875
作者: Yuta Sato,Kazuhiko Kawamoto,Hiroshi Kera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures
Abstract:The chain of thought is fundamental in Transformers, which is to perform step-by-step reasoning. Besides what intermediate steps work, the order of these steps critically affects the difficulty of the reasoning. This study addresses a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially with sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, on the multiplication task, it recovered the reverse-digit order reported in prior studies.
zh
[AI-17] Scaling Self-Supervised Representation Learning for Symbolic Piano Performance
【速读】:该论文旨在解决符号化钢琴音乐生成与分类问题,特别是通过预训练生成式自回归Transformer模型来提升音乐续写的一致性及跨任务的泛化能力。其解决方案的关键在于利用大量符号化钢琴乐谱进行预训练,并基于高质量子集对模型进行微调,以实现音乐续写、符号分类任务以及生成通用对比MIDI嵌入,其中通过适配SimCLR框架实现了符号音乐的对比学习,从而提升了模型在下游任务中的表现。
链接: https://arxiv.org/abs/2506.23869
作者: Louis Bradshaw,Honglu Fan,Alexander Spangher,Stella Biderman,Simon Colton
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: ISMIR (2025)
Abstract:We study the capabilities of generative autoregressive transformer models trained on large amounts of symbolic solo-piano transcriptions. After first pretraining on approximately 60,000 hours of music, we use a comparatively smaller, high-quality subset, to finetune models to produce musical continuations, perform symbolic classification tasks, and produce general-purpose contrastive MIDI embeddings by adapting the SimCLR framework to symbolic music. When evaluating piano continuation coherence, our generative model outperforms leading symbolic generation techniques and remains competitive with proprietary audio generation models. On MIR classification benchmarks, frozen representations from our contrastive model achieve state-of-the-art results in linear probe experiments, while direct finetuning demonstrates the generalizability of pretrained representations, often requiring only a few hundred labeled examples to specialize to downstream tasks.
zh
[AI-18] Differentially Private Synthetic Data Release for Topics API Outputs
【速读】:该论文试图解决隐私保护广告API(Privacy-Preserving Ads APIs)的隐私属性实证研究中因缺乏公开数据而面临的障碍,其核心问题是无法获得真实API输出的公开数据集以进行可靠分析。解决方案的关键在于开发一种新颖的方法生成具有现实性的合成API输出,这些输出既能支持准确的研究,又能提供强隐私保护。该方法基于差分隐私(differential privacy)技术,通过计算大量差分隐私统计量来描述API输出随时间的变化,并设计一个参数化的API调用序列分布,优化参数使其与实际统计量高度匹配,最终通过从该分布中抽样生成合成数据。
链接: https://arxiv.org/abs/2506.23855
作者: Travis Dick,Alessandro Epasto,Adel Javanmard,Josh Karlin,Andres Munoz Medina,Vahab Mirrokni,Sergei Vassilvitskii,Peilin Zhong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 8 figures
Abstract:The analysis of the privacy properties of Privacy-Preserving Ads APIs is an area of research that has received strong interest from academics, industry, and regulators. Despite this interest, the empirical study of these methods is hindered by the lack of publicly available data. Reliable empirical analysis of the privacy properties of an API, in fact, requires access to a dataset consisting of realistic API outputs; however, privacy concerns prevent the general release of such data to the public. In this work, we develop a novel methodology to construct synthetic API outputs that are simultaneously realistic enough to enable accurate study and provide strong privacy protections. We focus on one Privacy-Preserving Ads APIs: the Topics API, part of Google Chrome’s Privacy Sandbox. We developed a methodology to generate a differentially-private dataset that closely matches the re-identification risk properties of the real Topics API data. The use of differential privacy provides strong theoretical bounds on the leakage of private user information from this release. Our methodology is based on first computing a large number of differentially-private statistics describing how output API traces evolve over time. Then, we design a parameterized distribution over sequences of API traces and optimize its parameters so that they closely match the statistics obtained. Finally, we create the synthetic data by drawing from this distribution. Our work is complemented by an open-source release of the anonymized dataset obtained by this methodology. We hope this will enable external researchers to analyze the API in-depth and replicate prior and future work on a realistic large-scale dataset. We believe that this work will contribute to fostering transparency regarding the privacy properties of Privacy-Preserving Ads APIs. Comments: 20 pages, 8 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.23855 [cs.CR] (or arXiv:2506.23855v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.23855 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3711896.3737391 Focus to learn more DOI(s) linking to related resources Submission history From: Alessandro Epasto [view email] [v1] Mon, 30 Jun 2025 13:46:57 UTC (300 KB)
zh
[AI-19] A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents
【速读】:该论文试图解决自主AI代理在动态开放环境中因功能扩展而引入的新型安全风险问题,包括记忆污染、工具滥用、奖励劫持和涌现对齐偏差等,这些风险超出了传统系统或独立大语言模型的威胁模型。解决方案的关键在于提出一种统一的认知框架——Reflective Risk-Aware Agent Architecture (R2A2),其基于约束马尔可夫决策过程(Constrained Markov Decision Processes, CMDPs),通过风险感知的世界建模、元策略适应和联合奖励-风险优化,实现代理决策循环中的系统性、前瞻性安全防护。
链接: https://arxiv.org/abs/2506.23844
作者: Hang Su,Jun Luo,Chang Liu,Xiao Yang,Yichi Zhang,Yinpeng Dong,Jun Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Recent advances in large language models (LLMs) have catalyzed the rise of autonomous AI agents capable of perceiving, reasoning, and acting in dynamic, open-ended environments. These large-model agents mark a paradigm shift from static inference systems to interactive, memory-augmented entities. While these capabilities significantly expand the functional scope of AI, they also introduce qualitatively novel security risks - such as memory poisoning, tool misuse, reward hacking, and emergent misalignment - that extend beyond the threat models of conventional systems or standalone LLMs. In this survey, we first examine the structural foundations and key capabilities that underpin increasing levels of agent autonomy, including long-term memory retention, modular tool use, recursive planning, and reflective reasoning. We then analyze the corresponding security vulnerabilities across the agent stack, identifying failure modes such as deferred decision hazards, irreversible tool chains, and deceptive behaviors arising from internal state drift or value misalignment. These risks are traced to architectural fragilities that emerge across perception, cognition, memory, and action modules. To address these challenges, we systematically review recent defense strategies deployed at different autonomy layers, including input sanitization, memory lifecycle control, constrained decision-making, structured tool invocation, and introspective reflection. We introduce the Reflective Risk-Aware Agent Architecture (R2A2), a unified cognitive framework grounded in Constrained Markov Decision Processes (CMDPs), which incorporates risk-aware world modeling, meta-policy adaptation, and joint reward-risk optimization to enable principled, proactive safety across the agent’s decision-making loop.
zh
[AI-20] owards the “Digital Me”: A vision of authentic Conversational Agents powered by personal Human Digital Twins
【速读】:该论文试图解决传统人类数字孪生(Human Digital Twins, HDTs)在交互性与真实性方面的局限性,即如何使HDT成为具备自然对话风格、记忆和行为特征的可交互数字个体。解决方案的关键在于构建一种集成大规模语言模型与动态更新个人数据的新颖系统架构,通过上下文感知的记忆检索、类神经可塑性整合以及自适应学习机制,实现对个体对话风格和个性特征的持续模仿与演化。
链接: https://arxiv.org/abs/2506.23826
作者: Lluís C. Coll,Martin W. Lauer-Schmaltz,Philip Cash,John P. Hansen,Anja Maier
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 24 pages, 9 figures
Abstract:Human Digital Twins (HDTs) have traditionally been conceptualized as data-driven models designed to support decision-making across various domains. However, recent advancements in conversational AI open new possibilities for HDTs to function as authentic, interactive digital counterparts of individuals. This paper introduces a novel HDT system architecture that integrates large language models with dynamically updated personal data, enabling it to mirror an individual’s conversational style, memories, and behaviors. To achieve this, our approach implements context-aware memory retrieval, neural plasticity-inspired consolidation, and adaptive learning mechanisms, creating a more natural and evolving digital persona. The resulting system does not only replicate an individual’s unique conversational style depending on who they are speaking with, but also enriches responses with dynamically captured personal experiences, opinions, and memories. While this marks a significant step toward developing authentic virtual counterparts, it also raises critical ethical concerns regarding privacy, accountability, and the long-term implications of persistent digital identities. This study contributes to the field of HDTs by describing our novel system architecture, demonstrating its capabilities, and discussing future directions and emerging challenges to ensure the responsible and ethical development of HDTs.
zh
[AI-21] he Impact of AI on Educational Assessment: A Framework for Constructive Alignment
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)特别是大型语言模型(Large Language Models, LLM)对教育评估体系有效性的影响问题,即当前的评估方式是否仍能准确衡量学生的学习成果与理解程度。论文提出的关键解决方案是基于建构性对齐(Constructive Alignment, CA)理论和布卢姆分类法(Bloom’s taxonomy)重新审视学习目标,并根据AI对不同层次学习目标的影响调整评估方法。同时,强调应根据是否允许使用AI来统一形成性评估与总结性评估,并通过制定结构化指南和教师培训,提升教师对AI工具能力与局限性的认知,从而实现评估方法的适应性调整。
链接: https://arxiv.org/abs/2506.23815
作者: Patrick Stokkink
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The influence of Artificial Intelligence (AI), and specifically Large Language Models (LLM), on education is continuously increasing. These models are frequently used by students, giving rise to the question whether current forms of assessment are still a valid way to evaluate student performance and comprehension. The theoretical framework developed in this paper is grounded in Constructive Alignment (CA) theory and Bloom’s taxonomy for defining learning objectives. We argue that AI influences learning objectives of different Bloom levels in a different way, and assessment has to be adopted accordingly. Furthermore, in line with Bloom’s vision, formative and summative assessment should be aligned on whether the use of AI is permitted or not. Although lecturers tend to agree that education and assessment need to be adapted to the presence of AI, a strong bias exists on the extent to which lecturers want to allow for AI in assessment. This bias is caused by a lecturer’s familiarity with AI and specifically whether they use it themselves. To avoid this bias, we propose structured guidelines on a university or faculty level, to foster alignment among the staff. Besides that, we argue that teaching staff should be trained on the capabilities and limitations of AI tools. In this way, they are better able to adapt their assessment methods. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.23815 [cs.HC] (or arXiv:2506.23815v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2506.23815 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-22] Advancing Learnable Multi-Agent Pathfinding Solvers with Active Fine-Tuning
【速读】:该论文旨在解决多智能体路径规划(Multi-agent Pathfinding, MAPF)问题,特别是在大规模环境中实现高效且高质量的路径规划。其解决方案的关键在于引入MAPF-GPT-DDG,这是一种基于模仿学习的去中心化次优求解器,通过利用集中式专家数据对预训练的MAPF模型进行有效微调,并结合一种新颖的delta-data生成机制,显著提升了训练效率和测试性能。实验结果表明,MAPF-GPT-DDG在多个测试场景中优于现有的所有基于学习的MAPF求解器,并能够处理单个环境中多达100万智能体的实例,为MAPF领域的可扩展性设定了新里程碑。
链接: https://arxiv.org/abs/2506.23793
作者: Anton Andreychuk,Konstantin Yakovlev,Aleksandr Panov,Alexey Skrynnik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent pathfinding (MAPF) is a common abstraction of multi-robot trajectory planning problems, where multiple homogeneous robots simultaneously move in the shared environment. While solving MAPF optimally has been proven to be NP-hard, scalable, and efficient, solvers are vital for real-world applications like logistics, search-and-rescue, etc. To this end, decentralized suboptimal MAPF solvers that leverage machine learning have come on stage. Building on the success of the recently introduced MAPF-GPT, a pure imitation learning solver, we introduce MAPF-GPT-DDG. This novel approach effectively fine-tunes the pre-trained MAPF model using centralized expert data. Leveraging a novel delta-data generation mechanism, MAPF-GPT-DDG accelerates training while significantly improving performance at test time. Our experiments demonstrate that MAPF-GPT-DDG surpasses all existing learning-based MAPF solvers, including the original MAPF-GPT, regarding solution quality across many testing scenarios. Remarkably, it can work with MAPF instances involving up to 1 million agents in a single environment, setting a new milestone for scalability in MAPF domains.
zh
[AI-23] When GNNs Met a Word Equations Solver: Learning to Rank Equations (Extended Technical Report)
【速读】:该论文试图解决在求解合取形式的单词方程时,求解器性能受方程处理顺序影响较大的问题。其解决方案的关键在于利用图神经网络(Graph Neural Networks, GNNs)对单词方程进行排序,通过一种基于图的表示方法保留合取式中的全局信息,使GNN能够在排序过程中获得整体视角。此外,为应对合取式中变量子集数量不固定的问题,提出了三种将多分类任务适配到方程排序问题的方法,并借助最小不可满足子集(minimum unsatisfiable subsets, MUSes)进行GNN的训练。
链接: https://arxiv.org/abs/2506.23784
作者: Parosh Aziz Abdulla,Mohamed Faouzi Atig,Julie Cailler,Chencheng Liang,Philipp Rümmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Nielsen transformation is a standard approach for solving word equations: by repeatedly splitting equations and applying simplification steps, equations are rewritten until a solution is reached. When solving a conjunction of word equations in this way, the performance of the solver will depend considerably on the order in which equations are processed. In this work, the use of Graph Neural Networks (GNNs) for ranking word equations before and during the solving process is explored. For this, a novel graph-based representation for word equations is presented, preserving global information across conjuncts, enabling the GNN to have a holistic view during ranking. To handle the variable number of conjuncts, three approaches to adapt a multi-classification task to the problem of ranking equations are proposed. The training of the GNN is done with the help of minimum unsatisfiable subsets (MUSes) of word equations. The experimental results show that, compared to state-of-the-art string solvers, the new framework solves more problems in benchmarks where each variable appears at most once in each equation.
zh
[AI-24] Calibrating Graph Neural Networks with Wavelet-Aware Temperature Scaling
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在关系数据上的预测置信度估计与实际预测正确性之间存在偏差的问题,这一问题限制了其在安全关键场景中的部署。解决方案的关键在于提出一种后处理校准框架——小波感知温度缩放(Wavelet-Aware Temperature Scaling, WATS),该框架基于可调的热核图小波特征为每个节点分配特定的温度参数,从而更精细地捕捉图结构的异质性,提升置信度估计的准确性。
链接: https://arxiv.org/abs/2506.23782
作者: Xiaoyang Li,Linwei Tao,Haohui Lu,Minjing Dong,Junbin Gao,Chang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) have demonstrated strong predictive performance on relational data; however, their confidence estimates often misalign with actual predictive correctness, posing significant limitations for deployment in safety-critical settings. While existing graph-aware calibration methods seek to mitigate this limitation, they primarily depend on coarse one-hop statistics, such as neighbor-predicted confidence, or latent node embeddings, thereby neglecting the fine-grained structural heterogeneity inherent in graph topology. In this work, we propose Wavelet-Aware Temperature Scaling (WATS), a post-hoc calibration framework that assigns node-specific temperatures based on tunable heat-kernel graph wavelet features. Specifically, WATS harnesses the scalability and topology sensitivity of graph wavelets to refine confidence estimates, all without necessitating model retraining or access to neighboring logits or predictions. Extensive evaluations across seven benchmark datasets with varying graph structures and two GNN backbones demonstrate that WATS achieves the lowest Expected Calibration Error (ECE) among all compared methods, outperforming both classical and graph-specific baselines by up to 42.3% in ECE and reducing calibration variance by 17.24% on average compared with graph-specific methods. Moreover, WATS remains computationally efficient, scaling well across graphs of diverse sizes and densities. Code will be released based on publication.
zh
[AI-25] BayesL: Towards a Logical Framework for Bayesian Networks
【速读】:该论文试图解决如何高效地指定、查询和验证贝叶斯网络(Bayesian Networks, BNs)行为的问题,其解决方案的关键在于提出了一种名为BayesL的新型逻辑框架。BayesL是一种结构化语言,能够支持对BNs进行灵活的查询,并实现关于因果关系和证据关系的多样化推理,同时允许在不修改模型的情况下进行全面的“假设情景”分析。
链接: https://arxiv.org/abs/2506.23773
作者: Stefano M. Nicoletti,Mariëlle Stoelinga
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:We introduce BayesL, a novel logical framework for specifying, querying, and verifying the behaviour of Bayesian networks (BNs). BayesL (pronounced “Basil”) is a structured language that allows for the creation of queries over BNs. It facilitates versatile reasoning concerning causal and evidence-based relationships, and permits comprehensive what-if scenario evaluations without the need for manual modifications to the model.
zh
[AI-26] Multi-Timescale Hierarchical Reinforcement Learning for Unified Behavior and Control of Autonomous Driving
【速读】:该论文试图解决基于强化学习(Reinforcement Learning, RL)的自动驾驶(Autonomous Driving, AD)方法中策略结构设计不足的问题。现有方法要么仅输出短时域车辆控制指令导致驾驶行为波动,要么仅输出长时域驾驶目标无法实现驾驶行为与控制的统一最优。解决方案的关键在于提出一种多时域分层强化学习方法,采用分层策略结构,通过高、低层次RL策略的联合训练,分别生成长时域运动引导和短时域控制指令,并通过混合动作显式表示运动引导以捕捉结构化道路的多模态驾驶行为,同时设计分层安全机制以保障多时域安全性。
链接: https://arxiv.org/abs/2506.23771
作者: Guizhe Jin,Zhuoren Li,Bo Leng,Ran Yu,Lu Xiong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, Submitted to IEEE Robotics and Automation Letters
Abstract:Reinforcement Learning (RL) is increasingly used in autonomous driving (AD) and shows clear advantages. However, most RL-based AD methods overlook policy structure design. An RL policy that only outputs short-timescale vehicle control commands results in fluctuating driving behavior due to fluctuations in network outputs, while one that only outputs long-timescale driving goals cannot achieve unified optimality of driving behavior and control. Therefore, we propose a multi-timescale hierarchical reinforcement learning approach. Our approach adopts a hierarchical policy structure, where high- and low-level RL policies are unified-trained to produce long-timescale motion guidance and short-timescale control commands, respectively. Therein, motion guidance is explicitly represented by hybrid actions to capture multimodal driving behaviors on structured road and support incremental low-level extend-state updates. Additionally, a hierarchical safety mechanism is designed to ensure multi-timescale safety. Evaluation in simulator-based and HighD dataset-based highway multi-lane scenarios demonstrates that our approach significantly improves AD performance, effectively increasing driving efficiency, action consistency and safety.
zh
[AI-27] Software Engineering for Large Language Models : Research Status Challenges and the Road Ahead
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在开发生命周期中面临日益复杂的挑战,而现有研究尚未从软件工程(Software Engineering, SE)的角度系统性地探讨这些问题及其解决方案。其关键解决方案是通过对LLM开发生命周期的六个阶段——需求工程、数据集构建、模型开发与优化、测试与评估、部署与运维、维护与演进进行系统分析,识别各阶段的核心挑战并提出潜在的研究方向,从而为未来LLM的发展提供来自软件工程视角的有价值见解。
链接: https://arxiv.org/abs/2506.23762
作者: Hongzhou Rao,Yanjie Zhao,Xinyi Hou,Shenao Wang,Haoyu Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models (LLMs) has redefined artificial intelligence (AI), pushing the boundaries of AI research and enabling unbounded possibilities for both academia and the industry. However, LLM development faces increasingly complex challenges throughout its lifecycle, yet no existing research systematically explores these challenges and solutions from the perspective of software engineering (SE) approaches. To fill the gap, we systematically analyze research status throughout the LLM development lifecycle, divided into six phases: requirements engineering, dataset construction, model development and enhancement, testing and evaluation, deployment and operations, and maintenance and evolution. We then conclude by identifying the key challenges for each phase and presenting potential research directions to address these challenges. In general, we provide valuable insights from an SE perspective to facilitate future advances in LLM development.
zh
[AI-28] Marker Gene Method : Identifying Stable Solutions in a Dynamic Environment
【速读】:该论文旨在解决竞争性协同进化算法(Competitive Co-evolutionary Algorithms, CCEAs)在复杂动态环境下因非传递性和红后效应导致的收敛不稳定问题。其解决方案的关键在于引入标记基因方法(Marker Gene Method, MGM),该方法通过使用“标记基因”作为动态基准,并结合自适应加权机制以平衡探索与利用,从而在严格竞争博弈框架内产生接近纳什均衡的强吸引子,提升算法的稳定性和鲁棒性。
链接: https://arxiv.org/abs/2506.23734
作者: Hao Shi,Xi Li,Fangfang Xie
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Submitted to IEEE Transactions on Evolutionary Computation. 13 pages, 10 figures. Supplementary material is included
Abstract:Competitive Co-evolutionary Algorithms (CCEAs) are often hampered by complex dynamics like intransitivity and the Red Queen effect, leading to unstable convergence. To counter these challenges, this paper introduces the Marker Gene Method (MGM), a framework that establishes stability by using a ‘marker gene’ as a dynamic benchmark and an adaptive weighting mechanism to balance exploration and exploitation. We provide rigorous mathematical proofs demonstrating that MGM creates strong attractors near Nash Equilibria within the Strictly Competitive Game framework. Empirically, MGM demonstrates its efficacy across a spectrum of challenges: it stabilizes the canonical Rock-Paper-Scissors game, significantly improves the performance of C-RMOEA/D on ZDT benchmarks, and, when augmented with a Memory Pool (MP) extension, it successfully tames the notoriously pathological Shapley Biased Game. This work presents a theoretically sound and empirically validated framework that substantially enhances the stability and robustness of CCEAs in complex competitive environments.
zh
[AI-29] System-Embedded Diffusion Bridge Models
【速读】:该论文旨在解决反问题——从不完整或噪声测量中恢复信号——这一在科学和工程中具有基础性意义的问题。其解决方案的关键在于提出了一种新的监督型桥梁方法,即系统嵌入扩散桥梁模型(System embedded Diffusion Bridge Models, SDBs),该方法将已知的线性测量系统显式嵌入到矩阵值随机微分方程(SDE)的系数中,从而实现对反问题的高效求解,并在系统建模不准确的情况下仍表现出良好的泛化能力。
链接: https://arxiv.org/abs/2506.23726
作者: Bartlomiej Sobieski,Matthew Tivnan,Yuang Wang,Siyeop Yoon,Pengfei Jin,Dufan Wu,Quanzheng Li,Przemyslaw Biecek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Solving inverse problems – recovering signals from incomplete or noisy measurements – is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.
zh
[AI-30] PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在机器人操作任务中对物理世界基本属性、可操作性和约束条件(PAC)理解不足的问题。尽管VLMs已被广泛应用于机器人操作,但其训练过程往往忽略了对低层物理先决条件的建模,导致在实际任务执行中存在显著能力缺陷。解决方案的关键在于提出PAC Bench,这是一个全面的基准测试平台,通过系统评估VLMs在物理概念理解方面的表现,以揭示其局限性并推动更可靠、物理基础更强的模型开发。
链接: https://arxiv.org/abs/2506.23725
作者: Atharva Gundawar,Som Sagar,Ransalu Senanayake
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that remains largely unverified. For robots to perform actions reliably, they must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object’s state, such as being closed). Despite the widespread use of VLMs in manipulation tasks, we argue that off-the-shelf models may lack this granular, physically grounded understanding, as such prerequisites are often overlooked during training. To address this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with over 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, and 1 to 3 affordances defined per class), 100 real-world humanoid-view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of current VLMs to grasp fundamental physical concepts, highlighting limitations in their suitability for reliable robot manipulation and pointing to key areas for targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating physical reasoning in VLMs and guiding the development of more robust, physically grounded models for robotic applications. Project Page: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.23725 [cs.RO] (or arXiv:2506.23725v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2506.23725 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Atharva Gundawar [view email] [v1] Mon, 30 Jun 2025 10:58:36 UTC (8,602 KB)
zh
[AI-31] DABstep: Data Agent Benchmark for Multi-step Reasoning
【速读】:该论文试图解决AI代理在现实世界多步骤数据分析任务中的评估问题,具体表现为现有基准在复杂性、真实性和多模态要求上的不足。解决方案的关键在于构建DABstep,这是一个包含超过450个来自金融分析平台的真实挑战的基准,要求模型结合基于代码的数据处理与异构文档的上下文推理,并通过迭代的多步骤问题解决方法进行测试,从而全面评估数据操作、多源交叉引用和精确结果报告的能力。
链接: https://arxiv.org/abs/2506.23719
作者: Alex Egg,Martin Iglesias Goyanes,Friso Kingma,Andreu Mora,Leandro von Werra,Thomas Wolf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:We introduce DABstep, a novel benchmark for evaluating AI agents on realistic multi-step data analysis tasks. DABstep comprises over 450 real-world challenges derived from a financial analytics platform, requiring models to combine code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands an iterative, multi-step problem-solving approach, testing capabilities in data manipulation, cross-referencing multiple sources, and precise result reporting. The benchmark provides a factoid-style answer format with automatic correctness checks for objective scoring at scale. We evaluate leading LLM-based agents, revealing a substantial performance gap: even the best agent achieves only 14.55% accuracy on the hardest tasks. We detail our benchmark’s design, dataset composition, task formulation, evaluation protocol, report baseline results and analyze failure modes. DABstep is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.
zh
[AI-32] A New Perspective On AI Safety Through Control Theory Methodologies
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)系统在安全保证方面的重大问题,尤其是在安全关键型、现实世界的网络物理系统中,尽管AI在自主性方面展现出巨大潜力,但缺乏有效的安全保障机制。解决方案的关键在于提出一种基于跨学科视角的“数据控制”(data control)新范式,通过系统理论和系统分析的方法,将数据生成过程与AI系统的抽象机制相结合,从而促进AI工程在安全分析和保障方面的 interdisciplinary 利用,推动AI安全领域的创新与发展。
链接: https://arxiv.org/abs/2506.23703
作者: Lars Ullrich,Walter Zimmer,Ross Greer,Knut Graichen,Alois C. Knoll,Mohan Trivedi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to be published as part of the 2025 IEEE Open Journal of Intelligent Transportation Systems (OJ-ITS)
Abstract:While artificial intelligence (AI) is advancing rapidly and mastering increasingly complex problems with astonishing performance, the safety assurance of such systems is a major concern. Particularly in the context of safety-critical, real-world cyber-physical systems, AI promises to achieve a new level of autonomy but is hampered by a lack of safety assurance. While data-driven control takes up recent developments in AI to improve control systems, control theory in general could be leveraged to improve AI safety. Therefore, this article outlines a new perspective on AI safety based on an interdisciplinary interpretation of the underlying data-generation process and the respective abstraction by AI systems in a system theory-inspired and system analysis-driven manner. In this context, the new perspective, also referred to as data control, aims to stimulate AI engineering to take advantage of existing safety analysis and assurance in an interdisciplinary way to drive the paradigm of data control. Following a top-down approach, a generic foundation for safety analysis and assurance is outlined at an abstract level that can be refined for specific AI systems and applications and is prepared for future innovation.
zh
[AI-33] Agent 4S: The Transformation of Research Paradigms from the Perspective of Large Language Models
【速读】:该论文试图解决当前AI在科学领域(AI4S)作为分析工具所无法解决的核心低效问题。其解决方案的关键在于提出“Agent for Science”(Agent4S),即利用大语言模型(LLM)驱动的智能体自动化整个研究流程,从而实现真正的第五科学范式。该方案通过构建五级分类体系,明确了从简单任务自动化到完全自主协作的“AI科学家”的发展路径。
链接: https://arxiv.org/abs/2506.23692
作者: Boyuan Zheng,Zerui Fang,Zhe Xu,Rui Wang,Yiwen Chen,Cunshi Wang,Mengwei Qu,Lei Lei,Zhen Feng,Yan Liu,Yuyang Li,Mingzhou Tan,Jiaji Wu,Jianwei Shuai,Jia Li,Fangfu Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While AI for Science (AI4S) serves as an analytical tool in the current research paradigm, it doesn’t solve its core inefficiency. We propose “Agent for Science” (Agent4S)-the use of LLM-driven agents to automate the entire research workflow-as the true Fifth Scientific Paradigm. This paper introduces a five-level classification for Agent4S, outlining a clear roadmap from simple task automation to fully autonomous, collaborative “AI Scientists.” This framework defines the next revolutionary step in scientific discovery.
zh
[AI-34] PokéAI: A Goal-Generating Battle-Optimizing Multi-agent System for Pokemon Red
【速读】:该论文试图解决如何构建一个能够自主玩转《宝可梦红》游戏的多智能体大型语言模型(LLM)框架的问题,其核心挑战在于实现智能体在复杂游戏环境中的自主决策与策略执行。解决方案的关键在于设计了一个由三个专用智能体组成的架构——规划智能体、执行智能体和批判智能体,它们各自拥有独立的记忆库、角色和技能集,形成一个闭环决策系统,从而实现任务生成、执行与评估的自动化流程。
链接: https://arxiv.org/abs/2506.23689
作者: Zihao Liu,Xinhang Sui,Yueran Song,Siwen Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We introduce PokéAI, the first text-based, multi-agent large language model (LLM) framework designed to autonomously play and progress through Pokémon Red. Our system consists of three specialized agents-Planning, Execution, and Critique-each with its own memory bank, role, and skill set. The Planning Agent functions as the central brain, generating tasks to progress through the game. These tasks are then delegated to the Execution Agent, which carries them out within the game environment. Upon task completion, the Critique Agent evaluates the outcome to determine whether the objective was successfully achieved. Once verification is complete, control returns to the Planning Agent, forming a closed-loop decision-making system. As a preliminary step, we developed a battle module within the Execution Agent. Our results show that the battle AI achieves an average win rate of 80.8% across 50 wild encounters, only 6% lower than the performance of an experienced human player. Furthermore, we find that a model’s battle performance correlates strongly with its LLM Arena score on language-related tasks, indicating a meaningful link between linguistic ability and strategic reasoning. Finally, our analysis of gameplay logs reveals that each LLM exhibits a unique playstyle, suggesting that individual models develop distinct strategic behaviors. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2506.23689 [cs.AI] (or arXiv:2506.23689v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.23689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-35] Learning Modular Exponentiation with Transformers
【速读】:该论文试图解决如何通过深度学习模型,特别是Transformer架构,理解并实现模幂运算(modular exponentiation)的机制可解释性问题。其关键解决方案在于训练一个四层编码器-解码器Transformer模型,并通过系统性的采样策略、基于主成分分析(PCA)的嵌入分析以及激活块修复技术,探究模型中数论性质的编码方式。研究发现,反向操作数训练能够显著提升性能,并在相关模数间实现突然的泛化能力,这表明模型内部可能内化了共享的算术结构,同时揭示了最终层中由注意力头组成的子图足以完成常规幂运算任务。这些结果表明,Transformer模型通过专门的计算电路学习模运算,为更可解释和高效的神经网络方法提供了新方向。
链接: https://arxiv.org/abs/2506.23679
作者: David Demitri Africa,Sara M. Kapoor,Theo Simon Sorg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Modular exponentiation is crucial to number theory and cryptography, yet remains largely unexplored from a mechanistic interpretability standpoint. We train a 4-layer encoder-decoder Transformer model to perform this operation and investigate the emergence of numerical reasoning during training. Utilizing principled sampling strategies, PCA-based embedding analysis, and activation patching, we examine how number-theoretic properties are encoded within the model. We find that reciprocal operand training leads to strong performance gains, with sudden generalization across related moduli. These synchronized accuracy surges reflect grokking-like dynamics, suggesting the model internalizes shared arithmetic structure. We also find a subgraph consisting entirely of attention heads in the final layer sufficient to achieve full performance on the task of regular exponentiation. These results suggest that transformer models learn modular arithmetic through specialized computational circuits, paving the way for more interpretable and efficient neural approaches to modular exponentiation.
zh
[AI-36] Interactive Reasoning : Visualizing and Controlling Chain-of-Thought Reasoning in Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成的推理过程(Chain-of-Thought, CoT)内容冗长且缺乏明确结构,导致难以审查以及用户无法进行有效反馈的问题。解决方案的关键在于引入交互式推理(Interactive Reasoning),通过将CoT输出可视化为层次化的主题结构,使用户能够审查和修改推理过程,从而更高效地引导模型生成定制化响应,并增强对模型推理逻辑和输出的理解。
链接: https://arxiv.org/abs/2506.23678
作者: Rock Yuren Pang,K. J. Kevin Feng,Shangbin Feng,Chu Li,Weijia Shi,Yulia Tsvetkov,Jeffrey Heer,Katharina Reinecke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The output quality of large language models (LLMs) can be improved via “reasoning”: generating segments of chain-of-thought (CoT) content to further condition the model prior to producing user-facing output. While these chains contain valuable information, they are verbose and lack explicit organization, making them tedious to review. Moreover, they lack opportunities for user feedback, such as to remove unwanted considerations, add desired ones, or clarify unclear assumptions. We introduce Interactive Reasoning, an interaction design that visualizes chain-of-thought outputs as a hierarchy of topics and enables user review and modification. We implement interactive reasoning in Hippo, a prototype for AI-assisted decision making in the face of uncertain trade-offs. In a user study with 16 participants, we find that interactive reasoning in Hippo allows users to quickly identify and interrupt erroneous generations, efficiently steer the model towards customized responses, and better understand both model reasoning and model outputs. Our work contributes to a new paradigm that incorporates user oversight into LLM reasoning processes.
zh
[AI-37] HASD: Hierarchical Adaption for pathology Slide-level Domain-shift
【速读】:该论文旨在解决病理学人工智能中的滑片级领域偏移(slide-level domain shift)问题,这一问题源于病理数据受中心特定条件的显著影响。其解决方案的关键在于提出一种分层适应框架(Hierarchical Adaptation framework for Slide-level Domain-shift, HASD),该框架通过两个核心组件实现多尺度特征一致性与计算高效的滑片级领域适应:一是分层适应框架,包含领域级对齐求解器、滑片级几何不变性正则化和切片级注意力一致性正则化;二是原型选择机制,以降低计算开销。
链接: https://arxiv.org/abs/2506.23673
作者: Jingsong Liu,Han Li,Chen Yang,Michael Deutges,Ario Sadafi,Xin You,Katharina Breininger,Nassir Navab,Peter J. Schüffler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Domain shift is a critical problem for pathology AI as pathology data is heavily influenced by center-specific conditions. Current pathology domain adaptation methods focus on image patches rather than WSI, thus failing to capture global WSI features required in typical clinical scenarios. In this work, we address the challenges of slide-level domain shift by proposing a Hierarchical Adaptation framework for Slide-level Domain-shift (HASD). HASD achieves multi-scale feature consistency and computationally efficient slide-level domain adaptation through two key components: (1) a hierarchical adaptation framework that integrates a Domain-level Alignment Solver for feature alignment, a Slide-level Geometric Invariance Regularization to preserve the morphological structure, and a Patch-level Attention Consistency Regularization to maintain local critical diagnostic cues; and (2) a prototype selection mechanism that reduces computational overhead. We validate our method on two slide-level tasks across five datasets, achieving a 4.1% AUROC improvement in a Breast Cancer HER2 Grading cohort and a 3.9% C-index gain in a UCEC survival prediction cohort. Our method provides a practical and reliable slide-level domain adaption solution for pathology institutions, minimizing both computational and annotation costs.
zh
[AI-38] QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration
【速读】:该论文旨在解决开源软件中漏洞检测不全面的问题,通过整合生成式AI(Generative AI)与静态分析工具,提升漏洞检测的覆盖率和准确性。其解决方案的关键在于构建了一个名为QLPro的框架,该框架系统性地融合了大语言模型(LLMs)与CodeQL等静态分析工具,从而在JavaTest数据集上实现了比单一静态分析工具更高的漏洞检测率,并发现了多个此前未被识别的0-day漏洞。
链接: https://arxiv.org/abs/2506.23644
作者: Junze Hu,Xiangyu Jin,Yizhe Zeng,Yuling Liu,Yunpeng Li,Dan Du,Kaiyu Xie,Hongsong Zhu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source this http URL constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vulnerabilities while QLPro detected 41. Furthermore, QLPro discovered 6 previously unknown vulnerabilities, 2 of which have been confirmed as 0-days.
zh
[AI-39] owards Building Private LLM s: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
【速读】:该论文旨在解决构建私有大型语言模型(Large Language Models, LLMs)系统时面临的成本和可扩展性问题,特别是针对个人或小团队服务的需求。其关键解决方案是利用搭载Apple M2 Ultra芯片的Mac Studio集群,以经济高效的方式部署和加速具有专家混合(Mixture-of-Experts, MoE)架构的预训练DBRX模型。通过在多个节点上并行执行专家模块,显著降低了推理时间,并提出了优化方案以消除Apple软件栈中的内存管理开销,从而提升了系统的成本效益。
链接: https://arxiv.org/abs/2506.23635
作者: Mu-Chi Chen,Po-Hsuan Huang,Xiangrui Ke,Chia-Heng Tu,Chun Jason Xue,Shih-Hao Hung
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: International Conference on Research in Adaptive and Convergent Systems (RACS '24), November 5–8, 2024, Pompei, Italy
Abstract:Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI’s ChatGPT, Meta’s Llama, and Databricks’ DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple’s M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model’s experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack’s memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.
zh
[AI-40] gMBA: Expression Semantic Guided Mixed Boolean-Arithmetic Deobfuscation Using Transformer Architectures
【速读】:该论文试图解决混合布尔-算术(Mixed Boolean-Arithmetic, MBA)混淆技术在恶意软件中被滥用导致检测困难的问题,传统方法因将MBA表达式视为黑盒而忽视其内部语义信息。解决方案的关键在于提出一种自动构建的真值表(truth table),作为表达式行为的语义表示,不依赖外部资源,并设计了一个通用且可扩展的引导式MBA去混淆框架(gMBA),通过修改基于Transformer的神经编码器-解码器序列到序列架构来引入语义指导,从而显著提升去混淆性能。
链接: https://arxiv.org/abs/2506.23634
作者: Youjeong Noh,Joon-Young Paik,Jingun Kwon,Eun-Sun Cho
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixed Boolean-Arithmetic (MBA) obfuscation protects intellectual property by converting programs into forms that are more complex to analyze. However, MBA has been increasingly exploited by malware developers to evade detection and cause significant real-world problems. Traditional MBA deobfuscation methods often consider these expressions as part of a black box and overlook their internal semantic information. To bridge this gap, we propose a truth table, which is an automatically constructed semantic representation of an expression’s behavior that does not rely on external resources. The truth table is a mathematical form that represents the output of expression for all possible combinations of input. We also propose a general and extensible guided MBA deobfuscation framework (gMBA) that modifies a Transformer-based neural encoder-decoder Seq2Seq architecture to incorporate this semantic guidance. Experimental results and in-depth analysis show that integrating expression semantics significantly improves performance and highlights the importance of internal semantic expressions in recovering obfuscated code to its original form.
zh
[AI-41] A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data
【速读】:该论文旨在解决水质量数据(Water Quality Data, WQD)在环境监测中因传感器故障和通信延迟等问题导致的大量缺失数据问题,这些问题使得数据呈现出高维稀疏(High-Dimensional and Sparse, HDS)特性。传统数据插补方法难以捕捉数据的潜在动态和深层特征,从而导致插补性能不理想。论文提出的解决方案是基于卷积神经网络(Convolutional Neural Networks, CNN)的非线性低秩表示模型(Nonlinear Low-rank Representation, NLR),其关键在于利用CNN实现时间特征融合以建模时间间隔的数据依赖性,并提取非线性交互和局部模式以挖掘高阶关系特征,实现多维信息的深度融合。
链接: https://arxiv.org/abs/2506.23629
作者: Xin Liao,Bing Yang,Cai Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, conference
Abstract:The integrity of Water Quality Data (WQD) is critical in environmental monitoring for scientific decision-making and ecological protection. However, water quality monitoring systems are often challenged by large amounts of missing data due to unavoidable problems such as sensor failures and communication delays, which further lead to water quality data becoming High-Dimensional and Sparse (HDS). Traditional data imputation methods are difficult to depict the potential dynamics and fail to capture the deep data features, resulting in unsatisfactory imputation performance. To effectively address the above issues, this paper proposes a Nonlinear Low-rank Representation model (NLR) with Convolutional Neural Networks (CNN) for imputing missing WQD, which utilizes CNNs to implement two ideas: a) fusing temporal features to model the temporal dependence of data between time slots, and b) Extracting nonlinear interactions and local patterns to mine higher-order relationships features and achieve deep fusion of multidimensional information. Experimental studies on three real water quality datasets demonstrate that the proposed model significantly outperforms existing state-of-the-art data imputation models in terms of estimation accuracy. It provides an effective approach for handling water quality monitoring data in complex dynamic environments.
zh
[AI-42] he Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking
【速读】:该论文试图解决传统Kubernetes网络无法满足人工智能/机器学习(AI/ML)和不断演进的电信(Telco)基础设施日益增长的需求问题。其解决方案的关键在于引入Kubernetes Network Drivers (KNDs),这是一种变革性、模块化且声明式的架构,通过动态资源分配(Dynamic Resource Allocation, DRA)、节点资源接口(Node Resource Interface, NRI)改进以及即将推出的OCI运行时规范变化,将网络资源管理集成到Kubernetes核心中。DraNet实现展示了网络接口的声明式绑定,包括远程直接内存访问(Remote Direct Memory Access, RDMA)设备,显著提升了高性能AI/ML工作负载的性能。
链接: https://arxiv.org/abs/2506.23628
作者: Antonio Ojea
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 6 pages, 9 figures, submitted to IEEE LCN Special Track on Cloud-AI-Native Mobile Networks Powered by eBPF (CAMe 2025)
Abstract:Traditional Kubernetes networking struggles to meet the escalating demands of AI/ML and evolving Telco infrastructure. This paper introduces Kubernetes Network Drivers (KNDs), a transformative, modular, and declarative architecture designed to overcome current imperative provisioning and API limitations. KNDs integrate network resource management into Kubernetes’ core by utilizing Dynamic Resource Allocation (DRA), Node Resource Interface (NRI) improvements, and upcoming OCI Runtime Specification changes. Our DraNet implementation demonstrates declarative attachment of network interfaces, including Remote Direct Memory Access (RDMA) devices, significantly boosting high-performance AI/ML workloads. This capability enables sophisticated cloud-native applications and lays crucial groundwork for future Telco solutions, fostering a “galaxy” of specialized KNDs for enhanced application delivery and reduced operational complexity.
zh
[AI-43] Self-correcting Reward Shaping via Language Models for Reinforcement Learning Agents in Games
【速读】:该论文旨在解决在生产环境中部署强化学习(Reinforcement Learning, RL)代理时面临的两个关键问题,即设计有效的奖励函数通常需要RL专家的介入,以及当游戏内容或机制发生变化时,先前调优的奖励权重可能不再最优。其解决方案的关键在于提出一种基于用户定义的语言行为目标的自动化方法,通过语言模型(Language Model, LM)在每一轮迭代中根据目标行为和之前训练轮次的性能统计信息生成更新的奖励函数权重,从而实现奖励权重的迭代优化。该闭环过程使语言模型能够自我修正并逐步提升输出质量,无需人工进行奖励工程。
链接: https://arxiv.org/abs/2506.23626
作者: António Afonso,Iolanda Leite,Alessandro Sestini,Florian Fuchs,Konrad Tollmar,Linus Gisslén
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages in total, 10 pages of main paper, 5 figures
Abstract:Reinforcement Learning (RL) in games has gained significant momentum in recent years, enabling the creation of different agent behaviors that can transform a player’s gaming experience. However, deploying RL agents in production environments presents two key challenges: (1) designing an effective reward function typically requires an RL expert, and (2) when a game’s content or mechanics are modified, previously tuned reward weights may no longer be optimal. Towards the latter challenge, we propose an automated approach for iteratively fine-tuning an RL agent’s reward function weights, based on a user-defined language based behavioral goal. A Language Model (LM) proposes updated weights at each iteration based on this target behavior and a summary of performance statistics from prior training rounds. This closed-loop process allows the LM to self-correct and refine its output over time, producing increasingly aligned behavior without the need for manual reward engineering. We evaluate our approach in a racing task and show that it consistently improves agent performance across iterations. The LM-guided agents show a significant increase in performance from 9% to 74% success rate in just one iteration. We compare our LM-guided tuning against a human expert’s manual weight design in the racing task: by the final iteration, the LM-tuned agent achieved an 80% success rate, and completed laps in an average of 855 time steps, a competitive performance against the expert-tuned agent’s peak 94% success, and 850 time steps.
zh
[AI-44] SoK: Semantic Privacy in Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在敏感领域部署时面临的语义隐私(semantic privacy)问题,即传统数据隐私措施无法有效保护隐含、上下文相关或可推断的信息。其解决方案的关键在于提出一个以生命周期为中心的框架,用于分析语义隐私风险在输入处理、预训练、微调和对齐阶段的产生机制,并评估现有防御技术(如差分隐私、嵌入加密、边缘计算和遗忘机制)对这些威胁的有效性。研究揭示了语义层面保护的关键缺失,特别是在应对上下文推理和潜在表示泄露方面,进而提出了未来研究需解决的开放性挑战。
链接: https://arxiv.org/abs/2506.23603
作者: Baihe Ma,Yanna Jiang,Xu Wang,Guangshen Yu,Qin Wang,Caijun Sun,Chen Li,Xuelei Qi,Ying He,Wei Ni,Ren Ping Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models (LLMs) are increasingly deployed in sensitive domains, traditional data privacy measures prove inadequate for protecting information that is implicit, contextual, or inferable - what we define as semantic privacy. This Systematization of Knowledge (SoK) introduces a lifecycle-centric framework to analyze how semantic privacy risks emerge across input processing, pretraining, fine-tuning, and alignment stages of LLMs. We categorize key attack vectors and assess how current defenses, such as differential privacy, embedding encryption, edge computing, and unlearning, address these threats. Our analysis reveals critical gaps in semantic-level protection, especially against contextual inference and latent representation leakage. We conclude by outlining open challenges, including quantifying semantic leakage, protecting multimodal inputs, balancing de-identification with generation quality, and ensuring transparency in privacy enforcement. This work aims to inform future research on designing robust, semantically aware privacy-preserving techniques for LLMs.
zh
[AI-45] When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series ICML2025
【速读】:该论文试图解决未来异常事件预测(Anomaly Prediction, AP)问题,即预测特定未来时间点将发生的异常事件。现有方法在处理时间序列数据时无法有效实现AP任务,仅关注即时异常或无法提供对未来异常的精确预测。该论文提出的解决方案关键在于构建一个名为Anomaly to Prompt (A2P)的框架,包含Anomaly-Aware Forecasting (AAF)和Synthetic Anomaly Prompting (SAP),其中SAP引入了可学习的Anomaly Prompt Pool (APP),通过信号自适应提示模拟多样化的异常模式,从而提升对未来异常的预测能力。
链接: https://arxiv.org/abs/2506.23596
作者: Min-Yeong Park,Won-Jeong Lee,Seong Tae Kim,Gyeong-Moon Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, 12 tables, ICML 2025
Abstract:Recently, forecasting future abnormal events has emerged as an important scenario to tackle real-world necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains under-explored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address the AP task, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies. Our implementation code is available at this https URL.
zh
[AI-46] ransition Matching: Scalable and Flexible Generative Modeling
【速读】:该论文试图解决生成模型设计空间受限以及统一文本与媒体生成的挑战,其解决方案的关键在于提出一种新的离散时间、连续状态的生成范式——Transition Matching ™,该范式通过将复杂生成任务分解为简单的马尔可夫转移,实现了对非确定性概率转移核和任意非连续监督过程的灵活建模,从而拓展了生成模型的设计可能性。
链接: https://arxiv.org/abs/2506.23589
作者: Neta Shaul,Uriel Singer,Itai Gat,Yaron Lipman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching ™, a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.
zh
[AI-47] Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models
【速读】:该论文试图解决生成式 AI (Generative AI) 面临的越狱攻击(jailbreaking attacks)问题,即通过特定提示绕过安全机制的行为。其解决方案的关键在于利用多智能体大语言模型(multi-agent LLM)系统作为防御手段,通过比较单智能体与多智能体配置的防御效果,验证多智能体系统在提升对越狱攻击抵抗能力方面的有效性,尤其是在降低误漏报(false negatives)方面表现突出。
链接: https://arxiv.org/abs/2506.23576
作者: Maria Carolina Cornelia Wit,Jun Pang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 1 figure
Abstract:Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi-agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single-agent setups with two- and three-agent configurations. Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade-offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.
zh
[AI-48] Online Human Action Detection during Escorting
【速读】:该论文试图解决在复杂室内环境中,传统导航策略无法有效支持机器人助手进行人员 escort(护送)的问题,特别是在拥挤场景下,由于被护送者可能因各种原因无法持续跟随,导致传统系统难以提供有效的护送服务。解决方案的关键在于提出一种新型神经网络架构,该架构能够实时完成人像重识别(person re-identification)和动作预测(action prediction),使机器人能够持续检测并理解被护送者的行为,并据此动态调整自身运动,从而提升护送的稳定性和有效性。
链接: https://arxiv.org/abs/2506.23573
作者: Siddhartha Mondal,Avik Mitra,Chayan Sarkar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in IEEE RO-MAN '25
Abstract:The deployment of robot assistants in large indoor spaces has seen significant growth, with escorting tasks becoming a key application. However, most current escorting robots primarily rely on navigation-focused strategies, assuming that the person being escorted will follow without issue. In crowded environments, this assumption often falls short, as individuals may struggle to keep pace, become obstructed, get distracted, or need to stop unexpectedly. As a result, conventional robotic systems are often unable to provide effective escorting services due to their limited understanding of human movement dynamics. To address these challenges, an effective escorting robot must continuously detect and interpret human actions during the escorting process and adjust its movement accordingly. However, there is currently no existing dataset designed specifically for human action detection in the context of escorting. Given that escorting often occurs in crowded environments, where other individuals may enter the robot’s camera view, the robot also needs to identify the specific human it is escorting (the subject) before predicting their actions. Since no existing model performs both person re-identification and action prediction in real-time, we propose a novel neural network architecture that can accomplish both tasks. This enables the robot to adjust its speed dynamically based on the escortee’s movements and seamlessly resume escorting after any disruption. In comparative evaluations against strong baselines, our system demonstrates superior efficiency and effectiveness, showcasing its potential to significantly improve robotic escorting services in complex, real-world scenarios.
zh
[AI-49] CooT: Learning to Coordinate In-Context with Coordination Transformers
【速读】:该论文试图解决多智能体系统中在动态和不确定环境中人工代理之间有效协调的问题,现有方法如自对弈和基于种群的方法在面对未见过的合作伙伴时泛化能力差或需要大量训练。解决方案的关键在于提出一种名为Coordination Transformers (CooT) 的新型上下文协调框架,该框架利用最近的交互历史快速适应未见过的合作伙伴,通过预测与观察到的合作伙伴交互一致的动作来明确关注于适应新合作伙伴的行为,从而在无需显式监督或微调的情况下迅速学习有效的协调策略。
链接: https://arxiv.org/abs/2506.23549
作者: Huai-Chih Wang,Hsiang-Chun Chuang,Hsi-Chun Cheng,Dai-Jie Wu,Shao-Hua Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 23 pages, 10 tables, 8 figures
Abstract:Effective coordination among artificial agents in dynamic and uncertain environments remains a significant challenge in multi-agent systems. Existing approaches, such as self-play and population-based methods, either generalize poorly to unseen partners or require extensive training. To overcome these limitations, we propose Coordination Transformers (CooT), a novel in-context coordination framework that uses recent interaction histories to adapt to unseen partners rapidly. Unlike previous approaches that primarily aim to increase the diversity of training partners, CooT explicitly focuses on adapting to new partner behaviors by predicting actions aligned with observed partner interactions. Trained on interaction trajectories collected from diverse pairs of agents with complementary behaviors, CooT quickly learns effective coordination strategies without explicit supervision or fine-tuning. Evaluations on the Overcooked benchmark demonstrate that CooT significantly outperforms baseline methods in coordination tasks involving previously unseen partners. Human evaluations further confirm CooT as the most effective collaborative partner, while extensive ablations highlight its robustness, flexibility, and sensitivity to context in multi-agent scenarios.
zh
[AI-50] ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM -Generated Data
【速读】:该论文旨在解决从有机化学文献中自动提取化学合成步骤的挑战,这一任务因化学语言的固有歧义性和构建可靠计算机辅助提取协议所需的人工标注高成本而难以实现。其解决方案的关键在于提出ChemActor,一个完全微调的大语言模型(LLM),作为化学执行器,能够将非结构化的实验步骤转换为结构化的操作序列,并通过一种基于序列的LLM生成数据框架来应对标注数据不足和质量低的问题,该框架结合了基于分布差异的数据选择模块与通用LLM,以从单一分子输入生成可执行的操作指令。
链接: https://arxiv.org/abs/2506.23520
作者: Yu Zhang,Ruijie Yu,Jidong Tian,Feng Zhu,Jiapeng Liu,Xiaokang Yang,Yaohui Jin,Yanyan Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model’s advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: this https URL.
zh
[AI-51] MGPRL: Distributed Multi-Gaussian Processes for Wi-Fi-based Multi-Robot Relative Localization in Large Indoor Environments IROS2025
【速读】:该论文旨在解决多机器人系统在GPS拒止环境中进行相对定位的问题,现有方法通常依赖于成本高昂或作用距离有限的传感器(如摄像头和LiDAR),导致计算开销大且在非连通环境中的适应性差。解决方案的关键在于提出一种名为MGPRL的分布式框架,该框架利用多个Wi-Fi接入点(AP)的凸包进行相对定位,通过协同区域化多输出高斯过程实现高效的信号强度指示(RSSI)场预测,并结合不确定性感知的多AP定位与加权凸包对齐,以实现鲁棒的相对位姿估计。
链接: https://arxiv.org/abs/2506.23514
作者: Sai Krishna Ghanta,Ramviyas Parasuraman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to IROS 2025
Abstract:Relative localization is a crucial capability for multi-robot systems operating in GPS-denied environments. Existing approaches for multi-robot relative localization often depend on costly or short-range sensors like cameras and LiDARs. Consequently, these approaches face challenges such as high computational overhead (e.g., map merging) and difficulties in disjoint environments. To address this limitation, this paper introduces MGPRL, a novel distributed framework for multi-robot relative localization using convex-hull of multiple Wi-Fi access points (AP). To accomplish this, we employ co-regionalized multi-output Gaussian Processes for efficient Radio Signal Strength Indicator (RSSI) field prediction and perform uncertainty-aware multi-AP localization, which is further coupled with weighted convex hull-based alignment for robust relative pose estimation. Each robot predicts the RSSI field of the environment by an online scan of APs in its environment, which are utilized for position estimation of multiple APs. To perform relative localization, each robot aligns the convex hull of its predicted AP locations with that of the neighbor robots. This approach is well-suited for devices with limited computational resources and operates solely on widely available Wi-Fi RSSI measurements without necessitating any dedicated pre-calibration or offline fingerprinting. We rigorously evaluate the performance of the proposed MGPRL in ROS simulations and demonstrate it with real-world experiments, comparing it against multiple state-of-the-art approaches. The results showcase that MGPRL outperforms existing methods in terms of localization accuracy and computational efficiency. Finally, we open source MGPRL as a ROS package this https URL.
zh
[AI-52] Hybrid Approach for Electricity Price Forecasting using AlexNet and LSTM
【速读】:该论文试图解决电力价格预测中传统方法精度不足的问题,特别是针对外汇时间序列数据处理效果不佳以及仅关注需求和价格而忽略其他影响因素的局限性。其解决方案的关键在于引入一种混合模型,该模型结合了AlexNet(用于高效特征提取)和LSTM(用于学习时间序列模式),并通过引入外部变量如温度、日照和降雨等来提升预测准确性。实验结果表明,该混合模型在预测精度上优于独立的RNN和ANN模型。
链接: https://arxiv.org/abs/2506.23504
作者: Bosubabu Sambana,Kotamsetty Geethika Devi,Bandi Rajeswara Reddy,Galeti Mohammad Hussain,Gownivalla Siddartha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 Pages, 7 Figures
Abstract:The recent development of advanced machine learning methods for hybrid models has greatly addressed the need for the correct prediction of electrical prices. This method combines AlexNet and LSTM algorithms, which are used to introduce a new model with higher accuracy in price forecasting. Despite RNN and ANN being effective, they often fail to deal with forex time sequence data. The traditional methods do not accurately forecast the prices. These traditional methods only focus on demand and price which leads to insufficient analysis of data. To address this issue, using the hybrid approach, which focuses on external variables that also effect the predicted prices. Nevertheless, due to AlexNet’s excellent feature extraction and LSTM’s learning sequential patterns, the prediction accuracy is vastly increased. The model is built on the past data, which has been supplied with the most significant elements like demand, temperature, sunlight, and rain. For example, the model applies methods, such as minimum-maximum scaling and a time window, to predict the electricity prices of the future. The results show that this hybrid model is good than the standalone ones in terms of accuracy. Although we got our accuracy rating of 97.08, it shows higher accompaniments than remaining models RNN and ANN with accuracies of 96.64 and 96.63 respectively.
zh
[AI-53] Data Augmentation for Cognitive Behavioral Therapy: Leverag ing ERNIE Language Models using Artificial Intelligence
【速读】:该论文试图解决在数字时代中,如何有效识别社交平台上用户表达的负面情绪及认知扭曲,以支持心理治疗师进行及时且有针对性的干预。现有方法在分析这些认知路径方面存在显著不足,而本文提出的解决方案关键在于利用认知行为疗法(CBT)框架,结合接受与承诺策略以及数据增强技术,通过BERT、RoBERTa等模型进行情感分析,T5、PEGASUS进行文本摘要,mT5实现多语言文本翻译,从而全面检测社交媒体数据中的负面情绪和认知扭曲,并进一步预测其他潜在的心理健康问题,如恐惧症和饮食障碍,为心理干预提供更全面的支持。
链接: https://arxiv.org/abs/2506.23503
作者: Bosubabu Sambana,Kondreddygari Archana,Suram Indhra Sena Reddy,Shaik Meethaigar Jameer Basha,Shaik Karishma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 Pages, 5 Figures, IEEE IDCIoT 2025
Abstract:Cognitive Behavioral Therapy (CBT) is a proven approach for addressing the irrational thought patterns associated with mental health disorders, but its effectiveness relies on accurately identifying cognitive pathways to provide targeted treatment. In today’s digital age, individuals often express negative emotions on social media, where they may reveal cognitive distortions, and in severe cases, exhibit suicidal tendencies. However, there is a significant gap in methodologies designed to analyze these cognitive pathways, which could be critical for psychotherapists aiming to deliver timely and effective interventions in online environments. Cognitive Behavioral Therapy (CBT) framework leveraging acceptance, commitment and data augmentation to categorize and address both textual and visual content as positive or negative. Specifically, the system employs BERT, RoBERTa for Sentiment Analysis and T5, PEGASUS for Text Summarization, mT5 for Text Translation in Multiple Languages focusing on detecting negative emotions and cognitive distortions within social media data. While existing models are primarily designed to identify negative thoughts, the proposed system goes beyond this by predicting additional negative side effects and other potential mental health disorders likes Phobias, Eating Disorders. This enhancement allows for a more comprehensive understanding and intervention strategy, offering psychotherapists a powerful tool for early detection and treatment of various psychological issues.
zh
[AI-54] he Confidence Paradox: Can LLM Know When Its Wrong
【速读】:该论文旨在解决文档视觉问答(DocVQA)系统在伦理响应方面的不足,特别是在模型置信度与实际知识之间存在偏差的问题,这可能导致模型对模糊问题给出过于自信的答案或无法以可信赖的方式传达不确定性。解决方案的关键在于提出HonestVQA,这是一种自监督的诚实校准框架,通过量化不确定性识别知识缺口、利用加权损失函数将模型置信度与实际正确性对齐,并通过对比学习强制伦理响应行为,从而实现伦理一致的DocVQA系统。
链接: https://arxiv.org/abs/2506.23464
作者: Sahil Tripathi,Md Tabrez Nafis,Imran Hussain,Jiechao Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Document Visual Question Answering (DocVQA) systems are increasingly deployed in real world applications, yet they remain ethically opaque-often producing overconfident answers to ambiguous questions or failing to communicate uncertainty in a trustworthy manner. This misalignment between model confidence and actual knowledge poses significant risks, particularly in domains requiring ethical accountability. Existing approaches such as LayoutLMv3, UDOP, and DONUT have advanced SOTA performance by focusing on architectural sophistication and accuracy; however, they fall short in ethical responsiveness. To address these limitations, we introduce HonestVQA, a self-supervised honesty calibration framework for ethically aligned DocVQA. Our model-agnostic method quantifies uncertainty to identify knowledge gaps, aligns model confidence with actual correctness using weighted loss functions, and enforces ethical response behavior via contrastive learning. We further introduce two principled evaluation metrics–Honesty Score (H-Score) and Ethical Confidence Index (ECI)–to benchmark alignment between confidence, accuracy, and ethical communication. Empirically, HonestVQA improves DocVQA accuracy by up to 4.3% and F1 by 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets. It reduces overconfidence, lowering H-Score and ECI by 0.072 and 0.078, respectively. In cross domain evaluation, it achieves up to 78.9% accuracy and 76.1% F1-score, demonstrating strong generalization. Ablation shows a 3.8% drop in accuracy without alignment or contrastive loss. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.23464 [cs.AI] (or arXiv:2506.23464v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.23464 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-55] Can We Predict the Unpredictable? Leverag ing DisasterNet-LLM for Multimodal Disaster Classification
【速读】:该论文试图解决传统灾害管理方法在整合多模态数据(如图像、天气记录和文本报告)方面存在的不足,以实现更及时和准确的灾害分析。解决方案的关键在于提出一种专门的大型语言模型(Large Language Model, LLM)——DisasterNet-LLM,该模型通过先进的预训练技术、跨模态注意力机制和自适应变换器,显著提升了灾害分类的性能。
链接: https://arxiv.org/abs/2506.23462
作者: Manaswi Kulahara,Gautam Siddharth Kashyap,Nipun Joshi,Arpita Soni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025), scheduled for 3 - 8 August 2025 in Brisbane, Australia
Abstract:Effective disaster management requires timely and accurate insights, yet traditional methods struggle to integrate multimodal data such as images, weather records, and textual reports. To address this, we propose DisasterNet-LLM, a specialized Large Language Model (LLM) designed for comprehensive disaster analysis. By leveraging advanced pretraining, cross-modal attention mechanisms, and adaptive transformers, DisasterNet-LLM excels in disaster classification. Experimental results demonstrate its superiority over state-of-the-art models, achieving higher accuracy of 89.5%, an F1 score of 88.0%, AUC of 0.92%, and BERTScore of 0.88% in multimodal disaster classification tasks.
zh
[AI-56] From Large-scale Audio Tagging to Real-Time Explainable Emergency Vehicle Sirens Detection
【速读】:该论文旨在解决紧急车辆(Emergency Vehicle, EV)警报声在智能交通系统、智慧城市监控系统和自动驾驶技术中的准确识别问题。现有自动解决方案受限于缺乏大规模标注数据集以及先进声音事件检测模型的计算需求。该研究提出了一种轻量级卷积神经网络架构E2PANNs(Efficient Emergency Pre trained Audio Neural Networks),其基于PANNs框架并针对二分类EV警报声检测进行了优化。关键在于利用专门构建的AudioSet子集(AudioSet EV)进行微调与评估,并验证其在嵌入式硬件上的可行性,同时通过可解释性分析确保模型能够捕捉不同类型的EV警报声的时频特征。
链接: https://arxiv.org/abs/2506.23437
作者: Stefano Giacomelli,Marco Giordano,Claudia Rinaldi,Fabio Graziosi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: pre-print (submitted to the IEEE/ACM Transactions on Audio, Speech, and Language Processing)
Abstract:Accurate recognition of Emergency Vehicle (EV) sirens is critical for the integration of intelligent transportation systems, smart city monitoring systems, and autonomous driving technologies. Modern automatic solutions are limited by the lack of large scale, curated datasets and by the computational demands of state of the art sound event detection models. This work introduces E2PANNs (Efficient Emergency Pre trained Audio Neural Networks), a lightweight Convolutional Neural Network architecture derived from the PANNs framework, specifically optimized for binary EV siren detection. Leveraging our dedicated subset of AudioSet (AudioSet EV) we fine-tune and evaluate E2PANNs across multiple reference datasets and test its viability on embedded hardware. The experimental campaign includes ablation studies, cross-domain benchmarking, and real-time inference deployment on edge device. Interpretability analyses exploiting Guided Backpropagation and ScoreCAM algorithms provide insights into the model internal representations and validate its ability to capture distinct spectrotemporal patterns associated with different types of EV sirens. Real time performance is assessed through frame wise and event based detection metrics, as well as a detailed analysis of false positive activations. Results demonstrate that E2PANNs establish a new state of the art in this research domain, with high computational efficiency, and suitability for edge-based audio monitoring and safety-critical applications.
zh
[AI-57] Accurate Parameter-Efficient Test-Time Adaptation for Time Series Forecasting ICML2025 DATE
【速读】:该论文旨在解决现实世界时间序列的非平稳性导致预训练预测模型性能下降的问题。现有方法在测试时通过更新整个模型进行适应,增加了内存和计算成本。论文提出的解决方案关键在于PETSA,这是一种参数高效的测试时适应方法,通过仅更新输入和输出上的小型校准模块来调整预测器,而非重新训练整个模型。PETSA采用低秩适配器和动态门控机制,在不增加大量参数的情况下调整表示,并引入一种结合鲁棒项、频域项和块级结构项的专用损失函数,以维持准确性。
链接: https://arxiv.org/abs/2506.23424
作者: Heitor R. Medeiros,Hossein Sharifi-Noghabi,Gabriel L. Oliveira,Saghar Irandoust
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Second Workshop on Test-Time Adaptation: Putting Updates to the Test! at ICML 2025, Vancouver, Canada. 2025
Abstract:Real-world time series often exhibit a non-stationary nature, degrading the performance of pre-trained forecasting models. Test-Time Adaptation (TTA) addresses this by adjusting models during inference, but existing methods typically update the full model, increasing memory and compute costs. We propose PETSA, a parameter-efficient method that adapts forecasters at test time by only updating small calibration modules on the input and output. PETSA uses low-rank adapters and dynamic gating to adjust representations without retraining. To maintain accuracy despite limited adaptation capacity, we introduce a specialized loss combining three components: (1) a robust term, (2) a frequency-domain term to preserve periodicity, and (3) a patch-wise structural term for structural alignment. PETSA improves the adaptability of various forecasting backbones while requiring fewer parameters than baselines. Experimental results on benchmark datasets show that PETSA achieves competitive or better performance across all horizons. Our code is available at: this https URL
zh
[AI-58] BenchMake: Turn any scientific data set into a reproducible benchmark
【速读】:该论文试图解决计算科学中基准数据集稀缺的问题,这一问题源于问题的独特性和相关领域变化的快速性,导致新创新难以评估。解决方案的关键在于开发一种名为BenchMake的工具,该工具利用非负矩阵分解(Non-negative Matrix Factorization)确定性地识别并隔离凸包上的挑战性边缘案例,并将匹配的数据实例分割成一个测试集,该测试集在表格、图、图像、信号和文本等多种模态下最大化差异性和统计显著性。
链接: https://arxiv.org/abs/2506.23419
作者: Amanda S Barnard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 10 pages, 15 pages in Appendix, 15 figures, 5 tables, 57 references
Abstract:Benchmark data sets are a cornerstone of machine learning development and applications, ensuring new methods are robust, reliable and competitive. The relative rarity of benchmark sets in computational science, due to the uniqueness of the problems and the pace of change in the associated domains, makes evaluating new innovations difficult for computational scientists. In this paper a new tool is developed and tested to potentially turn any of the increasing numbers of scientific data sets made openly available into a benchmark accessible to the community. BenchMake uses non-negative matrix factorisation to deterministically identify and isolate challenging edge cases on the convex hull (the smallest convex set that contains all existing data instances) and partitions a required fraction of matched data instances into a testing set that maximises divergence and statistical significance, across tabular, graph, image, signal and textual modalities. BenchMake splits are compared to establish splits and random splits using ten publicly available benchmark sets from different areas of science, with different sizes, shapes, distributions.
zh
[AI-59] Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment
【速读】:该论文旨在解决在电子健康记录(Electronic Health Records, EHR)中跨分布式时间序列数据训练生成式基础模型(Generative Foundation Models)的挑战,同时保障患者隐私。其解决方案的关键在于提出联邦时间线合成(Federated Timeline Synthesis, FTS)框架,该框架将患者病史表示为无语言依赖的分词患者健康时间线(Patient Health Timelines, PHTs),各机构在其本地PHT上训练自回归Transformer模型,并仅上传模型权重至中央服务器。服务器利用这些模型生成大规模轨迹数据并训练全局生成器(Global Generator, GG),从而实现基于蒙特卡洛模拟的零样本推理。
链接: https://arxiv.org/abs/2506.23358
作者: Pawel Renc,Michal K. Grzeszczyk,Linglong Qian,Nassim Oufattole,Jeff Rasley,Arkadiusz Sitek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: conference paper
Abstract:We present Federated Timeline Synthesis (FTS), a novel framework for training generative foundation models across distributed timeseries data applied to electronic health records (EHR). At its core, FTS represents patient history as tokenized Patient Health Timelines (PHTs), language-agnostic sequences encoding temporal, categorical, and continuous clinical information. Each institution trains an autoregressive transformer on its local PHTs and transmits only model weights to a central server. The server uses the generators to synthesize a large corpus of trajectories and train a Global Generator (GG), enabling zero-shot inference via Monte Carlo simulation of future PHTs. We evaluate FTS on five clinically meaningful prediction tasks using MIMIC-IV data, showing that models trained on synthetic data generated by GG perform comparably to those trained on real data. FTS offers strong privacy guarantees, scalability across institutions, and extensibility to diverse prediction and simulation tasks especially in healthcare, including counterfactual inference, early warning detection, and synthetic trial design.
zh
[AI-60] Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop CVPR-2025
【速读】:该论文旨在解决多臂协作机器人在复杂物理环境中执行精细操作任务的挑战,特别是针对刚性、柔性及触觉敏感物体的协同操作问题。解决方案的关键在于构建一个涵盖仿真与现实世界的双臂协作挑战平台,利用RoboTwin Simulation平台和AgileX COBOT-Magic Robot平台,设计了多阶段竞赛任务,以推动可泛化的双臂策略学习。通过这一平台,研究者能够探索高效、鲁棒的双臂协同控制方法,为未来更复杂的自主系统提供基础支持。
链接: https://arxiv.org/abs/2506.23351
作者: Tianxing Chen,Kaixuan Wang,Zhaohui Yang,Yuhao Zhang,Zanxin Chen,Baijun Chen,Wanxi Dong,Ziyuan Liu,Dong Chen,Tianshuo Yang,Haibao Yu,Xiaokang Yang,Yusen Qin,Zhiqiang Xie,Yao Mu,Ping Luo,Tian Nian,Weiliang Deng,Yiheng Ge,Yibin Liu,Zixuan Li,Dehui Wang,Zhixuan Liang,Haohui Xie,Rijie Zeng,Yunfei Ge,Peiqing Cong,Guannan He,Zhaoming Han,Ruocheng Yin,Jingxiang Guo,Lunkai Lin,Tianling Xu,Hongzhe Bi,Xuewu Lin,Tianwei Lin,Shujie Luo,Keyu Li,Ziyan Zhao,Ke Fan,Heyang Xu,Bo Peng,Wenlong Gao,Dongjiang Li,Feng Jin,Hui Shen,Jinming Li,Chaowei Cui,Yuchen,Yaxin Peng,Lingdong Zeng,Wenlong Dong,Tengfei Li,Weijie Ke,Jun Chen,Erdemt Bao,Tian Lan,Tenglong Liu,Jin Yang,Huiping Zhuang,Baozhi Jia,Shuai Zhang,Zhengfeng Zou,Fangheng Guan,Tianyi Jia,Ke Zhou,Hongjiu Zhang,Yating Han,Cheng Fang,Yixian Zou,Chongyang Xu,Qinglun Zhang,Shen Cheng,Xiaohe Wang,Ping Tan,Haoqiang Fan,Shuaicheng Liu,Jiaheng Chen,Chuxuan Huang,Chengliang Lin,Kaijun Luo,Boyu Yue,Yi Liu,Jinyu Chen,Zichang Tan,Liming Deng,Shuo Xu,Zijian Cai,Shilong Yin,Hao Wang,Hongshan Liu,Tianyang Li,Long Shi,Ran Xu,Huilin Xu,Zhengquan Zhang,Congsheng Xu,Jinchang Yang,Feng Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Challenge Webpage: this https URL
Abstract:Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To advance this goal, we launched the RoboTwin Dual-Arm Collaboration Challenge at the 2nd MEIS Workshop, CVPR 2025. Built on the RoboTwin Simulation platform (1.0 and 2.0) and the AgileX COBOT-Magic Robot platform, the competition consisted of three stages: Simulation Round 1, Simulation Round 2, and a final Real-World Round. Participants totally tackled 17 dual-arm manipulation tasks, covering rigid, deformable, and tactile-based scenarios. The challenge attracted 64 global teams and over 400 participants, producing top-performing solutions like SEM and AnchorDP3 and generating valuable insights into generalizable bimanual policy learning. This report outlines the competition setup, task design, evaluation methodology, key findings and future direction, aiming to support future research on robust and generalizable bimanual manipulation policies. The Challenge Webpage is available at this https URL.
zh
[AI-61] VALID-Mol: a Systematic Framework for Validated LLM -Assisted Molecular Design ICSE
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在药物发现中的分子设计应用中,生成的分子结构常缺乏化学有效性与实用性的问题。其关键解决方案是提出VALID-Mol框架,该框架通过系统整合方法学提示工程、自动化化学验证和领域适配的微调LLM,显著提升了有效化学结构的生成率,从3%提升至83%,并确保了可合成分子的可靠生成及性能优化。
链接: https://arxiv.org/abs/2506.23339
作者: Malikussaid,Hilal Hudan Nuha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
备注: 16 pages, 1 figure, 5 algorithms, 7 tables, to be published in ICSECS Conference 2025, unabridged version
Abstract:Large Language Models (LLMs) demonstrate remarkable potential for scientific discovery, but their application in domains requiring factual accuracy and domain-specific constraints remains challenging. In molecular design for drug discovery, LLMs can suggest creative molecular modifications but often produce chemically invalid or impractical structures. We present VALID-Mol, a systematic framework for integrating chemical validation with LLM-driven molecular design that increases the rate of generating valid chemical structures from 3% to 83%. Our approach combines methodical prompt engineering, automated chemical validation, and a fine-tuned domain-adapted LLM to ensure reliable generation of synthesizable molecules with improved properties. Beyond the specific implementation, we contribute a generalizable methodology for scientifically-constrained LLM applications, with quantifiable reliability improvements. Computational predictions suggest our framework can generate promising candidates for synthesis with up to 17-fold computationally predicted improvements in target affinity while maintaining synthetic accessibility. We provide a detailed analysis of our prompt engineering process, validation architecture, and fine-tuning approach, offering a reproducible blueprint for applying LLMs to other scientific domains where domain-specific validation is essential.
zh
[AI-62] XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs
【速读】:该论文试图解决现有语音编解码器在保持高质量音频重建与适应语言模型建模需求之间的平衡问题。传统编解码器通常在语义丰富性或声学保真度中偏向一方,难以同时满足两者的需求。论文提出的解决方案是XY-Tokenizer,其关键在于通过多阶段、多任务学习机制缓解语义与声学能力之间的冲突,从而在相同比特率下实现与当前最优编解码器相当的语义和声学性能。
链接: https://arxiv.org/abs/2506.23325
作者: Yitian Gong,Luozhijie Jin,Ruifan Deng,Dong Zhang,Xin Zhang,Qinyuan Cheng,Zhaoye Fei,Shimin Li,Xipeng Qiu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech codecs serve as bridges between speech signals and large language models. An ideal codec for speech language models should not only preserve acoustic information but also capture rich semantic information. However, existing speech codecs struggle to balance high-quality audio reconstruction with ease of modeling by language models. In this study, we analyze the limitations of previous codecs in balancing semantic richness and acoustic fidelity. We propose XY-Tokenizer, a novel codec that mitigates the conflict between semantic and acoustic capabilities through multi-stage, multi-task learning. Experimental results demonstrate that XY-Tokenizer achieves performance in both semantic and acoustic tasks comparable to that of state-of-the-art codecs operating at similar bitrates, even though those existing codecs typically excel in only one aspect. Specifically, XY-Tokenizer achieves strong text alignment, surpassing distillation-based semantic modeling methods such as SpeechTokenizer and Mimi, while maintaining a speaker similarity score of 0.83 between reconstructed and original audio. The reconstruction performance of XY-Tokenizer is comparable to that of BigCodec, the current state-of-the-art among acoustic-only codecs, which achieves a speaker similarity score of 0.84 at a similar bitrate. Code and models are available at this https URL.
zh
[AI-63] Interpretable by Design: MH-AutoML for Transparent and Efficient Android Malware Detection without Compromising Performance
【速读】:该论文试图解决Android系统中恶意软件检测领域中存在的自动化机器学习(AutoML)工具缺乏透明度、可解释性和实验可追溯性的问题。其解决方案的关键在于提出MH-AutoML,这是一个针对Android恶意软件检测的领域特定框架,能够自动化整个机器学习流程,并集成可解释性、调试和实验追踪功能,从而在保持计算效率的同时提升检测的召回率和可解释性。
链接: https://arxiv.org/abs/2506.23314
作者: Joner Assolin,Gabriel Canto,Diego Kreutz,Eduardo Feitosa,Hendrio Bragança,Angelo Nogueira,Vanderson Rocha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures, 7 tabelas, paper submitted to JBCS
Abstract:Malware detection in Android systems requires both cybersecurity expertise and machine learning (ML) techniques. Automated Machine Learning (AutoML) has emerged as an approach to simplify ML development by reducing the need for specialized knowledge. However, current AutoML solutions typically operate as black-box systems with limited transparency, interpretability, and experiment traceability. To address these limitations, we present MH-AutoML, a domain-specific framework for Android malware detection. MH-AutoML automates the entire ML pipeline, including data preprocessing, feature engineering, algorithm selection, and hyperparameter tuning. The framework incorporates capabilities for interpretability, debugging, and experiment tracking that are often missing in general-purpose solutions. In this study, we compare MH-AutoML against seven established AutoML frameworks: Auto-Sklearn, AutoGluon, TPOT, HyperGBM, Auto-PyTorch, LightAutoML, and MLJAR. Results show that MH-AutoML achieves better recall rates while providing more transparency and control. The framework maintains computational efficiency comparable to other solutions, making it suitable for cybersecurity applications where both performance and explainability matter.
zh
[AI-64] GATSim: Urban Mobility Simulation with Generative Agents
【速读】:该论文试图解决传统基于代理的都市交通模拟中由于依赖刚性规则系统而无法捕捉人类出行决策的复杂性、适应性和行为多样性的问题。其解决方案的关键在于引入生成式代理(Generative Agents),这些代理具备推理能力、持久记忆和自适应学习机制,通过心理启发的记忆系统、工具使用能力和终身学习机制来塑造其出行决策。论文提出了一种结合城市交通基础模型与代理认知系统的综合架构,并通过系统验证展示了生成式代理能够产生可信的出行行为。
链接: https://arxiv.org/abs/2506.23306
作者: Qi Liu,Can Li,Wanjing Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional agent-based urban mobility simulations rely on rigid rule-based systems that fail to capture the complexity, adaptability, and behavioral diversity characteristic of human travel decision-making. Recent advances in large language models and AI agent technology offer opportunities to create agents with reasoning capabilities, persistent memory, and adaptive learning mechanisms. We propose GATSim (Generative-Agent Transport Simulation), a novel framework that leverages these advances to create generative agents with rich behavioral characteristics for urban mobility simulation. Unlike conventional approaches, GATSim agents possess diverse socioeconomic attributes, individual lifestyles, and evolving preferences that shape their mobility decisions through psychologically-informed memory systems, tool usage capabilities, and lifelong learning mechanisms. The main contributions of this study include: (1) a comprehensive architecture combining an urban mobility foundation model with agent cognitive systems and transport simulation environment, (2) a fully functional prototype implementation, and (3) systematic validation demonstrating that generative agents produce believable travel behaviors. Through designed reflection processes, generative agents in this study can transform specific travel experiences into generalized insights, enabling realistic behavioral adaptation over time with specialized mechanisms for activity planning and real-time reactive behaviors tailored to urban mobility contexts. Experiments show that generative agents perform competitively with human annotators in mobility scenarios while naturally producing macroscopic traffic evolution patterns. The code for the prototype system is shared at this https URL.
zh
[AI-65] Securing AI Systems: A Guide to Known Attacks and Impacts
【速读】:该论文试图解决嵌入信息系统的人工智能(Artificial Intelligence, AI)所面临的特定安全威胁,这些威胁利用了AI系统中的独特漏洞。其解决方案的关键在于识别十一类针对预测型和生成式AI系统的对抗性攻击,并明确将攻击技术与其影响(包括信息泄露、系统被破坏和资源耗尽)相联系,进而映射到保密性、完整性和可用性(Confidentiality, Integrity, and Availability, CIA)安全三元组,从而为研究人员、开发者、安全从业者和政策制定者提供基础性的知识以识别AI特有的风险并实施有效的防御措施。
链接: https://arxiv.org/abs/2506.23296
作者: Naoto Kiribuchi,Kengo Zenitani,Takayuki Semitsu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 34 pages, 16 figures
Abstract:Embedded into information systems, artificial intelligence (AI) faces security threats that exploit AI-specific vulnerabilities. This paper provides an accessible overview of adversarial attacks unique to predictive and generative AI systems. We identify eleven major attack types and explicitly link attack techniques to their impacts – including information leakage, system compromise, and resource exhaustion – mapped to the confidentiality, integrity, and availability (CIA) security triad. We aim to equip researchers, developers, security practitioners, and policymakers, even those without specialized AI security expertise, with foundational knowledge to recognize AI-specific risks and implement effective defenses, thereby enhancing the overall security posture of AI systems.
zh
[AI-66] Not All Explanations for Deep Learning Phenomena Are Equally Valuable ICML2025
【速读】:该论文试图解决当前深度学习研究中对一些看似反直觉现象(如双下降、领悟和彩票假设等)的孤立研究问题,这些研究往往基于临时假设进行个案分析。论文认为,在许多显著案例中,这些现象在实际应用中缺乏证据,相关研究可能无法有效推动整个领域的发展。解决方案的关键在于避免将这些现象视为需要定制化解释的独立谜题,而是将其作为检验和优化更广泛深度学习理论的特殊场景,从而确保研究进展与深度学习领域的总体实用目标保持一致。
链接: https://arxiv.org/abs/2506.23286
作者: Alan Jeffares,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at ICML 2025 for oral presentation
Abstract:Developing a better understanding of surprising or counterintuitive phenomena has constituted a significant portion of deep learning research in recent years. These include double descent, grokking, and the lottery ticket hypothesis – among many others. Works in this area often develop ad hoc hypotheses attempting to explain these observed phenomena on an isolated, case-by-case basis. This position paper asserts that, in many prominent cases, there is little evidence to suggest that these phenomena appear in real-world applications and these efforts may be inefficient in driving progress in the broader field. Consequently, we argue against viewing them as isolated puzzles that require bespoke resolutions or explanations. However, despite this, we suggest that deep learning phenomena do still offer research value by providing unique settings in which we can refine our broad explanatory theories of more general deep learning principles. This position is reinforced by analyzing the research outcomes of several prominent examples of these phenomena from the recent literature. We revisit the current norms in the research community in approaching these problems and propose practical recommendations for future research, aiming to ensure that progress on deep learning phenomena is well aligned with the ultimate pragmatic goal of progress in the broader field of deep learning.
zh
[AI-67] Predicting thinking time in Reasoning models
【速读】:该论文试图解决生成式 AI (Generative AI) 在复杂推理任务中因模型“思考时间”不可预测而导致的用户体验问题。现有推理模型在回答前会经历较长的隐式推理过程,但用户无法预知该过程的持续时间,这种不确定性可能引发用户不满,并随着大语言模型(LLM)处理更长任务而加剧。论文提出了一种在线和离线预测模型“思考时间”的方法,旨在构建一个实际可用的“推理进度条”,其关键在于通过有效的时间预测机制提升用户对模型推理过程的感知与控制能力。
链接: https://arxiv.org/abs/2506.23274
作者: Hans Peter Lynsgøe Raaschou-jensen,Constanza Fierro,Anders Søgaard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning models that produce long, hidden chains of thought have emerged as powerful tools for complex, reasoning-intensive tasks\citepdeepseekai2025deepseekr1incentivizingreasoningcapability, openai2024openaio1card. However, this paradigm introduces a new user experience challenge: users have little insight into how much time the model will spend reasoning before returning an answer. This unpredictability, can lead to user frustration and is likely to compound as LLMs can produce increasingly long tasks asynchronously \citepkwa2025measuringaiabilitycomplete. In this paper, we introduce and evaluate methods for both online and offline prediction of model “thinking time,” aiming to develop a practical “progress bar for reasoning.” We discuss the implications for user interaction and future research directions.
zh
[AI-68] FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis
【速读】:该论文试图解决金融领域中文本到SQL(text2sql)任务的挑战,尤其是在面对复杂且领域特定的查询时。其关键解决方案是提出FinStat2SQL,一个轻量级的text2sql流程,通过多智能体架构结合大语言模型和小语言模型,实现实体抽取、SQL生成和自我校正,并针对本地标准(如VAS)进行优化,从而提升金融报表自然语言查询的准确性和效率。
链接: https://arxiv.org/abs/2506.23273
作者: Quang Hung Nguyen,Phuong Anh Trinh,Phan Quoc Hung Mai,Tuan Phong Trinh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain-specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financial statements. Tailored to local standards like VAS, it combines large and small language models in a multi-agent setup for entity extraction, SQL generation, and self-correction. We build a domain-specific database and evaluate models on a synthetic QA dataset. A fine-tuned 7B model achieves 61.33% accuracy with sub-4-second response times on consumer hardware, outperforming GPT-4o-mini. FinStat2SQL offers a scalable, cost-efficient solution for financial analysis, making AI-powered querying accessible to Vietnamese enterprises.
zh
[AI-69] From Prompt Injections to Protocol Exploits: Threats in LLM -Powered AI Agents Workflows
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)代理生态系统中由于插件、连接器和跨代理协议的爆炸式增长而导致的安全发现机制和安全实践滞后的问题,从而引发脆弱集成并面临多种威胁。解决方案的关键在于提出首个统一的端到端威胁模型,涵盖从主机到工具以及代理到代理的通信,形式化攻击者能力与目标,并列举了超过三十种攻击技术,同时对各类攻击场景进行评估,并识别出关键的开放挑战与未来研究方向,如通过动态信任管理和加密溯源追踪来保障模型上下文协议(Model Context Protocol, MCP)的部署安全,设计并加固智能体网络接口等。
链接: https://arxiv.org/abs/2506.23260
作者: Mohamed Amine Ferrag,Norbert Tihanyi,Djallel Hamouda,Leandros Maglaras,Merouane Debbah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 29 pages, 15 figures, 6 tables
Abstract:Autonomous AI agents powered by large language models (LLMs) with structured function-calling interfaces have dramatically expanded capabilities for real-time data retrieval, complex computation, and multi-step orchestration. Yet, the explosive proliferation of plugins, connectors, and inter-agent protocols has outpaced discovery mechanisms and security practices, resulting in brittle integrations vulnerable to diverse threats. In this survey, we introduce the first unified, end-to-end threat model for LLM-agent ecosystems, spanning host-to-tool and agent-to-agent communications, formalize adversary capabilities and attacker objectives, and catalog over thirty attack techniques. Specifically, we organized the threat model into four domains: Input Manipulation (e.g., prompt injections, long-context hijacks, multimodal adversarial inputs), Model Compromise (e.g., prompt- and parameter-level backdoors, composite and encrypted multi-backdoors, poisoning strategies), System and Privacy Attacks (e.g., speculative side-channels, membership inference, retrieval poisoning, social-engineering simulations), and Protocol Vulnerabilities (e.g., exploits in Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent Network Protocol (ANP), and Agent-to-Agent (A2A) protocol). For each category, we review representative scenarios, assess real-world feasibility, and evaluate existing defenses. Building on our threat taxonomy, we identify key open challenges and future research directions, such as securing MCP deployments through dynamic trust management and cryptographic provenance tracking; designing and hardening Agentic Web Interfaces; and achieving resilience in multi-agent and federated environments. Our work provides a comprehensive reference to guide the design of robust defense mechanisms and establish best practices for resilient LLM-agent workflows.
zh
[AI-70] FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model
【速读】:该论文试图解决联邦学习场景中AI模型性能不足以及无法满足用户多样化需求的问题。其解决方案的关键在于提出基于参考模型的联邦学习方法,通过引入包含先前模型参数的参考模型,结合贝叶斯参数高效迁移学习框架,包含最优邻近项,从而在每一轮训练中克服灾难性遗忘问题,实现高模型性能与低计算成本的平衡。
链接: https://arxiv.org/abs/2506.23210
作者: Taehwan Yoon,Bongjun Choi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 6 pages,14 equation
Abstract:Federated learning(FL) is used for distributed scenarios to train artificial intelligence(AI) models while ensuring users’ privacy. In federated learning scenario, the server generally never knows about users’ data. This type of concept makes the AI training process efficient in terms of data privacy. However, regarding model performance, federated AI models may not sufficiently satisfy AI users’ expectations. Furthermore, AI users have a wide range of different needs. It is not easy to satisfy the whole users needs. These types of issues can be addressed through AI model optimization, fine-tuning, or personalization to achieve optimal model performance. To address model optimization challenges, we propose reference model-based federated learning for optimal fine-tuning, which overcomes catastrophic forgetting in each round. This method is derived from Bayesian parameter-efficient transfer learning, which includes an optimal proximal term and enables overcoming the catastrophic forgetting issue in each round by utilizing a reference model that incorporates previous model parameters. As a result, this method achieves both high model performance and low computing cost.
zh
[AI-71] Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data
【速读】:该论文试图解决当前生成式无线合成数据质量不可预测以及由此导致的任务性能提升不稳定的难题。其解决方案的关键在于提出可操作且通用的指标——亲和性(affinity)和多样性(diversity),用于量化合成数据的质量属性,并引入SynCheck,一种基于质量引导的合成数据利用方案,在任务模型训练过程中优化合成数据质量,从而提升整体性能。
链接: https://arxiv.org/abs/2506.23174
作者: Chen Gong,Bo Liang,Wei Gao,Chenren Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in MobiSys 2025
Abstract:Generative models have gained significant attention for their ability to produce realistic synthetic data that supplements the quantity of real-world datasets. While recent studies show performance improvements in wireless sensing tasks by incorporating all synthetic data into training sets, the quality of synthetic data remains unpredictable and the resulting performance gains are not guaranteed. To address this gap, we propose tractable and generalizable metrics to quantify quality attributes of synthetic data - affinity and diversity. Our assessment reveals prevalent affinity limitation in current wireless synthetic data, leading to mislabeled data and degraded task performance. We attribute the quality limitation to generative models’ lack of awareness of untrained conditions and domain-specific processing. To mitigate these issues, we introduce SynCheck, a quality-guided synthetic data utilization scheme that refines synthetic data quality during task model training. Our evaluation demonstrates that SynCheck consistently outperforms quality-oblivious utilization of synthetic data, and achieves 4.3% performance improvement even when the previous utilization degrades performance by 13.4%.
zh
[AI-72] Rises for Measuring Local Distributivity in Lattices
【速读】:该论文试图解决在形式概念分析(Formal Concept Analysis, FCA)中缺乏一种标准化的度量方法来量化格(lattice)的分配性(distributivity)问题。其解决方案的关键在于引入“上升”(rises)的概念,用以评估格的分配性。通过分析覆盖概念中属性或对象数量的变化,“上升”能够捕捉格结构中的非单位上升现象,进而判断格是否为分配格。研究证明,当且仅当不存在非单位上升时,格才是分配的,并进一步将上升与经典的交-并分配性联系起来。
链接: https://arxiv.org/abs/2506.23168
作者: Mohammad Abdulla,Tobias Hille,Dominik Dürrschnabel,Gerd Stumme
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Combinatorics (math.CO); Rings and Algebras (math.RA)
备注: 16 pages, 2 tables, 5 figures, International Joint Conference on Conceptual Knowledge Structures
Abstract:Distributivity is a well-established and extensively studied notion in lattice theory. In the context of data analysis, particularly within Formal Concept Analysis (FCA), lattices are often observed to exhibit a high degree of distributivity. However, no standardized measure exists to quantify this property. In this paper, we introduce the notion of rises in (concept) lattices as a means to assess distributivity. Rises capture how the number of attributes or objects in covering concepts change within the concept lattice. We show that a lattice is distributive if and only if no non-unit rises occur. Furthermore, we relate rises to the classical notion of meet- and join distributivity. We observe that concept lattices from real-world data are to a high degree join-distributive, but much less meet-distributive. We additionally study how join-distributivity manifests on the level of ordered sets.
zh
[AI-73] Mode Collapse Happens: Evaluating Critical Interactions in Joint Trajectory Prediction Models
【速读】:该论文试图解决自动驾驶系统中多模态预测模型存在的模式崩溃(mode collapse)问题,即模型仅预测最可能的轨迹而忽视其他潜在合理轨迹,从而带来安全隐患。其解决方案的关键在于提出一种新的评估框架,该框架通过引入模式崩溃、模式正确性和覆盖率等指标,重点评估联合轨迹预测中的模式崩溃现象,并强调预测的序列维度,以更全面地衡量模型在安全关键交互中的表现。
链接: https://arxiv.org/abs/2506.23164
作者: Maarten Hugenholtz,Anna Meszaros,Jens Kober,Zlatan Ajanovic
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures, submitted to a journal
Abstract:Autonomous Vehicle decisions rely on multimodal prediction models that account for multiple route options and the inherent uncertainty in human behavior. However, models can suffer from mode collapse, where only the most likely mode is predicted, posing significant safety risks. While existing methods employ various strategies to generate diverse predictions, they often overlook the diversity in interaction modes among agents. Additionally, traditional metrics for evaluating prediction models are dataset-dependent and do not evaluate inter-agent interactions quantitatively. To our knowledge, none of the existing metrics explicitly evaluates mode collapse. In this paper, we propose a novel evaluation framework that assesses mode collapse in joint trajectory predictions, focusing on safety-critical interactions. We introduce metrics for mode collapse, mode correctness, and coverage, emphasizing the sequential dimension of predictions. By testing four multi-agent trajectory prediction models, we demonstrate that mode collapse indeed happens. When looking at the sequential dimension, although prediction accuracy improves closer to interaction events, there are still cases where the models are unable to predict the correct interaction mode, even just before the interaction mode becomes inevitable. We hope that our framework can help researchers gain new insights and advance the development of more consistent and accurate prediction models, thus enhancing the safety of autonomous driving systems.
zh
[AI-74] Context-Driven Knowledge Graph Completion with Semantic-Aware Relational Message Passing
【速读】:该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)中由于传统基于节点的消息传递机制在聚合邻接边信息时引入噪声、导致信息稀释或过平滑的问题。其解决方案的关键在于提出了一种语义感知的关系消息传递框架,核心创新是引入了语义感知的Top-K邻居选择策略,通过在共享潜在空间中评估中心节点与其相邻边的语义相关性,仅选择最相关的Top-K条边,并利用多头注意力聚合器将这些边的信息与中心节点表示有效融合,从而生成语义聚焦的节点消息,提升链接预测任务的准确性。
链接: https://arxiv.org/abs/2506.23141
作者: Siyuan Li,Ruitong Liu,Yan Wen,Te Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic context surrounding a triplet (h, r, t) is crucial for Knowledge Graph Completion (KGC), providing vital cues for prediction. However, traditional node-based message passing mechanisms, when applied to knowledge graphs, often introduce noise and suffer from information dilution or over-smoothing by indiscriminately aggregating information from all neighboring edges. To address this challenge, we propose a semantic-aware relational message passing. A core innovation of this framework is the introduction of a \textbfsemantic-aware Top-K neighbor selection strategy. Specifically, this strategy first evaluates the semantic relevance between a central node and its incident edges within a shared latent space, selecting only the Top-K most pertinent ones. Subsequently, information from these selected edges is effectively fused with the central node’s own representation using a \textbfmulti-head attention aggregator to generate a semantically focused node message. In this manner, our model not only leverages the structure and features of edges within the knowledge graph but also more accurately captures and propagates the contextual information most relevant to the specific link prediction task, thereby effectively mitigating interference from irrelevant information. Extensive experiments demonstrate that our method achieves superior performance compared to existing approaches on several established benchmarks.
zh
[AI-75] Are Large Language Models Capable of Deep Relational Reasoning ? Insights from DeepSeek -R1 and Benchmark Comparisons
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在深度关系推理任务中的表现问题,特别是其在家族树和一般图推理任务中的逻辑演绎与关系推断能力。解决方案的关键在于设计一套精心构建的基准测试任务,以评估和比较DeepSeek-R1、DeepSeek-V3和GPT-4o三款先进LLMs的推理能力,并通过分析模型在复杂问题中的表现,揭示其推理机制及局限性。
链接: https://arxiv.org/abs/2506.23128
作者: Chi Chiu So,Yueyue Sun,Jun-Min Wang,Siu Pang Yung,Anthony Wai Keung Loh,Chun Pong Chau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 0 figures, accepted by 2025 IEEE international conference on artificial intelligence testing (AITest)
Abstract:How far are Large Language Models (LLMs) in performing deep relational reasoning? In this paper, we evaluate and compare the reasoning capabilities of three cutting-edge LLMs, namely, DeepSeek-R1, DeepSeek-V3 and GPT-4o, through a suite of carefully designed benchmark tasks in family tree and general graph reasoning. Our experiments reveal that DeepSeek-R1 consistently achieves the highest F1-scores across multiple tasks and problem sizes, demonstrating strong aptitude in logical deduction and relational inference. However, all evaluated models, including DeepSeek-R1, struggle significantly as problem complexity increases, largely due to token length limitations and incomplete output structures. A detailed analysis of DeepSeek-R1’s long Chain-of-Thought responses uncovers its unique planning and verification strategies, but also highlights instances of incoherent or incomplete reasoning, calling attention to the need for deeper scrutiny into LLMs’ internal inference dynamics. We further discuss key directions for future work, including the role of multimodal reasoning and the systematic examination of reasoning failures. Our findings provide both empirical insights and theoretical implications for advancing LLMs’ reasoning abilities, particularly in tasks that demand structured, multi-step logical inference. Our code repository will be publicly available at this https URL.
zh
[AI-76] he Societal Impact of Foundation Models: Advancing Evidence-based AI Policy
【速读】:该论文试图解决人工智能时代技术与社会协同演进中的复杂问题,特别是基础模型(foundation models)所带来的理解不足与潜在危害。其解决方案的关键在于构建科学基础与研究政策接口,通过概念框架、实证洞察以及从理解到行动的过渡,推动对基础模型社会影响的深入理解,从而促进基于证据的人工智能政策制定,以实现更优的社会成果。
链接: https://arxiv.org/abs/2506.23123
作者: Rishi Bommasani
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: Stanford University PhD Dissertation of Rishi Bommasani (Department of Computer Science, 2025). Also available at this https URL
Abstract:Artificial intelligence is humanity’s most promising technology because of the remarkable capabilities offered by foundation models. Yet, the same technology brings confusion and consternation: foundation models are poorly understood and they may precipitate a wide array of harms. This dissertation explains how technology and society coevolve in the age of AI, organized around three themes. First, the conceptual framing: the capabilities, risks, and the supply chain that grounds foundation models in the broader economy. Second, the empirical insights that enrich the conceptual foundations: transparency created via evaluations at the model level and indexes at the organization level. Finally, the transition from understanding to action: superior understanding of the societal impact of foundation models advances evidence-based AI policy. View together, this dissertation makes inroads into achieving better societal outcomes in the age of AI by building the scientific foundations and research-policy interface required for better AI governance.
zh
[AI-77] Can Large Language Models Capture Human Risk Preferences? A Cross-Cultural Study
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在模拟人类风险决策行为方面的可靠性问题,特别是其在复杂决策场景中的表现。研究的关键在于通过对比LLMs生成的决策与实际人类在基于彩票的任务中的反应,评估模型在风险偏好建模上的准确性,并探讨语言和文化因素对模型性能的影响。
链接: https://arxiv.org/abs/2506.23107
作者: Bing Song,Jianing Liu,Sisi Jian,Chenyang Wu,Vinayak Dixit
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 1 figure
Abstract:Large language models (LLMs) have made significant strides, extending their applications to dialogue systems, automated content creation, and domain-specific advisory tasks. However, as their use grows, concerns have emerged regarding their reliability in simulating complex decision-making behavior, such as risky decision-making, where a single choice can lead to multiple outcomes. This study investigates the ability of LLMs to simulate risky decision-making scenarios. We compare model-generated decisions with actual human responses in a series of lottery-based tasks, using transportation stated preference survey data from participants in Sydney, Dhaka, Hong Kong, and Nanjing. Demographic inputs were provided to two LLMs – ChatGPT 4o and ChatGPT o1-mini – which were tasked with predicting individual choices. Risk preferences were analyzed using the Constant Relative Risk Aversion (CRRA) framework. Results show that both models exhibit more risk-averse behavior than human participants, with o1-mini aligning more closely with observed human decisions. Further analysis of multilingual data from Nanjing and Hong Kong indicates that model predictions in Chinese deviate more from actual responses compared to English, suggesting that prompt language may influence simulation performance. These findings highlight both the promise and the current limitations of LLMs in replicating human-like risk behavior, particularly in linguistic and cultural settings.
zh
[AI-78] OMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure
【速读】:该论文试图解决如何在深度音乐生成中实现具有强结构连贯性的多轨电子音乐创作问题,其核心挑战在于如何有效建模音乐概念的层次结构,并在时间与空间维度上组织音乐元素。解决方案的关键在于提出TOMI(Transforming and Organizing Music Ideas)框架,通过一个稀疏的四维空间对多轨编曲过程进行建模,该空间由片段(clips)、段落(sections)、音轨(tracks)和变换(transformations)构成,从而实现音乐理念的生成、转换与组织,最终生成具有完整歌曲结构的多轨音乐。
链接: https://arxiv.org/abs/2506.23094
作者: Qi He,Gus Xia,Ziyu Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 9 pages, 4 figures, 2 tables. To be published in ISMIR 2025
Abstract:Hierarchical planning is a powerful approach to model long sequences structurally. Aside from considering hierarchies in the temporal structure of music, this paper explores an even more important aspect: concept hierarchy, which involves generating music ideas, transforming them, and ultimately organizing them–across musical time and space–into a complete composition. To this end, we introduce TOMI (Transforming and Organizing Music Ideas) as a novel approach in deep music generation and develop a TOMI-based model via instruction-tuned foundation LLM. Formally, we represent a multi-track composition process via a sparse, four-dimensional space characterized by clips (short audio or MIDI segments), sections (temporal positions), tracks (instrument layers), and transformations (elaboration methods). Our model is capable of generating multi-track electronic music with full-song structure, and we further integrate the TOMI-based model with the REAPER digital audio workstation, enabling interactive human-AI co-creation. Experimental results demonstrate that our approach produces higher-quality electronic music with stronger structural coherence compared to baselines.
zh
[AI-79] Enhancing Live Broadcast Engagement: A Multi-modal Approach to Short Video Recommendations Using MMGCN and User Preferences
【速读】:该论文旨在解决直播平台中用户参与度不足的问题,通过构建一种结合多模态图卷积网络(MMGCN)的短视频推荐系统来提升个性化推荐效果。其解决方案的关键在于融合用户交互数据、视频内容特征和上下文信息,并采用协同过滤与基于内容的过滤技术相结合的混合方法,以捕捉用户、视频属性和互动模式之间的复杂关系,从而实现更精准的个性化推荐。
链接: https://arxiv.org/abs/2506.23085
作者: Saeid Aghasoleymani Najafabadi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The purpose of this paper is to explore a multi-modal approach to enhancing live broadcast engagement by developing a short video recommendation system that incorporates Multi-modal Graph Convolutional Networks (MMGCN) with user preferences. In order to provide personalized recommendations tailored to individual interests, the proposed system takes into account user interaction data, video content features, and contextual information. With the aid of a hybrid approach combining collaborative filtering and content-based filtering techniques, the system is able to capture nuanced relationships between users, video attributes, and engagement patterns. Three datasets are used to evaluate the effectiveness of the system: Kwai, TikTok, and MovieLens. Compared to baseline models, such as DeepFM, Wide Deep, LightGBM, and XGBoost, the proposed MMGCN-based model shows superior performance. A notable feature of the proposed model is that it outperforms all baseline methods in capturing diverse user preferences and making accurate, personalized recommendations, resulting in a Kwai F1 score of 0.574, a Tiktok F1 score of 0.506, and a MovieLens F1 score of 0.197. We emphasize the importance of multi-modal integration and user-centric approaches in advancing recommender systems, emphasizing the role they play in enhancing content discovery and audience interaction on live broadcast platforms.
zh
[AI-80] AIs Euclids Elements Moment: From Language Models to Computable Thought
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)发展路径的系统性理解问题,其核心在于揭示AI演进的内在逻辑与阶段性特征。论文提出的解决方案是构建一个五阶段的演化框架——“认知几何学”(Geometry of Cognition),该框架将AI的发展历程类比于人类认知技术的历史演变,从专家系统到Transformer架构的转变均被纳入此模型。关键在于强调AI的演进并非线性过程,而是具有反馈机制的反射性发展,即AI在各阶段所生成的工具与洞察力反过来重塑其自身架构。这一框架不仅解释了过去的变革,还为未来AI的构建提供了理论基础和实践指导。
链接: https://arxiv.org/abs/2506.23080
作者: Xinmin Fang,Lingfeng Tao,Zhengxiong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a comprehensive five-stage evolutionary framework for understanding the development of artificial intelligence, arguing that its trajectory mirrors the historical progression of human cognitive technologies. We posit that AI is advancing through distinct epochs, each defined by a revolutionary shift in its capacity for representation and reasoning, analogous to the inventions of cuneiform, the alphabet, grammar and logic, mathematical calculus, and formal logical systems. This “Geometry of Cognition” framework moves beyond mere metaphor to provide a systematic, cross-disciplinary model that not only explains AI’s past architectural shifts-from expert systems to Transformers-but also charts a concrete and prescriptive path forward. Crucially, we demonstrate that this evolution is not merely linear but reflexive: as AI advances through these stages, the tools and insights it develops create a feedback loop that fundamentally reshapes its own underlying architecture. We are currently transitioning into a “Metalinguistic Moment,” characterized by the emergence of self-reflective capabilities like Chain-of-Thought prompting and Constitutional AI. The subsequent stages, the “Mathematical Symbolism Moment” and the “Formal Logic System Moment,” will be defined by the development of a computable calculus of thought, likely through neuro-symbolic architectures and program synthesis, culminating in provably aligned and reliable AI that reconstructs its own foundational representations. This work serves as the methodological capstone to our trilogy, which previously explored the economic drivers (“why”) and cognitive nature (“what”) of AI. Here, we address the “how,” providing a theoretical foundation for future research and offering concrete, actionable strategies for startups and developers aiming to build the next generation of intelligent systems.
zh
[AI-81] Curious Causality-Seeking Agents Learn Meta Causal World
【速读】:该论文试图解决在构建世界模型时,环境中的因果机制可能因策略或环境状态的微小变化而发生改变的问题,这种变化可能导致模型无法准确捕捉真实的因果结构。解决方案的关键在于引入一种名为“元因果图(Meta-Causal Graph)”的最小统一表示,该表示能够高效编码不同潜在世界状态之间因果结构的变化规则。元因果图由多个由元状态触发的因果子图组成,通过一个旨在识别元状态、发现因果关系并持续优化图结构的因果探索智能体实现对世界模型的动态建模与更新。
链接: https://arxiv.org/abs/2506.23068
作者: Zhiyu Zhao,Haoxuan Li,Haifeng Zhang,Jun Wang,Francesco Faccio,Jürgen Schmidhuber,Mengyue Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 33 pages
Abstract:When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton’s laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbfMeta-Causal Graph as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbfCausality-Seeking Agent whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.
zh
[AI-82] Measuring How LLM s Internalize Human Psychological Concepts: A preliminary analysis
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在内部表征人类思维和行为概念方面的准确性问题,即如何评估LLMs与人类心理维度之间的概念对齐程度。其解决方案的关键在于开发了一种定量框架,通过43个标准化心理量表,利用成对相似性分析评估语言模型重构和分类问卷条目的准确性,并通过层次聚类将结果聚类结构与原始分类标签进行比较,从而量化LLMs与人类心理构念的对齐程度。
链接: https://arxiv.org/abs/2506.23055
作者: Hiro Taiyo Hamada,Ippei Fujisawa,Genji Kawakita,Yuki Yamada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) such as ChatGPT have shown remarkable abilities in producing human-like text. However, it is unclear how accurately these models internalize concepts that shape human thought and behavior. Here, we developed a quantitative framework to assess concept alignment between LLMs and human psychological dimensions using 43 standardized psychological questionnaires, selected for their established validity in measuring distinct psychological constructs. Our method evaluates how accurately language models reconstruct and classify questionnaire items through pairwise similarity analysis. We compared resulting cluster structures with the original categorical labels using hierarchical clustering. A GPT-4 model achieved superior classification accuracy (66.2%), significantly outperforming GPT-3.5 (55.9%) and BERT (48.1%), all exceeding random baseline performance (31.9%). We also demonstrated that the estimated semantic similarity from GPT-4 is associated with Pearson’s correlation coefficients of human responses in multiple psychological questionnaires. This framework provides a novel approach to evaluate the alignment of the human-LLM concept and identify potential representational biases. Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy, offering insights for developing more interpretable AI systems.
zh
[AI-83] Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理效率上的挑战,特别是在现代GPU架构计算能力提升但内存带宽和容量未同步增长的情况下,导致的内存瓶颈问题。其解决方案的关键在于提出三值语言模型(Ternary Language Models, TriLMs),通过量化感知训练显著降低内存需求,并结合创新的2-bit和1.6-bit权重打包方案以及针对GPU的TriRun内核,以提升不同CPU和GPU架构上的推理速度。
链接: https://arxiv.org/abs/2506.23025
作者: Tejas Vaidhya,Ayush Kaushal,Vineet Jain,Francis Couture Harpin,Prashant Shishodia,Majid Behbahani,Yuriy Nevmyvaka,Irina Rish
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.
zh
[AI-84] BWLer: Barycentric Weight Layer Elucidates a Precision-Conditioning Tradeoff for PINNs COLT2025
【速读】:该论文试图解决物理信息神经网络(Physics-informed Neural Networks, PINNs)在求解偏微分方程(Partial Differential Equations, PDEs)时精度不足的问题。其关键解决方案是引入了重心权重层(Barycentric Weight Layer, BWLer),通过重心多项式插值建模PDE的解,从而将解的表示与PDE损失的导数计算分离。实验表明,BWLer能够显著提升模型的精度,并揭示了可达到的精度与PDE损失条件数之间的权衡关系。
链接: https://arxiv.org/abs/2506.23024
作者: Jerry Liu,Yasa Baig,Denise Hui Jean Lee,Rajat Vadiraj Dwaraknath,Atri Rudra,Chris Ré
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: Workshop for the Theory of AI for Scientific Computing @ COLT 2025 (Best Paper). 39 pages, 24 figures
Abstract:Physics-informed neural networks (PINNs) offer a flexible way to solve partial differential equations (PDEs) with machine learning, yet they still fall well short of the machine-precision accuracy many scientific tasks demand. In this work, we investigate whether the precision ceiling comes from the ill-conditioning of the PDEs or from the typical multi-layer perceptron (MLP) architecture. We introduce the Barycentric Weight Layer (BWLer), which models the PDE solution through barycentric polynomial interpolation. A BWLer can be added on top of an existing MLP (a BWLer-hat) or replace it completely (explicit BWLer), cleanly separating how we represent the solution from how we take derivatives for the PDE loss. Using BWLer, we identify fundamental precision limitations within the MLP: on a simple 1-D interpolation task, even MLPs with O(1e5) parameters stall around 1e-8 RMSE – about eight orders above float64 machine precision – before any PDE terms are added. In PDE learning, adding a BWLer lifts this ceiling and exposes a tradeoff between achievable accuracy and the conditioning of the PDE loss. For linear PDEs we fully characterize this tradeoff with an explicit error decomposition and navigate it during training with spectral derivatives and preconditioning. Across five benchmark PDEs, adding a BWLer on top of an MLP improves RMSE by up to 30x for convection, 10x for reaction, and 1800x for wave equations while remaining compatible with first-order optimizers. Replacing the MLP entirely lets an explicit BWLer reach near-machine-precision on convection, reaction, and wave problems (up to 10 billion times better than prior results) and match the performance of standard PINNs on stiff Burgers’ and irregular-geometry Poisson problems. Together, these findings point to a practical path for combining the flexibility of PINNs with the precision of classical spectral solvers.
zh
[AI-85] Scenario-Based Hierarchical Reinforcement Learning for Automated Driving Decision Making
【速读】:该论文旨在解决高度自动化驾驶系统中决策算法的开发难题,特别是在开放且复杂环境下的安全性和泛化能力问题。现有强化学习(Reinforcement Learning, RL)方法在简单驾驶任务中表现出色,但在处理更复杂的驾驶任务时存在泛化能力不足和学习效率低的问题。论文提出的解决方案是Scenario-based Automated Driving Reinforcement Learning (SAD-RL),其关键在于引入基于场景的环境,结合分层策略(Hierarchical Reinforcement Learning, HRL),通过高层策略选择操作模板,并由低层控制逻辑进行评估与执行,从而提升模型在复杂场景中的适应能力和训练效率。
链接: https://arxiv.org/abs/2506.23023
作者: M. Youssef Abdelhamid,Lennart Vater,Zlatan Ajanovic
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 10 figures, submitted to a conference
Abstract:Developing decision-making algorithms for highly automated driving systems remains challenging, since these systems have to operate safely in an open and complex environments. Reinforcement Learning (RL) approaches can learn comprehensive decision policies directly from experience and already show promising results in simple driving tasks. However, current approaches fail to achieve generalizability for more complex driving tasks and lack learning efficiency. Therefore, we present Scenario-based Automated Driving Reinforcement Learning (SAD-RL), the first framework that integrates Reinforcement Learning (RL) of hierarchical policy in a scenario-based environment. A high-level policy selects maneuver templates that are evaluated and executed by a low-level control logic. The scenario-based environment allows to control the training experience for the agent and to explicitly introduce challenging, but rate situations into the training process. Our experiments show that an agent trained using the SAD-RL framework can achieve safe behaviour in easy as well as challenging situations efficiently. Our ablation studies confirmed that both HRL and scenario diversity are essential for achieving these results.
zh
[AI-86] Generating Privacy Stories From Software Documentation
【速读】:该论文试图解决软件开发过程中隐私行为识别与隐私需求生成的问题,当前方法主要关注从法规中提取法律要求并评估合规性,而忽视了在软件开发前期和过程中对隐私行为的主动识别。解决方案的关键在于利用链式思维提示(chain-of-thought prompting, CoT)、上下文学习(in-context learning, ICL)和大型语言模型(Large Language Models, LLMs)从软件文档中提取隐私行为,并生成以用户故事形式表达的隐私需求。实验结果表明,如GPT-4o和Llama 3等主流LLMs在该任务上表现出较高的F1分数,且通过参数调优可进一步提升性能。
链接: https://arxiv.org/abs/2506.23014
作者: Wilder Baldwin,Shashank Chintakuntla,Shreyah Parajuli,Ali Pourghasemi,Ryan Shanz,Sepideh Ghanavati
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to RENext!'25 at the 33rd IEEE International Requirements Engineering 2025 conference
Abstract:Research shows that analysts and developers consider privacy as a security concept or as an afterthought, which may lead to non-compliance and violation of users’ privacy. Most current approaches, however, focus on extracting legal requirements from the regulations and evaluating the compliance of software and processes with them. In this paper, we develop a novel approach based on chain-of-thought prompting (CoT), in-context-learning (ICL), and Large Language Models (LLMs) to extract privacy behaviors from various software documents prior to and during software development, and then generate privacy requirements in the format of user stories. Our results show that most commonly used LLMs, such as GPT-4o and Llama 3, can identify privacy behaviors and generate privacy user stories with F1 scores exceeding 0.8. We also show that the performance of these models could be improved through parameter-tuning. Our findings provide insight into using and optimizing LLMs for generating privacy requirements given software documents created prior to or throughout the software development lifecycle.
zh
[AI-87] Against softmaxing culture
【速读】:该论文试图解决大型人工智能系统在文化评估中面临的“文化同质化”问题,即生成式 AI (Generative AI) 在处理语言和文化时倾向于将丰富的语言差异简化为通用表达,这一现象被称为“softmaxing culture”。论文认为当前机器学习(ML)和人机交互(HCI)方法在文化评估中的局限性,提出两个关键转变:首先,将评估的起点从“什么是文化”转向“文化何时出现”;其次,不仅描述文化普遍性,更强调将其置于具体语境中进行定位。这些概念上的转变旨在推动评估方法超越技术需求,更加关注文化的复杂性。
链接: https://arxiv.org/abs/2506.22968
作者: Daniel Mwesigwa
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:AI is flattening culture. Evaluations of “culture” are showing the myriad ways in which large AI models are homogenizing language and culture, averaging out rich linguistic differences into generic expressions. I call this phenomenon “softmaxing culture,” and it is one of the fundamental challenges facing AI evaluations today. Efforts to improve and strengthen evaluations of culture are central to the project of cultural alignment in large AI systems. This position paper argues that machine learning (ML) and human-computer interaction (HCI) approaches to evaluation are limited. I propose two key shifts. First, instead of asking “what is culture?” at the start of system evaluations, I propose beginning with the question: “when is culture?” Second, while I acknowledge the philosophical claim that cultural universals exist, the challenge is not simply to describe them, but to situate them in relation to their particulars. Taken together, these conceptual shifts invite evaluation approaches that move beyond technical requirements, toward perspectives more responsive to the complexities of culture.
zh
[AI-88] A Study on Semi-Supervised Detection of DDoS Attacks under Class Imbalance CEC
【速读】:该论文试图解决在分布式拒绝服务(Distributed Denial of Service, DDoS)攻击检测中,由于数据类别不平衡和缺乏足够标记样本而导致的自动化检测难题。其解决方案的关键在于采用半监督学习(Semi-Supervised Learning, SSL)技术,通过利用部分标记数据和未标记数据来提升检测性能,并评估13种先进的SSL算法在不同场景下的有效性及局限性。
链接: https://arxiv.org/abs/2506.22949
作者: Ehsan Hallaji,Vaishnavi Shanmugam,Roozbeh Razavi-Far,Mehrdad Saif
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in IEEE CCECE 2025
Abstract:One of the most difficult challenges in cybersecurity is eliminating Distributed Denial of Service (DDoS) attacks. Automating this task using artificial intelligence is a complex process due to the inherent class imbalance and lack of sufficient labeled samples of real-world datasets. This research investigates the use of Semi-Supervised Learning (SSL) techniques to improve DDoS attack detection when data is imbalanced and partially labeled. In this process, 13 state-of-the-art SSL algorithms are evaluated for detecting DDoS attacks in several scenarios. We evaluate their practical efficacy and shortcomings, including the extent to which they work in extreme environments. The results will offer insight into designing intelligent Intrusion Detection Systems (IDSs) that are robust against class imbalance and handle partially labeled data.
zh
[AI-89] Positioning AI Tools to Support Online Harm Reduction Practice: Applications and Design Directions
【速读】:该论文试图解决PWUD(People Who Use Drugs)在获取准确且可操作的伤害减少信息方面存在的问题,这些问题由于现有在线渠道在适应性、可及性和污名化影响方面的局限性而未能得到充分满足。解决方案的关键在于通过负责任的设计,利用大型语言模型(Large Language Models, LLMs)来增强信息提供能力,同时应对伦理对齐、情境理解、有效沟通和明确操作边界等挑战,并强调与专家及PWUD协作设计,以确保LLM系统在伤害减少生态系统中的有效性、安全性和责任性。
链接: https://arxiv.org/abs/2506.22941
作者: Kaixuan Wang,Jason T. Jacques,Chenxin Diao
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, with appendix
Abstract:Access to accurate and actionable harm reduction information can directly impact the health outcomes of People Who Use Drugs (PWUD), yet existing online channels often fail to meet their diverse and dynamic needs due to limitations in adaptability, accessibility, and the pervasive impact of stigma. Large Language Models (LLMs) present a novel opportunity to enhance information provision, but their application in such a high-stakes domain is under-explored and presents socio-technical challenges. This paper investigates how LLMs can be responsibly designed to support the information needs of PWUD. Through a qualitative workshop involving diverse stakeholder groups (academics, harm reduction practitioners, and an online community moderator), we explored LLM capabilities, identified potential use cases, and delineated core design considerations. Our findings reveal that while LLMs can address some existing information barriers (e.g., by offering responsive, multilingual, and potentially less stigmatising interactions), their effectiveness is contingent upon overcoming challenges related to ethical alignment with harm reduction principles, nuanced contextual understanding, effective communication, and clearly defined operational boundaries. We articulate design pathways emphasising collaborative co-design with experts and PWUD to develop LLM systems that are helpful, safe, and responsibly governed. This work contributes empirically grounded insights and actionable design considerations for the responsible development of LLMs as supportive tools within the harm reduction ecosystem.
zh
[AI-90] Mathematical Computation on High-dimensional Data via Array Programming and Parallel Acceleration
【速读】:该论文试图解决高维数据在深度学习应用中的计算挑战,尤其是由于维度灾难导致的处理难题。现有大规模数据工具主要关注业务导向的描述性统计,缺乏支持高级分析的数学统计方法。论文提出的解决方案是基于空间完备性的并行计算架构,其关键在于将高维数据分解为维度独立的结构,以实现分布式处理,从而支持数据挖掘与并行优化机器学习方法的无缝集成,并在统一系统中实现跨多种数据类型(如医学和自然图像)的科学计算。
链接: https://arxiv.org/abs/2506.22929
作者: Chen Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:
Abstract:While deep learning excels in natural image and language processing, its application to high-dimensional data faces computational challenges due to the dimensionality curse. Current large-scale data tools focus on business-oriented descriptive statistics, lacking mathematical statistics support for advanced analysis. We propose a parallel computation architecture based on space completeness, decomposing high-dimensional data into dimension-independent structures for distributed processing. This framework enables seamless integration of data mining and parallel-optimized machine learning methods, supporting scientific computations across diverse data types like medical and natural images within a unified system.
zh
[AI-91] Improving Rationality in the Reasoning Process of Language Models through Self-playing Game ICML2025
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中缺乏真正理解自身推理机制的问题。其解决方案的关键在于设计一种无需人类或高级模型监督的自我博弈机制,即“证明者-评判游戏”(Critic-Discernment Game, CDG),通过让证明者在面对建设性反馈时修正错误并在遭遇误导性评论时保持正确答案,从而提升模型在数学推理、逐步错误检测、自我修正和长链推理等任务中的推理合理性与自我理解能力。
链接: https://arxiv.org/abs/2506.22920
作者: Pinzheng Wang,Juntao Li,Zecheng Tang,Haijia Gui,Min zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025
Abstract:Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a Critic-Discernment Game(CDG) in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.
zh
[AI-92] Hecto: Modular Sparse Experts for Adaptive and Interpretable Reasoning
【速读】:该论文试图解决传统Mixture-of-Experts (MoE)模型中专家依赖相同归纳偏置导致的表征多样性受限问题,以及静态计算路径在处理不同类型推理任务时效率低下和可解释性不足的问题。解决方案的关键在于提出Hecto架构,通过引入结构异质性,在稀疏Top-1门控机制下结合用于时间推理的GRU专家和用于静态抽象的FFNN专家,实现专家的专业化分工与任务适配性,从而提升模型的稳定性、可解释性及在不同推理任务中的性能表现。
链接: https://arxiv.org/abs/2506.22919
作者: Sanskar Pandey,Ruhaan Chopra,Saad Murtaza Bhat,Ark Abhyudaya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) models enable conditional computation by routing inputs to specialized experts, but these experts rely on identical inductive biases, thus limiting representational diversity. This static computation pathway is inefficient for inputs that require different types of reasoning and limits specialization and interpretability. We propose Hecto, a lightweight MoE architecture that leverages architectural heterogeneity by combining a GRU expert for temporal reasoning and an FFNN expert for static abstraction under a sparse Top-1 gating mechanism. Evaluated on three reasoning benchmarks (AG News, SST-2, HotpotQA) and a regression task (STS-B), Hecto matches or closely trails homogeneous baselines in performance despite receiving isolated input representations, while achieving clear expert specialization, with each expert aligning to distinct reasoning types (temporal vs static). At larger batch sizes, Hecto exhibits improved performance, benefiting from relaxed computational constraints that allow its heterogeneous architecture to optimize more effectively. Ablation results isolate architectural diversity as the source of Hecto’s stability and interpretability across diverse reasoning tasks. Overall, Hecto establishes itself as a new benchmark for conditional computation, offering a principled framework for specialized reasoning in low-resource regimes with its model strength derived from principled specialization.
zh
[AI-93] Learning Truthful Mechanisms without Discretization
【速读】:该论文试图解决在机制设计中如何学习诚实且效用最大化的机制问题,现有基于学习的方法通常依赖于结果空间的离散化以保证诚实性,这导致随着问题规模增大效率下降。解决方案的关键在于提出一种无离散化的TEDI(Truthful, Expressive, and Dimension-Insensitive approach)算法,其核心是通过部分组最大网络(Partial GroupMax Network)对定价规则进行参数化,从而实现对部分凸函数的通用逼近,并结合协方差技巧和连续采样等新型训练技术,获得与一阶优化兼容的无偏梯度估计器,确保机制的诚实性、完全表达性和维度无关性。
链接: https://arxiv.org/abs/2506.22911
作者: Yunxuan Ma,Siqiang Wang,Zhijian Duan,Yukun Cheng,Xiaotie Deng
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 66 pages
Abstract:This paper introduces TEDI (Truthful, Expressive, and Dimension-Insensitive approach), a discretization-free algorithm to learn truthful and utility-maximizing mechanisms. Existing learning-based approaches often rely on discretization of outcome spaces to ensure truthfulness, which leads to inefficiency with increasing problem size. To address this limitation, we formalize the concept of pricing rules, defined as functions that map outcomes to prices. Based on this concept, we propose a novel menu mechanism, which can be equivalent to a truthful direct mechanism under specific conditions. The core idea of TEDI lies in its parameterization of pricing rules using Partial GroupMax Network, a new network architecture designed to universally approximate partial convex functions. To learn optimal pricing rules, we develop novel training techniques, including covariance trick and continuous sampling, to derive unbiased gradient estimators compatible with first-order optimization. Theoretical analysis establishes that TEDI guarantees truthfulness, full expressiveness, and dimension-insensitivity. Experimental evaluation in the studied auction setting demonstrates that TEDI achieves strong performance, competitive with or exceeding state-of-the-art methods. This work presents the first approaches to learn truthful mechanisms without outcome discretization, thereby enhancing algorithmic efficiency. The proposed concepts, network architecture, and learning techniques might offer potential value and provide new insights for automated mechanism design and differentiable economics. Comments: 66 pages Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.22911 [cs.GT] (or arXiv:2506.22911v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2506.22911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-94] Missing-Modality-Aware Graph Neural Network for Cancer Classification
【速读】:该论文旨在解决多模态生物数据学习中的缺失模态问题(missing modalities),即某些患者在部分模态上的数据完全缺失,传统融合方法通过排除缺失患者、填补缺失模态或直接使用部分模态进行预测,但难以应对多样化的缺失模式及模态数量增加导致的模式指数级增长。其解决方案的关键在于提出MAGNET(Missing-modality-Aware Graph neural NETwork),该方法引入了患者-模态多头注意力机制,根据模态的重要性和缺失性融合低维模态嵌入,并构建以融合多模态嵌入为节点特征、连通性由模态缺失性决定的患者图,随后使用传统的图神经网络进行预测,从而有效适应缺失模式的多样性且计算复杂度线性增长。
链接: https://arxiv.org/abs/2506.22901
作者: Sina Tabakhi,Haiping Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Genomics (q-bio.GN)
备注: 15 pages, 7 figures
Abstract:A key challenge in learning from multimodal biological data is missing modalities, where all data from some modalities are missing for some patients. Current fusion methods address this by excluding patients with missing modalities, imputing missing modalities, or making predictions directly with partial modalities. However, they often struggle with diverse missing-modality patterns and the exponential growth of the number of such patterns as the number of modalities increases. To address these limitations, we propose MAGNET (Missing-modality-Aware Graph neural NETwork) for direct prediction with partial modalities, which introduces a patient-modality multi-head attention mechanism to fuse lower-dimensional modality embeddings based on their importance and missingness. MAGNET’s complexity increases linearly with the number of modalities while adapting to missing-pattern variability. To generate predictions, MAGNET further constructs a patient graph with fused multimodal embeddings as node features and the connectivity determined by the modality missingness, followed by a conventional graph neural network. Experiments on three public multiomics datasets for cancer classification, with real-world instead of artificial missingness, show that MAGNET outperforms the state-of-the-art fusion methods. The data and code are available at this https URL.
zh
[AI-95] Interpretable Time Series Autoregression for Periodicity Quantification
【速读】:该论文旨在解决时间序列中周期性与季节性模式的量化与建模问题,特别是在处理时变和多维时间序列数据时的可解释性与计算效率问题。其解决方案的关键在于提出一种基于ℓ0-范数诱导稀疏约束的稀疏自回归框架,并通过混合整数优化(MIO)进行求解,同时引入子空间追逐决策变量剪枝(DVP)策略以加速优化过程,以及针对多维时间序列设计两阶段优化方案,从而实现模型在大规模问题中的可扩展性与解释性。
链接: https://arxiv.org/abs/2506.22895
作者: Xinyu Chen,Vassilis Digalakis Jr,Lijun Ding,Dingyi Zhuang,Jinhua Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series autoregression is a classical statistical model for capturing auto-correlations and identifying temporal patterns such as periodicity and seasonality. In this work, we propose a novel sparse autoregression framework from an interpretable machine learning perspective and the model interpretability for periodicity quantification is reinforced by \ell_0 -norm induced sparsity constraints. On the time-varying time series data, we reformulate the sparse autoregression and convert the involved optimization problem into a mixed-integer optimization (MIO). To accelerate it, we develop a subspace pursuit based decision variable pruning (DVP) strategy to reduce the search space. On the multidimensional time series that involves complicated spatial and temporal dimensions, we propose a spatially- and time-varying sparse autoregression model and resolve the corresponding MIO problem by developing a two-stage optimization scheme. In particular, the proposed scheme makes the model scalable to large problems even with millions of decision variables. Empirically, we conduct extensive experiments to evaluate the proposed models on real-world time series data. First, we demonstrate that the MIO solver can be drastically accelerated through the DVP strategy, while maintaining the same solution quality as a full MIO solver. Applying the time-varying sparse autoregression model to ridesharing trip data, we uncover both daily and weekly periodicities and reveal long-term changes in regularity of human mobility. Second, we demonstrate the spatial patterns of yearly seasonality in climate variable time series such as temperature and precipitation across the past four decades, and our model allows to discover dynamic climate patterns and identify climate phenomena such as El Nino in sea surface temperature.
zh
[AI-96] Agent ic Enterprise: AI-Centric User to User-Centric AI
【速读】:该论文试图解决当前以AI为中心的用户范式在企业决策中的不足,特别是在提升企业决策效率和生产力方面的局限性。其解决方案的关键在于转向以用户为中心的AI(User-Centric AI),并通过提出六个核心原则来促进代理(Agents)在企业环境中的有效应用,从而更好地满足企业决策的需求。
链接: https://arxiv.org/abs/2506.22893
作者: Arpit Narechania,Alex Endert,Atanu R Sinha
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 1 figure, 2 sidebars; Preprint
Abstract:After a very long winter, the Artificial Intelligence (AI) spring is here. Or, so it seems over the last three years. AI has the potential to impact many areas of human life - personal, social, health, education, professional. In this paper, we take a closer look at the potential of AI for Enterprises, where decision-making plays a crucial and repeated role across functions, tasks, and operations. We consider Agents imbued with AI as means to increase decision-productivity of enterprises. We highlight six tenets for Agentic success in enterprises, by drawing attention to what the current, AI-Centric User paradigm misses, in the face of persistent needs of and usefulness for Enterprise Decision-Making. In underscoring a shift to User-Centric AI, we offer six tenets and promote market mechanisms for platforms, aligning the design of AI and its delivery by Agents to the cause of enterprise users.
zh
[AI-97] Performance Measurements in the AI-Centric Computing Continuum Systems
【速读】:该论文试图解决分布式计算连续体(Distributed Computing Continuum, DCC)和物联网(IoT)环境中性能评估指标不足的问题,特别是在生成式 AI 和大规模语言模型推动下,传统性能指标已无法全面满足不断变化的计算需求和应用要求。解决方案的关键在于重新审视并扩展传统性能指标,引入如可持续性、能效和系统可观测性等新兴性能维度,同时提供选择合适指标的标准和考量因素,以支持系统设计优化和目标对齐。
链接: https://arxiv.org/abs/2506.22884
作者: Praveen Kumar Donta,Qiyang Zhang,Schahram Dustdar
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注:
Abstract:Over the Eight decades, computing paradigms have shifted from large, centralized systems to compact, distributed architectures, leading to the rise of the Distributed Computing Continuum (DCC). In this model, multiple layers such as cloud, edge, Internet of Things (IoT), and mobile platforms work together to support a wide range of applications. Recently, the emergence of Generative AI and large language models has further intensified the demand for computational resources across this continuum. Although traditional performance metrics have provided a solid foundation, they need to be revisited and expanded to keep pace with changing computational demands and application requirements. Accurate performance measurements benefit both system designers and users by supporting improvements in efficiency and promoting alignment with system goals. In this context, we review commonly used metrics in DCC and IoT environments. We also discuss emerging performance dimensions that address evolving computing needs, such as sustainability, energy efficiency, and system observability. We also outline criteria and considerations for selecting appropriate metrics, aiming to inspire future research and development in this critical area.
zh
[AI-98] Reason Bridge: Efficient Reasoning Transfer from Closed to Open-Source Language Models
【速读】:该论文试图解决闭源大型语言模型(Large Language Models, LLMs)与开源模型在复杂推理和精确指令遵循任务中存在显著性能差距的问题。其解决方案的关键在于提出一种名为ReasonBridge的方法,该方法通过一种新颖的分层知识蒸馏框架,高效地将闭源模型的推理能力迁移至开源模型。该方法的核心包括:分层蒸馏过程以捕捉战略抽象和战术实现模式、仅需0.3%额外可训练参数的稀疏推理适配器架构,以及利用引导推理干预的测试时计算扩展机制。
链接: https://arxiv.org/abs/2506.22865
作者: Ziqi Zhong,Xunzhu Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have revealed a significant performance gap between closed-source and open-source models, particularly in tasks requiring complex reasoning and precise instruction following. This paper introduces ReasonBridge, a methodology that efficiently transfers reasoning capabilities from powerful closed-source to open-source models through a novel hierarchical knowledge distillation framework. We develop a tailored dataset Reason1K with only 1,000 carefully curated reasoning traces emphasizing difficulty, diversity, and quality. These traces are filtered from across multiple domains using a structured multi-criteria selection algorithm. Our transfer learning approach incorporates: (1) a hierarchical distillation process capturing both strategic abstraction and tactical implementation patterns, (2) a sparse reasoning-focused adapter architecture requiring only 0.3% additional trainable parameters, and (3) a test-time compute scaling mechanism using guided inference interventions. Comprehensive evaluations demonstrate that ReasonBridge improves reasoning capabilities in open-source models by up to 23% on benchmark tasks, significantly narrowing the gap with closed-source models. Notably, the enhanced Qwen2.5-14B outperforms Claude-Sonnet3.5 on MATH500 and matches its performance on competition-level AIME problems. Our methodology generalizes effectively across diverse reasoning domains and model architectures, establishing a sample-efficient approach to reasoning enhancement for instruction following.
zh
[AI-99] Scalable Structure Learning of Bayesian Networks by Learning Algorithm Ensembles
【速读】:该论文旨在解决从数据中学习贝叶斯网络(Bayesian Networks, BNs)结构的挑战,特别是在处理包含大量变量的数据集时,传统分而治之(Divide-and-Conquer, D&D)策略存在学习精度不稳定的问题。其解决方案的关键在于引入结构学习集成(Structure Learning Ensemble, SLE),通过结合多个BN结构学习算法,以实现一致的高学习精度,并提出一种自动方法Auto-SLE来学习近优的SLE,从而克服手动设计高质量SLE的困难。
链接: https://arxiv.org/abs/2506.22848
作者: Shengcai Liu,Hui Ou-yang,Zhiyuan Wang,Cheng Chen,Qijun Cai,Yew-Soon Ong,Ke Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning the structure of Bayesian networks (BNs) from data is challenging, especially for datasets involving a large number of variables. The recently proposed divide-and-conquer (D\D) strategies present a promising approach for learning large BNs. However, they still face a main issue of unstable learning accuracy across subproblems. In this work, we introduce the idea of employing structure learning ensemble (SLE), which combines multiple BN structure learning algorithms, to consistently achieve high learning accuracy. We further propose an automatic approach called Auto-SLE for learning near-optimal SLEs, addressing the challenge of manually designing high-quality SLEs. The learned SLE is then integrated into a D\D method. Extensive experiments firmly show the superiority of our method over D\D methods with single BN structure learning algorithm in learning large BNs, achieving accuracy improvement usually by 30% \sim 225% on datasets involving 10,000 variables. Furthermore, our method generalizes well to datasets with many more (e.g., 30000) variables and different network characteristics than those present in the training data for learning the SLE. These results indicate the significant potential of employing (automatic learning of) SLEs for scalable BN structure learning.
zh
[AI-100] Quantum Neural Networks for Wind Energy Forecasting: A Comparative Study of Performance and Scalability with Classical Models
【速读】:该论文试图解决在风力涡轮机功率输出预测中应用量子神经网络(Quantum Neural Networks, QNNs)的可行性问题,旨在探索其在电力需求预测和系统扰动检测等能源领域任务中的表现。解决方案的关键在于采用基于Z特征映射(Z Feature Map)的数据编码方式,并通过不同量子电路结构(ansatz structures)构建六种QNN配置,结合交叉验证和未见数据集测试,评估其预测性能与仿真时间,从而验证QNN在该任务中的有效性及相较于经典方法的潜在优势。
链接: https://arxiv.org/abs/2506.22845
作者: Batuhan Hangun,Oguz Altun,Onder Eyecioglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Quantum Neural Networks (QNNs), a prominent approach in Quantum Machine Learning (QML), are emerging as a powerful alternative to classical machine learning methods. Recent studies have focused on the applicability of QNNs to various tasks, such as time-series forecasting, prediction, and classification, across a wide range of applications, including cybersecurity and medical imaging. With the increased use of smart grids driven by the integration of renewable energy systems, machine learning plays an important role in predicting power demand and detecting system disturbances. This study provides an in-depth investigation of QNNs for predicting the power output of a wind turbine. We assess the predictive performance and simulation time of six QNN configurations that are based on the Z Feature Map for data encoding and varying ansatz structures. Through detailed cross-validation experiments and tests on an unseen hold-out dataset, we experimentally demonstrate that QNNs can achieve predictive performance that is competitive with, and in some cases marginally better than, the benchmarked classical approaches. Our results also reveal the effects of dataset size and circuit complexity on predictive performance and simulation time. We believe our findings will offer valuable insights for researchers in the energy domain who wish to incorporate quantum machine learning into their work.
zh
[AI-101] xLSTMAD: A Powerful xLSTM-based Method for Anomaly Detection
【速读】:该论文试图解决多变量时间序列中的异常检测问题,这是传统方法尚未充分探索的领域。解决方案的关键在于提出xLSTMAD,一种基于完整编码器-解码器xLSTM架构的异常检测方法,其中编码器用于捕捉历史上下文,解码器则通过预测和重构两种方式生成异常得分。该方法结合了均方误差(MSE)和软动态时间规整(SoftDTW)损失函数,以同时考虑局部重建保真度和全局序列对齐,从而提升检测性能。
链接: https://arxiv.org/abs/2506.22837
作者: Kamil Faber,Marcin Pietroń,Dominik Żurek,Roberto Corizzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The recently proposed xLSTM is a powerful model that leverages expressive multiplicative gating and residual connections, providing the temporal capacity needed for long-horizon forecasting and representation learning. This architecture has demonstrated success in time series forecasting, lossless compression, and even large-scale language modeling tasks, where its linear memory footprint and fast inference make it a viable alternative to Transformers. Despite its growing popularity, no prior work has explored xLSTM for anomaly detection. In this work, we fill this gap by proposing xLSTMAD, the first anomaly detection method that integrates a full encoder-decoder xLSTM architecture, purpose-built for multivariate time series data. Our encoder processes input sequences to capture historical context, while the decoder is devised in two separate variants of the method. In the forecasting approach, the decoder iteratively generates forecasted future values xLSTMAD-F, while the reconstruction approach reconstructs the input time series from its encoded counterpart xLSTMAD-R. We investigate the performance of two loss functions: Mean Squared Error (MSE), and Soft Dynamic Time Warping (SoftDTW) to consider local reconstruction fidelity and global sequence alignment, respectively. We evaluate our method on the comprehensive TSB-AD-M benchmark, which spans 17 real-world datasets, using state-of-the-art challenging metrics such as VUS-PR. In our results, xLSTM showcases state-of-the-art accuracy, outperforming 23 popular anomaly detection baselines. Our paper is the first work revealing the powerful modeling capabilities of xLSTM for anomaly detection, paving the way for exciting new developments on this subject. Our code is available at: this https URL
zh
[AI-102] riADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations
【速读】:该论文旨在解决高维张量运算在高性能计算(HPC)和人工智能(AI)工作负载中所面临的高计算与内存需求问题,以及在扩展并行处理单元数量时导致的能量消耗增加问题,尤其是在处理稀疏数据时。其关键解决方案是提出Trilinear Algorithm和与之同构的算法设备架构(TriADA),该架构包含多项创新:一种用于计算三线性(3D)离散正交变换(3D-DXTs)的大规模并行低秩算法;一种基于外积的GEMM内核,具有解耦流式活动内存以加速3D-GEMT操作;一种全分布式三维网格互连处理单元网络,具备坐标无关、数据驱动的局部处理机制;以及一种弹性稀疏外积(ESOP)方法,通过避免零值操作数的计算与通信操作来提升能效、计算精度和稳定性。
链接: https://arxiv.org/abs/2506.22818
作者: Stanislav Sedukhin(1),Yoichi Tomioka(1),Kazuya Matsumoto(1),Yuichi Okuyama(1) ((1) The University of Aizu, Japan)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Signal Processing (eess.SP)
备注: 19 pages, 5 figures
Abstract:Multilinear transformations are key in high-performance computing (HPC) and artificial intelligence (AI) workloads, where data is represented as tensors. However, their high computational and memory demands, which grow with dimensionality, often slow down critical tasks. Moreover, scaling computation by enlarging the number of parallel processing units substantially increases energy consumption, limiting widespread adoption, especially for sparse data, which is common in HPC and AI applications. This paper introduces the Trilinear Algorithm and isomorphic to algorithm Device Architecture (TriADA) to address these challenges with the following innovations: (1) a massively parallel, low-rank algorithm for computing a family of trilinear (3D) discrete orthogonal transformations (3D-DXTs), which is a special case of the more general 3-mode matrix-by-tensor multiplication (3D-GEMT); (2) a new outer-product-based GEMM kernel with decoupled streaming active memory, specially designed to accelerate 3D-GEMT operation; (3) an isomorphic to the proposed algorithm, fully distributed 3D network of mesh interconnected processing elements or cells with a coordinate-free, data-driven local processing activity, which is independent of problem size; (4) an elastic sparse outer-product (ESOP) method that avoids unnecessary computing and communication operations with zero-valued operands, thereby enhancing energy efficiency, computational accuracy, and stability. TriADA is capable of performing a variety of trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps. The massively parallel, scalable, and energy-efficient architecture of TriADA is ideal for accelerating multilinear tensor operations, which are the most demanding parts of AI and HPC workloads.
zh
[AI-103] Offline Reinforcement Learning for Mobility Robustness Optimization
【速读】:该论文试图解决移动性鲁棒性优化(Mobility Robustness Optimisation, MRO)中小区个体偏移(Cell Individual Offset)参数调优的问题,旨在通过离线强化学习(offline Reinforcement Learning)方法替代传统的规则驱动方法,以提升网络切换性能。解决方案的关键在于利用已有的离线数据集学习最优策略,而无需进一步的在线探索,同时采用基于序列的Decision Transformers和基于价值的Conservative Q-Learning方法,在相同的奖励目标下实现比传统规则方法更优的性能,最高可提升7%。此外,离线强化学习还能通过同一数据集适应不同的目标函数,提供更高的操作灵活性。
链接: https://arxiv.org/abs/2506.22793
作者: Pegah Alizadeh,Anastasios Giovanidis,Pradeepa Ramachandra,Vasileios Koutsoukis,Osama Arouk
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 7 pages, double column, 4 figures, 6 tables, conference submission
Abstract:In this work we revisit the Mobility Robustness Optimisation (MRO) algorithm and study the possibility of learning the optimal Cell Individual Offset tuning using offline Reinforcement Learning. Such methods make use of collected offline datasets to learn the optimal policy, without further exploration. We adapt and apply a sequence-based method called Decision Transformers as well as a value-based method called Conservative Q-Learning to learn the optimal policy for the same target reward as the vanilla rule-based MRO. The same input features related to failures, ping-pongs, and other handover issues are used. Evaluation for realistic New Radio networks with 3500 MHz carrier frequency on a traffic mix including diverse user service types and a specific tunable cell-pair shows that offline-RL methods outperform rule-based MRO, offering up to 7% improvement. Furthermore, offline-RL can be trained for diverse objective functions using the same available dataset, thus offering operational flexibility compared to rule-based methods.
zh
[AI-104] WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing WWW INTERSPEECH2025
【速读】:该论文旨在解决语音嵌入中保留敏感属性(如说话人身份、口音或人口统计信息)所带来的偏见模型训练和隐私泄露问题。其解决方案的关键在于提出WavShape框架,该框架基于信息论,通过使用Donsker-Varadhan公式进行互信息(Mutual Information, MI)估计,指导一个基于MI的编码器,系统性地过滤敏感属性,同时保持下游任务所需的关键语音内容。
链接: https://arxiv.org/abs/2506.22789
作者: Oguzhan Baser,Ahmet Ege Tanriverdi,Kaan Kale,Sandeep P. Chinchali,Sriram Vishwanath
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 4 figures, Published at The Proceedings of Interspeech 2025, code is available at this http URL
Abstract:Speech embeddings often retain sensitive attributes such as speaker identity, accent, or demographic information, posing risks in biased model training and privacy leakage. We propose WavShape, an information-theoretic speech representation learning framework that optimizes embeddings for fairness and privacy while preserving task-relevant information. We leverage mutual information (MI) estimation using the Donsker-Varadhan formulation to guide an MI-based encoder that systematically filters sensitive attributes while maintaining speech content essential for downstream tasks. Experimental results on three known datasets show that WavShape reduces MI between embeddings and sensitive attributes by up to 81% while retaining 97% of task-relevant information. By integrating information theory with self-supervised speech models, this work advances the development of fair, privacy-aware, and resource-efficient speech systems.
zh
[AI-105] Smaller = Weaker? Benchmarking Robustness of Quantized LLM s in Code Generation
【速读】:该论文试图解决量化(Quantization)对大型语言模型(Large Language Models, LLMs)在代码生成任务中的鲁棒性(Robustness)影响问题。传统观点认为量化可能会降低模型性能,但本文通过系统性实验发现,量化后的LLMs在对抗性攻击和噪声扰动下表现出更高的鲁棒性。解决方案的关键在于通过广泛的实验验证量化不仅能够减少计算需求,还能提升LLMs在代码生成任务中的可靠性,从而为构建更高效和稳健的LLM部署策略提供理论支持。
链接: https://arxiv.org/abs/2506.22776
作者: Sen Fang,Weiyuan Ding,Antonio Mastropaolo,Bowen Xu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 13 pages, 6 figures
Abstract:Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs), reducing memory requirements and accelerating inference without architectural modifications. While existing research primarily focuses on evaluating the effectiveness of quantized LLMs compared to their original counterparts, the impact on robustness remains largely this http URL this paper, we present the first systematic investigation of how quantization affects the robustness of LLMs in code generation tasks. Through extensive experiments across four prominent LLM families (LLaMA, DeepSeek, CodeGen, and StarCoder) with parameter scales ranging from 350M to 33B, we evaluate robustness from dual perspectives: adversarial attacks on input prompts and noise perturbations on model architecture. Our findings challenge conventional wisdom by demonstrating that quantized LLMs often exhibit superior robustness compared to their full-precision counterparts, with 51.59% versus 42.86% of our adversarial experiments showing better resilience in quantized LLMs. Similarly, our noise perturbation experiments also confirm that LLMs after quantitation generally withstand higher levels of weight disturbances. These results suggest that quantization not only reduces computational requirements but can actually enhance LLMs’ reliability in code generation tasks, providing valuable insights for developing more robust and efficient LLM deployment strategies.
zh
[AI-106] Bridging Ethical Principles and Algorithmic Methods: An Alternative Approach for Assessing Trustworthiness in AI Systems
【速读】:该论文试图解决当前Trustworthy AI评估中存在主观性过强的问题,即现有自评估技术缺乏客观的量化标准。其解决方案的关键在于将Trustworthy AI的伦理要素与PageRank及TrustRank算法相结合,构建一种能够提供量化洞察并兼顾理论指导的评估框架。
链接: https://arxiv.org/abs/2506.22774
作者: Michael Papademas,Xenia Ziouvelou,Antonis Troumpoukis,Vangelis Karkaletsis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Artificial Intelligence (AI) technology epitomizes the complex challenges posed by human-made artifacts, particularly those widely integrated into society and exert significant influence, highlighting potential benefits and their negative consequences. While other technologies may also pose substantial risks, AI’s pervasive reach makes its societal effects especially profound. The complexity of AI systems, coupled with their remarkable capabilities, can lead to a reliance on technologies that operate beyond direct human oversight or understanding. To mitigate the risks that arise, several theoretical tools and guidelines have been developed, alongside efforts to create technological tools aimed at safeguarding Trustworthy AI. The guidelines take a more holistic view of the issue but fail to provide techniques for quantifying trustworthiness. Conversely, while technological tools are better at achieving such quantification, they lack a holistic perspective, focusing instead on specific aspects of Trustworthy AI. This paper aims to introduce an assessment method that combines the ethical components of Trustworthy AI with the algorithmic processes of PageRank and TrustRank. The goal is to establish an assessment framework that minimizes the subjectivity inherent in the self-assessment techniques prevalent in the field by introducing algorithmic criteria. The application of our approach indicates that a holistic assessment of an AI system’s trustworthiness can be achieved by providing quantitative insights while considering the theoretical content of relevant guidelines.
zh
[AI-107] FF-INT8: Efficient Forward-Forward DNN Training on Edge Devices with INT8 Precision
【速读】:该论文旨在解决传统反向传播(backpropagation)在时间与能耗上的低效问题,从而提升其在资源受限边缘设备上的适用性。其关键解决方案是采用INT8量化训练方法,并结合前向-前向(Forward-Forward, FF)算法的逐层策略,以稳定梯度量化过程。此外,论文还提出了一种“前瞻”机制来克服FF算法的局限性,从而在保持模型精度的同时显著提升训练速度、降低能耗并减少内存占用。
链接: https://arxiv.org/abs/2506.22771
作者: Jingxiao Ma,Priyadarshini Panda,Sherief Reda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: To be published in the 62nd Design Automation Conference (DAC), 2025
Abstract:Backpropagation has been the cornerstone of neural network training for decades, yet its inefficiencies in time and energy consumption limit its suitability for resource-constrained edge devices. While low-precision neural network quantization has been extensively researched to speed up model inference, its application in training has been less explored. Recently, the Forward-Forward (FF) algorithm has emerged as a promising alternative to backpropagation, replacing the backward pass with an additional forward pass. By avoiding the need to store intermediate activations for backpropagation, FF can reduce memory footprint, making it well-suited for embedded devices. This paper presents an INT8 quantized training approach that leverages FF’s layer-by-layer strategy to stabilize gradient quantization. Furthermore, we propose a novel “look-ahead” scheme to address limitations of FF and improve model accuracy. Experiments conducted on NVIDIA Jetson Orin Nano board demonstrate 4.6% faster training, 8.3% energy savings, and 27.0% reduction in memory usage, while maintaining competitive accuracy compared to the state-of-the-art.
zh
[AI-108] RAILS: Retrieval-Augmented Intelligence for Learning Software Development
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在软件开发过程中生成代码时出现的不完整代码或错误导入问题,尤其是在缺乏外部或项目特定文档的情况下。解决方案的关键在于提出RAILS(Retrieval-Augmented Intelligence for Learning Software Development)框架,该框架通过使用FAISS和OpenAI嵌入从精选的Java资源中检索语义相关上下文来增强LLM的提示,并结合编译器反馈引导的迭代验证循环以优化建议。
链接: https://arxiv.org/abs/2506.22742
作者: Wali Mohammad Abdullah,Md. Morshedul Islam,Devraj Parmar,Happy Hasmukhbhai Patel,Sindhuja Prabhakaran,Baidya Saha
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) like GPT-3.5-Turbo are increasingly used to assist software development, yet they often produce incomplete code or incorrect imports, especially when lacking access to external or project-specific documentation. We introduce RAILS (Retrieval-Augmented Intelligence for Learning Software Development), a framework that augments LLM prompts with semantically retrieved context from curated Java resources using FAISS and OpenAI embeddings. RAILS incorporates an iterative validation loop guided by compiler feedback to refine suggestions. We evaluated RAILS on 78 real-world Java import error cases spanning standard libraries, GUI APIs, external tools, and custom utilities. Despite using the same LLM, RAILS outperforms baseline prompting by preserving intent, avoiding hallucinations, and surfacing correct imports even when libraries are unavailable locally. Future work will integrate symbolic filtering via PostgreSQL and extend support to other languages and IDEs.
zh
[AI-109] Explanations are a means to an end
【速读】:该论文试图解决当前可解释机器学习方法在实际应用中缺乏明确使用目标的问题,即现有方法多关注模型输入到输出的映射描述,而未充分考虑解释在实际场景中的具体用途。解决方案的关键在于将解释的设计与评估置于一个基于统计决策理论的框架中,通过明确具体的使用场景来形式化解释的目标,从而确保解释的有效性和实用性。这种方法强调从功能角度出发,使解释能够适应如临床决策支持、提供救济途径或调试等多样化应用场景,并通过定义理论与实证相结合的评价标准,提升解释的价值评估体系。
链接: https://arxiv.org/abs/2506.22740
作者: Jessica Hullman,Ziyang Guo,Berk Ustun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Modern methods for explainable machine learning are designed to describe how models map inputs to outputs–without deep consideration of how these explanations will be used in practice. This paper argues that explanations should be designed and evaluated with a specific end in mind. We describe how to formalize this end in a framework based in statistical decision theory. We show how this functionally-grounded approach can be applied across diverse use cases, such as clinical decision support, providing recourse, or debugging. We demonstrate its use to characterize the maximum “boost” in performance on a particular task that an explanation could provide an idealized decision-maker, preventing misuse due to ambiguity by forcing researchers to specify concrete use cases that can be analyzed in light of models of expected explanation use. We argue that evaluation should meld theoretical and empirical perspectives on the value of explanation, and contribute definitions that span these perspectives.
zh
[AI-110] Kill Two Birds with One Stone! Trajectory enabled Unified Online Detection of Adversarial Examples and Backdoor Attacks
【速读】:该论文试图解决深度学习模型中同时应对对抗样本(Adversarial Examples, AE)和后门攻击的问题,提出了一种统一的在线检测框架UniGuard。其解决方案的关键在于两个核心洞察:首先,AE和后门攻击均需在推理阶段进行破坏,因此可以在运行时同时检测;其次,对抗性输入在前向推理过程中通过模型各层时会表现出与正常样本不同的轨迹特征,UniGuard通过将传播轨迹视为时间序列信号,并利用LSTM和频谱变换来放大时间域中细微的差异,从而实现对这些轨迹特征的有效检测。
链接: https://arxiv.org/abs/2506.22722
作者: Anmin Fu,Fanyu Meng,Huaibing Peng,Hua Ma,Zhi Zhang,Yifeng Zheng,Willy Susilo,Yansong Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The proposed UniGuard is the first unified online detection framework capable of simultaneously addressing adversarial examples and backdoor attacks. UniGuard builds upon two key insights: first, both AE and backdoor attacks have to compromise the inference phase, making it possible to tackle them simultaneously during run-time via online detection. Second, an adversarial input, whether a perturbed sample in AE attacks or a trigger-carrying sample in backdoor attacks, exhibits distinctive trajectory signatures from a benign sample as it propagates through the layers of a DL model in forward inference. The propagation trajectory of the adversarial sample must deviate from that of its benign counterpart; otherwise, the adversarial objective cannot be fulfilled. Detecting these trajectory signatures is inherently challenging due to their subtlety; UniGuard overcomes this by treating the propagation trajectory as a time-series signal, leveraging LSTM and spectrum transformation to amplify differences between adversarial and benign trajectories that are subtle in the time domain. UniGuard exceptional efficiency and effectiveness have been extensively validated across various modalities (image, text, and audio) and tasks (classification and regression), ranging from diverse model architectures against a wide range of AE attacks and backdoor attacks, including challenging partial backdoors and dynamic triggers. When compared to SOTA methods, including ContraNet (NDSS 22) specific for AE detection and TED (IEEE SP 24) specific for backdoor detection, UniGuard consistently demonstrates superior performance, even when matched against each method’s strengths in addressing their respective threats-each SOTA fails to parts of attack strategies while UniGuard succeeds for all.
zh
[AI-111] P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code
【速读】:该论文试图解决将串行C/C++代码自动转换为带有正确OpenMP注释的并行代码的问题,特别是在不进行模型微调或编译器插桩的情况下确保生成代码的语法正确性和可编译性。解决方案的关键在于提出P4OMP框架,该框架利用基于检索的增强生成(Retrieval-Augmented Generation, RAG)技术,结合来自OpenMP教程的结构化指令知识,以提高提示驱动代码生成的可靠性。通过在检索到的上下文中进行生成,P4OMP相比基线方法(如GPT-3.5-Turbo)显著提升了代码的语法正确性。
链接: https://arxiv.org/abs/2506.22703
作者: Wali Mohammad Abdullah,Azmain Kabir
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to improve the reliability of prompt-driven code generation. By grounding generation in the retrieved context, P4OMP improves syntactic correctness compared to baseline prompting with GPT-3.5-Turbo. We evaluate P4OMP against a baseline, GPT-3.5-Turbo without retrieval, on a comprehensive benchmark of 108 real-world C++ programs drawn from Stack Overflow, PolyBench, and NAS benchmark suites. P4OMP achieves 100% compilation success on all parallelizable cases, while the baseline fails to compile in 20 out of 108 cases. Six cases that rely on non-random-access iterators or thread-unsafe constructs are excluded due to fundamental OpenMP limitations. A detailed analysis demonstrates how P4OMP consistently avoids scoping errors, syntactic misuse, and invalid directive combinations that commonly affect baseline-generated code. We further demonstrate strong runtime scaling across seven compute-intensive benchmarks on an HPC cluster. P4OMP offers a robust, modular pipeline that significantly improves the reliability and applicability of LLM-generated OpenMP code.
zh
[AI-112] DistShap: Scalable GNN Explanations with Distributed Shapley Values
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)预测解释中计算开销过大的问题,特别是在大规模图数据上对特定边或特征进行重要性归因的挑战。解决方案的关键在于提出了一种名为DistShap的并行算法,该算法通过在多个GPU上分布式地采样子图、并行执行GNN推理以及求解分布式最小二乘问题来计算边的重要性得分,从而显著提升了计算效率并实现了对包含数百万特征的GNN模型的可扩展性。
链接: https://arxiv.org/abs/2506.22668
作者: Selahattin Akkas,Aditya Devarakonda,Ariful Azad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
备注: 12 pages
Abstract:With the growing adoption of graph neural networks (GNNs), explaining their predictions has become increasingly important. However, attributing predictions to specific edges or features remains computationally expensive. For example, classifying a node with 100 neighbors using a 3-layer GNN may involve identifying important edges from millions of candidates contributing to the prediction. To address this challenge, we propose DistShap, a parallel algorithm that distributes Shapley value-based explanations across multiple GPUs. DistShap operates by sampling subgraphs in a distributed setting, executing GNN inference in parallel across GPUs, and solving a distributed least squares problem to compute edge importance scores. DistShap outperforms most existing GNN explanation methods in accuracy and is the first to scale to GNN models with millions of features by using up to 128 GPUs on the NERSC Perlmutter supercomputer.
zh
[AI-113] Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision
【速读】:该论文试图解决当前系统工程(Systems Engineering, SE)自动化系统中对需求开发(requirements development)关注不足的问题,这些系统通常优先考虑代码开发而忽视了需求任务的复杂性。解决方案的关键在于提出一种基于知识引导的多智能体框架(Knowledge-Guided Multi-Agent Framework, KGMAF),该框架由六个专业智能体和一个构件池组成,旨在通过明确各智能体的功能、行为及知识,并设计构件池的概念架构,从而提升需求开发的效率与准确性。
链接: https://arxiv.org/abs/2506.22656
作者: Jiangping Huang,Dongming Jin,Weisong Sun,Yang Liu,Zhi Jin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper envisions a knowledge-guided multi-agent framework named KGMAF for automated requirements development. KGMAF aims to address gaps in current automation systems for SE, which prioritize code development and overlook the complexities of requirements tasks. KGMAF is composed of six specialized agents and an artifact pool to improve efficiency and accuracy. Specifically, KGMAF outlines the functionality, actions, and knowledge of each agent and provides the conceptual design of the artifact pool. Our case study highlights the potential of KGMAF in real-world scenarios. Finally, we outline several research opportunities for implementing and enhancing automated requirements development using multi-agent systems. We believe that KGMAF will play a pivotal role in shaping the future of automated requirements development in the era of LLMs.
zh
[AI-114] URSA: The Universal Research and Scientific Agent
【速读】:该论文试图解决现代科学研究中效率低下和瓶颈问题,通过引入基于生成式 AI (Generative AI) 的“代理”(agentic)系统来加速研究任务。解决方案的关键在于构建一个名为 URSA 的科学代理生态系统,该系统由一组模块化代理和工具组成,能够结合先进的物理仿真代码,以应对不同复杂度和影响力的科学问题。
链接: https://arxiv.org/abs/2506.22653
作者: Michael Grosskopf,Russell Bent,Rahul Somasundaram,Isaac Michaud,Arthur Lui,Nathan Debardeleben,Earl Lawrence
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 9 figures
Abstract:Large language models (LLMs) have moved far beyond their initial form as simple chatbots, now carrying out complex reasoning, planning, writing, coding, and research tasks. These skills overlap significantly with those that human scientists use day-to-day to solve complex problems that drive the cutting edge of research. Using LLMs in “agentic” AI has the potential to revolutionize modern science and remove bottlenecks to progress. In this work, we present URSA, a scientific agent ecosystem for accelerating research tasks. URSA consists of a set of modular agents and tools, including coupling to advanced physics simulation codes, that can be combined to address scientific problems of varied complexity and impact. This work highlights the architecture of URSA, as well as examples that highlight the potential of the system.
zh
[AI-115] Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
【速读】:该论文试图解决的问题是:在经过指令微调、强化学习或知识蒸馏等后训练方法提升大型语言模型的数学推理能力后,这些改进是否源于变换器层的重大变化,还是仅由微小调整导致,且基础模型的相对层重要性结构基本保持不变。论文的解决方案关键在于通过系统性的逐层消融实验,分析不同后训练范式下的模型在数学推理基准上的表现,从而揭示数学推理任务具有特定的层重要性结构,并且这些关键层在所有后训练方法中均保持一致,其移除会导致准确率下降高达80%。
链接: https://arxiv.org/abs/2506.22638
作者: Aadim Nepal,Safal Shrestha,Anubhav Shrestha,Minwu Kim,Keith Ross
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can exhibit improved mathematical reasoning capabilities following post-training with instruction tuning, reinforcement learning, or knowledge distillation. However, it remains unclear whether these improvements are driven by major changes in transformer layers or from minor adjustments that leave the relative layer importance structures of the base model largely unchanged. We investigate this question through systematic layer-wise ablation experiments, examining base, instruction-tuned, knowledge-distilled, and reinforcement learning variants on mathematical reasoning benchmarks. Our findings show that mathematical reasoning gives rise to a specific layer importance structure, and this structure persists across all post-training paradigms. Removal of such layers causes accuracy drops of up to 80%. In contrast, non-mathematical tasks like factual recall exhibit no critical layers. This distinction suggests that mathematical reasoning requires specialized layers that emerge during pre-training, while other non-reasoning tasks do not. From an information-theoretic perspective, we also observe that these critical layers are the same layers where major representational transformation occurs.
zh
[AI-116] Ludax: A GPU-Accelerated Domain Specific Language for Board Games
【速读】:该论文试图解决在人工智能研究中,尤其是强化学习(Reinforcement Learning, RL)领域,游戏环境构建效率低、难以跨游戏泛化的问题。解决方案的关键在于开发一种针对棋盘游戏的领域特定语言(Domain-Specific Language, DSL),该语言能够自动生成支持硬件加速的代码,从而提升模拟速度并增强算法的跨游戏适应性。这一框架名为Ludax,结合了游戏描述语言的通用性与现代并行处理硬件的速度优势,旨在加速游戏相关研究的进展。
链接: https://arxiv.org/abs/2506.22609
作者: Graham Todd,Alexander G. Padula,Dennis J.N.J. Soemers,Julian Togelius
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures
Abstract:Games have long been used as benchmarks and testing environments for research in artificial intelligence. A key step in supporting this research was the development of game description languages: frameworks that compile domain-specific code into playable and simulatable game environments, allowing researchers to generalize their algorithms and approaches across multiple games without having to manually implement each one. More recently, progress in reinforcement learning (RL) has been largely driven by advances in hardware acceleration. Libraries like JAX allow practitioners to take full advantage of cutting-edge computing hardware, often speeding up training and testing by orders of magnitude. Here, we present a synthesis of these strands of research: a domain-specific language for board games which automatically compiles into hardware-accelerated code. Our framework, Ludax, combines the generality of game description languages with the speed of modern parallel processing hardware and is designed to fit neatly into existing deep learning pipelines. We envision Ludax as a tool to help accelerate games research generally, from RL to cognitive science, by enabling rapid simulation and providing a flexible representation scheme. We present a detailed breakdown of Ludax’s description language and technical notes on the compilation process, along with speed benchmarking and a demonstration of training RL agents. The Ludax framework, along with implementations of existing board games, is open-source and freely available.
zh
[AI-117] Bootstrapping Human-Like Planning via LLM s
【速读】:该论文试图解决机器人终端用户如何通过可访问的方式指定机器人执行任务的问题,特别是在自然语言编程与拖放界面这两种常见端用户编程范式之间的结合问题。解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的管道,该管道能够接受自然语言输入并生成类似人类的行为序列输出,其粒度达到人类所制定的水平。
链接: https://arxiv.org/abs/2506.22604
作者: David Porfirio,Vincent Hsiao,Morgan Fine-Morris,Leslie Smith,Laura M. Hiatt
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
Abstract:Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot’s task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.
zh
[AI-118] he Hidden Link Between RLHF and Contrastive Learning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)与人类价值观对齐的问题,特别是针对现有方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO)在训练过程中存在的局限性。其解决方案的关键在于从互信息(Mutual Information, MI)最大化的角度重新诠释RLHF和DPO,揭示其与对比学习(Contrastive Learning)之间的深层联系,并利用Jensen-Shannon MI估计器替代原有的Donsker-Varadhan/MINE边界,提出一种新的互信息优化方法(Mutual Information Optimization, MIO)。该方法有效缓解了DPO在后期阶段选择似然下降的问题,并在多个复杂的推理和数学基准测试中表现出竞争力或更优性能。
链接: https://arxiv.org/abs/2506.22578
作者: Xufei Lv,Haoyuan Sun,Xuefeng Bai,Min Zhang,Houde Liu,Kehai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning based on the positive and negative samples derived from the base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). This paradigm further explains why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks. We will release the model and code upon acceptance.
zh
[AI-119] Exploration Behavior of Untrained Policies ICML-2025
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中探索(exploration)这一核心挑战,尤其是在稀疏或对抗性奖励结构的环境中。其解决方案的关键在于分析深度神经网络策略在训练前的架构如何隐式地影响探索行为。通过理论和实证方法,研究展示了如何从未经训练的策略中生成弹道式或扩散式轨迹,并揭示了未训练策略产生的相关动作及其导致的非平凡状态访问分布。该研究为利用策略初始化作为设计工具来理解早期训练中的探索行为提供了理论与实验框架。
链接: https://arxiv.org/abs/2506.22566
作者: Jacob Adamczyk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: High-dimensional Learning Dynamics Workshop at ICML-2025
Abstract:Exploration remains a fundamental challenge in reinforcement learning (RL), particularly in environments with sparse or adversarial reward structures. In this work, we study how the architecture of deep neural policies implicitly shapes exploration before training. We theoretically and empirically demonstrate strategies for generating ballistic or diffusive trajectories from untrained policies in a toy model. Using the theory of infinite-width networks and a continuous-time limit, we show that untrained policies return correlated actions and result in non-trivial state-visitation distributions. We discuss the distributions of the corresponding trajectories for a standard architecture, revealing insights into inductive biases for tackling exploration. Our results establish a theoretical and experimental framework for using policy initialization as a design tool to understand exploration behavior in early training.
zh
[AI-120] Red Teaming for Generative AI Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
【速读】:该论文试图解决生成式 AI(Generative AI)在训练数据中可能包含受版权保护内容的问题,以及由此引发的法律和伦理风险。解决方案的关键在于通过红队测试(red teaming event)评估 GPT4DFCI 模型是否能够复制受版权保护的文本内容,并据此制定缓解策略。研究发现模型在特定情况下能够再现经典书籍中的精确引文,但未能复制目标新闻文章、科学文章或电子健康记录的内容,同时存在生成虚假内容的情况。为此,GPT4DFCI v2.8.2 版本已部署了相应的缓解措施,以降低法律和伦理风险。
链接: https://arxiv.org/abs/2506.22523
作者: James Wen,Sahil Nalawade,Zhiwei Liang,Catherine Bielick,Marisa Ferrara Boston,Alexander Chowdhury,Adele Collin,Luigi De Angelis,Jacob Ellen,Heather Frase,Rodrigo R. Gameiro,Juan Manuel Gutierrez,Pooja Kadam,Murat Keceli,Srikanth Krishnamurthy,Anne Kwok,Yanan Lance Lu,Heather Mattie,Liam G. McCoy,Katherine Miller,Allison C. Morgan,Marlene Louisa Moerig,Trang Nguyen,Alexander Owen-Post,Alex D. Ruiz,Sreekar Reddy Puchala,Soujanya Samineni,Takeshi Tohyama,Varun Ullanat,Carmine Valenza,Camilo Velez,Pengcheng Wang,Anna Wuest,Yuxiang Zhou,Yingde Zhu,Jason M. Johnson,Jennifer Willcox,Francis J. Vitiello,Leo Anthony G. Celi,Renato Umeton
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI is present in multiple industries. Dana-Farber Cancer Institute, in partnership with Microsoft, has created an internal AI tool, GPT4DFCI. Together we hosted a red teaming event to assess whether the underlying GPT models that support the tool would output copyrighted data. Our teams focused on reproducing content from books, news articles, scientific articles, and electronic health records. We found isolated instances where GPT4DFCI was able to identify copyrighted material and reproduce exact quotes from famous books which indicates that copyrighted material was in the training data. The model was not able to reproduce content from our target news article, scientific article, or electronic health records. However, there were instances of fabrication. As a result of this event, a mitigation strategy is in production in GPT4DFCI v2.8.2, deployed on January 21, 2025. We hope this report leads to similar events in which AI software tools are stress-tested to assess the perimeter of their legal and ethical usage.
zh
[AI-121] A Survey on Model Extraction Attacks and Defenses for Large Language Models
【速读】:该论文旨在解决语言模型(Language Model, LM)在部署过程中面临的模型提取攻击(Model Extraction Attack)问题,此类攻击可能威胁知识产权和用户隐私。其解决方案的关键在于提出针对大语言模型(Large Language Model, LLM)的全面分类体系,涵盖功能提取、训练数据提取和针对提示的攻击,并系统分析了包括基于API的知识蒸馏、直接查询、参数恢复和提示窃取等攻击方法。同时,论文还综述了模型保护、数据隐私保护和针对提示的防御策略,并提出了专门用于评估攻击效果和防御性能的指标,以应对生成式语言模型(Generative Language Model)的特殊挑战。
链接: https://arxiv.org/abs/2506.22521
作者: Kaixiang Zhao,Lincan Li,Kaize Ding,Neil Zhenqiang Gong,Yue Zhao,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.
zh
[AI-122] Exploring Artificial Intelligence Tutor Teammate Adaptability to Harness Discovery Curiosity and Promote Learning in the Context of Interactive Molecular Dynamics
【速读】:该论文试图解决如何通过人工智能导师同伴(AI tutor teammate)提升学生在交互式分子动力学(Interactive Molecular Dynamics, IMD)任务中的好奇心驱动型参与度和学习效果的问题。其解决方案的关键在于利用人工智能的 curiosity-triggering(激发好奇心)和 response behaviors(响应行为),以刺激并维持学生的求知欲,进而影响学生主动提问的频率和复杂性。研究通过Wizard-of-Oz范式,由人类实验者通过大型语言模型动态调整AI的行为,结合混合方法探索性设计,验证了AI在促进团队协作、发现性好奇心及提升团队表现方面的有效性。
链接: https://arxiv.org/abs/2506.22520
作者: Mustafa Demir,Jacob Miratsky,Jonathan Nguyen,Chun Kit Chan,Punya Mishra,Abhishek Singharoy
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注:
Abstract:This study examines the impact of an Artificial Intelligence tutor teammate (AI) on student curiosity-driven engagement and learning effectiveness during Interactive Molecular Dynamics (IMD) tasks on the Visual Molecular Dynamics platform. It explores the role of the AI’s curiosity-triggering and response behaviors in stimulating and sustaining student curiosity, affecting the frequency and complexity of student-initiated questions. The study further assesses how AI interventions shape student engagement, foster discovery curiosity, and enhance team performance within the IMD learning environment. Using a Wizard-of-Oz paradigm, a human experimenter dynamically adjusts the AI tutor teammate’s behavior through a large language model. By employing a mixed-methods exploratory design, a total of 11 high school students participated in four IMD tasks that involved molecular visualization and calculations, which increased in complexity over a 60-minute period. Team performance was evaluated through real-time observation and recordings, whereas team communication was measured by question complexity and AI’s curiosity-triggering and response behaviors. Cross Recurrence Quantification Analysis (CRQA) metrics reflected structural alignment in coordination and were linked to communication behaviors. High-performing teams exhibited superior task completion, deeper understanding, and increased engagement. Advanced questions were associated with AI curiosity-triggering, indicating heightened engagement and cognitive complexity. CRQA metrics highlighted dynamic synchronization in student-AI interactions, emphasizing structured yet adaptive engagement to promote curiosity. These proof-of-concept findings suggest that the AI’s dual role as a teammate and educator indicates its capacity to provide adaptive feedback, sustaining engagement and epistemic curiosity.
zh
[AI-123] In-context learning for the classification of manipulation techniques in phishing emails
【速读】:该论文试图解决传统钓鱼邮件检测中忽视心理操纵问题的不足,提出利用大语言模型(Large Language Model, LLM)的上下文学习(In-Context Learning, ICL)对钓鱼邮件进行细粒度分类。解决方案的关键在于基于40种操纵技术的分类体系,通过少量示例在GPT-4o-mini上进行训练,并在真实法语钓鱼邮件数据集SignalSpam上进行评估,从而有效识别常见操纵技术,如诱饵、好奇心吸引和请求小帮助等,展现出0.76的准确率。
链接: https://arxiv.org/abs/2506.22515
作者: Antony Dalmiere(LAAS-TRUST, LAAS),Guillaume Auriol(LAAS-TRUST, INSA Toulouse),Vincent Nicomette(LAAS-TSF, LAAS),Pascal Marchand(LERASS)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional phishing detection often overlooks psychological manipulation. This study investigates using Large Language Model (LLM) In-Context Learning (ICL) for fine-grained classification of phishing emails based on a taxonomy of 40 manipulation techniques. Using few-shot examples with GPT-4o-mini on real-world French phishing emails (SignalSpam), we evaluated performance against a human-annotated test set (100 emails). The approach effectively identifies prevalent techniques (e.g., Baiting, Curiosity Appeal, Request For Minor Favor) with a promising accuracy of 0.76. This work demonstrates ICL’s potential for nuanced phishing analysis and provides insights into attacker strategies.
zh
[AI-124] Ask before you Build: Rethinking AI-for-Good in Human Trafficking Interventions
【速读】:该论文试图解决在反人口贩卖(Human Trafficking, HT)领域中,过度依赖生成式 AI (Generative AI) 技术解决方案可能带来的伦理问题,包括对剥削的简化理解、权力失衡以及对目标社区的潜在伤害。论文提出的解决方案是“激进质疑”(Radical Questioning, RQ)框架,其关键在于通过一个五步的项目前伦理评估工具,在设计之前批判性地审视是否应开发AI系统,尤其针对边缘化群体和结构性不公的领域。RQ强调在技术设计之前进行深层次的反思与权力建构分析,以避免基于监控的干预措施,并引导向支持幸存者赋权的工具发展。
链接: https://arxiv.org/abs/2506.22512
作者: Pratheeksha Nair,Gabriel Lefebvre,Sophia Garrel,Maryam Molamohammadi,Reihaneh Rabbany
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:AI for good initiatives often rely on the assumption that technical interventions can resolve complex social problems. In the context of human trafficking (HT), such techno-solutionism risks oversimplifying exploitation, reinforcing power imbalances and causing harm to the very communities AI claims to support. In this paper, we introduce the Radical Questioning (RQ) framework as a five step, pre-project ethical assessment tool to critically evaluate whether AI should be built at all, especially in domains involving marginalized populations and entrenched systemic injustice. RQ does not replace principles based ethics but precedes it, offering an upstream, deliberative space to confront assumptions, map power, and consider harms before design. Using a case study in AI for HT, we demonstrate how RQ reveals overlooked sociocultural complexities and guides us away from surveillance based interventions toward survivor empowerment tools. While developed in the context of HT, RQ’s five step structure can generalize to other domains, though the specific questions must be contextual. This paper situates RQ within a broader AI ethics philosophy that challenges instrumentalist norms and centers relational, reflexive responsibility.
zh
[AI-125] SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning
【速读】:该论文旨在解决联邦提示学习(Federated Prompt Learning)中的后门攻击问题,即恶意客户端通过注入视觉上不可察觉的可学习噪声触发器,使全局提示学习器对特定目标分类产生错误,同时保持对干净输入的高准确率。解决方案的关键在于提出SABRE-FL,这是一种轻量级、模块化的防御机制,通过在离线训练的分布外数据上使用嵌入空间异常检测器来过滤被污染的提示更新,从而无需访问原始客户端数据或标签,实现对恶意客户端的有效识别与过滤。
链接: https://arxiv.org/abs/2506.22506
作者: Momin Ahmad Khan,Yasra Chandio,Fatima Muhammad Anwar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Prompt Learning has emerged as a communication-efficient and privacy-preserving paradigm for adapting large vision-language models like CLIP across decentralized clients. However, the security implications of this setup remain underexplored. In this work, we present the first study of backdoor attacks in Federated Prompt Learning. We show that when malicious clients inject visually imperceptible, learnable noise triggers into input images, the global prompt learner becomes vulnerable to targeted misclassification while still maintaining high accuracy on clean inputs. Motivated by this vulnerability, we propose SABRE-FL, a lightweight, modular defense that filters poisoned prompt updates using an embedding-space anomaly detector trained offline on out-of-distribution data. SABRE-FL requires no access to raw client data or labels and generalizes across diverse datasets. We show, both theoretically and empirically, that malicious clients can be reliably identified and filtered using an embedding-based detector. Across five diverse datasets and four baseline defenses, SABRE-FL outperforms all baselines by significantly reducing backdoor accuracy while preserving clean accuracy, demonstrating strong empirical performance and underscoring the need for robust prompt learning in future federated systems.
zh
[AI-126] Peer Review as Structured Commentary: Immutable Identity Public Dialogue and Reproducible Scholarship
【速读】:该论文试图解决传统学术验证机制中存在的匿名性、延迟性和把关问题(gatekeeping)。其解决方案的关键在于构建一个透明、身份关联且可重复的学术评价体系,该体系以开放评论为基础,并利用区块链技术实现不可变的审计追踪,以及利用生成式 AI (Generative AI) 进行迭代合成,从而激励知识贡献、捕捉认识论演变并实现可追溯的声誉动态。
链接: https://arxiv.org/abs/2506.22497
作者: Craig Steven Wright
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Social and Information Networks (cs.SI); History and Philosophy of Physics (physics.hist-ph)
备注: 66 pages, 0 figures, interdisciplinary framework, includes proposed architecture and metadata layer structures
Abstract:This paper reconceptualises peer review as structured public commentary. Traditional academic validation is hindered by anonymity, latency, and gatekeeping. We propose a transparent, identity-linked, and reproducible system of scholarly evaluation anchored in open commentary. Leveraging blockchain for immutable audit trails and AI for iterative synthesis, we design a framework that incentivises intellectual contribution, captures epistemic evolution, and enables traceable reputational dynamics. This model empowers fields from computational science to the humanities, reframing academic knowledge as a living process rather than a static credential.
zh
[AI-127] Report on NSF Workshop on Science of Safe AI
【速读】:该论文试图解决如何开发既准确高效又安全可信的AI系统这一科学挑战,特别是在复杂AI模型的透明性和预测安全性方面存在不足的情况下。解决方案的关键在于建立理论、方法和工具,以支撑下一代AI-enabled系统的基础,从而提升AI系统的可解释性、可靠性和安全性。
链接: https://arxiv.org/abs/2506.22492
作者: Rajeev Alur,Greg Durrett,Hadas Kress-Gazit,Corina Păsăreanu,René Vidal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in machine learning, particularly the emergence of foundation models, are leading to new opportunities to develop technology-based solutions to societal problems. However, the reasoning and inner workings of today’s complex AI models are not transparent to the user, and there are no safety guarantees regarding their predictions. Consequently, to fulfill the promise of AI, we must address the following scientific challenge: how to develop AI-based systems that are not only accurate and performant but also safe and trustworthy? The criticality of safe operation is particularly evident for autonomous systems for control and robotics, and was the catalyst for the Safe Learning Enabled Systems (SLES) program at NSF. For the broader class of AI applications, such as users interacting with chatbots and clinicians receiving treatment recommendations, safety is, while no less important, less well-defined with context-dependent interpretations. This motivated the organization of a day-long workshop, held at University of Pennsylvania on February 26, 2025, to bring together investigators funded by the NSF SLES program with a broader pool of researchers studying AI safety. This report is the result of the discussions in the working groups that addressed different aspects of safety at the workshop. The report articulates a new research agenda focused on developing theory, methods, and tools that will provide the foundations of the next generation of AI-enabled systems. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.22492 [cs.CY] (or arXiv:2506.22492v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.22492 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-128] AGI Enabled Solutions For IoX Layers Bottlenecks In Cyber-Physical-Social-Thinking Space
【速读】:该论文试图解决Cyber-Physical-Social Thinking (CPST)生态系统中感知层、网络层和应用层的关键瓶颈问题。其解决方案的关键在于通过人工智能通用智能(Artificial General Intelligence, AGI)增强的互联网一切(Internet of Everything, IoX)技术,具体包括在感知层采用自适应传感器融合、边缘预处理和选择性注意机制以缓解数据过载,在网络层通过神经符号推理、主动推断和因果推理解决协议异构性和动态频谱管理问题,并在应用层构建支持身份与关系爆炸管理的决策框架。此外,论文强调了跨层集成、量子通信和伦理治理的重要性,以推动AGI增强型IoX系统的未来发展。
链接: https://arxiv.org/abs/2506.22487
作者: Amar Khelloufi,Huansheng Ning,Sahraoui Dhelim,Jianguo Ding
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 31 pages, 5 figures
Abstract:The integration of the Internet of Everything (IoX) and Artificial General Intelligence (AGI) has given rise to a transformative paradigm aimed at addressing critical bottlenecks across sensing, network, and application layers in Cyber-Physical-Social Thinking (CPST) ecosystems. In this survey, we provide a systematic and comprehensive review of AGI-enhanced IoX research, focusing on three key components: sensing-layer data management, network-layer protocol optimization, and application-layer decision-making frameworks. Specifically, this survey explores how AGI can mitigate IoX bottlenecks challenges by leveraging adaptive sensor fusion, edge preprocessing, and selective attention mechanisms at the sensing layer, while resolving network-layer issues such as protocol heterogeneity and dynamic spectrum management, neuro-symbolic reasoning, active inference, and causal reasoning, Furthermore, the survey examines AGI-enabled frameworks for managing identity and relationship explosion. Key findings suggest that AGI-driven strategies, such as adaptive sensor fusion, edge preprocessing, and semantic modeling, offer novel solutions to sensing-layer data overload, network-layer protocol heterogeneity, and application-layer identity explosion. The survey underscores the importance of cross-layer integration, quantum-enabled communication, and ethical governance frameworks for future AGI-enabled IoX systems. Finally, the survey identifies unresolved challenges, such as computational requirements, scalability, and real-world validation, calling for further research to fully realize AGI’s potential in addressing IoX bottlenecks. we believe AGI-enhanced IoX is emerging as a critical research field at the intersection of interconnected systems and advanced AI.
zh
[AI-129] Innovative Research on IoT Architecture and Robotic Operating Platforms: Applications of Large Language Models and Generative AI
【速读】:该论文试图解决传统物联网(IoT)系统与机器人在智能化和自主性方面的不足,旨在提升其实时决策能力和动态环境适应能力。解决方案的关键在于融合大型语言模型(LLMs)、生成式 AI、边缘计算和5G网络等前沿技术,构建一个创新的机器人操作平台,从而推动智能机器人与物联网的协同发展。
链接: https://arxiv.org/abs/2506.22477
作者: Huiwen Han
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Robotics (cs.RO)
备注: Published in: 2024 6th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), IEEE Xplore, DOI: https://doi.org/10.1109/RICAI64321.2024.10911316 . \c{opyright} 2024 IEEE
Abstract:This paper introduces an innovative design for robotic operating platforms, underpinned by a transformative Internet of Things (IoT) architecture, seamlessly integrating cutting-edge technologies such as large language models (LLMs), generative AI, edge computing, and 5G networks. The proposed platform aims to elevate the intelligence and autonomy of IoT systems and robotics, enabling them to make real-time decisions and adapt dynamically to changing environments. Through a series of compelling case studies across industries including smart manufacturing, healthcare, and service sectors, this paper demonstrates the substantial potential of IoT-enabled robotics to optimize operational workflows, enhance productivity, and deliver innovative, scalable solutions. By emphasizing the roles of LLMs and generative AI, the research highlights how these technologies drive the evolution of intelligent robotics and IoT, shaping the future of industry-specific advancements. The findings not only showcase the transformative power of these technologies but also offer a forward-looking perspective on their broader societal and industrial implications, positioning them as catalysts for next-generation automation and technological convergence.
zh
[AI-130] Vision Transformers for Multi-Variable Climate Downscaling: Emulating Regional Climate Models with a Shared Encoder and Multi-Decoder Architecture
【速读】:该论文旨在解决传统区域气候模型(RCMs)在区域研究中因计算成本高和灵活性有限而应用受限的问题,以及现有深度学习方法在降尺度时仅针对单一变量导致的上下文感知不足、计算冗余和变量间交互缺失的问题。其解决方案的关键在于提出一种多任务、多变量的视觉变换器(Vision Transformer, ViT)架构,该架构采用共享编码器和变量特定解码器(1EMD),能够同时预测地表温度(tas)、风速(sfcWind)和500 hPa位势高度(zg500)三个关键气候变量,从而实现跨变量的知识迁移并提升计算效率。
链接: https://arxiv.org/abs/2506.22447
作者: Fabio Merizzi,Harilaos Loukos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:Global Climate Models (GCMs) are critical for simulating large-scale climate dynamics, but their coarse spatial resolution limits their applicability in regional studies. Regional Climate Models (RCMs) refine this through dynamic downscaling, albeit at considerable computational cost and with limited flexibility. While deep learning has emerged as an efficient data-driven alternative, most existing studies have focused on single-variable models that downscale one variable at a time. This approach can lead to limited contextual awareness, redundant computation, and lack of cross-variable interaction. Our study addresses these limitations by proposing a multi-task, multi-variable Vision Transformer (ViT) architecture with a shared encoder and variable-specific decoders (1EMD). The proposed architecture jointly predicts three key climate variables: surface temperature (tas), wind speed (sfcWind), and 500 hPa geopotential height (zg500), directly from GCM-resolution inputs, emulating RCM-scale downscaling over Europe. We show that our multi-variable approach achieves positive cross-variable knowledge transfer and consistently outperforms single-variable baselines trained under identical conditions, while also improving computational efficiency. These results demonstrate the effectiveness of multi-variable modeling for high-resolution climate downscaling.
zh
[AI-131] EAGLE: Efficient Alignment of Generalized Latent Embeddings for Multimodal Survival Prediction with Interpretable Attribution Analysis
【速读】:该论文旨在解决癌症生存预测中多模态数据融合的挑战,包括现有方法的简单融合策略、高计算需求以及缺乏可解释性等问题。其解决方案的关键在于提出EAGLE(Efficient Alignment of Generalized Latent Embeddings)框架,通过基于注意力的多模态融合和全面的属性分析来提升模型性能与可解释性,具体包括动态跨模态注意力机制、大规模降维、三种互补的属性分析方法以及统一的适配流程。
链接: https://arxiv.org/abs/2506.22446
作者: Aakash Tripathi,Asim Waqas,Matthew B. Schabath,Yasin Yilmaz,Ghulam Rasool
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate cancer survival prediction requires integration of diverse data modalities that reflect the complex interplay between imaging, clinical parameters, and textual reports. However, existing multimodal approaches suffer from simplistic fusion strategies, massive computational requirements, and lack of interpretability-critical barriers to clinical adoption. We present EAGLE (Efficient Alignment of Generalized Latent Embeddings), a novel deep learning framework that addresses these limitations through attention-based multimodal fusion with comprehensive attribution analysis. EAGLE introduces four key innovations: (1) dynamic cross-modal attention mechanisms that learn hierarchical relationships between modalities, (2) massive dimensionality reduction (99.96%) while maintaining predictive performance, (3) three complementary attribution methods providing patient-level interpretability, and (4) a unified pipeline enabling seamless adaptation across cancer types. We evaluated EAGLE on 911 patients across three distinct malignancies: glioblastoma (GBM, n=160), intraductal papillary mucinous neoplasms (IPMN, n=171), and non-small cell lung cancer (NSCLC, n=580). Patient-level analysis showed high-risk individuals relied more heavily on adverse imaging features, while low-risk patients demonstrated balanced modality contributions. Risk stratification identified clinically meaningful groups with 4-fold (GBM) to 5-fold (NSCLC) differences in median survival, directly informing treatment intensity decisions. By combining state-of-the-art performance with clinical interpretability, EAGLE bridges the gap between advanced AI capabilities and practical healthcare deployment, offering a scalable solution for multimodal survival prediction that enhances both prognostic accuracy and physician trust in automated predictions.
zh
[AI-132] Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security
【速读】:该论文旨在解决日益连接的网络物理系统(Cyber-Physical Systems, CPS)在面对自适应和零日攻击等复杂网络威胁时,传统安全方法如基于规则的入侵检测和单智能体强化学习无法有效应对的问题。解决方案的关键在于提出一种分层对抗鲁棒的多智能体强化学习框架(Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning, HAMARL),该框架通过局部智能体负责子系统安全、全局协调器统筹系统级防御策略,并引入对抗训练循环以模拟和预判不断演变的网络威胁,从而实现主动防御适应。
链接: https://arxiv.org/abs/2506.22445
作者: Saad Alqithami
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:
Abstract:Cyber-Physical Systems play a critical role in the infrastructure of various sectors, including manufacturing, energy distribution, and autonomous transportation systems. However, their increasing connectivity renders them highly vulnerable to sophisticated cyber threats, such as adaptive and zero-day attacks, against which traditional security methods like rule-based intrusion detection and single-agent reinforcement learning prove insufficient. To overcome these challenges, this paper introduces a novel Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning (HAMARL) framework. HAMARL employs a hierarchical structure consisting of local agents dedicated to subsystem security and a global coordinator that oversees and optimizes comprehensive, system-wide defense strategies. Furthermore, the framework incorporates an adversarial training loop designed to simulate and anticipate evolving cyber threats, enabling proactive defense adaptation. Extensive experimental evaluations conducted on a simulated industrial IoT testbed indicate that HAMARL substantially outperforms traditional multi-agent reinforcement learning approaches, significantly improving attack detection accuracy, reducing response times, and ensuring operational continuity. The results underscore the effectiveness of combining hierarchical multi-agent coordination with adversarially-aware training to enhance the resilience and security of next-generation CPS.
zh
[AI-133] Latent Factorization of Tensors with Threshold Distance Weighted Loss for Traffic Data Estimation
【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)中由于通信故障和传感器故障导致的时空交通数据不完整或损坏问题,这些问题严重影响了ITS的性能。为了解决这一问题,研究提出了一种基于张量潜在因子分解(Latent Factorization of Tensors, LFT)的改进模型——阈值距离加权损失(Threshold Distance Weighted loss, TDW)融合的LFT(TDWLFT)模型。该解决方案的关键在于引入一种新的损失函数,通过为不同样本分配不同的权重,有效降低模型对异常值的敏感性,从而提升数据填补的准确性和计算效率。
链接: https://arxiv.org/abs/2506.22441
作者: Lei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Intelligent transportation systems (ITS) rely heavily on complete and high-quality spatiotemporal traffic data to achieve optimal performance. Nevertheless, in real-word traffic data collection processes, issues such as communication failures and sensor malfunctions often lead to incomplete or corrupted datasets, thereby posing significant challenges to the advancement of ITS. Among various methods for imputing missing spatiotemporal traffic data, the latent factorization of tensors (LFT) model has emerged as a widely adopted and effective solution. However, conventional LFT models typically employ the standard L2-norm in their learning objective, which makes them vulnerable to the influence of outliers. To overcome this limitation, this paper proposes a threshold distance weighted (TDW) loss-incorporated Latent Factorization of Tensors (TDWLFT) model. The proposed loss function effectively reduces the model’s sensitivity to outliers by assigning differentiated weights to individual samples. Extensive experiments conducted on two traffic speed datasets sourced from diverse urban environments confirm that the proposed TDWLFT model consistently outperforms state-of-the-art approaches in terms of both in both prediction accuracy and computational efficiency.
zh
[AI-134] Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling
【速读】:该论文试图解决如何大规模构建高质量的钢琴演奏音频转录为MIDI文件的数据集问题,以支持音乐生成与分析相关的研究。解决方案的关键在于采用多阶段的数据处理流程,首先利用语言模型根据元数据自主爬取和评分音频录音,随后通过音频分类器进行筛选和分割,从而生成超过一百万条独特的MIDI文件,涵盖约10万小时的转录音频。
链接: https://arxiv.org/abs/2504.15071
作者: Louis Bradshaw,Simon Colton
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at this https URL.
zh
[AI-135] ResQuNNs:Towards Enabling Deep Learning in Quantum Convolution Neural Networks
【速读】:该论文旨在解决传统量子卷积神经网络(Quanvolutional Neural Networks, QuNNs)中可训练量子卷积层的适应性不足及梯度优化复杂性问题。传统量子卷积层多为静态设计,限制了模型的灵活性与性能提升。该研究通过引入可训练量子卷积层以增强模型的适应性,但这一改进带来了梯度在多层间传递的挑战。解决方案的关键在于提出残差量子卷积神经网络(Residual Quanvolutional Neural Networks, ResQuNNs),通过引入残差块和跳跃连接,改善梯度流动,从而提升训练效率与网络性能。
链接: https://arxiv.org/abs/2402.09146
作者: Muhammad Kashif,Muhammad Shafique
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:In this paper, we present a novel framework for enhancing the performance of Quanvolutional Neural Networks (QuNNs) by introducing trainable quanvolutional layers and addressing the critical challenges associated with them. Traditional quanvolutional layers, although beneficial for feature extraction, have largely been static, offering limited adaptability. Unlike state-of-the-art, our research overcomes this limitation by enabling training within these layers, significantly increasing the flexibility and potential of QuNNs. However, the introduction of multiple trainable quanvolutional layers induces complexities in gradient-based optimization, primarily due to the difficulty in accessing gradients across these layers. To resolve this, we propose a novel architecture, Residual Quanvolutional Neural Networks (ResQuNNs), leveraging the concept of residual learning, which facilitates the flow of gradients by adding skip connections between layers. By inserting residual blocks between quanvolutional layers, we ensure enhanced gradient access throughout the network, leading to improved training performance. Moreover, we provide empirical evidence on the strategic placement of these residual blocks within QuNNs. Through extensive experimentation, we identify an efficient configuration of residual blocks, which enables gradients across all the layers in the network that eventually results in efficient training. Our findings suggest that the precise location of residual blocks plays a crucial role in maximizing the performance gains in QuNNs. Our results mark a substantial step forward in the evolution of quantum deep learning, offering new avenues for both theoretical development and practical quantum computing applications.
zh
[AI-136] SQUASH: A SWAP-Based Quantum Attack to Sabotage Hybrid Quantum Neural Networks
【速读】:该论文试图解决混合量子神经网络(Hybrid Quantum Neural Networks, HQNNs)在分类任务中面临的安全问题,具体表现为针对其电路级的攻击威胁。解决方案的关键在于提出一种名为SQUASH的攻击方法,该方法通过在受害HQNN的变分量子电路中插入SWAP门,直接操纵电路结构,导致量子比特错位并干扰量子态演化,从而显著降低分类性能。此攻击方式具有高度隐蔽性,无需访问训练数据或在输入状态中引入可检测的扰动。
链接: https://arxiv.org/abs/2506.24081
作者: Rahul Kumar,Wenqi Wei,Ying Mao,Junaid Farooq,Ying Wang,Juntao Chen
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Keywords: Quantum Machine Learning, Hybrid Quantum Neural Networks, SWAP Test, Fidelity, Circuit-level Attack
Abstract:We propose a circuit-level attack, SQUASH, a SWAP-Based Quantum Attack to sabotage Hybrid Quantum Neural Networks (HQNNs) for classification tasks. SQUASH is executed by inserting SWAP gate(s) into the variational quantum circuit of the victim HQNN. Unlike conventional noise-based or adversarial input attacks, SQUASH directly manipulates the circuit structure, leading to qubit misalignment and disrupting quantum state evolution. This attack is highly stealthy, as it does not require access to training data or introduce detectable perturbations in input states. Our results demonstrate that SQUASH significantly degrades classification performance, with untargeted SWAP attacks reducing accuracy by up to 74.08% and targeted SWAP attacks reducing target class accuracy by up to 79.78%. These findings reveal a critical vulnerability in HQNN implementations, underscoring the need for more resilient architectures against circuit-level adversarial interventions.
zh
[AI-137] nsor Train Quantum State Tomography using Compressed Sensing
【速读】:该论文试图解决量子态层析(Quantum State Tomography, QST)中由于量子态表示参数指数增长而导致的标准估计方法变得不切实际的问题。解决方案的关键在于使用低秩块张量列车分解(low-rank block tensor train decomposition)对量子态进行参数化,从而实现内存和计算上的高效性。该框架适用于可以由低秩分解良好近似的广泛类别的量子态,包括纯态、近似纯态以及哈密顿量的基态。
链接: https://arxiv.org/abs/2506.23560
作者: Shakir Showkat Sofi,Charlotte Vermeylen,Lieven De Lathauwer
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Optimization and Control (math.OC)
备注: Accepted for publication in EUSIPCO 2025
Abstract:Quantum state tomography (QST) is a fundamental technique for estimating the state of a quantum system from measured data and plays a crucial role in evaluating the performance of quantum devices. However, standard estimation methods become impractical due to the exponential growth of parameters in the state representation. In this work, we address this challenge by parameterizing the state using a low-rank block tensor train decomposition and demonstrate that our approach is both memory- and computationally efficient. This framework applies to a broad class of quantum states that can be well approximated by low-rank decompositions, including pure states, nearly pure states, and ground states of Hamiltonians.
zh
[AI-138] Multi-Branch DNN and CRLB-Ratio-Weight Fusion for Enhanced DOA Sensing via a Massive H2AD MIMO Receiver
【速读】:该论文旨在解决在大规模H² AD(Hybrid Two-Dimensional Antenna Array)结构中,如何设计一种低复杂度且高性能的融合方法,以整合不同子阵列组感知的目标方向值,同时减少对先验知识的依赖。其解决方案的关键在于提出了一种轻量级的Cramer-Rao Lower Bound (CRLB)-ratio-weight fusion (WF)方法,通过使用天线数量的倒数近似各子阵列的逆CRLB,从而避免实时计算CRLB,降低计算复杂度和对先验知识的依赖;此外,还构建了一个多分支深度神经网络(MBDNN),通过融合多个子阵列的候选角度来进一步提升到达方向(DOA)感知性能,其子阵列专用分支网络与共享回归模块的结合有效消除了伪解并融合真实角度。
链接: https://arxiv.org/abs/2506.23203
作者: Feng Shu,Jiatong Bai,Di Wu,Wei Zhu,Bin Deng,Fuhui Zhou,Jiangzhou Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:As a green MIMO structure, massive H ^2 AD is viewed as a potential technology for the future 6G wireless network. For such a structure, it is a challenging task to design a low-complexity and high-performance fusion of target direction values sensed by different sub-array groups with fewer use of prior knowledge. To address this issue, a lightweight Cramer-Rao lower bound (CRLB)-ratio-weight fusion (WF) method is proposed, which approximates inverse CRLB of each subarray using antenna number reciprocals to eliminate real-time CRLB computation. This reduces complexity and prior knowledge dependence while preserving fusion performance. Moreover, a multi-branch deep neural network (MBDNN) is constructed to further enhance direction-of-arrival (DOA) sensing by leveraging candidate angles from multiple subarrays. The subarray-specific branch networks are integrated with a shared regression module to effectively eliminate pseudo-solutions and fuse true angles. Simulation results show that the proposed CRLB-ratio-WF method achieves DOA sensing performance comparable to CRLB-based methods, while significantly reducing the reliance on prior knowledge. More notably, the proposed MBDNN has superior performance in low-SNR ranges. At SNR = -15 dB, it achieves an order-of-magnitude improvement in estimation accuracy compared to CRLB-ratio-WF method.
zh
[AI-139] Deep Learning for Optical Misalignment Diagnostics in Multi-Lens Imaging Systems
【速读】:该论文旨在解决多镜头成像系统中精确对准的难题,这一问题在光学工程领域尤为关键,因为即使微小的错位也会显著降低系统性能。传统对准方法依赖专用设备且耗时,因此亟需自动化和可扩展的解决方案。论文提出两种基于深度学习的逆向设计方法,通过仅使用光学测量即可诊断多元件镜头系统的错位问题,其关键在于利用光线追迹光斑图和物理基础的仿真流程,结合深度学习模型实现高精度的五自由度和四自由度误差估计。
链接: https://arxiv.org/abs/2506.23173
作者: Tomer Slor,Dean Oren,Shira Baneth,Tom Coen,Haim Suchowski
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In the rapidly evolving field of optical engineering, precise alignment of multi-lens imaging systems is critical yet challenging, as even minor misalignments can significantly degrade performance. Traditional alignment methods rely on specialized equipment and are time-consuming processes, highlighting the need for automated and scalable solutions. We present two complementary deep learning-based inverse-design methods for diagnosing misalignments in multi-element lens systems using only optical measurements. First, we use ray-traced spot diagrams to predict five-degree-of-freedom (5-DOF) errors in a 6-lens photographic prime, achieving a mean absolute error of 0.031mm in lateral translation and 0.011 ^\circ in tilt. We also introduce a physics-based simulation pipeline that utilizes grayscale synthetic camera images, enabling a deep learning model to estimate 4-DOF, decenter and tilt errors in both two- and six-lens multi-lens systems. These results show the potential to reshape manufacturing and quality control in precision imaging.
zh
[AI-140] reatment evidence imitation and chat
【速读】:该论文试图解决医疗决策制定中的治疗问题(treatment problem),即患者的核心医疗决策任务,该任务通常需要与医疗提供者协作完成。论文探讨了基于循证医学的解决方案,包括试验和观察性数据。其关键在于分析大型语言模型(Large Language Models)在解决治疗问题中的潜在应用,并揭示在此过程中出现的挑战,这些挑战与循证医学密切相关,为后续研究提供了方向。
链接: https://arxiv.org/abs/2506.23040
作者: Samuel J. Weisenthal
机构: 未知
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Large language models are thought to have potential to aid in medical decision making. We investigate this here. We start with the treatment problem, the patient’s core medical decision-making task, which is solved in collaboration with a healthcare provider. We discuss approaches to solving the treatment problem, including – within evidence-based medicine – trials and observational data. We then discuss the chat problem, and how this differs from the treatment problem – in particular as it relates to imitation. We then discuss how a large language model might be used to solve the treatment problem and highlight some of the challenges that emerge. We finally discuss how these challenges relate to evidence-based medicine, and how this might inform next steps.
zh
[AI-141] Beyond Code: The Multidimensional Impacts of Large Language Models in Software Development
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)对开源软件(Open-Source Software, OSS)开发者的影响力及其作用机制问题。其解决方案的关键在于通过实证研究,利用意大利临时禁用ChatGPT的自然实验,采用双重差分法(Difference-in-Differences)结合双向固定效应模型,分析LLMs在代码开发、知识共享和技能获取三个关键领域对OSS开发者的影响,从而揭示LLMs在提升开发者生产力、知识共享能力和技能获取效率方面的具体作用及其差异性。
链接: https://arxiv.org/abs/2506.22704
作者: Sardar Fatooreh Bonabi,Sarah Bana,Tingting Nian,Vijay Gurbaxani
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are poised to significantly impact software development, especially in the Open-Source Software (OSS) sector. To understand this impact, we first outline the mechanisms through which LLMs may influence OSS through code development, collaborative knowledge transfer, and skill development. We then empirically examine how LLMs affect OSS developers’ work in these three key areas. Leveraging a natural experiment from a temporary ChatGPT ban in Italy, we employ a Difference-in-Differences framework with two-way fixed effects to analyze data from all OSS developers on GitHub in three similar countries, Italy, France, and Portugal, totaling 88,022 users. We find that access to ChatGPT increases developer productivity by 6.4%, knowledge sharing by 9.6%, and skill acquisition by 8.4%. These benefits vary significantly by user experience level: novice developers primarily experience productivity gains, whereas more experienced developers benefit more from improved knowledge sharing and accelerated skill acquisition. In addition, we find that LLM-assisted learning is highly context-dependent, with the greatest benefits observed in technically complex, fragmented, or rapidly evolving contexts. We show that the productivity effects of LLMs extend beyond direct code generation to include enhanced collaborative learning and knowledge exchange among developers; dynamics that are essential for gaining a holistic understanding of LLMs’ impact in OSS. Our findings offer critical managerial implications: strategically deploying LLMs can accelerate novice developers’ onboarding and productivity, empower intermediate developers to foster knowledge sharing and collaboration, and support rapid skill acquisition, together enhancing long-term organizational productivity and agility.
zh
[AI-142] Correlated Mutations for Integer Programming
【速读】:该论文试图解决在整数规划(Integer Programming, IP)问题中,如何改进随机搜索启发式方法的性能问题。尽管整数规划的理论复杂性已大幅降低,但启发式方法仍然是解决此类难题的主要手段。论文提出了一种名为整数进化策略(Integer Evolution Strategies, IESs)的框架,其核心创新在于采用ℓ1-范数替代传统的ℓ2-范数,并引入截断正态(Truncated Normal, TN)和双几何(Double Geometric, DG)分布作为变异分布。研究显示,DG分布相较于TN分布在无界整数搜索中表现更优,而关键的改进在于从ℓ2-范数向ℓ1-范数的转变。
链接: https://arxiv.org/abs/2506.22526
作者: Ofer M. Shir,Michael Emmerich
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Even with the recent theoretical advancements that dramatically reduced the complexity of Integer Programming (IP), heuristics remain the dominant problem-solvers for this difficult category. This study seeks to establish the groundwork for Integer Evolution Strategies (IESs), a class of randomized search heuristics inherently designed for continuous spaces. IESs already excel in treating IP in practice, but accomplish it via discretization and by applying sophisticated patches to their continuous operators, while persistently using the \ell_2 -norm as their operation pillar. We lay foundations for discrete search, by adopting the \ell_1 -norm, accounting for the suitable step-size, and questioning alternative measures to quantify correlations over the integer lattice. We focus on mutation distributions for unbounded integer decision variables. We briefly discuss a couple of candidate discrete probabilities induced by the uniform and binomial distributions, which we show to possess less appealing theoretical properties, and then narrow down to the Truncated Normal (TN) and Double Geometric (DG) distributions. We explore their theoretical properties, including entropy functions, and propose a procedure to generate scalable correlated mutation distributions. Our investigations are accompanied by extensive numerical simulations, which consistently support the claim that the DG distribution is better suited for unbounded integer search. We link our theoretical perspective to empirical evidence indicating that an IES with correlated DG mutations outperformed other strategies over non-separable quadratic IP. We conclude that while the replacement of the default TN distribution by the DG is theoretically justified and practically beneficial, the truly crucial change lies in adopting the \ell_1 -norm over the \ell_2 -norm.
zh
[AI-143] Masked Autoencoders that Feel the Heart: Unveiling Simplicity Bias for ECG Analyses
【速读】:该论文试图解决心电图(Electrocardiogram, ECG)分析中由于监督学习模型过度拟合主导模式而忽视细微但临床关键特征的“简单性偏差”(Simplicity Bias, SB)问题。解决方案的关键在于采用自监督学习(Self-Supervised Learning, SSL),并通过两个核心组件缓解SB:一是时频感知滤波器,用于捕捉反映ECG信号动态特性的时频特征;二是多粒度原型重构,实现跨时域和频域的粗粒度与细粒度表征学习。
链接: https://arxiv.org/abs/2506.22495
作者: He-Yang Xu,Hongxiang Gao,Yuwen Li,Xiu-Shen Wei,Chengyu Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The diagnostic value of electrocardiogram (ECG) lies in its dynamic characteristics, ranging from rhythm fluctuations to subtle waveform deformations that evolve across time and frequency domains. However, supervised ECG models tend to overfit dominant and repetitive patterns, overlooking fine-grained but clinically critical cues, a phenomenon known as Simplicity Bias (SB), where models favor easily learnable signals over subtle but informative ones. In this work, we first empirically demonstrate the presence of SB in ECG analyses and its negative impact on diagnostic performance, while simultaneously discovering that self-supervised learning (SSL) can alleviate it, providing a promising direction for tackling the bias. Following the SSL paradigm, we propose a novel method comprising two key components: 1) Temporal-Frequency aware Filters to capture temporal-frequency features reflecting the dynamic characteristics of ECG signals, and 2) building on this, Multi-Grained Prototype Reconstruction for coarse and fine representation learning across dual domains, further mitigating SB. To advance SSL in ECG analyses, we curate a large-scale multi-site ECG dataset with 1.53 million recordings from over 300 clinical centers. Experiments on three downstream tasks across six ECG datasets demonstrate that our method effectively reduces SB and achieves state-of-the-art performance. Code and dataset will be released publicly.
zh
[AI-144] Hindsight-Guided Momentum (HGM) Optimizer: An Approach to Adaptive Learning Rate
【速读】:该论文试图解决传统自适应优化算法(如Adam或RMSprop)在调整学习率时仅依赖梯度幅度而忽视方向一致性的问题,这些方法往往忽略了重要的几何信息,例如当前梯度与历史更新方向的一致性,这反映了优化路径的局部曲率和一致性。解决方案的关键在于引入一种回顾机制(hindsight mechanism),通过计算当前梯度与累积动量之间的余弦相似度来评估方向一致性,从而在更新方向一致时增加学习率,在震荡或噪声区域减少学习率,实现更高效的优化过程。
链接: https://arxiv.org/abs/2506.22479
作者: Krisanu Sarkar
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce Hindsight-Guided Momentum (HGM), a first-order optimization algorithm that adaptively scales learning rates based on the directional consistency of recent updates. Traditional adaptive methods, such as Adam or RMSprop , adapt learning dynamics using only the magnitude of gradients, often overlooking important geometric this http URL cues refer to directional information, such as the alignment between current gradients and past updates, which reflects the local curvature and consistency of the optimization path. HGM addresses this by incorporating a hindsight mechanism that evaluates the cosine similarity between the current gradient and accumulated momentum. This allows it to distinguish between coherent and conflicting gradient directions, increasing the learning rate when updates align and reducing it in regions of oscillation or noise. The result is a more responsive optimizer that accelerates convergence in smooth regions of the loss surface while maintaining stability in sharper or more erratic areas. Despite this added adaptability, the method preserves the computational and memory efficiency of existing this http URL more intelligently responding to the structure of the optimization landscape, HGM provides a simple yet effective improvement over existing approaches, particularly in non-convex settings like that of deep neural network training.
zh
[AI-145] Dimensionality Reduction on IoT Monitoring Data of Smart Building for Energy Consumption Forecasting
【速读】:该论文试图解决在资源受限的边缘计算环境中,如何保持数据分析准确性的问题,特别是在处理来自物联网(IoT)系统的大量数据时。其解决方案的关键在于通过相关性分析识别出与能耗具有强或半强相关性的环境变量,从而在不牺牲预测精度的前提下排除弱相关变量,提高数据处理效率。
链接: https://arxiv.org/abs/2506.22468
作者: Konstantinos Koutras,Agorakis Bompotas,Constantinos Halkiopoulos,Athanasios Kalogeras,Christos Alexakos
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Version of submitted paper on 2023 IEEE International Smart Cities Conference (ISC2), 1-6, 2023
Abstract:The Internet of Things (IoT) plays a major role today in smart building infrastructures, from simple smart-home applications, to more sophisticated industrial type installations. The vast amounts of data generated from relevant systems can be processed in different ways revealing important information. This is especially true in the era of edge computing, when advanced data analysis and decision-making is gradually moving to the edge of the network where devices are generally characterised by low computing resources. In this context, one of the emerging main challenges is related to maintaining data analysis accuracy even with less data that can be efficiently handled by low resource devices. The present work focuses on correlation analysis of data retrieved from a pilot IoT network installation monitoring a small smart office by means of environmental and energy consumption sensors. The research motivation was to find statistical correlation between the monitoring variables that will allow the use of machine learning (ML) prediction algorithms for energy consumption reducing input parameters. For this to happen, a series of hypothesis tests for the correlation of three different environmental variables with the energy consumption were carried out. A total of ninety tests were performed, thirty for each pair of variables. In these tests, p-values showed the existence of strong or semi-strong correlation with two environmental variables, and of a weak correlation with a third one. Using the proposed methodology, we manage without examining the entire data set to exclude weak correlated variables while keeping the same score of accuracy.
zh
[AI-146] Privacy-aware IoT Fall Detection Services For Aging in Place
【速读】:该论文旨在解决老年人跌倒检测(Fall Detection)的问题,以支持日益增长的老年人口实现独立和安全的生活。现有方法常面临数据稀缺或隐私泄露的挑战,为此,本文提出了一种基于物联网的跌倒检测即服务(Fall Detection as a Service, FDaaS)框架,其关键在于采用面向服务的架构,利用超宽带(Ultra-wideband, UWB)雷达传感器作为物联网健康感知服务,从而在保证隐私和最小侵入性的同时实现精准的跌倒检测。此外,通过引入生成式预训练变换器(Fall Detection Generative Pre-trained Transformer, FD-GPT)结合数据增强技术,有效缓解了数据稀缺问题。
链接: https://arxiv.org/abs/2506.22462
作者: Abdallah Lakhdari,Jiajie Li,Amani Abusafia,Athman Bouguettaya
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 11 pages, 12 figures, This paper is accepted in the 2025 IEEE International Conference on Web Services (ICWS 2025)
Abstract:Fall detection is critical to support the growing elderly population, projected to reach 2.1 billion by 2050. However, existing methods often face data scarcity challenges or compromise privacy. We propose a novel IoT-based Fall Detection as a Service (FDaaS) framework to assist the elderly in living independently and safely by accurately detecting falls. We design a service-oriented architecture that leverages Ultra-wideband (UWB) radar sensors as an IoT health-sensing service, ensuring privacy and minimal intrusion. We address the challenges of data scarcity by utilizing a Fall Detection Generative Pre-trained Transformer (FD-GPT) that uses augmentation techniques. We developed a protocol to collect a comprehensive dataset of the elderly daily activities and fall events. This resulted in a real dataset that carefully mimics the elderly’s routine. We rigorously evaluate and compare various models using this dataset. Experimental results show our approach achieves 90.72% accuracy and 89.33% precision in distinguishing between fall events and regular activities of daily living.
zh
[AI-147] Machine Learning for Proactive Groundwater Management: Early Warning and Resource Allocation
【速读】:该论文试图解决地下水监测有效性不足的问题,特别是在数据稀疏、计算资源有限以及传统方法输出延迟等方面的挑战。其解决方案的关键在于构建一个基于机器学习的管道,利用气候数据、水文气象记录和地形属性,并通过AutoGluon的自动化集成框架进行处理,结合地理空间预处理、领域驱动特征工程和自动化模型选择,以克服传统监测的局限性。
链接: https://arxiv.org/abs/2506.22461
作者: Chuan Li,Ruoxuan Yang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Groundwater supports ecosystems, agriculture, and drinking water supplies worldwide, yet effective monitoring remains challenging due to sparse data, computational constraints, and delayed outputs from traditional approaches. We develop a machine learning pipeline that predicts groundwater level categories using climate data, hydro-meteorological records, and physiographic attributes processed through AutoGluon’s automated ensemble framework. Our approach integrates geospatial preprocessing, domain-driven feature engineering, and automated model selection to overcome conventional monitoring limitations. Applied to a large-scale French dataset (n 3,440,000 observations from 1,500+ wells), the model achieves weighted F_1 scores of 0.927 on validation data and 0.67 on temporally distinct test data. Scenario-based evaluations demonstrate practical utility for early warning systems and water allocation decisions under changing climate conditions. The open-source implementation provides a scalable framework for integrating machine learning into national groundwater monitoring networks, enabling more responsive and data-driven water management strategies.
zh
[AI-148] Heart rate and respiratory rate prediction from noisy real-world smartphone based on Deep Learning methods
【速读】:该论文试图解决在日常生活中利用手机视频估计心率(HR)和呼吸频率(RR)的准确性问题,尤其是针对现有算法在实验室环境下的性能与实际应用场景之间的差异。研究发现,传统算法在日常生活的指纹视频上的表现显著低于之前报告的结果,分别恶化了7倍和13倍。该论文提出的解决方案关键在于采用一种新型的三维深度卷积神经网络(3D deep CNN),通过深度学习方法显著提升了HR和RR的估计精度,实现了HR误差降低68%和RR误差降低75%。
链接: https://arxiv.org/abs/2506.22460
作者: Ibne Farabi Shihab
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Using mobile phone video of the fingertip as a data source for estimating vital signs such as heart rate (HR) and respiratory rate (RR) during daily life has long been suggested. While existing literature indicates that these estimates are accurate to within several beats or breaths per minute, the data used to draw these conclusions are typically collected in laboratory environments under careful experimental control, and yet the results are assumed to generalize to daily life. In an effort to test it, a team of researchers collected a large dataset of mobile phone video recordings made during daily life and annotated with ground truth HR and RR labels from N=111 participants. They found that traditional algorithm performance on the fingerprint videos is worse than previously reported (7 times and 13 times worse for RR and HR, respectively). Fortunately, recent advancements in deep learning, especially in convolutional neural networks (CNNs), offer a promising solution to improve this performance. This study proposes a new method for estimating HR and RR using a novel 3D deep CNN, demonstrating a reduced error in estimated HR by 68% and RR by 75%. These promising results suggest that regressor-based deep learning approaches should be used in estimating HR and RR.
zh
[AI-149] A Complex UNet Approach for Non-Invasive Fetal ECG Extraction Using Single-Channel Dry Textile Electrodes
【速读】:该论文旨在解决在非侵入性妊娠监测中,使用干式纺织电极进行单通道记录时,因噪声和运动伪影导致胎儿心电图(fECG)信号提取困难的问题。其解决方案的关键在于提出了一种基于复数去噪网络Complex UNet的创新流程,该方法不仅处理信号的幅度,还同时处理频谱图的实部和虚部,从而考虑相位信息并避免不一致的预测,实现了在模拟和真实数据上的最优fECG提取与R波检测性能。
链接: https://arxiv.org/abs/2506.22457
作者: Iulia Orvas,Andrei Radu,Alessandra Galli,Ana Neacsu,Elisabetta Peri
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Continuous, non-invasive pregnancy monitoring is crucial for minimising potential complications. The fetal electrocardiogram (fECG) represents a promising tool for assessing fetal health beyond clinical environments. Home-based monitoring necessitates the use of a minimal number of comfortable and durable electrodes, such as dry textile electrodes. However, this setup presents many challenges, including increased noise and motion artefacts, which complicate the accurate extraction of fECG signals. To overcome these challenges, we introduce a pioneering method for extracting fECG from single-channel recordings obtained using dry textile electrodes using AI techniques. We created a new dataset by simulating abdominal recordings, including noise closely resembling real-world characteristics of in-vivo recordings through dry textile electrodes, alongside mECG and fECG. To ensure the reliability of the extracted fECG, we propose an innovative pipeline based on a complex-valued denoising network, Complex UNet. Unlike previous approaches that focused solely on signal magnitude, our method processes both real and imaginary components of the spectrogram, addressing phase information and preventing incongruous predictions. We evaluated our novel pipeline against traditional, well-established approaches, on both simulated and real data in terms of fECG extraction and R-peak detection. The results showcase that our suggested method achieves new state-of-the-art results, enabling an accurate extraction of fECG morphology across all evaluated settings. This method is the first to effectively extract fECG signals from single-channel recordings using dry textile electrodes, making a significant advancement towards a fully non-invasive and self-administered fECG extraction solution.
zh
[AI-150] Unsupervised Learning-Based Joint Resource Allocation and Beamforming Design for RIS-Assisted MISO-OFDMA Systems
【速读】:该论文旨在解决在基于可重构智能表面(RIS)的多输入单输出正交频分多址(MISO-OFDMA)系统中,下行链路传输的资源分配问题。其关键解决方案是提出了一种基于无监督学习的两阶段框架,包括BeamNet和AllocationNet,分别用于预测RIS相移和分配资源块(RB),并通过最大比传输和水填算法实现主动波束成形。为处理离散约束并保持可微性,采用了量化和Gumbel-softmax技巧,并通过定制损失函数和分阶段训练提升在服务质量(QoS)约束下的性能。
链接: https://arxiv.org/abs/2506.22448
作者: Yu Ma,Xingyu Zhou,Xiao Li,Le Liang,Shi Jin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file
Abstract:Reconfigurable intelligent surfaces (RIS) are key enablers for 6G wireless systems. This paper studies downlink transmission in an RIS-assisted MISO-OFDMA system, addressing resource allocation challenges. A two-stage unsupervised learning-based framework is proposed to jointly design RIS phase shifts, BS beamforming, and resource block (RB) allocation. The framework includes BeamNet, which predicts RIS phase shifts from CSI, and AllocationNet, which allocates RBs using equivalent CSI derived from BeamNet outputs. Active beamforming is implemented via maximum ratio transmission and water-filling. To handle discrete constraints while ensuring differentiability, quantization and the Gumbel-softmax trick are adopted. A customized loss and phased training enhance performance under QoS constraints. Simulations show the method achieves 99.93% of the sum rate of the SCA baseline with only 0.036% of its runtime, and it remains robust across varying channel and user conditions.
zh
[AI-151] Attention acts to suppress goal-based conflict under high competition
【速读】:该论文试图解决在高竞争条件下,顶向下注意如何调控视觉皮层中任务相关与无关刺激的神经信号问题。传统观点认为顶向下注意能够选择性增强任务相关刺激的神经信号,但该研究发现,在高竞争环境下,即两个具有相反调制目标的刺激共享感受野时,顶向下注意会在刺激呈现后100毫秒内非选择性地抑制任务相关和无关的神经信号,其关键在于这种非选择性的注意资源调动有助于减少无关刺激的前向信号传递。
链接: https://arxiv.org/abs/1610.09431
作者: Omar Claflin
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures, 3 tables
Abstract:It is known that when multiple stimuli are present, top-down attention selectively enhances the neural signal in the visual cortex for task-relevant stimuli, but this has been tested only under conditions of minimal competition of visual attention. Here we show during high competition, that is, two stimuli in a shared receptive field possessing opposing modulatory goals, top-down attention suppresses both task-relevant and irrelevant neural signals within 100 ms of stimuli onset. This non-selective engagement of top-down attentional resources serves to reduce the feedforward signal representing irrelevant stimuli.
zh
机器学习
[LG-0] Agent .xpu: Efficient Scheduling of Agent ic LLM Workloads on Heterogeneous SoC
链接: https://arxiv.org/abs/2506.24045
作者: Xinming Wei,Jiahao Zhang,Haoran Li,Jiayu Chen,Rui Qu,Maoliang Li,Xiang Chen,Guojie Luo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces this http URL, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, this http URL first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that this http URL achieves 4.6 \times lower latency for reactive tasks and sustains 1.6 \times -6.8 \times higher throughput for proactive tasks compared to state-of-the-art inference engines.
[LG-1] Faster Diffusion Models via Higher-Order Approximation
链接: https://arxiv.org/abs/2506.24042
作者: Gen Li,Yuchen Zhou,Yuting Wei,Yuxin Chen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we explore provable acceleration of diffusion models without any additional retraining. Focusing on the task of approximating a target data distribution in \mathbbR^d to within \varepsilon total-variation distance, we propose a principled, training-free sampling algorithm that requires only the order of d^1+2/K \varepsilon^-1/K score function evaluations (up to log factor) in the presence of accurate scores, where K is an arbitrarily large fixed integer. This result applies to a broad class of target data distributions, without the need for assumptions such as smoothness or log-concavity. Our theory is robust vis-a-vis inexact score estimation, degrading gracefully as the score estimation error increases – without demanding higher-order smoothness on the score estimates as assumed in previous work. The proposed algorithm draws insight from high-order ODE solvers, leveraging high-order Lagrange interpolation and successive refinement to approximate the integral derived from the probability flow ODE. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2506.24042 [cs.LG] (or arXiv:2506.24042v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.24042 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] Unsupervised Sparse Coding-based Spiking Neural Network for Real-time Spike Sorting
链接: https://arxiv.org/abs/2506.24041
作者: Alexis Melot,Sean U.N. Wood,Yannick Coffinier,Pierre Yger,Fabien Alibart
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Main article : 16 pages, 7 figures and 4 tables. Supplementary Material starts at page 17 with 7 figures
Abstract:Spike sorting is a crucial step in decoding multichannel extracellular neural signals, enabling the identification of individual neuronal activity. A key challenge in brain-machine interfaces (BMIs) is achieving real-time, low-power spike sorting at the edge while keeping high neural decoding performance. This study introduces the Neuromorphic Sparse Sorter (NSS), a compact two-layer spiking neural network optimized for efficient spike sorting. NSS leverages the Locally Competitive Algorithm (LCA) for sparse coding to extract relevant features from noisy events with reduced computational demands. NSS learns to sort detected spike waveforms in an online fashion and operates entirely unsupervised. To exploit multi-bit spike coding capabilities of neuromorphic platforms like Intel’s Loihi 2, a custom neuron model was implemented, enabling flexible power-performance trade-offs via adjustable spike bit-widths. Evaluations on simulated and real-world tetrode signals with biological drift showed NSS outperformed established pipelines such as WaveClus3 and PCA+KMeans. With 2-bit graded spikes, NSS on Loihi 2 outperformed NSS implemented with leaky integrate-and-fire neuron and achieved an F1-score of 77% (+10% improvement) while consuming 8.6mW (+1.65mW) when tested on a drifting recording, with a computational processing time of 0.25ms (+60 us) per inference.
[LG-3] Provably Efficient and Agile Randomized Q-Learning
链接: https://arxiv.org/abs/2506.24005
作者: He Wang,Xingyu Xu,Yuejie Chi
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning process. In this paper, we propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL. We establish an \widetildeO(\sqrtH^5SAT) regret bound, where S is the number of states, A is the number of actions, H is the episode length, and T is the total number of episodes. In addition, we present a logarithmic regret bound under a mild positive sub-optimality condition on the optimal Q-function. Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.
[LG-4] he Jacobian and Hessian of the Kullback-Leibler Divergence between Multivariate Gaussian Distributions (Technical Report)
链接: https://arxiv.org/abs/2506.23996
作者: Juan Maroñas
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This document shows how to obtain the Jacobian and Hessian matrices of the Kullback-Leibler divergence between two multivariate Gaussian distributions, using the first and second-order differentials. The presented derivations are based on the theory presented by \citemagnus99. I’ve also got great inspiration from some of the derivations in \citeminka. Since I pretend to be at most didactic, the document is split into a summary of results and detailed derivations on each of the elements involved, with specific references to the tricks used in the derivations, and to many of the underlying concepts. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2506.23996 [cs.LG] (or arXiv:2506.23996v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.23996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-5] A Scalable Approach for Safe and Robust Learning via Lipschitz-Constrained Networks
链接: https://arxiv.org/abs/2506.23977
作者: Zain ul Abdeen,Vassilis Kekatos,Ming Jin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Certified robustness is a critical property for deploying neural networks (NN) in safety-critical applications. A principle approach to achieving such guarantees is to constrain the global Lipschitz constant of the network. However, accurate methods for Lipschitz-constrained training often suffer from non-convex formulations and poor scalability due to reliance on global semidefinite programs (SDPs). In this letter, we propose a convex training framework that enforces global Lipschitz constraints via semidefinite relaxation. By reparameterizing the NN using loop transformation, we derive a convex admissibility condition that enables tractable and certifiable training. While the resulting formulation guarantees robustness, its scalability is limited by the size of global SDP. To overcome this, we develop a randomized subspace linear matrix inequalities (RS-LMI) approach that decomposes the global constraints into sketched layerwise constraints projected onto low-dimensional subspaces, yielding a smooth and memory-efficient training objective. Empirical results on MNIST, CIFAR-10, and ImageNet demonstrate that the proposed framework achieves competitive accuracy with significantly improved Lipschitz bounds and runtime performance.
[LG-6] UMA: A Family of Universal Models for Atoms
链接: https://arxiv.org/abs/2506.23971
作者: Brandon M. Wood,Misko Dzamba,Xiang Fu,Meng Gao,Muhammed Shuaibi,Luis Barroso-Luque,Kareem Abdelmaqsoud,Vahe Gharakhanyan,John R. Kitchin,Daniel S. Levine,Kyle Michel,Anuroop Sriram,Taco Cohen,Abhishek Das,Ammar Rizvi,Sushree Jagriti Sahoo,Zachary W. Ulissi,C. Lawrence Zitnick
类目: Machine Learning (cs.LG)
*备注: 29 pages, 5 figures
Abstract:The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models.
[LG-7] Learning Constraints Directly from Network Data
链接: https://arxiv.org/abs/2506.23964
作者: Hongyu Hè,Minhao Jin,Maria Apostolaki
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 13 pages, 15 figures
Abstract:Network data conforms to a wide range of rules that arise from protocols, design principles, and deployment decisions (e.g., a packet’s queuing delay must be less than its end-to-end delay). Formalizing such rules as logic constraints can (i) improve the quality of synthetic data, (ii) reduce the brittleness of machine learning (ML) models, and (iii) improve semantic understanding of network measurements. However, these benefits remain out of reach if rule extraction is manual or solely reliant on ML, as both approaches yield incomplete, unreliable, and/or inaccurate rules. This paper formulates rule extraction as a constraint modeling problem and introduces NetNomos that learns propositional logic constraints directly from raw network measurements. Constraint modeling in this domain is uniquely challenging due to the scale of the data, the inherent learning complexity and passive environment, and the lack of ground truth supervision. NetNomos addresses these challenges via a lattice-based search structured by constraint specificity and succinctness. Our approach reduces learning complexity from superquadratic to logarithmic and enables efficient traversal in combinatorial search space. Our evaluations on diverse network datasets show that NetNomos learns all benchmark rules, including those associated with as little as 0.01% of data points, in under three hours. In contrast, baseline methods discover less than 25% of the rules and require several days to run. Through three case studies, we show that: NetNomos (i) finds rule violations in the outputs of all seven synthetic traffic generators, hence can be used to assess and guide their generation process; (ii) detects semantic differences in traffic, hence can be used for anomaly detection; and (iii) automatically finds rules used for telemetry imputation, hence can support monitoring through inference. Comments: 13 pages, 15 figures Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) ACMclasses: C.2.3; I.2.6; I.2.3 Cite as: arXiv:2506.23964 [cs.NI] (or arXiv:2506.23964v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2506.23964 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-8] Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages
链接: https://arxiv.org/abs/2506.23958
作者: Ikechukwu Ogbonna,Lesley Davidson,Soumya Banerjee,Abhishek Dasgupta,Laurence Kenney,Vikranth Harthikote Nagaraja
类目: Machine Learning (cs.LG)
*备注: 5 pages, 0 figures, 0 tables
Abstract:Millions of people in African countries face barriers to accessing healthcare due to language and literacy gaps. This research tackles this challenge by transforming complex medical documents – in this case, prosthetic device user manuals – into accessible formats for underserved populations. This case study in cross-cultural translation is particularly pertinent/relevant for communities that receive donated prosthetic devices but may not receive the accompanying user documentation. Or, if available online, may only be available in formats (e.g., language and readability) that are inaccessible to local populations (e.g., English-language, high resource settings/cultural context). The approach is demonstrated using the widely spoken Pidgin dialect, but our open-source framework has been designed to enable rapid and easy extension to other languages/dialects. This work presents an AI-powered framework designed to process and translate complex medical documents, e.g., user manuals for prosthetic devices, into marginalised languages. The system enables users – such as healthcare workers or patients – to upload English-language medical equipment manuals, pose questions in their native language, and receive accurate, localised answers in real time. Technically, the system integrates a Retrieval-Augmented Generation (RAG) pipeline for processing and semantic understanding of the uploaded manuals. It then employs advanced Natural Language Processing (NLP) models for generative question-answering and multilingual translation. Beyond simple translation, it ensures accessibility to device instructions, treatment protocols, and safety information, empowering patients and clinicians to make informed healthcare decisions.
[LG-9] RawMal-TF: Raw Malware Dataset Labeled by Type and Family
链接: https://arxiv.org/abs/2506.23909
作者: David Bálik,Martin Jureček,Mark Stamp
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels. Raw binaries were collected from sources such as VirusShare, VX Underground, and MalwareBazaar, and subsequently labeled with family information parsed from binary names and type-level labels integrated from ClarAVy. The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline based on static analysis, particularly extracting features from Portable Executable headers, to support advanced classification tasks. The evaluation was focused on three key classification tasks. In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets, reaching 98.5% for type-based detection and 98.98% for family-based detection. When using truncated datasets of 1,000 samples to assess performance under limited data conditions, both models still performed strongly, achieving 97.6% for type-based detection and 98.66% for family-based detection. For interclass classification, which distinguishes between malware types or families, the models reached up to 97.5% accuracy on type-level tasks and up to 93.7% on family-level tasks. In the multiclass classification setting, which assigns samples to the correct type or family, SVM achieved 81.1% accuracy on type labels, while Random Forest and XGBoost reached approximately 73.4% on family labels. The results highlight practical trade-offs between accuracy and computational cost, and demonstrate that labeling at both the type and family levels enables more fine-grained and insightful malware classification. The work establishes a robust foundation for future research on advanced malware detection and classification.
[LG-10] Emergent musical properties of a transformer under contrastive self-supervised learning
链接: https://arxiv.org/abs/2506.23873
作者: Yuexuan Kong,Gabriel Meseguer-Brocal,Vincent Lostanlen,Mathieu Lagrange,Romain Hennequin
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2025
Abstract:In music information retrieval (MIR), contrastive self-supervised learning for general-purpose representation models is effective for global tasks such as automatic tagging. However, for local tasks such as chord estimation, it is widely assumed that contrastively trained general-purpose self-supervised models are inadequate and that more sophisticated SSL is necessary; e.g., masked modeling. Our paper challenges this assumption by revealing the potential of contrastive SSL paired with a transformer in local MIR tasks. We consider a lightweight vision transformer with one-dimensional patches in the time–frequency domain (ViT-1D) and train it with simple contrastive SSL through normalized temperature-scaled cross-entropy loss (NT-Xent). Although NT-Xent operates only over the class token, we observe that, potentially thanks to weight sharing, informative musical properties emerge in ViT-1D’s sequence tokens. On global tasks, the temporal average of class and sequence tokens offers a performance increase compared to the class token alone, showing useful properties in the sequence tokens. On local tasks, sequence tokens perform unexpectedly well, despite not being specifically trained for. Furthermore, high-level musical features such as onsets emerge from layer-wise attention maps and self-similarity matrices show different layers capture different musical dimensions. Our paper does not focus on improving performance but advances the musical interpretation of transformers and sheds light on some overlooked abilities of contrastive SSL paired with transformers for sequence modeling in MIR.
[LG-11] When Plants Respond: Electrophysiology and Machine Learning for Green Monitoring Systems
链接: https://arxiv.org/abs/2506.23872
作者: Eduard Buss,Till Aust,Heiko Hamann
类目: Machine Learning (cs.LG)
*备注: Submitted and Accepted at the 14th international conference on biomimetic and biohybrid systems (Living Machines)
Abstract:Living plants, while contributing to ecological balance and climate regulation, also function as natural sensors capable of transmitting information about their internal physiological states and surrounding conditions. This rich source of data provides potential for applications in environmental monitoring and precision agriculture. With integration into biohybrid systems, we establish novel channels of physiological signal flow between living plants and artificial devices. We equipped Hedera helix with a plant-wearable device called PhytoNode to continuously record the plant’s electrophysiological activity. We deployed plants in an uncontrolled outdoor environment to map electrophysiological patterns to environmental conditions. Over five months, we collected data that we analyzed using state-of-the-art and automated machine learning (AutoML). Our classification models achieve high performance, reaching macro F1 scores of up to 95 percent in binary tasks. AutoML approaches outperformed manual tuning, and selecting subsets of statistical features further improved accuracy. Our biohybrid living system monitors the electrophysiology of plants in harsh, real-world conditions. This work advances scalable, self-sustaining, and plant-integrated living biohybrid systems for sustainable environmental monitoring.
[LG-12] EFPI: Elastic Formation and Position Identification in Football (Soccer) using Template Matching and Linear Assignment
链接: https://arxiv.org/abs/2506.23843
作者: Joris Bekkers
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding team formations and player positioning is crucial for tactical analysis in football (soccer). This paper presents a flexible method for formation recognition and player position assignment in football using predefined static formation templates and cost minimization from spatiotemporal tracking data, called EFPI. Our approach employs linear sum assignment to optimally match players to positions within a set of template formations by minimizing the total distance between actual player locations and template positions, subsequently selecting the formation with the lowest assignment cost. To improve accuracy, we scale actual player positions to match the dimensions of these formation templates in both width and length. While the method functions effectively on individual frames, it extends naturally to larger game segments such as complete periods, possession sequences or specific intervals (e.g. 10 second intervals, 5 minute intervals etc.). Additionally, we incorporate an optional stability parameter that prevents unnecessary formation changes when assignment costs differ only marginally between time segments. EFPI is available as open-source code through the unravelsports Python package.
[LG-13] SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration
链接: https://arxiv.org/abs/2506.23803
作者: Dmitry Kovalev
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix smoothness and noise assumptions. This allows us to recover state-of-the-art convergence results for several popular adaptive gradient methods, including AdaGrad-Norm, AdaGrad, and ASGO/One-sided Shampoo. In addition, we establish the fundamental connection between two recently proposed algorithms, Scion and DASGO, and provide the first theoretical guarantees for the latter. Second, we show that the convergence of methods like AdaGrad and DASGO can be provably accelerated beyond the best-known rates using Nesterov momentum. Consequently, we obtain the first theoretical justification that AdaGrad-type algorithms can simultaneously benefit from both diagonal preconditioning and momentum, which may provide an ultimate explanation for the practical efficiency of Adam.
[LG-14] Adaptive Out-of-Control Point Pattern Detection in Sequential Random Finite Set Observations
链接: https://arxiv.org/abs/2506.23802
作者: Konstantinos Bourazas,Savvas Papaioannou,Panayiotis Kolios
类目: Machine Learning (cs.LG)
*备注: 23rd European Control Conference (ECC 2025), Thessaloniki, Greece, 24-27 June 2025
Abstract:In this work we introduce a novel adaptive anomaly detection framework specifically designed for monitoring sequential random finite set (RFS) observations. Our approach effectively distinguishes between In-Control data (normal) and Out-Of-Control data (anomalies) by detecting deviations from the expected statistical behavior of the process. The primary contributions of this study include the development of an innovative RFS-based framework that not only learns the normal behavior of the data-generating process online but also dynamically adapts to behavioral shifts to accurately identify abnormal point patterns. To achieve this, we introduce a new class of RFS-based posterior distributions, named Power Discounting Posteriors (PD), which facilitate adaptation to systematic changes in data while enabling anomaly detection of point pattern data through a novel predictive posterior density function. The effectiveness of the proposed approach is demonstrated by extensive qualitative and quantitative simulation experiments.
[LG-15] owards the Training of Deeper Predictive Coding Neural Networks
链接: https://arxiv.org/abs/2506.23800
作者: Chang Qi,Matteo Forasassi,Thomas Lukasiewicz,Tommaso Salvatori
类目: Machine Learning (cs.LG)
*备注: 18 Pages, 7 figures
Abstract:Predictive coding networks trained with equilibrium propagation are neural models that perform inference through an iterative energy minimization process. Previous studies have demonstrated their effectiveness in shallow architectures, but show significant performance degradation when depth exceeds five to seven layers. In this work, we show that the reason behind this degradation is due to exponentially imbalanced errors between layers during weight updates, and predictions from the previous layer not being effective in guiding updates in deeper layers. We address the first issue by introducing two novel methods to optimize the latent variables that use precision-weighting to re-balance the distribution of energy among layers during the `relaxation phase’, and the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers. Empirically, we test our methods on a large number of image classification tasks, resulting in large improvements in test accuracy across networks with more than seven layers, with performances comparable to those of backprop on similar models. These findings suggest that a better understanding of the relaxation phase is important to train models using equilibrium propagation at scale, and open new possibilities for their application in complex tasks.
[LG-16] KAIROS: Scalable Model-Agnostic Data Valuation
链接: https://arxiv.org/abs/2506.23799
作者: Jiongli Zhu,Parjanya Prajakta Prashant,Alex Cloninger,Babak Salimi
类目: Machine Learning (cs.LG)
*备注: 19 pages, 9 figures
Abstract:Training data increasingly shapes not only model accuracy but also regulatory compliance and market valuation of AI assets. Yet existing valuation methods remain inadequate: model-based techniques depend on a single fitted model and inherit its biases, while algorithm-based approaches such as Data Shapley require costly retrainings at web scale. Recent Wasserstein-based model-agnostic methods rely on approximations that misrank examples relative to their true leave-one-out (LOO) utility. We introduce KAIROS, a scalable, model-agnostic valuation framework that assigns each example a distributional influence score: its contribution to the Maximum Mean Discrepancy (MMD) between the empirical training distribution and a clean reference set. Unlike Wasserstein surrogates, our MMD-based influence admits a closed-form solution that faithfully approximates the exact LOO ranking within O(1/N^2) error, requires no retraining, and naturally extends to conditional kernels for unified label- and feature-error detection. Moreover, KAIROS supports efficient online updates: when a new batch of size m arrives, all scores can be updated in O(mN) time, delivering up to 50x speedup without compromising ranking quality. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art model-, Shapley-, and Wasserstein-based baselines in both accuracy and runtime. We provide rigorous theoretical guarantees, including symmetry for reproducible rankings and density-separation for interpretable thresholds.
[LG-17] Model-driven Stochastic Trace Clustering
链接: https://arxiv.org/abs/2506.23776
作者: Jari Peeperkorn,Johannes De Smedt,Jochen De Weerdt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Process discovery algorithms automatically extract process models from event logs, but high variability often results in complex and hard-to-understand models. To mitigate this issue, trace clustering techniques group process executions into clusters, each represented by a simpler and more understandable process model. Model-driven trace clustering improves on this by assigning traces to clusters based on their conformity to cluster-specific process models. However, most existing clustering techniques rely on either no process model discovery, or non-stochastic models, neglecting the frequency or probability of activities and transitions, thereby limiting their capability to capture real-world execution dynamics. We propose a novel model-driven trace clustering method that optimizes stochastic process models within each cluster. Our approach uses entropic relevance, a stochastic conformance metric based on directly-follows probabilities, to guide trace assignment. This allows clustering decisions to consider both structural alignment with a cluster’s process model and the likelihood that a trace originates from a given stochastic process model. The method is computationally efficient, scales linearly with input size, and improves model interpretability by producing clusters with clearer control-flow patterns. Extensive experiments on public real-life datasets show that our method outperforms existing alternatives in representing process behavior and reveals how clustering performance rankings can shift when stochasticity is considered.
[LG-18] raining of Spiking Neural Networks with Expectation-Propagation
链接: https://arxiv.org/abs/2506.23757
作者: Dan Yao,Steve McLaughlin,Yoann Altmann
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 10 pages
Abstract:In this paper, we propose a unifying message-passing framework for training spiking neural networks (SNNs) using Expectation-Propagation. Our gradient-free method is capable of learning the marginal distributions of network parameters and simultaneously marginalizes nuisance parameters, such as the outputs of hidden layers. This framework allows for the first time, training of discrete and continuous weights, for deterministic and stochastic spiking networks, using batches of training samples. Although its convergence is not ensured, the algorithm converges in practice faster than gradient-based methods, without requiring a large number of passes through the training data. The classification and regression results presented pave the way for new efficient training methods for deep Bayesian networks.
[LG-19] Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies
链接: https://arxiv.org/abs/2506.23640
作者: Ximeng Liu,Shizhen Zhao,Xinbing Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Recently, researchers have explored ML-based Traffic Engineering (TE), leveraging neural networks to solve TE problems traditionally addressed by optimization. However, existing ML-based TE schemes remain impractical: they either fail to handle topology changes or suffer from poor scalability due to excessive computational and memory overhead. To overcome these limitations, we propose Geminet, a lightweight and scalable ML-based TE framework that can handle changing topologies. Geminet is built upon two key insights: (i) a methodology that decouples neural networks from topology by learning an iterative gradient-descent-based adjustment process, as the update rule of gradient descent is topology-agnostic, relying only on a few gradient-related quantities; (ii) shifting optimization from path-level routing weights to edge-level dual variables, reducing memory consumption by leveraging the fact that edges are far fewer than paths. Evaluations on WAN and data center datasets show that Geminet significantly improves scalability. Its neural network size is only 0.04% to 7% of existing schemes, while handling topology variations as effectively as HARP, a state-of-the-art ML-based TE approach, without performance degradation. When trained on large-scale topologies, Geminet consumes under 10 GiB of memory, more than eight times less than the 80-plus GiB required by HARP, while achieving 5.45 times faster convergence speed, demonstrating its potential for large-scale deployment.
[LG-20] Detect Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning ASIACCS25
链接: https://arxiv.org/abs/2506.23583
作者: Marvin Xhemrishi,Alexandre Graell i Amat,Balázs Pejó
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: The shorter version is accepted at FL-AsiaCCS 25
Abstract:Federated learning with secure aggregation enables private and collaborative learning from decentralised data without leaking sensitive client information. However, secure aggregation also complicates the detection of malicious client behaviour and the evaluation of individual client contributions to the learning. To address these challenges, QI (Pejo et al.) and FedGT (Xhemrishi et al.) were proposed for contribution evaluation (CE) and misbehaviour detection (MD), respectively. QI, however, lacks adequate MD accuracy due to its reliance on the random selection of clients in each training round, while FedGT lacks the CE ability. In this work, we combine the strengths of QI and FedGT to achieve both robust MD and accurate CE. Our experiments demonstrate superior performance compared to using either method independently.
[LG-21] A unified framework on the universal approximation of transformer-type architectures
链接: https://arxiv.org/abs/2506.23551
作者: Jingpu Cheng,Qianxiao Li,Ting Lin,Zuowei Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.
[LG-22] Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size
链接: https://arxiv.org/abs/2506.23544
作者: Kento Imaizumi,Hideaki Iiduka
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Momentum methods were originally introduced for their superiority to stochastic gradient descent (SGD) in deterministic settings with convex objective functions. However, despite their widespread application to deep neural networks – a representative case of stochastic nonconvex optimization – the theoretical justification for their effectiveness in such settings remains limited. Quasi-hyperbolic momentum (QHM) is an algorithm that generalizes various momentum methods and has been studied to better understand the class of momentum-based algorithms as a whole. In this paper, we provide both asymptotic and non-asymptotic convergence results for mini-batch QHM with an increasing batch size. We show that achieving asymptotic convergence requires either a decaying learning rate or an increasing batch size. Since a decaying learning rate adversely affects non-asymptotic convergence, we demonstrate that using mini-batch QHM with an increasing batch size – without decaying the learning rate – can be a more effective strategy. Our experiments show that even a finite increase in batch size can provide benefits for training neural networks.
[LG-23] Reconciling Attribute and Structural Anomalies for Improved Graph Anomaly Detection
链接: https://arxiv.org/abs/2506.23469
作者: Chunjing Xiao,Jiahui Lu,Xovee Xu,Fan Zhou,Tianshu Xie,Wei Lu,Lifeng Xu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS); DOI: this https URL
Abstract:Graph anomaly detection is critical in domains such as healthcare and economics, where identifying deviations can prevent substantial losses. Existing unsupervised approaches strive to learn a single model capable of detecting both attribute and structural anomalies. However, they confront the tug-of-war problem between two distinct types of anomalies, resulting in suboptimal performance. This work presents TripleAD, a mutual distillation-based triple-channel graph anomaly detection framework. It includes three estimation modules to identify the attribute, structural, and mixed anomalies while mitigating the interference between different types of anomalies. In the first channel, we design a multiscale attribute estimation module to capture extensive node interactions and ameliorate the over-smoothing issue. To better identify structural anomalies, we introduce a link-enhanced structure estimation module in the second channel that facilitates information flow to topologically isolated nodes. The third channel is powered by an attribute-mixed curvature, a new indicator that encapsulates both attribute and structural information for discriminating mixed anomalies. Moreover, a mutual distillation strategy is introduced to encourage communication and collaboration between the three channels. Extensive experiments demonstrate the effectiveness of the proposed TripleAD model against strong baselines.
[LG-24] Neuro-Informed Joint Learning Enhances Cognitive Workload Decoding in Portable BCIs
链接: https://arxiv.org/abs/2506.23458
作者: Xiaoxiao Yang,Chan Feng,Jiancheng Chen
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 2 pages short paper
Abstract:Portable and wearable consumer-grade electroencephalography (EEG) devices, like Muse headbands, offer unprecedented mobility for daily brain-computer interface (BCI) applications, including cognitive load detection. However, the exacerbated non-stationarity in portable EEG signals constrains data fidelity and decoding accuracy, creating a fundamental trade-off between portability and performance. To mitigate such limitation, we propose MuseCogNet (Muse-based Cognitive Network), a unified joint learning framework integrating self-supervised and supervised training paradigms. In particular, we introduce an EEG-grounded self-supervised reconstruction loss based on average pooling to capture robust neurophysiological patterns, while cross-entropy loss refines task-specific cognitive discriminants. This joint learning framework resembles the bottom-up and top-down attention in humans, enabling MuseCogNet to significantly outperform state-of-the-art methods on a publicly available Muse dataset and establish an implementable pathway for neurocognitive monitoring in ecological settings.
[LG-25] Enhancing Insider Threat Detection Using User-Based Sequencing and Transformer Encoders
链接: https://arxiv.org/abs/2506.23446
作者: Mohamed Elbasheer,Adewale Akinfaderin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Insider threat detection presents unique challenges due to the authorized status of malicious actors and the subtlety of anomalous behaviors. Existing machine learning methods often treat user activity as isolated events, thereby failing to leverage sequential dependencies in user behavior. In this study, we propose a User-Based Sequencing (UBS) methodology, transforming the CERT insider threat dataset into structured temporal sequences suitable for deep sequential modeling. We deploy a Transformer Encoder architecture to model benign user activity and employ its reconstruction errors as anomaly scores. These scores are subsequently evaluated using three unsupervised outlier detection algorithms: One-Class SVM (OCSVM), Local Outlier Factor (LOF), and Isolation Forest (iForest). Across four rigorously designed test sets, including combinations of multiple CERT dataset releases, our UBS-Transformer pipeline consistently achieves state-of-the-art performance - notably 96.61% accuracy, 99.43% recall, 96.38% F1-score, 95.00% AUROC, and exceptionally low false negative (0.0057) and false positive (0.0571) rates. Comparative analyses demonstrate that our approach substantially outperforms tabular and conventional autoencoder baselines, underscoring the efficacy of sequential user modeling and advanced anomaly detection in the insider threat domain.
[LG-26] Do LLM s Dream of Discrete Algorithms?
链接: https://arxiv.org/abs/2506.23408
作者: Claudionor Coelho Jr,Yanen Li,Philip Tee
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:
Abstract:Large Language Models (LLMs) have rapidly transformed the landscape of artificial intelligence, enabling natural language interfaces and dynamic orchestration of software components. However, their reliance on probabilistic inference limits their effectiveness in domains requiring strict logical reasoning, discrete decision-making, and robust interpretability. This paper investigates these limitations and proposes a neurosymbolic approach that augments LLMs with logic-based reasoning modules, particularly leveraging Prolog predicates and composable toolsets. By integrating first-order logic and explicit rule systems, our framework enables LLMs to decompose complex queries into verifiable sub-tasks, orchestrate reliable solutions, and mitigate common failure modes such as hallucination and incorrect step decomposition. We demonstrate the practical benefits of this hybrid architecture through experiments on the DABStep benchmark, showing improved precision, coverage, and system documentation in multi-step reasoning tasks. Our results indicate that combining LLMs with modular logic reasoning restores engineering rigor, enhances system reliability, and offers a scalable path toward trustworthy, interpretable AI agents across complex domains.
[LG-27] When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery
链接: https://arxiv.org/abs/2506.23374
作者: Dominik Meier,Sujai Hiremath,Promit Ghosal,Kyra Gan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Distinguishing cause and effect from bivariate observational data is a foundational problem in many disciplines, but challenging without additional assumptions. Additive noise models (ANMs) are widely used to enable sample-efficient bivariate causal discovery. However, conventional ANM-based methods fail when unobserved mediators corrupt the causal relationship between variables. This paper makes three key contributions: first, we rigorously characterize why standard ANM approaches break down in the presence of unmeasured mediators. Second, we demonstrate that prior solutions for hidden mediation are brittle in finite sample settings, limiting their practical utility. To address these gaps, we propose Bivariate Denoising Diffusion (BiDD) for causal discovery, a method designed to handle latent noise introduced by unmeasured mediators. Unlike prior methods that infer directionality through mean squared error loss comparisons, our approach introduces a novel independence test statistic: during the noising and denoising processes for each variable, we condition on the other variable as input and evaluate the independence of the predicted noise relative to this input. We prove asymptotic consistency of BiDD under the ANM, and conjecture that it performs well under hidden mediation. Experiments on synthetic and real-world data demonstrate consistent performance, outperforming existing methods in mediator-corrupted settings while maintaining strong performance in mediator-free settings.
[LG-28] A case for data valuation transparency via DValCards
链接: https://arxiv.org/abs/2506.23349
作者: Keziah Naggita,Julienne LaChance
类目: Machine Learning (cs.LG)
*备注:
Abstract:Following the rise in popularity of data-centric machine learning (ML), various data valuation methods have been proposed to quantify the contribution of each datapoint to desired ML model performance metrics (e.g., accuracy). Beyond the technical applications of data valuation methods (e.g., data cleaning, data acquisition, etc.), it has been suggested that within the context of data markets, data buyers might utilize such methods to fairly compensate data owners. Here we demonstrate that data valuation metrics are inherently biased and unstable under simple algorithmic design choices, resulting in both technical and ethical implications. By analyzing 9 tabular classification datasets and 6 data valuation methods, we illustrate how (1) common and inexpensive data pre-processing techniques can drastically alter estimated data values; (2) subsampling via data valuation metrics may increase class imbalance; and (3) data valuation metrics may undervalue underrepresented group data. Consequently, we argue in favor of increased transparency associated with data valuation in-the-wild and introduce the novel Data Valuation Cards (DValCards) framework towards this aim. The proliferation of DValCards will reduce misuse of data valuation metrics, including in data pricing, and build trust in responsible ML systems.
[LG-29] Data-Driven Self-Supervised Learning for the Discovery of Solution Singularity for Partial Differential Equations
链接: https://arxiv.org/abs/2506.23344
作者: Difeng Cai,Paulina Sepúlveda
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The appearance of singularities in the function of interest constitutes a fundamental challenge in scientific computing. It can significantly undermine the effectiveness of numerical schemes for function approximation, numerical integration, and the solution of partial differential equations (PDEs), etc. The problem becomes more sophisticated if the location of the singularity is unknown, which is often encountered in solving PDEs. Detecting the singularity is therefore critical for developing efficient adaptive methods to reduce computational costs in various applications. In this paper, we consider singularity detection in a purely data-driven setting. Namely, the input only contains given data, such as the vertex set from a mesh. To overcome the limitation of the raw unlabeled data, we propose a self-supervised learning (SSL) framework for estimating the location of the singularity. A key component is a filtering procedure as the pretext task in SSL, where two filtering methods are presented, based on k nearest neighbors and kernel density estimation, respectively. We provide numerical examples to illustrate the potential pathological or inaccurate results due to the use of raw data without filtering. Various experiments are presented to demonstrate the ability of the proposed approach to deal with input perturbation, label corruption, and different kinds of singularities such interior circle, boundary layer, concentric semicircles, etc.
[LG-30] Learning to Rank with Variable Result Presentation Lengths SIGIR2025
链接: https://arxiv.org/abs/2506.23319
作者: Norman Knyazev,Harrie Oosterhuis
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: SIGIR 2025
Abstract:Learning to Rank (LTR) methods generally assume that each document in a top-K ranking is presented in an equal format. However, previous work has shown that users’ perceptions of relevance can be changed by varying presentations, i.e., allocating more vertical space to some documents to provide additional textual or image information. Furthermore, presentation length can also redirect attention, as users are more likely to notice longer presentations when scrolling through results. Deciding on the document presentation lengths in a fixed vertical space ranking is an important problem that has not been addressed by existing LTR methods. We address this gap by introducing the variable presentation length ranking task, where simultaneously the ordering of documents and their presentation length is decided. Despite being a generalization of standard ranking, we show that this setting brings significant new challenges: Firstly, the probability ranking principle no longer applies to this setting, and secondly, the problem cannot be divided into separate ordering and length selection tasks. We therefore propose VLPL - a new family of Plackett-Luce list-wise gradient estimation methods for the joint optimization of document ordering and lengths. Our semi-synthetic experiments show that VLPL can effectively balance the expected exposure and attractiveness of all documents, achieving the best performance across different ranking settings. Furthermore, we observe that even simple length-aware methods can achieve significant performance improvements over fixed-length models. Altogether, our theoretical and empirical results highlight the importance and difficulties of combining document presentation with LTR. Comments: SIGIR 2025 Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2506.23319 [cs.IR] (or arXiv:2506.23319v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.23319 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730020 Focus to learn more DOI(s) linking to related resources
[LG-31] Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis
链接: https://arxiv.org/abs/2506.23287
作者: Zelin Zang,WenZhe Li,Fei Chen,Yongjie Xu,Chang Yu,Zhen Lei,Stan Z. Li
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 6 figures, under review
Abstract:In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding complex biological processes. Key to this is the modeling and generation of hierarchical data that represents the intrinsic structure within datasets. Traditional methods face limitations in terms of computational cost, performance, generative capacity, and stability. Recent VAEs based approaches have made strides in addressing these challenges but still require specialized network modules for each tree branch, limiting their stability and ability to capture deep hierarchical relationships. To overcome these challenges, we introduce diffusion-based approach called HDTree. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and quantized diffusion processes to model tree node transitions. This method improves stability by eliminating branch-specific modules and enhancing generative capacity through gradual hierarchical changes simulated by the diffusion process. HDTree’s effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in terms of accuracy and performance. These contributions provide a new tool for hierarchical lineage analysis, enabling more accurate and efficient modeling of cellular differentiation paths and offering insights for downstream biological tasks. The code of HDTree is available at anonymous link this https URL.
[LG-32] BAPE: Learning an Explicit Bayes Classifier for Long-tailed Visual Recognition
链接: https://arxiv.org/abs/2506.23280
作者: Chaoqun Du,Yulin Wang,Shiji Song,Gao Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian decision theory advocates the Bayes classifier as the optimal approach for minimizing the risk in machine learning problems. Current deep learning algorithms usually solve for the optimal classifier by \emphimplicitly estimating the posterior probabilities, \emphe.g., by minimizing the Softmax cross-entropy loss. This simple methodology has been proven effective for meticulously balanced academic benchmark datasets. However, it is not applicable to the long-tailed data distributions in the real world, where it leads to the gradient imbalance issue and fails to ensure the Bayes optimal decision rule. To address these challenges, this paper presents a novel approach (BAPE) that provides a more precise theoretical estimation of the data distributions by \emphexplicitly modeling the parameters of the posterior probabilities and solving them with point estimation. Consequently, our method directly learns the Bayes classifier without gradient descent based on Bayes’ theorem, simultaneously alleviating the gradient imbalance and ensuring the Bayes optimal decision rule. Furthermore, we propose a straightforward yet effective \emphdistribution adjustment technique. This method enables the Bayes classifier trained from the long-tailed training set to effectively adapt to the test data distribution with an arbitrary imbalance factor, thereby enhancing performance without incurring additional computational costs. In addition, we demonstrate the gains of our method are orthogonal to existing learning approaches for long-tailed scenarios, as they are mostly designed under the principle of \emphimplicitly estimating the posterior probabilities. Extensive empirical evaluations on CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist demonstrate that our method significantly improves the generalization performance of popular deep networks, despite its simplicity.
[LG-33] Sub-MoE: Efficient Mixture-of-Expert LLM s Compression via Subspace Expert Merging
链接: https://arxiv.org/abs/2506.23266
作者: Lujun Li,Zhu Qiyuan,Jiacheng Wang,Wei Li,Hao Gu,Sirui Han,Yike Guo
类目: Machine Learning (cs.LG)
*备注: Work in progress, revisions ongoing
Abstract:Mixture of Experts (MoE) LLMs face significant obstacles due to their massive parameter scale, which imposes memory, storage, and deployment challenges. Although recent expert merging methods promise greater efficiency by consolidating multiple experts, they are fundamentally hindered by parameter conflicts arising from expert specialization. In this paper, we present Sub-MoE, a novel MoE compression framework via Subspace Expert Merging. Our key insight is to perform joint Singular Value Decomposition (SVD) on concatenated expert weights, reducing conflicting parameters by extracting shared U -matrices while enabling effective merging of the expert-specific V components. Specifically, Sub-MoE consists of two innovative phases: (1) Adaptive Expert Clustering, which groups functionally coherent experts via K-means clustering based on cosine similarity of expert outputs; and (2) Subspace Expert Merging, which first enforces Experts Union Decomposition to derive the shared U -matrix across experts in the same group, then pursues frequency-based merging for individual V -matrices, and finalizes expert reconstruction using the merged V -matrix. In this way, we align and fuse experts in a shared subspace, and can be extended with intra-expert compression for further inference optimization. Extensive experiments on Mixtral, DeepSeek, and Qwen-1.5|3 MoE LLMs demonstrate that our Sub-MoE significantly outperforms existing expert pruning and merging methods. Notably, our Sub-MoE maintains 96%|86% of original performance with 25%|50% expert reduction on Mixtral-8x7B in zero-shot benchmarks. Code will be released at this https URL.
[LG-34] External Data-Enhanced Meta-Representation for Adaptive Probabilistic Load Forecasting
链接: https://arxiv.org/abs/2506.23201
作者: Haoran Li,Muhao Guo,Marija Ilic,Yang Weng,Guangchun Ruan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages
Abstract:Accurate residential load forecasting is critical for power system reliability with rising renewable integration and demand-side flexibility. However, most statistical and machine learning models treat external factors, such as weather, calendar effects, and pricing, as extra input, ignoring their heterogeneity, and thus limiting the extraction of useful external information. We propose a paradigm shift: external data should serve as meta-knowledge to dynamically adapt the forecasting model itself. Based on this idea, we design a meta-representation framework using hypernetworks that modulate selected parameters of a base Deep Learning (DL) model in response to external conditions. This provides both expressivity and adaptability. We further integrate a Mixture-of-Experts (MoE) mechanism to enhance efficiency through selective expert activation, while improving robustness by filtering redundant external inputs. The resulting model, dubbed as a Meta Mixture of Experts for External data (M2oE2), achieves substantial improvements in accuracy and robustness with limited additional overhead, outperforming existing state-of-the-art methods in diverse load datasets. The dataset and source code are publicly available at this https URL_load\this http URL.
[LG-35] Efficient Algorithms for Learning and Compressing Monophonic Halfspaces in Graphs
链接: https://arxiv.org/abs/2506.23186
作者: Marco Bressan,Victor Chepoi,Emmanuel Esposito,Maximilian Thiessen
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Combinatorics (math.CO); Machine Learning (stat.ML)
*备注:
Abstract:Abstract notions of convexity over the vertices of a graph, and corresponding notions of halfspaces, have recently gained attention from the machine learning community. In this work we study monophonic halfspaces, a notion of graph halfspaces defined through closure under induced paths. Our main result is a 2 -satisfiability based decomposition theorem, which allows one to represent monophonic halfspaces as a disjoint union of certain vertex subsets. Using this decomposition, we achieve efficient and (nearly) optimal algorithms for various learning problems, such as teaching, active, and online learning. Most notably, we obtain a polynomial-time algorithm for empirical risk minimization. Independently of the decomposition theorem, we obtain an efficient, stable, and proper sample compression scheme. This makes monophonic halfspaces efficiently learnable with proper learners and linear error rate 1/\varepsilon in the realizable PAC setting. Our results answer open questions from the literature, and show a stark contrast with geodesic halfspaces, for which most of the said learning problems are NP-hard.
[LG-36] Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data
链接: https://arxiv.org/abs/2506.23182
作者: Robert Frank,Michael Widrich,Rahmad Akbar,Günter Klambauer,Geir Kjetil Sandve,Philippe A. Robert,Victor Greiff
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.
[LG-37] Compositions of Variant Experts for Integrating Short-Term and Long-Term Preferences
链接: https://arxiv.org/abs/2506.23170
作者: Jaime Hieu Do,Trung-Hoang Le,Hady W. Lauw
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:In the online digital realm, recommendation systems are ubiquitous and play a crucial role in enhancing user experience. These systems leverage user preferences to provide personalized recommendations, thereby helping users navigate through the paradox of choice. This work focuses on personalized sequential recommendation, where the system considers not only a user’s immediate, evolving session context, but also their cumulative historical behavior to provide highly relevant and timely recommendations. Through an empirical study conducted on diverse real-world datasets, we have observed and quantified the existence and impact of both short-term (immediate and transient) and long-term (enduring and stable) preferences on users’ historical interactions. Building on these insights, we propose a framework that combines short- and long-term preferences to enhance recommendation performance, namely Compositions of Variant Experts (CoVE). This novel framework dynamically integrates short- and long-term preferences through the use of different specialized recommendation models (i.e., experts). Extensive experiments showcase the effectiveness of the proposed methods and ablation studies further investigate the impact of variant expert types.
[LG-38] Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes
链接: https://arxiv.org/abs/2506.23165
作者: David Bossens,Atsushi Nitanda
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes (RCMDPs), making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained MDP. In the oracle-based RCMDP setting, we obtain an \mathcalO\left(\frac1T\right) convergence rate for the squared distance as a Bregman divergence, and an \mathcalO\left(e^-T\right) convergence rate for entropy-regularised objectives. In the sample-based RCMDP setting, we obtain an \tilde\mathcalO\left(\frac1T^1/3\right) convergence rate. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.
[LG-39] Multi-task Offline Reinforcement Learning for Online Advertising in Recommender Systems KDD2025
链接: https://arxiv.org/abs/2506.23090
作者: Langming Liu,Wanyu Wang,Chi Zhang,Bo Li,Hongzhi Yin,Xuetao Wei,Wenbo Su,Bo Zheng,Xiangyu Zhao
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: KDD 2025
Abstract:Online advertising in recommendation platforms has gained significant attention, with a predominant focus on channel recommendation and budget allocation strategies. However, current offline reinforcement learning (RL) methods face substantial challenges when applied to sparse advertising scenarios, primarily due to severe overestimation, distributional shifts, and overlooking budget constraints. To address these issues, we propose MTORL, a novel multi-task offline RL model that targets two key objectives. First, we establish a Markov Decision Process (MDP) framework specific to the nuances of advertising. Then, we develop a causal state encoder to capture dynamic user interests and temporal dependencies, facilitating offline RL through conditional sequence modeling. Causal attention mechanisms are introduced to enhance user sequence representations by identifying correlations among causal states. We employ multi-task learning to decode actions and rewards, simultaneously addressing channel recommendation and budget allocation. Notably, our framework includes an automated system for integrating these tasks into online advertising. Extensive experiments on offline and online environments demonstrate MTORL’s superiority over state-of-the-art methods.
[LG-40] CSBrain: A Cross-scale Spatiotemporal Brain Foundation Model for EEG Decoding
链接: https://arxiv.org/abs/2506.23075
作者: Yuchen Zhou,Jiamin Wu,Zichen Ren,Zhouheng Yao,Weiheng Lu,Kunyu Peng,Qihao Zheng,Chunfeng Song,Wanli Ouyang,Chao Gou
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Understanding and decoding brain activity from electroencephalography (EEG) signals is a fundamental challenge in neuroscience and AI, with applications in cognition, emotion recognition, diagnosis, and brain-computer interfaces. While recent EEG foundation models advance generalized decoding via unified architectures and large-scale pretraining, they adopt a scale-agnostic dense modeling paradigm inherited from NLP and vision. This design neglects a core property of neural activity: cross-scale spatiotemporal structure. EEG task patterns span a wide range of temporal and spatial scales, from short bursts to slow rhythms, and from localized cortical responses to distributed interactions. Ignoring this diversity leads to suboptimal representations and weak generalization. We propose CSBrain, a Cross-scale Spatiotemporal Brain foundation model for generalized EEG decoding. CSBrain introduces: (i) Cross-scale Spatiotemporal Tokenization (CST), which aggregates multi-scale features from localized temporal windows and anatomical brain regions into compact scale-aware tokens; and (ii) Structured Sparse Attention (SSA), which captures cross-window and cross-region dependencies, enhancing scale diversity while removing spurious correlations. CST and SSA are alternately stacked to progressively integrate multi-scale dependencies. Experiments on 11 EEG tasks across 16 datasets show that CSBrain consistently outperforms task-specific and foundation model baselines. These results establish cross-scale modeling as a key inductive bias and position CSBrain as a robust backbone for future brain-AI research.
[LG-41] Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction
链接: https://arxiv.org/abs/2506.23053
作者: Hanlin Dong,Arian Prabowo,Hao Xue,Flora D. Salim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Air quality prediction is a challenging forecasting task due to its spatio-temporal complexity and the inherent dynamics as well as uncertainty. Most of the current models handle these two challenges by applying Graph Neural Networks or known physics principles, and quantifying stochasticity through probabilistic networks like Diffusion models. Nevertheless, finding the right balancing point between the certainties and uncertainties remains an open question. Therefore, we propose Double-Diffusion, a novel diffusion probabilistic model that harnesses the power of known physics to guide air quality forecasting with stochasticity. To the best of our knowledge, while precedents have been made of using conditional diffusion models to predict air pollution, this is the first attempt to use physics as a conditional generative approach for air quality prediction. Along with a sampling strategy adopted from image restoration and a new denoiser architecture, Double-Diffusion ranks first in most evaluation scenarios across two real-life datasets compared with other probabilistic models, it also cuts inference time by 50% to 30% while enjoying an increase between 3-12% in Continuous Ranked Probabilistic Score (CRPS).
[LG-42] Frag ile Robust and Antifrag ile: A Perspective from Parameter Responses in Reinforcement Learning Under Stress
链接: https://arxiv.org/abs/2506.23036
作者: Zain ul Abdeen,Ming Jin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This paper explores Reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. Inspired by synaptic plasticity in neuroscience, synaptic filtering introduces internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as fragile, robust, or antifragile, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on PPO-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.
[LG-43] Feature-Wise Mixing for Mitigating Contextual Bias in Predictive Supervised Learning
链接: https://arxiv.org/abs/2506.23033
作者: Yash Vardhan Tomar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Bias in predictive machine learning (ML) models is a fundamental challenge due to the skewed or unfair outcomes produced by biased models. Existing mitigation strategies rely on either post-hoc corrections or rigid constraints. However, emerging research claims that these techniques can limit scalability and reduce generalizability. To address this, this paper introduces a feature-wise mixing framework to mitigate contextual bias. This was done by redistributing feature representations across multiple contextual datasets. To assess feature-wise mixing’s effectiveness, four ML classifiers were trained using cross-validation and evaluated with bias-sensitive loss functions, including disparity metrics and mean squared error (MSE), which served as a standard measure of predictive performance. The proposed method achieved an average bias reduction of 43.35% and a statistically significant decrease in MSE across all classifiers trained on mixed datasets. Additionally, benchmarking against established bias mitigation techniques found that feature-wise mixing consistently outperformed SMOTE oversampling and demonstrated competitive effectiveness without requiring explicit bias attribute identification. Feature-wise mixing efficiently avoids the computational overhead typically associated with fairness-aware learning algorithms. Future work could explore applying feature-wise mixing for real-world fields where accurate predictions are necessary.
[LG-44] A Reinforcement Learning Approach for Optimal Control in Microgrids
链接: https://arxiv.org/abs/2506.22995
作者: Davide Salaorni,Federico Bianchi,Francesco Trovò,Marcello Restelli
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, accepted to International Joint Conference on Neural Networks 2025
Abstract:The increasing integration of renewable energy sources (RESs) is transforming traditional power grid networks, which require new approaches for managing decentralized energy production and consumption. Microgrids (MGs) provide a promising solution by enabling localized control over energy generation, storage, and distribution. This paper presents a novel reinforcement learning (RL)-based methodology for optimizing microgrid energy management. Specifically, we propose an RL agent that learns optimal energy trading and storage policies by leveraging historical data on energy production, consumption, and market prices. A digital twin (DT) is used to simulate the energy storage system dynamics, incorporating degradation factors to ensure a realistic emulation of the analysed setting. Our approach is validated through an experimental campaign using real-world data from a power grid located in the Italian territory. The results indicate that the proposed RL-based strategy outperforms rule-based methods and existing RL benchmarks, offering a robust solution for intelligent microgrid management.
[LG-45] Kernel Outlier Detection
链接: https://arxiv.org/abs/2506.22994
作者: Can Hakan Dağıdır,Mia Hubert,Peter J. Rousseeuw
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:A new anomaly detection method called kernel outlier detection (KOD) is proposed. It is designed to address challenges of outlier detection in high-dimensional settings. The aim is to overcome limitations of existing methods, such as dependence on distributional assumptions or on hyperparameters that are hard to tune. KOD starts with a kernel transformation, followed by a projection pursuit approach. Its novelties include a new ensemble of directions to search over, and a new way to combine results of different direction types. This provides a flexible and lightweight approach for outlier detection. Our empirical evaluations illustrate the effectiveness of KOD on three small datasets with challenging structures, and on four large benchmark datasets.
[LG-46] Cybersecurity-Focused Anomaly Detection in Connected Autonomous Vehicles Using Machine Learning
链接: https://arxiv.org/abs/2506.22984
作者: Prathyush Kumar Reddy Lebaku,Lu Gao,Yunpeng Zhang,Zhixia Li,Yongxin Liu,Tanvir Arafin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection in connected autonomous vehicles (CAVs) is crucial for maintaining safe and reliable transportation networks, as CAVs can be susceptible to sensor malfunctions, cyber-attacks, and unexpected environmental disruptions. This study explores an anomaly detection approach by simulating vehicle behavior, generating a dataset that represents typical and atypical vehicular interactions. The dataset includes time-series data of position, speed, and acceleration for multiple connected autonomous vehicles. We utilized machine learning models to effectively identify abnormal driving patterns. First, we applied a stacked Long Short-Term Memory (LSTM) model to capture temporal dependencies and sequence-based anomalies. The stacked LSTM model processed the sequential data to learn standard driving behaviors. Additionally, we deployed a Random Forest model to support anomaly detection by offering ensemble-based predictions, which enhanced model interpretability and performance. The Random Forest model achieved an R2 of 0.9830, MAE of 5.746, and a 95th percentile anomaly threshold of 14.18, while the stacked LSTM model attained an R2 of 0.9998, MAE of 82.425, and a 95th percentile anomaly threshold of 265.63. These results demonstrate the models’ effectiveness in accurately predicting vehicle trajectories and detecting anomalies in autonomous driving scenarios.
[LG-47] Hierarchical Decentralized Stochastic Control for Cyber-Physical Systems
链接: https://arxiv.org/abs/2506.22971
作者: Kesav Kazam Ramachandran Anantharaman,Rahul Meshram
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注: 6 pages, 2 figures
Abstract:This paper presents a two-timescale hierarchical decentralized architecture for control of Cyber-Physical Systems. The architecture consists of N independent sub-processes, a global controller, and N local controllers, each formulated as a Markov Decision Process (MDP). The global controller, operating at a slower timescale optimizes the infinite-horizon discounted cumulative reward under budget constraints. For the local controllers, operating at a faster timescale, we propose two different optimization frameworks, namely the COpt and FOpt. In the COpt framework, the local controller also optimizes an infinite-horizon MDP, while in the FOpt framework, the local controller optimizes a finite-horizon MDP. The FOpt framework mimics a federal structure, where the local controllers have more autonomy in their decision making. First, the existence of stationary deterministic optimal policies for both these frameworks is established. Then, various relationships between the two frameworks are studied, including a bound on the difference between the two optimal value functions. Additionally, sufficiency conditions are provided such that the two frameworks lead to the same optimal values.
[LG-48] Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models
链接: https://arxiv.org/abs/2506.22950
作者: Liangyu Wang,Huanyi Xie,Xinhai Wang,Tianjin Huang,Mengdi Li,Di Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sampling, a framework that enables efficient and stable GRPO training by decoupling group size from GPU memory usage. It consists of: (1) micro sampling groups that decompose large groups into memory-feasible rounds; (2) continuous sampling that interleaves generation across groups to improve utilization; and (3) a length-aware scheduler combining token-conditioned sequence length prediction with a two-stage plan: global grouping via FPTAS and runtime refill via SJF. Experiments show that our Micro Sampling Groups reduce peak memory usage by over 50% compared to full-group decoding (e.g., from 21.55 GB to 10.64 GB on Qwen3-1.7B). Building on this, Infinite Sampling improves throughput by over 25% compared to the naive micro sampling group method, reducing decoding steps while maintaining full-length completions and memory usage. Our hybrid scheduling ensures efficient and stable GRPO training with larger groups under realistic GPU memory constraints. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.22950 [cs.LG] (or arXiv:2506.22950v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.22950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-49] Efficient Cybersecurity Assessment Using SVM and Fuzzy Evidential Reasoning for Resilient Infrastructure
链接: https://arxiv.org/abs/2506.22938
作者: Zaydon L. Ali,Wassan Saad Abduljabbar Hayale,Israa Ibraheem Al_Barazanchi,Ravi Sekhar,Pritesh Shah,Sushma Parihar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:With current advancement in hybermedia knowledges, the privacy of digital information has developed a critical problem. To overawed the susceptibilities of present security protocols, scholars tend to focus mainly on efforts on alternation of current protocols. Over past decade, various proposed encoding models have been shown insecurity, leading to main threats against significant data. Utilizing the suitable encryption model is very vital means of guard against various such, but algorithm is selected based on the dependency of data which need to be secured. Moreover, testing potentiality of the security assessment one by one to identify the best choice can take a vital time for processing. For faster and precisive identification of assessment algorithm, we suggest a security phase exposure model for cipher encryption technique by invoking Support Vector Machine (SVM). In this work, we form a dataset using usual security components like contrast, homogeneity. To overcome the uncertainty in analysing the security and lack of ability of processing data to a risk assessment mechanism. To overcome with such complications, this paper proposes an assessment model for security issues using fuzzy evidential reasoning (ER) approaches. Significantly, the model can be utilised to process and assemble risk assessment data on various aspects in systematic ways. To estimate the performance of our framework, we have various analyses like, recall, F1 score and accuracy.
[LG-50] owards Time Series Generation Conditioned on Unstructured Natural Language
链接: https://arxiv.org/abs/2506.22927
作者: Jaeyun Woo,Jiseok Lee,Brian Kenji Iwana
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative Artificial Intelligence (AI) has rapidly become a powerful tool, capable of generating various types of data, such as images and text. However, despite the significant advancement of generative AI, time series generative AI remains underdeveloped, even though the application of time series is essential in finance, climate, and numerous fields. In this research, we propose a novel method of generating time series conditioned on unstructured natural language descriptions. We use a diffusion model combined with a language model to generate time series from the text. Through the proposed method, we demonstrate that time series generation based on natural language is possible. The proposed method can provide various applications such as custom forecasting, time series manipulation, data augmentation, and transfer learning. Furthermore, we construct and propose a new public dataset for time series generation, consisting of 63,010 time series-description pairs.
[LG-51] P2U: Progressive Precision Update For Efficient Model Distribution
链接: https://arxiv.org/abs/2506.22871
作者: Homayun Afrabandpey,Hamed Rezazadegan Tavakoli
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:Efficient model distribution is becoming increasingly critical in bandwidth-constrained environments. In this paper, we propose a simple yet effective approach called Progressive Precision Update (P ^2 U) to address this problem. Instead of transmitting the original high-precision model, P ^2 U transmits a lower-bit precision model, coupled with a model update representing the difference between the original high-precision model and the transmitted low precision version. With extensive experiments on various model architectures, ranging from small models ( 1 - 6 million parameters) to a large model (more than 100 million parameters) and using three different data sets, e.g., chest X-Ray, PASCAL-VOC, and CIFAR-100, we demonstrate that P ^2 U consistently achieves better tradeoff between accuracy, bandwidth usage and latency. Moreover, we show that when bandwidth or startup time is the priority, aggressive quantization (e.g., 4-bit) can be used without severely compromising performance. These results establish P ^2 U as an effective and practical solution for scalable and efficient model distribution in low-resource settings, including federated learning, edge computing, and IoT deployments. Given that P ^2 U complements existing compression techniques and can be implemented alongside any compression method, e.g., sparsification, quantization, pruning, etc., the potential for improvement is even greater.
[LG-52] Deep learning 40 years of human migration
链接: https://arxiv.org/abs/2506.22821
作者: Thomas Gaskin,Guy J. Abel
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a novel and detailed dataset on origin-destination annual migration flows and stocks between 230 countries and regions, spanning the period from 1990 to the present. Our flow estimates are further disaggregated by country of birth, providing a comprehensive picture of migration over the last 43 years. The estimates are obtained by training a deep recurrent neural network to learn flow patterns from 18 covariates for all countries, including geographic, economic, cultural, societal, and political information. The recurrent architecture of the neural network means that the entire past can influence current migration patterns, allowing us to learn long-range temporal correlations. By training an ensemble of neural networks and additionally pushing uncertainty on the covariates through the trained network, we obtain confidence bounds for all our estimates, allowing researchers to pinpoint the geographic regions most in need of additional data collection. We validate our approach on various test sets of unseen data, demonstrating that it significantly outperforms traditional methods estimating five-year flows while delivering a significant increase in temporal resolution. The model is fully open source: all training data, neural network weights, and training code are made public alongside the migration estimates, providing a valuable resource for future studies of human migration.
[LG-53] Multimodal Atmospheric Super-Resolution With Deep Generative Models
链接: https://arxiv.org/abs/2506.22780
作者: Dibyajyoti Chakraborty,Haiwen Guan,Jason Stock,Troy Arcomano,Guido Cervone,Romit Maulik
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:
Abstract:Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions. They achieve this by learning a score function, i.e., the gradient of the log-probability density of the data, and reversing a noising process using the same. Once trained, score-based diffusion models not only generate new samples but also enable zero-shot conditioning of the generated samples on observed data. This promises a novel paradigm for data and model fusion, wherein the implicitly learned distributions of pretrained score-based diffusion models can be updated given the availability of online data in a Bayesian formulation. In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements from multimodal data. Additional analysis on how score-based sampling can be used for uncertainty estimates is also provided. Our experiments are performed for a super-resolution task that generates the ERA5 atmospheric dataset given sparse observations from a coarse-grained representation of the same and/or from unstructured experimental observations of the IGRA radiosonde dataset. We demonstrate accurate recovery of the high dimensional state given multiple sources of low-fidelity measurements. We also discover that the generative model can balance the influence of multiple dataset modalities during spatiotemporal reconstructions.
[LG-54] Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing
链接: https://arxiv.org/abs/2506.22773
作者: Yanran Wu,Inez Hua,Yi Ding
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 7 pages, 9 figures, HotCarbon '25: Proceedings of the 4th Workshop on Sustainable Computer Systems, Cambridge, Massachusetts (USA), July 10-11th, 2025
Abstract:Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at this https URL.
[LG-55] Robust Tensor Completion via Gradient Tensor Nulclear L1-L2 Norm for Traffic Data Recovery
链接: https://arxiv.org/abs/2506.22732
作者: Hao Shu,Jicheng Li,Tianyv Lei,Lijun Sun
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:
Abstract:In real-world scenarios, spatiotemporal traffic data frequently experiences dual degradation from missing values and noise caused by sensor malfunctions and communication failures. Therefore, effective data recovery methods are essential to ensure the reliability of downstream data-driven applications. while classical tensor completion methods have been widely adopted, they are incapable of modeling noise, making them unsuitable for complex scenarios involving simultaneous data missingness and noise interference. Existing Robust Tensor Completion (RTC) approaches offer potential solutions by separately modeling the actual tensor data and noise. However, their effectiveness is often constrained by the over-relaxation of convex rank surrogates and the suboptimal utilization of local consistency, leading to inadequate model accuracy. To address these limitations, we first introduce the tensor L1-L2 norm, a novel non-convex tensor rank surrogate that functions as an effective low-rank representation tool. Leveraging an advanced feature fusion strategy, we further develop the gradient tensor L1-L2 norm by incorporating the tensor L1-L2 norm in the gradient domain. By integrating the gradient tensor nuclear L1-L2 norm into the RTC framework, we propose the Robust Tensor Completion via Gradient Tensor Nuclear L1-L2 Norm (RTC-GTNLN) model, which not only fully exploits both global low-rankness and local consistency without trade-off parameter, but also effectively handles the dual degradation challenges of missing data and noise in traffic data. Extensive experiments conducted on multiple real-world traffic datasets demonstrate that the RTC-GTNLN model consistently outperforms existing state-of-the-art methods in complex recovery scenarios involving simultaneous missing values and noise.
[LG-56] Persistence Paradox in Dynamic Science
链接: https://arxiv.org/abs/2506.22729
作者: Honglin Bao,Kai Li
类目: Digital Libraries (cs.DL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Persistence is often regarded as a virtue in science. In this paper, however, we challenge this conventional view by highlighting its contextual nature, particularly how persistence can become a liability during periods of paradigm shift. We focus on the deep learning revolution catalyzed by AlexNet in 2012. Analyzing the 20-year career trajectories of over 5,000 scientists who were active in top machine learning venues during the preceding decade, we examine how their research focus and output evolved. We first uncover a dynamic period in which leading venues increasingly prioritized cutting-edge deep learning developments that displaced relatively traditional statistical learning methods. Scientists responded to these changes in markedly different ways. Those who were previously successful or affiliated with old teams adapted more slowly, experiencing what we term a rigidity penalty - a reluctance to embrace new directions leading to a decline in scientific impact, as measured by citation percentile rank. In contrast, scientists who pursued strategic adaptation - selectively pivoting toward emerging trends while preserving weak connections to prior expertise - reaped the greatest benefits. Taken together, our macro- and micro-level findings show that scientific breakthroughs act as mechanisms that reconfigure power structures within a field.
[LG-57] Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication
链接: https://arxiv.org/abs/2506.22714
作者: Jinliang Shi,Shigang Li,Youxuan Xu,Xueying Wang,Rongtian Fu,Zhi Ma,Tong Wu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming flexibility. In this work, we discover that utilizing one resource alone leads to inferior performance for sparse matrix multiplication, due to their respective limitations. To this end, we propose Libra, a systematic approach that enables synergistic computation between CUDA and Tensor cores to achieve the best performance for sparse matrix multiplication. Specifically, we propose a 2D-aware workload distribution strategy to find out the sweet point of task mapping for different sparse operators, leveraging both the high performance of Tensor cores and the low computational redundancy on CUDA cores. In addition, Libra incorporates systematic optimizations for heterogeneous computing, including hybrid load-balancing, finely optimized kernel implementations, and GPU-accelerated preprocessing. Extensive experimental results on H100 and RTX 4090 GPUs show that Libra outperforms the state-of-the-art by on average 3.1x (up to 9.23x) over DTC-SpMM and 2.9x (up to 3.9x) for end-to-end GNN applications. Libra opens up a new perspective for sparse operator acceleration by fully exploiting the heterogeneous computing resources on GPUs.
[LG-58] Generalized Linear Mode Connectivity for Transformers
链接: https://arxiv.org/abs/2506.22712
作者: Alexander Theus,Alessandro Cabodi,Sotiris Anagnostidis,Antonio Orvieto,Sidak Pal Singh,Valentina Boeva
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths, despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space – such as neuron permutations – which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, orthogonal transformations, and general invertible maps – broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.
[LG-59] FairMarket-RL: LLM -Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets
链接: https://arxiv.org/abs/2506.22708
作者: Shrenik Jadhav,Birva Sevak,Srijita Das,Akhtar Hussain,Wencong Su,Van-Hai Bui
类目: Machine Learning (cs.LG); General Economics (econ.GN); Systems and Control (eess.SY)
*备注:
Abstract:Peer-to-peer (P2P) trading is increasingly recognized as a key mechanism for decentralized market regulation, yet existing approaches often lack robust frameworks to ensure fairness. This paper presents FairMarket-RL, a novel hybrid framework that combines Large Language Models (LLMs) with Reinforcement Learning (RL) to enable fairness-aware trading agents. In a simulated P2P microgrid with multiple sellers and buyers, the LLM acts as a real-time fairness critic, evaluating each trading episode using two metrics: Fairness-To-Buyer (FTB) and Fairness-Between-Sellers (FBS). These fairness scores are integrated into agent rewards through scheduled \lambda-coefficients, forming an adaptive LLM-guided reward shaping loop that replaces brittle, rule-based fairness constraints. Agents are trained using Independent Proximal Policy Optimization (IPPO) and achieve equitable outcomes, fulfilling over 90% of buyer demand, maintaining fair seller margins, and consistently reaching FTB and FBS scores above 0.80. The training process demonstrates that fairness feedback improves convergence, reduces buyer shortfalls, and narrows profit disparities between sellers. With its language-based critic, the framework scales naturally, and its extension to a large power distribution system with household prosumers illustrates its practical applicability. FairMarket-RL thus offers a scalable, equity-driven solution for autonomous trading in decentralized energy systems.
[LG-60] Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment
链接: https://arxiv.org/abs/2506.22685
作者: Anh Bui,Trang Vu,Trung Le,Junae Kim,Tamas Abraham,Rollin Omari,Amar Kaur,Dinh Phung
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ( V^* ) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like “a photo of V^* wearing glasses and playing guitar” into simpler, less contextually rich forms such as "a photo of V^* " but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding V^* to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at this https URL. Subjects: Machine Learning (cs.LG); Graphics (cs.GR) Cite as: arXiv:2506.22685 [cs.LG] (or arXiv:2506.22685v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.22685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-61] Learning Stochastic Multiscale Models NEURIPS2025
链接: https://arxiv.org/abs/2506.22655
作者: Andrew F. Ilersich,Prasanth B. Nair
类目: Machine Learning (cs.LG)
*备注: Body is 9 pages, 13 including acknowledgements and references, 35 including appendix. 21 figures and 6 tables. Submitted to NeurIPS 2025
Abstract:The physical sciences are replete with dynamical systems that require the resolution of a wide range of length and time scales. This presents significant computational challenges since direct numerical simulation requires discretization at the finest relevant scales, leading to a high-dimensional state space. In this work, we propose an approach to learn stochastic multiscale models in the form of stochastic differential equations directly from observational data. Our method resolves the state on a coarse mesh while introducing an auxiliary state to capture the effects of unresolved scales. We learn the parameters of the multiscale model using a modern forward-solver-free amortized variational inference method. Our approach draws inspiration from physics-based multiscale modeling approaches, such as large-eddy simulation in fluid dynamics, while learning directly from data. We present numerical studies to demonstrate that our learned multiscale models achieve superior predictive accuracy compared to direct numerical simulation and closure-type models at equivalent resolution.
[LG-62] Interact2Vec – An efficient neural network-based model for simultaneously learning users and items embeddings in recommender systems
链接: https://arxiv.org/abs/2506.22648
作者: Pedro R. Pires,Tiago A. Almeida
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted for publication in Applied Soft Computing (ASOC), 49 pages, 14 figures
Abstract:Over the past decade, recommender systems have experienced a surge in popularity. Despite notable progress, they grapple with challenging issues, such as high data dimensionality and sparseness. Representing users and items as low-dimensional embeddings learned via neural networks has become a leading solution. However, while recent studies show promising results, many approaches rely on complex architectures or require content data, which may not always be available. This paper presents Interact2Vec, a novel neural network-based model that simultaneously learns distributed embeddings for users and items while demanding only implicit feedback. The model employs state-of-the-art strategies that natural language processing models commonly use to optimize the training phase and enhance the final embeddings. Two types of experiments were conducted regarding the extrinsic and intrinsic quality of the model. In the former, we benchmarked the recommendations generated by Interact2Vec’s embeddings in a top- N ranking problem, comparing them with six other recommender algorithms. The model achieved the second or third-best results in 30% of the datasets, being competitive with other recommenders, and has proven to be very efficient with an average training time reduction of 274% compared to other embedding-based models. Later, we analyzed the intrinsic quality of the embeddings through similarity tables. Our findings suggest that Interact2Vec can achieve promising results, especially on the extrinsic task, and is an excellent embedding-generator model for scenarios of scarce computing resources, enabling the learning of item and user embeddings simultaneously and efficiently.
[LG-63] Cost-effective Reduced-Order Modeling via Bayesian Active Learning
链接: https://arxiv.org/abs/2506.22645
作者: Amir Hossein Rahmati,Nathan M. Urban,Byung-Jun Yoon,Xiaoning Qian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Machine Learning surrogates have been developed to accelerate solving systems dynamics of complex processes in different science and engineering applications. To faithfully capture governing systems dynamics, these methods rely on large training datasets, hence restricting their applicability in real-world problems. In this work, we propose BayPOD-AL, an active learning framework based on an uncertainty-aware Bayesian proper orthogonal decomposition (POD) approach, which aims to effectively learn reduced-order models from high-fidelity full-order models representing complex systems. Experimental results on predicting the temperature evolution over a rod demonstrate BayPOD-AL’s effectiveness in suggesting the informative data and reducing computational cost related to constructing a training dataset compared to other uncertainty-guided active learning strategies. Furthermore, we demonstrate BayPOD-AL’s generalizability and efficiency by evaluating its performance on a dataset of higher temporal resolution than the training dataset.
[LG-64] A hierarchical Vovk-Azoury-Warmuth forecaster with discounting for online regression in RKHS
链接: https://arxiv.org/abs/2506.22631
作者: Dmitry B. Rokhlin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the problem of online regression with the unconstrained quadratic loss against a time-varying sequence of functions from a Reproducing Kernel Hilbert Space (RKHS). Recently, Jacobsen and Cutkosky (2024) introduced a discounted Vovk-Azoury-Warmuth (DVAW) forecaster that achieves optimal dynamic regret in the finite-dimensional case. In this work, we lift their approach to the non-parametric domain by synthesizing the DVAW framework with a random feature approximation. We propose a fully adaptive, hierarchical algorithm, which we call H-VAW-D (Hierarchical Vovk-Azoury-Warmuth with Discounting), that learns both the discount factor and the number of random features. We prove that this algorithm, which has a per-iteration computational complexity of O(T\ln T) , achieves an expected dynamic regret of O(T^2/3P_T^1/3 + \sqrtT\ln T) , where P_T is the functional path length of a comparator sequence.
[LG-65] Hierarchical Modeling and Architecture Optimization: Review and Unified Framework
链接: https://arxiv.org/abs/2506.22621
作者: Paul Saves,Edward Hallé-Hannan,Jasper Bussemaker,Youssef Diouane,Nathalie Bartoli
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Simulation-based problems involving mixed-variable inputs frequently feature domains that are hierarchical, conditional, heterogeneous, or tree-structured. These characteristics pose challenges for data representation, modeling, and optimization. This paper reviews extensive literature on these structured input spaces and proposes a unified framework that generalizes existing approaches. In this framework, input variables may be continuous, integer, or categorical. A variable is described as meta if its value governs the presence of other decreed variables, enabling the modeling of conditional and hierarchical structures. We further introduce the concept of partially-decreed variables, whose activation depends on contextual conditions. To capture these inter-variable hierarchical relationships, we introduce design space graphs, combining principles from feature modeling and graph theory. This allows the definition of general hierarchical domains suitable for describing complex system architectures. The framework supports the use of surrogate models over such domains and integrates hierarchical kernels and distances for efficient modeling and optimization. The proposed methods are implemented in the open-source Surrogate Modeling Toolbox (SMT 2.0), and their capabilities are demonstrated through applications in Bayesian optimization for complex system design, including a case study in green aircraft architecture. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2506.22621 [cs.LG] (or arXiv:2506.22621v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.22621 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] A User-Centric Privacy-Preserving and Verifiable Ecosystem for Personal Data Management and Utilization
链接: https://arxiv.org/abs/2506.22606
作者: Osama Zafar,Mina Namazi,Yuqiao Xu,Youngjin Yoo,Erman Ayday
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:In the current paradigm of digital personalized services, the centralized management of personal data raises significant privacy concerns, security vulnerabilities, and diminished individual autonomy over sensitive information. Despite their efficiency, traditional centralized architectures frequently fail to satisfy rigorous privacy requirements and expose users to data breaches and unauthorized access risks. This pressing challenge calls for a fundamental paradigm shift in methodologies for collecting, storing, and utilizing personal data across diverse sectors, including education, healthcare, and finance. This paper introduces a novel decentralized, privacy-preserving architecture that handles heterogeneous personal information, ranging from educational credentials to health records and financial data. Unlike traditional models, our system grants users complete data ownership and control, allowing them to selectively share information without compromising privacy. The architecture’s foundation comprises advanced privacy-enhancing technologies, including secure enclaves and federated learning, enabling secure computation, verification, and data sharing. The system supports diverse functionalities, including local computation, model training, and privacy-preserving data sharing, while ensuring data credibility and robust user privacy. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2506.22606 [cs.CR] (or arXiv:2506.22606v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.22606 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-67] Are Fast Methods Stable in Adversarially Robust Transfer Learning?
链接: https://arxiv.org/abs/2506.22602
作者: Joshua C. Zhao,Saurabh Bagchi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages
Abstract:Transfer learning is often used to decrease the computational cost of model training, as fine-tuning a model allows a downstream task to leverage the features learned from the pre-training dataset and quickly adapt them to a new task. This is particularly useful for achieving adversarial robustness, as adversarially training models from scratch is very computationally expensive. However, high robustness in transfer learning still requires adversarial training during the fine-tuning phase, which requires up to an order of magnitude more time than standard fine-tuning. In this work, we revisit the use of the fast gradient sign method (FGSM) in robust transfer learning to improve the computational cost of adversarial fine-tuning. We surprisingly find that FGSM is much more stable in adversarial fine-tuning than when training from scratch. In particular, FGSM fine-tuning does not suffer from any issues with catastrophic overfitting at standard perturbation budgets of \varepsilon=4 or \varepsilon=8 . This stability is further enhanced with parameter-efficient fine-tuning methods, where FGSM remains stable even up to \varepsilon=32 for linear probing. We demonstrate how this stability translates into performance across multiple datasets. Compared to fine-tuning with the more commonly used method of projected gradient descent (PGD), on average, FGSM only loses 0.39% and 1.39% test robustness for \varepsilon=4 and \varepsilon=8 while using 4\times less training time. Surprisingly, FGSM may not only be a significantly more efficient alternative to PGD in adversarially robust transfer learning but also a well-performing one.
[LG-68] MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLM s
链接: https://arxiv.org/abs/2506.22557
作者: Boyuan Chen,Minghao Shao,Abdul Basit,Siddharth Garg,Muhammad Shafique
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The growing capabilities of large language models (LLMs) have exposed them to increasingly sophisticated jailbreak attacks. Among these, obfuscation-based attacks – which encrypt malicious content to evade detection – remain highly effective. By leveraging the reasoning ability of advanced LLMs to interpret encrypted prompts, such attacks circumvent conventional defenses that rely on keyword detection or context filtering. These methods are very difficult to defend against, as existing safety mechanisms are not designed to interpret or decode ciphered content. In this work, we propose \textbfMetaCipher, a novel obfuscation-based jailbreak framework, along with a reinforcement learning-based dynamic cipher selection mechanism that adaptively chooses optimal encryption strategies from a cipher pool. This approach enhances jailbreak effectiveness and generalizability across diverse task types, victim LLMs, and safety guardrails. Our framework is modular and extensible by design, supporting arbitrary cipher families and accommodating evolving adversarial strategies. We complement our method with a large-scale empirical analysis of cipher performance across multiple victim LLMs. Within as few as 10 queries, MetaCipher achieves over 92% attack success rate (ASR) on most recent standard malicious prompt benchmarks against state-of-the-art non-reasoning LLMs, and over 74% ASR against reasoning-capable LLMs, outperforming all existing obfuscation-based jailbreak methods. These results highlight the long-term robustness and adaptability of our approach, making it more resilient than prior methods in the face of advancing safety measures.
[LG-69] ask-Agnostic Contrastive Pretraining for Relational Deep Learning
链接: https://arxiv.org/abs/2506.22530
作者: Jakub Peleška,Gustav Šír
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: arXiv admin note: text overlap with arXiv:2506.22199
Abstract:Relational Deep Learning (RDL) is an emerging paradigm that leverages Graph Neural Network principles to learn directly from relational databases by representing them as heterogeneous graphs. However, existing RDL models typically rely on task-specific supervised learning, requiring training separate models for each predictive task, which may hamper scalability and reuse. In this work, we propose a novel task-agnostic contrastive pretraining approach for RDL that enables database-wide representation learning. For that aim, we introduce three levels of contrastive objectives - row-level, link-level, and context-level - designed to capture the structural and semantic heterogeneity inherent to relational data. We implement the respective pretraining approach through a modular RDL architecture and an efficient sampling strategy tailored to the heterogeneous database setting. Our preliminary results on standard RDL benchmarks demonstrate that fine-tuning the pretrained models measurably outperforms training from scratch, validating the promise of the proposed methodology in learning transferable representations for relational data. Comments: arXiv admin note: text overlap with arXiv:2506.22199 Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2506.22530 [cs.LG] (or arXiv:2506.22530v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.22530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] Stabilization of industrial processes with time series machine learning
链接: https://arxiv.org/abs/2506.22502
作者: Matvei Anoshin,Olga Tsurkan,Vadim Lopatkin,Leonid Fedichkin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:The stabilization of time series processes is a crucial problem that is ubiquitous in various industrial fields. The application of machine learning to its solution can have a decisive impact, improving both the quality of the resulting stabilization with less computational resources required. In this work, we present a simple pipeline consisting of two neural networks: the oracle predictor and the optimizer, proposing a substitution of the point-wise values optimization to the problem of the neural network training, which successfully improves stability in terms of the temperature control by about 3 times compared to ordinary solvers.
[LG-71] Service Placement in Small Cell Networks Using Distributed Best Arm Identification in Linear Bandits
链接: https://arxiv.org/abs/2506.22480
作者: Mariam Yahya,Aydin Sezgin,Setareh Maghsudi
类目: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:As users in small cell networks increasingly rely on computation-intensive services, cloud-based access often results in high latency. Multi-access edge computing (MEC) mitigates this by bringing computational resources closer to end users, with small base stations (SBSs) serving as edge servers to enable low-latency service delivery. However, limited edge capacity makes it challenging to decide which services to deploy locally versus in the cloud, especially under unknown service demand and dynamic network conditions. To tackle this problem, we model service demand as a linear function of service attributes and formulate the service placement task as a linear bandit problem, where SBSs act as agents and services as arms. The goal is to identify the service that, when placed at the edge, offers the greatest reduction in total user delay compared to cloud deployment. We propose a distributed and adaptive multi-agent best-arm identification (BAI) algorithm under a fixed-confidence setting, where SBSs collaborate to accelerate learning. Simulations show that our algorithm identifies the optimal service with the desired confidence and achieves near-optimal speedup, as the number of learning rounds decreases proportionally with the number of SBSs. We also provide theoretical analysis of the algorithm’s sample complexity and communication overhead.
[LG-72] Active Learning for Forecasting Severity among Patients with Post Acute Sequelae of SARS-CoV-2
链接: https://arxiv.org/abs/2506.22444
作者: Jing Wang,Amar Sra,Jeremy C. Weiss
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:The long-term effects of Postacute Sequelae of SARS-CoV-2, known as PASC, pose a significant challenge to healthcare systems worldwide. Accurate identification of progression events, such as hospitalization and reinfection, is essential for effective patient management and resource allocation. However, traditional models trained on structured data struggle to capture the nuanced progression of PASC. In this study, we introduce the first publicly available cohort of 18 PASC patients, with text time series features based on Large Language Model Llama-3.1-70B-Instruct and clinical risk annotated by clinical expert. We propose an Active Attention Network to predict the clinical risk and identify progression events related to the risk. By integrating human expertise with active learning, we aim to enhance clinical risk prediction accuracy and enable progression events identification with fewer number of annotation. The ultimate goal is to improves patient care and decision-making for SARS-CoV-2 patient.
[LG-73] Learning Interpretable Rules from Neural Networks: Neurosymbolic AI for Radar Hand Gesture Recognition
链接: https://arxiv.org/abs/2506.22443
作者: Sarah Seifi,Tobias Sukianto,Cecilia Carbonelli,Lorenzo Servadei,Robert Wille
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 8 pages, 3 figures, accepted at the late-breaking work track at the XAI-2025 third World Conference of Explainable AI
Abstract:Rule-based models offer interpretability but struggle with complex data, while deep neural networks excel in performance yet lack transparency. This work investigates a neuro-symbolic rule learning neural network named RL-Net that learns interpretable rule lists through neural optimization, applied for the first time to radar-based hand gesture recognition (HGR). We benchmark RL-Net against a fully transparent rule-based system (MIRA) and an explainable black-box model (XentricAI), evaluating accuracy, interpretability, and user adaptability via transfer learning. Our results show that RL-Net achieves a favorable trade-off, maintaining strong performance (93.03% F1) while significantly reducing rule complexity. We identify optimization challenges specific to rule pruning and hierarchy bias and propose stability-enhancing modifications. Compared to MIRA and XentricAI, RL-Net emerges as a practical middle ground between transparency and performance. This study highlights the real-world feasibility of neuro-symbolic models for interpretable HGR and offers insights for extending explainable AI to edge-deployable sensing systems.
[LG-74] Features-based embedding or Feature-grounding
链接: https://arxiv.org/abs/2506.22442
作者: Piotr Makarevich
类目: Machine Learning (cs.LG)
*备注: 13 pages, 12 figures
Abstract:In everyday reasoning, when we think about a particular object, we associate it with a unique set of expected properties such as weight, size, or more abstract attributes like density or horsepower. These expectations are shaped by our prior knowledge and the conceptual categories we have formed through experience. This paper investigates how such knowledge-based structured thinking can be reproduced in deep learning models using features based embeddings. Specially, it introduces an specific approach to build feature-grounded embedding, aiming to align shareable representations of operable dictionary with interpretable domain-specific conceptual features.
[LG-75] From Model Design to Organizational Design: Complexity Redistribution and Trade-Offs in Generative AI
链接: https://arxiv.org/abs/2506.22440
作者: Sharique Hasan,Alexander Oettl,Sampsa Samila
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); General Economics (econ.GN)
*备注:
Abstract:This paper introduces the Generality-Accuracy-Simplicity (GAS) framework to analyze how large language models (LLMs) are reshaping organizations and competitive strategy. We argue that viewing AI as a simple reduction in input costs overlooks two critical dynamics: (a) the inherent trade-offs among generality, accuracy, and simplicity, and (b) the redistribution of complexity across stakeholders. While LLMs appear to defy the traditional trade-off by offering high generality and accuracy through simple interfaces, this user-facing simplicity masks a significant shift of complexity to infrastructure, compliance, and specialized personnel. The GAS trade-off, therefore, does not disappear but is relocated from the user to the organization, creating new managerial challenges, particularly around accuracy in high-stakes applications. We contend that competitive advantage no longer stems from mere AI adoption, but from mastering this redistributed complexity through the design of abstraction layers, workflow alignment, and complementary expertise. This study advances AI strategy by clarifying how scalable cognition relocates complexity and redefines the conditions for technology integration.
[LG-76] Consensus-based optimization for closed-box adversarial attacks and a connection to evolution strategies
链接: https://arxiv.org/abs/2506.24048
作者: Tim Roith,Leon Bungert,Philipp Wacker
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Consensus-based optimization (CBO) has established itself as an efficient gradient-free optimization scheme, with attractive mathematical properties, such as mean-field convergence results for non-convex loss functions. In this work, we study CBO in the context of closed-box adversarial attacks, which are imperceptible input perturbations that aim to fool a classifier, without accessing its gradient. Our contribution is to establish a connection between the so-called consensus hopping as introduced by Riedl et al. and natural evolution strategies (NES) commonly applied in the context of adversarial attacks and to rigorously relate both methods to gradient-based optimization schemes. Beyond that, we provide a comprehensive experimental study that shows that despite the conceptual similarities, CBO can outperform NES and other evolutionary strategies in certain scenarios.
[LG-77] Post-processing of EEG-based Auditory Attention Decoding Decisions via Hidden Markov Models
链接: https://arxiv.org/abs/2506.24024
作者: Nicolas Heintz,Tom Francart,Alexander Bertrand
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Auditory attention decoding (AAD) algorithms exploit brain signals, such as electroencephalography (EEG), to identify which speaker a listener is focusing on in a multi-speaker environment. While state-of-the-art AAD algorithms can identify the attended speaker on short time windows, their predictions are often too inaccurate for practical use. In this work, we propose augmenting AAD with a hidden Markov model (HMM) that models the temporal structure of attention. More specifically, the HMM relies on the fact that a subject is much less likely to switch attention than to keep attending the same speaker at any moment in time. We show how a HMM can significantly improve existing AAD algorithms in both causal (real-time) and non-causal (offline) settings. We further demonstrate that HMMs outperform existing postprocessing approaches in both accuracy and responsiveness, and explore how various factors such as window length, switching frequency, and AAD accuracy influence overall performance. The proposed method is computationally efficient, intuitive to use and applicable in both real-time and offline settings.
[LG-78] Minimax and Bayes Optimal Best-arm Identification: Adaptive Experimental Design for Treatment Choice
链接: https://arxiv.org/abs/2506.24007
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:This study investigates adaptive experimental design for treatment choice, also known as fixed-budget best-arm identification. We consider an adaptive procedure consisting of a treatment-allocation phase followed by a treatment-choice phase, and we design an adaptive experiment for this setup to efficiently identify the best treatment arm, defined as the one with the highest expected outcome. In our designed experiment, the treatment-allocation phase consists of two stages. The first stage is a pilot phase, where we allocate each treatment arm uniformly with equal proportions to eliminate clearly suboptimal arms and estimate outcome variances. In the second stage, we allocate treatment arms in proportion to the variances estimated in the first stage. After the treatment-allocation phase, the procedure enters the treatment-choice phase, where we choose the treatment arm with the highest sample mean as our estimate of the best treatment arm. We prove that this single design is simultaneously asymptotically minimax and Bayes optimal for the simple regret, with upper bounds that match our lower bounds up to exact constants. Therefore, our designed experiment achieves the sharp efficiency limits without requiring separate tuning for minimax and Bayesian objectives.
[LG-79] Learning robust parameter inference and density reconstruction in flyer plate impact experiments
链接: https://arxiv.org/abs/2506.23914
作者: Evan Bell,Daniel A. Serino,Ben S. Southworth,Trevor Wilcox,Marc L. Klasky
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 24 pages, 21 figures
Abstract:Estimating physical parameters or material properties from experimental observations is a common objective in many areas of physics and material science. In many experiments, especially in shock physics, radiography is the primary means of observing the system of interest. However, radiography does not provide direct access to key state variables, such as density, which prevents the application of traditional parameter estimation approaches. Here we focus on flyer plate impact experiments on porous materials, and resolving the underlying parameterized equation of state (EoS) and crush porosity model parameters given radiographic observation(s). We use machine learning as a tool to demonstrate with high confidence that using only high impact velocity data does not provide sufficient information to accurately infer both EoS and crush model parameters, even with fully resolved density fields or a dynamic sequence of images. We thus propose an observable data set consisting of low and high impact velocity experiments/simulations that capture different regimes of compaction and shock propagation, and proceed to introduce a generative machine learning approach which produces a posterior distribution of physical parameters directly from radiographs. We demonstrate the effectiveness of the approach in estimating parameters from simulated flyer plate impact experiments, and show that the obtained estimates of EoS and crush model parameters can then be used in hydrodynamic simulations to obtain accurate and physically admissible density reconstructions. Finally, we examine the robustness of the approach to model mismatches, and find that the learned approach can provide useful parameter estimates in the presence of out-of-distribution radiographic noise and previously unseen physics, thereby promoting a potential breakthrough in estimating material properties from experimental radiographic images.
[LG-80] Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
链接: https://arxiv.org/abs/2506.23836
作者: Alexander Tyurin
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:We consider centralized distributed optimization in the classical federated learning setup, where n workers jointly find an \varepsilon -stationary point of an L -smooth, d -dimensional nonconvex function f , having access only to unbiased stochastic gradients with variance \sigma^2 . Each worker requires at most h seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are \tau_s and \tau_w seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to n . For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term \frach \sigma^2 L \Deltan \varepsilon^2, which improves with the number of workers n, where \Delta = f(x^0) - f^*, and x^0 \in R^d is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers \tau_s , we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term \tau_s d \fracL \Delta\varepsilon and the variance-dependent runtime term \frach \sigma^2 L \Delta\varepsilon^2, better than poly-logarithmically in n , even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new “worst-case” function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.
[LG-81] Explainable AI for Comprehensive Risk Assessment for Financial Reports: A Lightweight Hierarchical Transformer Network Approach
链接: https://arxiv.org/abs/2506.23767
作者: Xue Wen Tan,Stanley Kok
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:
Abstract:Every publicly traded U.S. company files an annual 10-K report containing critical insights into financial health and risk. We propose Tiny eXplainable Risk Assessor (TinyXRA), a lightweight and explainable transformer-based model that automatically assesses company risk from these reports. Unlike prior work that relies solely on the standard deviation of excess returns (adjusted for the Fama-French model), which indiscriminately penalizes both upside and downside risk, TinyXRA incorporates skewness, kurtosis, and the Sortino ratio for more comprehensive risk assessment. We leverage TinyBERT as our encoder to efficiently process lengthy financial documents, coupled with a novel dynamic, attention-based word cloud mechanism that provides intuitive risk visualization while filtering irrelevant terms. This lightweight design ensures scalable deployment across diverse computing environments with real-time processing capabilities for thousands of financial documents which is essential for production systems with constrained computational resources. We employ triplet loss for risk quartile classification, improving over pairwise loss approaches in existing literature by capturing both the direction and magnitude of risk differences. Our TinyXRA achieves state-of-the-art predictive accuracy across seven test years on a dataset spanning 2013-2024, while providing transparent and interpretable risk assessments. We conduct comprehensive ablation studies to evaluate our contributions and assess model explanations both quantitatively by systematically removing highly attended words and sentences, and qualitatively by examining explanation coherence. The paper concludes with findings, practical implications, limitations, and future research directions.
[LG-82] Overparametrized models with posterior drift
链接: https://arxiv.org/abs/2506.23619
作者: Guillaume Coqueret,Martial Laguerre
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:
Abstract:This paper investigates the impact of posterior drift on out-of-sample forecasting accuracy in overparametrized machine learning models. We document the loss in performance when the loadings of the data generating process change between the training and testing samples. This matters crucially in settings in which regime changes are likely to occur, for instance, in financial markets. Applied to equity premium forecasting, our results underline the sensitivity of a market timing strategy to sub-periods and to the bandwidth parameters that control the complexity of the model. For the average investor, we find that focusing on holding periods of 15 years can generate very heterogeneous returns, especially for small bandwidths. Large bandwidths yield much more consistent outcomes, but are far less appealing from a risk-adjusted return standpoint. All in all, our findings tend to recommend cautiousness when resorting to large linear models for stock market predictions.
[LG-83] Seeding neural network quantum states with tensor network states
链接: https://arxiv.org/abs/2506.23550
作者: Ryui Kaneko,Shimpei Goto
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Numerical Analysis (math.NA); Quantum Physics (quant-ph)
*备注: 13 pages, 13 figures
Abstract:We find an efficient approach to approximately convert matrix product states (MPSs) into restricted Boltzmann machine wave functions consisting of a multinomial hidden unit through a canonical polyadic (CP) decomposition of the MPSs. This method allows us to generate well-behaved initial neural network quantum states for quantum many-body ground-state calculations in polynomial time of the number of variational parameters and systematically shorten the distance between the initial states and the ground states with increasing the rank of the CP decomposition. We demonstrate the efficiency of our method by taking the transverse-field Ising model as an example and discuss possible applications of our method to more general quantum many-body systems in which the ground-state wave functions possess complex nodal structures.
[LG-84] Neural Langevin Machine: a local asymmetric learning rule can be creative
链接: https://arxiv.org/abs/2506.23546
作者: Zhendong Yu,Weizhong Huang,Haiping Huang
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 3 figures, with Github link in the paper
Abstract:Fixed points of recurrent neural networks can be leveraged to store and generate information. These fixed points can be captured by the Boltzmann-Gibbs measure, which leads to neural Langevin dynamics that can be used for sampling and learning a real dataset. We call this type of generative model neural Langevin machine, which is interpretable due to its analytic form of distribution and is simple to train. Moreover, the learning process is derived as a local asymmetric plasticity rule, bearing biological relevance. Therefore, one can realize a continuous sampling of creative dynamics in a neural network, mimicking an imagination process in brain circuits. This neural Langevin machine may be another promising generative model, at least in its strength in circuit-based sampling and biologically plausible learning rule.
[LG-85] st of partial effects for Frechet regression on Bures-Wasserstein manifolds
链接: https://arxiv.org/abs/2506.23487
作者: Haoshu Xu,Hongzhe Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose a novel test for assessing partial effects in Frechet regression on Bures Wasserstein manifolds. Our approach employs a sample splitting strategy: the first subsample is used to fit the Frechet regression model, yielding estimates of the covariance matrices and their associated optimal transport maps, while the second subsample is used to construct the test statistic. We prove that this statistic converges in distribution to a weighted mixture of chi squared components, where the weights correspond to the eigenvalues of an integral operator defined by an appropriate RKHS kernel. We establish that our procedure achieves the nominal asymptotic size and demonstrate that its worst-case power converges uniformly to one. Through extensive simulations and a real data application, we illustrate the test’s finite-sample accuracy and practical utility.
[LG-86] Sampling and Identity-Testing Without Approximate Tensorization of Entropy
链接: https://arxiv.org/abs/2506.23456
作者: William Gay,William He,Nicholas Kocurek,Ryan O’Donnell
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Certain tasks in high-dimensional statistics become easier when the underlying distribution satisfies a local-to-global property called approximate tensorization of entropy (ATE). For example, the Glauber dynamics Markov chain of an ATE distribution mixes fast and can produce approximate samples in a small amount of time, since such a distribution satisfies a modified log-Sobolev inequality. Moreover, identity-testing for an ATE distribution requires few samples if the tester is given coordinate conditional access to the unknown distribution, as shown by Blanca, Chen, Štefankovič, and Vigoda (COLT 2023). A natural class of distributions that do not satisfy ATE consists of mixtures of (few) distributions that do satisfy ATE. We study the complexity of identity-testing and sampling for these distributions. Our main results are the following: 1. We show fast mixing of Glauber dynamics from a data-based initialization, with optimal sample complexity, for mixtures of distributions satisfying modified log-Sobolev inequalities. This extends work of Huang, Koehler, Lee, Mohanty, Rajaraman, Vuong, and Wu (STOC 2025, COLT 2025) for mixtures of distributions satisfying Poincaré inequalities. 2. Answering an open question posed by Blanca et al., we give efficient identity-testers for mixtures of ATE distributions in the coordinate-conditional sampling access model. We also give some simplifications and improvements to the original algorithm of Blanca et al. Subjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.23456 [math.ST] (or arXiv:2506.23456v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2506.23456 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nicholas Kocurek [view email] [v1] Mon, 30 Jun 2025 01:36:32 UTC (29 KB)
[LG-87] Minimax Optimal Two-Stage Algorithm For Moment Estimation Under Covariate Shift
链接: https://arxiv.org/abs/2506.23453
作者: Zhen Zhang,Xin Liu,Shaoli Wang,Jiaye Teng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Covariate shift occurs when the distribution of input features differs between the training and testing phases. In covariate shift, estimating an unknown function’s moment is a classical problem that remains under-explored, despite its common occurrence in real-world scenarios. In this paper, we investigate the minimax lower bound of the problem when the source and target distributions are known. To achieve the minimax optimal bound (up to a logarithmic factor), we propose a two-stage algorithm. Specifically, it first trains an optimal estimator for the function under the source distribution, and then uses a likelihood ratio reweighting procedure to calibrate the moment estimator. In practice, the source and target distributions are typically unknown, and estimating the likelihood ratio may be unstable. To solve this problem, we propose a truncated version of the estimator that ensures double robustness and provide the corresponding upper bound. Extensive numerical studies on synthetic examples confirm our theoretical findings and further illustrate the effectiveness of our proposed method.
[LG-88] DPOT: A DeepParticle method for Computation of Optimal Transport with convergence guarantee
链接: https://arxiv.org/abs/2506.23429
作者: Yingyuan Li,Aokun Wang,Zhongjian Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we propose a novel machine learning approach to compute the optimal transport map between two continuous distributions from their unpaired samples, based on the DeepParticle methods. The proposed method leads to a min-min optimization during training and does not impose any restriction on the network structure. Theoretically we establish a weak convergence guarantee and a quantitative error bound between the learned map and the optimal transport map. Our numerical experiments validate the theoretical results and the effectiveness of the new approach, particularly on real-world tasks.
[LG-89] AICO: Feature Significance Tests for Supervised Learning
链接: https://arxiv.org/abs/2506.23396
作者: Kay Giesecke,Enguerrand Horel,Chartsiri Jirachotkulthorn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The opacity of many supervised learning algorithms remains a key challenge, hindering scientific discovery and limiting broader deployment – particularly in high-stakes domains. This paper develops model- and distribution-agnostic significance tests to assess the influence of input features in any regression or classification algorithm. Our method evaluates a feature’s incremental contribution to model performance by masking its values across samples. Under the null hypothesis, the distribution of performance differences across a test set has a non-positive median. We construct a uniformly most powerful, randomized sign test for this median, yielding exact p-values for assessing feature significance and confidence intervals with exact coverage for estimating population-level feature importance. The approach requires minimal assumptions, avoids model retraining or auxiliary models, and remains computationally efficient even for large-scale, high-dimensional settings. Experiments on synthetic tasks validate its statistical and computational advantages, and applications to real-world data illustrate its practical utility.
[LG-90] Investigating an Overfitting and Degeneration Phenomenon in Self-Supervised Multi-Pitch Estimation
链接: https://arxiv.org/abs/2506.23371
作者: Frank Cwitkowitz,Zhiyao Duan
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to ISMIR 2025
Abstract:Multi-Pitch Estimation (MPE) continues to be a sought after capability of Music Information Retrieval (MIR) systems, and is critical for many applications and downstream tasks involving pitch, including music transcription. However, existing methods are largely based on supervised learning, and there are significant challenges in collecting annotated data for the task. Recently, self-supervised techniques exploiting intrinsic properties of pitch and harmonic signals have shown promise for both monophonic and polyphonic pitch estimation, but these still remain inferior to supervised methods. In this work, we extend the classic supervised MPE paradigm by incorporating several self-supervised objectives based on pitch-invariant and pitch-equivariant properties. This joint training results in a substantial improvement under closed training conditions, which naturally suggests that applying the same objectives to a broader collection of data will yield further improvements. However, in doing so we uncover a phenomenon whereby our model simultaneously overfits to the supervised data while degenerating on data used for self-supervision only. We demonstrate and investigate this and offer our insights on the underlying problem.
[LG-91] Physics informed guided diffusion for accelerated multi-parametric MRI reconstruction MICCAI2025
链接: https://arxiv.org/abs/2506.23311
作者: Perla Mayo,Carolin M. Pirkl,Alin Achim,Bjoern Menze,Mohammad Golbabaee
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 11 pages, 1 figure, 1 algorithm, 3 tables. Accepted to MICCAI 2025. This is a version prior peer-review
Abstract:We introduce MRF-DiPh, a novel physics informed denoising diffusion approach for multiparametric tissue mapping from highly accelerated, transient-state quantitative MRI acquisitions like Magnetic Resonance Fingerprinting (MRF). Our method is derived from a proximal splitting formulation, incorporating a pretrained denoising diffusion model as an effective image prior to regularize the MRF inverse problem. Further, during reconstruction it simultaneously enforces two key physical constraints: (1) k-space measurement consistency and (2) adherence to the Bloch response model. Numerical experiments on in-vivo brain scans data show that MRF-DiPh outperforms deep learning and compressed sensing MRF baselines, providing more accurate parameter maps while better preserving measurement fidelity and physical model consistency-critical for solving reliably inverse problems in medical imaging.
[LG-92] On Universality of Non-Separable Approximate Message Passing Algorithms
链接: https://arxiv.org/abs/2506.23010
作者: Max Lovig,Tianhao Wang,Zhou Fan
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注:
Abstract:Mean-field characterizations of first-order iterative algorithms – including Approximate Message Passing (AMP), stochastic and proximal gradient descent, and Langevin diffusions – have enabled a precise understanding of learning dynamics in many statistical applications. For algorithms whose non-linearities have a coordinate-separable form, it is known that such characterizations enjoy a degree of universality with respect to the underlying data distribution. However, mean-field characterizations of non-separable algorithm dynamics have largely remained restricted to i.i.d. Gaussian or rotationally-invariant data. In this work, we initiate a study of universality for non-separable AMP algorithms. We identify a general condition for AMP with polynomial non-linearities, in terms of a Bounded Composition Property (BCP) for their representing tensors, to admit a state evolution that holds universally for matrices with non-Gaussian entries. We then formalize a condition of BCP-approximability for Lipschitz AMP algorithms to enjoy a similar universal guarantee. We demonstrate that many common classes of non-separable non-linearities are BCP-approximable, including local denoisers, spectral denoisers for generic signals, and compositions of separable functions with generic linear maps, implying the universality of state evolution for AMP algorithms employing these non-linearities. Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2506.23010 [math.ST] (or arXiv:2506.23010v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2506.23010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-93] CN-SBM: Categorical Block Modelling For Primary and Residual Copy Number Variation
链接: https://arxiv.org/abs/2506.22963
作者: Kevin Lam,William Daniels,J Maxwell Douglas,Daniel Lai,Samuel Aparicio,Benjamin Bloem-Reddy,Yongjin Park
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 8 pages, 4 figures
Abstract:Cancer is a genetic disorder whose clonal evolution can be monitored by tracking noisy genome-wide copy number variants. We introduce the Copy Number Stochastic Block Model (CN-SBM), a probabilistic framework that jointly clusters samples and genomic regions based on discrete copy number states using a bipartite categorical block model. Unlike models relying on Gaussian or Poisson assumptions, CN-SBM respects the discrete nature of CNV calls and captures subpopulation-specific patterns through block-wise structure. Using a two-stage approach, CN-SBM decomposes CNV data into primary and residual components, enabling detection of both large-scale chromosomal alterations and finer aberrations. We derive a scalable variational inference algorithm for application to large cohorts and high-resolution data. Benchmarks on simulated and real datasets show improved model fit over existing methods. Applied to TCGA low-grade glioma data, CN-SBM reveals clinically relevant subtypes and structured residual variation, aiding patient stratification in survival analysis. These results establish CN-SBM as an interpretable, scalable framework for CNV analysis with direct relevance for tumor heterogeneity and prognosis.
[LG-94] Differentiable Radar Ambiguity Functions: Mathematical Formulation and Computational Implementation
链接: https://arxiv.org/abs/2506.22935
作者: Marc Bara Iniesta
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 16 pages, 4 figures, source code available at this https URL (DOI: https://doi.org/10.5281/zenodo.15763301 )
Abstract:The ambiguity function is fundamental to radar waveform design, characterizing range and Doppler resolution capabilities. However, its traditional formulation involves non-differentiable operations, preventing integration with gradient-based optimization methods and modern machine learning frameworks. This paper presents the first complete mathematical framework and computational implementation for differentiable radar ambiguity functions. Our approach addresses the fundamental technical challenges that have prevented the radar community from leveraging automatic differentiation: proper handling of complex-valued gradients using Wirtinger calculus, efficient computation through parallelized FFT operations, numerical stability throughout cascaded operations, and composability with arbitrary differentiable operations. We term this approach GRAF (Gradient-based Radar Ambiguity Functions), which reformulates the ambiguity function computation to maintain mathematical equivalence while enabling gradient flow through the entire pipeline. The resulting implementation provides a general-purpose differentiable ambiguity function compatible with modern automatic differentiation frameworks, enabling new research directions including neural network-based waveform generation with ambiguity constraints, end-to-end optimization of radar systems, and integration of classical radar theory with modern deep learning. We provide complete implementation details and demonstrate computational efficiency suitable for practical applications. This work establishes the mathematical and computational foundation for applying modern machine learning techniques to radar waveform design, bridging classical radar signal processing with automatic differentiation frameworks.
[LG-95] Deep neural networks can provably solve Bellm an equations for Markov decision processes without the curse of dimensionality
链接: https://arxiv.org/abs/2506.22851
作者: Arnulf Jentzen,Konrad Kleinberg,Thomas Kruse
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:Discrete time stochastic optimal control problems and Markov decision processes (MDPs) are fundamental models for sequential decision-making under uncertainty and as such provide the mathematical framework underlying reinforcement learning theory. A central tool for solving MDPs is the Bellman equation and its solution, the so-called Q -function. In this article, we construct deep neural network (DNN) approximations for Q -functions associated to MDPs with infinite time horizon and finite control set A . More specifically, we show that if the the payoff function and the random transition dynamics of the MDP can be suitably approximated by DNNs with leaky rectified linear unit (ReLU) activation, then the solutions Q_d\colon \mathbb R^d\to \mathbb R^|A| , d\in \mathbbN , of the associated Bellman equations can also be approximated in the L^2 -sense by DNNs with leaky ReLU activation whose numbers of parameters grow at most polynomially in both the dimension d\in \mathbbN of the state space and the reciprocal 1/\varepsilon of the prescribed error \varepsilon\in (0,1) . Our proof relies on the recently introduced full-history recursive multilevel fixed-point (MLFP) approximation scheme.
[LG-96] Can We Reliably Predict the Feds Next Move? A Multi-Modal Approach to U.S. Monetary Policy Forecasting
链接: https://arxiv.org/abs/2506.22763
作者: Fiona Xiao Jingyi,Lili Liu
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 9 pages, 15 figures
Abstract:Forecasting central bank policy decisions remains a persistent challenge for investors, financial institutions, and policymakers due to the wide-reaching impact of monetary actions. In particular, anticipating shifts in the U.S. federal funds rate is vital for risk management and trading strategies. Traditional methods relying only on structured macroeconomic indicators often fall short in capturing the forward-looking cues embedded in central bank communications. This study examines whether predictive accuracy can be enhanced by integrating structured data with unstructured textual signals from Federal Reserve communications. We adopt a multi-modal framework, comparing traditional machine learning models, transformer-based language models, and deep learning architectures in both unimodal and hybrid settings. Our results show that hybrid models consistently outperform unimodal baselines. The best performance is achieved by combining TF-IDF features of FOMC texts with economic indicators in an XGBoost classifier, reaching a test AUC of 0.83. FinBERT-based sentiment features marginally improve ranking but perform worse in classification, especially under class imbalance. SHAP analysis reveals that sparse, interpretable features align more closely with policy-relevant signals. These findings underscore the importance of integrating textual and structured signals transparently. For monetary policy forecasting, simpler hybrid models can offer both accuracy and interpretability, delivering actionable insights for researchers and decision-makers. Comments: 9 pages, 15 figures Subjects: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computational Finance (q-fin.CP) Cite as: arXiv:2506.22763 [q-fin.PM] (or arXiv:2506.22763v1 [q-fin.PM] for this version) https://doi.org/10.48550/arXiv.2506.22763 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lili Liu [view email] [v1] Sat, 28 Jun 2025 05:54:58 UTC (10,081 KB)
[LG-97] Lower bounds for trace estimation via Block Krylov and other methods
链接: https://arxiv.org/abs/2506.22701
作者: Shi Jie Yu
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This paper studies theoretical lower bounds for estimating the trace of a matrix function, \texttr(f(A)) , focusing on methods that use Hutchinson’s method along with Block Krylov techniques. These methods work by approximating matrix-vector products like f(A)V using a Block Krylov subspace. This is closely related to approximating functions with polynomials. We derive theoretical upper bounds on how many Krylov steps are needed for functions such as A^-1/2 and A^-1 by analyzing the upper bounds from the polynomial approximation of their scalar equivalent. In addition, we also develop lower limits on the number of queries needed for trace estimation, specifically for \texttr(W^-p) where W is a Wishart matrix. Our study clarifies the connection between the number of steps in Block Krylov methods and the degree of the polynomial used for approximation. This links the total cost of trace estimation to basic limits in polynomial approximation and how much information is needed for the computation.
[LG-98] Bayesian Invariance Modeling of Multi-Environment Data
链接: https://arxiv.org/abs/2506.22675
作者: Luhuan Wu,Mingzhang Yin,Yixin Wang,John P. Cunningham,David M. Blei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Invariant prediction [Peters et al., 2016] analyzes feature/outcome data from multiple environments to identify invariant features - those with a stable predictive relationship to the outcome. Such features support generalization to new environments and help reveal causal mechanisms. Previous methods have primarily tackled this problem through hypothesis testing or regularized optimization. Here we develop Bayesian Invariant Prediction (BIP), a probabilistic model for invariant prediction. BIP encodes the indices of invariant features as a latent variable and recover them by posterior inference. Under the assumptions of Peters et al. [2016], the BIP posterior targets the true invariant features. We prove that the posterior is consistent and that greater environment heterogeneity leads to faster posterior contraction. To handle many features, we design an efficient variational approximation called VI-BIP. In simulations and real data, we find that BIP and VI-BIP are more accurate and scalable than existing methods for invariant prediction.
[LG-99] Diversity by Design: Addressing Mode Collapse Improves scRNA-seq Perturbation Modeling on Well-Calibrated Metrics
链接: https://arxiv.org/abs/2506.22641
作者: Gabriel M. Mejia,Henry E. Miller,Francis J. A. Leblanc,Bo Wang,Brendan Swain,Lucas Paulo de Lima Camillo
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Molecular Networks (q-bio.MN); Machine Learning (stat.ML)
*备注:
Abstract:Recent benchmarks reveal that models for single-cell perturbation response are often outperformed by simply predicting the dataset mean. We trace this anomaly to a metric artifact: control-referenced deltas and unweighted error metrics reward mode collapse whenever the control is biased or the biological signal is sparse. Large-scale \textitin silico simulations and analysis of two real-world perturbation datasets confirm that shared reference shifts, not genuine biological change, drives high performance in these evaluations. We introduce differentially expressed gene (DEG)-aware metrics, weighted mean-squared error (WMSE) and weighted delta R^2 ( R^2_w(\Delta) ) with respect to all perturbations, that measure error in niche signals with high sensitivity. We further introduce negative and positive performance baselines to calibrate these metrics. With these improvements, the mean baseline sinks to null performance while genuine predictors are correctly rewarded. Finally, we show that using WMSE as a loss function reduces mode collapse and improves model performance.
[LG-100] Deep Hedging to Manage Tail Risk
链接: https://arxiv.org/abs/2506.22611
作者: Yuming Ma
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Risk Management (q-fin.RM)
*备注: 59 pages
Abstract:Extending Buehler et al.'s 2019 Deep Hedging paradigm, we innovatively employ deep neural networks to parameterize convex-risk minimization (CVaR/ES) for the portfolio tail-risk hedging problem. Through comprehensive numerical experiments on crisis-era bootstrap market simulators – customizable with transaction costs, risk budgets, liquidity constraints, and market impact – our end-to-end framework not only achieves significant one-day 99% CVaR reduction but also yields practical insights into friction-aware strategy adaptation, demonstrating robustness and operational viability in realistic markets.
[LG-101] Learning Individual Reproductive Behavior from Aggregate Fertility Rates via Neural Posterior Estimation
链接: https://arxiv.org/abs/2506.22607
作者: Daniel Ciganda,Ignacio Campón,Iñaki Permanyer,Jakob H Macke
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:While age-specific fertility rates (ASFRs) provide the most extensive record of reproductive change, their aggregate nature masks the underlying behavioral mechanisms that ultimately drive fertility trends. To recover these mechanisms, we develop a likelihood-free Bayesian framework that couples an individual-level model of the reproductive process with Sequential Neural Posterior Estimation (SNPE). This allows us to infer eight behavioral and biological parameters from just two aggregate series: ASFRs and the age-profile of planned versus unplanned births. Applied to U.S. National Survey of Family Growth cohorts and to Demographic and Health Survey cohorts from Colombia, the Dominican Republic, and Peru, the method reproduces observed fertility schedules and, critically, predicts out-of-sample micro-level distributions of age at first sex, inter-birth intervals, and family-size ideals, none of which inform the estimation step. Because the fitted model yields complete synthetic life histories, it enables behaviorally explicit population forecasts and supports the construction of demographic digital twins.
[LG-102] Adjoint Schrödinger Bridge Sampler
链接: https://arxiv.org/abs/2506.22565
作者: Guan-Horng Liu,Jaemoo Choi,Yongxin Chen,Benjamin Kurt Miller,Ricky T. Q. Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Computational methods for learning to sample from the Boltzmann distribution – where the target distribution is known only up to an unnormalized energy function – have advanced significantly recently. Due to the lack of explicit target samples, however, prior diffusion-based methods, known as diffusion samplers, often require importance-weighted estimation or complicated learning processes. Both trade off scalability with extensive evaluations of the energy and model, thereby limiting their practical usage. In this work, we propose Adjoint Schrödinger Bridge Sampler (ASBS), a new diffusion sampler that employs simple and scalable matching-based objectives yet without the need to estimate target samples during training. ASBS is grounded on a mathematical model – the Schrödinger Bridge – which enhances sampling efficiency via kinetic-optimal transportation. Through a new lens of stochastic optimal control theory, we demonstrate how SB-based diffusion samplers can be learned at scale via Adjoint Matching and prove convergence to the global solution. Notably, ASBS generalizes the recent Adjoint Sampling (Havens et al., 2025) to arbitrary source distributions by relaxing the so-called memoryless condition that largely restricts the design space. Through extensive experiments, we demonstrate the effectiveness of ASBS on sampling from classical energy functions, amortized conformer generation, and molecular Boltzmann distributions.
[LG-103] Spectral Bias in Variational Quantum Machine Learning
链接: https://arxiv.org/abs/2506.22555
作者: Callum Duffy,Marcin Jastrzebski
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures
Abstract:In this work, we investigate the phenomenon of spectral bias in quantum machine learning, where, in classical settings, models tend to fit low-frequency components of a target function earlier during training than high-frequency ones, demonstrating a frequency-dependent rate of convergence. We study this effect specifically in parameterised quantum circuits (PQCs). Leveraging the established formulation of PQCs as Fourier series, we prove that spectral bias in this setting arises from the ``redundancy’’ of the Fourier coefficients, which denotes the number of terms in the analytical form of the model contributing to the same frequency component. The choice of data encoding scheme dictates the degree of redundancy for a Fourier coefficient. We find that the magnitude of the Fourier coefficients’ gradients during training strongly correlates with the coefficients’ redundancy. We then further demonstrate this empirically with three different encoding schemes. Additionally, we demonstrate that PQCs with greater redundancy exhibit increased robustness to random perturbations in their parameters at the corresponding frequencies. We investigate how design choices affect the ability of PQCs to learn Fourier sums, focusing on parameter initialization scale and entanglement structure, finding large initializations and low-entanglement schemes tend to slow convergence.
[LG-104] Neural models of multiscale systems: conceptual limitations stochastic parametrizations and a climate application
链接: https://arxiv.org/abs/2506.22552
作者: Fabrizio Falasca
类目: Chaotic Dynamics (nlin.CD); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:This work explores key conceptual limitations in data-driven modeling of multiscale dynamical systems, focusing on neural emulators and stochastic climate modeling. A skillful climate model should capture both stationary statistics and responses to external perturbations. While current autoregressive neural models often reproduce the former, they typically struggle with the latter. We begin by analyzing a low-dimensional dynamical system to expose, by analogy, fundamental limitations that persist in high-dimensional settings. Specifically, we construct neural stochastic models under two scenarios: one where the full state vector is observed, and another with only partial observations (i.e. a subset of variables). In the first case, the models accurately capture both equilibrium statistics and forced responses in ensemble mean and variance. In the more realistic case of partial observations, two key challenges emerge: (i) identifying the \textitproper variables to model, and (ii) parameterizing the influence of unobserved degrees of freedom. These issues are not specific to neural networks but reflect fundamental limitations of data-driven modeling and the need to target the slow dynamics of the system. We argue that physically grounded strategies – such as coarse-graining and stochastic parameterizations – are critical, both conceptually and practically, for the skillful emulation of complex systems like the coupled climate system. Building on these insights, we turn to a more realistic application: a stochastic reduced neural model of the sea surface temperature field and the net radiative flux at the top of the atmosphere, assessing its stationary statistics, response to temperature forcing, and interpretability.
[LG-105] Strategic A/B testing via Maximum Probability-driven Two-armed Bandit
链接: https://arxiv.org/abs/2506.22536
作者: Yu Zhang,Shanshan Zhao,Bokui Wan,Jinjuan Wang,Xiaodong Yan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 25 pages, 14 figures
Abstract:Detecting a minor average treatment effect is a major challenge in large-scale applications, where even minimal improvements can have a significant economic impact. Traditional methods, reliant on normal distribution-based or expanded statistics, often fail to identify such minor effects because of their inability to handle small discrepancies with sufficient sensitivity. This work leverages a counterfactual outcome framework and proposes a maximum probability-driven two-armed bandit (TAB) process by weighting the mean volatility statistic, which controls Type I error. The implementation of permutation methods further enhances the robustness and efficacy. The established strategic central limit theorem (SCLT) demonstrates that our approach yields a more concentrated distribution under the null hypothesis and a less concentrated one under the alternative hypothesis, greatly improving statistical power. The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power.
[LG-106] MENGLAN: Multiscale Enhanced Nonparametric Gas Analyzer with Lightweight Architecture and Networks
链接: https://arxiv.org/abs/2506.22490
作者: Zhenke Duan,Jiqun Pan,Jiani Tu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate detection of ethylene concentrations in mixed gases is crucial in chemical production for safety and health purposes. Traditional methods are hindered by high cost and complexity, limiting their practical application. This study proposes MENGLAN, a Multiscale Enhanced Nonparametric Gas Analyzer that integrates a dual-stream structure, a Hybrid Multi-Head Attention mechanism, and a Feature Reactivation Module to enable real-time, lightweight, and high-precision ethylene concentration prediction. Results show that MENGLAN achieves superior performance, reduced computational demand, and enhanced deployability compared to existing methods.
[LG-107] Zero-Shot EEG-to-Gait Decoding via Phase-Aware Representation Learning
链接: https://arxiv.org/abs/2506.22488
作者: Xi Fu,Weibang Jiang,Rui Liu,Gernot R. Müller-Putz,Cuntai Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate decoding of lower-limb motion from EEG signals is essential for advancing brain-computer interface (BCI) applications in movement intent recognition and control. However, challenges persist in achieving causal, phase-consistent predictions and in modeling both inter- and intra-subject variability. To address these issues, we propose NeuroDyGait, a domain-generalizable EEG-to-motion decoding framework that leverages structured contrastive representation learning and relational domain modeling. The proposed method employs relative contrastive learning to achieve semantic alignment between EEG and motion embeddings. Furthermore, a multi-cycle gait reconstruction objective is introduced to enforce temporal coherence and maintain biomechanical consistency. To promote inter-session generalization, during fine-tuning, a domain dynamic decoding mechanism adaptively assigns session-specific prediction heads and learns to mix their outputs based on inter-session relationships. NeuroDyGait enables zero-shot motion prediction for unseen individuals without requiring adaptation and achieves superior performance in cross-subject gait decoding on benchmark datasets. Additionally, it demonstrates strong phase-detection capabilities even without explicit phase supervision during training. These findings highlight the potential of relational domain learning in enabling scalable, target-free deployment of BCIs.
[LG-108] An Interpretable Transformer-Based Foundation Model for Cross-Procedural Skill Assessment Using Raw fNIRS Signals
链接: https://arxiv.org/abs/2506.22476
作者: A. Subedi,S. De,L. Cavuoto,S. Schwaitzberg,M. Hackett,J. Norfleet
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Objective skill assessment in high-stakes procedural environments requires models that not only decode underlying cognitive and motor processes but also generalize across tasks, individuals, and experimental contexts. While prior work has demonstrated the potential of functional near-infrared spectroscopy (fNIRS) for evaluating cognitive-motor performance, existing approaches are often task-specific, rely on extensive preprocessing, and lack robustness to new procedures or conditions. Here, we introduce an interpretable transformer-based foundation model trained on minimally processed fNIRS signals for cross-procedural skill assessment. Pretrained using self-supervised learning on data from laparoscopic surgical tasks and endotracheal intubation (ETI), the model achieves greater than 88% classification accuracy on all tasks, with Matthews Correlation Coefficient exceeding 0.91 on ETI. It generalizes to a novel emergency airway procedure–cricothyrotomy–using fewer than 30 labeled samples and a lightweight (less than 2k parameter) adapter module, attaining an AUC greater than 87%. Interpretability is achieved via a novel channel attention mechanism–developed specifically for fNIRS–that identifies functionally coherent prefrontal sub-networks validated through ablation studies. Temporal attention patterns align with task-critical phases and capture stress-induced changes in neural variability, offering insight into dynamic cognitive states.
[LG-109] Physics-Embedded Neural Networks for sEMG-based Continuous Motion Estimation IROS
链接: https://arxiv.org/abs/2506.22459
作者: Wending Heng,Chaoyuan Liang,Yihui Zhao,Zhiqiang Zhang,Glen Cooper,Zhenhong Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Abstract:Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological consistency. This paper introduces a novel Physics-Embedded Neural Network (PENN) that combines interpretable MSK forward-dynamics with data-driven residual learning, thereby preserving physiological consistency while achieving accurate motion estimation. The PENN employs a recursive temporal structure to propagate historical estimates and a lightweight convolutional neural network for residual correction, leading to robust and temporally coherent estimations. A two-phase training strategy is designed for PENN. Experimental evaluations on six healthy subjects show that PENN outperforms state-of-the-art baseline methods in both root mean square error (RMSE) and R^2 metrics.
[LG-110] Data Normalization Strategies for EEG Deep Learning
链接: https://arxiv.org/abs/2506.22455
作者: Dung Truong,Arnaud Delorme
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Normalization is a critical yet often overlooked component in the preprocessing pipeline for EEG deep learning applications. The rise of large-scale pretraining paradigms such as self-supervised learning (SSL) introduces a new set of tasks whose nature is substantially different from supervised training common in EEG deep learning applications. This raises new questions about optimal normalization strategies for the applicable task. In this study, we systematically evaluate the impact of normalization granularity (recording vs. window level) and scope (cross-channel vs. within-channel) on both supervised (age and gender prediction) and self-supervised (Contrastive Predictive Coding) tasks. Using high-density resting-state EEG from 2,836 subjects in the Healthy Brain Network dataset, we show that optimal normalization strategies differ significantly between training paradigms. Window-level within-channel normalization yields the best performance in supervised tasks, while minimal or cross-channel normalization at the window level is more effective for SSL. These results underscore the necessity of task-specific normalization choices and challenge the assumption that a universal normalization strategy can generalize across learning settings. Our findings provide practical insights for developing robust EEG deep learning pipelines as the field shifts toward large-scale, foundation model training.
[LG-111] Microelectrode Signal Dynamics as Biomarkers of Subthalamic Nucleus Entry on Deep Brain Stimulation: A Nonlinear Feature Approach
链接: https://arxiv.org/abs/2506.22454
作者: Ana Luiza S. Tavares,Artur Pedro M. Neto,Francinaldo L. Gomes,Paul Rodrigo dos Reis,Arthur G. da Silva,Antonio P. Junior,Bruno D. Gomes
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 8 pages, 5 figures
Abstract:Accurate intraoperative localization of the subthalamic nucleus (STN) is essential for the efficacy of Deep Brain Stimulation (DBS) in patients with Parkinson’s disease. While microelectrode recordings (MERs) provide rich electrophysiological information during DBS electrode implantation, current localization practices often rely on subjective interpretation of signal features. In this study, we propose a quantitative framework that leverages nonlinear dynamics and entropy-based metrics to classify neural activity recorded inside versus outside the STN. MER data from three patients were preprocessed using a robust artifact correction pipeline, segmented, and labelled based on surgical annotations. A comprehensive set of recurrence quantification analysis, nonlinear, and entropy features were extracted from each segment. Multiple supervised classifiers were trained on every combination of feature domains using stratified 10-fold cross-validation, followed by statistical comparison using paired Wilcoxon signed-rank tests with Holm-Bonferroni correction. The combination of entropy and nonlinear features yielded the highest discriminative power, and the Extra Trees classifier emerged as the best model with a cross-validated F1-score of 0.902+/-0.027 and ROC AUC of 0.887+/-0.055. Final evaluation on a 20% hold-out test set confirmed robust generalization (F1= 0.922, ROC AUC = 0.941). These results highlight the potential of nonlinear and entropy signal descriptors in supporting real-time, data-driven decision-making during DBS surgeries
[LG-112] Arnoldi Singular Vector perturbations for machine learning weather prediction
链接: https://arxiv.org/abs/2506.22450
作者: Jens Winkler,Michael Denhard
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: dynamical systems, atmospheric physics, machine learing weather prediction, forecast uncertainity, 42 pages with 29 figures (inkl. appendix)
Abstract:Since weather forecasts are fundamentally uncertain, reliable decision making requires information on the likelihoods of future weather scenarios. We explore the sensitivity of machine learning weather prediction (MLWP) using the 24h Pangu Weather ML model of Huawei to errors in the initial conditions with a specific kind of Singular Vector (SV) perturbations. Our Arnoldi-SV (A-SV) method does not need linear nor adjoint model versions and is applicable to numerical weather prediction (NWP) as well as MLWP. It observes error growth within a given optimization time window by iteratively applying a forecast model to perturbed model states. This creates a Krylov subspace, implicitly based on a matrix operator, which approximates the local error growth. Each iteration adds new dimensions to the Krylov space and its leading right SVs are expected to turn into directions of growing errors. We show that A-SV indeed finds dynamically meaningful perturbation patterns for the 24h Pangu Weather model, which grow right from the beginning of the forecast rollout. These perturbations describe local unstable modes and could be a basis to initialize MLWP ensembles. Since we start A-SV from random noise perturbations, the algorithm transforms noise into perturbations conditioned on a given reference state - a process that is akin to the denoising process of the generic diffusion based ML model of GenCast, therefor we briefly discuss similarities and differences.
信息检索
[IR-0] Act-With-Think: Chunk Auto-Regressive Modeling for Generative Recommendation
链接: https://arxiv.org/abs/2506.23643
作者: Yifan Wang,Weinan Gan,Longtao Xiao,Jieming Zhu,Heng Chang,Haozhao Wang,Rui Zhang,Zhenhua Dong,Ruiming Tang,Ruixuan Li
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 2 figures
Abstract:Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability " \textbfwhy " for the behavior " \textbfwhat ", which may constrain the full potential of GR. To this end, we present Chunk AutoRegressive Modeling (CAR), a new generation paradigm following the decision pattern that users usually think semantic aspects of items (e.g. brand) and then take actions on target items (e.g. purchase). Our CAR, for the \textitfirst time , incorporates semantics (SIDs) and behavior (UID) into a single autoregressive transformer from an ``act-with-think’’ dual perspective via chunk-level autoregression. Specifically, CAR packs SIDs and UID into a conceptual chunk for item unified representation, allowing each decoding step to make a holistic prediction. Experiments show that our CAR significantly outperforms existing methods based on traditional AR, improving Recall@5 by 7.93% to 22.30%. Furthermore, we verify the scaling effect between model performance and SIDs bit number, demonstrating that CAR preliminary emulates a kind of slow-thinking style mechanism akin to the reasoning processes observed in large language models (LLMs).
[IR-1] NaviX: A Native Vector Index Design for Graph DBMSs With Robust Predicate-Agnostic Search Performance
链接: https://arxiv.org/abs/2506.23397
作者: Gaurav Sehgal,Semih Salihoglu
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:
Abstract:There is an increasing demand for extending existing DBMSs with vector indices so that they become unified systems capable of supporting modern predictive applications, which require joint querying of vector embeddings together with the structured properties and connections of objects. We present NaviX, a native vector index for graph DBMSs (GDBMSs) that has two main design goals. First, we aim to implement a disk-based vector index that leverages the core storage and query-processing capabilities of the underlying GDBMS. To this end, NaviX is built on the Hierarchical Navigable Small-World (HNSW) graph, which itself is a graph-based structure. Second, we aim to support predicate-agnostic filtered vector search queries, in which the k nearest neighbors (kNNs) of a query vector vQ are searched only within an arbitrary subset S of vectors defined by an ad-hoc selection sub-query QS. We adopt a prefiltering approach that evaluates QS first and passes the full description of subset S to the kNN search operator. We study how to design a prefiltering search algorithm that remains robust under varying selectivities and under different correlations between subset S and query vector vQ. We propose an adaptive algorithm that uses the local selectivity of each vector in the HNSW graph to choose an appropriate heuristic at every iteration of the kNN search. Finally, We demonstrate NaviX’s robustness and efficiency through extensive experiments against both existing prefiltering- and postfiltering-based baselines.
[IR-2] Impact of Shallow vs. Deep Relevance Judgments on BERT-based Reranking Models ICTIR’25
链接: https://arxiv.org/abs/2506.23191
作者: Gabriel Iturra-Bocaz,Danny Vo,Petra Galuscakova
类目: Information Retrieval (cs.IR)
*备注: Accepted at ICTIR’25
Abstract:This paper investigates the impact of shallow versus deep relevance judgments on the performance of BERT-based reranking models in neural Information Retrieval. Shallow-judged datasets, characterized by numerous queries each with few relevance judgments, and deep-judged datasets, involving fewer queries with extensive relevance judgments, are compared. The research assesses how these datasets affect the performance of BERT-based reranking models trained on them. The experiments are run on the MS MARCO and LongEval collections. Results indicate that shallow-judged datasets generally enhance generalization and effectiveness of reranking models due to a broader range of available contexts. The disadvantage of the deep-judged datasets might be mitigated by a larger number of negative training examples.
[IR-3] Synergizing Implicit and Explicit User Interests: A Multi-Embedding Retrieval Framework at Pinterest KDD2025
链接: https://arxiv.org/abs/2506.23060
作者: Zhibo Fan,Hongtao Lin,Haoyu Chen,Bowen Deng,Hedi Xia,Yuke Yan,James Li
类目: Information Retrieval (cs.IR)
*备注: KDD 2025
Abstract:Industrial recommendation systems are typically composed of multiple stages, including retrieval, ranking, and blending. The retrieval stage plays a critical role in generating a high-recall set of candidate items that covers a wide range of diverse user interests. Effectively covering the diverse and long-tail user interests within this stage poses a significant challenge: traditional two-tower models struggle in this regard due to limited user-item feature interaction and often bias towards top use cases. To address these issues, we propose a novel multi-embedding retrieval framework designed to enhance user interest representation by generating multiple user embeddings conditioned on both implicit and explicit user interests. Implicit interests are captured from user history through a Differentiable Clustering Module (DCM), whereas explicit interests, such as topics that the user has followed, are modeled via Conditional Retrieval (CR). These methodologies represent a form of conditioned user representation learning that involves condition representation construction and associating the target item with the relevant conditions. Synergizing implicit and explicit user interests serves as a complementary approach to achieve more effective and comprehensive candidate retrieval as they benefit on different user segments and extract conditions from different but supplementary sources. Extensive experiments and A/B testing reveal significant improvements in user engagements and feed diversity metrics. Our proposed framework has been successfully deployed on Pinterest home feed.
[IR-4] Machine Assistant with Reliable Knowledge: Enhancing Student Learning via RAG -based Retrieval
链接: https://arxiv.org/abs/2506.23026
作者: Yongsheng Lian
类目: Information Retrieval (cs.IR)
*备注:
Abstract:We present Machine Assistant with Reliable Knowledge (MARK), a retrieval-augmented question-answering system designed to support student learning through accurate and contextually grounded responses. The system is built on a retrieval-augmented generation (RAG) framework, which integrates a curated knowledge base to ensure factual consistency. To enhance retrieval effectiveness across diverse question types, we implement a hybrid search strategy that combines dense vector similarity with sparse keyword-based retrieval. This dual-retrieval mechanism improves robustness for both general and domain-specific queries. The system includes a feedback loop in which students can rate responses and instructors can review and revise them. Instructor corrections are incorporated into the retrieval corpus, enabling adaptive refinement over time. The system was deployed in a classroom setting as a substitute for traditional office hours, where it successfully addressed a broad range of student queries. It was also used to provide technical support by integrating with a customer-specific knowledge base, demonstrating its ability to handle routine, context-sensitive tasks in applied domains. MARK is publicly accessible at this https URL.