Arxiv今日论文 | 2025-06-13

本篇博文主要内容为 2025-06-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决推理模型在自我评估与修正错误思维过程中的有效性问题，特别是模型如何识别并从四种类型的无帮助思维中恢复。研究的关键在于通过实验分析模型对无信息性漫谈、无关思维、误导性思维以及导致错误答案的思维的识别与恢复能力，发现模型虽然能有效识别大部分无帮助思维，但在实际恢复过程中表现不佳，尤其在面对短时无关思维时，大模型反而比小模型更难以恢复，这表明其自我再评估能力尚未达到真正的“元认知”水平。

链接: https://arxiv.org/abs/2506.10979
作者: Sohee Yang,Sang-Woo Lee,Nora Kassner,Daniela Gottesman,Sebastian Riedel,Mor Geva
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent reasoning models show the ability to reflect, backtrack, and self-validate their reasoning, which is crucial in spotting mistakes and arriving at accurate solutions. A natural question that arises is how effectively models can perform such self-reevaluation. We tackle this question by investigating how well reasoning models identify and recover from four types of unhelpful thoughts: uninformative rambling thoughts, thoughts irrelevant to the question, thoughts misdirecting the question as a slightly different question, and thoughts that lead to incorrect answers. We show that models are effective at identifying most unhelpful thoughts but struggle to recover from the same thoughts when these are injected into their thinking process, causing significant performance drops. Models tend to naively continue the line of reasoning of the injected irrelevant thoughts, which showcases that their self-reevaluation abilities are far from a general “meta-cognitive” awareness. Moreover, we observe non/inverse-scaling trends, where larger models struggle more than smaller ones to recover from short irrelevant thoughts, even when instructed to reevaluate their reasoning. We demonstrate the implications of these findings with a jailbreak experiment using irrelevant thought injection, showing that the smallest models are the least distracted by harmful-response-triggering thoughts. Overall, our findings call for improvement in self-reevaluation of reasoning models to develop better reasoning and safer systems.
zh

[NLP-1] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）驱动的数据科学代理在实际应用中效果受限的问题，特别是其对复杂、创新性任务的处理能力不足。现有框架依赖于刚性预定义的工作流程和不灵活的编码策略，导致其仅适用于相对简单、传统的任务。论文提出的解决方案——AutoMind，通过三个关键创新克服上述缺陷：（1）构建了一个经过筛选的专家知识库，使代理能够基于领域专家知识进行决策；（2）引入了一种具有知识性的代理树搜索算法，以战略方式探索可能的解决方案；（3）采用自适应编码策略，动态调整代码生成以匹配任务复杂度。

链接: https://arxiv.org/abs/2506.10974
作者: Yixin Ou,Yujie Luo,Jingsheng Zheng,Lanning Wei,Shuofei Qiao,Jintian Zhang,Da Zheng,Huajun Chen,Ningyu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Ongoing work. Code is at this https URL

点击查看摘要

Abstract:Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
zh

[NLP-2] MMMG: A Massive Multidisciplinary Multi-Tier Generation Benchmark for Text-to-Image Reasoning

【速读】：该论文试图解决知识图像生成（knowledge image generation）任务中图像生成模型的推理能力不足问题，具体表现为实体保真度低、关系表达弱以及视觉杂乱。其解决方案的关键在于构建了一个跨学科、多层次的知识图像基准MMMG，并采用统一的知识图谱（Knowledge Graph, KG）表示来消除评估中的混淆因素，同时引入MMMG-Score作为评估指标，结合图编辑距离衡量事实准确性与视觉清晰度评估。此外，论文还提出了FLUX-Reason基线模型，通过融合推理型大语言模型与扩散模型，在16,000个精选的知识图像-提示对上进行训练，以推动该领域的发展。

链接: https://arxiv.org/abs/2506.10963
作者: Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning–a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits–low entity fidelity, weak relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
zh

[NLP-3] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

【速读】：该论文旨在解决中文有害内容检测数据集稀缺且范围有限的问题，以及现有模型在该任务上的性能不足。其解决方案的关键在于构建一个全面、专业标注的中文内容危害检测基准，并通过注释过程生成包含显式专家知识的知识规则库，同时提出一种融合人工标注知识规则与大语言模型隐式知识的增强基线模型，使小型模型能够达到与先进大语言模型相当的性能。

链接: https://arxiv.org/abs/2506.10960
作者: Kangwei Liu,Siyuan Cheng,Bozhong Tian,Xiaozhuan Liang,Yuyang Yin,Meng Han,Ningyu Zhang,Bryan Hooi,Xi Chen,Shumin Deng
机构: Zhejiang University (浙江大学); Tencent (腾讯); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at this https URL.
zh

[NLP-4] Build the web for agents not agents for the web

【速读】：该论文试图解决当前网络代理在与人类设计的界面交互时所面临的根本性不匹配问题，这一问题限制了其在复杂网络操作中的效率和可靠性。论文提出的解决方案关键在于提出一种面向代理能力优化的新交互范式——Agentic Web Interface (AWI)，通过重新设计界面以适应代理的需求，而非强迫代理适应传统界面，从而提升网络代理的安全性、效率和标准化水平。

链接: https://arxiv.org/abs/2506.10953
作者: Xing Han Lù,Gaurav Kamath,Marius Mosbach,Siva Reddy
机构: McGill University (麦吉尔大学); Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and multimodal counterparts have spurred significant interest in developing web agents – AI systems capable of autonomously navigating and completing tasks within web environments. While holding tremendous promise for automating complex web interactions, current approaches face substantial challenges due to the fundamental mismatch between human-designed interfaces and LLM capabilities. Current methods struggle with the inherent complexity of web inputs, whether processing massive DOM trees, relying on screenshots augmented with additional information, or bypassing the user interface entirely through API interactions. This position paper advocates for a paradigm shift in web agent research: rather than forcing web agents to adapt to interfaces designed for humans, we should develop a new interaction paradigm specifically optimized for agentic capabilities. To this end, we introduce the concept of an Agentic Web Interface (AWI), an interface specifically designed for agents to navigate a website. We establish six guiding principles for AWI design, emphasizing safety, efficiency, and standardization, to account for the interests of all primary stakeholders. This reframing aims to overcome fundamental limitations of existing interfaces, paving the way for more efficient, reliable, and transparent web agent design, which will be a collaborative effort involving the broader ML community.
zh

[NLP-5] Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training ICML2025

【速读】：该论文试图解决语言模型（Language Model, LM）预训练过程中如何高效选择最优数据混合比例的问题，以提升下游任务性能。解决方案的关键在于提出了一种名为~\textscDomain2Vec的新方法，该方法通过将数据集分解为多个元领域（meta-domain）的线性组合，构建一个元领域词汇表，并利用分类器将任意数据集分解为对应于该词汇表分布的领域向量。在“分布对齐假设”（Distribution Alignment Assumption, DA ^2）下，这些领域向量能够无须训练直接识别出最优的数据混合方案，从而显著降低计算开销并提升模型性能。

链接: https://arxiv.org/abs/2506.10952
作者: Mozhi Zhang,Howe Tissue,Lu Wang,Xipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML2025

点击查看摘要

Abstract:We introduce~\textscDomain2Vec, a novel approach that decomposes any dataset into a linear combination of several \emphmeta-domains, a new concept designed to capture the key underlying features of datasets. \textscDomain2Vec maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph\textbfDistribution \textbfAlignment \textbfAssumption (DA ^2 ), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textscDomain2vec can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that \textscDomain2Vec helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, \textscDomain2Vec achieves the same validation loss on Pile-CC using only 51.5% of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, \textscDomain2Vec improves downstream performance by an average of 2.83% .
zh

[NLP-6] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）在去遗忘（unlearning）过程中出现的无意遗忘（unintended forgetting）问题，即在移除特定数据时，模型的实用性和对有价值信息的保留能力受到损害。解决方案的关键在于提出GUARD框架，其核心是一个轻量级的代理数据归属度量（proxy data attribution metric），用于量化遗忘集与保留集之间的“对齐”程度，并在此基础上设计一种自适应、非均匀的去遗忘目标，通过反比例分配去遗忘权重来优化保留效果，从而减轻无意遗忘带来的性能下降。

链接: https://arxiv.org/abs/2506.10946
作者: Evelyn Ma,Duo Zhou,Peizhi Niu,Huiting Zhou,Huan Zhang,Olgica Milenkovic,S. Rasoul Etesami
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unlearning in large language models (LLMs) is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this, we propose GUARD-a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the “alignment” between the forget and retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended losses in retention. We provide rigorous theoretical guarantees that GUARD significantly enhances retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU benchmark across multiple LLM architectures demonstrate that GUARD substantially improves utility preservation while ensuring effective unlearning. Notably, GUARD reduces utility sacrifice on the Retain Set by up to 194.92% in terms of Truth Ratio when forgetting 10% of the training data.
zh

[NLP-7] VINCIE: Unlocking In-context Image Editing from Video

【速读】：该论文试图解决的是上下文图像编辑（in-context image editing）问题，即在给定包含文本和先前生成图像的上下文序列的情况下，对图像进行修改。现有方法通常依赖于任务特定的流水线和专家模型（如分割和修复模型）来构建训练数据，而本文提出了一种直接从视频中学习上下文图像编辑模型的方法。解决方案的关键在于引入一种可扩展的视频标注方法，将视频表示为交错的多模态序列，并设计了一个基于三个代理任务（下一图像预测、当前分割预测和下一分割预测）的块因果扩散变压器模型，以有效利用此类数据进行学习。

链接: https://arxiv.org/abs/2506.10941
作者: Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project page: this https URL

点击查看摘要

Abstract:In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.
zh

[NLP-8] Dynamic Epistemic Friction in Dialogue CONLL2025

【速读】：该论文试图解决在人机协作场景中，大型语言模型（Large Language Models, LLMs）与人类偏好对齐过程中忽视“认识论摩擦”（epistemic friction）的问题。认识论摩擦指的是在面对新、冲突或模糊信息时，更新信念所遇到的内在阻力。论文的关键解决方案是将动态认识论摩擦定义为认知整合的阻力，其特征是代理当前信念状态与外部证据支持的新命题之间的不一致，并将其置于动态认识论逻辑框架中，从而在对话中有效预测信念更新。

链接: https://arxiv.org/abs/2506.10934
作者: Timothy Obiso,Kenneth Lai,Abhijnan Nath,Nikhil Krishnaswamy,James Pustejovsky
机构: Brandeis University (布兰代斯大学); Colorado State University (科罗拉多州立大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures, 2 tables, CoNLL 2025

点击查看摘要

Abstract:Recent developments in aligning Large Language Models (LLMs) with human preferences have significantly enhanced their utility in human-AI collaborative scenarios. However, such approaches often neglect the critical role of “epistemic friction,” or the inherent resistance encountered when updating beliefs in response to new, conflicting, or ambiguous information. In this paper, we define dynamic epistemic friction as the resistance to epistemic integration, characterized by the misalignment between an agent’s current belief state and new propositions supported by external evidence. We position this within the framework of Dynamic Epistemic Logic (Van Benthem and Pacuit, 2011), where friction emerges as nontrivial belief-revision during the interaction. We then present analyses from a situated collaborative task that demonstrate how this model of epistemic friction can effectively predict belief updates in dialogues, and we subsequently discuss how the model of belief alignment as a measure of epistemic resistance or friction can naturally be made more sophisticated to accommodate the complexities of real-world dialogue scenarios.
zh

[NLP-9] Robustly Improving LLM Fairness in Realistic Settings via Interpretability

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在高风险招聘应用中因引入现实情境细节而产生的显著种族和性别偏见问题。研究发现，传统简单的反偏见提示在现实场景中失效，而通过内部偏见缓解策略——即在模型激活中识别并中和敏感属性方向——能够实现跨所有测试场景的鲁棒偏见减少。该解决方案的关键在于在推理过程中应用仿射概念编辑，以消除与种族和性别相关联的方向，从而在保持模型性能的同时将偏见降低至极低水平。

链接: https://arxiv.org/abs/2506.10922
作者: Adam Karvonen,Samuel Marks
机构: Anthropic(Anthropic)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people’s careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10%") induces significant racial and gender biases (up to 12% differences in interview rates). When these biases emerge, they consistently favor Black over White candidates and female over male candidates across all tested models and scenarios. Moreover, models can infer demographics and become biased from subtle cues like college affiliations, with these biases remaining invisible even when inspecting the model’s chain-of-thought reasoning. To address these limitations, our internal bias mitigation identifies race and gender-correlated directions and applies affine concept editing at inference time. Despite using directions from a simple synthetic dataset, the intervention generalizes robustly, consistently reducing bias to very low levels (typically under 1%, always below 2.5%) while largely maintaining model performance. Our findings suggest that practitioners deploying LLMs for hiring should adopt more realistic evaluation methodologies and consider internal mitigation strategies for equitable outcomes.
zh

[NLP-10] Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中可解释性问题，特别是如何在无监督条件下找到能够捕捉可解释特征的方向。现有方法依赖于稀疏自编码器（Sparse Autoencoders, SAEs）从残差流激活中学习特征，但这些方法在因果评估中表现不佳且缺乏内在可解释性。论文提出的解决方案关键在于使用半非负矩阵分解（Semi-Nonnegative Matrix Factorization, SNMF）直接分解多层感知机（MLP）的激活，使得学习到的特征为共激活神经元的稀疏线性组合，并映射到其激活输入，从而实现直接可解释性。

链接: https://arxiv.org/abs/2506.10920
作者: Or Shafran,Atticus Geiger,Mor Geva
机构: Tel Aviv University (特拉维夫大学); Pr(Ai)2R Group (Pr(Ai)2R 组)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP’s activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.
zh

[NLP-11] Breaking Bad Molecules: Are MLLM s Ready for Structure-Level Molecular Detoxification?

【速读】：该论文试图解决分子毒性修复（molecular toxicity repair）问题，即生成结构有效且毒性降低的分子替代物，这一任务尚未被系统定义或基准测试。解决方案的关键在于引入ToxiMol，这是首个针对通用多模态大语言模型（MLLMs）的分子毒性修复基准任务，并构建了一个标准化数据集，涵盖11个主要任务和560种代表性有毒分子，同时设计了基于机制感知和任务适应的提示标注流程，以及自动化评估框架ToxiEval，整合毒性终点预测、合成可及性、药物样性和结构相似性等指标，以系统评估修复效果。

链接: https://arxiv.org/abs/2506.10912
作者: Fei Lin,Ziyang Gong,Cong Wang,Yonglin Tian,Tengchao Zhang,Xue Yang,Gen Luo,Fei-Yue Wang
机构: Macau University of Science and Technology (澳门科技大学); Shanghai Jiao Tong University (上海交通大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Shanghai AI Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair - generating structurally valid molecular alternatives with reduced toxicity - has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess nearly 30 mainstream general-purpose MLLMs and design multiple ablation studies to analyze key factors such as evaluation criteria, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware molecule editing.
zh

[NLP-12] Magistral

【速读】：该论文旨在解决如何通过纯文本数据的强化学习（Reinforcement Learning, RL）训练大型语言模型（Large Language Models, LLMs），以提升其推理能力及多模态理解、指令遵循和函数调用等能力的问题。其解决方案的关键在于采用自底向上的方法，完全依赖自身模型和基础设施，而非依赖现有实现或先前模型的RL轨迹，从而构建了一个能够探索LLMs纯RL训练极限的系统，并提出了一种简单的方法来强制模型的推理语言。此外，研究还表明仅使用文本数据进行RL训练可以保持初始检查点的主要能力，并在多项任务上实现性能提升或维持。

链接: https://arxiv.org/abs/2506.10910
作者: Mistral-AI:Abhinav Rastogi,Albert Q. Jiang,Andy Lo,Gabrielle Berrada,Guillaume Lample,Jason Rute,Joep Barmentlo,Karmesh Yadav,Kartik Khandelwal,Khyathi Raghavi Chandu,Léonard Blier,Lucile Saulnier,Matthieu Dinot,Maxime Darrin,Neha Gupta,Roman Soletskyi,Sagar Vaze,Teven Le Scao,Yihan Wang,Adam Yang,Alexander H. Liu,Alexandre Sablayrolles,Amélie Héliou,Amélie Martin,Andy Ehrenberg,Anmol Agarwal,Antoine Roux,Arthur Darcet,Arthur Mensch,Baptiste Bout,Baptiste Rozière,Baudouin De Monicault,Chris Bamford,Christian Wallenwein,Christophe Renaudin,Clémence Lanfranchi,Darius Dabert,Devon Mizelle,Diego de las Casas,Elliot Chane-Sane,Emilien Fugier,Emma Bou Hanna,Gauthier Delerce,Gauthier Guinet,Georgii Novikov,Guillaume Martin,Himanshu Jaju,Jan Ludziejewski,Jean-Hadrien Chabran,Jean-Malo Delignon,Joachim Studnia,Jonas Amar,Josselin Somerville Roberts,Julien Denize,Karan Saxena,Kush Jain,Lingxiao Zhao,Louis Martin,Luyu Gao,Lélio Renard Lavaud,Marie Pellat,Mathilde Guillaumin,Mathis Felardos,Maximilian Augustin,Mickaël Seznec,Nikhil Raghuraman,Olivier Duchenne,Patricia Wang,Patrick von Platen,Patryk Saffer,Paul Jacob,Paul Wambergue,Paula Kurylowicz,Pavankumar Reddy Muddireddy,Philomène Chagniot,Pierre Stock,Pravesh Agrawal,Romain Sauvestre,Rémi Delacourt,Sanchit Gandhi,Sandeep Subramanian,Shashwat Dalal,Siddharth Gandhi,Soham Ghosh,Srijan Mishra,Sumukh Aithal,Szymon Antoniak,Thibault Schueller,Thibaut Lavril,Thomas Robert,Thomas Wang,Timothée Lacroix,Valeriia Nemychnikova,Victor Paltz,Virgile Richard,Wen-Ding Li,William Marshall,Xuanyu Zhang,Yunhao Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Magistral, Mistral’s first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint’s capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.
zh

[NLP-13] Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

【速读】：该论文试图解决自动形式化（autoformalization）任务中缺乏有效自动评估方法的问题，特别是在高级数学等复杂领域，传统的人工评估耗时且依赖专业知识。其解决方案的关键在于提出一种基于认知与形式基础的大型语言模型评委集合（EFG），该方法通过逻辑保全（LP）、数学一致性（MC）、形式有效性（FV）和形式质量（FQ）等多维标准进行系统性评估，从而实现对自动形式化质量的透明、可靠判断。

链接: https://arxiv.org/abs/2506.10903
作者: Lan Zhang,Marco Valentino,Andre Freitas
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using large language models (LLMs) have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.
zh

[NLP-14] BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

【速读】：该论文旨在解决生物医学和临床自然语言处理（NLP）中编码器模型发展滞后导致的领域适应性有限的问题。其关键解决方案是引入BioClinical ModernBERT，这是一个经过领域适配的编码器，基于最新的ModernBERT架构，结合了长上下文处理能力以及在速度和性能上的显著提升，通过在最大规模的生物医学和临床语料库（超过535亿个标记）上进行持续预训练，并利用来自不同机构、领域和地理区域的20个数据集，克服了以往临床编码器依赖单一数据源的局限性。

链接: https://arxiv.org/abs/2506.10896
作者: Thomas Sounack,Joshua Davis,Brigitte Durieux,Antoine Chaffin,Tom J. Pollard,Eric Lehman,Alistair E. W. Johnson,Matthew McDermott,Tristan Naumann,Charlotta Lindvall
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.
zh

[NLP-15] he Diffusion Duality ICML2025

【速读】：该论文旨在解决均匀状态离散扩散模型在文本生成任务中性能低于自回归模型和掩码扩散模型的问题。其解决方案的关键在于发现均匀状态扩散过程本质上源自于底层的高斯扩散，并利用这一特性将高斯扩散中的有效技术迁移至离散扩散模型中，从而提升训练和采样的效率与效果。

链接: https://arxiv.org/abs/2506.10892
作者: Subham Sekhar Sahoo,Justin Deschenaux,Aaron Gokaslan,Guanghan Wang,Justin Chiu,Volodymyr Kuleshov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025. We provide the code at: this https URL

点击查看摘要

Abstract:Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: this http URL
zh

[NLP-16] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在微调过程中表现出的双重特性问题：一方面能够从新事实中显著泛化，另一方面却容易产生错误信息的幻觉。论文提出，这两种行为均源于一种称为“非上下文推理”（Out-of-Context Reasoning, OCR）的单一机制，即通过关联概念推断隐含意义的能力，即使这些概念之间没有因果关系。解决方案的关键在于将OCR形式化为一个合成的事实回忆任务，并通过实验验证具有矩阵分解结构的单层单头注意力变换器能够有效学习该任务，而权重合并的模型则不能，这揭示了矩阵分解在OCR能力中的关键作用。此外，理论分析表明，梯度下降的隐式偏差倾向于选择最小化输出-值矩阵核范数的解，从而解释了模型为何能高效地关联事实与隐含关系，无论其相关性是否具有因果性。

链接: https://arxiv.org/abs/2506.10887
作者: Yixiao Huang,Hanlin Zhu,Tianyu Guo,Jiantao Jiao,Somayeh Sojoudi,Michael I. Jordan,Stuart Russell,Song Mei
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
zh

[NLP-17] Slimming Down LLM s Without Losing Their Minds

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）在有限资源下进行高效微调的问题，重点探索参数效率方法（如LoRA和QLoRA）对模型性能的影响。其解决方案的关键在于通过设计参数高效的微调策略，在保持计算效率的同时提升模型在特定任务上的表现，并强调微调数据集与评估任务之间的对齐程度对模型性能的决定性作用。

链接: https://arxiv.org/abs/2506.10885
作者: Qingda(Michael)Mai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:This paper investigates and validates the impact of fine-tuning on large language model performance, focusing on parameter-efficient methods (LoRA and QLoRA). We evaluate model capabilities across three key domains: (1) commonsense reasoning (HellaSwag), (2) mathematical reasoning (GSM8K), and (3) multi-domain knowledge (MMLU-CS). Our findings demonstrate that: (1) LoRA-based methods effectively improve task-specific performance while maintaining computational efficiency, and (2) performance strongly depends on alignment between fine-tuning dataset and benchmark tasks. The study provides both theoretical insights into parameter-efficient mechanisms and practical guidance for developers implementing efficient LLM adaptation with limited resources. Comments: 10 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.10885 [cs.CL] (or arXiv:2506.10885v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.10885 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-18] Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment ACL2025

【速读】：该论文旨在解决医疗对话系统（Medical Dialogue System, MDS）在识别相关医学知识和生成个性化、医学准确响应方面的不足。其解决方案的关键在于提出MedRef，该系统通过知识精炼机制过滤无关医学数据，提升响应中关键医学实体的预测准确性，并设计了包含历史细节和明显细节的全面提示结构。此外，通过引入三元组过滤器（Triplet Filter）和示例选择器（Demo Selector）两个核心模块，实现了对不同患者状况的实时适应性，从而提升了系统的生成质量和医学实体准确性。

链接: https://arxiv.org/abs/2506.10877
作者: Hongda Sun,Jiaren Peng,Wenzhong Yang,Liang He,Bo Du,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学人工智能学院); Sichuan University (四川大学); School of Computer Science and Technology, Xinjiang University (新疆大学计算机科学与技术学院); Tsinghua University (清华大学); School of Computer Science, Wuhan University (武汉大学计算机学院); School of Artificial Intelligence, Wuhan University (武汉大学人工智能学院); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education (教育部下一代智能搜索与推荐工程研究中心)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt. Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.
zh

[NLP-19] Analyzing the relationships between pretraining language phonetic tonal and speaker information in self-supervised speech models

【速读】：该论文试图解决多语言环境下自监督语音模型（如wav2vec2）如何编码语言匹配与非匹配语音信息的问题。其解决方案的关键在于通过探测分类器和几何分析方法，研究不同语言的音素、词汇声调和说话人信息在模型中的表示特性，从而揭示模型在不同语言下的表征结构是否具有通用性。研究发现，无论预训练和测试语言如何，音素、声调和说话人的子空间在很大程度上是正交的，且各层的探测准确率模式相似，仅在后期层中语言匹配的音素和声调探测略占优势，这表明wav2vec2学习到的表征结构在很大程度上独立于预训练时使用的语音材料。

链接: https://arxiv.org/abs/2506.10855
作者: Michele Gubian,Ioana Krehan,Oli Liu,James Kirby,Sharon Goldwater
机构: Institute for Phonetics and Speech Processing (语音学与言语处理研究所); School of Informatics (信息学院)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Analyses of self-supervised speech models have begun to reveal where and how they represent different types of information. However, almost all analyses have focused on English. Here, we examine how wav2vec2 models trained on four different languages encode both language-matched and non-matched speech. We use probing classifiers and geometric analyses to examine how phones, lexical tones, and speaker information are represented. We show that for all pretraining and test languages, the subspaces encoding phones, tones, and speakers are largely orthogonal, and that layerwise patterns of probing accuracy are similar, with a relatively small advantage for matched-language phone and tone (but not speaker) probes in the later layers. Our findings suggest that the structure of representations learned by wav2vec2 is largely independent of the speech material used during pretraining.
zh

[NLP-20] Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles

【速读】：该论文旨在解决扩散基础语言模型（dLLMs）在采样策略上的静态行为问题，这些问题导致了效率低下和灵活性不足。其解决方案的关键在于提出一种动态采样策略——SlowFast Sampling，该策略通过自适应地在探索性解码和加速解码阶段之间切换，以提升生成效率。该方法遵循确定性原则、收敛性原则和位置性原则，指导何时以及何处可以自信且高效地解码令牌，并进一步与dLLM-Cache结合以减少冗余计算。

链接: https://arxiv.org/abs/2506.10848
作者: Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); EPIC Lab, Shanghai Jiao Tong University (上海交通大学EPIC实验室); Central South University (中南大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages; 5 figures;

点击查看摘要

Abstract:Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63 \times speedup on LLaDA with minimal accuracy drop, and up to 34.22 \times when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.
zh

[NLP-21] CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）框架在处理复杂、现实世界任务时效率和协作能力不足的问题。其解决方案的关键在于提出一种多智能体检索增强生成（multi-agent RAG, mRAG）框架，通过专门化代理执行规划、搜索、推理和协调等子任务，并采用自训练范式结合奖励引导轨迹采样来优化智能体间的协作与响应生成。

链接: https://arxiv.org/abs/2506.10844
作者: Alireza Salemi,Mukta Maddipatla,Hamed Zamani
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.
zh

[NLP-22] ReCUT: Balancing Reasoning Length and Accuracy in LLM s via Stepwise Trails and Preference Optimization

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在链式思维（Chain-of-Thought, CoT）提示方法中出现的过度思考问题，即生成不必要的冗长或重复的推理轨迹。解决方案的关键在于提出一种名为通过逐步试验进行推理压缩（Reasoning Compression Through Stepwise Trials, ReCUT）的新方法，该方法通过分步探索机制和长短结合的采样策略，使LLMs能够逐步生成多样化的推理路径，并通过评估构建偏好对来训练两个专门模型（Gemini LLMs），分别优化推理准确性和推理长度，最终通过参数插值得到集成模型，从而在保持或提升推理准确性的同时显著减少推理长度。

链接: https://arxiv.org/abs/2506.10822
作者: Zhensheng Jin,Xinze Li,Yifan Ji,Chunyi Peng,Zhenghao Liu,Qi Shi,Yukun Yan,Shuo Wang,Furong Peng,Ge Yu
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Shanxi University (山西大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via this https URL.
zh

[NLP-23] VideoDeepResearch: Long Video Understanding With Agent ic Tool Using

【速读】：该论文试图解决长视频理解（Long Video Understanding, LVU）问题，该问题由于任务本身的复杂性和上下文窗口限制，对当前多模态大语言模型（MLLMs）构成了重大挑战。论文提出的解决方案的关键在于引入VideoDeepResearch，这是一个新型的代理框架，其核心是仅依赖一个文本为主的大型推理模型（LRM）结合模块化的多模态工具包，包括多模态检索器和视觉感知器，这些工具在实际中均可获得。系统通过推理制定问题解决策略，并通过工具选择性地访问和利用关键视频内容，从而有效提升长视频理解性能。

链接: https://arxiv.org/abs/2506.10821
作者: Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.
zh

[NLP-24] Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints ACL2025

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）中高效更新多语言知识的同时，保持跨语言一致的事实表征这一长期未解的挑战。现有方法通过为每种语言部署独立的编辑系统虽可行，但因需管理多个模型而成本高昂；而将所有语言的知识更新整合至统一模型中则易导致参数干扰，影响多语言泛化能力和注入知识的准确性。论文提出的解决方案——LangEdit，其关键在于引入一种零空间约束框架，通过将每种语言的参数更新投影到先前更新子空间的正交补空间，实现语言特定知识更新的精确隔离，从而数学上保证更新独立性并维持多语言泛化能力。

链接: https://arxiv.org/abs/2506.10800
作者: Wei Sun,Tingyu Qu,Mingxiao Li,Jesse Davis,Marie-Francine Moens
机构: KU Leuven (天主教鲁汶大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at this https URL.
zh

[NLP-25] FASCIST-O-METER: Classifier for Neo-fascist Discourse Online

【速读】：该论文试图解决如何在数字话语中识别和分类新法西斯主义（neo-fascist）内容的问题，特别是在美国社会语境下的网络论坛中。解决方案的关键在于构建首个针对该现象的编码方案，并结合自然语言处理（NLP）技术进行模型训练与测试，通过收集并标注来自新法西斯主义团体的大量网络活动数据，最终实现了对新法西斯主义话语的初步分类模型。

链接: https://arxiv.org/abs/2506.10789
作者: Rudy Alexandro Garrido Veliz,Martin Semmann,Chris Biemann,Seid Muhie Yimam
机构: Universtät Hamburg(汉堡大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neo-fascism is a political and societal ideology that has been having remarkable growth in the last decade in the United States of America (USA), as well as in other Western societies. It poses a grave danger to democracy and the minorities it targets, and it requires active actions against it to avoid escalation. This work presents the first-of-its-kind neo-fascist coding scheme for digital discourse in the USA societal context, overseen by political science researchers. Our work bridges the gap between Natural Language Processing (NLP) and political science against this phenomena. Furthermore, to test the coding scheme, we collect a tremendous amount of activity on the internet from notable neo-fascist groups (the forums of Iron March and this http URL), and the guidelines are applied to a subset of the collected posts. Through crowdsourcing, we annotate a total of a thousand posts that are labeled as neo-fascist or non-neo-fascist. With this labeled data set, we fine-tune and test both Small Language Models (SLMs) and Large Language Models (LLMs), obtaining the very first classification models for neo-fascist discourse. We find that the prevalence of neo-fascist rhetoric in this kind of forum is ever-present, making them a good target for future research. The societal context is a key consideration for neo-fascist speech when conducting NLP research. Finally, the work against this kind of political movement must be pressed upon and continued for the well-being of a democratic society. Disclaimer: This study focuses on detecting neo-fascist content in text, similar to other hate speech analyses, without labeling individuals or organizations.
zh

[NLP-26] Improving Named Entity Transcription with Contextual LLM -based Revision

【速读】：该论文旨在解决自动语音识别（ASR）系统在命名实体（Named Entities）识别上的高词错误率（WER）问题，因为命名实体通常是关键信息，其识别错误会严重影响下游应用。解决方案的关键在于引入一种基于大语言模型（LLM）的修订机制，利用LLM的推理能力以及包含正确命名实体的局部上下文（如课程笔记）对ASR预测中的错误命名实体进行修正。

链接: https://arxiv.org/abs/2506.10779
作者: Viet Anh Trinh,Xinlu He,Jacob Whitehill
机构: WPI( Worcester Polytechnic Institute)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM’s reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30% relative WER reduction for named entities.
zh

[NLP-27] Different Questions Different Models: Fine-Grained Evaluation of Uncertainty and Calibration in Clinical QA with LLM s

【速读】：该论文旨在解决在高风险领域（如临床决策支持）中部署大型语言模型（Large Language Models, LLMs）时，如何获得准确且校准良好的不确定性估计问题。其关键解决方案是通过细粒度评估不同不确定性估计方法的性能，包括标准的单次生成和基于采样的方法，并探索基于推理轨迹中行为信号的轻量级单次通过估计器，这些方法在仅需一次生成的情况下即可接近语义熵（Semantic Entropy）的性能表现。

链接: https://arxiv.org/abs/2506.10769
作者: Alberto Testoni,Iacer Calixto
机构: Amsterdam University Medical Center (阿姆斯特丹大学医学中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Public Health, Methodology (阿姆斯特丹公共卫生，方法学); Amsterdam Public Health, Mental Health (阿姆斯特丹公共卫生，心理健康)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate and well-calibrated uncertainty estimates are essential for deploying large language models (LLMs) in high-stakes domains such as clinical decision support. We present a fine-grained evaluation of uncertainty estimation methods for clinical multiple-choice question answering, covering ten open-source LLMs (general-purpose, biomedical, and reasoning models) across two datasets, eleven medical specialties, and six question types. We compare standard single-generation and sampling-based methods, and present a case study exploring simple, single-pass estimators based on behavioral signals in reasoning traces. These lightweight methods approach the performance of Semantic Entropy while requiring only one generation. Our results reveal substantial variation across specialties and question types, underscoring the importance of selecting models based on both the nature of the question and model-specific strengths.
zh

[NLP-28] One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

【速读】：该论文试图解决在预训练多语言大型语言模型（Large Language Models, LLMs）时，由于模型容量有限、高质量数据稀缺以及计算资源约束所带来的挑战，尤其是在新语言的语料覆盖不足的情况下，如何提升模型在后训练阶段对新语言的适应能力。论文的核心解决方案是通过优化分词器（tokenizer）设计，提出使用一个覆盖语言数量超过主要预训练语言的通用分词器，以提高模型在扩展语言覆盖范围后的适应性。实验结果表明，这种通用分词器显著提升了语言适应能力，相较于针对特定预训练语言的分词器，其胜率提高了最高20.2%，同时在完全未见过的语言上也表现出更高的适应性。

链接: https://arxiv.org/abs/2506.10766
作者: Diana Abagyan,Alejandro R. Salamanca,Andres Felipe Cruz-Salinas,Kris Cao,Hangyu Lin,Acyr Locatelli,Marzieh Fadaee,Ahmet Üstün,Sara Hooker
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve “language plasticity”, or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.
zh

[NLP-29] Neural at ArchEHR-QA 2025: Agent ic Prompt Optimization for Evidence-Grounded Clinical Question Answering

【速读】：该论文旨在解决在电子健康记录（Electronic Health Records, EHRs）上实现自动化问答（QA）中的关键信息缺口问题，特别是在有限监督条件下，如何精确地进行证据检索和生成可信答案。其解决方案的关键在于将任务解耦为两个阶段：（1）句级证据识别，以及（2）带有显式引用的答案合成。通过使用DSPy的MIPROv2优化器自动探索提示空间，联合调整开发集上的指令和少量示例，从而提升模型性能；此外，引入自洽性投票机制进一步提高了证据召回率而不牺牲精度。

链接: https://arxiv.org/abs/2506.10751
作者: Sai Prasanna Teja Reddy Bogireddy,Abrar Majeedi,Viswanatha Reddy Gajjala,Zhuoyan Xu,Siddhant Rai,Vaishnav Potlapalli
机构: University of Chicago(芝加哥大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated question answering (QA) over electronic health records (EHRs) can bridge critical information gaps for clinicians and patients, yet it demands both precise evidence retrieval and faithful answer generation under limited supervision. In this work, we present Neural, the runner-up in the BioNLP 2025 ArchEHR-QA shared task on evidence-grounded clinical QA. Our proposed method decouples the task into (1) sentence-level evidence identification and (2) answer synthesis with explicit citations. For each stage, we automatically explore the prompt space with DSPy’s MIPROv2 optimizer, jointly tuning instructions and few-shot demonstrations on the development set. A self-consistency voting scheme further improves evidence recall without sacrificing precision. On the hidden test set, our method attains an overall score of 51.5, placing second stage while outperforming standard zero-shot and few-shot prompting by over 20 and 10 points, respectively. These results indicate that data-driven prompt optimization is a cost-effective alternative to model fine-tuning for high-stakes clinical QA, advancing the reliability of AI assistants in healthcare.
zh

[NLP-30] axoAdapt: Aligning LLM -Based Multidimensional Taxonomy Construction to Evolving Research Corpora ACL2025

【速读】：该论文旨在解决科学文献组织与检索中的挑战，特别是在动态演变的科学领域中构建通用且适应性强的分类体系问题。传统专家定制的分类体系成本高、耗时长，而现有自动分类方法要么过度依赖特定语料库，牺牲了泛化能力，要么过度依赖大语言模型（Large Language Models, LLMs）的预训练知识，忽视了科学领域的动态特性。此外，这些方法未能充分考虑科学文献的多维属性。论文提出的解决方案是TaxoAdapt框架，其关键在于通过迭代分层分类动态调整LLM生成的分类体系，根据语料的主题分布扩展分类体系的宽度和深度，从而实现更精细且一致的分类结构。

链接: https://arxiv.org/abs/2506.10737
作者: Priyanka Kargupta,Nan Zhang,Yunyi Zhang,Rui Zhang,Prasenjit Mitra,Jiawei Han
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ACL 2025 Main Conference. Code available at: this https URL

点击查看摘要

Abstract:The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus’ topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.
zh

[NLP-31] Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims ACL2025

【速读】：该论文试图解决如何对复杂且非黑白分明的声明（claim）进行结构化分析与多角度验证的问题，特别是在科学和政治领域中，声明往往包含多个可验证的子方面。解决方案的关键在于提出ClaimSpect框架，该框架基于检索增强生成技术，自动构建声明的层次化方面结构，并结合语料库特定视角进行丰富，从而实现对声明的全面解析与多维度观点呈现。

链接: https://arxiv.org/abs/2506.10728
作者: Priyanka Kargupta,Runchu Tian,Jiawei Han
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ACL 2025 Main Conference. Code available at: this https URL

点击查看摘要

Abstract:Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false” – as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
zh

[NLP-32] PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRM）在数学基准测试中生成的思维链（Chain-of-Thought, CoT）推理过程过于冗长的问题，这导致了token使用量和成本增加，限制了其在延迟敏感或API受限环境中的部署。解决方案的关键在于提出PREMISE（PRompt-based Efficient Mathematical Inference with Strategic Evaluation），这是一个仅依赖提示的框架，通过结合逐迹诊断与梯度启发的提示优化，在不修改模型权重的情况下减少推理开销。该方法通过多目标文本搜索联合优化简洁性与正确性，平衡了token长度与答案有效性，从而在保持答案准确性的同时显著降低推理token数量和成本。

链接: https://arxiv.org/abs/2506.10716
作者: Ye Yu,Yaoning Yu,Haohan Wang
机构: University of Illinois at Urbana–Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) such as Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical benchmarks using lengthy chain-of-thought (CoT) reasoning, but the resulting traces are often unnecessarily verbose. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. We introduce PREMISE (PRompt-based Efficient Mathematical Inference with Strategic Evaluation), a prompt-only framework that reduces reasoning overhead without modifying model weights. PREMISE combines trace-level diagnostics with gradient-inspired prompt optimization to minimize redundant computation while preserving answer accuracy. The approach jointly optimizes brevity and correctness through a multi-objective textual search that balances token length and answer validity. Unlike prior work, PREMISE runs in a single-pass black-box interface, so it can be applied directly to commercial LLMs. On GSM8K, SVAMP, and Math500 we match or exceed baseline accuracy ( 96%\rightarrow96% with Claude, 91%\rightarrow92% with Gemini) while reducing reasoning tokens by up to 87.5% and cutting dollar cost by 69 – 82% . These results show that prompt-level optimization is a practical and scalable path to efficient LRM inference without compromising reasoning quality.
zh

[NLP-33] Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet

【速读】：该论文试图解决在开放英语词典（Open English Wordnet）中存在大量缺失的上位关系（hypernymy）链接的问题，特别是针对形容词的上位关系建立。解决方案的关键在于提出一种新的形容词上位关系资源，并通过微调大语言模型来预测形容词的上位关系，证明了TaxoLLaMa方法可以有效适应这一任务。

链接: https://arxiv.org/abs/2506.10715
作者: Lorenzo Augello,John P. McCrae
机构: Università Cattolica del Sacro Cuore (天主教圣心大学); Insight Research Ireland Centre for Data Analytics (Insight 研究爱尔兰数据分析师中心); University of Galway (加拉威大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hypernymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.
zh

[NLP-34] Large Language Models for Detection of Life-Threatening Texts

【速读】：该论文旨在解决识别具有生命威胁性语言（life-threatening language）的问题，以保护处于困境中的个体、促进心理健康和预防潜在伤害及生命损失。其解决方案的关键在于利用大规模语言模型（large language models, LLMs）进行检测，并与传统的文本处理方法（如词袋模型、词嵌入、主题建模和双向编码器表示转换模型）进行对比。研究通过在不同数据集上微调三种开源LLMs（Gemma、Mistral和Llama-2）验证了LLMs在平衡和不平衡数据场景下的优越性能，表明LLMs在应对生命威胁性语言检测任务中展现出巨大潜力。

链接: https://arxiv.org/abs/2506.10687
作者: Thanh Thi Nguyen,Campbell Wilson,Janis Dalins
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting life-threatening language is essential for safeguarding individuals in distress, promoting mental health and well-being, and preventing potential harm and loss of life. This paper presents an effective approach to identifying life-threatening texts using large language models (LLMs) and compares them with traditional methods such as bag of words, word embedding, topic modeling, and Bidirectional Encoder Representations from Transformers. We fine-tune three open-source LLMs including Gemma, Mistral, and Llama-2 using their 7B parameter variants on different datasets, which are constructed with class balance, imbalance, and extreme imbalance scenarios. Experimental results demonstrate a strong performance of LLMs against traditional methods. More specifically, Mistral and Llama-2 models are top performers in both balanced and imbalanced data scenarios while Gemma is slightly behind. We employ the upsampling technique to deal with the imbalanced data scenarios and demonstrate that while this method benefits traditional approaches, it does not have as much impact on LLMs. This study demonstrates a great potential of LLMs for real-world life-threatening language detection problems.
zh

[NLP-35] Math: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在电信领域中数学密集型任务上的表现评估问题，尤其是其在信号处理、网络优化和性能分析等专业领域的有效性尚未被充分探索。解决方案的关键在于构建了TeleMath数据集，这是首个专门用于评估LLMs在电信领域数值解数学问题上性能的基准数据集，包含500个由领域专家设计的问题-答案对，并提出了相应的生成流程以确保数据集的广泛覆盖性和专业性。

链接: https://arxiv.org/abs/2506.10674
作者: Vincenzo Colle,Mohamed Sana,Nicola Piovesan,Antonio De Domenico,Fadhel Ayed,Merouane Debbah
机构: Paris Research Center, Huawei Technologies (巴黎研究中心，华为技术); Università degli Studi di Cassino e del Lazio Meridionale (卡西诺和南部拉齐奥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages

点击查看摘要

Abstract:The increasing adoption of artificial intelligence in telecommunications has raised interest in the capability of Large Language Models (LLMs) to address domain-specific, mathematically intensive tasks. Although recent advancements have improved the performance of LLMs in general mathematical reasoning, their effectiveness within specialized domains, such as signal processing, network optimization, and performance analysis, remains largely unexplored. To address this gap, we introduce TeleMath, the first benchmark dataset specifically designed to evaluate LLM performance in solving mathematical problems with numerical solutions in the telecommunications domain. Comprising 500 question-answer (QnA) pairs, TeleMath covers a wide spectrum of topics in the telecommunications field. This paper outlines the proposed QnAs generation pipeline, starting from a selected seed of problems crafted by Subject Matter Experts. The evaluation of a wide range of open-source LLMs reveals that best performance on TeleMath is achieved by recent models explicitly designed for mathematical or logical reasoning. In contrast, general-purpose models, even those with a large number of parameters, often struggle with these challenges. We have released the dataset and the evaluation code to ease result reproducibility and support future research.
zh

[NLP-36] Spelling-out is not Straightforward: LLM s Capability of Tokenization from Token to Characters

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在字符级任务中的表现问题，特别是其在识别令牌内组成子组件时的不足。研究发现，尽管拼写任务对人类而言简单，但LLMs并未以直接方式处理该任务，其嵌入层未能完全编码字符级信息，尤其是在第一个字符之后。解决方案的关键在于利用中间和更高层次的Transformer层来重建字符级知识，并在这些层中观察到拼写行为的“突破”现象，这一机制通过探测分类器、识别知识神经元以及检查注意力权重三种互补分析进行了验证。

链接: https://arxiv.org/abs/2506.10641
作者: Tatsuya Hiraoka,Kentaro Inui
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); RIKEN
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct “breakthrough” in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.
zh

[NLP-37] Conversational Search: From Fundamentals to Frontiers in the LLM Era SIGIR2025

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLMs）构建智能对话式搜索系统的问题，以提升用户在多轮交互中获取复杂信息需求的效率与准确性。其解决方案的关键在于结合LLMs在指令遵循、内容生成和推理能力方面的优势，探索其在对话式搜索中的应用潜力，并推动相关核心技术的发展。

链接: https://arxiv.org/abs/2506.10635
作者: Fengran Mo,Chuan Meng,Mohammad Aliannejadi,Jian-Yun Nie
机构: Université de Montréal(蒙特利尔大学); University of Amsterdam(阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by Tutorial Track in SIGIR 2025

点击查看摘要

Abstract:Conversational search enables multi-turn interactions between users and systems to fulfill users’ complex information needs. During this interaction, the system should understand the users’ search intent within the conversational context and then return the relevant information through a flexible, dialogue-based interface. The recent powerful large language models (LLMs) with capacities of instruction following, content generation, and reasoning, attract significant attention and advancements, providing new opportunities and challenges for building up intelligent conversational search systems. This tutorial aims to introduce the connection between fundamentals and the emerging topics revolutionized by LLMs in the context of conversational search. It is designed for students, researchers, and practitioners from both academia and industry. Participants will gain a comprehensive understanding of both the core principles and cutting-edge developments driven by LLMs in conversational search, equipping them with the knowledge needed to contribute to the development of next-generation conversational search systems.
zh

[NLP-38] NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

【速读】：该论文旨在解决AI驱动的导师在教学能力评估中的错误识别问题，即判断导师的回应是否正确地指出了学生数学推理中的错误。解决方案的关键在于结合示例驱动的提示（example-driven prompting）与大语言模型（LLM）的推理能力，通过检索语义相似的示例、构建结构化提示以及使用模式引导的输出解析来生成可解释的预测，从而提升教学反馈评估的效果。

链接: https://arxiv.org/abs/2506.10627
作者: Numaan Naeem,Sarfraz Ahmad,Momina Ahsan,Hasan Iqbal
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor’s response correctly identifies a mistake in a student’s mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at this https URL.
zh

[NLP-39] SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis

【速读】：该论文旨在解决对话型人工智能系统在训练、评估和基准测试中对高质量、灵活且可复现的合成对话数据的需求问题。其解决方案的关键在于提出SDialog，这是一个模块化、可扩展的Python工具包，通过利用指令调优的大规模语言模型（Large Language Models, LLMs），提供了角色（persona）、编排（orchestration）和场景管理（scenario management）的抽象，从而支持生成真实、多样且可控的对话数据，推动了合成数据生成工具和框架的标准化进程。

链接: https://arxiv.org/abs/2506.10622
作者: Sergio Burdisso,Esaú Villatoro-Tello,Petr Motlicek
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today’s fast-evolving research landscape.
zh

[NLP-40] Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code

【速读】：该论文试图解决纸质心电图（ECG）记录准确数字化的持续性挑战，特别是针对单导联因信号重叠而受损的问题。解决方案的关键在于提出一个两阶段的处理流程：第一阶段采用基于U-Net的分割网络，通过富含重叠信号的数据集和自定义数据增强技术进行训练，以精确分离主要的ECG波形；第二阶段则利用传统数字化技术将优化后的二值掩码转换为时间序列信号，并引入自适应网格检测模块以提高不同ECG格式和尺度下的适用性。

链接: https://arxiv.org/abs/2506.10617
作者: Reza Karbasi,Masoud Rahimi,Abdol-Hossein Vahabie,Hadi Moradi
机构: University of Tehran (德黑兰大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline’s 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: this https URL.
zh

[NLP-41] Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search

【速读】：该论文试图解决从现代语言形式中无监督地重建原型词（protoforms）的问题，即确定现代语言形式的祖先词形。传统方法主要依赖于语音编辑的概率模型来推断同源词集中的原型词，但这些方法受限于其数据驱动的性质。该论文提出的解决方案的关键在于将数据驱动的推理与基于规则的启发式方法整合到进化优化框架中，从而结合统计模式和语言学驱动的约束来指导重建过程。

链接: https://arxiv.org/abs/2506.10614
作者: Promise Dodzi Kpoglu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.
zh

[NLP-42] Encoding call-by-push-value in the pi-calculus

【速读】：该论文试图解决将Levys的调用方式为推值（Call-by-Push-Value, CBPV）的lambda演算编码到pi演算中的问题，并证明该编码的正确性。解决方案的关键在于提出一种针对内部pi演算（pi-i-calculus）的编码方法，以避免在形式化过程中使用de Bruijn索引带来的挑战，同时利用该设置下早、晚和开式bisimulation一致以及bisimulation为同余关系的特性，确保编码的可靠性和一致性。此外，作者还论证了该编码满足Gorla提出的五个良好编码标准，并与Milner的编码进行了比较。

链接: https://arxiv.org/abs/2506.10584
作者: Benjamin Bennetzen,Nikolaj Rossander Kristensen,Peter Buus Steffensen
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
备注: 56 pages

点击查看摘要

Abstract:In this report we define an encoding of Levys call-by-push-value lambda-calculus (CBPV) in the pi-calculus, and prove that our encoding is both sound and complete. We present informal (by-hand) proofs of soundness, completeness, and all required lemmas. The encoding is specialized to the internal pi-calculus (pi-i-calculus) to circumvent certain challenges associated with using de Bruijn index in a formalization, and it also helps with bisimulation as early-, late- and open-bisimulation coincide in this setting, furthermore bisimulation is a congruence. Additionally, we argue that our encoding also satisfies the five criteria for good encodings proposed by Gorla, as well as show similarities between Milners and our encoding. This paper includes encodings from CBPV in the pi-i-calculus, asynchronous polyadic pi-calculus and the local pi-calculus. We begin a formalization of the proof in Coq for the soundness and completeness of the encoding in the pi-i-calculus. Not all lemmas used in the formalization are themselves formally proven. However, we argue that the non-proven lemmas are reasonable, as they are proven by hand, or amount to Coq formalities that are straightforward given informal arguments.
zh

[NLP-43] Scientists First Exam: Probing Cognitive Abilities of MLLM via Perception Understanding and Reasoning

【速读】：该论文试图解决当前科学多模态大语言模型（Multimodal Large Language Models, MLLMs）在科学认知能力评估中存在不足的问题，即现有科学基准主要关注模型的知识理解能力，而缺乏对其感知和推理能力的全面评估。解决方案的关键是提出Scientists’ First Exam (SFE)基准，该基准通过三个相互关联的层次——科学信号感知、科学属性理解和科学比较推理——来评估MLLMs的科学认知能力，并包含830对专家验证的视觉问答（VQA）数据，覆盖六个高价值学科的66个多模态任务。

链接: https://arxiv.org/abs/2506.10521
作者: Yuhao Zhou,Yiheng Wang,Xuming He,Ruoyao Xiao,Zhiwei Li,Qiantai Feng,Zijie Guo,Yuejin Yang,Hao Wu,Wenxuan Huang,Jiaqi Wei,Dan Si,Xiuqi Yao,Jia Bu,Haiwen Huang,Tianfan Fu,Shixiang Tang,Ben Fei,Dongzhan Zhou,Fenghua Ling,Yan Lu,Siqi Sun,Chenhui Li,Guanjie Zheng,Jiancheng Lv,Wenlong Zhang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 82 pages

点击查看摘要

Abstract:Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
zh

[NLP-44] Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理知识密集型任务时因缺乏背景知识和产生幻觉而表现不佳的问题。现有基于知识图谱（Knowledge Graphs, KGs）增强的LLMs主要关注补充事实性知识，但仍难以解决复杂问题。论文认为，除了事实性知识外，对事实间关系进行精炼并组织成逻辑一致的推理路径同样重要。为解决从KG中提取可靠推理路径所面临的图结构复杂性和多路径生成带来的挑战，论文提出RRP框架，该框架结合了LLMs的语义优势与通过关系嵌入和双向分布学习获得的结构信息，并引入一个重审模块以根据路径的重要性进行评估和优化。

链接: https://arxiv.org/abs/2506.10508
作者: Yilin Xiao,Chuang Zhou,Qinggang Zhang,Bo Li,Qing Li,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.
zh

[NLP-45] Beyond Single-User Dialogue: Assessing Multi-User Dialogue State Tracking Capabilities of Large Language Models

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在多用户对话状态追踪（multi-user DST）中的性能不足问题，尤其是在真实场景下多说话人交互的复杂性未被充分考虑。其解决方案的关键在于通过基于话语行为理论生成第二用户的话语，并将其系统性地整合到现有对话状态追踪数据集中，从而在降低数据集构建成本的同时，实现对LLMs在多用户场景下鲁棒性的可控评估。

链接: https://arxiv.org/abs/2506.10504
作者: Sangmin Song,Juhwan Choi,JungMin Yun,YoungBin Kim
机构: GS Neotek; AITRICS; Chung-Ang University (忠南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance in zero-shot dialogue state tracking (DST), reducing the need for task-specific training. However, conventional DST benchmarks primarily focus on structured user-agent conversations, failing to capture the complexities of real-world multi-user interactions. In this study, we assess the robustness of LLMs in multi-user DST while minimizing dataset construction costs. Inspired by recent advances in LLM-based data annotation, we extend an existing DST dataset by generating utterances of a second user based on speech act theory. Our methodology systematically incorporates a second user’s utterances into conversations, enabling a controlled evaluation of LLMs in multi-user settings. Experimental results reveal a significant performance drop compared to single-user DST, highlighting the limitations of current LLMs in extracting and tracking dialogue states amidst multiple speakers. Our findings emphasize the need for future research to enhance LLMs for multi-user DST scenarios, paving the way for more realistic and robust DST models.
zh

[NLP-46] Surface Fairness Deep Bias: A Comparative Study of Bias in Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中存在的偏见问题，特别是模型在不同用户人格或预设角色下表现出的偏见观点和结果差异。论文提出的关键解决方案是通过调整任务形式来更有效地检测模型中的偏见，例如要求模型对用户的回答进行评分或提供薪资谈判建议，从而揭示出更为显著的偏见表现。此外，随着LLM助手记忆和个性化功能的发展，用户无需预先设定角色描述，模型已能获取用户的社会人口学信息，这使得偏见问题以新的方式凸显出来。

链接: https://arxiv.org/abs/2506.10491
作者: Aleksandra Sorokovikova,Pavel Chizhov,Iuliia Eremenko,Ivan P. Yamshchikov
机构: Constructor University, Bremen; CAIRO, THWS, Würzburg; University of Kassel
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models are trained on large amounts of data. These data inevitably include controversial and stereotypical content, which contains all sorts of biases related to gender, origin, age, etc. As a result, the models express biased points of view or produce different results based on the assigned personality or the personality of the user. In this paper, we investigate various proxy measures of bias in large language models (LLMs). We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user’s answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers. With the recent trend for LLM assistant memory and personalization, these problems open up from a different angle: modern LLM users do not need to pre-prompt the description of their persona since the model already knows their socio-demographics.
zh

[NLP-47] able-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

【速读】：该论文试图解决科学主张验证中仅预测最终标签不足以反映模型推理过程和解释性不足的问题（Scientific claim verification）。解决方案的关键在于将表格与文本对齐重新定义为一个解释任务，要求模型识别支持主张验证的关键表格单元格，并通过扩展SciTab基准数据集，引入人工标注的单元格级理由来构建新数据集，从而提升模型的可解释性和推理能力。

链接: https://arxiv.org/abs/2506.10486
作者: Xanh Ho,Sunisth Kumar,Yun-Ang Wu,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
机构: National Institute of Informatics, Japan (日本国立情報学研究所); The University of Tokyo, Japan (东京大学); National Taiwan University (台湾大学); JFLI, CNRS, Nantes Université, France (JFLI，法国国家科学研究中心，南特大学)
类目: Computation and Language (cs.CL)
备注: 8 pages; code and data are available at this https URL

点击查看摘要

Abstract:Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model’s reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.
zh

[NLP-48] owards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts

【速读】：该论文旨在解决多模态情感识别（Multimodal Emotion Recognition, MER）中同时应对模态缺失和分布外（Out-Of-Distribution, OOD）数据的挑战，现有方法通常依赖特定模型或引入过多参数，限制了实用性。其解决方案的关键在于提出一种新的鲁棒性框架Causal Inference Distiller (CIDer)，该框架包含两个核心组件：Model-Specific Self-Distillation (MSSD) 模块和Model-Agnostic Causal Inference (MACI) 模块。MSSD通过跨低级特征、注意力图和高级表示的权重共享自蒸馏增强鲁棒性，而MACI则通过定制的因果图减轻标签和语言偏差，提升OOD泛化能力，并且仅需少量额外参数即可独立优化。

链接: https://arxiv.org/abs/2506.10452
作者: Guowei Zhong,Ruohong Huan,Mingzhen Wu,Ronghua Liang,Peng Chen
机构: Zhejiang University of Technology(浙江理工大学); Zhejiang University of Science and Technology(浙江科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Submitted to TAC. The code is available at this https URL

点击查看摘要

Abstract:Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at this https URL.
zh

[NLP-49] Fast on the Easy Deep on the Hard: Efficient Reasoning via Powered Length Penalty

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中生成输出过长导致计算延迟增加的问题，同时在保持准确性的前提下提升推理效率。其解决方案的关键在于通过划分奖励函数并引入一种新颖的输出长度惩罚机制，以实现对简单问题的简洁推理和对复杂问题的充分推理，从而优化模型的整体性能。

链接: https://arxiv.org/abs/2506.10446
作者: Zehui Ling,Deshu Chen,Hongwei Zhang,Yifeng Jiao,Xin Guo,Yuan Cheng
机构: Artificial Intelligence Innovation and Incubation Institute, Fudan University (人工智能创新与孵化研究院，复旦大学); School of Data Science, Fudan University (数据科学学院，复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant advancements in reasoning capabilities, performing well on various challenging benchmarks. Techniques like Chain-of-Thought prompting have been introduced to further improve reasoning. However, these approaches frequently generate longer outputs, which in turn increase computational latency. Although some methods use reinforcement learning to shorten reasoning, they often apply uniform penalties without considering the problem’s complexity, leading to suboptimal outcomes. In this study, we seek to enhance the efficiency of LLM reasoning by promoting conciseness for simpler problems while preserving sufficient reasoning for more complex ones for accuracy, thus improving the model’s overall performance. Specifically, we manage the model’s reasoning efficiency by dividing the reward function and including a novel penalty for output length. Our approach has yielded impressive outcomes in benchmark evaluations across three datasets: GSM8K, MATH500, and AIME2024. For the comparatively simpler datasets GSM8K and MATH500, our method has effectively shortened output lengths while preserving or enhancing accuracy. On the more demanding AIME2024 dataset, our approach has resulted in improved accuracy.
zh

[NLP-50] PAL: Probing Audio Encoders via LLM s – A Study of Information Transfer from Audio Encoders to LLM s

【速读】：该论文旨在解决如何高效地将音频编码器中的丰富语义表示转移到大型语言模型（Large Language Models, LLMs）中，以提升音频-语言模型（Audio-LLMs）的跨模态信息传递能力。其解决方案的关键在于通过系统性地调整模型架构设计，包括延迟音频信息的整合以建立文本上下文、仅通过LLM层的注意力子模块有效探测音频表示，以及融合多种音频编码器以获得更丰富的互补表征。这些设计选择在560万对音频-文本数据集上进行了验证，最终模型在基线基础上实现了10%至60%的相对性能提升。

链接: https://arxiv.org/abs/2506.10423
作者: Tony Alex,Wish Suharitdamrong,Sara Atito,Armin Mustafa,Philip J. B. Jackson,Imran Razzak,Muhammad Awais
机构: Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK; Surrey Institute for People-Centred AI (PAI), University of Surrey, UK; Mohamed bin Zayed University of AI, Abu Dhabi, UAE
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 21 pages, 11 figures

点击查看摘要

Abstract:The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM’s ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM’s initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer’s attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM’s capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10% to 60% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: this https URL
zh

[NLP-51] Beyond the Battlefield: Framing Analysis of Media Coverag e in Conflict Reporting

【速读】：该论文试图解决新闻媒体在冲突报道中框架（frame）对读者观点的影响问题，特别是针对以色列-巴勒斯坦战争的报道中，现有研究因定性分析或仅关注表层通用框架而缺乏深入洞察。其解决方案的关键在于通过计算方法，结合框架语义和大语言模型，识别传播框架及其与语言框架的关联，从而揭示新闻报道中对战争与和平报道的侧重差异以及不同地区媒体在责任归属和受害者认定上的偏见。

链接: https://arxiv.org/abs/2506.10421
作者: Avneet Kaur,Arnav Arora
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Framing used by news media, especially in times of conflict, can have substantial impact on readers’ opinion, potentially aggravating the conflict itself. Current studies on the topic of conflict framing have limited insights due to their qualitative nature or only look at surface level generic frames without going deeper. In this work, we identify indicators of war and peace journalism, as outlined by prior work in conflict studies, in a corpus of news articles reporting on the Israel-Palestine war. For our analysis, we use computational approaches, using a combination of frame semantics and large language models to identify both communicative framing and its connection to linguistic framing. Our analysis reveals a higher focus on war based reporting rather than peace based. We also show substantial differences in reporting across the US, UK, and Middle Eastern news outlets in framing who the assailant and victims of the conflict are, surfacing biases within the media.
zh

[NLP-52] Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences? ACL2025

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在图像序列中的时间定位与推理能力不足的问题。其解决方案的关键在于构建一个名为TempVS的基准测试，该基准包含事件关系推断、句子排序和图像排序三个主要测试，并附带基础的时间定位测试，以评估MLLMs在理解事件时间顺序方面的能力。TempVS要求模型同时依赖视觉和语言模态来完成任务，从而揭示当前模型在时间推理方面的局限性。

链接: https://arxiv.org/abs/2506.10415
作者: Yingjin Song,Yupei Du,Denis Paperno,Albert Gatt
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 14 figures. Accepted to ACL 2025

点击查看摘要

Abstract:This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at this https URL.
zh

[NLP-53] me-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

【速读】：该论文旨在解决现实世界中时间序列数据的不规则性、多模态性和复杂性与现有基准数据集假设的清洁、规则采样、单模态数据之间的显著差距。其解决方案的关键在于引入Time-IMM数据集和IMM-TSF基准库，其中Time-IMM专门设计用于捕捉多模态多变量时间序列中的因果驱动不规则性，并涵盖九种不同类型的不规则性机制；而IMM-TSF则提供了针对不规则多模态时间序列的预测基准，包含专门的融合模块，支持异步集成和真实评估，从而提升了在不规则时间序列数据上的预测性能。

链接: https://arxiv.org/abs/2506.10412
作者: Ching Chang,Jeehyun Hwang,Yidan Shi,Haixin Wang,Wen-Chih Peng,Tien-Fu Chen,Wei Wang
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper is currently under review

点击查看摘要

Abstract:Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at this https URL, and the benchmark library can be accessed at this https URL.
zh

[NLP-54] PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在复杂推理任务中难以可靠验证自身输出正确性的问题。现有解决方案通常依赖于独立的验证模型或需要多阶段的自我修正训练流程，这限制了其可扩展性。论文提出的解决方案是Policy as Generative Verifier (PAG)，其关键在于通过在统一的多轮强化学习（Reinforcement Learning, RL）框架内交替执行策略和验证器角色，使LLMs具备自我修正能力。与以往总是生成第二次尝试的方法不同，PAG引入了选择性修订机制：仅当生成式验证步骤检测到错误时，模型才会进行答案修订，从而提升推理与验证能力的协同增强效果。

链接: https://arxiv.org/abs/2506.10406
作者: Yuhua Jiang,Yuwen Xiong,Yufeng Yuan,Chao Xin,Wenyuan Xu,Yu Yue,Qianchuan Zhao,Lin Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generative Verifier (PAG), a simple and effective framework that empowers LLMs to self-correct by alternating between policy and verifier roles within a unified multi-turn reinforcement learning (RL) paradigm. Distinct from prior approaches that always generate a second attempt regardless of model confidence, PAG introduces a selective revision mechanism: the model revises its answer only when its own generative verification step detects an error. This verify-then-revise workflow not only alleviates model collapse but also jointly enhances both reasoning and verification abilities. Extensive experiments across diverse reasoning benchmarks highlight PAG’s dual advancements: as a policy, it enhances direct generation and self-correction accuracy; as a verifier, its self-verification outperforms self-consistency.
zh

[NLP-55] ableRAG : A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

【速读】：该论文旨在解决现有检索增强生成（Retrieval-Augmented Generation, RAG）方法在处理包含文本和表格的异构文档时所面临的局限性，这些问题包括表格结构被破坏、信息丢失以及大语言模型在多跳全局查询中的推理能力下降。解决方案的关键在于提出TableRAG，这是一个融合文本理解和复杂表格操作的混合框架，其通过四个迭代步骤——上下文感知的查询分解、文本检索、SQL编程与执行以及组合式中间答案生成——有效提升了对异构数据的处理能力。

链接: https://arxiv.org/abs/2506.10380
作者: Xiaohan Yu,Pu Jian,Chong Chen
机构: Huawei Cloud BU (华为云BU)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review. Codes are available at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at this https URL.
zh

[NLP-56] Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

【速读】：该论文试图解决语言模型能力评估中的因果关系识别问题，旨在通过严谨的因果评估方法揭示模型能力之间的潜在因果结构，从而为模型开发提供更具指导性的见解。其解决方案的关键在于提出一种因果表征学习框架，将观测到的基准性能建模为少量潜在能力因子的线性变换，并在适当控制基础模型这一共同混杂因素后，识别出这些潜在因子之间的因果关联。

链接: https://arxiv.org/abs/2506.10378
作者: Jikai Jin,Vasilis Syrgkanis,Sham Kakade,Hanlin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.
zh

[NLP-57] Can We Infer Confidential Properties of Training Data from LLM s?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在领域特定微调过程中可能暴露的属性推断漏洞问题，即攻击者通过分析模型输出推断出训练数据集的敏感属性。解决方案的关键在于提出PropInfer基准任务，用于评估在问答和对话生成两种微调范式下的属性推断能力，并设计了两种定制化攻击方法：基于提示的生成攻击和利用词频信号的影子模型攻击。实验结果表明，这些攻击能够成功揭示LLMs中未被认识到的安全隐患。

链接: https://arxiv.org/abs/2506.10364
作者: Penguin Huang,Chhavi Yadav,Ruihan Wu,Kamalika Chaudhuri
机构: UC San Diego(加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties – such as patient demographics or disease prevalence – that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
zh

[NLP-58] An Analysis of Datasets Metrics and Models in Keyphrase Generation ACL2025

【速读】：该论文试图解决关键短语生成（keyphrase generation）领域中缺乏系统性综述与分析的问题，旨在揭示当前研究的进展、局限及开放挑战。其解决方案的关键在于对超过50篇相关研究论文进行深入分析，并提出当前评估实践中存在的关键问题，如常用基准数据集之间的相似性过高以及评价指标计算不一致导致性能被高估等。此外，为应对预训练模型（pre-trained models）在该任务中可用性有限的问题，作者发布了一个基于PLM（Pre-trained Language Model）的强大模型，以促进未来的研究发展。

链接: https://arxiv.org/abs/2506.10346
作者: Florian Boudin,Akiko Aizawa
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: GEM^2 paper @ ACL 2025

点击查看摘要

Abstract:Keyphrase generation refers to the task of producing a set of words or phrases that summarises the content of a document. Continuous efforts have been dedicated to this task over the past few years, spreading across multiple lines of research, such as model architectures, data resources, and use-case scenarios. Yet, the current state of keyphrase generation remains unknown as there has been no attempt to review and analyse previous work. In this paper, we bridge this gap by presenting an analysis of over 50 research papers on keyphrase generation, offering a comprehensive overview of recent progress, limitations, and open challenges. Our findings highlight several critical issues in current evaluation practices, such as the concerning similarity among commonly-used benchmark datasets and inconsistencies in metric calculations leading to overestimated performances. Additionally, we address the limited availability of pre-trained models by releasing a strong PLM-based model for keyphrase generation as an effort to facilitate future research.
zh

[NLP-59] Code Execution as Grounded Supervision for LLM Reasoning

【速读】：该论文试图解决在训练大型语言模型（Large Language Models, LLMs）时，获取可靠且准确的链式思维（Chain-of-Thought, CoT）监督信号这一挑战。现有方法依赖于成本高昂的人工标注或容易出错的LLM生成的CoT，难以保证数据质量。论文提出的解决方案的关键在于利用程序执行的确定性，从代码执行中提取可验证的、逐步的推理轨迹，并将其转换为自然语言形式的CoT推理，从而生成高质量的监督数据。

链接: https://arxiv.org/abs/2506.10343
作者: Dongwon Jung,Wenxuan Zhou,Muhao Chen
机构: University of California, Davis; University of Southern California
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.
zh

[NLP-60] Provably Learning from Language Feedback

【速读】：该论文试图解决从语言反馈中学习（Learning from Language Feedback, LLF）的问题，特别是在存在隐式奖励的情况下如何有效进行决策学习。其解决方案的关键在于引入了“迁移弹性维数”（transfer eluder dimension）作为衡量LLF问题复杂性的指标，并基于此设计了一个无遗憾算法——\texttt{HELiX}，该算法通过序列交互能够理论上解决LLF问题，并且性能保证与问题的迁移弹性维数相关。

链接: https://arxiv.org/abs/2506.10341
作者: Wanqiao Xu,Allen Nie,Ruijie Zheng,Aditya Modi,Adith Swaminathan,Ching-An Cheng
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce \textittransfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called \textttHELiX , that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that \textttHELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.
zh

[NLP-61] Detecting Sockpuppetry on Wikipedia Using Meta-Learning ACL2025

【速读】：该论文旨在解决维基百科中恶意sockpuppet（伪装账号）检测的问题，以维护互联网上可靠信息的访问并防止虚假信息的传播。传统机器学习方法依赖于风格和元数据特征，但未能优先考虑对作者特定行为的适应性，因此在文本数据有限的情况下难以有效建模特定sockpuppet群体的行为。解决方案的关键在于应用元学习（meta-learning），这是一种通过在多个任务上训练模型来提高数据稀缺环境下性能的机器学习技术，能够优化模型以快速适应新sockpuppet群体的写作风格。

链接: https://arxiv.org/abs/2506.10314
作者: Luc Raszewski,Christine De Kock
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release a new dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.
zh

[NLP-62] Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLM s INTERSPEECH2025

【速读】：该论文试图解决在语音到语音翻译（Speech-to-Speech Translation, S2ST）中，大型语言模型（Large Language Models, LLMs）从文本模态到语音模态的适应问题。由于LLMs通常仅在文本数据上进行训练，因此在有限的语音到语音数据条件下，将其适配到语音模态面临挑战。解决方案的关键在于提出一种计划性交错的语音-文本训练方法，即在训练过程中使用交错的语音-文本单元，而非仅使用语音单元，其中对齐的文本标记在词级进行交错，并随着训练进程逐步降低文本比例，以促进从文本到语音的渐进式模态适应。

链接: https://arxiv.org/abs/2506.10299
作者: Hayato Futami,Emiru Tsunoo,Yosuke Kashiwagi,Yuki Ito,Hassan Shahmohammadi,Siddhant Arora,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech2025

点击查看摘要

Abstract:Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on text-only data, which presents challenges to adapt them to speech modality with limited speech-to-speech data. To address the training difficulty, we propose scheduled interleaved speech–text training in this study. We use interleaved speech–text units instead of speech units during training, where aligned text tokens are interleaved at the word level. We gradually decrease the ratio of text as training progresses, to facilitate progressive modality adaptation from text to speech. We conduct experimental evaluations by fine-tuning LLaMA3.2-1B for S2ST on the CVSS dataset. We show that the proposed method consistently improves the translation performances, especially for languages with limited training data.
zh

[NLP-63] “Check My Work?”: Measuring Sycophancy in a Simulated Educational Context KDD

【速读】：该论文试图解决用户提供的建议如何影响大型语言模型（Large Language Models, LLMs）在模拟教育情境中的表现问题，特别是在谄媚行为可能带来重大风险的情况下。研究的关键在于揭示查询框架对模型响应质量的影响，并发现当学生提及错误答案时，LLMs的准确性可能下降多达15个百分点，而提及正确答案则可提升相同幅度的准确性。此外，研究还表明这种偏差在较小的模型中更为显著，如GPT-4.1-nano模型的偏差可达30%，而GPT-4o模型仅为8%。通过分析模型“翻转”答案的频率及令牌级别的概率，证实了模型倾向于根据学生提到的答案调整自身输出，符合谄媚假设。

链接: https://arxiv.org/abs/2506.10297
作者: Chuck Arvin
机构: USC Gould School of Law (南加州大学法学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Presented at KDD Workshop on Ethical Artificial Intelligence: Methods and Applications (EAI) 2025

点击查看摘要

Abstract:This study examines how user-provided suggestions affect Large Language Models (LLMs) in a simulated educational context, where sycophancy poses significant risks. Testing five different LLMs from the OpenAI GPT-4o and GPT-4.1 model classes across five experimental conditions, we show that response quality varies dramatically based on query framing. In cases where the student mentions an incorrect answer, the LLM correctness can degrade by as much as 15 percentage points, while mentioning the correct answer boosts accuracy by the same margin. Our results also show that this bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model, versus 8% for the GPT-4o model. Our analysis of how often LLMs “flip” their answer, and an investigation into token level probabilities, confirm that the models are generally changing their answers to answer choices mentioned by students in line with the sycophancy hypothesis. This sycophantic behavior has important implications for educational equity, as LLMs may accelerate learning for knowledgeable students while the same tools may reinforce misunderstanding for less knowledgeable students. Our results highlight the need to better understand the mechanism, and ways to mitigate, such bias in the educational context.
zh

[NLP-64] Flick: Few Labels Text Classification using K-Aware Intermediate Learning in Multi-Task Low-Resource Languages

【速读】：该论文旨在解决在真正低资源语言环境下，少标签文本分类任务中因伪标签噪声和领域适应性问题而导致的性能瓶颈。其解决方案的关键在于提出Flick方法，该方法通过从更广泛的初始聚类中提炼高置信度的伪标签来显著提升伪标签质量，而非依赖传统的多聚类伪标签或复杂级联架构。Flick引入了一种新颖的伪标签精炼组件，通过聚焦单聚类内聚性和自适应的top-k选择机制，有效识别并利用表现最佳的伪标签聚类，从而减少低资源数据中固有的错误传播，实现仅需少量真实标签即可对预训练语言模型进行稳健微调。

链接: https://arxiv.org/abs/2506.10292
作者: Ali Almutairi,Abdullah Alsuhaibani,Shoaib Jameel,Usman Naseem,Gelareh Mohammadi,Imran Razzak
机构: University of New South Wales(新南威尔士大学); University of Technology Sydney(悉尼科技大学); University of Southampton(南安普顿大学); Macquarie University(麦考瑞大学); MBZUAI(穆罕默德本扎耶德大学人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training deep learning networks with minimal supervision has gained significant research attention due to its potential to reduce reliance on extensive labelled data. While self-training methods have proven effective in semi-supervised learning, they remain vulnerable to errors from noisy pseudo labels. Moreover, most recent approaches to the few-label classification problem are either designed for resource-rich languages such as English or involve complex cascading models that are prone to overfitting. To address the persistent challenge of few-label text classification in truly low-resource linguistic contexts, where existing methods often struggle with noisy pseudo-labels and domain adaptation, we propose Flick. Unlike prior methods that rely on generic multi-cluster pseudo-labelling or complex cascading architectures, Flick leverages the fundamental insight that distilling high-confidence pseudo-labels from a broader set of initial clusters can dramatically improve pseudo-label quality, particularly for linguistically diverse, low-resource settings. Flick introduces a novel pseudo-label refinement component, a departure from traditional pseudo-labelling strategies by identifying and leveraging top-performing pseudo-label clusters. This component specifically learns to distil highly reliable pseudo-labels from an initial broad set by focusing on single-cluster cohesion and leveraging an adaptive top-k selection mechanism. This targeted refinement process is crucial for mitigating the propagation of errors inherent in low-resource data, allowing for robust fine-tuning of pre-trained language models with only a handful of true labels. We demonstrate Flick’s efficacy across 14 diverse datasets, encompassing challenging low-resource languages such as Arabic, Urdu, and Setswana, alongside English, showcasing its superior performance and adaptability.
zh

[NLP-65] ClusterUCB: Efficient Gradient-Based Data Selection for Targeted Fine-Tuning of LLM s

【速读】：该论文旨在解决在大规模语言模型的监督微调过程中，基于梯度的数据样本选择因计算资源消耗过大而难以实际应用的问题。其解决方案的关键在于提出一种结合聚类和改进的上置信界（Upper Confidence Bound, UCB）算法的高效梯度数据选择框架，即ClusterUCB。该方法首先对训练数据进行聚类，利用相似梯度特征的数据样本具有相似影响的直觉，将跨簇的数据选择问题建模为一个受约束的计算预算分配问题，并将其视为多臂老虎机问题，通过改进的UCB算法进行求解，从而在保持性能的同时显著降低计算消耗。

链接: https://arxiv.org/abs/2506.10288
作者: Zige Wang,Qi Zhu,Fei Mi,Minghui Xu,Ruochun Jin,Wenjing Yang
机构: Peking University (北京大学); National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Gradient-based data influence approximation has been leveraged to select useful data samples in the supervised fine-tuning of large language models. However, the computation of gradients throughout the fine-tuning process requires too many resources to be feasible in practice. In this paper, we propose an efficient gradient-based data selection framework with clustering and a modified Upper Confidence Bound (UCB) algorithm. Based on the intuition that data samples with similar gradient features will have similar influences, we first perform clustering on the training data pool. Then, we frame the inter-cluster data selection as a constrained computing budget allocation problem and consider it a multi-armed bandit problem. A modified UCB algorithm is leveraged to solve this problem. Specifically, during the iterative sampling process, historical data influence information is recorded to directly estimate the distributions of each cluster, and a cold start is adopted to balance exploration and exploitation. Experimental results on various benchmarks show that our proposed framework, ClusterUCB, can achieve comparable results to the original gradient-based data selection methods while greatly reducing computing consumption.
zh

[NLP-66] Discrete Audio Tokens: More Than a Survey!

【速读】：该论文旨在解决当前离散音频标记化方法在不同领域（如语音、音乐和通用音频）中缺乏统一比较和系统性评估的问题。其关键解决方案是提出一种基于编码器-解码器结构、量化技术、训练范式、流式处理能力和应用领域的分类体系，并在多个基准上对标记化器进行重建性能、下游任务表现和声学语言建模的评估，通过控制消融实验分析权衡关系，从而揭示关键限制、实际考量及开放性挑战，为该领域的未来研究提供指导。

链接: https://arxiv.org/abs/2506.10274
作者: Pooneh Mousavi,Gallil Maimon,Adel Moumen,Darius Petermann,Jiatong Shi,Haibin Wu,Haici Yang,Anastasia Kuznetsova,Artem Ploujnikov,Ricard Marxer,Bhuvana Ramabhadran,Benjamin Elizalde,Loren Lugosch,Jinyu Li,Cem Subakan,Phil Woodland,Minje Kim,Hung-yi Lee,Shinji Watanabe,Yossi Adi,Mirco Ravanelli
机构: Concordia University; Mila-Quebec AI Institute; The Hebrew University of Jerusalem; University of Cambridge; Indiana University; Carnegie Mellon University; Microsoft; Université de Montréal; Université de Toulon; Google; Apple; Laval University; National Taiwan University; University of Illinois at Urbana-Champaign
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream this http URL provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: this https URL.
zh

[NLP-67] Do Language Models Have Bayesian Brains? Distinguishing Stochastic and Deterministic Decision Patterns within Large Language Models

【速读】：该论文试图解决语言模型是否具备类似人类的贝叶斯推理能力的问题，以及如何准确推断语言模型的先验分布。其解决方案的关键在于揭示语言模型在特定条件下可以表现出近似确定性的决策行为，而非依赖于传统的采样假设，从而挑战了以往通过模拟吉布斯采样来获取语言模型先验的方法，并提出了一种区分吉布斯采样中随机与确定性决策模式的简单方法，以避免推断出误导性的语言模型先验。

链接: https://arxiv.org/abs/2506.10268
作者: Andrea Yaoyun Cui,Pengfei Yu
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); Boson AI (波森人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models are essentially probability distributions over token sequences. Auto-regressive models generate sentences by iteratively computing and sampling from the distribution of the next token. This iterative sampling introduces stochasticity, leading to the assumption that language models make probabilistic decisions, similar to sampling from unknown distributions. Building on this assumption, prior research has used simulated Gibbs sampling, inspired by experiments designed to elicit human priors, to infer the priors of language models. In this paper, we revisit a critical question: Do language models possess Bayesian brains? Our findings show that under certain conditions, language models can exhibit near-deterministic decision-making, such as producing maximum likelihood estimations, even with a non-zero sampling temperature. This challenges the sampling assumption and undermines previous methods for eliciting human-like priors. Furthermore, we demonstrate that without proper scrutiny, a system with deterministic behavior undergoing simulated Gibbs sampling can converge to a “false prior.” To address this, we propose a straightforward approach to distinguish between stochastic and deterministic decision patterns in Gibbs sampling, helping to prevent the inference of misleading language model priors. We experiment on a variety of large language models to identify their decision patterns under various circumstances. Our results provide key insights in understanding decision making of large language models.
zh

[NLP-68] oxSyn-PT: A Large-Scale Synthetic Dataset for Hate Speech Detection in Portuguese

【速读】：该论文试图解决在葡萄牙语中对九个受法律保护的少数群体进行细粒度仇恨言论分类的问题，特别是在低资源环境下合成数据与仇恨言论检测的研究不足。解决方案的关键在于提出一种新颖的四阶段流水线：首先构建一个紧凑且手动策划的种子数据集，随后通过指令调优的大语言模型进行少量样本扩展，接着采用基于改写的数据增强方法，最后进行数据丰富并添加额外的中性文本以防止对群体特定线索的过拟合，从而生成一个类别平衡、风格多样且脱离社交媒体领域的大型葡萄牙语语料库。

链接: https://arxiv.org/abs/2506.10245
作者: Iago Alves Brito,Julia Soares Dollis,Fernanda Bufon Färber,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 tables, 1 figure

点击查看摘要

Abstract:We present ToxSyn-PT, the first large-scale Portuguese corpus that enables fine-grained hate-speech classification across nine legally protected minority groups. The dataset contains 53,274 synthetic sentences equally distributed between minorities groups and toxicity labels. ToxSyn-PT is created through a novel four-stage pipeline: (1) a compact, manually curated seed; (2) few-shot expansion with an instruction-tuned LLM; (3) paraphrase-based augmentation; and (4) enrichment, plus additional neutral texts to curb overfitting to group-specific cues. The resulting corpus is class-balanced, stylistically diverse, and free from the social-media domain that dominate existing Portuguese datasets. Despite domain differences with traditional benchmarks, experiments on both binary and multi-label classification on the corpus yields strong results across five public Portuguese hate-speech datasets, demonstrating robust generalization even across domain boundaries. The dataset is publicly released to advance research on synthetic data and hate-speech detection in low-resource settings.
zh

[NLP-69] Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

【速读】：该论文试图解决机器遗忘（machine unlearning）方法在面对简单提示攻击时可能失效的问题，即验证已“遗忘”知识是否真正被移除还是仅在输出层面被抑制。其解决方案的关键在于系统评估八种遗忘技术在三种模型家族中的表现，并通过输出分析、logit分析和探测分析来评估被遗忘知识的可恢复性，从而揭示现有遗忘方法在实际攻击下的有效性局限。

链接: https://arxiv.org/abs/2506.10236
作者: Yeonwoo Jang,Shariqah Hossain,Ashwin Sreevatsa,Diogo Cruz
机构: Supervised Program for Alignment Research (SPAR)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.
zh

[NLP-70] Classifying Unreliable Narrators with Large Language Models ACL2025

【速读】：该论文试图解决识别不可靠叙述者（unreliable narrator）的问题，即那些无意中歪曲信息的叙述者。解决方案的关键在于结合叙事学理论，通过计算方法对多领域文本数据进行分类，并利用大型语言模型（LLM）在少量样本、微调和课程学习等设置下进行实验，以探索其在实际文本数据中识别不可靠叙述者的能力。

链接: https://arxiv.org/abs/2506.10231
作者: Anneliese Brei,Katharine Henry,Abhisheik Sharma,Shashank Srivastava,Snigdha Chaturvedi
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); Virginia Polytechnic Institute and State University (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code and invite future research in this area.
zh

[NLP-71] -Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在非STEM领域任务中推理能力评估不足的问题，特别是其在战略、空间和逻辑推理方面的表现尚未得到充分研究。解决方案的关键在于提出一种简单且可扩展的程序化方法，用于生成可验证的两人对弈游戏问题，并构建了TTT-Bench基准，通过四款类似井字棋的游戏来评估LRMs的基本推理能力。这些游戏虽然对人类而言简单易解，但需要模型理解对手意图及棋盘空间配置以确保胜利，从而有效揭示模型在推理任务中的局限性。

链接: https://arxiv.org/abs/2506.10209
作者: Prakamya Mishra,Jiang Liu,Jialian Wu,Xiaodong Yu,Zicheng Liu,Emad Barsoum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce \textbfTTT-Bench, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board’s spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and \textbfdiscover that the models that excel at hard math problems frequently fail at these simple reasoning games. Further testing reveals that our evaluated reasoning models score on average \downarrow 41% \ \downarrow 5% lower on TTT-Bench compared to MATH 500 \ AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.
zh

[NLP-72] Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

【速读】：该论文试图解决在复杂现实事件中自动提取潜在参数化知识以提升视频检索效果的问题，特别是在零样本多语言文本到视频检索场景下。解决方案的关键在于提出Q2E（Query-to-Event）方法，通过利用大语言模型（LLMs）和视觉-语言模型（VLMs）中的嵌入知识对查询进行分解，从而增强对简化用户查询的理解，并结合基于熵的融合评分策略整合多模态信息，实现跨数据集、领域、模型的适应性检索。

链接: https://arxiv.org/abs/2506.10202
作者: Shubhashis Roy Dipta,Francis Ferraro
机构: University of Maryland Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.
zh

[NLP-73] Disclosure Audits for LLM Agents

【速读】：该论文试图解决大型语言模型代理在处理敏感数据时可能引发的隐私泄露问题，特别是在持续交互过程中潜在的隐私风险。解决方案的关键在于提出一种名为“对话隐私泄露的会话操纵框架”（Conversational Manipulation for Privacy Leakage, CMPL）的审计框架，该框架通过迭代探测策略对严格遵循隐私指令的代理进行压力测试，模拟真实场景下的多轮交互以系统性地发现隐性漏洞，从而有效识别现有单轮防御机制无法阻止的隐私风险。

链接: https://arxiv.org/abs/2506.10171
作者: Saswat Das,Jameson Sandler,Ferdinando Fioretto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. This study proposes an auditing framework for conversational privacy that quantifies and audits these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework, is an iterative probing strategy designed to stress-test agents that enforce strict privacy directives. Rather than focusing solely on a single disclosure event, CMPL simulates realistic multi-turn interactions to systematically uncover latent vulnerabilities. Our evaluation on diverse domains, data modalities, and safety configurations demonstrate the auditing framework’s ability to reveal privacy risks that are not deterred by existing single-turn defenses. In addition to introducing CMPL as a diagnostic tool, the paper delivers (1) an auditing procedure grounded in quantifiable risk metrics and (2) an open benchmark for evaluation of conversational privacy across agent implementations.
zh

[NLP-74] Can LLM s Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在故事生成能力上的理解不足问题，特别是如何评估其生成高质量故事的能力。解决方案的关键在于利用计算叙事学的理论框架，将LLMs应用于叙事规划问题求解，并构建一个基于文学案例的基准测试，重点评估因果合理性、角色意图性和戏剧冲突性。通过这一方法，研究者能够更深入地分析LLMs在不同规模和复杂度下的故事生成表现。

链接: https://arxiv.org/abs/2506.10161
作者: Yi Wang,Max Kreminski
机构: Autodesk Research (Autodesk 研究院); Midjourney (Midjourney)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In 2025 IEEE Conference on Games (CoG)

点击查看摘要

Abstract:Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs’ ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs’ story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.
zh

[NLP-75] One Patient Many Contexts: Scaling Medical AI Through Contextual Intelligence

【速读】：该论文试图解决当前医疗基础模型在面对新人群、新专科或新临床场景时，因需要微调、精心设计提示或从知识库中检索而导致的适应性不足问题，这限制了模型对不熟悉输入的解释能力和对训练阶段未涵盖的临床情境的调整能力。解决方案的关键在于实现医疗人工智能的上下文切换（context-switching），即模型能够在不进行重新训练的情况下，动态调整其推理过程以适应新的专科、人群、工作流程和临床角色。

链接: https://arxiv.org/abs/2506.10157
作者: Michelle M. Li,Ben Y. Reis,Adam Rodman,Tianxi Cai,Noa Dagan,Ran D. Balicer,Joseph Loscalzo,Isaac S. Kohane,Marinka Zitnik
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical foundation models, including language models trained on clinical notes, vision-language models on medical images, and multimodal models on electronic health records, can summarize clinical notes, answer medical questions, and assist in decision-making. Adapting these models to new populations, specialties, or settings typically requires fine-tuning, careful prompting, or retrieval from knowledge bases. This can be impractical, and limits their ability to interpret unfamiliar inputs and adjust to clinical situations not represented during training. As a result, models are prone to contextual errors, where predictions appear reasonable but fail to account for critical patient-specific or contextual information. These errors stem from a fundamental limitation that current models struggle with: dynamically adjusting their behavior across evolving contexts of medical care. In this Perspective, we outline a vision for context-switching in medical AI: models that dynamically adapt their reasoning without retraining to new specialties, populations, workflows, and clinical roles. We envision context-switching AI to diagnose, manage, and treat a wide range of diseases across specialties and regions, and expand access to medical care.
zh

[NLP-76] Measuring Corporate Human Capital Disclosures: Lexicon Data Code and Research Opportunities

【速读】：该论文试图解决企业人力资本（Human Capital, HC）在价值创造中的重要性日益增加，但目前缺乏明确的衡量和披露规则的问题。解决方案的关键是利用一种机器学习算法（word2vec）在已确认的人力资本披露数据上进行训练，从而构建一个全面的人力资本相关关键词列表，并将其分类为五个子类别（多元化与包容性；健康与安全；劳资关系与文化；薪酬与福利；人口统计与其他），以捕捉人力资本管理的多维特性。该研究还提供了词典、企业人力资本披露数据以及用于开发词典的Python代码，并展示了如何使用这些数据和代码，包括对BERT模型进行微调。

链接: https://arxiv.org/abs/2506.10155
作者: Elizabeth Demers,Victor Xiaoqi Wang,Kean Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 50 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Human capital (HC) is increasingly important to corporate value creation. Unlike other assets, however, HC is not currently subject to well-defined measurement or disclosure rules. We use a machine learning algorithm (word2vec) trained on a confirmed set of HC disclosures to develop a comprehensive list of HC-related keywords classified into five subcategories (DEI; health and safety; labor relations and culture; compensation and benefits; and demographics and other) that capture the multidimensional nature of HC management. We share our lexicon, corporate HC disclosures, and the Python code used to develop the lexicon, and we provide detailed examples of using our data and code, including for fine-tuning a BERT model. Researchers can use our HC lexicon (or modify the code to capture another construct of interest) with their samples of corporate communications to address pertinent HC questions. We close with a discussion of future research opportunities related to HC management and disclosure.
zh

[NLP-77] Analyzing Emotions in Bangla Social Media Comments Using Machine Learning and LIME

【速读】：该论文试图解决在资源有限的语言（如孟加拉语）中进行有效情绪识别的问题，尤其是在面对具有独特地域表达和文化特征的语言时。其解决方案的关键在于采用多种机器学习模型（如Linear SVM、KNN、Random Forest、BiLSTM和AdaBoost）结合特征提取方法（如TF-IDF向量化和PCA降维），并通过LIME技术提高模型的可解释性，从而提升情绪分析的准确性和实用性。

链接: https://arxiv.org/abs/2506.10154
作者: Bidyarthi Paul,SM Musfiqur Rahman,Dipta Biswas,Md. Ziaul Hasan,Md. Zahid Hossain
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Research on understanding emotions in written language continues to expand, especially for understudied languages with distinctive regional expressions and cultural features, such as Bangla. This study examines emotion analysis using 22,698 social media comments from the EmoNoBa dataset. For language analysis, we employ machine learning models: Linear SVM, KNN, and Random Forest with n-gram data from a TF-IDF vectorizer. We additionally investigated how PCA affects the reduction of dimensionality. Moreover, we utilized a BiLSTM model and AdaBoost to improve decision trees. To make our machine learning models easier to understand, we used LIME to explain the predictions of the AdaBoost classifier, which uses decision trees. With the goal of advancing sentiment analysis in languages with limited resources, our work examines various techniques to find efficient techniques for emotion identification in Bangla.
zh

[NLP-78] When Large Language Models are Reliable for Judging Empathic Communication

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在文本对话中对共情交流细微差别判断的可靠性问题。其解决方案的关键在于通过对比专家、众包工作者和LLMs在四个心理学、自然语言处理及传播学相关的评估框架下的标注结果，评估不同标注群体之间的评分者间信度，并以专家的一致性作为基准来衡量LLMs的表现。研究发现，LLMs在多数框架下接近专家水平，并优于众包工作者的可靠性，表明在特定任务上经过适当基准验证的LLMs可有效支持情感敏感场景中的透明度与监督。

链接: https://arxiv.org/abs/2506.10150
作者: Aakriti Kumar,Nalin Poungpeth,Diyi Yang,Erina Farrell,Bruce Lambert,Matthew Groh
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks’ sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM performance than standard classification metrics. Across all four frameworks, LLMs consistently approach this expert level benchmark and exceed the reliability of crowdworkers. These results demonstrate how LLMs, when validated on specific tasks with appropriate benchmarks, can support transparency and oversight in emotionally sensitive applications including their use as conversational companions.
zh

[NLP-79] Unsupervised Elicitation of Language Models

【速读】：该论文试图解决在预训练语言模型（pretrained language models, PLMs）的微调过程中，依赖人类监督进行行为引导所带来的局限性，尤其是在模型能力超过人类的情况下，获取高质量人类监督变得困难或不可行。解决方案的关键在于引入一种无监督算法——内部一致性最大化（Internal Coherence Maximization, ICM），通过模型自身生成的标签对预训练语言模型进行微调，从而无需外部监督。实验结果表明，该方法在多个任务上能够达到甚至超越基于人工标注监督的微调效果，并能更有效地激发模型的超人类能力。

链接: https://arxiv.org/abs/2506.10139
作者: Jiaxin Wen,Zachary Ankner,Arushi Somani,Peter Hase,Samuel Marks,Jacob Goldman-Wetzler,Linda Petrini,Henry Sleight,Collin Burns,He He,Shi Feng,Ethan Perez,Jan Leike
机构: Anthropic(Anthropic); Schmidt Sciences(施密特科学); Independent(独立); Constellation(星座); New York University(纽约大学); George Washington University(乔治华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To steer pretrained language models for downstream tasks, today’s post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained language models on their own generated labels, \emphwithout external supervision. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs’ capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.
zh

[NLP-80] ChartReason er: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering

【速读】：该论文试图解决将大型语言模型的长链推理能力扩展到视觉推理任务中的挑战，特别是针对图表问答等需要大量视觉细节的任务，现有方法通过图像到文本的转换会丢失关键的结构和语义信息。解决方案的关键在于提出一种基于代码的两阶段框架ChartReasoner，首先训练一个高保真模型将多样化的图表图像转换为结构化的ECharts代码以尽可能保留布局和数据语义，随后设计一个通用的图表推理数据合成管道，利用预训练的转换模型自动生成推理轨迹并过滤低质量样本，最终在合成数据集上结合监督微调和强化学习训练多模态模型。

链接: https://arxiv.org/abs/2506.10116
作者: Caijun Jia,Nan Xu,Jingxuan Wei,Qingli Wang,Lei Wang,Bihui Yu,Junnan Zhu
机构: Chinese Academy of Sciences (中国科学院); Shenyang Institute of Computing Technology, Chinese Academy of Sciences (中国科学院沈阳计算技术研究所); Beijing Wenge Technology Co., Ltd. (北京文革科技有限公司); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language models have shown remarkable reasoning capabilities through long-chain reasoning before responding. However, how to extend this capability to visual reasoning tasks remains an open challenge. Existing multimodal reasoning approaches transfer such visual reasoning task into textual reasoning task via several image-to-text conversions, which often lose critical structural and semantic information embedded in visualizations, especially for tasks like chart question answering that require a large amount of visual details. To bridge this gap, we propose ChartReasoner, a code-driven novel two-stage framework designed to enable precise, interpretable reasoning over charts. We first train a high-fidelity model to convert diverse chart images into structured ECharts codes, preserving both layout and data semantics as lossless as possible. Then, we design a general chart reasoning data synthesis pipeline, which leverages this pretrained transport model to automatically and scalably generate chart reasoning trajectories and utilizes a code validator to filter out low-quality samples. Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning on our synthesized chart reasoning dataset and experimental results on four public benchmarks clearly demonstrate the effectiveness of our proposed ChartReasoner. It can preserve the original details of the charts as much as possible and perform comparably with state-of-the-art open-source models while using fewer parameters, approaching the performance of proprietary systems like GPT-4o in out-of-domain settings.
zh

[NLP-81] When Meaning Stays the Same but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLM s ICML2025

【速读】：该论文试图解决大型语言模型在面对语义相同但token级表达不同的提示时，行为出现偏差的问题，即所谓的提示方差（prompt variance）。其解决方案的关键是提出基于提示的语义迁移（Prompt-Based Semantic Shift, PBSS）诊断框架，用于衡量在语义等价提示重写下大语言模型的行为漂移，从而揭示模型响应的稳定性与tokenization及解码策略之间的统计规律。

链接: https://arxiv.org/abs/2506.10095
作者: Xiao Li,Joel Kreuzwieser,Alan Peters
机构: Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL)
备注: This paper was developed for presentation at ICML 2025 Tokshop Workshop, but is now submitted as a standalone contribution

点击查看摘要

Abstract:We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.
zh

[NLP-82] Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific Information

【速读】：该论文旨在解决工业资产中生成故障模式与影响分析（Failure Modes and Effects Analysis, FMEA）文档的复杂性和低效问题。解决方案的关键在于提出了一种名为Chat-of-Thought的多智能体系统，该系统通过多个具有特定角色的基于大型语言模型（Large Language Model, LLM）的智能体协作，结合先进的AI技术和动态任务路由，实现FMEA表格的优化生成与验证。其核心创新在于引入了“思维对话”机制，通过动态、多角色驱动的讨论实现内容的迭代优化。

链接: https://arxiv.org/abs/2506.10086
作者: Christodoulos Constantinides,Shuxin Lin,Nianjun Zhou,Dhaval Patel
机构: IBM(国际商业机器公司); IBM Research(IBM研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.
zh

[NLP-83] A quantum semantic framework for natural language processing

【速读】：该论文试图解决自然语言中语义退化（semantic degeneracy）带来的解释不确定性问题，即随着语义表达复杂性的增加，任何解释者（人类或大语言模型）恢复单一预期意义的可能性趋于消失。解决方案的关键在于通过柯尔莫哥洛夫复杂性（Kolmogorov complexity）论证语义解释的计算不可行性，并提出意义是通过观察者依赖的解释行为实现的。为验证这一观点，作者采用类似贝尔不等式测试的方法，利用多种大语言模型作为“计算认知系统”在不同语境下解释歧义词对，实验结果表明语言解释在模糊性下表现出非经典的情境依赖性，从而证明基于经典频率论的自然语言分析方法必然存在信息损失，进而提出基于贝叶斯重复采样的方法更适用于语境中的语言意义表征。

链接: https://arxiv.org/abs/2506.10077
作者: Christopher J. Agostino,Quan Le Thien,Molly Apsel,Denizhan Pak,Elina Lesyk,Ashabari Majumdar
机构: NPC Worldwide (NPC Worldwide); Indiana University (印第安纳大学); Quantum Science and Engineering Center (量子科学与工程中心); Cognitive Science Program (认知科学项目); Department of Psychological and Brain Sciences (心理与脑科学系); Luddy School of Informatics, Computing, and Engineering (卢迪信息学、计算与工程学院); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注: 12 pages, 2 figures, accepted submission to Quantum AI and NLP 2025

点击查看摘要

Abstract:Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. Large Language Models (LLMs) and other modern NLP systems face inherent limitations precisely because they operate within natural language itself, making them subject to the same interpretive constraints imposed by semantic degeneracy. In this work, we argue using Kolmogorov complexity that as an expression’s complexity grows, the likelihood of any interpreting agent (human or LLM-powered AI) recovering the single intended meaning vanishes. This computational intractability suggests the classical view that linguistic forms possess meaning in and of themselves is flawed. We alternatively posit that meaning is instead actualized through an observer-dependent interpretive act. To test this, we conducted a semantic Bell inequality test using diverse LLM agents as ``computational cognitive systems’’ to interpret ambiguous word pairs under varied contextual settings. Across several independent experiments, we found average CHSH expectation values ranging from 1.2 to 2.8, with several runs yielding values (e.g., 2.3-2.4) that significantly violate the classical boundary ( |S|\leq2 ). This demonstrates that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.
zh

[NLP-84] askCraft: Automated Generation of Agent ic Tasks

【速读】：该论文旨在解决现有指令数据缺乏工具交互以及当前代理任务基准依赖于成本高昂的人工标注，从而限制了可扩展性的问题。其解决方案的关键在于提出\textscTaskCraft，这是一个自动化的工作流，能够生成可调节难度、多工具且可验证的代理任务，并包含执行轨迹，通过深度和宽度扩展方法扩展原子任务，以创建结构上和层次上复杂的挑战。

链接: https://arxiv.org/abs/2506.10055
作者: Dingfeng Shi,Jingyi Cao,Qianben Chen,Weichen Sun,Weizhen Li,Hongxuan Lu,Fangchen Dong,Tianrui Qin,King Zhu,Minghao Yang,Jian Yang,Ge Zhang,Jiaheng Liu,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI. However, existing instruction data lacks tool interaction, and current agentic benchmarks rely on costly human annotation, limiting their scalability. We introduce \textscTaskCraft, an automated workflow for generating difficulty-scalable, multi-tool, and verifiable agentic tasks with execution trajectories. TaskCraft expands atomic tasks using depth-based and width-based extensions to create structurally and hierarchically complex challenges. Empirical results show that these tasks improve prompt optimization in the generation workflow and enhance supervised fine-tuning of agentic foundation models. We present a large-scale synthetic dataset of approximately 36,000 tasks with varying difficulty to support future research on agent tuning and evaluation.
zh

[NLP-85] Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLM s

【速读】：该论文旨在解决直接偏好优化（Direct Preference Optimization, DPO）在强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）中因统一处理所有偏好对而导致的数据利用不充分和性能受限的问题。其解决方案的关键在于提出Omni-DPO，一个双视角优化框架，该框架同时考虑每个偏好对的固有质量和模型在这些偏好对上的演化性能，并通过自适应加权样本的方式，结合数据质量和模型学习动态进行训练，从而实现更高效的数据利用和更好的性能表现。

链接: https://arxiv.org/abs/2506.10054
作者: Shangpin Peng,Weinong Wang,Zhuotao Tian,Senqiao Yang,Xing Wu,Haotian Xu,Chengquan Zhang,Takashi Isobe,Baotian Hu,Min Zhang
机构: HIT, Shenzhen (哈尔滨工业大学深圳); Tencent (腾讯); CUHK (香港中文大学); UCAS (中国科学院大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model’s evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model’s learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at this https URL.
zh

[NLP-86] GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models

【速读】：该论文试图解决当前文本到图像（Text-to-Image, T2I）模型在安全评估方面的不足，即缺乏可靠的工具来检测经过防御的T2I模型中的潜在风险。其解决方案的关键在于提出GenBreak框架，该框架通过微调红队大型语言模型（LLM），系统地探索T2I生成器中的漏洞。该方法结合了监督微调与基于代理T2I模型交互的强化学习，通过整合多种奖励信号，引导LLM生成既能规避安全机制又具有高毒性且语义连贯的对抗性提示，从而有效揭示商业T2I生成器中的实际安全弱点。

链接: https://arxiv.org/abs/2506.10047
作者: Zilong Wang,Xiang Zheng,Xiaosen Wang,Bo Wang,Xingjun Ma,Yu-Gang Jiang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Text-to-image (T2I) models such as Stable Diffusion have advanced rapidly and are now widely used in content creation. However, these models can be misused to generate harmful content, including nudity or violence, posing significant safety risks. While most platforms employ content moderation systems, underlying vulnerabilities can still be exploited by determined adversaries. Recent research on red-teaming and adversarial attacks against T2I models has notable limitations: some studies successfully generate highly toxic images but use adversarial prompts that are easily detected and blocked by safety filters, while others focus on bypassing safety mechanisms but fail to produce genuinely harmful outputs, neglecting the discovery of truly high-risk prompts. Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. Our approach combines supervised fine-tuning on curated datasets with reinforcement learning via interaction with a surrogate T2I model. By integrating multiple reward signals, we guide the LLM to craft adversarial prompts that enhance both evasion capability and image toxicity, while maintaining semantic coherence and diversity. These prompts demonstrate strong effectiveness in black-box attacks against commercial T2I generators, revealing practical and concerning safety weaknesses.
zh

[NLP-87] oken Perturbation Guidance for Diffusion Models

【速读】：该论文试图解决当前扩散模型中Classifier-free guidance (CFG)方法存在的两个主要问题：需要特定的训练过程以及仅适用于条件生成。其解决方案的关键在于提出一种名为Token Perturbation Guidance (TPG)的新方法，该方法通过直接对扩散网络中的中间token表示应用扰动矩阵来提供引导信号，并采用保持范数的洗牌操作以确保引导信号的有效性和稳定性。TPG无需额外训练且与输入条件无关，从而能够广泛应用于条件和非条件生成任务。

链接: https://arxiv.org/abs/2506.10036
作者: Javad Rajabi,Soroush Mehraban,Seyedmorteza Sadat,Babak Taati
机构: University of Toronto (多伦多大学); Vector Institute for Artificial Intelligence (向量人工智能研究所); KITE Research Institute (KITE 研究所); ETH Zürich (苏黎世联邦理工学院)
类目: Graphics (cs.GR); Computation and Language (cs.CL)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2 \times improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. The code is available at this https URL
zh

[NLP-88] Evaluation empirique de la sécurisation et de lalignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全性和对齐性方面存在的问题，特别是针对ChatGPT和Gemini这两款模型进行比较分析，并构建与越狱（jailbreak）技术相关的分类体系。其解决方案的关键在于通过实验和分析，识别和归纳不同类型的越狱攻击方法，从而为提升模型的安全防护能力提供理论支持和技术参考。

链接: https://arxiv.org/abs/2506.10029
作者: Rafaël Nouailles(GdR)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: in French language

点击查看摘要

Abstract:Large Language models (LLMs) are transforming digital usage, particularly in text generation, image creation, information retrieval and code development. ChatGPT, launched by OpenAI in November 2022, quickly became a reference, prompting the emergence of competitors such as Google’s Gemini. However, these technological advances raise new cybersecurity challenges, including prompt injection attacks, the circumvention of regulatory measures (jailbreaking), the spread of misinformation (hallucinations) and risks associated with deep fakes. This paper presents a comparative analysis of the security and alignment levels of ChatGPT and Gemini, as well as a taxonomy of jailbreak techniques associated with experiments.
zh

[NLP-89] Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练过程中可能记忆并泄露个人身份信息（Personally Identifiable Information, PII）的问题。解决方案的关键在于提出一种名为隐私记忆编辑（Private Memorization Editing, PME）的方法，该方法将LLMs的记忆能力转化为一种隐私防御策略，通过检测并编辑模型对训练数据的知识来减轻PII的记忆，从而增强模型对隐私训练数据提取攻击的鲁棒性。

链接: https://arxiv.org/abs/2506.10024
作者: Elena Sofia Ruzzetti,Giancarlo A. Xompero,Davide Venditti,Fabio Massimo Zanzotto
机构: Human Centric ART, University of Rome Tor Vergata; Almawave S.p.A.
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To be published at ACL 2025 (Main)

点击查看摘要

Abstract:Large Language Models (LLMs) memorize, and thus, among huge amounts of uncontrolled data, may memorize Personally Identifiable Information (PII), which should not be stored and, consequently, not leaked. In this paper, we introduce Private Memorization Editing (PME), an approach for preventing private data leakage that turns an apparent limitation, that is, the LLMs’ memorization ability, into a powerful privacy defense strategy. While attacks against LLMs have been performed exploiting previous knowledge regarding their training data, our approach aims to exploit the same kind of knowledge in order to make a model more robust. We detect a memorized PII and then mitigate the memorization of PII by editing a model knowledge of its training data. We verify that our procedure does not affect the underlying language model while making it more robust against privacy Training Data Extraction attacks. We demonstrate that PME can effectively reduce the number of leaked PII in a number of configurations, in some cases even reducing the accuracy of the privacy attacks to zero.
zh

[NLP-90] From Threat to Tool: Leverag ing Refusal-Aware Injection Attacks for Safety Alignment

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）安全对齐过程中依赖大量人工标注偏好数据所带来的高成本和低效率问题。其解决方案的关键在于提出一种无需训练且与模型无关的框架——拒绝感知自适应注入（Refusal-Aware Adaptive Injection, RAAI），该框架通过检测模型内部的拒绝信号并自适应地注入预定义短语，以诱导出有害但流畅的输出。此方法利用了LLM攻击技术，生成合成数据用于微调，从而在提升模型对有害提示的鲁棒性的同时，保持其在标准任务上的通用能力。

链接: https://arxiv.org/abs/2506.10020
作者: Kyubyung Chae,Hyunbin Jin,Taesup Kim
机构: Seoul National University (首尔大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safely aligning large language models (LLMs) often demands extensive human-labeled preference data, a process that’s both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address this, we introduce Refusal-Aware Adaptive Injection (RAAI), a straightforward, training-free, and model-agnostic framework that repurposes LLM attack techniques. RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions. Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average across four benchmarks. Crucially, fine-tuning LLMs with the synthetic data generated by RAAI improves model robustness against harmful prompts while preserving general capabilities on standard tasks like MMLU and ARC. This work highlights how LLM attack methodologies can be reframed as practical tools for scalable and controllable safety alignment.
zh

[NLP-91] A Survey of Automatic Evaluation Methods on Text Visual and Speech Generations

【速读】：该论文试图解决生成式 AI（Generative AI）在文本、图像和音频等多模态生成内容的质量自动评估问题，当前研究缺乏一个系统性的框架来全面组织和分类这些评估方法。解决方案的关键在于提出一个全面的综述和统一的分类体系，识别出贯穿三个模态的五个基本评估范式，并通过分析文本生成的成熟技术，将其扩展至图像和音频生成领域，从而展示该框架的广泛适用性。

链接: https://arxiv.org/abs/2506.10019
作者: Tian Lan,Yang-Hao Zhou,Zi-Ao Ma,Fanshu Sun,Rui-Qing Sun,Junyu Luo,Rong-Cheng Tu,Heyan Huang,Chen Xu,Zhijing Wu,Xian-Ling Mao
机构: Beijing Institute of Technology(北京理工大学); Peking University(北京大学); Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.
zh

[NLP-92] Multimodal Large Language Models : A Survey

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在跨模态能力上的发展与整合问题，旨在探索其在多种输出模态（如图像、音乐、视频等）中的应用潜力及技术实现路径。解决方案的关键在于利用基础技术，包括自监督学习（Self-Supervised Learning, SSL）、专家混合（Mixture of Experts, MoE）、人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）和思维链（Chain-of-Thought, CoT）提示，以增强模型的跨模态理解与生成能力。同时，通过分析关键模型、架构趋势及跨模态协同效应，推动更通用、适应性强且可解释的多模态系统的发展。

链接: https://arxiv.org/abs/2506.10016
作者: Longzhen Han,Awes Mubarak,Almas Baimagambetov,Nikolaos Polatidis,Thar Baker
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Architectural innovations like transformers and diffusion models underpin this convergence, enabling cross-modal transfer and modular specialization. We highlight emerging patterns of synergy, and identify open challenges in evaluation, modularity, and structured reasoning. This survey offers a unified perspective on MLLM development and identifies critical paths toward more general-purpose, adaptive, and interpretable multimodal systems.
zh

[NLP-93] Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

【速读】：该论文试图解决从文本输入自动合成高质量电影视频的问题，其解决方案的关键在于结合生成式AI（Generative AI）技术，包括使用Stable Diffusion进行高保真图像合成、GPT-2进行叙事结构构建，并采用基于gTTS和YouTube源音乐的混合音频处理流程。此外，通过五场景框架、线性帧插值、电影级后期处理以及音视频同步等技术手段，实现了专业级的输出效果。

链接: https://arxiv.org/abs/2506.10005
作者: Sridhar S,Nithin A,Shakeel Rifath,Vasantha Raj
机构: Hindustan Institute of Technology and Science (印度哈奴珊技术与科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Multimedia (cs.MM)
备注: 10 pages, seven figures about Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models

点击查看摘要

Abstract:Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.
zh

[NLP-94] Resa: Transparent Reasoning Models via SAEs

【速读】：该论文试图解决如何以低成本有效地激发语言模型中的强推理能力的问题。其解决方案的关键在于提出了一种名为Resa的1.5B参数推理模型家族，该模型通过一种新颖且高效的稀疏自编码器调优（Sparse Autoencoder Tuning, SAE-Tuning）方法进行训练。该方法首先利用一个源模型训练稀疏自编码器（SAE）以捕捉推理能力，随后使用训练好的SAE引导标准的监督微调过程，在目标模型中激发这些能力，整个过程仅使用验证过的问答数据而无需任何推理轨迹。

链接: https://arxiv.org/abs/2506.09967
作者: Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Deqing Fu,Willie Neiswanger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains 97% of its RL-trained counterpart’s reasoning performance while reducing training costs by 2000x to roughly \ 1 and training time by 450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \ 1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
zh

[NLP-95] na: Tiny Reasoning Models via LoRA

【速读】：该论文试图解决如何以低成本高效实现语言模型中强大的推理能力这一问题。其解决方案的关键在于采用参数高效的强化学习（RL）更新方法，具体是通过低秩适配（LoRA）对一个仅有15亿参数的微型基础模型进行微调，从而在保持较低计算成本的同时获得具有竞争力甚至超越当前最先进（SOTA）强化学习推理模型的推理性能。

链接: https://arxiv.org/abs/2504.15777
作者: Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Willie Neiswanger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a 20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only \ 9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model’s underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \ checkpoints.
zh

[NLP-96] Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

【速读】：该论文旨在解决语音识别系统在面对新说话人时因数据量不足而难以进行有效适应的问题，尤其是在数据未标注的情况下。其解决方案的关键在于结合两种方法：首先，提出了一种新的损失函数——基于完整假设的条件熵，以减少对单个错误假设或“伪标签”的依赖，从而提高适应过程的鲁棒性；其次，引入了一个短向量形式的“说话人代码”，使得仅需少量数据即可估计该代码，进而提升适应效果。

链接: https://arxiv.org/abs/2506.10653
作者: Rogier C. van Dalen,Shucong Zhang,Titouan Parcollet,Sourav Bhattacharya
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or “pseudo-label”, this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a “speaker code” characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.
zh

[NLP-97] AC/DC: LLM -based Audio Comprehension via Dialogue Continuation INTERSPEECH2025

【速读】：该论文试图解决音频理解任务中由于目标描述（caption）变化导致的模型泛化能力不足的问题（caption variation problem）。其解决方案的关键在于利用大语言模型（LLM）的对话延续能力，通过训练模型生成类似对话响应的方式，而非直接生成目标描述。这种对话延续训练能够捕捉描述的深层语义，从而在仅使用音频描述数据集进行训练的情况下，实现零样本指令遵循能力。

链接: https://arxiv.org/abs/2506.10312
作者: Yusuke Fujita,Tomoya Mizumoto,Atsushi Kojima,Lianbo Liu,Yui Sudo
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:We propose an instruction-following audio comprehension model that leverages the dialogue continuation ability of large language models (LLMs). Instead of directly generating target captions in training data, the proposed method trains a model to produce responses as if the input caption triggered a dialogue. This dialogue continuation training mitigates the caption variation problem. Learning to continue a dialogue effectively captures the caption’s meaning beyond its surface-level words. As a result, our model enables zero-shot instruction-following capability without multitask instruction tuning, even trained solely on audio captioning datasets. Experiments on AudioCaps, WavCaps, and Clotho datasets with AudioBench audio-scene question-answering tests demonstrate our model’s ability to follow various unseen instructions.
zh

计算机视觉

[CV-0] SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

【速读】：该论文旨在解决传统生成模型在新视角合成（NVS）中依赖密集多视角采集导致的表面过于平滑和几何失真问题，这些问题源于生成模型仅从RGB数据中难以准确推断三维结构。其解决方案的关键在于提出SceneCompleter框架，通过两个核心组件实现三维一致的生成式新视角合成：（1）一种几何-外观双流扩散模型，联合在RGBD空间中合成新视角；（2）一个场景嵌入器，从参考图像中编码更全面的场景理解。该方法通过有效融合结构与纹理信息，提升了生成新视角合成的视觉一致性和合理性。

链接: https://arxiv.org/abs/2506.10981
作者: Weiliang Chen,Jiayi Bi,Yuanhui Huang,Wenzhao Zheng,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: this https URL
zh

[CV-1] InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

【速读】：该论文旨在解决3D场景修复（3D scene inpainting）在实时或在线应用中因计算复杂度高而难以实用的问题。现有方法依赖于耗时且计算密集的优化过程，限制了其在交互式操作中的应用。论文提出的解决方案是InstaInpaint，其关键在于采用基于参考的前馈框架，能够在0.4秒内从2D修复建议生成3D场景修复结果，并通过自监督的掩码微调策略训练定制的大规模重建模型（LRM），从而实现高效且高质量的修复效果。

链接: https://arxiv.org/abs/2506.10980
作者: Junqi You,Chieh Hubert Lin,Weijie Lyu,Zhengbo Zhang,Ming-Hsuan Yang
机构: Shanghai Jiao Tong University (上海交通大学); UC Merced (加州大学默塞德分校); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications. We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large-scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000x speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting. More video results are available at our project page: this https URL.
zh

[CV-2] Fine-Grained Perturbation Guidance via Attention Head Selection

【速读】：该论文旨在解决扩散模型中注意力扰动（attention perturbation）方法在定位扰动位置上的不足，特别是在Diffusion Transformer（DiT）架构中，由于质量相关计算分布在不同层，现有方法缺乏系统性的策略来确定扰动的具体位置。其解决方案的关键在于提出“HeadHunter”框架，通过迭代选择与用户目标对齐的注意力头，实现对生成质量和视觉属性的细粒度控制，并引入SoftPAG技术，通过线性插值注意力图至单位矩阵，提供连续可调的扰动强度，从而提升生成效果并抑制伪影。

链接: https://arxiv.org/abs/2506.10978
作者: Donghoon Ahn,Jiwon Kang,Sanghyun Lee,Minjae Kim,Jaewon Min,Wooseok Jang,Saungwu Lee,Sayak Paul,Susung Hong,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
zh

[CV-3] QuadricFormer: Scene as Superquadrics for 3D Semantic Occupancy Prediction

【速读】：该论文旨在解决3D占用预测（3D occupancy prediction）中现有方法效率低下的问题，特别是传统密集体素表示和基于稀疏高斯的对象中心表示在处理复杂几何结构时的不足。其关键解决方案是引入几何表达能力强的超二次曲面（superquadrics）作为场景基元，利用其固有的形状多样性以更少的基元高效表示复杂结构，并通过概率超二次曲面混合模型实现占用概率分布与语义信息的联合建模，同时设计剪枝与分裂模块提升模型效率。

链接: https://arxiv.org/abs/2506.10977
作者: Sicheng Zuo,Wenzhao Zheng,Xiaoyong Han,Longchao Yang,Yong Pan,Jiwen Lu
机构: Tsinghua University (清华大学); Li Auto Inc. (小鹏汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D occupancy prediction is crucial for robust autonomous driving systems as it enables comprehensive perception of environmental structures and semantics. Most existing methods employ dense voxel-based scene representations, ignoring the sparsity of driving scenes and resulting in inefficiency. Recent works explore object-centric representations based on sparse Gaussians, but their ellipsoidal shape prior limits the modeling of diverse structures. In real-world driving scenes, objects exhibit rich geometries (e.g., cuboids, cylinders, and irregular shapes), necessitating excessive ellipsoidal Gaussians densely packed for accurate modeling, which leads to inefficient representations. To address this, we propose to use geometrically expressive superquadrics as scene primitives, enabling efficient representation of complex structures with fewer primitives through their inherent shape diversity. We develop a probabilistic superquadric mixture model, which interprets each superquadric as an occupancy probability distribution with a corresponding geometry prior, and calculates semantics through probabilistic mixture. Building on this, we present QuadricFormer, a superquadric-based model for efficient 3D occupancy prediction, and introduce a pruning-and-splitting module to further enhance modeling efficiency by concentrating superquadrics in occupied regions. Extensive experiments on the nuScenes dataset demonstrate that QuadricFormer achieves state-of-the-art performance while maintaining superior efficiency.
zh

[CV-4] GenWorld: Towards Detecting AI-generated Real-world Simulation Videos

【速读】：该论文试图解决AI生成视频检测中因缺乏高质量真实世界数据集而导致的检测器可信度不足的问题。解决方案的关键在于提出GenWorld，这是一个大规模、高质量且具有真实世界模拟特性的AI生成视频数据集，其核心特点包括真实场景模拟、高质量伪造视频生成以及跨提示多样性，从而为检测模型提供更具代表性和泛化能力的训练数据。同时，论文还提出了SpannDetector，通过利用多视角一致性作为检测准则，提升对高保真AI生成视频的检测效果。

链接: https://arxiv.org/abs/2506.10975
作者: Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: this https URL
zh

[CV-5] Eye Robot: Learning to Look to Act with a BC-RL Perception-Action Loop WWW

【速读】：该论文试图解决机器人在复杂环境中实现有效手眼协调的问题，特别是在大工作空间内仅使用单个摄像头进行操作时的挑战。解决方案的关键在于引入EyeRobot系统，该系统通过强化学习训练出一种基于视觉注意力的 gaze policy，并结合模仿学习（Behavior Cloning, BC）与强化学习（Reinforcement Learning, RL）的联合训练机制（BC-RL loop），使机械眼能够主动注视有助于手部完成任务的区域，从而实现高效的协同控制。此外，EyeRobot采用了一种仿中央凹（foveal）的策略架构，在有限计算资源下实现了高分辨率视觉处理，提升了目标跟踪和干扰抑制能力。

链接: https://arxiv.org/abs/2506.10968
作者: Justin Kerr,Kush Hari,Ethan Weber,Chung Min Kim,Brent Yi,Tyler Bonnen,Ken Goldberg,Angjoo Kanazawa
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Humans do not passively observe the visual world – we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: this https URL
zh

[CV-6] Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中视觉令牌长度远大于文本令牌导致的高推理成本问题。现有方法要么依赖基于注意力的剪枝，保留大量重复令牌，要么使用基于相似性的剪枝，忽视指令相关性，从而导致性能不佳。论文提出的解决方案关键在于提出一种名为CDPruner的新颖视觉令牌剪枝方法，该方法通过最大化保留令牌的条件多样性来优化剪枝过程，其核心是定义基于指令的条件相似性，并将令牌剪枝问题重新建模为确定点过程（Determinantal Point Process, DPP），以实现更优的子集选择。

链接: https://arxiv.org/abs/2506.10967
作者: Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures, code: this https URL , project page: this https URL

点击查看摘要

Abstract:In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95% and CUDA latency by 78%, while maintaining 94% of the original accuracy. Our code is available at this https URL.
zh

[CV-7] SpectralAR: Spectral Autoregressive Visual Generation

【速读】：该论文旨在解决自回归视觉生成中图像块固有的并行性与自回归建模因果性之间的矛盾问题。现有方法通常将视觉序列构建为图像块，但图像块的并行特性与自回归模型的顺序生成需求相冲突。解决方案的关键在于提出一种从频谱角度实现视觉序列因果性的框架——频谱自回归（SpectralAR），其核心是通过嵌套频谱标记化（Nested Spectral Tokenization）将图像转换为有序频谱标记，并以粗到细的方式进行自回归生成，从而在不依赖复杂结构的情况下实现序列因果性和标记效率。

链接: https://arxiv.org/abs/2506.10962
作者: Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Yueqi Duan,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: this https URL.
zh

[CV-8] ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems

【速读】：该论文试图解决在逆问题中，当奖励模型信息不足时，基于预训练扩散模型的无监督方法（如扩散后验采样，DPS）容易偏离数据流形，导致生成结果不真实的问题。解决方案的关键在于提出一种简单的封装方法 ReGuidance，其核心思想是通过从用户提供的候选解 $\hat{x}$ 出发，反向运行无条件概率流常微分方程（ODE），并利用得到的潜在表示作为 DPS 的初始化，从而提升样本的真实性和奖励效果。

链接: https://arxiv.org/abs/2506.10955
作者: Aayush Karan,Kulin Shah,Sitan Chen
机构: Harvard SEAS (哈佛大学工程与应用科学学院); UT Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 14 figures

点击查看摘要

Abstract:There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution \hatx produced by an algorithm of the user’s choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from \hatx , and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.
zh

[CV-9] M4V: Multi-Modal Mamba for Text-to-Video Generation

【速读】：该论文旨在解决文本到视频生成中的计算复杂度高和多模态时空建模能力不足的问题。其关键解决方案是提出M4V框架，其中包含多模态扩散Mamba（MM-DiM）模块，通过多模态令牌重新组合设计实现多模态信息与时空建模的无缝融合，从而在保持视频生成质量的同时显著降低计算成本。

链接: https://arxiv.org/abs/2506.10915
作者: Jiancheng Huang,Gengwei Zhang,Zequn Jie,Siyu Jiao,Yinlong Qian,Ling Chen,Yunchao Wei,Lin Ma
机构: Meituan(美团); University of Technology Sydney(悉尼科技大学); Beijing Jiaotong University(北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768 \times 1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V’s ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at this https URL.
zh

[CV-10] AIR: Zero-shot Generative Model Adaptation with Iterative Refinement

【速读】：该论文旨在解决零样本生成模型适应（ZSGM）中由于图像偏移与文本偏移在CLIP嵌入空间中完全对齐假设导致生成图像质量下降的问题。其解决方案的关键在于提出一种基于迭代优化的适应方法（Adaptation with Iterative Refinement, AIR），该方法通过考虑文本偏移与图像偏移之间的不对齐现象，提升目标域图像的质量。

链接: https://arxiv.org/abs/2506.10895
作者: Guimeng Liu,Milad Abdollahzadeh,Ngai-Man Cheung
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space, resulting in quality degrade in generated images. Our work makes two main contributions. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance, i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptation with Iterative Refinement (AIR) which is the first ZSGM approach to focus on improving target domain image quality based on our new insight on offset this http URL, quantitative, and user study in 26 experiment setups consistently demonstrate the proposed AIR approach achieves SOTA performance. Additional experiments are in Supp.
zh

[CV-11] CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation

【速读】：该论文旨在解决高质量、可编辑且具有美学吸引力的图形设计创作过程耗时且依赖技能的问题，尤其是在初学者中。其解决方案的关键在于提出CreatiPoster框架，该框架能够根据自然语言指令或提供的素材生成可编辑的多层图形组合。该框架包含一个协议模型，用于生成详细的JSON规格说明，以及一个条件背景模型，用于基于渲染的前景图层合成连贯的背景，从而实现专业级的视觉效果。

链接: https://arxiv.org/abs/2506.10890
作者: Zhao Zhang,Yutao Cheng,Dexiang Hong,Maoke Yang,Gonglei Shi,Lei Ma,Hui Zhang,Jie Shao,Xinglong Wu
机构: ByteDance, Intelligent Creation; ByteDance, Fudan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graphic design plays a crucial role in both commercial and personal contexts, yet creating high-quality, editable, and aesthetically pleasing graphic compositions remains a time-consuming and skill-intensive task, especially for beginners. Current AI tools automate parts of the workflow, but struggle to accurately incorporate user-supplied assets, maintain editability, and achieve professional visual appeal. Commercial systems, like Canva Magic Design, rely on vast template libraries, which are impractical for replicate. In this paper, we introduce CreatiPoster, a framework that generates editable, multi-layer compositions from optional natural-language instructions or assets. A protocol model, an RGBA large multimodal model, first produces a JSON specification detailing every layer (text or asset) with precise layout, hierarchy, content and style, plus a concise background prompt. A conditional background model then synthesizes a coherent background conditioned on this rendered foreground layers. We construct a benchmark with automated metrics for graphic-design generation and show that CreatiPoster surpasses leading open-source approaches and proprietary commercial systems. To catalyze further research, we release a copyright-free corpus of 100,000 multi-layer designs. CreatiPoster supports diverse applications such as canvas editing, text overlay, responsive resizing, multilingual adaptation, and animated posters, advancing the democratization of AI-assisted graphic design. Project homepage: this https URL
zh

[CV-12] VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

【速读】：该论文试图解决现有评估体系在多步骤推理能力评估中忽视时间推理和过程有效性的问题（temporal reasoning and procedural validity）。其解决方案的关键在于构建VRBench，这是一个面向大型模型的长叙事视频基准，包含1,010个长视频、9,468个多人步问答对以及30,292个带时间戳的推理步骤，并通过多阶段筛选流程确保剧情连贯性。同时，该研究提出了一种人机协作框架以生成连贯的推理链，并设计了多阶段评估流程，包括基于大语言模型（LLM）引导的进度级评分指标，以全面评估推理过程的质量。

链接: https://arxiv.org/abs/2506.10857
作者: Jiashuo Yu,Yue Wu,Meng Chu,Zhifei Ren,Zizheng Huang,Pei Chu,Ruijie Zhang,Yinan He,Qirui Li,Songze Li,Zhenxiang Li,Zhongying Tu,Conghui He,Yu Qiao,Yali Wang,Yi Wang,Limin Wang
机构: Shanghai Artificial Intelligence Laboratory; Nanjing University; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Technical Report

点击查看摘要

Abstract:We present VRBench, the first long narrative video benchmark crafted for evaluating large models’ multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.
zh

[CV-13] Post-Training Quantization for Video Matting

【速读】：该论文旨在解决视频抠像（video matting）模型在资源受限设备上部署时面临的计算密集型问题，特别是针对后训练量化（Post-Training Quantization, PTQ）在保持精度和时间一致性方面存在的挑战。其解决方案的关键在于提出一种新颖的通用PTQ框架，包含三个核心贡献：首先，采用两阶段PTQ策略，结合基于块重构的优化与全局量化参数校准，以减少精度损失；其次，引入统计驱动的全局仿射校准（Statistically-Driven Global Affine Calibration, GAC）方法，补偿累积的统计失真；最后，设计光学流辅助（Optical Flow Assistance, OFA）组件，利用帧间的时序和语义先验信息提升模型在复杂场景下的前景区分能力，从而在极低比特量化下实现接近全精度的性能。

链接: https://arxiv.org/abs/2506.10840
作者: Tianrui Zhu,Houyuan Chen,Ruihao Gong,Michele Magno,Haotong Qin,Kai Zhang
机构: Nanjing University (南京大学); SenseTime Research (商汤科技研究); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model’s ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.
zh

[CV-14] Occlusion-Aware 3D Hand-Object Pose Estimation with Masked AutoEncoders

【速读】：该论文旨在解决单目RGB图像中手-物体姿态估计的问题，尤其是由于手-物体交互过程中严重的遮挡导致的挑战。现有方法在全局结构感知和推理方面存在不足，限制了其处理遮挡手-物体交互的效果。解决方案的关键在于提出一种基于掩码自编码器的遮挡感知手-物体姿态估计方法（HOMAE），其核心是采用目标聚焦的掩码策略，通过在手-物体交互区域引入结构化遮挡，促使模型学习上下文感知特征并推理被遮挡结构；同时，通过融合多尺度解码器特征生成的符号距离场（SDF）与显式点云，结合两者在全局上下文和局部几何精度上的优势，提升对遮挡区域的鲁棒性处理能力。

链接: https://arxiv.org/abs/2506.10816
作者: Hui Yang,Wei Sun,Jian Liu,Jin Zheng,Jian Xiao,Ajmal Mian
机构: Hunan University (湖南大学); Nanyang Technological University (南洋理工大学); Central South University (中南大学); The University of Western Australia (西澳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Hand-object pose estimation from monocular RGB images remains a significant challenge mainly due to the severe occlusions inherent in hand-object interactions. Existing methods do not sufficiently explore global structural perception and reasoning, which limits their effectiveness in handling occluded hand-object interactions. To address this challenge, we propose an occlusion-aware hand-object pose estimation method based on masked autoencoders, termed as HOMAE. Specifically, we propose a target-focused masking strategy that imposes structured occlusion on regions of hand-object interaction, encouraging the model to learn context-aware features and reason about the occluded structures. We further integrate multi-scale features extracted from the decoder to predict a signed distance field (SDF), capturing both global context and fine-grained geometry. To enhance geometric perception, we combine the implicit SDF with an explicit point cloud derived from the SDF, leveraging the complementary strengths of both representations. This fusion enables more robust handling of occluded regions by combining the global context from the SDF with the precise local geometry provided by the point cloud. Extensive experiments on challenging DexYCB and HO3Dv2 benchmarks demonstrate that HOMAE achieves state-of-the-art performance in hand-object pose estimation. We will release our code and model.
zh

[CV-15] Unsupervised Deformable Image Registration with Structural Nonparametric Smoothing

【速读】：该论文旨在解决基于学习的可变形图像配准（Learning-based Deformable Image Registration, DIR）在处理具有稀疏特征且包含大平滑区域的医学图像时所面临的孔径（aperture）和大位移（large-displacement）挑战。传统无监督DIR方法由于神经网络在单次前向传播中预测形变场，导致形变场在训练后缺乏约束，从而难以有效处理这些复杂情况。解决方案的关键在于提出SmoothProper模块，该模块通过引入基于对偶优化的层和定制化的交互项，在网络前向传播过程中实现信息传递，从而高效传播流信号、强制平滑性并保持结构一致性，同时无需额外的正则化超参数调优。

链接: https://arxiv.org/abs/2506.10813
作者: Hang Zhang,Xiang Chen,Renjiu Hu,Rongguang Wang,Jinwei Zhang,Min Liu,Yaonan Wang,Gaolei Li,Xinxing Cheng,Jinming Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Accepted for publication at Information Processing in Medical Imaging (IPMI) 2025

点击查看摘要

Abstract:Learning-based deformable image registration (DIR) accelerates alignment by amortizing traditional optimization via neural networks. Label supervision further enhances accuracy, enabling efficient and precise nonlinear alignment of unseen scans. However, images with sparse features amid large smooth regions, such as retinal vessels, introduce aperture and large-displacement challenges that unsupervised DIR methods struggle to address. This limitation occurs because neural networks predict deformation fields in a single forward pass, leaving fields unconstrained post-training and shifting the regularization burden entirely to network weights. To address these issues, we introduce SmoothProper, a plug-and-play neural module enforcing smoothness and promoting message passing within the network’s forward pass. By integrating a duality-based optimization layer with tailored interaction terms, SmoothProper efficiently propagates flow signals across spatial locations, enforces smoothness, and preserves structural consistency. It is model-agnostic, seamlessly integrates into existing registration frameworks with minimal parameter overhead, and eliminates regularizer hyperparameter tuning. Preliminary results on a retinal vessel dataset exhibiting aperture and large-displacement challenges demonstrate our method reduces registration error to 1.88 pixels on 2912x2912 images, marking the first unsupervised DIR approach to effectively address both challenges. The source code will be available at this https URL.
zh

[CV-16] Prompts to Summaries: Zero-Shot Language-Guided Video Summarization

【速读】：该论文旨在解决视频摘要生成中用户可控性不足的问题，特别是如何在无需领域特定训练数据的情况下实现基于自然语言用户意图的灵活摘要生成。其解决方案的关键在于引入“Prompt-to-Summaries”框架，该框架利用预训练的视频-语言模型（VidLMs）生成场景级描述，并通过大型语言模型（LLMs）作为评判者，根据精心设计的提示分配场景级重要性评分，最终通过两个新提出的度量标准——一致性（时间连贯性）和独特性（新颖性）将评分传播至短片段层级，从而生成细粒度的帧重要性。该方法无需任何训练数据，超越了所有无监督方法并达到了监督方法的性能。

链接: https://arxiv.org/abs/2506.10807
作者: Mario Barbara,Alaa Maalouf
机构: the University of Haifa (海法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.
zh

[CV-17] Human-Robot Navigation using Event-based Cameras and Reinforcement Learning

【速读】：该论文试图解决传统基于图像的机器人导航控制器在固定帧率下存在的运动模糊和延迟问题，以及在实时人本导航和避障任务中的性能局限。解决方案的关键在于结合事件相机（event camera）与其他传感器，并利用强化学习进行策略优化，通过事件相机的异步特性实现灵活时间间隔的视觉信息处理，从而支持自适应推理与控制。该框架还引入了模仿学习阶段以提高样本效率，并采用深度确定性策略梯度（Deep Deterministic Policy Gradient）进行策略优化，最终在仿真环境中实现了鲁棒的导航、行人跟随和避障能力。

链接: https://arxiv.org/abs/2506.10790
作者: Ignacio Bugueno-Cordova,Javier Ruiz-del-Solar,Rodrigo Verschae
机构: Universidad de Chile (智利大学); Advanced Mining Technology Center (AMTC) (先进采矿技术中心); Universidad de O’Higgins (奥希金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:This work introduces a robot navigation controller that combines event cameras and other sensors with reinforcement learning to enable real-time human-centered navigation and obstacle avoidance. Unlike conventional image-based controllers, which operate at fixed rates and suffer from motion blur and latency, this approach leverages the asynchronous nature of event cameras to process visual information over flexible time intervals, enabling adaptive inference and control. The framework integrates event-based perception, additional range sensing, and policy optimization via Deep Deterministic Policy Gradient, with an initial imitation learning phase to improve sample efficiency. Promising results are achieved in simulated environments, demonstrating robust navigation, pedestrian following, and obstacle avoidance. A demo video is available at the project website.
zh

[CV-18] SlotPi: Physics-informed Object-centric Reasoning Models

【速读】：该论文旨在解决通过视觉观察理解并推理由物理定律主导的动态过程的问题，尤其是现有以物体为中心的动态模拟方法在整合物理知识和验证模型在多样化场景中的适应性方面存在的不足。其解决方案的关键在于提出SlotPi，一个基于槽位的物理信息驱动的物体中心推理模型，该模型结合了基于哈密顿原理的物理模块与时空预测模块，以实现对动态场景的准确建模与预测。

链接: https://arxiv.org/abs/2506.10778
作者: Jian Li,Wan Han,Ning Lin,Yu-Liang Zhan,Ruizhi Chengze,Haining Wang,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Hao Sun
机构: Renmin University of China(中国人民大学); Huawei Technologies Ltd(华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding and reasoning about dynamics governed by physical laws through visual observation, akin to human capabilities in the real world, poses significant challenges. Currently, object-centric dynamic simulation methods, which emulate human behavior, have achieved notable progress but overlook two critical aspects: 1) the integration of physical knowledge into models. Humans gain physical insights by observing the world and apply this knowledge to accurately reason about various dynamic scenarios; 2) the validation of model adaptability across diverse scenarios. Real-world dynamics, especially those involving fluids and objects, demand models that not only capture object interactions but also simulate fluid flow characteristics. To address these gaps, we introduce SlotPi, a slot-based physics-informed object-centric reasoning model. SlotPi integrates a physical module based on Hamiltonian principles with a spatio-temporal prediction module for dynamic forecasting. Our experiments highlight the model’s strengths in tasks such as prediction and Visual Question Answering (VQA) on benchmark and fluid datasets. Furthermore, we have created a real-world dataset encompassing object interactions, fluid dynamics, and fluid-object interactions, on which we validated our model’s capabilities. The model’s robust performance across all datasets underscores its strong adaptability, laying a foundation for developing more advanced world models.
zh

[CV-19] Stroke-based Cyclic Amplifier: Image Super-Resolution at Arbitrary Ultra-Large Scales

【速读】：该论文旨在解决任意尺度图像超分辨率（Arbitrary-Scale Image Super-Resolution, ASISR）方法在放大因子超出训练数据范围时性能显著下降的问题，特别是由此引发的模糊现象。其解决方案的关键在于提出了一种统一模型——基于笔画的循环放大器（Stroke-based Cyclic Amplifier, SbCA），该模型通过笔画向量放大器将图像分解为一系列以矢量图形表示的笔画进行放大，并结合细节补全模块恢复缺失细节，从而实现高保真图像重建。此外，该模型采用循环策略，在仅需一次训练的情况下，通过迭代细化细节实现超大尺度的图像放大，同时保持子尺度在训练范围内，有效解决了分布偏移问题并消除了伪影、噪声和模糊。

链接: https://arxiv.org/abs/2506.10774
作者: Wenhao Guo,Peng Lu,Xujun Peng,Zhaoran Zhao,Sheng Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学); Amazon AGI Foundations (亚马逊AGI基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prior Arbitrary-Scale Image Super-Resolution (ASISR) methods often experience a significant performance decline when the upsampling factor exceeds the range covered by the training data, introducing substantial blurring. To address this issue, we propose a unified model, Stroke-based Cyclic Amplifier (SbCA), for ultra-large upsampling tasks. The key of SbCA is the stroke vector amplifier, which decomposes the image into a series of strokes represented as vector graphics for magnification. Then, the detail completion module also restores missing details, ensuring high-fidelity image reconstruction. Our cyclic strategy achieves ultra-large upsampling by iteratively refining details with this unified SbCA model, trained only once for all, while keeping sub-scales within the training range. Our approach effectively addresses the distribution drift issue and eliminates artifacts, noise and blurring, producing high-quality, high-resolution super-resolved images. Experimental validations on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing methods in ultra-large upsampling tasks (e.g. \times100 ), delivering visual quality far superior to state-of-the-art techniques.
zh

[CV-20] PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

【速读】：该论文旨在解决生成具有美学价值的海报这一挑战性问题，该任务不仅需要精确的文本渲染，还要求抽象艺术内容、引人注目的版式设计以及整体风格的和谐统一。其解决方案的关键在于提出PosterCraft，这是一个统一框架，摒弃了以往模块化流水线和固定预设版式的限制，使模型能够自由探索连贯且视觉吸引力强的构图。该框架通过四个关键阶段优化高质量美学海报的生成：大规模文本渲染优化、区域感知的监督微调、基于最佳N选择偏好优化的美学文本强化学习以及联合视觉-语言反馈精炼，每个阶段均配备针对特定需求的全自动数据构建管道，从而实现无需复杂架构修改的稳健训练。

链接: https://arxiv.org/abs/2506.10741
作者: SiXiang Chen,Jianyu Lai,Jialin Gao,Tian Ye,Haoyu Chen,Hengyu Shi,Shitong Shao,Yunlong Lin,Song Fei,Zhaohu Xing,Yeying Jin,Junfeng Luo,Xiaoming Wei,Lei Zhu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Meituan(美团); Xiamen University(厦门大学); National University of Singapore(新加坡国立大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal-approaching the quality of SOTA commercial systems. Our code, models, and datasets can be found in the Project page: this https URL
zh

[CV-21] IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain

【速读】：该论文旨在解决零样本和少样本异常检测（Zero-shot and Few-shot Anomaly Detection, ZFSAD）中存在的两个关键问题：一是现有基于CLIP的方法依赖于先验类别知识并需要针对特定场景设计文本提示，难以在联合嵌入空间中有效区分正常与异常实例；二是大多数ZFSAD方法集中在工业领域，缺乏在医学任务中的探索。其解决方案的关键在于提出IQE-CLIP框架，通过融合文本和实例感知的视觉信息生成查询嵌入，以更有效地识别异常，并引入基于类别的可学习提示词以及实例感知的查询模块，从而提升模型在医学领域的适应性和检测性能。

链接: https://arxiv.org/abs/2506.10730
作者: Hong Huang,Weixiang Sun,Zhijian Wu,Jingwen Niu,Donghuan Lu,Xian Wu,Yefeng Zheng
机构: Westlake University (西湖大学); Simon Fraser University (西蒙弗雷泽大学); University of Notre Dame (圣母大学); Shandong University (山东大学); Tencent Jarvis Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models, such as CLIP, have significantly improved performance in zero- and few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based methods assume prior knowledge of categories and rely on carefully designed prompts tailored to specific scenarios. While these text prompts capture semantic information in the textual space, they often fail to distinguish normal and anomalous instances in the joint embedding space. Moreover, most ZFSAD approaches focus on industrial domains, with limited exploration in medical tasks. To address these limitations, we propose IQE-CLIP, a novel framework for ZFSAD in the medical domain. We show that query embeddings integrating both textual and instance-aware visual information serve as more effective indicators of anomalies. Specifically, we introduce class-based and learnable prompting tokens to better adapt CLIP to the medical setting. Furthermore, we design an instance-aware query module that extracts region-level contextual information from both modalities, enabling the generation of anomaly-sensitive embeddings. Extensive experiments on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance in both zero-shot and few-shot settings. Code and data are available at \hrefthis https URLthis https URL.
zh

[CV-22] Deep Learning-based Multi Project InP Wafer Simulation for Unsupervised Surface Defect Detection

【速读】：该论文试图解决半导体制造中因生产规模小和设计变异性高而导致的缺乏已知黄金标准（golden standard）的问题，进而导致缺陷检测依赖人工且效率低下的问题。解决方案的关键是提出一种基于深度神经网络的方法，通过从CAD数据生成逼真的InP晶圆图像来构建合成黄金标准，从而实现更高效的缺陷检测。

链接: https://arxiv.org/abs/2506.10713
作者: Emílio Dolgener Cantú,Rolf Klemens Wittmann,Oliver Abdeen,Patrick Wagner,Wojciech Samek,Moritz Baier,Sebastian Lapuschkin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Quality management in semiconductor manufacturing often relies on template matching with known golden standards. For Indium-Phosphide (InP) multi-project wafer manufacturing, low production scale and high design variability lead to such golden standards being typically unavailable. Defect detection, in turn, is manual and labor-intensive. This work addresses this challenge by proposing a methodology to generate a synthetic golden standard using Deep Neural Networks, trained to simulate photo-realistic InP wafer images from CAD data. We evaluate various training objectives and assess the quality of the simulated images on both synthetic data and InP wafer photographs. Our deep-learning-based method outperforms a baseline decision-tree-based approach, enabling the use of a ‘simulated golden die’ from CAD plans in any user-defined region of a wafer for more efficient defect detection. We apply our method to a template matching procedure, to demonstrate its practical utility in surface defect detection.
zh

[CV-23] Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement

【速读】：该论文旨在解决伪装目标检测（Camouflaged Object Detection, COD）中因目标与背景之间视觉差异细微而导致的分割质量不足问题。现有方法虽取得一定进展，但后处理优化仍存在较大提升空间。其解决方案的关键在于提出一种基于不确定性掩码的伯努利扩散（Uncertainty-Masked Bernoulli Diffusion, UMBD）模型，该模型通过不确定性引导的掩码机制，仅对分割质量较差的残差区域应用伯努利扩散，从而实现精准细化的同时保留正确分割区域。此外，设计了混合不确定性量化网络（Hybrid Uncertainty Quantification Network, HUQNet）以融合多源不确定性信息，提升估计精度并为生成采样过程提供自适应指导。

链接: https://arxiv.org/abs/2506.10712
作者: Yuqi Shen,Fengyang Xiao,Sujie Hu,Youwei Pang,Yifan Pu,Chengyu Fang,Xiu Li,Chunming He
机构: Tsinghua University (清华大学); Duke University (杜克大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Camouflaged Object Detection (COD) presents inherent challenges due to the subtle visual differences between targets and their backgrounds. While existing methods have made notable progress, there remains significant potential for post-processing refinement that has yet to be fully explored. To address this limitation, we propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD. UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality, enabling targeted refinement while preserving correctly segmented areas. To support this process, we design the Hybrid Uncertainty Quantification Network (HUQNet), which employs a multi-branch architecture and fuses uncertainty from multiple sources to improve estimation accuracy. This enables adaptive guidance during the generative sampling process. The proposed UMBD framework can be seamlessly integrated with a wide range of existing Encoder-Decoder-based COD models, combining their discriminative capabilities with the generative advantages of diffusion-based refinement. Extensive experiments across multiple COD benchmarks demonstrate consistent performance improvements, achieving average gains of 5.5% in MAE and 3.2% in weighted F-measure with only modest computational overhead. Code will be released.
zh

[CV-24] Continual Hyperbolic Learning of Instances and Classes

【速读】：该论文试图解决在持续学习中同时处理实例和类别的问题，这在现实应用场景如机器人和自动驾驶中尤为关键。传统持续学习主要关注实例或类别的分类，而实际应用需要模型同时处理两者。解决方案的关键在于识别出类别与实例之间存在的层次结构，并提出HyperCLIC算法，该算法利用双曲空间对层次关系进行建模，从而实现对多粒度信息的持续嵌入与平衡。

链接: https://arxiv.org/abs/2506.10710
作者: Melika Ayoughi,Mina Ghadimi Atigh,Mohammad Mahdi Derakhshani,Cees G. M. Snoek,Pascal Mettes,Paul Groth
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual learning has traditionally focused on classifying either instances or classes, but real-world applications, such as robotics and self-driving cars, require models to handle both simultaneously. To mirror real-life scenarios, we introduce the task of continual learning of instances and classes, at the same time. This task challenges models to adapt to multiple levels of granularity over time, which requires balancing fine-grained instance recognition with coarse-grained class generalization. In this paper, we identify that classes and instances naturally form a hierarchical structure. To model these hierarchical relationships, we propose HyperCLIC, a continual learning algorithm that leverages hyperbolic space, which is uniquely suited for hierarchical data due to its ability to represent tree-like structures with low distortion and compact embeddings. Our framework incorporates hyperbolic classification and distillation objectives, enabling the continual embedding of hierarchical relations. To evaluate performance across multiple granularities, we introduce continual hierarchical metrics. We validate our approach on EgoObjects, the only dataset that captures the complexity of hierarchical object recognition in dynamic real-world environments. Empirical results show that HyperCLIC operates effectively at multiple granularities with improved hierarchical generalization.
zh

[CV-25] Underag e Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

【速读】：该论文旨在解决在非约束图像中对未成年人进行准确自动筛选的问题，其核心挑战在于模型对分布偏移的鲁棒性不足以及公开数据集中儿童样本的稀缺性。解决方案的关键是提出一种多任务架构，基于冻结的FaRL视觉-语言主干网络，结合一个紧凑的两层MLP，共享特征以同时处理年龄回归头和四个二分类 underage 头（针对12、15、18和21岁阈值），并引入α重加权焦点损失和年龄平衡的小批量采样策略，以缓解类别不平衡问题。此外，通过引入年龄差距机制剔除边缘案例，进一步提升了模型性能。

链接: https://arxiv.org/abs/2506.10689
作者: Christopher Gaul,Eduardo Fidalgo,Enrique Alegre,Rocío Alaiz Rodríguez,Eri Pérez Corral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate automatic screening of minors in unconstrained images demands models that are robust to distribution shift and resilient to the children under-representation in publicly available data. To overcome these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary under-age heads for age thresholds of 12, 15, 18, and 21 years, focusing on the legally critical age range. To address the severe class imbalance, we introduce an \alpha -reweighted focal-style loss and age-balanced mini-batch sampling, which equalizes twelve age bins during stochastic optimization. Further improvement is achieved with an age gap that removes edge cases from the loss. Moreover, we set a rigorous evaluation by proposing the Overall Under-Age Benchmark, with 303k cleaned training images and 110k test images, defining both the “ASORES-39k” restricted overall test, which removes the noisiest domains, and the age estimation wild shifts test “ASWIFT-20k” of 20k-images, stressing extreme pose ( 45°), expression, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model “F” lowers the root-mean-square-error on the ASORES-39k restricted test from 5.733 (age-only baseline) to 5.656 years and lifts under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the domain shift to the wild data of ASWIFT-20k, the same configuration nearly sustains 0.99 recall while boosting F2 from 0.742 to 0.833 with respect to the age-only baseline, demonstrating strong generalization under distribution shift. For the under-12 and under-15 tasks, the respective boosts in F2 are from 0.666 to 0.955 and from 0.689 to 0.916, respectively. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.10689 [cs.CV] (or arXiv:2506.10689v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.10689 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-26] Unsourced Adversarial CAPTCHA: A Bi-Phase Adversarial CAPTCHA Framework

【速读】：该论文试图解决传统CAPTCHA方案在深度学习快速发展背景下易受基于深度神经网络（DNN）的自动化攻击的问题，以及现有对抗攻击方法依赖原始图像特征导致的干扰人类识别和应用场景受限的问题。解决方案的关键在于提出一种名为无源对抗CAPTCHA（Unsourced Adversarial CAPTCHA, UAC）的新框架，该框架通过攻击者指定的文本提示生成高质量的对抗样本，并利用大型语言模型（Large Language Model, LLM）提升CAPTCHA的多样性，支持定向与非定向攻击。其中，针对定向攻击提出的EDICT方法优化扩散模型中的双潜在变量以获得更优图像质量，而针对非定向攻击特别是黑盒场景提出的双路径无源对抗CAPTCHA（BP-UAC）则采用多模态梯度与双路径优化策略，实现高效的误分类。

链接: https://arxiv.org/abs/2506.10685
作者: Xia Du,Xiaoyuan Liu,Jizhe Zhou,Zheng Lin,Chi-man Pun,Zhe Chen,Wei Ni,Jun Luo
机构: Xiamen University of Technology (厦门理工学院); Sichuan University (四川大学); University of Hong Kong (香港大学); University of Macau (澳门大学); Fudan University (复旦大学); Data61, CSIRO (数据61，澳大利亚联邦科学与工业研究组织); University of New South Wales (新南威尔士大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:With the rapid advancements in deep learning, traditional CAPTCHA schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on original image characteristics, resulting in distortions that hinder human interpretation and limit applicability in scenarios lacking initial input images. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (UAC), a novel framework generating high-fidelity adversarial examples guided by attacker-specified text prompts. Leveraging a Large Language Model (LLM), UAC enhances CAPTCHA diversity and supports both targeted and untargeted attacks. For targeted attacks, the EDICT method optimizes dual latent variables in a diffusion model for superior image quality. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-UAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show BP-UAC achieves high attack success rates across diverse systems, generating natural CAPTCHAs indistinguishable to humans and DNNs.
zh

[CV-27] Enhancing Deepfake Detection using SE Block Attention with CNN

【速读】：该论文试图解决深度伪造（Deepfake）内容对信息真实性和安全性的威胁问题，其核心挑战在于现有检测方法难以应对深度伪造内容日益提升的复杂性和真实性。解决方案的关键是提出一种轻量级卷积神经网络（CNN），结合了挤压与激励块（SE）注意力机制，通过动态通道特征重新校准，增强网络对有用特征的强调并抑制冗余特征，从而实现高效且准确的深度伪造检测。该模型在保持较小规模的同时，在Style GAN数据集上达到了94.14%的总体分类准确率和0.985的AUC-ROC分数。

链接: https://arxiv.org/abs/2506.10683
作者: Subhram Dasgupta,Janelle Mason,Xiaohong Yuan,Olusola Odeyomi,Kaushik Roy
机构: North Carolina A&T State University (北卡罗来纳农业技术州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the digital age, Deepfake present a formidable challenge by using advanced artificial intelligence to create highly convincing manipulated content, undermining information authenticity and security. These sophisticated fabrications surpass traditional detection methods in complexity and realism. To address this issue, we aim to harness cutting-edge deep learning methodologies to engineer an innovative deepfake detection model. However, most of the models designed for deepfake detection are large, causing heavy storage and memory consumption. In this research, we propose a lightweight convolution neural network (CNN) with squeeze and excitation block attention (SE) for Deepfake detection. The SE block module is designed to perform dynamic channel-wise feature recalibration. The SE block allows the network to emphasize informative features and suppress less useful ones, which leads to a more efficient and effective learning module. This module is integrated with a simple sequential model to perform Deepfake detection. The model is smaller in size and it achieves competing accuracy with the existing models for deepfake detection tasks. The model achieved an overall classification accuracy of 94.14% and AUC-ROC score of 0.985 on the Style GAN dataset from the Diverse Fake Face Dataset. Our proposed approach presents a promising avenue for combating the Deepfake challenge with minimal computational resources, developing efficient and scalable solutions for digital content verification.
zh

[CV-28] PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis

【速读】：该论文旨在解决原型方法在医学图像中可视化结果与人类可理解的生物标志物不一致的问题，以及现有原型方法学习的过于细粒度的原型在医学影像中可解释性较差的问题。其解决方案的关键在于提出PiPViT（Patch-based Visual Interpretable Prototypes），该模型基于视觉变换器（Vision Transformer, ViT），通过捕捉图像块间的长距离依赖关系，学习具有鲁棒性和人类可解释性的原型，并仅使用图像级标签近似病变范围，同时结合对比学习和多分辨率输入处理以实现跨尺度的生物标志物定位。

链接: https://arxiv.org/abs/2506.10669
作者: Marzieh Oghbaie,Teresa Araújoa,Hrvoje Bogunović
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.10669 [cs.CV] (or arXiv:2506.10669v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.10669 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-29] GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

【速读】：该论文旨在解决视频生成模型在实例保留、运动合理性、构图和物理合理性等特定维度上的性能提升问题，而现有微调方法通常依赖人工标注和大规模计算资源，限制了其实用性。论文提出的解决方案关键在于GigaVideo-1框架，该框架通过自动反馈机制挖掘预训练视频扩散模型的潜在能力，而非依赖外部高质量数据。其核心创新包括基于提示的数据引擎以构建多样化的训练样本，以及利用预训练视觉-语言模型的反馈进行奖励引导的优化策略，从而实现无需人工标注和少量真实数据的高效微调。

链接: https://arxiv.org/abs/2506.10639
作者: Xiaoyi Bao,Jindi Lv,Xiaofeng Wang,Zheng Zhu,Xinze Chen,YuKun Zhou,Jiancheng Lv,Xingang Wang,Guan Huang
机构: GigaAI; Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Computer Science, Sichuan University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.
zh

[CV-30] Symmetrical Flow Matching: Unified Image Generation Segmentation and Classification with Score-Based Generative Models

【速读】：该论文试图解决多任务学习中语义分割、分类与图像生成难以统一建模的问题，以及传统方法在条件生成时对一对一映射的严格依赖。其解决方案的关键在于提出对称流匹配（Symmetrical Flow Matching, SymmFlow），通过对称的学习目标联合建模正向与反向变换，确保双向一致性并保留足够的熵以实现生成多样性，同时引入新的训练目标以显式保留流中的语义信息，从而支持灵活的条件输入并实现单步分割与分类。

链接: https://arxiv.org/abs/2506.10634
作者: Francisco Caetano,Christiaan Viviers,Peter H.N. De With,Fons van der Sommen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks. The code will be publicly available.
zh

[CV-31] Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models

【速读】：该论文旨在解决文本到图像的潜在扩散模型在医学影像领域中多模态对齐不足的问题，特别是在胸部X光图像与自由文本放射学报告之间的临床相关信息对齐方面。其解决方案的关键在于提出一种微调框架，以增强预训练模型的多模态对齐能力，从而使其能够高效地应用于下游任务如短语定位。

链接: https://arxiv.org/abs/2506.10633
作者: Konstantinos Vilouras,Ilias Stogiannidis,Junyu Yan,Alison Q. O’Neil,Sotirios A. Tsaftaris
机构: University of Edinburgh (爱丁堡大学); Canon Medical Research Europe Ltd. (佳能医疗研究欧洲有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.
zh

[CV-32] Hessian Geometry of Latent Space in Generative Models ICML2025

【速读】：该论文试图解决生成模型（包括统计物理模型和扩散模型）潜在空间几何结构的分析问题，特别是如何准确重建费舍尔信息度量（Fisher information metric）以揭示潜在空间中的热力学量和相变现象。解决方案的关键在于通过近似给定生成样本的潜在变量后验分布，学习定义指数族费舍尔度量的对数归一化常数（log-partition function），从而实现对潜在空间几何结构的有效建模与分析。

链接: https://arxiv.org/abs/2506.10632
作者: Alexander Lobashev,Dmitry Guskov,Maria Larchenko,Mikhail Tamm
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG); Statistics Theory (math.ST)
备注: ICML 2025

点击查看摘要

Abstract:This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at this https URL.
zh

[CV-33] xTailor: Customized Text-aligned Texturing via Effective Resampling ICLR2025

【速读】：该论文旨在解决文本到纹理合成中视角一致性不足的问题，具体表现为由于扩散过程中先前合成纹理的整合不足以及纹理合成过程的自回归特性导致的纹理属性在不同视角间的渐变。此外，预定义的相机位置选择未考虑物体几何结构，限制了从不同视角合成的纹理信息的有效利用。TexTailor 的关键解决方案包括：（1）在扩散过程中应用重采样方案以重复整合先前合成纹理的信息；（2）在这些重采样纹理上微调深度感知扩散模型，并引入性能保持损失以缓解少量训练图像对生成质量的影响；（3）基于物体几何结构自适应调整相机位置，从而提升视角一致性的纹理合成效果。

链接: https://arxiv.org/abs/2506.10612
作者: Suin Lee,Dae-Shik Kim
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to ICLR 2025

点击查看摘要

Abstract:We present TexTailor, a novel method for generating consistent object textures from textual descriptions. Existing text-to-texture synthesis approaches utilize depth-aware diffusion models to progressively generate images and synthesize textures across predefined multiple viewpoints. However, these approaches lead to a gradual shift in texture properties across viewpoints due to (1) insufficient integration of previously synthesized textures at each viewpoint during the diffusion process and (2) the autoregressive nature of the texture synthesis process. Moreover, the predefined selection of camera positions, which does not account for the object’s geometry, limits the effective use of texture information synthesized from different viewpoints, ultimately degrading overall texture consistency. In TexTailor, we address these issues by (1) applying a resampling scheme that repeatedly integrates information from previously synthesized textures within the diffusion process, and (2) fine-tuning a depth-aware diffusion model on these resampled textures. During this process, we observed that using only a few training images restricts the model’s original ability to generate high-fidelity images aligned with the conditioning, and therefore propose an performance preservation loss to mitigate this issue. Additionally, we improve the synthesis of view-consistent textures by adaptively adjusting camera positions based on the object’s geometry. Experiments on a subset of the Objaverse dataset and the ShapeNet car dataset demonstrate that TexTailor outperforms state-of-the-art methods in synthesizing view-consistent textures. The source code for TexTailor is available at this https URL
zh

[CV-34] MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

【速读】：该论文旨在解决场景文本检索中依赖昂贵边界框标注以及难以统一多种查询类型的问题。其解决方案的关键在于提出一种无框（box-free）的多查询场景文本检索方法——MSTAR，该方法通过渐进式视觉嵌入动态捕捉文本的多粒度表示，并将自由风格的文本查询与风格感知指令相协调，同时引入多实例匹配模块以增强视觉-语言对齐能力。

链接: https://arxiv.org/abs/2506.10609
作者: Liang Yin,Xudong Xie,Zhang Li,Xiang Bai,Yuliang Liu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at this https URL.
zh

[CV-35] High-resolution efficient image generation from WiFi CSI using a pretrained latent diffusion model

【速读】：该论文旨在解决从WiFi信道状态信息（CSI）生成物理环境图像的问题，传统方法通常依赖于复杂且计算密集的生成对抗网络（GAN）等技术。其解决方案的关键在于利用预训练的潜在扩散模型（LDM），通过轻量级神经网络将CSI幅度直接映射到LDM的潜在空间，并结合文本引导的去噪扩散模型进行潜在表示处理，最终通过LDM的预训练解码器生成高分辨率图像。该方法避免了像素空间图像生成的挑战和传统图像到图像管道中的显式图像编码阶段，从而实现了高效且高质量的图像合成。

链接: https://arxiv.org/abs/2506.10605
作者: Eshan Ramesh,Nishio Takayuki
机构: School of Engineering, Institute of Science Tokyo (工程学院，东京科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM’s denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM’s pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.
zh

[CV-36] Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

【速读】：该论文旨在解决高密度场景下遥感图像中定向目标检测的标注成本高和样本分配不足、实例混淆的问题。其解决方案的关键在于提出SSP（Semantic-decoupled Spatial Partition）框架，该框架通过结合规则驱动的先验注入与数据驱动的标签净化，实现像素级空间划分的样本分配和语义空间划分的边界框提取，从而有效提升检测性能。

链接: https://arxiv.org/abs/2506.10601
作者: Xinyuan Liu,Hang Xu,Yike Ma,Yucheng Zhang,Feng Dai
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent remote sensing tech advancements drive imagery growth, making oriented object detection rapid development, yet hindered by labor-intensive annotation for high-density scenes. Oriented object detection with point supervision offers a cost-effective solution for densely packed scenes in remote sensing, yet existing methods suffer from inadequate sample assignment and instance confusion due to rigid rule-based designs. To address this, we propose SSP (Semantic-decoupled Spatial Partition), a unified framework that synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two core innovations: 1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. 2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and reliably converts them into bounding boxes to form pseudo-labels for supervising the learning of downstream detectors. Experiments on DOTA-v1.0 and others demonstrate SSP’ s superiority: it achieves 45.78% mAP under point supervision, outperforming SOTA method PointOBB-v2 by 4.10%. Furthermore, when integrated with ORCNN and ReDet architectures, the SSP framework achieves mAP values of 47.86% and 48.50%, respectively. The code is available at this https URL.
zh

[CV-37] EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence

【速读】：该论文旨在解决当前具身智能任务中依赖手工创建和标注的传统3D计算机图形资产所面临的高成本、低真实感及可扩展性差的问题。其解决方案的关键在于提出EmbodiedGen，一个用于交互式3D世界生成的基础平台，该平台能够以低成本生成高质量、可控且逼真的3D资产，并在统一机器人描述格式（URDF）中保持精确的物理属性和现实尺度，从而支持下游任务的训练与评估。

链接: https://arxiv.org/abs/2506.10600
作者: Wang Xinjie,Liu Liu,Cao Yu,Wu Ruiqi,Qin Wenkang,Wang Dehui,Sui Wei,Su Zhizhong
机构: Horizon Robotics (横竖机器人); GigaAI (巨量人工智能); D-Robotics (D-机器人); Shanghai Jiao Tong University (上海交通大学); VCIP, CS, Nankai University (VCIP，CS，南开大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at this https URL.
zh

[CV-38] Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement

【速读】：该论文旨在解决航空设备制造过程中对CAD模型误差评估的精确性与系统性问题，特别是在高精度制造和测量平台下如何实现多层次的误差分析。解决方案的关键在于提出了一种名为HEA-MM的分层误差评估框架，该框架通过结构光扫描仪获取工件的三维测量数据，并将其与参考CAD模型进行配准，随后在全局、部件和特征三个层级进行误差分析。其中，部件层级引入了基于优化的原始体素精炼方法，结合分割与合并操作以提取有意义的点云区域；特征层级则通过两阶段算法实现圆孔特征的准确检测与分析。

链接: https://arxiv.org/abs/2506.10594
作者: Jin Huang,Honghua Chen,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); School of Artificial Intelligence, Taiyuan University of Technology (太原理工大学人工智能学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.
zh

[CV-39] Rethinking Random Masking in Self Distillation on ViT

【速读】：该论文试图解决在自蒸馏框架（如DINO）中，随机掩码可能无意中消除关键语义信息的问题，从而影响模型的性能。其解决方案的关键在于采用非对称的随机掩码策略，即仅对学生的全局视图进行随机掩码，而保留学生的局部视图和教师的全局视图未被掩码，以此在保持清晰监督信号的同时，通过掩码输入增强模型的鲁棒性。

链接: https://arxiv.org/abs/2506.10582
作者: Jihyeon Seong,Hyunkyung Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated remarkable performance across a wide range of vision tasks. In particular, self-distillation frameworks such as DINO have contributed significantly to these advances. Within such frameworks, random masking is often utilized to improve training efficiency and introduce regularization. However, recent studies have raised concerns that indiscriminate random masking may inadvertently eliminate critical semantic information, motivating the development of more informed masking strategies. In this study, we explore the role of random masking in the self-distillation setting, focusing on the DINO framework. Specifically, we apply random masking exclusively to the student’s global view, while preserving the student’s local views and the teacher’s global view in their original, unmasked forms. This design leverages DINO’s multi-view augmentation scheme to retain clean supervision while inducing robustness through masked inputs. We evaluate our approach using DINO-Tiny on the mini-ImageNet dataset and show that random masking under this asymmetric setup yields more robust and fine-grained attention maps, ultimately enhancing downstream performance.
zh

[CV-40] ransformer IMU Calibrator: Dynamic On-body IMU Calibration for Inertial Motion Capture SIGGRAPH2025

【速读】：该论文试图解决稀疏惯性动作捕捉系统中IMU（惯性测量单元）校准的问题，特别是突破传统校准方法中对绝对静态假设的限制，即坐标漂移RG’G和测量偏移RBS在整个运动过程中保持恒定。解决方案的关键在于提出一种动态校准方法，通过两个放松的假设实现RG’G和RBS的实时估计：一是矩阵在短时间内变化可忽略不计，二是时间窗口内的人体运动/IMU读数具有多样性。这一方法通过减少候选矩阵数量并提供多样化的约束条件，显著降低了求解空间，从而能够在短时间内准确估计RG’G和RBS。此外，研究者还设计了一个基于IMU读数多样性的校准触发机制，以确保假设的有效性，并首次实现了隐式IMU校准，无需显式校准过程即可无缝使用IMUs。

链接: https://arxiv.org/abs/2506.10580
作者: Chengxu Zuo,Jiawei Huang,Xiao Jiang,Yuan Yao,Xiangren Shi,Rui Cao,Xinyu Yi,Feng Xu,Shihui Guo,Yipeng Qin
机构: Xiamen University(厦门大学); Bournemouth University(伯恩茅斯大学); Cardiff University(卡迪夫大学); Tsinghua University(清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by SIGGRAPH 2025 (TOG)

点击查看摘要

Abstract:In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG’G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimation of RG’G and RBS under two relaxed assumptions: i) the matrices change negligibly in a short time window; ii) the human movements/IMU readings are diverse in such a time window. Intuitively, the first assumption reduces the number of candidate matrices, and the second assumption provides diverse constraints, which greatly reduces the solution space and allows for accurate estimation of RG’G and RBS from a short history of IMU readings in real time. To achieve this, we created synthetic datasets of paired RG’G, RBS matrices and IMU readings, and learned their mappings using a Transformer-based model. We also designed a calibration trigger based on the diversity of IMU readings to ensure that assumption ii) is met before applying our method. To our knowledge, we are the first to achieve implicit IMU calibration (i.e., seamlessly putting IMUs into use without the need for an explicit calibration process), as well as the first to enable long-term and accurate motion capture using sparse IMUs. The code and dataset are available at this https URL.
zh

[CV-41] Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres

【速读】：该论文试图解决当前扩散模型在处理超球面数据时无法保留类别几何结构的问题。标准扩散模型在前向过程中依赖各向同性的高斯噪声，这使其更适用于欧几里得空间，而忽略了现实世界中许多数据所遵循的非欧几里得分布，如超球面流形。为了解决这一局限性，该论文提出了一种名为HyperSphereDiff的解决方案，其关键在于将超球面结构与方向性噪声对齐，从而保留类别几何信息并有效捕捉角度不确定性。

链接: https://arxiv.org/abs/2506.10576
作者: Muskan Dosi,Chiranjeev Chiranjeev,Kartik Thakral,Mayank Vatsa,Richa Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce HyperSphereDiff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: this https URL
zh

[CV-42] xt to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

【速读】：该论文旨在解决在使用文本作为图像（TaI）进行参数高效微调（PEFT）时，图像与文本模态之间的模态差距问题，这一问题限制了图像识别性能。其解决方案的关键在于提出一种名为T2I-PAL的方法，通过利用预训练的文本到图像生成模型从文本描述中生成逼真且多样化的图像，从而缩小模态差距。此外，T2I-PAL结合了类别级热图和可学习原型以增强多标签图像识别，并融合提示调整和适配器学习以提升分类性能。

链接: https://arxiv.org/abs/2506.10575
作者: Chun-Mei Feng,Kai Yu,Xinxing Xu,Salman Khan,Rick Siow Mong Goh,Wangmeng Zuo,Yong Liu
机构: Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR) (高性能计算研究所); University of Minnesota (明尼苏达大学); Microsoft Research Asia Singapore (微软亚洲研究院新加坡); Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) (穆罕默德·本·扎耶德人工智能大学); Australian National University (澳大利亚国立大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.
zh

[CV-43] DanceChat: Large Language Model-Guided Music-to-Dance Generation CEC

【速读】：该论文旨在解决音乐到舞蹈生成中的语义鸿沟问题，即音乐仅提供抽象线索（如旋律、节奏和情感），而无法明确指定具体的肢体动作，同时单首音乐可能产生多种合理的舞蹈解释，导致生成的舞蹈多样性不足。解决方案的关键在于引入基于大型语言模型（Large Language Model, LLM）的指导机制，通过LLM作为编舞者生成文本形式的动作指令，为舞蹈生成提供明确的高层指导，从而增强生成舞蹈的多样性和与音乐风格的对齐度。

链接: https://arxiv.org/abs/2506.10574
作者: Qing Wang,Xiaohang Yang,Yilan Dong,Naveen Raj Govindaraj,Gregory Slabaugh,Shanxin Yuan
机构: Queen Mary University of London(伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: check demos at this https URL

点击查看摘要

Abstract:Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the modelâĂŹs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a Large Language Model (LLM)-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.
zh

[CV-44] Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration

【速读】：该论文旨在解决医学视觉表示学习中由于数据稀缺性导致的挑战，特别是针对长篇报告中存在的复杂话语关系和语义病理问题。其解决方案的关键在于提出一种名为PLACE的新框架，该框架通过病理级别对齐（Pathological-Level Alignment）和相关性探索（Correlation Exploration）来增强细粒度细节，而无需额外的人工标注。具体而言，该框架引入了病理级别跨模态对齐（PCMA）方法，以最大化图像与报告中病理观察的一致性，并设计了一个代理任务以强化图像块间的相关性识别，从而提升模型在多种下游任务中的性能。

链接: https://arxiv.org/abs/2506.10573
作者: Jun Wang,Lixing Zhu,Xiaohan Yu,Abhir Bhalerao,Yulan He
机构: University of Warwick (华威大学); King’s College London (国王学院); Macquarie University (麦考瑞大学); Griffith University (格里菲斯大学); Alan Turing Institute (艾伦·图灵研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 tables and 6 figures

点击查看摘要

Abstract:Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.
zh

[CV-45] DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

【速读】：该论文旨在解决在电子商务和数字营销中生成高保真人类-产品演示视频时，现有框架难以同时保持人物和产品身份一致性以及理解人类与产品之间空间关系的问题，从而导致表现不真实和交互不自然。解决方案的关键在于提出一种基于Diffusion Transformer（DiT）的框架，通过注入成对的人类-产品参考信息，并利用额外的掩码交叉注意力机制，同时保留人物身份和产品特定细节（如标志和纹理），结合3D身体网格模板和产品边界框提供精确的动作引导，以及使用结构化文本编码增强小角度旋转下的3D一致性。

链接: https://arxiv.org/abs/2506.10568
作者: Lizhen Wang,Zhurong Xia,Tianshu Hu,Pengrui Wang,Pengfei Wang,Zerong Zheng,Ming Zhou
机构: ByteDance Intelligent Creation(字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: this https URL.
zh

[CV-46] LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System ECCV2024

【速读】：该论文旨在解决密集视觉SLAM（Simultaneous Localization and Mapping，同时定位与建图）在实时性、鲁棒性和大规模场景可扩展性方面的挑战，尤其是在使用RGB-D相机系统时存在的高计算成本和内存需求问题。其解决方案的关键在于提出一种更高效的视觉SLAM模型LRSLAM，该模型利用低秩张量分解方法，通过Six-axis和CP分解实现更快的收敛速度、更高的内存效率以及更优的重建与定位质量。

链接: https://arxiv.org/abs/2506.10567
作者: Hongbeen Park,Minjeong Park,Giljoo Nam,Jinkyu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2024

点击查看摘要

Abstract:Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in achieving real-time performance, robustness, and scalability for large-scale scenes. Recent approaches utilizing neural implicit scene representations show promise but suffer from high computational costs and memory requirements. ESLAM introduced a plane-based tensor decomposition but still struggled with memory growth. Addressing these challenges, we propose a more efficient visual SLAM model, called LRSLAM, utilizing low-rank tensor decomposition methods. Our approach, leveraging the Six-axis and CP decompositions, achieves better convergence rates, memory efficiency, and reconstruction/localization quality than existing state-of-the-art approaches. Evaluation across diverse indoor RGB-D datasets demonstrates LRSLAM’s superior performance in terms of parameter efficiency, processing time, and accuracy, retaining reconstruction and localization quality. Our code will be publicly available upon publication.
zh

[CV-47] Balancing Tails when Comparing Distributions: Comprehensive Equity Index (CEI) with Application to Bias Evaluation in Operational Face Biometrics

【速读】：该论文试图解决高性能人脸识别（Face Recognition, FR）系统中难以通过现有度量检测的人口统计偏差问题，尤其是在得分分布尾部的细微差异。解决方案的关键在于提出综合公平指数（Comprehensive Equity Index, CEI），该指标通过分别分析真实得分和伪造得分分布，能够在关注尾部概率的同时考虑整体分布形态，从而更敏感地检测出隐性偏差。此外，CEI^A作为其自动化版本，进一步提升了评估的客观性和实用性。

链接: https://arxiv.org/abs/2506.10564
作者: Imanol Solano,Julian Fierrez,Aythami Morales,Alejandro Peña,Ruben Tolosana,Francisco Zamora-Martinez,Javier San Agustin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Demographic bias in high-performance face recognition (FR) systems often eludes detection by existing metrics, especially with respect to subtle disparities in the tails of the score distribution. We introduce the Comprehensive Equity Index (CEI), a novel metric designed to address this limitation. CEI uniquely analyzes genuine and impostor score distributions separately, enabling a configurable focus on tail probabilities while also considering overall distribution shapes. Our extensive experiments (evaluating state-of-the-art FR systems, intentionally biased models, and diverse datasets) confirm CEI’s superior ability to detect nuanced biases where previous methods fall short. Furthermore, we present CEI^A, an automated version of the metric that enhances objectivity and simplifies practical application. CEI provides a robust and sensitive tool for operational FR fairness assessment. The proposed methods have been developed particularly for bias evaluation in face biometrics but, in general, they are applicable for comparing statistical distributions in any problem where one is interested in analyzing the distribution tails.
zh

[CV-48] From Images to Insights: Explainable Biodiversity Monitoring with Plain Language Habitat Explanations

【速读】：该论文试图解决如何解释物种为何栖息于特定环境的问题，这对于理解生态系统和保护生物多样性具有重要意义。现有生态工作流程存在碎片化且对非专业人员不友好。解决方案的关键在于提出一个端到端的视觉到因果框架，该框架将物种图像转化为可解释的因果洞察，通过整合物种识别、全球分布检索、伪缺失采样和气候数据提取，结合现代因果推断方法发现环境特征间的因果结构，并利用结构化模板和大语言模型生成统计上有依据的人类可读因果解释。

链接: https://arxiv.org/abs/2506.10559
作者: Yutong Zhou,Masahiro Ryo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Code will be released at: this https URL

点击查看摘要

Abstract:Explaining why the species lives at a particular location is important for understanding ecological systems and conserving biodiversity. However, existing ecological workflows are fragmented and often inaccessible to non-specialists. We propose an end-to-end visual-to-causal framework that transforms a species image into interpretable causal insights about its habitat preference. The system integrates species recognition, global occurrence retrieval, pseudo-absence sampling, and climate data extraction. We then discover causal structures among environmental features and estimate their influence on species occurrence using modern causal inference methods. Finally, we generate statistically grounded, human-readable causal explanations from structured templates and large language models. We demonstrate the framework on a bee and a flower species and report early results as part of an ongoing project, showing the potential of the multimodal AI assistant backed up by a recommended ecological modeling practice for describing species habitat in human-understandable language.
zh

[CV-49] ContextRefine-CLIP for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2025

【速读】：该论文旨在解决视觉-文本多实例检索任务中的语义对齐与上下文感知联合表示生成问题。其解决方案的关键在于基于双编码器AVION架构，引入跨模态注意力流模块，实现视觉与文本特征之间的双向动态交互与优化，从而生成更具上下文感知能力的联合表示。此外，结合软标签相关性矩阵与对称多相似性损失函数，进一步提升了语义对齐的准确性。

链接: https://arxiv.org/abs/2506.10550
作者: Jing He,Yiqing Wang,Lingling Li,Kexin Zhang,Puhua Chen
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This report presents ContextRefine-CLIP (CR-CLIP), an efficient model for visual-textual multi-instance retrieval tasks. The approach is based on the dual-encoder AVION, on which we introduce a cross-modal attention flow module to achieve bidirectional dynamic interaction and refinement between visual and textual features to generate more context-aware joint representations. For soft-label relevance matrices provided in tasks such as EPIC-KITCHENS-100, CR-CLIP can work with Symmetric Multi-Similarity Loss to achieve more accurate semantic alignment and optimization using the refined features. Without using ensemble learning, the CR-CLIP model achieves 66.78mAP and 82.08nDCG on the EPIC-KITCHENS-100 public leaderboard, which significantly outperforms the baseline model and fully validates its effectiveness in cross-modal retrieval. The code will be released open-source on this https URL
zh

[CV-50] AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

【速读】：该论文旨在解决生成连贯叙事视频的难题，尤其是针对跨场景和多角色的视频生成任务。现有方法通常将预生成的关键帧刚性转换为固定长度的片段，导致叙事断裂和节奏问题，且视频生成模型的不稳定性使得单个低质量片段会显著影响整体动画的逻辑连贯性和视觉连续性。论文提出的解决方案是AniMaker，其关键在于引入了多智能体框架，实现了高效的多候选片段生成与叙事感知的片段选择，其中核心技术创新包括Photography Agent中的MCTS-Gen策略，以及Reviewer Agent中的AniEval评估框架，前者通过受蒙特卡洛树搜索启发的方法优化候选空间探索，后者则首次专门设计用于多镜头动画评估，综合考虑前后片段上下文以评估叙事一致性、动作完成度及动画特性。

链接: https://arxiv.org/abs/2506.10540
作者: Haoyuan Shi,Yunxin Li,Xinyu Chen,Longyue Wang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学，深圳); Alibaba International Digital Commerce (阿里巴巴国际数字商业)
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation’s logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker’s approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
zh

[CV-51] SLICK: Selective Localization and Instance Calibration for Knowledge-Enhanced Car Damage Segmentation in Automotive Insurance

【速读】：该论文旨在解决现实世界中汽车损伤分割的挑战，特别是针对遮挡、形变或漆面脱落等情况下的精确和鲁棒分割问题。其解决方案的关键在于引入SLICK框架，该框架融合了结构先验和领域知识，并包含五个核心组件：基于结构先验引导的高分辨率语义主干进行选择性部件分割、感知定位的注意力模块、实例敏感的细化头、跨通道校准以及知识融合模块，从而提升了损伤检测的精度与泛化能力。

链接: https://arxiv.org/abs/2506.10528
作者: Teerapong Panboonyuen
机构: MARSAIL( Motor AI Recognition Solution Artificial Intelligence Laboratory)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:We present SLICK, a novel framework for precise and robust car damage segmentation that leverages structural priors and domain knowledge to tackle real-world automotive inspection challenges. SLICK introduces five key components: (1) Selective Part Segmentation using a high-resolution semantic backbone guided by structural priors to achieve surgical accuracy in segmenting vehicle parts even under occlusion, deformation, or paint loss; (2) Localization-Aware Attention blocks that dynamically focus on damaged regions, enhancing fine-grained damage detection in cluttered and complex street scenes; (3) an Instance-Sensitive Refinement head that leverages panoptic cues and shape priors to disentangle overlapping or adjacent parts, enabling precise boundary alignment; (4) Cross-Channel Calibration through multi-scale channel attention that amplifies subtle damage signals such as scratches and dents while suppressing noise like reflections and decals; and (5) a Knowledge Fusion Module that integrates synthetic crash data, part geometry, and real-world insurance datasets to improve generalization and handle rare cases effectively. Experiments on large-scale automotive datasets demonstrate SLICK’s superior segmentation performance, robustness, and practical applicability for insurance and automotive inspection workflows.
zh

[CV-52] ALBERT: Advanced Localization and Bidirectional Encoder Representations from Transformers for Automotive Damage Evaluation

【速读】：该论文旨在解决汽车损伤检测与部件分割中的准确性和智能化问题，特别是针对真实损伤与虚假损伤的区分以及复杂汽车部件的精确分割。其解决方案的关键在于引入ALBERT模型，该模型基于双向编码器表示（Bidirectional Encoder Representations），结合先进的定位机制，实现了对26种损伤类型、7种虚假损伤变体以及61个不同汽车部件的高精度识别与分割。

链接: https://arxiv.org/abs/2506.10524
作者: Teerapong Panboonyuen
机构: MARSAIL( Motor AI Recognition Solution Artificial Intelligence Laboratory)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:This paper introduces ALBERT, an instance segmentation model specifically designed for comprehensive car damage and part segmentation. Leveraging the power of Bidirectional Encoder Representations, ALBERT incorporates advanced localization mechanisms to accurately identify and differentiate between real and fake damages, as well as segment individual car parts. The model is trained on a large-scale, richly annotated automotive dataset that categorizes damage into 26 types, identifies 7 fake damage variants, and segments 61 distinct car parts. Our approach demonstrates strong performance in both segmentation accuracy and damage classification, paving the way for intelligent automotive inspection and assessment applications.
zh

[CV-53] CogStream: Context-guided Streaming Video Question Answering

【速读】：该论文试图解决视频大语言模型（Video Large Language Models, Vid-LLMs）在流式视频推理中因依赖上下文信息而面临的计算负担过重和无关上下文干扰的问题。解决方案的关键在于提出一种名为Context-guided Streaming Video Reasoning (CogStream)的任务，该任务要求模型从历史上下文中识别最相关的信息以推断当前视频流的问题答案，并通过视觉流压缩和历史对话检索技术构建基线模型CogReasoner，从而高效处理该任务。

链接: https://arxiv.org/abs/2506.10516
作者: Zicheng Zhao,Kangyu Wang,Shijie Li,Rui Qian,Weiyao Lin,Huabin Liu
机构: Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It efficiently tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method. Code will be released soon.
zh

[CV-54] Edit360: 2D Image Edits to 3D Assets from Any Angle

【速读】：该论文试图解决将2D图像生成与编辑能力扩展到3D资产时面临的多视角一致性问题（multi-view consistency），尤其是针对细粒度编辑的挑战。现有方法通常局限于预设视角的编辑，限制了灵活性和实际应用。解决方案的关键在于提出Edit360框架，该框架无需微调即可实现多视角一致的3D编辑，其核心机制是引入了新颖的锚点视角编辑传播（Anchor-View Editing Propagation）技术，通过在扩散模型的潜在空间和注意力空间中对齐和融合多视角信息，从而确保所有视角下的结构连贯性。

链接: https://arxiv.org/abs/2506.10507
作者: Junchao Huang,Xinting Hu,Zhuotao Tian,Shaoshuai Shi,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳); Nanyang Technological University(南洋理工大学); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Voyager Research, Didi Chuxing(滴滴出行伏尔泰研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.
zh

[CV-55] J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft

【速读】：该论文旨在解决战斗机机身表面损伤检测与定位的难题，传统人工检测方法在规模、效率和一致性方面存在显著局限。解决方案的关键在于提出一种名为J-DDL的智能表面损伤检测与定位系统，该系统融合2D图像与3D点云数据，利用基于YOLO架构的新型损伤检测网络，结合轻量级Fasternet块、优化的包含Efficient Multiscale Attention (EMA)模块的颈部结构以及新型损失函数Inner-CIOU，实现高精度的损伤识别与三维定位。

链接: https://arxiv.org/abs/2506.10505
作者: Jin Huang,Mingqiang Wei,Zikuan Li,Hangyu Qu,Wei Zhao,Xinyu Bai
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Taiyuan University of Technology (太原理工大学); Nanjing Institute of Technology (南京工程学院); Avic Shenyang Aircraft Company Limited (中航工业沈阳飞机公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.
zh

[CV-56] Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation

【速读】：该论文旨在解决参考遥感图像分割（Reference Remote Sensing Image Segmentation, RRSIS）任务中对密集标注数据的依赖以及复杂场景理解的挑战。其解决方案的关键在于提出一种名为\textit{prompt-generated semantic localization guiding Segment Anything Model}（PSLG-SAM）的框架，该框架将RRSIS任务分解为粗略定位和精细分割两个阶段，通过视觉基础网络实现文本描述对象的粗略定位，并利用分割任何模型（Segment Anything Model, SAM）结合基于聚类的前景点生成器和掩码边界迭代优化策略进行精确分割，其中第二阶段可实现无需训练，从而显著降低标注数据的需求。

链接: https://arxiv.org/abs/2506.10503
作者: Shuyang Li,Shuang Wang,Zhuangzhuang Sun,Jing Xiao
机构: Xidian University (西安电子科技大学); Shaanxi Satellite Application Center for Natural Resources (陕西省自然资源卫星应用中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi-modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named \textitprompt-generated semantic localization guiding Segment Anything Model(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text-described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering-based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train-free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex this http URL further contribute a high-quality, multi-category manually annotated dataset. Experimental validation on two datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant performance improvements and surpasses existing state-of-the-art this http URL code will be made publicly available.
zh

[CV-57] Class-Incremental Learning for Honey Botanical Origin Classification with Hyperspectral Images: A Study with Continual Backpropagation

【速读】：该论文试图解决蜂蜜产品植物来源分类中因无法一次性收集所有蜂蜜品种而导致的模型训练不充分问题，这限制了准确区分不同植物来源蜂蜜的能力。解决方案的关键在于采用类增量学习（Class-Incremental Learning, CIL）技术，并提出一种结合持续反向传播（Continual Backpropagation, CB）算法的新方法，通过重新初始化较少使用的隐藏神经元来增强神经网络的可塑性，从而提升CIL算法的性能。

链接: https://arxiv.org/abs/2506.10489
作者: Guyang Zhang,Waleed Abdulla
机构: University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Honey is an important commodity in the global market. Honey types of different botanical origins provide diversified flavors and health benefits, thus having different market values. Developing accurate and effective botanical origin-distinguishing techniques is crucial to protect consumers’ interests. However, it is impractical to collect all the varieties of honey products at once to train a model for botanical origin differentiation. Therefore, researchers developed class-incremental learning (CIL) techniques to address this challenge. This study examined and compared multiple CIL algorithms on a real-world honey hyperspectral imaging dataset. A novel technique is also proposed to improve the performance of class-incremental learning algorithms by combining with a continual backpropagation (CB) algorithm. The CB method addresses the issue of loss-of-plasticity by reinitializing a proportion of less-used hidden neurons to inject variability into neural networks. Experiments showed that CB improved the performance of most CIL methods by 1-7%.
zh

[CV-58] Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation

【速读】：该论文试图解决光学乐谱识别（Optical Music Recognition, OMR）评估领域长期存在的不足，即缺乏一个专门用于基准测试的高质量数据集和相应的评估指标。解决方案的关键在于引入了Sheet Music Benchmark (SMB) 数据集和OMR Normalized Edit Distance (OMR-NED) 评估指标。SMB 数据集包含685页乐谱，涵盖多种音乐纹理，并采用Humdrum kern格式进行编码；而OMR-NED则在Symbol Error Rate (SER) 的基础上进行了改进，提供了对乐谱中各个元素（如音符头、连音线、音高、调号等）的细粒度错误分析，从而实现了更精确的性能评估。

链接: https://arxiv.org/abs/2506.10488
作者: Juan C. Martinez-Sevilla,Joan Cerveto-Serrano,Noelia Luna,Greg Chapman,Craig Sapp,David Rizo,Jorge Calvo-Zaragoza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In this work, we introduce the Sheet Music Benchmark (SMB), a dataset of six hundred and eighty-five pages specifically designed to benchmark Optical Music Recognition (OMR) research. SMB encompasses a diverse array of musical textures, including monophony, pianoform, quartet, and others, all encoded in Common Western Modern Notation using the Humdrum **kern format. Alongside SMB, we introduce the OMR Normalized Edit Distance (OMR-NED), a new metric tailored explicitly for evaluating OMR performance. OMR-NED builds upon the widely-used Symbol Error Rate (SER), offering a fine-grained and detailed error analysis that covers individual musical elements such as note heads, beams, pitches, accidentals, and other critical notation features. The resulting numeric score provided by OMR-NED facilitates clear comparisons, enabling researchers and end-users alike to identify optimal OMR approaches. Our work thus addresses a long-standing gap in OMR evaluation, and we support our contributions with baseline experiments using standardized SMB dataset splits for training and assessing state-of-the-art methods.
zh

[CV-59] LLM s Are Not Yet Ready for Deepfake Image Detection

【速读】：该论文试图解决深度伪造（deepfake）技术对媒体真实性和公众信任造成的威胁，探索视觉语言模型（VLMs）在深度伪造检测中的适用性。解决方案的关键在于通过结构化的零样本评估，分析ChatGPT、Claude、Gemini和Grok等主流VLMs在三种主要深度伪造类型（人脸替换、重演和合成生成）上的分类准确性和推理深度，以评估其作为独立检测系统的能力，并揭示其在解释性和上下文分析方面的潜力，从而为构建混合或人机协同的检测框架提供依据。

链接: https://arxiv.org/abs/2506.10474
作者: Shahroz Tariq,David Nguyen,M.A.P. Chamikara,Tingmin Wu,Alsharif Abuadbba,Kristen Moore
机构: CSIRO’s Data61 Australia(澳大利亚数据61实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, and 2 tables. paper is under review

点击查看摘要

Abstract:The growing sophistication of deepfakes presents substantial challenges to the integrity of media and the preservation of public trust. Concurrently, vision-language models (VLMs), large language models enhanced with visual reasoning capabilities, have emerged as promising tools across various domains, sparking interest in their applicability to deepfake detection. This study conducts a structured zero-shot evaluation of four prominent VLMs: ChatGPT, Claude, Gemini, and Grok, focusing on three primary deepfake types: faceswap, reenactment, and synthetic generation. Leveraging a meticulously assembled benchmark comprising authentic and manipulated images from diverse sources, we evaluate each model’s classification accuracy and reasoning depth. Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems. We highlight critical failure modes, such as an overemphasis on stylistic elements and vulnerability to misleading visual patterns like vintage aesthetics. Nevertheless, VLMs exhibit strengths in interpretability and contextual analysis, suggesting their potential to augment human expertise in forensic workflows. These insights imply that although general-purpose models currently lack the reliability needed for autonomous deepfake detection, they hold promise as integral components in hybrid or human-in-the-loop detection frameworks.
zh

[CV-60] Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On

【速读】：该论文旨在解决现有基于图像的虚拟试穿方法在视图限制和实时性能方面的不足，以及基于单品的虚拟试穿方法在数据采集成本高和服装与人体对齐不准确的问题。其解决方案的关键在于提出一种利用真实人体收集单品数据集的低成本方法，替代昂贵的机器人人偶，并引入一种混合人体表示，通过简化版的DensePose图增强中间表示，从而实现合成服装图像与人体的精准对齐及无需定制可穿戴设备的人体-服装交互。

链接: https://arxiv.org/abs/2506.10468
作者: Zaiqiang Wu,Yechen Li,Jingyuan Liu,Yuki Shibata,Takayuki Hori,I-Chao Shen,Takeo Igarashi
机构: The University of Tokyo (东京大学); SoftBank Corp (软银公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing image-based virtual try-on methods are often limited to the front view and lack real-time performance. While per-garment virtual try-on methods have tackled these issues by capturing per-garment datasets and training per-garment neural networks, they still encounter practical limitations: (1) the robotic mannequin used to capture per-garment datasets is prohibitively expensive for widespread adoption and fails to accurately replicate natural human body deformation; (2) the synthesized garments often misalign with the human body. To address these challenges, we propose a low-barrier approach for collecting per-garment datasets using real human bodies, eliminating the necessity for a customized robotic mannequin. We also introduce a hybrid person representation that enhances the existing intermediate representation with a simplified DensePose map. This ensures accurate alignment of synthesized garment images with the human body and enables human-garment interaction without the need for customized wearable devices. We performed qualitative and quantitative evaluations against other state-of-the-art image-based virtual try-on methods and conducted ablation studies to demonstrate the superiority of our method regarding image quality and temporal consistency. Finally, our user study results indicated that most participants found our virtual try-on system helpful for making garment purchasing decisions.
zh

[CV-61] MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

【速读】：该论文试图解决医疗图像分割中模型依赖显式人工指令、缺乏主动推理能力以理解复杂临床问题的问题，以及现有方法在生成精确分割掩码方面的不足。解决方案的关键在于提出MedSeg-R框架，该框架利用多模态大语言模型（MLLMs）的推理能力来解析临床问题，并生成对应的精确分割掩码，其核心组件包括全局上下文理解模块和像素级定位模块。

链接: https://arxiv.org/abs/2506.10465
作者: Yu Huang,Zelin Peng,Yichen Zhao,Piao Yang,Xiaokang Yang,Wei Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: †: Equal contribution

点击查看摘要

Abstract:Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R’s superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.
zh

[CV-62] Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization ICCV ICCV2023

【速读】：该论文试图解决深度神经网络（Deep Neural Network, DNN）量化过程中初始条件对模型量化鲁棒性影响的研究不足问题。现有方法主要关注量化后的模型优化技术，而忽视了训练初期权重初始化对量化性能的影响。解决方案的关键在于提出一种基于图超网络（Graph Hypernetworks, GHN）的量化鲁棒性初始化方法，通过预训练GHN生成量化模型参数，并进一步微调GHN以适应量化图结构（GHN-QAT），从而显著提升量化后的模型精度，尤其在低比特量化（如4-bit和2-bit）下表现突出。

链接: https://arxiv.org/abs/2506.10463
作者: Stone Yun,Alexander Wong
机构: University of Waterloo(滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Portions of this article have been presented as extended abstracts at the ICCV 2023 Workshop on Low Bit Quantized Neural Networks (ICCVW-LBQNN 2023) and the 2020 Conference on Vision and Intelligent Systems (CVIS 2020). arXiv admin note: text overlap with arXiv:2011.14578 , arXiv:2208.12489 , arXiv:2309.13773

点击查看摘要

Abstract:Deep neural network (DNN) quantization for fast, efficient inference has been an important tool in limiting the cost of machine learning (ML) model inference. Quantization-specific model development techniques such as regularization, quantization-aware training, and quantization-robustness penalties have served to greatly boost the accuracy and robustness of modern DNNs. However, very little exploration has been done on improving the initial conditions of DNN training for quantization. Just as random weight initialization has been shown to significantly impact test accuracy of floating point models, it would make sense that different weight initialization methods impact quantization robustness of trained models. We present an extensive study examining the effects of different weight initializations on a variety of CNN building blocks commonly used in efficient CNNs. This analysis reveals that even with varying CNN architectures, the choice of random weight initializer can significantly affect final quantization robustness. Next, we explore a new method for quantization-robust CNN initialization – using Graph Hypernetworks (GHN) to predict parameters of quantized DNNs. Besides showing that GHN-predicted parameters are quantization-robust after regular float32 pretraining (of the GHN), we find that finetuning GHNs to predict parameters for quantized graphs (which we call GHN-QAT) can further improve quantized accuracy of CNNs. Notably, GHN-QAT shows significant accuracy improvements for even 4-bit quantization and better-than-random accuracy for 2-bits. To the best of our knowledge, this is the first in-depth study on quantization-aware DNN weight initialization. GHN-QAT offers a novel approach to quantized DNN model design. Future investigations, such as using GHN-QAT-initialized parameters for quantization-aware training, can further streamline the DNN quantization process.
zh

[CV-63] Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance

【速读】：该论文旨在解决高光谱图像（Hyperspectral Image, HSI）分类模型在面对对抗样本时的可迁移性不足问题，以及现有研究在充分利用图像结构和特征信息方面的局限性。其解决方案的关键在于：首先，在保持图像结构不变的前提下，将图像在空间和光谱维度上随机划分为块，并对每个块应用多种变换以增加输入多样性并缓解过拟合；其次，设计了一种针对中间层的特征距离损失，通过衡量原始样本与对抗样本在增强特征间的距离作为主要损失，同时将输出层预测作为辅助损失，引导扰动破坏真实类别的特征，从而有效提升对抗样本的可迁移性。

链接: https://arxiv.org/abs/2506.10459
作者: Chun Liu,Bingqian Zhu,Tao Xu,Zheng Zheng,Zheng Li,Wei Yang,Zhigang Han,Jiayao Wang
机构: Henan University (河南大学); Beihang university (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI adversarial examples remains limited and faces challenges in fully utilizing the structural and feature information of images. To address these issues, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification models. First, while keeping the image structure unchanged, the proposed method randomly divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on a block by block basis to increase input diversity and mitigate overfitting. Second, a feature distancing loss targeting intermediate layers is designed, which measures the distance between the amplified features of the original examples and the features of the adversarial examples as the primary loss, while the output layer prediction serves as the auxiliary loss. This guides the perturbation to disrupt the features of the true class in adversarial examples, effectively enhancing transferability. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve effective transferability to black-box models on two public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.
zh

[CV-64] Rethinking Generative Human Video Coding with Implicit Motion Transformation

【速读】：该论文旨在解决人体视频压缩中由于复杂多样的运动模式导致的重建失真和运动不准确问题。传统基于显式运动引导的生成式人体视频编码（GHVC）在处理人体视频时表现不佳，因此本文提出通过隐式运动变换（IMT）来提升GHVC的性能，其关键在于将复杂的肢体信号表征为紧凑的视觉特征，并将其转换为隐式运动引导以用于信号重建。

链接: https://arxiv.org/abs/2506.10453
作者: Bolin Chen,Ru-Ling Liao,Jie Chen,Yan Ye
机构: Alibaba DAMO Academy & Hupan Laboratory (阿里巴巴达摩院与湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Beyond traditional hybrid-based video codec, generative video codec could achieve promising compression performance by evolving high-dimensional signals into compact feature representations for bitstream compactness at the encoder side and developing explicit motion fields as intermediate supervision for high-quality reconstruction at the decoder side. This paradigm has achieved significant success in face video compression. However, compared to facial videos, human body videos pose greater challenges due to their more complex and diverse motion patterns, i.e., when using explicit motion guidance for Generative Human Video Coding (GHVC), the reconstruction results could suffer severe distortions and inaccurate motion. As such, this paper highlights the limitations of explicit motion-based approaches for human body video compression and investigates the GHVC performance improvement with the aid of Implicit Motion Transformation, namely IMT. In particular, we propose to characterize complex human body signal into compact visual features and transform these features into implicit motion guidance for signal reconstruction. Experimental results demonstrate the effectiveness of the proposed IMT paradigm, which can facilitate GHVC to achieve high-efficiency compression and high-fidelity synthesis.
zh

[CV-65] MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

【速读】：该论文旨在解决在线视频内容快速增长背景下，传统单模态视频摘要方法难以全面捕捉视频语义丰富性的问题。其解决方案的关键在于提出MF2Summ模型，该模型基于多模态内容理解，通过整合视觉与听觉信息，构建了一个五阶段处理流程，包括特征提取、跨模态注意力交互、特征融合、片段预测和关键帧选择，其中核心的融合机制采用跨模态Transformer和对齐引导的自注意力Transformer，以有效建模模态间依赖关系和时间对应性。

链接: https://arxiv.org/abs/2506.10430
作者: Shuo wang,Jihao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9% and 0.6% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.
zh

[CV-66] Its Not the Target Its the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations

【速读】：该论文旨在解决红外小目标检测（IRSTD）在复杂背景下的难题，主要挑战包括低信杂比（SCR）、目标形态多样性以及缺乏显著视觉线索。其解决方案的关键在于提出一种名为LRRNet的端到端框架，该框架利用了红外图像背景的低秩特性，并通过压缩-重建-减法（CRS）范式直接在图像域中建模结构感知的低秩背景表示，无需依赖基于块的处理或显式矩阵分解。这是首次在端到端方式下使用深度神经网络直接学习低秩背景结构的工作。

链接: https://arxiv.org/abs/2506.10425
作者: Guoyi Zhang,Guangsheng Xu,Siyang Chen,Han Wang,Xiaohu Zhang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression–reconstruction–subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model’s resilience to sensor noise. The source code will be made publicly available upon acceptance.
zh

[CV-67] Semi-Tensor-Product Based Convolutional Neural Networks

【速读】：该论文试图解决传统卷积操作中因填充（padding）引入的冗余信息问题，从而提升卷积神经网络（CNN）在图像和高阶信号识别中的性能。解决方案的关键在于提出一种基于域的卷积乘积（domain-based convolutional product, CP），该方法结合向量的半张量积（semi-tensor product, STP），避免了传统方法中通过填充引入的垃圾信息，从而提高了模型的准确性和效率。

链接: https://arxiv.org/abs/2506.10407
作者: Daizhan Cheng
机构: Chinese Academy of Sciences (中国科学院)
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The semi-tensor product (STP) of vectors is a generalization of conventional inner product of vectors, which allows the factor vectors to of different dimensions. This paper proposes a domain-based convolutional product (CP). Combining domain-based CP with STP of vectors, a new CP is proposed. Since there is no zero or any other padding, it can avoid the junk information caused by padding. Using it, the STP-based convolutional neural network (CNN) is developed. Its application to image and third order signal identifications is considered.
zh

[CV-68] Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

【速读】：该论文试图解决统一多模态模型在图像理解和生成任务中性能不如专用模型的问题，其核心挑战在于视觉特征需求和训练过程在图像理解和生成之间的固有差异。解决方案的关键在于提出Pisces，一个通过新型解耦视觉编码架构和针对多模态生成优化的定制训练技术来应对这一挑战的自回归多模态基础模型。

链接: https://arxiv.org/abs/2506.10395
作者: Zhiyang Xu,Jiuhai Chen,Zhaojiang Lin,Xichen Pan,Lifu Huang,Tianyi Zhou,Madian Khabsa,Qifan Wang,Di Jin,Michihiro Yasunaga,Lili Yu,Xi Victoria Lin,Shaoliang Nie
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Unified image understanding and generation model

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
zh

[CV-69] ReconMOST: Multi-Layer Sea Temperature Reconstruction with Observations-Guided Diffusion

【速读】：该论文旨在解决海洋温度场重建中的数据稀疏性、算法复杂性和计算成本高等问题，同时克服传统方法在云层遮挡等场景下的局限性。其解决方案的关键在于提出ReconMOST框架，该框架基于数据驱动的扩散模型（diffusion model），通过预训练一个无条件扩散模型以获得物理一致的海洋温度场分布模式，并在生成阶段利用高精度的现场观测数据作为引导点，实现多层海洋温度的准确重建。此外，在缺乏直接观测数据的区域，模型通过预训练阶段学习到的物理一致性空间分布模式，实现了隐式引导和物理合理的重建。

链接: https://arxiv.org/abs/2506.10391
作者: Yuanyi Song,Pumeng Lyu,Ben Fei,Fenghua Ling,Wanli Ouyang,Lei Bai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate reconstruction of ocean is essential for reflecting global climate dynamics and supporting marine meteorological research. Conventional methods face challenges due to sparse data, algorithmic complexity, and high computational costs, while increasing usage of machine learning (ML) method remains limited to reconstruction problems at the sea surface and local regions, struggling with issues like cloud occlusion. To address these limitations, this paper proposes ReconMOST, a data-driven guided diffusion model framework for multi-layer sea temperature reconstruction. Specifically, we first pre-train an unconditional diffusion model using a large collection of historical numerical simulation data, enabling the model to attain physically consistent distribution patterns of ocean temperature fields. During the generation phase, sparse yet high-accuracy in-situ observational data are utilized as guidance points for the reverse diffusion process, generating accurate reconstruction results. Importantly, in regions lacking direct observational data, the physically consistent spatial distribution patterns learned during pre-training enable implicitly guided and physically plausible reconstructions. Our method extends ML-based SST reconstruction to a global, multi-layer setting, handling over 92.5% missing data while maintaining reconstruction accuracy, spatial resolution, and superior generalization capability. We pre-train our model on CMIP6 numerical simulation data and conduct guided reconstruction experiments on CMIP6 and EN4 analysis data. The results of mean squared error (MSE) values achieve 0.049 on guidance, 0.680 on reconstruction, and 0.633 on total, respectively, demonstrating the effectiveness and robustness of the proposed framework. Our source code is available at this https URL.
zh

[CV-70] DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

【速读】：该论文试图解决非卷积模型（如Vision Transformer和Vision Mamba）在处理图像时对固定尺寸块的依赖问题，这一问题导致背景区域被过度编码而关键局部细节被忽略，尤其是在信息对象稀疏分布的情况下。解决方案的关键在于提出一种全可微分的动态自适应区域标记器（Dynamic Adaptive Region Tokenizer, DART），它通过自适应地将图像划分为内容相关的不同大小的块，并结合可学习的区域得分与分段可微分的分位数操作，将更密集的标记分配给信息丰富的区域，从而提升模型性能。

链接: https://arxiv.org/abs/2506.10390
作者: Shicheng Yin,Kaixuan Yin,Yang Liu,Weixing Chen,Liang Lin
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at this https URL.
zh

[CV-71] Leverag ing 6DoF Pose Foundation Models For Mapping Marine Sediment Burial

【速读】：该论文旨在解决从远程影像中准确估计人为物体在海底的埋藏深度的问题，这一问题对于评估生态风险、污染物迁移及危险物质（如弹药）的恢复或缓解策略具有重要意义。解决方案的关键在于提出了一种名为PoseIDON的计算机视觉流程，该流程结合了深度基础模型特征与多视角摄影测量技术，以估计物体的六自由度姿态及周围海底的方位，并通过将物体的CAD模型与观测图像对齐以及拟合海底局部平面近似来推断埋藏深度。

链接: https://arxiv.org/abs/2506.10386
作者: Jerry Yan,Chinmay Talegaonkar,Nicholas Antipa,Eric Terrill,Sophia Merrifield
机构: Marine Physical Laboratory, Scripps Institution of Oceanography, UCSD, La Jolla, CA USA; Department of Electrical and Computer Engineering, UCSD, La Jolla, CA USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The burial state of anthropogenic objects on the seafloor provides insight into localized sedimentation dynamics and is also critical for assessing ecological risks, potential pollutant transport, and the viability of recovery or mitigation strategies for hazardous materials such as munitions. Accurate burial depth estimation from remote imagery remains difficult due to partial occlusion, poor visibility, and object degradation. This work introduces a computer vision pipeline, called PoseIDON, which combines deep foundation model features with multiview photogrammetry to estimate six degrees of freedom object pose and the orientation of the surrounding seafloor from ROV video. Burial depth is inferred by aligning CAD models of the objects with observed imagery and fitting a local planar approximation of the seafloor. The method is validated using footage of 54 objects, including barrels and munitions, recorded at a historic ocean dumpsite in the San Pedro Basin. The model achieves a mean burial depth error of approximately 10 centimeters and resolves spatial burial patterns that reflect underlying sediment transport processes. This approach enables scalable, non-invasive mapping of seafloor burial and supports environmental assessment at contaminated sites.
zh

[CV-72] Revisiting Transformers with Insights from Image Filtering

【速读】：该论文试图解决自注意力机制（self-attention）在Transformer架构中缺乏理论基础和可解释性的问题，其关键在于构建一个统一的图像处理框架，以解释自注意力计算本身及其组件（如位置编码和残差连接）的作用，并揭示其在不同变体中的工作机制。该框架不仅有助于理解自注意力的成功与局限性，还通过受图像处理启发的结构改进，提升了模型在语言和视觉任务中的准确性、鲁棒性以及长序列理解能力。

链接: https://arxiv.org/abs/2506.10371
作者: Laziz U. Abdullaev,Maksim Tkachenko,Tan M. Nguyen
机构: National University of Singapore (新加坡国立大学); Rakuten Institute of Technology (RIT), Singapore (乐天技术研究所（RIT），新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.
zh

[CV-73] FSATFusion: Frequency-Spatial Attention Transformer for Infrared and Visible Image Fusion

【速读】：该论文旨在解决红外与可见光图像融合（Infrared and Visible Image Fusion, IVIF）中由于卷积操作固有局限性导致的全局上下文信息捕获能力不足，进而引发的信息丢失问题。解决方案的关键在于提出一种端到端的融合网络——频域-空域注意力Transformer融合网络（Frequency-Spatial Attention Transformer Fusion Network, FSATFusion），其核心是频域-空域注意力模块（Frequency-Spatial Attention Module, FSAM）和改进的Transformer模块（Improved Transformer Module, ITM），用于有效提取源图像中的判别特征并增强全局上下文信息的表达能力。

链接: https://arxiv.org/abs/2506.10366
作者: Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui,Yuhan Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. Existing deep learning approaches often utilize convolutional neural networks to extract image features. However, the inherently capacity of convolution operations to capture global context can lead to information loss, thereby restricting fusion performance. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains a frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. This FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer. We conducted both qualitative and quantitative comparative experiments, demonstrating the superior fusion quality and efficiency of FSATFusion compared to other state-of-the-art methods. Furthermore, our network was tested on two additional tasks without any modifications, to verify the excellent generalization capability of FSATFusion. Finally, the object detection experiment demonstrated the superiority of FSATFusion in downstream visual tasks. Our code is available at this https URL.
zh

[CV-74] FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device ICIP

【速读】：该论文旨在解决轻量级人脸识别模型在保持高精度的同时降低计算复杂度和提升推理速度的问题。其解决方案的关键在于引入了一种创新的轻量级多头线性注意力（Multi-Head Linear Attention, MHLA）机制，并结合重构的token mixer，从而有效减少了计算复杂度，同时保持了竞争性的识别准确率。

链接: https://arxiv.org/abs/2506.10361
作者: Novendra Setyawan,Chi-Chia Sun,Mao-Hsiu Hsu,Wen-Kai Kuo,Jun-Wei Hsieh
机构: National Formosa University (国立高雄第一科技大学); National Taipei University (台北科技大学); University of Muhammadiyah Malang (玛拉拿达天主教大学); National Yang Ming Chiao Tung University (阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 ICIP

点击查看摘要

Abstract:This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.
zh

[CV-75] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation

【速读】：该论文试图解决文本到动作生成中运动缺乏可控性、一致性和多样性的问题，这主要是由于现有方法依赖端到端映射策略，无法捕捉深层语言结构和逻辑推理。解决方案的关键在于提出Motion-R1，这是一个融合了思维链（Chain-of-Thought）机制的统一运动-语言建模框架，通过显式分解复杂文本指令为逻辑结构化的动作路径，为运动生成提供高层次语义指导，从而显著提升模型对多步骤、长时程和组合丰富指令的理解与执行能力。

链接: https://arxiv.org/abs/2506.10353
作者: Runqi Ouyang,Haoyun Li,Zhenyuan Zhang,Xiaofeng Wang,Zheng Zhu,Guan Huang,Xingang Wang
机构: GigaAI; CASIA; HKUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model’s ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
zh

[CV-76] RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration

【速读】：该论文旨在解决医学图像配准中因空间分辨率差异（如像素间距、切片厚度和视野不同）而导致的配准效果不佳问题。传统基于机器学习的配准方法通过重采样将图像调整到固定分辨率，但这一过程可能引入插值伪影。论文提出的解决方案是RealKeyMorph (RKM)，其关键在于不进行重采样，直接在原始数据上操作，并输出扫描仪真实坐标系下的关键点，从而实现对分辨率的无关性。RKM通过利用扫描仪生成的仿射矩阵将关键点转换到真实世界空间，并将其整合到训练过程中，使提取的关键点具备分辨率不变特性。

链接: https://arxiv.org/abs/2506.10344
作者: Mina C. Moghadam,Alan Q. Wang,Omer Taub,Martin R. Prince,Mert R. Sabuncu
机构: Weill Cornell Medicine (威尔康奈尔医学中心); Stanford University (斯坦福大学); Cornell Tech (康奈尔技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 8 figures, to be submitted to MELBA

点击查看摘要

Abstract:Many real-world settings require registration of a pair of medical images that differ in spatial resolution, which may arise from differences in image acquisition parameters like pixel spacing, slice thickness, and field-of-view. However, all previous machine learning-based registration techniques resample images onto a fixed resolution. This is suboptimal because resampling can introduce artifacts due to interpolation. To address this, we present RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is an extension of KeyMorph, a registration framework which works by training a network to learn corresponding keypoints for a given pair of images, after which a closed-form keypoint matching step is used to derive the transformation that aligns them. To avoid resampling and enable operating on the raw data, RKM outputs keypoints in real-world coordinates of the scanner. To do this, we leverage the affine matrix produced by the scanner (e.g., MRI machine) that encodes the mapping from voxel coordinates to real world coordinates. By transforming keypoints into real-world space and integrating this into the training process, RKM effectively enables the extracted keypoints to be resolution-agnostic. In our experiments, we demonstrate the advantages of RKM on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as 3D volumes with varying resolutions in brain datasets.
zh

[CV-77] UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leverag ing Vision Large Language Models

【速读】：该论文试图解决传统城市文化研究方法依赖专家解读和历史文献记录，难以在不同语境下实现标准化的问题。其解决方案的关键在于提出一种基于视觉-语言模型的多模态研究框架，通过自动化和可扩展的方式分析城市街道景观风格差异，从而提升城市形态研究的客观性和数据驱动性。

链接: https://arxiv.org/abs/2506.10342
作者: Jun Yin,Jing Zhong,Peilin Li,Pengyu Zeng,Miao Zhang,Ran Luo,Shuai Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban cultures and architectural styles vary significantly across cities due to geographical, chronological, historical, and socio-political factors. Understanding these differences is essential for anticipating how cities may evolve in the future. As representative cases of historical continuity and modern innovation in China, Beijing and Shenzhen offer valuable perspectives for exploring the transformation of urban streetscapes. However, conventional approaches to urban cultural studies often rely on expert interpretation and historical documentation, which are difficult to standardize across different contexts. To address this, we propose a multimodal research framework based on vision-language models, enabling automated and scalable analysis of urban streetscape style differences. This approach enhances the objectivity and data-driven nature of urban form research. The contributions of this study are as follows: First, we construct UrbanDiffBench, a curated dataset of urban streetscapes containing architectural images from different periods and regions. Second, we develop UrbanSense, the first vision-language-model-based framework for urban streetscape analysis, enabling the quantitative generation and comparison of urban style representations. Third, experimental results show that Over 80% of generated descriptions pass the t-test (p less than 0.05). High Phi scores (0.912 for cities, 0.833 for periods) from subjective evaluations confirm the method’s ability to capture subtle stylistic differences. These results highlight the method’s potential to quantify and interpret urban style evolution, offering a scientifically grounded lens for future design.
zh

[CV-78] GeoCAD: Local Geometry-Controllable CAD Generation

【速读】：该论文试图解决局部几何可控的计算机辅助设计（CAD）生成问题，即在不破坏整体结构的前提下，自动修改CAD模型的局部区域，并确保新生成的局部形状符合用户指定的几何指令。现有方法在遵循文本指令或聚焦局部区域方面存在局限。解决方案的关键在于提出GeoCAD方法，其核心是采用互补描述策略，通过基于顶点和大语言模型（VLLM）的描述方式，系统地为简单和复杂局部部件生成几何指令，并利用这些指令与剩余部分作为输入，通过大型语言模型预测被遮蔽的局部区域。

链接: https://arxiv.org/abs/2506.10337
作者: Zhanwei Zhang,Kaiyuan Liu,Junjie Liu,Wenxiao Wang,Binbin Lin,Liang Xie,Chen Shen,Deng Cai
机构: Zhejiang University (浙江大学); Alibaba Cloud Computing (阿里云计算); Zhejiang University of Technology (浙江理工大学); Hangzhou YunQi Academy of Engineering (杭州云启工程研究院); School of Software Technology, Zhejiang University (浙江大学软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Local geometry-controllable computer-aided design (CAD) generation aims to modify local parts of CAD models automatically, enhancing design efficiency. It also ensures that the shapes of newly generated local parts follow user-specific geometric instructions (e.g., an isosceles right triangle or a rectangle with one corner cut off). However, existing methods encounter challenges in achieving this goal. Specifically, they either lack the ability to follow textual instructions or are unable to focus on the local parts. To address this limitation, we introduce GeoCAD, a user-friendly and local geometry-controllable CAD generation method. Specifically, we first propose a complementary captioning strategy to generate geometric instructions for local parts. This strategy involves vertex-based and VLLM-based captioning for systematically annotating simple and complex parts, respectively. In this way, we caption \sim 221k different local parts in total. In the training stage, given a CAD model, we randomly mask a local part. Then, using its geometric instruction and the remaining parts as input, we prompt large language models (LLMs) to predict the masked part. During inference, users can specify any local part for modification while adhering to a variety of predefined geometric instructions. Extensive experiments demonstrate the effectiveness of GeoCAD in generation quality, validity and text-to-CAD consistency. Code will be available at this https URL.
zh

[CV-79] PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting

【速读】：该论文试图解决3D Gaussian splatting (3DGS)在输入视图有限时容易过拟合训练视图，导致渲染质量显著下降的问题。其解决方案的关键在于提出一种基于点特征感知的高斯点云渲染框架，通过引入最新的立体基础模型进行相机位姿估计和稠密点云重建，结合多尺度2D外观特征的采样与聚合，以及基于自注意力机制的点交互网络，增强点级外观表示，并利用轻量级多层感知机解码生成高斯参数，从而实现从稀疏训练视图中实时、高质量的渲染。

链接: https://arxiv.org/abs/2506.10335
作者: Lintao Xiang,Hongpei Zheng,Yating Huang,Qijun Yang,Hujun Yin
机构: The University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.
zh

[CV-80] Using Vision Language Models to Detect Students Academic Emotion through Facial Expressions

【速读】：该论文试图解决学生学术情绪在在线学习环境中通过面部表情进行自动且准确分析的问题。传统方法依赖于监督机器学习算法，但这些模型在不同情境下的泛化能力较差，需要反复进行数据收集、标注和训练。该研究提出的解决方案关键在于利用视觉-语言模型（Vision-Language Models, VLMs）的零样本提示能力，无需微调即可实现跨视觉识别任务的泛化，从而提升情绪分析的效率与适应性。

链接: https://arxiv.org/abs/2506.10334
作者: Deliang Wang,Chao Yang,Gaowei Chen
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Students’ academic emotions significantly influence their social behavior and learning performance. Traditional approaches to automatically and accurately analyze these emotions have predominantly relied on supervised machine learning algorithms. However, these models often struggle to generalize across different contexts, necessitating repeated cycles of data collection, annotation, and training. The emergence of Vision-Language Models (VLMs) offers a promising alternative, enabling generalization across visual recognition tasks through zero-shot prompting without requiring fine-tuning. This study investigates the potential of VLMs to analyze students’ academic emotions via facial expressions in an online learning environment. We employed two VLMs, Llama-3.2-11B-Vision-Instruct and Qwen2.5-VL-7B-Instruct, to analyze 5,000 images depicting confused, distracted, happy, neutral, and tired expressions using zero-shot prompting. Preliminary results indicate that both models demonstrate moderate performance in academic facial expression recognition, with Qwen2.5-VL-7B-Instruct outperforming Llama-3.2-11B-Vision-Instruct. Notably, both models excel in identifying students’ happy emotions but fail to detect distracted behavior. Additionally, Qwen2.5-VL-7B-Instruct exhibits relatively high performance in recognizing students’ confused expressions, highlighting its potential for practical applications in identifying content that causes student confusion.
zh

[CV-81] Research on Audio-Visual Quality Assessment Dataset and Method for User-Generated Omnidirectional Video ICME2025

【速读】：该论文旨在解决用户生成内容（User-Generated Content, UGC）的全向视频（Omnidirectional Video, ODV）中音视频质量评估（Audio-Visual Quality Assessment, AVQA）研究不足的问题。其关键解决方案是构建一个包含UGC全向音视频内容的数据集，并在此基础上设计一个有效的AVQA基线模型，该模型包含视频特征提取模块、音频特征提取与音视频融合模块，从而提升对UGC-ODV的音视频质量评估性能。

链接: https://arxiv.org/abs/2506.10331
作者: Fei Zhao,Da Pan,Zelu Qi,Ping Shi
机构: Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Our paper has been accepted by ICME 2025

点击查看摘要

Abstract:In response to the rising prominence of the Metaverse, omnidirectional videos (ODVs) have garnered notable interest, gradually shifting from professional-generated content (PGC) to user-generated content (UGC). However, the study of audio-visual quality assessment (AVQA) within ODVs remains limited. To address this, we construct a dataset of UGC omnidirectional audio and video (A/V) content. The videos are captured by five individuals using two different types of omnidirectional cameras, shooting 300 videos covering 10 different scene types. A subjective AVQA experiment is conducted on the dataset to obtain the Mean Opinion Scores (MOSs) of the A/V sequences. After that, to facilitate the development of UGC-ODV AVQA fields, we construct an effective AVQA baseline model on the proposed dataset, of which the baseline model consists of video feature extraction module, audio feature extraction and audio-visual fusion module. The experimental results demonstrate that our model achieves optimal performance on the proposed dataset.
zh

[CV-82] owards Scalable SOAP Note Generation: A Weakly Supervised Multimodal Framework CVPR

【速读】：该论文旨在解决皮肤癌患者临床随访过程中SOAP（Subjective, Objective, Assessment, and Plan）笔记手动生成耗时且导致医生工作倦怠的问题。其解决方案的关键在于提出一种弱监督的多模态框架，能够从有限的输入（如病变图像和稀疏的临床文本）中生成结构化的临床SOAP笔记，从而减少对人工标注的依赖，实现可扩展且符合临床需求的文档生成。

链接: https://arxiv.org/abs/2506.10328
作者: Sadia Kamal,Tim Oates,Joy Wan
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at IEEE/CVF Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Skin carcinoma is the most prevalent form of cancer globally, accounting for over 8 billion in annual healthcare expenditures. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate clinical quality, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.
zh

[CV-83] Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation

【速读】：该论文旨在解决皮肤癌诊断中因数据稀缺和模型不确定性不足而导致的分类性能受限问题。其解决方案的关键在于结合迁移学习与不确定性量化（UQ），通过使用预训练特征提取器（如CLIP视觉变压器）与传统分类器（如SVM）进行高效分类，并引入蒙特卡洛Dropout、集成学习及集成蒙特卡洛Dropout方法来评估模型输出的可靠性，从而提升深度学习在医学诊断中的准确性和可信度。

链接: https://arxiv.org/abs/2506.10302
作者: Hamzeh Asgharnezhad,Pegah Tabarisaadi,Abbas Khosravi,Roohallah Alizadehsani,U. Rajendra Acharya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and reliable skin cancer diagnosis is critical for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, but their performance can be limited by data scarcity and a lack of uncertainty awareness. In this study, we present a comprehensive evaluation of DL-based skin lesion classification using transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. In the first phase, we benchmarked several pre-trained feature extractors-including Contrastive Language-Image Pretraining (CLIP) variants, Residual Network-50 (ResNet50), Densely Connected Convolutional Network (DenseNet121), Visual Geometry Group network (VGG16), and EfficientNet-V2-Large-combined with a range of traditional classifiers such as Support Vector Machine (SVM), eXtreme Gradient Boosting (XGBoost), and logistic regression. Our results show that CLIP-based vision transformers, particularly LAION CLIP ViT-H/14 with SVM, deliver the highest classification performance. In the second phase, we incorporated UQ using Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) to assess not only prediction accuracy but also the reliability of model outputs. We evaluated these models using uncertainty-aware metrics such as uncertainty accuracy(UAcc), uncertainty sensitivity(USen), uncertainty specificity(USpe), and uncertainty precision(UPre). The results demonstrate that ensemble methods offer a good trade-off between accuracy and uncertainty handling, while EMCD is more sensitive to uncertain predictions. This study highlights the importance of integrating UQ into DL-based medical diagnosis to enhance both performance and trustworthiness in real-world clinical applications.
zh

[CV-84] HalLoc: Token-level Localization of Hallucinations for Vision Language Models CVPR2025

【速读】：该论文旨在解决大规模视觉-语言模型（Vision-Language Models, VLMs）中幻觉（Hallucination）问题，该问题严重影响了模型的可靠性，尤其在关键应用中需要准确性的场景下。现有检测方法依赖计算密集型模型，导致高延迟和资源消耗，并且其确定性结果无法适应现实世界中幻觉与真实信息界限模糊的情况。论文提出的解决方案是构建HalLoc数据集，其关键在于提供150K个级别的标注样本，涵盖多种任务类型，支持概率性幻觉检测，使模型能够以分级置信度进行检测，从而提升用户交互的准确性与可靠性。此外，论文还引入了一个基于HalLoc训练的基线模型，实现生成过程中的低开销并发检测，可无缝集成到现有VLM中，提升可靠性同时保持效率。

链接: https://arxiv.org/abs/2506.10286
作者: Eunkyu Park,Minyeong Kim,Gunhee Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Hallucinations pose a significant challenge to the reliability of large vision-language models, making their detection essential for ensuring accuracy in critical applications. Current detection methods often rely on computationally intensive models, leading to high latency and resource demands. Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. It features 150K token-level annotated samples, including hallucination types, across Visual Question Answering (VQA), instruction-following, and image captioning tasks. This dataset facilitates the development of models that detect hallucinations with graded confidence, enabling more informed user interactions. Additionally, we introduce a baseline model trained on HalLoc, offering low-overhead, concurrent hallucination detection during generation. The model can be seamlessly integrated into existing VLMs, improving reliability while preserving efficiency. The prospect of a robust plug-and-play hallucination detection module opens new avenues for enhancing the trustworthiness of vision-language models in real-world applications. The HalLoc dataset and code are publicly available at: this https URL.
zh

[CV-85] Energy Aware Camera Location Search Algorithm for Increasing Precision of Observation in Automated Manufacturing

【速读】：该论文旨在解决在eye-to-hand配置中，相机位置对视觉伺服系统观测精度的影响问题。由于制造环境中不同观测位置的环境条件差异会导致图像噪声水平变化，从而影响相机估计质量，因此需要优化相机的位置以获得更优的观测效果。解决方案的关键在于提出一种相机移动策略算法，该算法通过探索相机工作空间，寻找噪声水平最小的最优位置，并在能量有限的情况下确保相机到达已搜索位置中的次优位置。该算法通过适应性学习环境来提高空间探索效率，结合图像平均技术，在单相机条件下实现了眼-手配置中可接受的观测精度，同时保留了原始图像中的高频信息。

链接: https://arxiv.org/abs/2506.10251
作者: Rongfei Li,Francis Assadian
机构: 未知
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 24 figures, Journal, Published in: Applied Sciences, 2024, vol. 14, article 9140. For published version, see this http URL: this https URL

点击查看摘要

Abstract:Visual servoing technology has been well developed and applied in many automated manufacturing tasks, especially in tools’ pose alignment. To access a full global view of tools, most applications adopt eye-to-hand configuration or eye-to-hand/eye-in-hand cooperation configuration in an automated manufacturing environment. Most research papers mainly put efforts into developing control and observation architectures in various scenarios, but few of them have discussed the importance of the camera’s location in eye-to-hand configuration. In a manufacturing environment, the quality of camera estimations may vary significantly from one observation location to another, as the combined effects of environmental conditions result in different noise levels of a single image shot at different locations. In this paper, we propose an algorithm for the camera’s moving policy so that it explores the camera workspace and searches for the optimal location where the images’ noise level is minimized. Also, this algorithm ensures the camera ends up at a suboptimal (if the optimal one is unreachable) location among the locations already searched, with limited energy available for moving the camera. Unlike a simple brute force approach, the algorithm enables the camera to explore space more efficiently by adapting the search policy from learning the environment. With the aid of an image averaging technique, this algorithm, in use of a solo camera, achieves the observation accuracy in eye-to-hand configurations to a desirable extent without filtering out high-frequency information in the original image. An automated manufacturing application has been simulated and the results show the success of this algorithm’s improvement of observation precision with limited energy.
zh

[CV-86] DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos CVPR2025

【速读】：该论文旨在解决基于摄像头的鸟瞰图（Bird’s Eye View, BEV）中3D目标检测任务的计算成本高和效率低的问题。传统方法依赖于密集的BEV特征，构建成本较高；而近期的稀疏查询检测方法虽然有所改进，但仍需大量查询且在处理多帧视频时计算开销较大。论文提出的DySS方法的关键在于采用状态空间学习（state-space learning）和动态查询机制，通过状态空间模型（State-Space Model, SSM）对时间步长上的采样特征进行序列处理，并引入未来预测和掩码重建作为辅助任务以提升模型对运动和对应关系的建模能力，最终通过动态更新查询来保持高效且有效的检测查询集。

链接: https://arxiv.org/abs/2506.10242
作者: Rajeev Yasarla,Shizhong Han,Hong Cai,Fatih Porikli
机构: Qualcomm AI Research(高通人工智能研究); Qualcomm Technologies, Inc(高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Workshop on Autonomous Driving

点击查看摘要

Abstract:Camera-based 3D object detection in Bird’s Eye View (BEV) is one of the most important perception tasks in autonomous driving. Earlier methods rely on dense BEV features, which are costly to construct. More recent works explore sparse query-based detection. However, they still require a large number of queries and can become expensive to run when more video frames are used. In this paper, we propose DySS, a novel method that employs state-space learning and dynamic queries. More specifically, DySS leverages a state-space model (SSM) to sequentially process the sampled features over time steps. In order to encourage the model to better capture the underlying motion and correspondence information, we introduce auxiliary tasks of future prediction and masked reconstruction to better train the SSM. The state of the SSM then provides an informative yet efficient summarization of the scene. Based on the state-space learned features, we dynamically update the queries via merge, remove, and split operations, which help maintain a useful, lean set of detection queries throughout the network. Our proposed DySS achieves both superior detection performance and efficient inference. Specifically, on the nuScenes test split, DySS achieves 65.31 NDS and 57.4 mAP, outperforming the latest state of the art. On the val split, DySS achieves 56.2 NDS and 46.2 mAP, as well as a real-time inference speed of 33 FPS.
zh

[CV-87] California Crop Yield Benchmark: Combining Satellite Image Climate Evapotranspiration and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops

【速读】：该论文旨在解决加州农业中准确且及时的作物产量预测问题，尽管已有大量历史产量数据，但由于环境、气候和土壤等因素的复杂相互作用，这一问题仍然具有挑战性。解决方案的关键在于构建一个覆盖全州70多种作物的综合性作物产量基准数据集，并开发一种针对县级别、作物特异性产量预测的多模态深度学习模型。该模型通过分层特征提取和时间序列编码器捕捉生长季节的空间与时间动态，结合静态输入如土壤特性与作物种类以解释长期变化，从而实现了较高的预测性能（总体R²得分为0.76）。

链接: https://arxiv.org/abs/2506.10228
作者: Hamid Kamangir,Mona Hajiesmaeeli,Mason Earles
机构: University of California, Davis (加州大学戴维斯分校); Texas A&M University-Corpus Christi (德克萨斯农工大学科珀斯克里斯蒂分校); AI Institute for Food Systems (食品系统人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:California is a global leader in agricultural production, contributing 12.5% of the United States total output and ranking as the fifth-largest food and cotton supplier in the world. Despite the availability of extensive historical yield data from the USDA National Agricultural Statistics Service, accurate and timely crop yield forecasting remains a challenge due to the complex interplay of environmental, climatic, and soil-related factors. In this study, we introduce a comprehensive crop yield benchmark dataset covering over 70 crops across all California counties from 2008 to 2022. The benchmark integrates diverse data sources, including Landsat satellite imagery, daily climate records, monthly evapotranspiration, and high-resolution soil properties. To effectively learn from these heterogeneous inputs, we develop a multi-modal deep learning model tailored for county-level, crop-specific yield forecasting. The model employs stratified feature extraction and a timeseries encoder to capture spatial and temporal dynamics during the growing season. Static inputs such as soil characteristics and crop identity inform long-term variability. Our approach achieves an overall R2 score of 0.76 across all crops of unseen test dataset, highlighting strong predictive performance across California diverse agricultural regions. This benchmark and modeling framework offer a valuable foundation for advancing agricultural forecasting, climate adaptation, and precision farming. The full dataset and codebase are publicly available at our GitHub repository.
zh

[CV-88] ScoreMix: Improving Face Recognition via Score Composition in Diffusion Generators

【速读】：该论文旨在解决在有限标记数据情况下，提升判别器（discriminator）性能的问题。其解决方案的关键在于利用扩散模型的得分组合特性（score compositional properties），通过在扩散采样过程中凸性混合不同类别条件轨迹的得分，生成具有挑战性的合成样本，从而显著增强模型的判别能力。

链接: https://arxiv.org/abs/2506.10226
作者: Parsa Rahimi,Sebastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we propose ScoreMix, a novel yet simple data augmentation strategy leveraging the score compositional properties of diffusion models to enhance discriminator performance, particularly under scenarios with limited labeled data. By convexly mixing the scores from different class-conditioned trajectories during diffusion sampling, we generate challenging synthetic samples that significantly improve discriminative capabilities in all studied benchmarks. We systematically investigate class-selection strategies for mixing and discover that greater performance gains arise when combining classes distant in the discriminator’s embedding space, rather than close in the generator’s condition space. Moreover, we empirically show that, under standard metrics, the correlation between the generator’s learned condition space and the discriminator’s embedding space is minimal. Our approach achieves notable performance improvements without extensive parameter searches, demonstrating practical advantages for training discriminative models while effectively mitigating problems regarding collections of large datasets. Paper website: this https URL
zh

[CV-89] Improving Personalized Search with Regularized Low-Rank Parameter Updates CVPR2025

【速读】：该论文旨在解决个性化视觉-语言检索（personalized vision-language retrieval）问题，即在仅有少量示例的情况下识别新的概念（如“我的狗Fido”）。该任务的挑战在于不仅需要从少量图像中学习新概念，还需结合个人知识与通用知识以在不同上下文中识别该概念。论文提出的关键解决方案是通过正则化低秩适配（regularized low-rank adaptation）对视觉-语言双编码器模型的内部表示进行有效调整，具体而言是在语言编码器最后一层的小规模参数集上进行适配，这被证明是一种比文本反转（textual inversion）更有效的识别个人概念方法，同时能够保持通用知识。此外，研究还发现参数相加策略在整合多个学习到的个人概念时具有有效性。

链接: https://arxiv.org/abs/2506.10182
作者: Fiona Ryan,Josef Sivic,Fabian Caba Heilbron,Judy Hoffman,James M. Rehg,Bryan Russell
机构: Georgia Tech; Adobe Research; CIIRC CTU; UIUC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Highlight. Code: this http URL

点击查看摘要

Abstract:Personalized vision-language retrieval seeks to recognize new concepts (e.g. “my dog Fido”) from only a few examples. This task is challenging because it requires not only learning a new concept from a few images, but also integrating the personal and general knowledge together to recognize the concept in different contexts. In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. We find that regularized low-rank adaption of a small set of parameters in the language encoder’s final layer serves as a highly effective alternative to textual inversion for recognizing the personal concept while preserving general knowledge. Additionally, we explore strategies for combining parameters of multiple learned personal concepts, finding that parameter addition is effective. To evaluate how well general knowledge is preserved in a finetuned representation, we introduce a metric that measures image retrieval accuracy based on captions generated by a vision language model (VLM). Our approach achieves state-of-the-art accuracy on two benchmarks for personalized image retrieval with natural language queries - DeepFashion2 and ConCon-Chi - outperforming the prior art by 4%-22% on personal retrievals.
zh

[CV-90] Attention Please! Revisiting Attentive Probing for Masked Image Modeling

【速读】：该论文旨在解决传统线性探针（Linear Probing, LP）在评估基于掩码图像建模（Masked Image Modeling, MIM）的模型时表现不佳的问题，这是因为Patch Tokens的分布特性导致LP无法有效捕捉模型的潜在能力。其解决方案的关键在于提出一种高效的探针（Efficient Probing, EP），该方法通过多查询交叉注意力机制消除了冗余投影，减少了可训练参数数量，并在计算效率上实现了高达10倍的提升，同时在多个基准测试中优于LP和先前的注意力探针方法。

链接: https://arxiv.org/abs/2506.10178
作者: Bill Psomas,Dionysis Christopoulos,Eirini Baltzi,Ioannis Kakogeorgiou,Tilemachos Aravanis,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As fine-tuning (FT) becomes increasingly impractical at scale, probing is emerging as the preferred evaluation protocol for self-supervised learning (SSL). Yet, the standard linear probing (LP) fails to adequately reflect the potential of models trained with Masked Image Modeling (MIM), due to the distributed nature of patch tokens. This motivates the need for attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite its growing adoption, attentive probing remains under-explored, with existing methods suffering from excessive parameterization and poor computational efficiency. In this work, we revisit attentive probing through the lens of the accuracy-efficiency trade-off. We conduct a systematic study of existing methods, analyzing their mechanisms and benchmarking their performance. We introduce efficient probing (EP), a multi-query cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10 \times speed-up over conventional multi-head attention. Despite its simplicity, EP outperforms LP and prior attentive probing approaches across seven benchmarks, generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings. Code available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.10178 [cs.CV] (or arXiv:2506.10178v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.10178 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-91] Geometric Regularity in Deterministic Sampling of Diffusion-based Generative Models ICML2024

【速读】：该论文试图解决生成式模型在采样过程中效率与质量之间的平衡问题，特别是在有限函数评估次数下提升图像生成性能。其解决方案的关键在于发现并利用采样轨迹的几何规律性，即所有采样轨迹均位于极低维子空间且呈现出几乎相同的“飞镖”形状，从而提出一种基于动态规划的采样时间调度优化方法，该方法无需对现有基于常微分方程（ODE）的数值求解器进行大幅修改，计算开销小且能显著提升生成效果。

链接: https://arxiv.org/abs/2506.10177
作者: Defang Chen,Zhenyu Zhou,Can Wang,Siwei Lyu
机构: University at Buffalo, State University of New York (纽约州立大学布法罗分校); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 50 pages. The short version appeared in ICML 2024. arXiv admin note: substantial text overlap with arXiv:2405.11326

点击查看摘要

Abstract:Diffusion-based generative models employ stochastic differential equations (SDEs) and their equivalent probability flow ordinary differential equations (ODEs) to establish a smooth transformation between complex high-dimensional data distributions and tractable prior distributions. In this paper, we reveal a striking geometric regularity in the deterministic sampling dynamics: each simulated sampling trajectory lies within an extremely low-dimensional subspace, and all trajectories exhibit an almost identical ‘‘boomerang’’ shape, regardless of the model architecture, applied conditions, or generated content. We characterize several intriguing properties of these trajectories, particularly under closed-form solutions based on kernel-estimated data modeling. We also demonstrate a practical application of the discovered trajectory regularity by proposing a dynamic programming-based scheme to better align the sampling time schedule with the underlying trajectory structure. This simple strategy requires minimal modification to existing ODE-based numerical solvers, incurs negligible computational overhead, and achieves superior image generation performance, especially in regions with only 5 \sim 10 function evaluations.
zh

[CV-92] Retrieval of Surface Solar Radiation through Implicit Albedo Recovery from Temporal Context

【速读】：该论文试图解决在复杂地形和动态积雪覆盖区域中，通过卫星图像准确反演地表太阳辐射（SSR）的问题。传统方法依赖于月度统计来近似背景反射率，但在山地地区因间歇性积雪和表面变化频繁而效果不佳。解决方案的关键在于提出一种基于注意力机制的模拟器，该模拟器通过原始卫星图像序列隐式学习清晰天空下的地表反射率，无需依赖手工设计的特征如显式反照率图或云掩膜，从而提高了在复杂地形中的泛化能力。

链接: https://arxiv.org/abs/2506.10174
作者: Yael Frischholz,Devis Tuia,Michael Lehning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Accurate retrieval of surface solar radiation (SSR) from satellite imagery critically depends on estimating the background reflectance that a spaceborne sensor would observe under clear-sky conditions. Deviations from this baseline can then be used to detect cloud presence and guide radiative transfer models in inferring atmospheric attenuation. Operational retrieval algorithms typically approximate background reflectance using monthly statistics, assuming surface properties vary slowly relative to atmospheric conditions. However, this approach fails in mountainous regions where intermittent snow cover and changing snow surfaces are frequent. We propose an attention-based emulator for SSR retrieval that implicitly learns to infer clear-sky surface reflectance from raw satellite image sequences. Built on the Temporo-Spatial Vision Transformer, our approach eliminates the need for hand-crafted features such as explicit albedo maps or cloud masks. The emulator is trained on instantaneous SSR estimates from the HelioMont algorithm over Switzerland, a region characterized by complex terrain and dynamic snow cover. Inputs include multi-spectral SEVIRI imagery from the Meteosat Second Generation platform, augmented with static topographic features and solar geometry. The target variable is HelioMont’s SSR, computed as the sum of its direct and diffuse horizontal irradiance components, given at a spatial resolution of 1.7 km. We show that, when provided a sufficiently long temporal context, the model matches the performances of albedo-informed models, highlighting the model’s ability to internally learn and exploit latent surface reflectance dynamics. Our geospatial analysis shows this effect is most powerful in mountainous regions and improves generalization in both simple and complex topographic settings. Code and datasets are publicly available at this https URL
zh

[CV-93] SPARKE: Scalable Prompt-Aware Diversity Guidance in Diffusion Models via RKE Score

【速读】：该论文旨在解决提示引导型扩散模型在生成样本中保持足够多样性的问题，尤其是在提示涵盖广泛语义范围时，如何在语义相似的提示之间实现提示感知的多样性评估。其解决方案的关键在于提出了一种名为可扩展提示感知Rényi核熵多样性引导（SPARKE）的方法，该方法通过条件熵进行多样性引导，动态地根据相似提示调整多样性测量，并实现提示感知的多样性控制。此外，为降低计算复杂度，论文聚焦于条件潜在RKE得分引导的特殊情况，将熵计算和基于梯度的优化复杂度从一般熵度量的O(n^3)降低到O(n)，从而实现了大规模生成场景下的高效多样性引导采样。

链接: https://arxiv.org/abs/2506.10173
作者: Mohammad Jalali,Haoyu Lei,Amin Gohari,Farzan Farnia
机构: The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in high-fidelity image synthesis and prompt-guided generative modeling. However, ensuring adequate diversity in generated samples of prompt-guided diffusion models remains a challenge, particularly when the prompts span a broad semantic spectrum and the diversity of generated data needs to be evaluated in a prompt-aware fashion across semantically similar prompts. Recent methods have introduced guidance via diversity measures to encourage more varied generations. In this work, we extend the diversity measure-based approaches by proposing the Scalable Prompt-Aware Rény Kernel Entropy Diversity Guidance (SPARKE) method for prompt-aware diversity guidance. SPARKE utilizes conditional entropy for diversity guidance, which dynamically conditions diversity measurement on similar prompts and enables prompt-aware diversity control. While the entropy-based guidance approach enhances prompt-aware diversity, its reliance on the matrix-based entropy scores poses computational challenges in large-scale generation settings. To address this, we focus on the special case of Conditional latent RKE Score Guidance, reducing entropy computation and gradient-based optimization complexity from the O(n^3) of general entropy measures to O(n) . The reduced computational complexity allows for diversity-guided sampling over potentially thousands of generation rounds on different prompts. We numerically test the SPARKE method on several text-to-image diffusion models, demonstrating that the proposed method improves the prompt-aware diversity of the generated data without incurring significant computational costs. We release our code on the project page: this https URL
zh

[CV-94] A Navigation Framework Utilizing Vision-Language Models

【速读】：该论文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）中多模态理解与实时导航决策之间的计算成本高及部署困难的问题。其解决方案的关键在于提出一种模块化、即插即用的导航框架，通过将视觉-语言理解与动作规划解耦，结合冻结的视觉-语言模型Qwen2.5-VL-7B-Instruct与轻量级规划逻辑，实现无需大量模型微调的灵活、快速和适应性导航。该框架利用提示工程、结构化历史管理和双帧视觉输入策略，以提升导航步骤间的决策连贯性。

链接: https://arxiv.org/abs/2506.10172
作者: Yicheng Duan,Kaiyu tang
机构: Case Western Reserve University (凯斯西储大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance decision-making continuity across navigation steps. We evaluate our system on the Room-to-Room benchmark within the VLN-CE setting using the Matterport3D dataset and Habitat-Lab simulation environment. Although our initial results reveal challenges in generalizing to unseen environments under strict evaluation settings, our modular approach lays a foundation for scalable and efficient navigation systems, highlighting promising directions for future improvement through enhanced environmental priors and expanded multimodal input integration.
zh

[CV-95] Balanced Hyperbolic Embeddings Are Natural Out-of-Distribution Detectors

【速读】：该论文试图解决深度学习中的分布外（out-of-distribution, OOD）识别问题，即区分那些不属于网络训练分布的样本。其解决方案的关键在于采用一种良好的分层双曲嵌入（hierarchical hyperbolic embedding），通过平衡浅层和宽层子层次之间的差异性，并联合优化层次扭曲度，从而实现对分布内和分布外样本的有效区分。该方法引入了Balanced Hyperbolic Learning，利用双曲原型进行分类，并扩展了现有的OOD评分函数以适应双曲原型，实验表明该方法在多个数据集和评分函数上均优于现有方法。

链接: https://arxiv.org/abs/2506.10146
作者: Tejaswi Kasarla,Max van Spengler,Pascal Mettes
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution recognition forms an important and well-studied problem in deep learning, with the goal to filter out samples that do not belong to the distribution on which a network has been trained. The conclusion of this paper is simple: a good hierarchical hyperbolic embedding is preferred for discriminating in- and out-of-distribution samples. We introduce Balanced Hyperbolic Learning. We outline a hyperbolic class embedding algorithm that jointly optimizes for hierarchical distortion and balancing between shallow and wide subhierarchies. We then use the class embeddings as hyperbolic prototypes for classification on in-distribution data. We outline how to generalize existing out-of-distribution scoring functions to operate with hyperbolic prototypes. Empirical evaluations across 13 datasets and 13 scoring functions show that our hyperbolic embeddings outperform existing out-of-distribution approaches when trained on the same data with the same backbones. We also show that our hyperbolic embeddings outperform other hyperbolic approaches, beat state-of-the-art contrastive methods, and natively enable hierarchical out-of-distribution generalization.
zh

[CV-96] RoCA: Robust Cross-Domain End-to-End Autonomous Driving

【速读】：该论文旨在解决端到端（E2E）自动驾驶在跨域部署中的泛化能力不足问题，特别是在不同城市等实际场景下的性能不稳定及微调成本过高的挑战。其解决方案的关键在于提出RoCA框架，该框架通过在E2E管道中对编码自车及周围车辆信息的token建立联合概率分布，并利用高斯过程（Gaussian Process）学习一组具有代表性的基础token及其轨迹，从而实现对新驾驶场景的未来轨迹进行概率推理，提升模型的跨域泛化与适应能力。

链接: https://arxiv.org/abs/2506.10145
作者: Rajeev Yasarla,Shizhong Han,Hsin-Pai Cheng,Litian Liu,Shweta Mahajan,Apratim Bhattacharyya,Yunxiao Shi,Risheek Garrepalli,Hong Cai,Fatih Porikli
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.
zh

[CV-97] ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在视觉感知任务中缺乏足够挑战性且可明确验证的视觉中心任务的问题。为了解决这一问题，作者提出了一种强化学习（Reinforcement Learning, RL）代理任务ViCrit（Visual Caption Hallucination Critic），其关键在于通过在人类撰写的图像描述段落中注入细微的合成视觉幻觉，并训练模型根据图像和修改后的描述定位出被破坏的部分，从而在保持完整感知难度的同时提供二进制、精确匹配的奖励信号。该方法显著提升了多种视觉语言基准测试的性能，并展示了在抽象图像推理和视觉数学任务中的迁移能力。

链接: https://arxiv.org/abs/2506.10128
作者: Xiyao Wang,Zhengyuan Yang,Chao Feng,Yongyuan Liang,Yuhang Zhou,Xiaoyu Liu,Ziyi Zang,Ming Li,Chung-Ching Lin,Kevin Lin,Linjie Li,Furong Huang,Lijuan Wang
机构: University of Maryland (马里兰大学); Microsoft (微软); University of Michigan (密歇根大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
zh

[CV-98] Detecção da Psoríase Utilizando Visão Computacional: Uma Abordagem Comparativa Entre CNNs e Vision Transformers

【速读】：该论文试图解决的是在银屑病及其类似疾病病变图像的多类别分类任务中，卷积神经网络（Convolutional Neural Networks, CNNs）与视觉变压器（Vision Transformers, ViTs）性能的比较问题。解决方案的关键在于利用预训练于ImageNet的数据集进行模型适配，并通过实验验证ViTs在较小模型规模下仍能实现更优的预测性能，其中Dual Attention Vision Transformer-Base (DaViT-B)表现出最佳的f1-score为96.4%，表明其在自动化银屑病检测中的高效性。

链接: https://arxiv.org/abs/2506.10119
作者: Natanael Lucena,Fábio S. da Silva,Ricardo Rios
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, in Portuguese language, 2 figures, 2 tables, and 4 formulas. To be published in the Proceedings of the LII Brazilian Integrated Software and Hardware Seminar 2025 (SEMISH 2025)

点击查看摘要

Abstract:This paper presents a comparison of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the task of multi-classifying images containing lesions of psoriasis and diseases similar to it. Models pre-trained on ImageNet were adapted to a specific data set. Both achieved high predictive metrics, but the ViTs stood out for their superior performance with smaller models. Dual Attention Vision Transformer-Base (DaViT-B) obtained the best results, with an f1-score of 96.4%, and is recommended as the most efficient architecture for automated psoriasis detection. This article reinforces the potential of ViTs for medical image classification tasks.
zh

[CV-99] A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild

【速读】：该论文试图解决在多模态环境下检测数字内容中未成年人（children）的识别方法缺乏基准数据集的问题。解决方案的关键是发布了Image-Caption Children in the Wild Dataset (ICCWD)，这是一个包含10,000对人工标注的图像-标题数据集，用于评估检测未成年人图像的工具，其内容涵盖多种情境下的儿童图像，包括虚构描绘和部分可见的身体，从而为相关研究提供了一个更为丰富和实用的基准。

链接: https://arxiv.org/abs/2506.10117
作者: Klim Kireev,Ana-Maria Creţu,Raphael Meier,Sarah Adel Bargal,Elissa Redmiles,Carmela Troncoso
机构: MPI-SP & EPFL; EPFL; Cyber-Defence Campus; armasuisse S+T; Georgetown University; MPI-SP & EPFL
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.
zh

[CV-100] EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

【速读】：该论文旨在解决基于扩散模型的具身智能系统中Vision-Language-Action (VLA)模型在推理过程中面临的高计算和内存需求问题，这些问题主要源于模型内部及推理阶段的冗余。解决方案的关键在于提出EfficientVLA，一个结构化且无需训练的推理加速框架，通过协同利用多方面的冗余来系统性地消除这些瓶颈，具体包括：从语言模块中剪枝功能不重要的层、通过任务感知策略优化视觉处理路径以选择紧凑且多样化的视觉标记，以及通过策略性缓存和重用关键中间特征来缓解迭代扩散动作头中的时间冗余。

链接: https://arxiv.org/abs/2506.10100
作者: Yantai Yang,Yuhao Wang,Zichen Wen,Luo Zhongwei,Chang Zou,Zhipeng Zhang,Chuan Wen,Linfeng Zhang
机构: School of Artificial Intelligence, Shanghai Jiao Tong University (人工智能学院，上海交通大学); Harbin Institute of Technology (哈尔滨工业大学); Xi’an Jiaotong University (西安交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models, particularly diffusion-based architectures, demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands stemming from extensive inherent and inference-time redundancies. While existing acceleration efforts often target isolated inefficiencies, such piecemeal solutions typically fail to holistically address the varied computational and memory bottlenecks across the entire VLA pipeline, thereby limiting practical deployability. We introduce EfficientVLA, a structured and training-free inference acceleration framework that systematically eliminates these barriers by cohesively exploiting multifaceted redundancies. EfficientVLA synergistically integrates three targeted strategies: (1) pruning of functionally inconsequential layers from the language module, guided by an analysis of inter-layer redundancies; (2) optimizing the visual processing pathway through a task-aware strategy that selects a compact, diverse set of visual tokens, balancing task-criticality with informational coverage; and (3) alleviating temporal computational redundancy within the iterative diffusion-based action head by strategically caching and reusing key intermediate features. We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.
zh

[CV-101] st-Time Adaptation for Generalizable Task Progress Estimation ICML DATE

【速读】：该论文试图解决在测试阶段对轨迹进展估计模型进行在线适应的问题，以提升其在不同视觉和时间上下文中的性能。解决方案的关键在于引入一种基于梯度的元学习策略，通过在专家视觉轨迹及其自然语言任务描述上训练模型，使测试阶段的适应能够依赖语义内容而非时间顺序来改进进展估计。这种方法实现了从单一训练环境到多样化分布外任务、环境和实现方式的泛化，并优于使用自回归视觉-语言模型的最先进上下文学习方法。

链接: https://arxiv.org/abs/2506.10085
作者: Christos Ziakas,Alessandra Russo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: pages, 2 figures, accepted to the 2nd Workshop on Test-Time Adaptation: Putting Updates to the Test (PUT) at 42nd International Conference on Machine Learning (ICML), Vancouver, Canada, 2025

点击查看摘要

Abstract:We propose a test-time adaptation method that enables a progress estimation model to adapt online to the visual and temporal context of test trajectories by optimizing a learned self-supervised objective. To this end, we introduce a gradient-based meta-learning strategy to train the model on expert visual trajectories and their natural language task descriptions, such that test-time adaptation improves progress estimation relying on semantic content over temporal order. Our test-time adaptation method generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art in-context learning approach using autoregressive vision-language models.
zh

[CV-102] DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding NEURIPS2025

【速读】：该论文试图解决传统视觉主干网络在特征构建过程中缺乏明确的自适应、迭代优化路径的问题，旨在通过引入经典搜索算法的原理，实现更具算法性、结构化和逻辑性的处理流程，从而获得更可解释、可能具备推理能力的表示。其解决方案的关键在于提出DeepTraverse架构，该架构通过两个关键协同组件实现：递归探索模块（recursive exploration modules）通过参数共享在有前景的表征路径上系统性地深化特征分析，以及自适应校准模块（adaptive calibration modules）根据动态变化的全局上下文调整特征显著性，从而实现智能的特征构建与优化。

链接: https://arxiv.org/abs/2506.10084
作者: Bin Guo,John H.L. Hansen
机构: University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:Conventional vision backbones, despite their success, often construct features through a largely uniform cascade of operations, offering limited explicit pathways for adaptive, iterative refinement. This raises a compelling question: can principles from classical search algorithms instill a more algorithmic, structured, and logical processing flow within these networks, leading to representations built through more interpretable, perhaps reasoning-like decision processes? We introduce DeepTraverse, a novel vision architecture directly inspired by algorithmic search strategies, enabling it to learn features through a process of systematic elucidation and adaptive refinement distinct from conventional approaches. DeepTraverse operationalizes this via two key synergistic components: recursive exploration modules that methodically deepen feature analysis along promising representational paths with parameter sharing for efficiency, and adaptive calibration modules that dynamically adjust feature salience based on evolving global context. The resulting algorithmic interplay allows DeepTraverse to intelligently construct and refine feature patterns. Comprehensive evaluations across a diverse suite of image classification benchmarks show that DeepTraverse achieves highly competitive classification accuracy and robust feature discrimination, often outperforming conventional models with similar or larger parameter counts. Our work demonstrates that integrating such algorithmic priors provides a principled and effective strategy for building more efficient, performant, and structured vision backbones.
zh

[CV-103] LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

【速读】：该论文试图解决视频编辑中现有方法依赖大规模预训练而缺乏针对特定编辑的灵活性的问题，以及第一帧引导编辑在后续帧中控制力不足的问题。其解决方案的关键在于提出一种基于掩码的低秩适应（LoRA）微调方法，该方法能够在不改变模型架构的前提下，对预训练的图像到视频（I2V）模型进行灵活调整，通过空间掩码实现区域特定的学习，从而在保留背景区域的同时实现可控的编辑传播，并结合额外参考图像提供外观指导以增强编辑的准确性与可控性。

链接: https://arxiv.org/abs/2506.10082
作者: Chenjian Gao,Lihe Ding,Xin Cai,Zhanpeng Huang,Zibin Wang,Tianfan Xue
机构: The Chinese University of Hong Kong (中国香港中文大学); SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods.
zh

[CV-104] Secure Data Access in Cloud Environments Using Quantum Cryptography

【速读】：该论文试图解决云计算环境中数据安全性的问题，特别是在未来量子计算机出现后传统加密方法可能不再足够强大的背景下。解决方案的关键在于利用量子密码学技术，特别是量子密钥分发（Quantum Key Distribution, QKD）和BB84协议来生成不可窃取且可检测的密钥，并结合量子一次性密码本（Quantum One Time Pad, QOTP）进行数据的加密与解密，从而实现数据在云环境中的完全隐私保护。

链接: https://arxiv.org/abs/2506.10028
作者: S. Vasavi Venkata Lakshmi,Ziaul Haque Choudhury
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cloud computing has made storing and accessing data easier but keeping it secure is a big challenge nowadays. Traditional methods of ensuring data may not be strong enough in the future when powerful quantum computers become available. To solve this problem, this study uses quantum cryptography to protect data in the cloud environment. Quantum Key Distribution (QKD) creates secure keys by sending information using quantum particles like photons. Specifically, we use the BB84 protocol, a simple and reliable way to make secure keys that cannot be stolen without detection. To protect the data, we use the Quantum One Time pad (QOTP) for encryption and decryption, ensuring the data stays completely private. This study shows how these Quantum methods can be applied in cloud systems to provide a strong defense against hackers, even if they have access to quantum computers. The combination of QKD, BB84, and QOTP creates a safe and reliable way to keep data secure when it is stored or shared in the cloud. Using quantum cryptography, this paper provides a way to ensure data security now and in the future, making cloud computing safer for everyone to store their data securely and safely.
zh

[CV-105] Learning-based density-equalizing map

【速读】：该论文旨在解决密度等值映射（Density-equalizing map, DEM）在传统方法中存在的准确性受限、极端情况下产生重叠伪影以及从2D扩展到3D时需进行大量算法重构的问题。其解决方案的关键在于提出一种基于深度学习的密度等值映射框架（LDEM），通过引入强制密度均匀性和几何规则性的损失函数，并采用分层策略在粗粒度和细粒度层面预测变换，从而实现更优的密度等值效果和双射性，同时支持无缝从2D到3D的扩展。

链接: https://arxiv.org/abs/2506.10027
作者: Yanwen Huang,Lok Ming Lui,Gary P. T. Choi
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or optimization-based methods that minimize handcrafted energy functionals. However, these conventional techniques often face several challenges: they may suffer from limited accuracy, produce overlapping artifacts in extreme cases, and require substantial algorithmic redesign when extended from 2D to 3D, due to the derivative-dependent nature of their energy formulations. In this work, we propose a novel learning-based density-equalizing mapping framework (LDEM) using deep neural networks. Specifically, we introduce a loss function that enforces density uniformity and geometric regularity, and utilize a hierarchical approach to predict the transformations at both the coarse and dense levels. Our method demonstrates superior density-equalizing and bijectivity properties compared to prior methods for a wide range of simple and complex density distributions, and can be easily applied to surface remeshing with different effects. Also, it generalizes seamlessly from 2D to 3D domains without structural changes to the model architecture or loss formulation. Altogether, our work opens up new possibilities for scalable and robust computation of density-equalizing maps for practical applications.
zh

[CV-106] WDMIR: Wavelet-Driven Multimodal Intent Recognition IJCAI2025

【速读】：该论文旨在解决多模态意图识别（Multimodal Intent Recognition, MIR）中对非语言信息语义内容挖掘不足的问题，现有方法主要关注文本分析，而忽视了视频和音频等非语言线索中的丰富语义。其解决方案的关键在于提出一种基于小波变换的多模态意图识别框架（Wavelet-Driven Multimodal Intent Recognition, WDMIR），通过频域分析增强非语言信息的理解，具体包括：(1) 一种小波驱动的融合模块，实现视频与音频特征在频域中的同步分解与整合，支持时间动态的细粒度分析；(2) 一种跨模态交互机制，促进从双模态到三模态融合的特征逐步增强，有效弥合语言与非语言信息之间的语义鸿沟。

链接: https://arxiv.org/abs/2506.10011
作者: Weiyin Gong,Kai Zhang,Yanghai Zhang,Qi Liu,Xinjie Sun,Junyu Lu,Linbo Zhu
机构: 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(国家认知智能重点实验室，中国科学技术大学); 2School of Computer Science, Liupanshui Normal University(计算机科学学院，六盘水师范学院); 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(人工智能研究所，合肥综合性国家科学中心)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted at IJCAI 2025, 9pages, 6figures

点击查看摘要

Abstract:Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition(WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.
zh

[CV-107] Structured Graph Representations for Visual Narrative Reasoning : A Hierarchical Framework for Comics

【速读】：该论文旨在解决视觉叙事的结构化理解问题，特别是在多模态媒体如漫画中的叙事内容分析。其核心挑战在于如何有效建模和推理故事结构、角色连续性以及事件发展。解决方案的关键在于提出一种分层知识图谱框架，将叙事内容分解为从宏观故事线到细粒度事件段的不同层次，并通过集成知识图谱捕捉语义、空间和时间关系。该框架在面板层面构建多模态图，将视觉元素与文本成分进行关联，并在不同叙事层级间进行整合，从而支持符号化推理。

链接: https://arxiv.org/abs/2506.10008
作者: Yi-Chun Chen
机构: North Carolina State University (北卡罗来纳州立大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been submitted to ACM Multimedia 2025 and is currently under review

点击查看摘要

Abstract:This paper presents a hierarchical knowledge graph framework for the structured understanding of visual narratives, focusing on multimodal media such as comics. The proposed method decomposes narrative content into multiple levels, from macro-level story arcs to fine-grained event segments. It represents them through integrated knowledge graphs that capture semantic, spatial, and temporal relationships. At the panel level, we construct multimodal graphs that link visual elements such as characters, objects, and actions with corresponding textual components, including dialogue and captions. These graphs are integrated across narrative levels to support reasoning over story structure, character continuity, and event progression. We apply our approach to a manually annotated subset of the Manga109 dataset and demonstrate its ability to support symbolic reasoning across diverse narrative tasks, including action retrieval, dialogue tracing, character appearance mapping, and panel timeline reconstruction. Evaluation results show high precision and recall across tasks, validating the coherence and interpretability of the framework. This work contributes a scalable foundation for narrative-based content analysis, interactive storytelling, and multimodal reasoning in visual media. Comments: This paper has been submitted to ACM Multimedia 2025 and is currently under review Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.10008 [cs.MM] (or arXiv:2506.10008v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2506.10008 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-108] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space ICME2025

【速读】：该论文旨在解决音频驱动的情感化3D面部动画中存在的两个关键问题：一是依赖单一模态控制信号（如视频、文本或情感标签）而未能充分利用其互补优势进行全面情感操控；二是基于确定性回归的映射方法限制了情感表达和非语言行为的随机性，从而降低了合成动画的表现力。解决方案的关键在于提出一种基于扩散模型的可控制情感化3D面部动画框架，其核心创新包括：（1）以FLAME为中心的多模态情感绑定策略，通过对比学习对齐多种模态（文本、音频和情感标签），实现从多个信号源灵活控制情感；（2）一种具有内容感知注意力和情感引导层的注意力机制潜在扩散模型，能够在保持时间一致性和自然面部动态的同时增强运动多样性。

链接: https://arxiv.org/abs/2506.10007
作者: Kangwei Liu,Junwu Liu,Xiaowei Yi,Jinlin Guo,Yun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Laboratory for Big Data and Decision, School of Systems Engineering, National University of Defense Technology(国防科技大学大数据与决策实验室，系统工程学院)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: this https URL.
zh

[CV-109] HER2 Expression Prediction with Flexible Multi-Modal Inputs via Dynamic Bidirectional Reconstruction ACM-MM2025

【速读】：该论文旨在解决乳腺癌中HER2评估模型在实际临床应用中的局限性，即现有模型通常单独分析HE或IHC图像，而临床实践中依赖两者的协同解读，但因工作流程复杂性和成本限制，难以同时获取两种模态数据。其解决方案的关键在于提出一种自适应的双模态框架，通过三个创新点实现灵活的单/双模态HER2预测：动态分支选择器根据输入完整性激活单模态重建或双模态联合推理；双向跨模态GAN实现缺失模态的上下文感知特征空间重建；混合训练协议结合对抗学习与多任务优化。该框架显著提升了单模态HE预测准确率，并在双模态下达到高精度，同时在仅使用IHC输入时仍保持较高可靠性，有效降低了对同步采集的依赖。

链接: https://arxiv.org/abs/2506.10006
作者: Jie Qin,Wei Yang,Yan Su,Yiran Zhu,Weizhen Li,Yunyue Pan,Chengchang Pan,Honggang Qi
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages,5 figures,3 tables,submitted to the 33rd ACM International Conference on Multimedia(ACM MM 2025)

点击查看摘要

Abstract:Current HER2 assessment models for breast cancer predominantly analyze HE or IHC images in isolation,despite clinical reliance on their synergistic interpretation. However, concurrent acquisition of both modalities is often hindered by workflow complexity and cost constraints. We propose an adaptive bimodal framework enabling flexible single-/dual-modality HER2 prediction through three innovations: 1) A dynamic branch selector that activates either single-modality reconstruction or dual-modality joint inference based on input completeness; 2) A bidirectional cross-modal GAN performing context-aware feature-space reconstruction of missing modalities; 3) A hybrid training protocol integrating adversarial learning and multi-task optimization. This architecture elevates single-modality HE prediction accuracy from 71.44% to 94.25% while achieving 95.09% dual-modality accuracy, maintaining 90.28% reliability with sole IHC inputs. The framework’s “dual-preferred, single-compatible” design delivers near-bimodal performance without requiring synchronized acquisition, particularly benefiting resource-limited settings through IHC infrastructure cost reduction. Experimental validation confirms 22.81%/12.90% accuracy improvements over HE/IHC baselines respectively, with cross-modal reconstruction enhancing F1-scores to 0.9609 (HE to IHC) and 0.9251 (IHC to HE). By dynamically routing inputs through reconstruction-enhanced or native fusion pathways, the system mitigates performance degradation from missing data while preserving computational efficiency (78.55% parameter reduction in lightweight variant). This elastic architecture demonstrates significant potential for democratizing precise HER2 assessment across diverse healthcare settings.
zh

[CV-110] EQ-TAA: Equivariant Traffic Accident Anticipation via Diffusion-Based Accident Video Synthesis

【速读】：该论文旨在解决交通场景中交通事故预见（Traffic Accident Anticipation, TAA）的问题，特别是由于交通场景固有的长尾分布、不确定性及快速变化特性，导致事故的因果部分难以准确识别，并易受数据偏差影响，从而产生背景混杂问题。其解决方案的关键在于提出一种注意力视频扩散（Attentive Video Diffusion, AVD）模型，该模型通过生成行车记录仪视频中的因果部分，从正常视频片段合成额外的事故视频片段，从而在无需额外标注的情况下训练TAA模型。AVD能够根据事故或无事故的文本提示生成因果视频帧，同时保留帧的风格和内容，进而结合等变三元损失函数实现等变TAA（Equivariant TAA, EQ-TAA）。

链接: https://arxiv.org/abs/2506.10002
作者: Jianwu Fang,Lei-Lei Li,Zhedong Zheng,Hongkai Yu,Jianru Xue,Zhengguo Li,Tat-Seng Chua
机构: Xi’an Jiaotong University (西安交通大学); University of Macau (澳门大学); National University of Singapore (新加坡国立大学); Cleveland State University (克利夫兰州立大学); Institute for Infocomm Research, Agency for Science, Technology and Research (A∗STAR) (资讯通信研究院，科技研究局)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IEEE-TMM

点击查看摘要

Abstract:Traffic Accident Anticipation (TAA) in traffic scenes is a challenging problem for achieving zero fatalities in the future. Current approaches typically treat TAA as a supervised learning task needing the laborious annotation of accident occurrence duration. However, the inherent long-tailed, uncertain, and fast-evolving nature of traffic scenes has the problem that real causal parts of accidents are difficult to identify and are easily dominated by data bias, resulting in a background confounding issue. Thus, we propose an Attentive Video Diffusion (AVD) model that synthesizes additional accident video clips by generating the causal part in dashcam videos, i.e., from normal clips to accident clips. AVD aims to generate causal video frames based on accident or accident-free text prompts while preserving the style and content of frames for TAA after video generation. This approach can be trained using datasets collected from various driving scenes without any extra annotations. Additionally, AVD facilitates an Equivariant TAA (EQ-TAA) with an equivariant triple loss for an anchor accident-free video clip, along with the generated pair of contrastive pseudo-normal and pseudo-accident clips. Extensive experiments have been conducted to evaluate the performance of AVD and EQ-TAA, and competitive performance compared to state-of-the-art methods has been obtained.
zh

[CV-111] Semi-Automated Quality Assurance in Digital Pathology: Tile Classification Approach

【速读】：该论文旨在解决数字病理学中质量保证（Quality Assurance）领域存在的问题，即当前人工审核流程效率低下且难以检测微小的图像伪影（Artifact），而传统图像处理方法在检测精度和可扩展性方面存在局限。其解决方案的关键在于提出一种基于深度学习的AI算法，该算法通过分析数字病理切片的图像块（tile），将其分类为10种预定义的伪影类型或背景，并生成伪影定位图，从而有效减少人工审核的工作量。该算法采用InceptionResNet模型，并结合单类二值模型与多实例模型的混合设计，以优化对各类伪影的检测效果。

链接: https://arxiv.org/abs/2506.10916
作者: Meredith VandeHaar,M. Clinch,I. Yilmaz,M.A. Rahman,Y. Xiao,F. Dogany,H.M. Alazab,A. Nassar,Z. Akkus,B. Dangott
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Quality assurance is a critical but underexplored area in digital pathology, where even minor artifacts can have significant effects. Artifacts have been shown to negatively impact the performance of AI diagnostic models. In current practice, trained staff manually review digitized images prior to release of these slides to pathologists which are then used to render a diagnosis. Conventional image processing approaches, provide a foundation for detecting artifacts on digital pathology slides. However, current tools do not leverage deep learning, which has the potential to improve detection accuracy and scalability. Despite these advancements, methods for quality assurance in digital pathology remain limited, presenting a gap for innovation. We propose an AI algorithm designed to screen digital pathology slides by analyzing tiles and categorizing them into one of 10 predefined artifact types or as background. This algorithm identifies and localizes artifacts, creating a map that highlights regions of interest. By directing human operators to specific tiles affected by artifacts, the algorithm minimizes the time and effort required to manually review entire slides for quality issues. From internal archives and The Cancer Genome Atlas, 133 whole slide images were selected and 10 artifacts were annotated using an internally developed software ZAPP (Mayo Clinic, Jacksonville, FL). Ablation study of multiple models at different tile sizes and magnification was performed. InceptionResNet was selected. Single artifact models were trained and tested, followed by a limited multiple instance model with artifacts that performed well together (chatter, fold, and pen). From the results of this study we suggest a hybrid design for artifact screening composed of both single artifact binary models as well as multiple instance models to optimize detection of each artifact. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.10916 [eess.IV] (or arXiv:2506.10916v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.10916 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-112] Med-URWKV: Pure RWKV With ImageNet Pre-training For Medical Image Segmentation DATE

【速读】：该论文旨在解决医学图像分割任务中现有方法的局限性，包括卷积神经网络（CNN）的感受野受限以及Transformer模型因二次复杂度带来的计算开销问题。其解决方案的关键在于提出一种基于纯RWKV（Receptance Weighted Key Value）架构的医学图像分割模型Med-URWKV，该模型在U-Net框架基础上引入了基于ImageNet预训练的VRWKV编码器，从而充分利用预训练模型的优势，提升分割性能。实验结果表明，Med-URWKV在多个数据集上表现优于从头训练的RWKV模型，验证了预训练VRWKV编码器的有效性。

链接: https://arxiv.org/abs/2506.10858
作者: Zhenhuan Zhou
机构: Nankai University (南开大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint Draft, 5 pages. This paper will be updated with a formal version in the future, Copyright: College of Computer Science, Nankai University. All rights reserved

点击查看摘要

Abstract:Medical image segmentation is a fundamental and key technology in computer-aided diagnosis and treatment. Previous methods can be broadly classified into three categories: convolutional neural network (CNN) based, Transformer based, and hybrid architectures that combine both. However, each of them has its own limitations, such as restricted receptive fields in CNNs or the computational overhead caused by the quadratic complexity of Transformers. Recently, the Receptance Weighted Key Value (RWKV) model has emerged as a promising alternative for various vision tasks, offering strong long-range modeling capabilities with linear computational complexity. Some studies have also adapted RWKV to medical image segmentation tasks, achieving competitive performance. However, most of these studies focus on modifications to the Vision-RWKV (VRWKV) mechanism and train models from scratch, without exploring the potential advantages of leveraging pre-trained VRWKV models for medical image segmentation tasks. In this paper, we propose Med-URWKV, a pure RWKV-based architecture built upon the U-Net framework, which incorporates ImageNet-based pretraining to further explore the potential of RWKV in medical image segmentation tasks. To the best of our knowledge, Med-URWKV is the first pure RWKV segmentation model in the medical field that can directly reuse a large-scale pre-trained VRWKV encoder. Experimental results on seven datasets demonstrate that Med-URWKV achieves comparable or even superior segmentation performance compared to other carefully optimized RWKV models trained from scratch. This validates the effectiveness of using a pretrained VRWKV encoder in enhancing model performance. The codes will be released.
zh

[CV-113] Generalist Models in Medical Image Segmentation: A Survey and Performance Comparison with Task-Specific Approaches

【速读】：该论文试图解决医学图像分割中通用模型（generalist models）的开发与应用问题，旨在探索其在不同任务中的泛化能力及性能表现。解决方案的关键在于基于大规模数据预训练并结合微调策略，同时借鉴Segment Anything Model (SAM) 的设计理念，进一步拓展至医学图像领域，涵盖零样本、少样本、适配器等不同形式的模型变体，并对比分析其在主流任务中的表现，以推动通用模型在医学影像分析中的实际应用。

链接: https://arxiv.org/abs/2506.10825
作者: Andrea Moglia(1),Matteo Leccardi(1),Matteo Cavicchioli(1),Alice Maccarini(2),Marco Marcon(1),Luca Mainardi(1),Pietro Cerveri(1 and 2) ((1) Politecnico di Milano, (2) Università di Pavia)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 132 pages, 26 figures, 23 tables. Andrea Moglia and Matteo Leccardi are equally contributing authors

点击查看摘要

Abstract:Following the successful paradigm shift of large language models, leveraging pre-training on a massive corpus of data and fine-tuning on different downstream tasks, generalist models have made their foray into computer vision. The introduction of Segment Anything Model (SAM) set a milestone on segmentation of natural images, inspiring the design of a multitude of architectures for medical image segmentation. In this survey we offer a comprehensive and in-depth investigation on generalist models for medical image segmentation. We start with an introduction on the fundamentals concepts underpinning their development. Then, we provide a taxonomy on the different declinations of SAM in terms of zero-shot, few-shot, fine-tuning, adapters, on the recent SAM 2, on other innovative models trained on images alone, and others trained on both text and images. We thoroughly analyze their performances at the level of both primary research and best-in-literature, followed by a rigorous comparison with the state-of-the-art task-specific models. We emphasize the need to address challenges in terms of compliance with regulatory frameworks, privacy and security laws, budget, and trustworthy artificial intelligence (AI). Finally, we share our perspective on future directions concerning synthetic data, early fusion, lessons learnt from generalist models in natural language processing, agentic AI and physical AI, and clinical translation.
zh

[CV-114] Modality-AGnostic Image Cascade (MAGIC) for Multi-Modality Cardiac Substructure Segmentation

【速读】：该论文旨在解决胸腔放疗计划中心脏亚结构分割的挑战，以降低放射性心脏病的风险。传统深度学习方法在不同模态和重叠结构上的泛化能力不足，难以满足临床需求。其解决方案的关键在于提出了一种模态无关的图像级联模型（Modality-AGnostic Image Cascade, MAGIC），该模型基于nnU-Net架构，通过复制编码和解码分支实现单一模型对多模态数据（包括模拟CT、低场强MR-Linac和心脏CT血管造影）的高效、准确分割，具备良好的泛化能力和计算轻量化特性。

链接: https://arxiv.org/abs/2506.10797
作者: Nicholas Summerfield,Qisheng He,Alex Kuo,Ahmed I. Ghanem,Simeng Zhu,Chase Ruff,Joshua Pan,Anudeep Kumar,Prashant Nagpal,Jiwei Zhao,Ming Dong,Carri K. Glide-Hurst
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac substructures are essential in thoracic radiation therapy planning to minimize risk of radiation-induced heart disease. Deep learning (DL) offers efficient methods to reduce contouring burden but lacks generalizability across different modalities and overlapping structures. This work introduces and validates a Modality-AGnostic Image Cascade (MAGIC) for comprehensive and multi-modal cardiac substructure segmentation. MAGIC is implemented through replicated encoding and decoding branches of an nnU-Net-based, U-shaped backbone conserving the function of a single model. Twenty cardiac substructures (heart, chambers, great vessels (GVs), valves, coronary arteries (CAs), and conduction nodes) from simulation CT (Sim-CT), low-field MR-Linac, and cardiac CT angiography (CCTA) modalities were manually delineated and used to train (n=76), validate (n=15), and test (n=30) MAGIC. Twelve comparison models (four segmentation subgroups across three modalities) were equivalently trained. All methods were compared for training efficiency and against reference contours using the Dice Similarity Coefficient (DSC) and two-tailed Wilcoxon Signed-Rank test (threshold, p0.05). Average DSC scores were 0.75(0.16) for Sim-CT, 0.68(0.21) for MR-Linac, and 0.80(0.16) for CCTA. MAGIC outperforms the comparison in 57% of cases, with limited statistical differences. MAGIC offers an effective and accurate segmentation solution that is lightweight and capable of segmenting multiple modalities and overlapping structures in a single model. MAGIC further enables clinical implementation by simplifying the computational requirements and offering unparalleled flexibility for clinical settings.
zh

[CV-115] ConStyX: Content Style Augmentation for Generalizable Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割模型在跨领域数据分布变化（domain shift）下的性能下降问题，其核心挑战在于如何提升模型的域泛化能力。为了解决这一问题，作者提出了一种基于领域随机化的新型域泛化方法——内容风格增强（Content Style Augmentation, ConStyX），其关键在于同时增强训练数据的内容和风格，从而更广泛地覆盖不同数据域，并在模型训练过程中利用良好增强的特征，同时减轻过度增强特征的负面影响。

链接: https://arxiv.org/abs/2506.10675
作者: Xi Chen,Zhiqiang Shen,Peng Cao,Jinzhu Yang,Osmar R. Zaiane
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical images are usually collected from multiple domains, leading to domain shifts that impair the performance of medical image segmentation models. Domain Generalization (DG) aims to address this issue by training a robust model with strong generalizability. Recently, numerous domain randomization-based DG methods have been proposed. However, these methods suffer from the following limitations: 1) constrained efficiency of domain randomization due to their exclusive dependence on image style perturbation, and 2) neglect of the adverse effects of over-augmented images on model training. To address these issues, we propose a novel domain randomization-based DG method, called content style augmentation (ConStyX), for generalizable medical image segmentation. Specifically, ConStyX 1) augments the content and style of training data, allowing the augmented training data to better cover a wider range of data domains, and 2) leverages well-augmented features while mitigating the negative effects of over-augmented features during model training. Extensive experiments across multiple domains demonstrate that our ConStyX achieves superior generalization performance. The code is available at this https URL.
zh

[CV-116] SWDL: Stratum-Wise Difference Learning with Deep Laplacian Pyramid for Semi-Supervised 3D Intracranial Hemorrhage Segmentation

【速读】：该论文旨在解决医学影像分割中因标注数据稀缺而导致的性能受限问题，特别是在颅内出血（Intracranial Hemorrhage, ICH）的分割任务中，由于标注过程繁琐且成本高昂，导致可用标注数据量有限。其解决方案的关键在于提出一种新的半监督学习（Semi-Supervised Learning, SSL）框架——SWDL-Net，该框架结合了拉普拉斯金字塔（Laplacian Pyramid）在边缘锐化方面的优势与深度卷积上采样在细节精度提升上的能力，通过差异学习机制有效融合两者，从而实现对病灶细节和边界的精确分割。

链接: https://arxiv.org/abs/2506.10325
作者: Cheng Wang,Siqi Chen,Donghua Mi,Yang Chen,Yudong Zhang,Yinsheng Li
机构: Southeast University (东南大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Beijing Tiantan Hospital, Capital Medical University (首都医科大学北京天坛医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 4 figures, 6 Tables

点击查看摘要

Abstract:Recent advances in medical imaging have established deep learning-based segmentation as the predominant approach, though it typically requires large amounts of manually annotated data. However, obtaining annotations for intracranial hemorrhage (ICH) remains particularly challenging due to the tedious and costly labeling process. Semi-supervised learning (SSL) has emerged as a promising solution to address the scarcity of labeled data, especially in volumetric medical image segmentation. Unlike conventional SSL methods that primarily focus on high-confidence pseudo-labels or consistency regularization, we propose SWDL-Net, a novel SSL framework that exploits the complementary advantages of Laplacian pyramid and deep convolutional upsampling. The Laplacian pyramid excels at edge sharpening, while deep convolutions enhance detail precision through flexible feature mapping. Our framework achieves superior segmentation of lesion details and boundaries through a difference learning mechanism that effectively integrates these complementary approaches. Extensive experiments on a 271-case ICH dataset and public benchmarks demonstrate that SWDL-Net outperforms current state-of-the-art methods in scenarios with only 2% labeled data. Additional evaluations on the publicly available Brain Hemorrhage Segmentation Dataset (BHSD) with 5% labeled data further confirm the superiority of our approach. Code and data have been released at this https URL.
zh

[CV-117] DUN-SRE: Deep Unrolling Network with Spatiotemporal Rotation Equivariance for Dynamic MRI Reconstruction

【速读】：该论文旨在解决动态磁共振成像（Dynamic MRI）重建中由于严重欠采样导致的图像质量下降问题，特别是如何有效建模动态MRI中的时空对称性先验。现有方法在利用空间对称性方面表现良好，但未能充分建模时间维度上的对称性，而后者是动态MRI中最普遍且信息量最大的结构先验。论文提出的解决方案关键在于设计一种具有时空旋转等变性的深度展开网络（DUN-SRE），通过(2+1)D等变卷积架构实现时空对称性约束的严格传播，并将数据一致性和近似映射模块整合到统一的深度展开框架中，从而提升心脏电影MRI中运动动力学的物理准确性与重建质量。

链接: https://arxiv.org/abs/2506.10309
作者: Yuliang Zhu,Jing Cheng,Qi Xie,Zhuo-Xu Cui,Qingyong Zhu,Yuanyuan Liu,Xin Liu,Jianfeng Ren,Chengbo Wang,Dong Liang
机构: University of Nottingham Ningbo China (诺丁汉大学宁波分校); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); School of Mathematics and Statistics, Xi’an Jiaotong University (西安交通大学数学与统计学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Magnetic Resonance Imaging (MRI) exhibits transformation symmetries, including spatial rotation symmetry within individual frames and temporal symmetry along the time dimension. Explicit incorporation of these symmetry priors in the reconstruction model can significantly improve image quality, especially under aggressive undersampling scenarios. Recently, Equivariant convolutional neural network (ECNN) has shown great promise in exploiting spatial symmetry priors. However, existing ECNNs critically fail to model temporal symmetry, arguably the most universal and informative structural prior in dynamic MRI reconstruction. To tackle this issue, we propose a novel Deep Unrolling Network with Spatiotemporal Rotation Equivariance (DUN-SRE) for Dynamic MRI Reconstruction. The DUN-SRE establishes spatiotemporal equivariance through a (2+1)D equivariant convolutional architecture. In particular, it integrates both the data consistency and proximal mapping module into a unified deep unrolling framework. This architecture ensures rigorous propagation of spatiotemporal rotation symmetry constraints throughout the reconstruction process, enabling more physically accurate modeling of cardiac motion dynamics in cine MRI. In addition, a high-fidelity group filter parameterization mechanism is developed to maintain representation precision while enforcing symmetry constraints. Comprehensive experiments on Cardiac CINE MRI datasets demonstrate that DUN-SRE achieves state-of-the-art performance, particularly in preserving rotation-symmetric structures, offering strong generalization capability to a broad range of dynamic MRI reconstruction tasks.
zh

[CV-118] Ground Reaction Force Estimation via Time-aware Knowledge Distillation

【速读】：该论文旨在解决可穿戴足底传感器在估计地面反作用力（Ground Reaction Force, GRF）时存在的精度不足和易受噪声干扰的问题。其解决方案的关键在于提出一种时间感知的知识蒸馏（Time-aware Knowledge Distillation）框架，该框架通过利用小批量数据中的相似性和时间特征，有效捕捉特征间的互补关系以及输入与目标数据的序列特性，从而提升轻量级模型在GRF估计任务中的性能。

链接: https://arxiv.org/abs/2506.10265
作者: Eun Som Jeon,Sinjini Mitra,Jisoo Lee,Omik M. Save,Ankita Shukla,Hyunglae Lee,Pavan Turaga
机构: Seoul National University of Science and Technology (首尔科学综合大学校); Arizona State University (亚利桑那州立大学); University of Nevada, Reno (内华达大学雷诺分校)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human gait analysis with wearable sensors has been widely used in various applications, such as daily life healthcare, rehabilitation, physical therapy, and clinical diagnostics and monitoring. In particular, ground reaction force (GRF) provides critical information about how the body interacts with the ground during locomotion. Although instrumented treadmills have been widely used as the gold standard for measuring GRF during walking, their lack of portability and high cost make them impractical for many applications. As an alternative, low-cost, portable, wearable insole sensors have been utilized to measure GRF; however, these sensors are susceptible to noise and disturbance and are less accurate than treadmill measurements. To address these challenges, we propose a Time-aware Knowledge Distillation framework for GRF estimation from insole sensor data. This framework leverages similarity and temporal features within a mini-batch during the knowledge distillation process, effectively capturing the complementary relationships between features and the sequential properties of the target and input data. The performance of the lightweight models distilled through this framework was evaluated by comparing GRF estimations from insole sensor data against measurements from an instrumented treadmill. Empirical results demonstrated that Time-aware Knowledge Distillation outperforms current baselines in GRF estimation from wearable sensor data.
zh

[CV-119] Conditional diffusion models for guided anomaly detection in brain images using fluid-driven anomaly randomization

【速读】：该论文试图解决在脑部MRI中依赖病患数据进行监督学习的局限性，尤其是在罕见疾病场景下难以获取足够病灶数据的问题。其解决方案的关键在于提出一种基于条件扩散模型的新型弱监督框架，通过将合成伪病灶图像引入建模过程，以更好地指导健康图像的重建。该方法利用流体驱动的异常随机化技术生成具有解剖一致性的合成异常，从而提升模型对异常区域的检测与重建能力。

链接: https://arxiv.org/abs/2506.10233
作者: Ana Lawry Aguila,Peirong Liu,Oula Puonti,Juan Eugenio Iglesias
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised machine learning has enabled accurate pathology detection in brain MRI, but requires training data from diseased subjects that may not be readily available in some scenarios, for example, in the case of rare diseases. Reconstruction-based unsupervised anomaly detection, in particular using diffusion models, has gained popularity in the medical field as it allows for training on healthy images alone, eliminating the need for large disease-specific cohorts. These methods assume that a model trained on normal data cannot accurately represent or reconstruct anomalies. However, this assumption often fails with models failing to reconstruct healthy tissue or accurately reconstruct abnormal regions i.e., failing to remove anomalies. In this work, we introduce a novel conditional diffusion model framework for anomaly detection and healthy image reconstruction in brain MRI. Our weakly supervised approach integrates synthetically generated pseudo-pathology images into the modeling process to better guide the reconstruction of healthy images. To generate these pseudo-pathologies, we apply fluid-driven anomaly randomization to augment real pathology segmentation maps from an auxiliary dataset, ensuring that the synthetic anomalies are both realistic and anatomically coherent. We evaluate our model’s ability to detect pathology, using both synthetic anomaly datasets and real pathology from the ATLAS dataset. In our extensive experiments, our model: (i) consistently outperforms variational autoencoders, and conditional and unconditional latent diffusion; and (ii) surpasses on most datasets, the performance of supervised inpainting methods with access to paired diseased/healthy images.
zh

[CV-120] Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation

【速读】：该论文旨在解决医学影像领域中数据稀缺对机器学习发展带来的挑战，尤其是传统医疗潜在扩散模型（LDM）训练中依赖短提示文本编码器、复用非医疗LDM或需要大量数据微调等限制性策略的问题。其解决方案的关键在于提出了一种类条件高效大型语言模型适配器（CCELLA），该方法采用新颖的双头条件机制，通过交叉注意力同时将非医疗大型语言模型编码的文本特征与病理分类通过时间步嵌入作用于LDM U-Net，结合联合损失函数和数据高效的LDM训练框架，从而在有限的数据量和人工标注下实现高质量的医学图像合成，提升模型性能与科学可及性。

链接: https://arxiv.org/abs/2506.10230
作者: Emerson P. Grabke,Masoom A. Haider,Babak Taati
机构: Institute of Biomedical Engineering, University of Toronto (生物医学工程研究所，多伦多大学); Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital (Lunenfeld-Tanenbaum 研究所，西奈山医院); KITE Research Institute, Toronto Rehabilitation Institute, University Health Network (KITE 研究所，多伦多康复研究所，健康网络大学); Joint Department of Medical Imaging, University of Toronto, Princess Margaret Hospital, and Sinai Health systems (医学影像联合系，多伦多大学，公主玛格丽特医院，西奈医疗系统); Department of Computer Science, University of Toronto (计算机科学系，多伦多大学); Faculty Affiliate of the Vector Institute, Toronto (矢量研究所多伦多兼职教授)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MAH and BT are co-senior authors on the work. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM training typically relies on performance- or scientific accessibility-limiting strategies including a reliance on short-prompt text encoders, the reuse of non-medical LDMs, or a requirement for fine-tuning with large data volumes. We propose a Class-Conditioned Efficient Large Language model Adapter (CCELLA) to address these limitations. CCELLA is a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with non-medical large language model-encoded text features through cross-attention and with pathology classification through the timestep embedding. We also propose a joint loss function and a data-efficient LDM training framework. In combination, these strategies enable pathology-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Our method achieves a 3D FID score of 0.025 on a size-limited prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method to the training dataset improves classifier accuracy from 69% to 74%. Training a classifier solely on our method’s synthetic images achieved comparable performance to training on real images alone.
zh

[CV-121] Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective

【速读】：该论文旨在解决脑肿瘤，尤其是增强区域（contrast-enhancing regions）在术后对比MRI中的精确分割问题，这一问题对于临床诊断和治疗计划至关重要，但目前方法在分割这些区域时表现不佳，主要原因是未能充分考虑MRI特有的肿瘤特征，如复杂纹理和方向变化。其解决方案的关键在于提出一种基于频域视角的脑肿瘤分割网络——Harmonized Frequency Fusion Network (HFF-Net)，通过Frequency Domain Decomposition (FDD)模块分离MRI图像为低频和高频成分以全面表征肿瘤区域，结合Adaptive Laplacian Convolution (ALC)模块增强对肿瘤边界敏感性，并设计Frequency Domain Cross-Attention (FDCA)模块有效融合多尺度肿瘤特征。

链接: https://arxiv.org/abs/2506.10142
作者: Minye Shao,Zeyu Wang,Haoran Duan,Yawen Huang,Bing Zhai,Shizheng Wang,Yang Long,Yefeng Zheng
机构: Durham University (杜伦大学); Dalian Minzu University (大连民族大学); Tsinghua University (清华大学); Jarvis Research Center, Tencent YouTu Lab (腾讯优图实验室); Northumbria University (诺桑比亚大学); SunwayAI Research Lab, Fuyang Normal University (阜阳师范大学太阳人工智能研究院); Chinese Academy of Sciences R&D Center for Internet of Things (中国科学院物联网研发中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48% (ranging from 2.39% to 7.72%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: this https URL.
zh

人工智能

[AI-0] Rethinking Losses for Diffusion Bridge Samplers

【速读】：该论文旨在解决扩散桥（diffusion bridges）在采样未归一化分布时优化目标不明确的问题，特别是在使用生成式AI（Generative AI）进行训练时，如何选择更有效的损失函数以提升采样性能。其解决方案的关键在于指出Log Variance (LV) 损失在理论上缺乏类似于反向Kullback-Leibler (rKL) 损失的优化动机，并提出采用结合对数导数技巧的rKL损失（rKL-LD）作为更优的优化目标，该方法不仅避免了概念性问题，还在多个基准测试中表现出更好的性能和更稳定的训练行为。

链接: https://arxiv.org/abs/2506.10982
作者: Sebastian Sanokowski,Lukas Gruber,Christoph Bartmann,Sepp Hochreiter,Sebastian Lehner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions. Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients. While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned. Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality. Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss. Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior.
zh

[AI-1] Principled Approaches for Extending Neural Architectures to Function Spaces for Operator Learning

【速读】：该论文试图解决神经网络在科学计算领域（如连续时间动力系统和偏微分方程(PDEs)）中应用受限的问题，这是因为传统深度学习主要应用于有限维空间的映射，而科学问题通常定义在无限维函数空间上。解决方案的关键在于提出一种将现有神经网络架构转换为神经算子的通用方法，从而实现对函数空间之间映射的有效建模，使 operator learning 能够继承深度学习中经过验证的架构优化策略。

链接: https://arxiv.org/abs/2506.10973
作者: Julius Berner,Miguel Liu-Schiaffini,Jean Kossaifi,Valentin Duruisseaux,Boris Bonev,Kamyar Azizzadenesheli,Anima Anandkumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Functional Analysis (math.FA); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:A wide range of scientific problems, such as those described by continuous-time dynamical systems and partial differential equations (PDEs), are naturally formulated on function spaces. While function spaces are typically infinite-dimensional, deep learning has predominantly advanced through applications in computer vision and natural language processing that focus on mappings between finite-dimensional spaces. Such fundamental disparities in the nature of the data have limited neural networks from achieving a comparable level of success in scientific applications as seen in other fields. Neural operators are a principled way to generalize neural networks to mappings between function spaces, offering a pathway to replicate deep learning’s transformative impact on scientific problems. For instance, neural operators can learn solution operators for entire classes of PDEs, e.g., physical systems with different boundary conditions, coefficient functions, and geometries. A key factor in deep learning’s success has been the careful engineering of neural architectures through extensive empirical testing. Translating these neural architectures into neural operators allows operator learning to enjoy these same empirical optimizations. However, prior neural operator architectures have often been introduced as standalone models, not directly derived as extensions of existing neural network architectures. In this paper, we identify and distill the key principles for constructing practical implementations of mappings between infinite-dimensional function spaces. Using these principles, we propose a recipe for converting several popular neural architectures into neural operators with minimal modifications. This paper aims to guide practitioners through this process and details the steps to make neural operators work in practice. Our code can be found at this https URL
zh

[AI-2] Farseer: A Refined Scaling Law in Large Language Models

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）训练成本高昂导致的“缩放差距”问题，即小规模实验所得的见解难以有效迁移至资源密集型生产系统中，从而阻碍了高效创新。其解决方案的关键在于提出Farseer，这是一种改进的缩放定律，通过系统构建模型损失表面L(N,D)，实现了对实证数据更优的拟合，相比之前的Chinchilla定律显著提升了预测准确性，特别是在外推能力方面，将外推误差降低了433%。这使得在不同规模和配置下能够可靠评估竞争性训练策略，并将小规模消融研究的结果有效地推广到大规模性能预测中。

链接: https://arxiv.org/abs/2506.10972
作者: Houyi Li,Wenzhen Zheng,Qiufeng Wang,Zhenyu Ding,Haoying Wang,Zili Wang,Shijie Xuyang,Ning Ding,Shuigeng Zhou,Xiangyu Zhang,Daxin Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34

点击查看摘要

Abstract:Training Large Language Models (LLMs) is prohibitively expensive, creating a critical scaling gap where insights from small-scale experiments often fail to transfer to resource-intensive production systems, thereby hindering efficient innovation. To bridge this, we introduce Farseer, a novel and refined scaling law offering enhanced predictive accuracy across scales. By systematically constructing a model loss surface L(N,D) , Farseer achieves a significantly better fit to empirical data than prior laws (e.g., Chinchilla’s law). Our methodology yields accurate, robust, and highly generalizable predictions, demonstrating excellent extrapolation capabilities, improving upon Chinchilla’s law by reducing extrapolation error by 433%. This allows for the reliable evaluation of competing training strategies across all (N,D) settings, enabling conclusions from small-scale ablation studies to be confidently extrapolated to predict large-scale performance. Furthermore, Farseer provides new insights into optimal compute allocation, better reflecting the nuanced demands of modern LLM training. To validate our approach, we trained an extensive suite of approximately 1,000 LLMs across diverse scales and configurations, consuming roughly 3 million NVIDIA H100 GPU hours. We are comprehensively open-sourcing all models, data, results, and logs at this https URL to foster further research.
zh

[AI-3] Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

【速读】：该论文试图解决在结构化几何数据背景下，对上下文学习（In-Context Learning, ICL）的理论理解不足的问题。其解决方案的关键在于建立注意力机制与经典核方法之间的新联系，并据此推导出基于提示长度和训练任务数量的泛化误差界。研究进一步表明，当观察到足够多的训练任务时，Transformer模型能够实现流形上Hölder函数的最小最大回归率，该速率随流形内在维度呈指数级增长，而非环境空间维度，从而揭示了Transformer作为上下文算法学习者的复杂性。

链接: https://arxiv.org/abs/2506.10959
作者: Zhaiming Shen,Alexander Hsu,Rongjie Lai,Wenjing Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding–particularly in the context of structured geometric data–remains unexplored. In this work, we initiate a theoretical study of ICL for regression of Hölder functions on manifolds. By establishing a novel connection between the attention mechanism and classical kernel methods, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.
zh

[AI-4] SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

【速读】：该论文旨在解决构建大规模GitHub问题解决数据集的挑战，这一过程在设置评估环境、评分测试结果和验证任务实例方面尤为繁琐且耗时。解决方案的关键在于提出SWE-Factory自动化流水线，其核心包含三个自动化组件：SWE-Builder多智能体系统用于自动化评估环境构建，基于退出码的标准评分方法替代了手动编写解析器的需求，以及利用可靠退出码信号自动执行fail2pass验证过程。这些组件共同提升了数据集构建的效率与准确性。

链接: https://arxiv.org/abs/2506.10954
作者: Lianghong Guo,Yanlin Wang,Caihua Li,Pengyu Yang,Jiachi Chen,Wei Tao,Yingtian Zou,Duyu Tang,Zibin Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at 0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of 0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at this https URL.
zh

[AI-5] Monitoring Decomposition Attacks in LLM s with Lightweight Sequential Monitors

【速读】：该论文试图解决当前大语言模型（Large Language Model, LLM）安全防御在分解攻击（decomposition attack）下的失效问题，即恶意目标被分解为看似无害的子任务以规避拒绝响应。解决方案的关键在于引入一个外部监控器（external monitor），该监控器以更高粒度观察对话过程，并采用轻量级序列监控框架（lightweight sequential monitoring framework），通过累积评估每个子任务来实时检测潜在恶意意图。该方法在GPT-4o上实现了93%的防御成功率，且对随机任务注入具有鲁棒性，同时显著降低了成本和延迟。

链接: https://arxiv.org/abs/2506.10949
作者: Chen Yueh-Han,Nitish Joshi,Yulin Chen,Maksym Andriushchenko,Rico Angell,He He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instructions. We therefore propose adding an external monitor that observes the conversation at a higher granularity. To facilitate our study of monitoring decomposition attacks, we curate the largest and most diverse dataset to date, including question-answering, text-to-image, and agentic tasks. We verify our datasets by testing them on frontier LLMs and show an 87% attack success rate on average on GPT-4o. This confirms that decomposition attack is broadly effective. Additionally, we find that random tasks can be injected into the decomposed subtasks to further obfuscate malicious intents. To defend in real time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each subtask. We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor. Moreover, it remains robust against random task injection and cuts cost by 90% and latency by 50%. Our findings suggest that lightweight sequential monitors are highly effective in mitigating decomposition attacks and are viable in deployment.
zh

[AI-6] Spurious Rewards: Rethinking Training Signals in RLVR

【速读】：该论文试图解决在缺乏有效奖励信号的情况下，如何通过强化学习与可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）激发模型的强数学推理能力的问题。其解决方案的关键在于利用即使与正确答案相关性极低或为负的虚假奖励信号，仍能通过RLVR提升模型在数学任务上的表现，例如Qwen2.5-Math-7B在MATH-500数据集上的性能提升显著，接近使用真实奖励信号的效果。研究进一步指出，RLVR可能通过挖掘预训练过程中学到的有用推理表示来实现这一目标，但具体机制仍需进一步研究。

链接: https://arxiv.org/abs/2506.10947
作者: Rulin Shao,Shuyue Stella Li,Rui Xin,Scott Geng,Yiping Wang,Sewoong Oh,Simon Shaolei Du,Nathan Lambert,Sewon Min,Ranjay Krishna,Yulia Tsvetkov,Hannaneh Hajishirzi,Pang Wei Koh,Luke Zettlemoyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) – nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning – thinking in code without actual code execution – to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.
zh

[AI-7] he Role of Generative AI in Facilitating Social Interactions: A Scoping Review

【速读】：该论文试图解决生成式 AI（Generative AI）技术在促进社会互动中的作用及其设计与评估方法尚不明确的问题。其解决方案的关键在于通过系统性回顾30项2020年后发表的研究，分析GAI应用在叙事、社交情感技能训练、回忆、协作学习、音乐创作和一般对话等领域的设计策略，强调参与式和共同设计方法在提升技术有效性与社会参与度中的作用，并探讨文化偏见和可及性等社会伦理问题。

链接: https://arxiv.org/abs/2506.10927
作者: T. T. J. E. Arets,G. Perugia,M. Houben,W.A. IJsselsteijn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint version of a manuscript submitted to ACM Transactions on Computer-Human Interaction (TOCHI), under review. 39 pages, 4 figures

点击查看摘要

Abstract:Reduced social connectedness increasingly poses a threat to mental health, life expectancy, and general well-being. Generative AI (GAI) technologies, such as large language models (LLMs) and image generation tools, are increasingly integrated into applications aimed at enhancing human social experiences. Despite their growing presence, little is known about how these technologies influence social interactions. This scoping review investigates how GAI-based applications are currently designed to facilitate social interaction, what forms of social engagement they target, and which design and evaluation methodologies designers use to create and evaluate them. Through an analysis of 30 studies published since 2020, we identify key trends in application domains including storytelling, socio-emotional skills training, reminiscence, collaborative learning, music making, and general conversation. We highlight the role of participatory and co-design approaches in fostering both effective technology use and social engagement, while also examining socio-ethical concerns such as cultural bias and accessibility. This review underscores the potential of GAI to support dynamic and personalized interactions, but calls for greater attention to equitable design practices and inclusive evaluation strategies.
zh

[AI-8] Agent ic Semantic Control for Autonomous Wireless Space Networks: Extending Space-O-RAN with MCP-Driven Distributed Intelligence

【速读】：该论文旨在解决月球表面作业对无线通信系统的高要求问题，包括自主性、抗干扰能力和适应环境与任务驱动上下文的能力。其解决方案的关键在于提出一种新的扩展，引入了基于Model Context Protocol (MCP)和Agent-to-Agent (A2A)通信协议的语义代理层，从而实现跨实时、近实时和非实时控制层的上下文感知决策。通过在探测器、着陆器和月球基站中部署分布式认知代理，实现了无线感知的协调策略，包括延迟自适应推理和带宽感知的语义压缩。

链接: https://arxiv.org/abs/2506.10925
作者: Eduardo Baena,Paolo Testolina,Michele Polese,Sergi Aliaga,Andrew Benincasa,Dimitrios Koutsonikolas,Josep Jornet,Tommaso Melodia
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: Lunar Surface Innovation Consortium 2025 Spring Meeting, May 20-22

点击查看摘要

Abstract:Lunar surface operations impose stringent requirements on wireless communication systems, including autonomy, robustness to disruption, and the ability to adapt to environmental and mission-driven context. While Space-O-RAN provides a distributed orchestration model aligned with 3GPP standards, its decision logic is limited to static policies and lacks semantic integration. We propose a novel extension incorporating a semantic agentic layer enabled by the Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication protocols, allowing context-aware decision making across real-time, near-real-time, and non-real-time control layers. Distributed cognitive agents deployed in rovers, landers, and lunar base stations implement wireless-aware coordination strategies, including delay-adaptive reasoning and bandwidth-aware semantic compression, while interacting with multiple MCP servers to reason over telemetry, locomotion planning, and mission constraints.
zh

[AI-9] GenPlanX. Generation of Plans and Execution

【速读】：该论文试图解决传统AI规划技术在面对自然语言描述的规划任务时缺乏理解能力的问题，以及如何实现人机协作以提升任务执行效率。解决方案的关键在于集成大型语言模型（Large Language Models, LLMs）与经典AI规划引擎，并结合执行与监控框架，从而实现基于自然语言的任务描述与高效规划执行。

链接: https://arxiv.org/abs/2506.10897
作者: Daniel Borrajo,Giuseppe Canonaco,Tomás de la Rosa,Alfredo Garrachón,Sriram Gopalakrishnan,Simerjot Kaur,Marianela Morales,Sunandita Patra,Alberto Pozanco,Keshav Ramani,Charese Smiley,Pietro Totis,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical AI Planning techniques generate sequences of actions for complex tasks. However, they lack the ability to understand planning tasks when provided using natural language. The advent of Large Language Models (LLMs) has introduced novel capabilities in human-computer interaction. In the context of planning tasks, LLMs have shown to be particularly good in interpreting human intents among other uses. This paper introduces GenPlanX that integrates LLMs for natural language-based description of planning tasks, with a classical AI planning engine, alongside an execution and monitoring framework. We demonstrate the efficacy of GenPlanX in assisting users with office-related tasks, highlighting its potential to streamline workflows and enhance productivity through seamless human-AI collaboration.
zh

[AI-10] Data-Driven Prediction of Dynamic Interactions Between Robot Appendage and Granular Material

【速读】：该论文试图解决机器人在颗粒地形中运动交互的建模问题，旨在通过数据驱动的方法获得对机器人运动与颗粒地形相互作用的基本理解。其解决方案的关键在于集成降维（Sequentially Truncated Higher-Order Singular Value Decomposition）、代理建模（Gaussian Process）和数据同化技术（Reduced Order Particle Filter），从而实现高效且准确的预测能力。该方法利用离线收集的高保真仿真数据和少量实验数据，在保证计算效率的同时，能够生成与物理模型相媲美的预测结果，并具备在长时域预测中超越传统高保真仿真的潜力。

链接: https://arxiv.org/abs/2506.10875
作者: Guanjin Wang,Xiangxue Zhao,Shapour Azarm,Balakumar Balachandran
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:An alternative data-driven modeling approach has been proposed and employed to gain fundamental insights into robot motion interaction with granular terrain at certain length scales. The approach is based on an integration of dimension reduction (Sequentially Truncated Higher-Order Singular Value Decomposition), surrogate modeling (Gaussian Process), and data assimilation techniques (Reduced Order Particle Filter). This approach can be used online and is based on offline data, obtained from the offline collection of high-fidelity simulation data and a set of sparse experimental data. The results have shown that orders of magnitude reduction in computational time can be obtained from the proposed data-driven modeling approach compared with physics-based high-fidelity simulations. With only simulation data as input, the data-driven prediction technique can generate predictions that have comparable accuracy as simulations. With both simulation data and sparse physical experimental measurement as input, the data-driven approach with its embedded data assimilation techniques has the potential in outperforming only high-fidelity simulations for the long-horizon predictions. In addition, it is demonstrated that the data-driven modeling approach can also reproduce the scaling relationship recovered by physics-based simulations for maximum resistive forces, which may indicate its general predictability beyond a case-by-case basis. The results are expected to help robot navigation and exploration in unknown and complex terrains during both online and offline phases.
zh

[AI-11] Precise Zero-Shot Pointwise Ranking with LLM s through Post-Aggregated Global Context Information SIGIR2025

【速读】：该论文旨在提升点对点（pointwise）方法在文档排序任务中的有效性，同时保持其计算效率。传统点对点方法通过独立生成每个候选文档的相关性得分，虽然效率较高，但忽略了文档之间的比较信息，导致评分不一致和性能下降。该论文的关键解决方案是提出一种全局一致的对比点对点排序策略（GCCP），通过引入一个锚文档作为参考，该锚文档为伪相关文档的查询聚焦摘要，从而捕捉全局上下文进行对比评分；此外，通过后聚合方式（PAGC）将对比得分与现有点对点方法高效结合，无需额外训练即可融入全局上下文信息。

链接: https://arxiv.org/abs/2506.10859
作者: Kehan Long,Shasha Li,Chen Xu,Jintao Tang,Ting Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:Recent advancements have successfully harnessed the power of Large Language Models (LLMs) for zero-shot document ranking, exploring a variety of prompting strategies. Comparative approaches like pairwise and listwise achieve high effectiveness but are computationally intensive and thus less practical for larger-scale applications. Scoring-based pointwise approaches exhibit superior efficiency by independently and simultaneously generating the relevance scores for each candidate document. However, this independence ignores critical comparative insights between documents, resulting in inconsistent scoring and suboptimal performance. In this paper, we aim to improve the effectiveness of pointwise methods while preserving their efficiency through two key innovations: (1) We propose a novel Global-Consistent Comparative Pointwise Ranking (GCCP) strategy that incorporates global reference comparisons between each candidate and an anchor document to generate contrastive relevance scores. We strategically design the anchor document as a query-focused summary of pseudo-relevant candidates, which serves as an effective reference point by capturing the global context for document comparison. (2) These contrastive relevance scores can be efficiently Post-Aggregated with existing pointwise methods, seamlessly integrating essential Global Context information in a training-free manner (PAGC). Extensive experiments on the TREC DL and BEIR benchmark demonstrate that our approach significantly outperforms previous pointwise methods while maintaining comparable efficiency. Our method also achieves competitive performance against comparative methods that require substantially more computational resources. More analyses further validate the efficacy of our anchor construction strategy.
zh

[AI-12] A Study on Individual Spatiotemporal Activity Generation Method Using MCP-Enhanced Chain-of-Thought Large Language Models

【速读】：该论文试图解决传统基于规则和统计方法在城市规划研究中进行人类时空行为模拟时存在的高计算成本、泛化能力有限及可扩展性差的问题。其解决方案的关键在于引入一种结合链式思维（Chain-of-Thought, CoT）推理与模型上下文协议（Model Context Protocol, MCP）的框架，通过五阶段认知框架实现类人渐进式推理，并利用六类专门的MCP工具（包括时间管理、空间导航、环境感知、个人记忆、社会协作和经验评估）进行综合数据处理，从而提升大语言模型（Large Language Models, LLMs）在模拟符合验证数据模式的时空行为方面的能力。

链接: https://arxiv.org/abs/2506.10853
作者: Yu Zhang,Yang Hu,De Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Human spatiotemporal behavior simulation is critical for urban planning research, yet traditional rule-based and statistical approaches suffer from high computational costs, limited generalizability, and poor scalability. While large language models (LLMs) show promise as “world simulators,” they face challenges in spatiotemporal reasoning including limited spatial cognition, lack of physical constraint understanding, and group homogenization tendencies. This paper introduces a framework integrating chain-of-thought (CoT) reasoning with Model Context Protocol (MCP) to enhance LLMs’ capability in simulating spatiotemporal behaviors that correspond with validation data patterns. The methodology combines human-like progressive reasoning through a five-stage cognitive framework with comprehensive data processing via six specialized MCP tool categories: temporal management, spatial navigation, environmental perception, personal memory, social collaboration, and experience evaluation. Experiments in Shanghai’s Lujiazui district validate the framework’s effectiveness across 1,000 generated samples. Results demonstrate high similarity with real mobile signaling data, achieving generation quality scores of 7.86 to 8.36 across different base models. Parallel processing experiments show efficiency improvements, with generation times decreasing from 1.30 to 0.17 minutes per sample when scaling from 2 to 12 processes. This work contributes to integrating CoT reasoning with MCP for urban behavior modeling, advancing LLMs applications in urban computing and providing a practical approach for synthetic mobility data generation. The framework offers a foundation for smart city planning, transportation forecasting, and participatory urban design applications.
zh

[AI-13] Efficiency Robustness of Dynamic Deep Learning Systems USENIX-SECURITY’25

【速读】：该论文试图解决动态深度学习系统（Dynamic Deep Learning Systems, DDLSs）在面对效率对抗攻击时的鲁棒性问题。其解决方案的关键在于系统性地分析DDLSs中由于动态行为引入的新攻击面，并提出首个针对效率攻击的全面分类体系，从而为后续防御机制的设计提供理论基础和方向指引。

链接: https://arxiv.org/abs/2506.10831
作者: Ravishka Rathnasuriya,Tingxi Li,Zexin Xu,Zihe Song,Mirazul Haque,Simin Chen,Wei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to USENIX Security '25

点击查看摘要

Abstract:Deep Learning Systems (DLSs) are increasingly deployed in real-time applications, including those in resourceconstrained environments such as mobile and IoT devices. To address efficiency challenges, Dynamic Deep Learning Systems (DDLSs) adapt inference computation based on input complexity, reducing overhead. While this dynamic behavior improves efficiency, such behavior introduces new attack surfaces. In particular, efficiency adversarial attacks exploit these dynamic mechanisms to degrade system performance. This paper systematically explores efficiency robustness of DDLSs, presenting the first comprehensive taxonomy of efficiency attacks. We categorize these attacks based on three dynamic behaviors: (i) attacks on dynamic computations per inference, (ii) attacks on dynamic inference iterations, and (iii) attacks on dynamic output production for downstream tasks. Through an in-depth evaluation, we analyze adversarial strategies that target DDLSs efficiency and identify key challenges in securing these systems. In addition, we investigate existing defense mechanisms, demonstrating their limitations against increasingly popular efficiency attacks and the necessity for novel mitigation strategies to secure future adaptive DDLSs.
zh

[AI-14] LLM -Driven Personalized Answer Generation and Evaluation

【速读】：该论文试图解决在线学习中个性化回答不足的问题，旨在通过生成式 AI (Generative AI) 提供定制化的学习者问题解答，以提升学习体验并减轻教师的工作负担。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）根据学习者或类似学习者的示例答案生成个性化的回答，实验结果表明，提供此类示例能够显著提升LLMs对个体学习者需求的响应能力。

链接: https://arxiv.org/abs/2506.10829
作者: Mohammadreza Molavi,Mohammadreza Tavakoli,Mohammad Moein,Abdolali Faraji,Gábor Kismihók
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This is the preprint version of a paper accepted at AIED 2025. The final version will be published by Springer

点击查看摘要

Abstract:Online learning has experienced rapid growth due to its flexibility and accessibility. Personalization, adapted to the needs of individual learners, is crucial for enhancing the learning experience, particularly in online settings. A key aspect of personalization is providing learners with answers customized to their specific questions. This paper therefore explores the potential of Large Language Models (LLMs) to generate personalized answers to learners’ questions, thereby enhancing engagement and reducing the workload on educators. To evaluate the effectiveness of LLMs in this context, we conducted a comprehensive study using the StackExchange platform in two distinct areas: language learning and programming. We developed a framework and a dataset for validating automatically generated personalized answers. Subsequently, we generated personalized answers using different strategies, including 0-shot, 1-shot, and few-shot scenarios. The generated answers were evaluated using three methods: 1. BERTScore, 2. LLM evaluation, and 3. human evaluation. Our findings indicated that providing LLMs with examples of desired answers (from the learner or similar learners) can significantly enhance the LLMs’ ability to tailor responses to individual learners’ needs.
zh

[AI-15] What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps

【速读】：该论文试图解决用户对生成式 AI (Generative AI) 驱动的移动应用中 AI 功能的感知、评估和批评机制尚不明确的问题，主要受限于用户反馈量庞大导致的研究困难。其解决方案的关键在于开发并验证了一个多阶段分析流程，该流程从人工标注的基准数据出发，系统性地评估了大语言模型（LLMs）和提示策略，涵盖了评论分类、方面-情感抽取和聚类等阶段，并确保了准确性和一致性。该流程实现了用户反馈的可扩展、高精度分析，提取出超过一百万条方面-情感对，并将其聚类为18个正面和15个负面用户主题，从而揭示了用户关注的核心主题及细粒度的情感共现特征。

链接: https://arxiv.org/abs/2506.10785
作者: Vinaik Chhetri,Krishna Upadhyay,A.B. Siddique,Umar Farooq
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Artificial Intelligence (AI)-powered features have rapidly proliferated across mobile apps in various domains, including productivity, education, entertainment, and creativity. However, how users perceive, evaluate, and critique these AI features remains largely unexplored, primarily due to the overwhelming volume of user feedback. In this work, we present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps, leveraging a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play. We develop and validate a multi-stage analysis pipeline that begins with a human-labeled benchmark and systematically evaluates large language models (LLMs) and prompting strategies. Each stage, including review classification, aspect-sentiment extraction, and clustering, is validated for accuracy and consistency. Our pipeline enables scalable, high-precision analysis of user feedback, extracting over one million aspect-sentiment pairs clustered into 18 positive and 15 negative user topics. Our analysis reveals that users consistently focus on a narrow set of themes: positive comments emphasize productivity, reliability, and personalized assistance, while negative feedback highlights technical failures (e.g., scanning and recognition), pricing concerns, and limitations in language support. Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review. These fine-grained, co-occurring sentiments are often missed by traditional approaches that treat positive and negative feedback in isolation or rely on coarse-grained analysis. To this end, our approach provides a more faithful reflection of the real-world user experiences with AI-powered apps. Category-aware analysis further uncovers both universal drivers of satisfaction and domain-specific frustrations.
zh

[AI-16] ME: Trigger Element Combination Backdoor Attack on Copyright Infringement

【速读】：该论文试图解决生成式扩散模型（Generative Diffusion Models, DMs）在文本到图像任务中面临的关键安全问题，即通过复制训练数据发起的版权侵权攻击（Copyright Infringement Attack），而现有数据资源受限且攻击效果不佳。其解决方案的关键在于提出新的数据集以支持相关研究，并基于SBD方法改进为多元素（Multi-Element, ME）攻击策略，通过增加每个污染样本中的有毒视觉-文本元素数量来提升攻击能力，同时引入离散余弦变换（Discrete Cosine Transform, DCT）以保持攻击的隐蔽性。

链接: https://arxiv.org/abs/2506.10776
作者: Feiyu Yang,Siyuan Liang,Aishan Liu,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The capability of generative diffusion models (DMs) like Stable Diffusion (SD) in replicating training data could be taken advantage of by attackers to launch the Copyright Infringement Attack, with duplicated poisoned image-text pairs. SilentBadDiffusion (SBD) is a method proposed recently, which shew outstanding performance in attacking SD in text-to-image tasks. However, the feasible data resources in this area are still limited, some of them are even constrained or prohibited due to the issues like copyright ownership or inappropriate contents; And not all of the images in current datasets are suitable for the proposed attacking methods; Besides, the state-of-the-art (SoTA) performance of SBD is far from ideal when few generated poisoning samples could be adopted for attacks. In this paper, we raised new datasets accessible for researching in attacks like SBD, and proposed Multi-Element (ME) attack method based on SBD by increasing the number of poisonous visual-text elements per poisoned sample to enhance the ability of attacking, while importing Discrete Cosine Transform (DCT) for the poisoned samples to maintain the stealthiness. The Copyright Infringement Rate (CIR) / First Attack Epoch (FAE) we got on the two new datasets were 16.78% / 39.50 and 51.20% / 23.60, respectively close to or even outperformed benchmark Pokemon and Mijourney datasets. In condition of low subsampling ratio (5%, 6 poisoned samples), MESI and DCT earned CIR / FAE of 0.23% / 84.00 and 12.73% / 65.50, both better than original SBD, which failed to attack at all.
zh

[AI-17] OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在通过学习先前反馈迭代优化复杂解决方案方面能力不足的问题。其解决方案的关键在于提出OPT-BENCH，一个用于评估LLM代理在大规模搜索空间优化问题上的综合基准，并引入OPT-Agent，一个端到端的优化框架，该框架通过利用历史反馈生成、验证并迭代改进解决方案，从而模拟人类在处理复杂问题时的推理过程。

链接: https://arxiv.org/abs/2506.10764
作者: Xiaozhe Li,Jixuan Chen,Xinyu Fang,Shengyuan Ding,Haodong Duan,Qingwen Liu,Kai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \hrefthis https URLthis https URL.
zh

[AI-18] Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding

【速读】：该论文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）中的两个关键瓶颈问题：在分布外环境中的泛化能力和对固定离散动作空间的依赖。其解决方案的关键在于提出一种名为Vision-Language Fly (VLFly) 的框架，该框架通过集成指令编码器、目标检索器和航点规划器三个模块，实现基于语言指令的连续速度命令生成，从而无需定位或主动测距传感器，仅依靠机载单目摄像头的自我中心观测进行导航。

链接: https://arxiv.org/abs/2506.10756
作者: Yuhang Zhang,Haosheng Yu,Jiaping Xiao,Mir Feroskhan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-and-language navigation (VLN) is a long-standing challenge in autonomous robotics, aiming to empower agents with the ability to follow human instructions while navigating complex environments. Two key bottlenecks remain in this field: generalization to out-of-distribution environments and reliance on fixed discrete action spaces. To address these challenges, we propose Vision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles (UAVs) to execute language-guided flight. Without the requirement for localization or active ranging sensors, VLFly outputs continuous velocity commands purely from egocentric observations captured by an onboard monocular camera. The VLFly integrates three modules: an instruction encoder based on a large language model (LLM) that reformulates high-level language into structured prompts, a goal retriever powered by a vision-language model (VLM) that matches these prompts to goal images via vision-language similarity, and a waypoint planner that generates executable trajectories for real-time UAV control. VLFly is evaluated across diverse simulation environments without additional fine-tuning and consistently outperforms all baselines. Moreover, real-world VLN tasks in indoor and outdoor environments under direct and indirect instructions demonstrate that VLFly achieves robust open-vocabulary goal understanding and generalized navigation capabilities, even in the presence of abstract language input.
zh

[AI-19] BNMusic: Blending Environmental Noises into Personalized Music

【速读】：该论文试图解决环境噪声在音频工程中引起的干扰问题，传统声学掩蔽技术常因主导声音与噪声之间的错位（如节奏不匹配）而需要过高的音量来实现有效的掩蔽。解决方案的关键在于引入一种基于用户提供的文本提示生成个性化音乐的替代方法，通过将环境噪声融入节奏对齐、自适应增强且易于接受的音乐片段中，从而降低噪声的可察觉性，提升整体听觉体验。该方法的核心是提出一个包含两个关键阶段的BNMusic框架：第一阶段生成包含噪声音乐本质的梅尔频谱图表示的完整音乐作品；第二阶段自适应地增强生成的音乐段落以进一步减少噪声感知并提高融合效果，同时保持听觉质量。

链接: https://arxiv.org/abs/2506.10754
作者: Chi Zuo,Martin B. Møller,Pablo Martínez-Nuevo,Huayang Huang,Yu Wu,Ye Zhu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences.
zh

[AI-20] hink before You Simulate: Symbolic Reasoning to Orchestrate Neural Computation for Counterfactual Question Answering WACV2024

【速读】：该论文试图解决视频动态中的因果与时间推理问题，特别是现有神经符号模型在回答反事实问题时的局限性。其解决方案的关键在于增强神经符号模型以实现反事实推理，通过定义因果图来表示事件间的因果关系，并利用答案集编程（Answer Set Programming, ASP）来协调感知与仿真模块，从而提升模型在反事实问题上的表现。

链接: https://arxiv.org/abs/2506.10753
作者: Adam Ishay,Zhun Yang,Joohyung Lee,Ilgu Kang,Dongjae Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In Proceedings the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024)

点击查看摘要

Abstract:Causal and temporal reasoning about video dynamics is a challenging problem. While neuro-symbolic models that combine symbolic reasoning with neural-based perception and prediction have shown promise, they exhibit limitations, especially in answering counterfactual questions. This paper introduces a method to enhance a neuro-symbolic model for counterfactual reasoning, leveraging symbolic reasoning about causal relations among events. We define the notion of a causal graph to represent such relations and use Answer Set Programming (ASP), a declarative logic programming method, to find how to coordinate perception and simulation modules. We validate the effectiveness of our approach on two benchmarks, CLEVRER and CRAFT. Our enhancement achieves state-of-the-art performance on the CLEVRER challenge, significantly outperforming existing models. In the case of the CRAFT benchmark, we leverage a large pre-trained language model, such as GPT-3.5 and GPT-4, as a proxy for a dynamics simulator. Our findings show that this method can further improve its performance on counterfactual questions by providing alternative prompts instructed by symbolic causal reasoning.
zh

[AI-21] ED-LaST: Towards Robust Backdoor Defense Against Adaptive Attacks

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在面对自适应后门攻击时的安全性问题，特别是针对拓扑演化动力学（Topological Evolution Dynamics, TED）检测方法可能被适应性扭曲拓扑表示分布所绕过的问题。解决方案的关键在于提出TED-LaST（Topological Evolution Dynamics against Laundry, Slow release, and Target mapping attack strategies），其核心创新包括标签监督的动力学跟踪和自适应层强调机制，从而有效识别传统TED防御难以检测的隐蔽威胁，提升对复杂后门攻击的检测能力。

链接: https://arxiv.org/abs/2506.10722
作者: Xiaoxing Mo,Yuxuan Cheng,Nan Sun,Leo Yu Zhang,Wei Luo,Shang Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are vulnerable to backdoor attacks, where attackers implant hidden triggers during training to maliciously control model behavior. Topological Evolution Dynamics (TED) has recently emerged as a powerful tool for detecting backdoor attacks in DNNs. However, TED can be vulnerable to backdoor attacks that adaptively distort topological representation distributions across network layers. To address this limitation, we propose TED-LaST (Topological Evolution Dynamics against Laundry, Slow release, and Target mapping attack strategies), a novel defense strategy that enhances TED’s robustness against adaptive attacks. TED-LaST introduces two key innovations: label-supervised dynamics tracking and adaptive layer emphasis. These enhancements enable the identification of stealthy threats that evade traditional TED-based defenses, even in cases of inseparability in topological space and subtle topological perturbations. We review and classify data poisoning tricks in state-of-the-art adaptive attacks and propose enhanced adaptive attack with target mapping, which can dynamically shift malicious tasks and fully leverage the stealthiness that adaptive attacks possess. Our comprehensive experiments on multiple datasets (CIFAR-10, GTSRB, and ImageNet100) and model architectures (ResNet20, ResNet101) show that TED-LaST effectively counteracts sophisticated backdoors like Adap-Blend, Adapt-Patch, and the proposed enhanced adaptive attack. TED-LaST sets a new benchmark for robust backdoor detection, substantially enhancing DNN security against evolving threats.
zh

[AI-22] System ASPMT2SMT:Computing ASPMT Theories by SMT Solvers

【速读】：该论文试图解决如何将答案集编程（Answer Set Programming, ASP）与满足模理论（Satisfiability Modulo Theories, SMT）相结合的问题，以支持更复杂的逻辑推理。解决方案的关键在于将ASPMT程序的紧致片段转换为SMT实例，从而利用SMT求解器计算ASPMT程序的稳定模型。为此，作者提出了一个编译器\sc aspsmt2smt，该系统结合了ASP接地工具\sc gringo和SMT求解器\sc z3，实现了部分程序接地并由\sc z3处理剩余变量，有效支持了对实数计算的连续变化推理。

链接: https://arxiv.org/abs/2506.10708
作者: Michael Bartholomew,Joohyung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: In Proceedings of the 14th European Conference on Logics in Artificial Intelligence (JELIA 2014)

点击查看摘要

Abstract:Answer Set Programming Modulo Theories (ASPMT) is an approach to combining answer set programming and satisfiability modulo theories based on the functional stable model semantics. It is shown that the tight fragment of ASPMT programs can be turned into SMT instances, thereby allowing SMT solvers to compute stable models of ASPMT programs. In this paper we present a compiler called \sc aspsmt2smt, which implements this translation. The system uses ASP grounder \sc gringo and SMT solver \sc z3. \sc gringo partially grounds input programs while leaving some variables to be processed by \sc z3. We demonstrate that the system can effectively handle real number computations for reasoning about continuous changes.
zh

[AI-23] ConTextTab: A Semantics-Aware Tabular In-Context Learner

【速读】：该论文试图解决传统表格上下文学习（tabular in-context learning, ICL）模型在利用真实世界表格数据中的语义和常识知识方面存在的不足，以及基于预训练大语言模型的表格ICL模型在上下文容量上的限制。解决方案的关键在于提出ConTextTab框架，该框架将语义理解和对齐整合到原生表格ICL架构中，通过为不同数据模态设计专用嵌入并在大规模真实世界表格数据上进行训练，从而在保持架构效率的同时提升模型的语义理解能力，使其在多个基准测试中达到最先进的性能，并在语义丰富的CARTE基准上设立了新的标准。

链接: https://arxiv.org/abs/2506.10707
作者: Marco Spinaci,Marek Polewczyk,Maximilian Schambach,Sam Thelin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. While being architecturally efficient and well-adapted to tabular data structures, current table-native ICL architectures, being trained exclusively on synthetic data, do not fully leverage the rich semantics and world knowledge contained in real-world tabular data. On another end of this spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark.
zh

[AI-24] Formalising Software Requirements using Large Language Models

【速读】：该论文旨在解决自然语言需求在软件设计、系统实现和验证过程中的可追溯性与验证问题（traceability and verification of natural language requirements）。其解决方案的关键在于通过自然语言处理（Natural Language Processing）技术、本体论（ontology）构建软件系统领域描述、基于相似性的现有软件制品重用以及大语言模型来自动生成形式化规格说明，并利用人工智能引导整个过程，从而提升需求与形式化规范之间的关联性和可验证性。

链接: https://arxiv.org/abs/2506.10704
作者: Arshad Beg,Diarmuid O’Donoghue,Rosemary Monahan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted and presented as a poster in ADAPT Annual Conference (AACS2025) on 15th of May, 2025

点击查看摘要

Abstract:This paper is a brief introduction to our recently initiated project named VERIFAI: Traceability and verification of natural language requirements. The project addresses the challenges in the traceability and verification of formal specifications through providing support for the automatic generation of the formal specifications and the traceability of the requirements from the initial software design stage through the systems implementation and verification. Approaches explored in this project include Natural Language Processing, use of ontologies to describe the software system domain, reuse of existing software artefacts from similar systems (i.e. through similarity based reuse) and large language models to identify and declare the specifications as well as use of artificial intelligence to guide the process.
zh

[AI-25] Saturation Self-Organizing Map

【速读】：该论文试图解决神经系统在连续学习过程中面临的灾难性遗忘问题（catastrophic forgetting），特别是针对自组织映射（Self-Organizing Maps, SOMs）在处理顺序任务时知识保留能力不足的问题。其解决方案的关键在于引入饱和自组织映射（Saturation Self-Organizing Maps, SatSOM），该方法通过一种新颖的饱和机制，逐步降低神经元的学习率和邻域半径，从而有效冻结已充分训练的神经元，并将学习过程重新引导至地图中未充分利用的区域。

链接: https://arxiv.org/abs/2506.10680
作者: Igor Urbanik,Paweł Gajewski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: github repository: this https URL

点击查看摘要

Abstract:Continual learning poses a fundamental challenge for neural systems, which often suffer from catastrophic forgetting when exposed to sequential tasks. Self-Organizing Maps (SOMs), despite their interpretability and efficiency, are not immune to this issue. In this paper, we introduce Saturation Self-Organizing Maps (SatSOM)-an extension of SOMs designed to improve knowledge retention in continual learning scenarios. SatSOM incorporates a novel saturation mechanism that gradually reduces the learning rate and neighborhood radius of neurons as they accumulate information. This effectively freezes well-trained neurons and redirects learning to underutilized areas of the map.
zh

[AI-26] Automated Validation of Textual Constraints Against AutomationML via LLM s and SHACL

【速读】：该论文试图解决AutomationML (AML) 中现有建模建议作为非形式化文本约束无法在AML内部自动验证的问题。解决方案的关键在于构建一个将AML模型映射到OWL本体的管道，并利用大型语言模型将文本规则转换为SHACL约束，随后对生成的AML本体进行验证，最终将验证结果自动转化为自然语言描述，从而实现对复杂建模规则的半自动化检查。

链接: https://arxiv.org/abs/2506.10678
作者: Tom Westermann,Aljosha Köcher,Felix Gehlhoff
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:AutomationML (AML) enables standardized data exchange in engineering, yet existing recommendations for proper AML modeling are typically formulated as informal and textual constraints. These constraints cannot be validated automatically within AML itself. This work-in-progress paper introduces a pipeline to formalize and verify such constraints. First, AML models are mapped to OWL ontologies via RML and SPARQL. In addition, a Large Language Model translates textual rules into SHACL constraints, which are then validated against the previously generated AML ontology. Finally, SHACL validation results are automatically interpreted in natural language. The approach is demonstrated on a sample AML recommendation. Results show that even complex modeling rules can be semi-automatically checked – without requiring users to understand formal methods or ontology technologies.
zh

[AI-27] Contrastive Matrix Completion with Denoising and Augmented Graph Views for Robust Recommendation

【速读】：该论文旨在解决图神经网络（GNN）在矩阵补全任务中对噪声或无关边敏感以及容易过拟合的问题，从而提升模型的泛化能力。其解决方案的关键在于提出一种基于对比学习的矩阵补全方法（MCCL），通过提取局部邻域子图并生成两种不同的图表示：一种结合GNN层与注意力机制以实现去噪，另一种通过图变分自编码器对齐特征分布；随后利用互学习损失函数逐步对齐这两种表示，使模型能够捕捉共性模式并显著提升泛化性能。

链接: https://arxiv.org/abs/2506.10658
作者: Narges Nemati,Mostafa Haghir Chehreghani
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 30 pages

点击查看摘要

Abstract:Matrix completion is a widely adopted framework in recommender systems, as predicting the missing entries in the user-item rating matrix enables a comprehensive understanding of user preferences. However, current graph neural network (GNN)-based approaches are highly sensitive to noisy or irrelevant edges–due to their inherent message-passing mechanisms–and are prone to overfitting, which limits their generalizability. To overcome these challenges, we propose a novel method called Matrix Completion using Contrastive Learning (MCCL). Our approach begins by extracting local neighborhood subgraphs for each interaction and subsequently generates two distinct graph representations. The first representation emphasizes denoising by integrating GNN layers with an attention mechanism, while the second is obtained via a graph variational autoencoder that aligns the feature distribution with a standard prior. A mutual learning loss function is employed during training to gradually harmonize these representations, enabling the model to capture common patterns and significantly enhance its generalizability. Extensive experiments on several real-world datasets demonstrate that our approach not only improves the numerical accuracy of the predicted scores–achieving up to a 0.8% improvement in RMSE–but also produces superior rankings with improvements of up to 36% in ranking metrics.
zh

[AI-28] Data Shifts Hurt CoT: A Theoretical Study

【速读】：该论文试图解决在存在数据分布变化和数据污染的情况下，基于链式思维（Chain of Thought, CoT）的模型性能退化问题。其解决方案的关键在于通过分析k-奇偶性问题，揭示数据偏移对CoT分解方法训练出的模型质量的影响，并从机制层面解释这种影响的根源。

链接: https://arxiv.org/abs/2506.10647
作者: Lang Yin,Debangshu Banerjee,Gagandeep Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain of Thought (CoT) has been applied to various large language models (LLMs) and proven to be effective in improving the quality of outputs. In recent studies, transformers are proven to have absolute upper bounds in terms of expressive power, and consequently, they cannot solve many computationally difficult problems. However, empowered by CoT, transformers are proven to be able to solve some difficult problems effectively, such as the k -parity problem. Nevertheless, those works rely on two imperative assumptions: (1) identical training and testing distribution, and (2) corruption-free training data with correct reasoning steps. However, in the real world, these assumptions do not always hold. Although the risks of data shifts have caught attention, our work is the first to rigorously study the exact harm caused by such shifts to the best of our knowledge. Focusing on the k -parity problem, in this work we investigate the joint impact of two types of data shifts: the distribution shifts and data poisoning, on the quality of trained models obtained by a well-established CoT decomposition. In addition to revealing a surprising phenomenon that CoT leads to worse performance on learning parity than directly generating the prediction, our technical results also give a rigorous and comprehensive explanation of the mechanistic reasons of such impact.
zh

[AI-29] me Series Forecasting as Reasoning : A Slow-Thinking Approach with Reinforced LLM s

【速读】：该论文旨在解决时间序列预测（Time Series Forecasting, TSF）中现有方法依赖历史模式提取与映射的快速思维范式所带来的局限性，即缺乏显式的中间时间序列推理过程。为克服这一问题，论文提出了一种基于两阶段强化微调框架的解决方案——Time-R1，其关键在于通过监督微调进行预热适应，并利用强化学习提升模型的泛化能力，同时设计了针对时间序列预测的细粒度多目标奖励机制，并引入GRIP（group-based relative importance for policy optimization）以优化模型对有效推理路径的探索。

链接: https://arxiv.org/abs/2506.10630
作者: Yucong Luo,Yitong Zhou,Mingyue Cheng,Jiahao Wang,Daoyu Wang,Tingyue Pan,Jintao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical techniques to data-driven deep learning architectures. Despite their effectiveness, most existing methods still adhere to a fast thinking paradigm-relying on extracting historical patterns and mapping them to future values as their core modeling philosophy, lacking an explicit thinking process that incorporates intermediate time series reasoning. Meanwhile, emerging slow-thinking LLMs (e.g., OpenAI-o1) have shown remarkable multi-step reasoning capabilities, offering an alternative way to overcome these issues. However, prompt engineering alone presents several limitations - including high computational cost, privacy risks, and limited capacity for in-depth domain-specific time series reasoning. To address these limitations, a more promising approach is to train LLMs to develop slow thinking capabilities and acquire strong time series reasoning skills. For this purpose, we propose Time-R1, a two-stage reinforcement fine-tuning framework designed to enhance multi-step reasoning ability of LLMs for time series forecasting. Specifically, the first stage conducts supervised fine-tuning for warmup adaptation, while the second stage employs reinforcement learning to improve the model’s generalization ability. Particularly, we design a fine-grained multi-objective reward specifically for time series forecasting, and then introduce GRIP (group-based relative importance for policy optimization), which leverages non-uniform sampling to further encourage and optimize the model’s exploration of effective reasoning paths. Experiments demonstrate that Time-R1 significantly improves forecast performance across diverse datasets.
zh

[AI-30] ask Adaptation from Skills: Information Geometry Disentanglement and New Objectives for Unsupervised Reinforcement Learning ICLR ICLR2024

【速读】：该论文旨在解决无监督强化学习（Unsupervised Reinforcement Learning, URL）中技能学习的泛化能力问题，特别是如何使学习到的技能更好地初始化下游任务的策略。其解决方案的关键在于提出一种新的解耦度量LSEPIN，并构建其与下游任务适应成本之间的信息几何联系。进一步地，通过将信息几何中的KL散度替换为Wasserstein距离，提出了一个新的技能学习目标WSEP，该目标在理论上有助于下游任务的适应，并能发现比MISL更多的初始策略。最后，基于Wasserstein距离提出了PWSEP算法，理论上能够发现所有最优初始策略。

链接: https://arxiv.org/abs/2506.10629
作者: Yucheng Yang,Tianyi Zhou,Qiang He,Lei Han,Mykola Pechenizkiy,Meng Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Spotlight paper at ICLR 2024. This version includes acknowledgments omitted from the ICLR version and indicates the corresponding authors primarily responsible for the work

点击查看摘要

Abstract:Unsupervised reinforcement learning (URL) aims to learn general skills for unseen downstream tasks. Mutual Information Skill Learning (MISL) addresses URL by maximizing the mutual information between states and skills but lacks sufficient theoretical analysis, e.g., how well its learned skills can initialize a downstream task’s policy. Our new theoretical analysis in this paper shows that the diversity and separability of learned skills are fundamentally critical to downstream task adaptation but MISL does not necessarily guarantee these properties. To complement MISL, we propose a novel disentanglement metric LSEPIN. Moreover, we build an information-geometric connection between LSEPIN and downstream task adaptation cost. For better geometric properties, we investigate a new strategy that replaces the KL divergence in information geometry with Wasserstein distance. We extend the geometric analysis to it, which leads to a novel skill-learning objective WSEP. It is theoretically justified to be helpful to downstream task adaptation and it is capable of discovering more initial policies for downstream tasks than MISL. We finally propose another Wasserstein distance-based algorithm PWSEP that can theoretically discover all optimal initial policies.
zh

[AI-31] Data Driven Diagnosis for Large Cyber-Physical-Systems with Minimal Prior Information

【速读】：该论文试图解决复杂网络物理系统（Cyber-Physical Systems, CPS）诊断过程中对详尽系统模型或全面训练数据的依赖问题，这些问题通常难以获取。解决方案的关键在于提出一种基于最小先验知识的诊断方法，该方法仅需子系统间的基本关系理解和正常运行数据，结合基于神经网络的症状生成器与新的图诊断算法，利用子系统间的最小因果关系信息进行诊断，从而在保证诊断准确性的同时有效缩小搜索空间。

链接: https://arxiv.org/abs/2506.10613
作者: Henrik Sebastian Steude,Alexander Diedrich,Ingo Pill,Lukas Moddemann,Daniel Vranješ,Oliver Niggemann
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diagnostic processes for complex cyber-physical systems often require extensive prior knowledge in the form of detailed system models or comprehensive training data. However, obtaining such information poses a significant challenge. To address this issue, we present a new diagnostic approach that operates with minimal prior knowledge, requiring only a basic understanding of subsystem relationships and data from nominal operations. Our method combines a neural network-based symptom generator, which employs subsystem-level anomaly detection, with a new graph diagnosis algorithm that leverages minimal causal relationship information between subsystems-information that is typically available in practice. Our experiments with fully controllable simulated datasets show that our method includes the true causal component in its diagnosis set for 82 p.c. of all cases while effectively reducing the search space in 73 p.c. of the scenarios. Additional tests on the real-world Secure Water Treatment dataset showcase the approach’s potential for practical scenarios. Our results thus highlight our approach’s potential for practical applications with large and complex cyber-physical systems where limited prior knowledge is available.
zh

[AI-32] SoK: Evaluating Jailbreak Guardrails for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在部署过程中面临的关键安全问题，特别是针对绕过安全机制的越狱攻击（jailbreak attacks）。其解决方案的核心在于提出一种多维分类体系（taxonomy），将防御机制（guardrails）按照六个关键维度进行分类，并引入一个安全性-效率-实用性（Security-Efficiency-Utility）评估框架，以系统性地评估这些防御机制的实际效果。通过这一方法，论文为未来研究提供了结构化的基础，以推动鲁棒的LLM防御机制的发展与应用。

链接: https://arxiv.org/abs/2506.10597
作者: Xunguang Wang,Zhenlan Ji,Wenxuan Wang,Zongjie Li,Daoyuan Wu,Shuai Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails–external defense mechanisms that monitor and control LLM interaction–have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at this https URL.
zh

[AI-33] Size-adaptive Hypothesis Testing for Fairness

【速读】：该论文试图解决算法决策系统在公平性评估中因统计方法不足而导致的歧视检测不准确问题，特别是在处理小样本交叉敏感属性群体时，传统方法因忽略抽样误差和无法区分大小群体而失效。其解决方案的关键在于提出一种统一的、规模自适应的假设检验框架，该框架针对足够大的子群体采用中心极限定理推导出统计幂等差异的解析置信区间和Wald检验，确保第一类错误控制在预设水平；而对于小样本的交叉群体，则采用完全贝叶斯狄利克雷-多项式估计器，通过蒙特卡洛可信区间进行校准，并随着数据量增加自然收敛至Wald区间，从而实现对公平性的证据驱动统计决策。

链接: https://arxiv.org/abs/2506.10586
作者: Antonio Ferrara,Francesco Cozzi,Alan Perotti,André Panisson,Francesco Bonchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Determining whether an algorithmic decision-making system discriminates against a specific demographic typically involves comparing a single point estimate of a fairness metric against a predefined threshold. This practice is statistically brittle: it ignores sampling error and treats small demographic subgroups the same as large ones. The problem intensifies in intersectional analyses, where multiple sensitive attributes are considered jointly, giving rise to a larger number of smaller groups. As these groups become more granular, the data representing them becomes too sparse for reliable estimation, and fairness metrics yield excessively wide confidence intervals, precluding meaningful conclusions about potential unfair treatments. In this paper, we introduce a unified, size-adaptive, hypothesis-testing framework that turns fairness assessment into an evidence-based statistical decision. Our contribution is twofold. (i) For sufficiently large subgroups, we prove a Central-Limit result for the statistical parity difference, leading to analytic confidence intervals and a Wald test whose type-I (false positive) error is guaranteed at level \alpha . (ii) For the long tail of small intersectional groups, we derive a fully Bayesian Dirichlet-multinomial estimator; Monte-Carlo credible intervals are calibrated for any sample size and naturally converge to Wald intervals as more data becomes available. We validate our approach empirically on benchmark datasets, demonstrating how our tests provide interpretable, statistically rigorous decisions under varying degrees of data availability and intersectionality. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML) Cite as: arXiv:2506.10586 [cs.LG] (or arXiv:2506.10586v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.10586 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-34] Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning

【速读】：该论文试图解决如何评估大型语言模型（Large Language Models, LLMs）在符号推理、假设验证和符号逻辑泛化方面的能力问题，其解决方案的关键在于提出了一种名为Primender序列的新型整数序列。该序列通过结合经典素数性质与基于模数位的条件定义，形成一种确定性但非平凡的结构，从而为LLMs提供了一个可解释、基于规则的测试平台，用于检验模型对隐藏规则的推断能力、数学假设的验证能力以及大规模符号模式的泛化能力。

链接: https://arxiv.org/abs/2506.10585
作者: Mohd Anwar Jamal Faiz
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Symbolic Computation (cs.SC)
备注: 9 pages, 7 figures, 2 tables, 3 codes, oeis sequence A384735

点击查看摘要

Abstract:This paper introduces the Primender sequence, a novel integer sequence defined by a hybrid rule that combines classical primality with modular digit-based conditions. Specifically, a number n is included in the sequence if it is prime or ends with a prime number of unit digit or any length. In other words, numbers which are primes or have at least one prime suffix. The resulting sequence exhibits a deterministic yet non-trivial structure, blending number-theoretic properties with symbolic patterning. We propose the Primender sequence as a benchmark for evaluating the symbolic reasoning capabilities of Large Language Models (LLMs). The study is motivated by the need for interpretable, rule-based testbeds that can assess an LLM’s ability to infer hidden rules, validate mathematical hypotheses, and generalize symbolic logic at scale. A key hypothesis explored is: Whenever a number in the Primender sequence is exactly one more than the largest prime less than or equal to it, the difference between it and the previous number in the sequence is also 1. We design a structured prompt and evaluation framework to test this hypothesis across multiple state-of-the-art LLMs, including ChatGPT, Copilot, DeepSeek, Gemini, Grok, and LLaMA. The models are tasked with identifying the underlying rule, validating the hypothesis, and generating the next 100,000 terms of the sequence. Comparative metrics such as rule inference accuracy, hypothesis evaluation, sequence validity, and symbolic explanation quality are used to assess model performance. This work contributes a novel mathematical construct and a reproducible methodology for benchmarking LLMs in symbolic reasoning, hypothesis testing, and scalable pattern generalization - bridging the domains of number theory, artificial intelligence, and software engineering.
zh

[AI-35] StepProof: Step-by-step verification of natural language mathematical proofs

【速读】：该论文试图解决交互式定理证明器（Interactive Theorem Provers, ITPs）缺乏自然语言接口导致的验证限制问题，特别是在自动形式化（autoformalization）过程中无法实现细粒度的句子级验证。解决方案的关键在于提出一种名为StepProof的新方法，该方法通过将完整的证明分解为多个可验证的子证明，从而实现逐句验证，显著提升了证明的成功率和效率。

链接: https://arxiv.org/abs/2506.10558
作者: Xiaolin Hu,Qinghua Zhou,Bogdan Grechuk,Ivan Y. Tyukin
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive theorem provers (ITPs) are powerful tools for the formal verification of mathematical proofs down to the axiom level. However, their lack of a natural language interface remains a significant limitation. Recent advancements in large language models (LLMs) have enhanced the understanding of natural language inputs, paving the way for autoformalization - the process of translating natural language proofs into formal proofs that can be verified. Despite these advancements, existing autoformalization approaches are limited to verifying complete proofs and lack the capability for finer, sentence-level verification. To address this gap, we propose StepProof, a novel autoformalization method designed for granular, step-by-step verification. StepProof breaks down complete proofs into multiple verifiable subproofs, enabling sentence-level verification. Experimental results demonstrate that StepProof significantly improves proof success rates and efficiency compared to traditional methods. Additionally, we found that minor manual adjustments to the natural language proofs, tailoring them for step-level verification, further enhanced StepProof’s performance in autoformalization.
zh

[AI-36] LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在复杂关系结构上的逻辑规划与推理能力评估问题。其解决方案的关键在于提出LogiPlan基准，通过动态调整任务复杂度（如对象数量、关系数量及关系链的最小深度），实现对模型在逻辑关系推理方面性能的细粒度评估。LogiPlan包含三个互补任务：计划生成、一致性检测和比较问题，并进一步评估模型的自我修正能力，从而全面揭示不同模型在处理复杂逻辑规划任务时的表现差异。

链接: https://arxiv.org/abs/2506.10527
作者: Yanan Cai,Ahmed Salem,Besmira Nushi,Mark Russinovich
机构: 未知
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:We introduce LogiPlan, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over complex relational structures. Logical relational reasoning is important for applications that may rely on LLMs to generate and query structured graphs of relations such as network infrastructure, knowledge bases, or business process schema. Our framework allows for dynamic variation of task complexity by controlling the number of objects, relations, and the minimum depth of relational chains, providing a fine-grained assessment of model performance across difficulty levels. LogiPlan encompasses three complementary tasks: (1) Plan Generation, where models must construct valid directed relational graphs meeting specified structural constraints; (2) Consistency Detection, testing models’ ability to identify inconsistencies in relational structures; and (3) Comparison Question, evaluating models’ capacity to determine the validity of queried relationships within a given graph. Additionally, we assess models’ self-correction capabilities by prompting them to verify and refine their initial solutions. We evaluate state-of-the-art models including DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet across these tasks, revealing significant performance gaps that correlate with model scale and architecture. Our analysis demonstrates that while recent reasoning-enhanced models show promising results on simpler instances, they struggle with more complex configurations requiring deeper logical planning.
zh

[AI-37] OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics

【速读】：该论文试图解决传统算法基准测试因模型复杂度提升而逐渐饱和的问题，从而需要更具挑战性的基准来推动算法推理能力的进一步发展。其解决方案的关键在于构建了一个高质量、私有且具有奥林匹克竞赛级别难度的信息学数据集OIBench，包含250个精心设计的原创问题，并通过多种编程范式和复杂度的全面评估方法确保基准的严谨性，同时引入Time/Space Completion Curves进行更细粒度的效率分析，以及通过高阶参赛者评估实现人与模型的直接对比。

链接: https://arxiv.org/abs/2506.10481
作者: Yaoming Zhu,Junxin Wang,Yiyang Li,Lin Qiu,ZongYu Wang,Jun Xu,Xuezhi Cao,Yuhuai Wei,Mingshi Wang,Xunliang Cai,Rong Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As models become increasingly sophisticated, conventional algorithm benchmarks are increasingly saturated, underscoring the need for more challenging benchmarks to guide future improvements in algorithmic reasoning. This paper introduces OIBench, a high-quality, private, and challenging olympiad-level informatics dataset comprising 250 carefully curated original problems. We detail the construction methodology of the benchmark, ensuring a comprehensive assessment across various programming paradigms and complexities, and we demonstrate its contamination-resistant properties via experiments. We propose Time/Space Completion Curves for finer-grained efficiency analysis and enable direct human-model comparisons through high-level participant evaluations. Our experiments reveal that while open-source models lag behind closed-source counterparts, current SOTA models already outperform most human participants in both correctness and efficiency, while still being suboptimal compared to the canonical solutions. By releasing OIBench as a fully open-source resource (this https URL), we hope this benchmark will contribute to advancing code reasoning capabilities for future LLMs.
zh

[AI-38] Specification and Evaluation of Multi-Agent LLM Systems – Prototype and Cybersecurity Applications

【速读】：该论文试图解决如何在特定领域应用大语言模型（Large Language Models, LLMs）以完成复杂任务的问题，特别是针对多智能体系统（multi-agent systems）的联合规范与综合应用缺乏深入研究的问题。解决方案的关键在于构建一个基于LLM的多智能体系统架构，并引入相应的规范以支持系统化评估LLM及其推理技术在实际应用中的表现，例如网络安全任务中的问题解答、服务器安全和网络安全性评估。通过扩展已有研究并设计测试用例，验证了该架构的可行性与有效性。

链接: https://arxiv.org/abs/2506.10467
作者: Felix Härer
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in LLMs indicate potential for novel applications, e.g., through reasoning capabilities in the latest OpenAI and DeepSeek models. For applying these models in specific domains beyond text generation, LLM-based multi-agent approaches can be utilized that solve complex tasks by combining reasoning techniques, code generation, and software execution. Applications might utilize these capabilities and the knowledge of specialized LLM agents. However, while many evaluations are performed on LLMs, reasoning techniques, and applications individually, their joint specification and combined application is not explored well. Defined specifications for multi-agent LLM systems are required to explore their potential and their suitability for specific applications, allowing for systematic evaluations of LLMs, reasoning techniques, and related aspects. This paper reports the results of exploratory research to specify and evaluate these aspects through a multi-agent system. The system architecture and prototype are extended from previous research and a specification is introduced for multi-agent systems. Test cases involving cybersecurity tasks indicate feasibility of the architecture and evaluation approach. In particular, the results show the evaluation of question answering, server security, and network security tasks that were completed correctly by agents with LLMs from OpenAI and DeepSeek.
zh

[AI-39] Equitable Mechanism Design for Facility Location IJCAI2025

【速读】：该论文试图解决在设施定位问题中设计策略稳健的机制，以最大化代理之间的公平性（equitability）。其核心问题是证明了不存在任何策略稳健机制能够对最优效用Gini指数的近似比进行有限约束，这是通过一个基本的不可能性结果得出的。解决方案的关键在于转而计算效用补集Gini指数的近似比，并研究确定性和随机机制对此指标的近似效果，同时考虑机制对纳什福利（Nash welfare）的近似能力，以作为平等主义与功利主义结果之间的公平折中。

链接: https://arxiv.org/abs/2506.10460
作者: Toby Walsh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: To appear in Proceedings of IJCAI 2025

点击查看摘要

Abstract:We consider strategy proof mechanisms for facility location which maximize equitability between agents. As is common in the literature, we measure equitability with the Gini index. We first prove a simple but fundamental impossibility result that no strategy proof mechanism can bound the approximation ratio of the optimal Gini index of utilities for one or more facilities. We propose instead computing approximation ratios of the complemented Gini index of utilities, and consider how well both deterministic and randomized mechanisms approximate this. In addition, as Nash welfare is often put forwards as an equitable compromise between egalitarian and utilitarian outcomes, we consider how well mechanisms approximate the Nash welfare.
zh

[AI-40] SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks USENIX-SECURITY

【速读】：该论文旨在解决微调后的大型语言模型（Large Language Models, LLMs）在面对成员推理攻击（Membership Inference Attacks, MIAs）时存在的隐私泄露问题。研究发现，MIAs能够利用微调过程中的损失减少来有效揭示数据成员信息，从而对隐私构成威胁。为应对这一问题，作者提出了一种名为SOFT（Selective data Obfuscation in LLM Fine-Tuning）的新型防御技术，其关键在于通过选择性地引入影响数据并调整参数，在保持模型性能的同时有效降低隐私风险。

链接: https://arxiv.org/abs/2506.10424
作者: Kaiyuan Zhang,Siyuan Cheng,Hanxi Guo,Yuetian Chen,Zian Su,Shengwei An,Yuntao Du,Charles Fleming,Ashish Kundu,Xiangyu Zhang,Ninghui Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by the 34th USENIX Security Symposium 2025. Code is available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success and are widely adopted for diverse applications. However, fine-tuning these models often involves private or sensitive information, raising critical privacy concerns. In this work, we conduct the first comprehensive study evaluating the vulnerability of fine-tuned LLMs to membership inference attacks (MIAs). Our empirical analysis demonstrates that MIAs exploit the loss reduction during fine-tuning, making them highly effective in revealing membership information. These findings motivate the development of our defense. We propose SOFT (\textbfSelective data \textbfObfuscation in LLM \textbfFine-\textbfTuning), a novel defense technique that mitigates privacy leakage by leveraging influential data selection with an adjustable parameter to balance utility preservation and privacy protection. Our extensive experiments span six diverse domains and multiple LLM architectures and scales. Results show that SOFT effectively reduces privacy risks while maintaining competitive model performance, offering a practical and scalable solution to safeguard sensitive information in fine-tuned LLMs.
zh

[AI-41] Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent -based Methods

【速读】：该论文试图解决边缘计算中由于严格的资源约束而难以实现传统自动扩展的问题，旨在通过引入多维弹性维度来实现更灵活的扩展行为。解决方案的关键在于提出了一种基于智能体（agent）的自动扩展框架，该框架能够动态调整硬件资源和内部服务配置，以在受限环境中最大化需求满足度。

链接: https://arxiv.org/abs/2506.10420
作者: Boris Sedlak,Alireza Furutanpey,Zihang Wang,Víctor Casamayor Pujol,Schahram Dustdar
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Edge computing breaks with traditional autoscaling due to strict resource constraints, thus, motivating more flexible scaling behaviors using multiple elasticity dimensions. This work introduces an agent-based autoscaling framework that dynamically adjusts both hardware resources and internal service configurations to maximize requirements fulfillment in constrained environments. We compare four types of scaling agents: Active Inference, Deep Q Network, Analysis of Structural Knowledge, and Deep Active Inference, using two real-world processing services running in parallel: YOLOv8 for visual recognition and OpenCV for QR code detection. Results show all agents achieve acceptable SLO performance with varying convergence patterns. While the Deep Q Network benefits from pre-training, the structural analysis converges quickly, and the deep active inference agent combines theoretical foundations with practical scalability advantages. Our findings provide evidence for the viability of multi-dimensional agent-based autoscaling for edge environments and encourage future work in this research direction.
zh

[AI-42] Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agent ic Retrieval-Augmented Generation for Industry Challenges

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）系统在复杂推理、动态检索和多模态集成方面存在的局限性。传统RAG系统基于静态流水线，虽在结构化任务中表现有效，但在现实场景中难以应对动态变化的需求。论文提出的解决方案关键在于引入推理代理式RAG（Reasoning Agentic RAG），通过将决策机制和自适应工具使用直接嵌入检索过程，实现模型在推理过程中自主协调工具交互，从而提升系统的灵活性与适应性。

链接: https://arxiv.org/abs/2506.10408
作者: Jintao Liang,Gang Su,Huifeng Lin,You Wu,Rui Zhao,Ziyue Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follows fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems. Our collection of the relevant research has been organized into a this https URL.
zh

[AI-43] me To Impeach LLM -as-a-Judge: Programs are the Future of Evaluation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在评估生成内容质量时面临的高API成本、可靠性不确定、流程不灵活以及固有偏见等问题。其解决方案的关键在于引入PAJAMA（Program-As-a-Judge for Automated Model Assessment），通过LLMs合成可执行的评判程序，而非直接对响应进行评分，从而实现低成本、可解释、可审计且易于调整的评判逻辑。

链接: https://arxiv.org/abs/2506.10403
作者: Tzu-Heng Huang,Harit Vishwakarma,Frederic Sala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.
zh

[AI-44] HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

【速读】：该论文试图解决CUDA代码在不同硬件平台上的性能可移植性问题，特别是针对其他硬件平台需要支持CUDA-based软件时所面临的挑战。现有方法在工作负载覆盖范围、通用性和开发成本方面存在局限，而现有的大语言模型（Large Language Models, LLMs）在高性能CUDA代码转换方面的表现仍不理想，主要受限于高质量训练数据的缺乏。解决方案的关键在于提出一种新颖的框架，利用AI编译器和自动优化技术生成高性能的CUDA及对应平台代码对，并通过基于图的数据增强方法提升模型性能，同时引入HPCTransEval基准评估LLMs在CUDA转换任务中的表现。

链接: https://arxiv.org/abs/2506.10401
作者: Jiaqi Lv,Xufeng He,Yanchen Liu,Xu Dai,Yang Hu,Shouyi Yin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant position in the field of parallel software. This dominance requires other hardware platforms to support CUDA-based software with performance portability. However, translating CUDA code to other platforms poses significant challenges due to differences in parallel programming paradigms and hardware architectures. Existing approaches rely on language extensions, domain-specific languages (DSLs), or compilers but face limitations in workload coverage and generalizability. Moreover, these methods often incur substantial development costs. Recently, LLMs have demonstrated extraordinary potential in various vertical domains, especially in code-related tasks. However, the performance of existing LLMs in CUDA transpilation, particularly for high-performance code, remains suboptimal. The main reason for this limitation lies in the lack of high-quality training datasets. To address these challenges, we propose a novel framework for generating high-performance CUDA and corresponding platform code pairs, leveraging AI compiler and automatic optimization technology. We further enhance the framework with a graph-based data augmentation method and introduce HPCTransEval, a benchmark for evaluating LLM performance on CUDA transpilation. We conduct experiments using CUDA-to-CPU transpilation as a case study on leading LLMs. The result demonstrates that our framework significantly improves CUDA transpilation, highlighting the potential of LLMs to address compatibility challenges within the CUDA ecosystem.
zh

[AI-45] Mirag e-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

【速读】：该论文旨在解决多模态大语言模型（Multi-modal Large Language Model, MLLM）作为图形用户界面（GUI）代理在在线环境中执行长时程任务时面临的知识不足和离线与在线领域差距问题。其解决方案的关键在于提出一种分层多模态技能（Hierarchical Multimodal Skills, HMS）模块，通过逐步抽象轨迹为执行技能、核心技能和元技能，构建分层知识结构以支持长时程任务规划，并结合技能增强的蒙特卡洛树搜索（Skill-Augmented Monte Carlo Tree Search, SA-MCTS）算法，有效利用离线环境中的技能来缩小领域差距，提升在线探索效率。

链接: https://arxiv.org/abs/2506.10387
作者: Yuquan Xie,Zaijing Li,Rui Shao,Gongwei Chen,Kaiwen Zhou,Yinchuan Li,Dongmei Jiang,Liqiang Nie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32%, 19%, 15%, and 79% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: this https URL
zh

[AI-46] NeuroPAL: Punctuated Anytime Learning with Neuroevolution for Macromanagement in Starcraft: Brood War

【速读】：该论文旨在解决StarCraft: Brood War中基于神经进化（Neuroevolution）的训练效率低下问题，特别是在宏观管理（macromanagement）任务中，传统方法如基于规则的系统或监督深度学习在适应性和计算效率方面存在局限。论文提出的解决方案是NeuroPAL框架，其关键在于将Neuroevolution of Augmenting Topologies (NEAT)与Punctuated Anytime Learning (PAL)相结合，通过频繁的低保真度训练和周期性的高保真度评估交替进行，提升NEAT的样本效率，从而在更少的训练迭代中发现有效的策略。

链接: https://arxiv.org/abs/2506.10384
作者: Jim O’Connor,Yeonghun Lee,Gary B Parker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: IEEE Conference on Games 2025

点击查看摘要

Abstract:StarCraft: Brood War remains a challenging benchmark for artificial intelligence research, particularly in the domain of macromanagement, where long-term strategic planning is required. Traditional approaches to StarCraft AI rely on rule-based systems or supervised deep learning, both of which face limitations in adaptability and computational efficiency. In this work, we introduce NeuroPAL, a neuroevolutionary framework that integrates Neuroevolution of Augmenting Topologies (NEAT) with Punctuated Anytime Learning (PAL) to improve the efficiency of evolutionary training. By alternating between frequent, low-fidelity training and periodic, high-fidelity evaluations, PAL enhances the sample efficiency of NEAT, enabling agents to discover effective strategies in fewer training iterations. We evaluate NeuroPAL in a fixed-map, single-race scenario in StarCraft: Brood War and compare its performance to standard NEAT-based training. Our results show that PAL significantly accelerates the learning process, allowing the agent to reach competitive levels of play in approximately half the training time required by NEAT alone. Additionally, the evolved agents exhibit emergent behaviors such as proxy barracks placement and defensive building optimization, strategies commonly used by expert human players. These findings suggest that structured evaluation mechanisms like PAL can enhance the scalability and effectiveness of neuroevolution in complex real-time strategy environments.
zh

[AI-47] Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

【速读】：该论文旨在解决在开放世界环境（如Minecraft）中构建具备感知、规划、行动、具身化和反思能力的通用代理所面临的挑战，包括领域特定数据不足、异构任务间的干扰以及视觉多样性问题。其解决方案的关键在于三个主要贡献：首先，提出一种知识增强的数据生成管道以提供可扩展且高质量的训练数据；其次，引入基于任务级路由的专家混合（Mixture-of-Experts, MoE）架构以减轻异构任务间的干扰；最后，开发一种多模态推理增强的强化学习方法以提升代理对Minecraft中视觉多样性的推理能力。基于这些创新，作者提出了Optimus-3，一个面向Minecraft的通用代理，并通过大量实验验证了其优越性。

链接: https://arxiv.org/abs/2506.10357
作者: Zaijing Li,Yuquan Xie,Rui Shao,Gongwei Chen,Weili Guan,Dongmei Jiang,Liqiang Nie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent’s reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: this https URL
zh

[AI-48] PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation NEURIPS2025

【速读】：该论文旨在解决生理信号在分析过程中面临的运动伪影、基线漂移和其他低信噪比干扰问题，以及信号强非平稳性带来的建模挑战。其解决方案的关键在于提出一种基于小波的新型生理信号分析方法，能够捕捉多尺度时频特征，并在此基础上引入针对EMG和ECG的两个大规模预训练模型，同时构建一个整合预训练EEG模型的统一多模态框架，通过专用分支引导与可学习加权融合策略，有效应对低信噪比、高个体差异和设备不匹配等问题。

链接: https://arxiv.org/abs/2506.10351
作者: Yanlong Chen,Mattia Orlandi,Pierangelo Maria Rapa,Simone Benatti,Luca Benini,Yawei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures, 9 tables. Submitted to NeurIPS 2025

点击查看摘要

Abstract:Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications.
zh

[AI-49] Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements

【速读】：该论文旨在解决软件开发过程中代码问题检测与修复自动化的问题，通过集成大型语言模型（Large Language Models, LLMs）如OpenAI的GPT-3.5 Turbo和GPT-4o，提升代码质量并优化开发流程。其解决方案的关键在于构建一个静态代码分析框架，用于检测代码中的缺陷、漏洞和异味，并结合迭代式提示工程（prompt engineering）确保生成结果符合项目需求；同时引入检索增强生成（Retrieval-Augmented Generation, RAG）技术以提高修复的相关性和准确性，并通过自建的“代码比对应用”解决LLM幻觉问题，从而实现高效、可靠的代码自动修复。

链接: https://arxiv.org/abs/2506.10330
作者: Seyed Moein Abtahi,Akramul Azim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at FORGE 2025

点击查看摘要

Abstract:This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) such as OpenAI’s GPT-3.5 Turbo and GPT-4o into software development workflows. A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project. Detailed information on each issue was extracted and organized to facilitate automated code revision using LLMs. An iterative prompt engineering process is applied to ensure that prompts are structured to produce accurate and organized outputs aligned with the project requirements. Retrieval-augmented generation (RAG) is implemented to enhance the relevance and precision of the revisions, enabling LLM to access and integrate real-time external knowledge. The issue of LLM hallucinations - where the model generates plausible but incorrect outputs - is addressed by a custom-built “Code Comparison App,” which identifies and corrects erroneous changes before applying them to the codebase. Subsequent scans using the static code analysis framework revealed a significant reduction in code issues, demonstrating the effectiveness of combining LLMs, static analysis, and RAG to improve code quality, streamline the software development process, and reduce time and resource expenditure.
zh

[AI-50] A Benchmark for Generalizing Across Diverse Team Strategies in Competitive Pokémon NEURIPS2025

【速读】：该论文试图解决多智能体学习中AI代理在无需重新训练的情况下，适应截然不同的战略环境这一核心挑战。其关键解决方案是引入VGC-Bench基准，该基准提供了关键的基础设施、标准化的评估协议以及人类对战数据集和多种基线方法，包括大型语言模型代理、行为克隆、强化学习以及经验博弈论方法如自对弈、虚拟对手和双Oracle。通过在单团队配置下训练和评估代理，研究者证明了所提出方法能够战胜专业VGC选手，但同时也发现即使在单团队设置中表现最佳的算法在团队规模扩大时仍面临显著的泛化挑战。

链接: https://arxiv.org/abs/2506.10326
作者: Cameron Angliss,Jiaxun Cui,Jiaheng Hu,Arrasy Rahman,Peter Stone
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 15 pages, 3 figures, 10 tables, submitted to NeurIPS 2025 Datasets Benchmarks Track

点击查看摘要

Abstract:Developing AI agents that can robustly adapt to dramatically different strategic landscapes without retraining is a central challenge for multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with an extraordinarily large space of possible team configurations of approximately 10^139 - far larger than those of Dota or Starcraft. The highly discrete, combinatorial nature of team building in Pokémon VGC causes optimal strategies to shift dramatically depending on both the team being piloted and the opponent’s team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies human-play datasets and a range of baselines - from large-language-model agents and behavior cloning to reinforcement learning and empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated on a single-team configuration, our methods are able to win against a professional VGC competitor. We extensively evaluated all baseline methods over progressively larger team sets and find that even the best-performing algorithm in the single-team setting struggles at scaling up as team size grows. Thus, policy generalization across diverse team strategies remains an open challenge for the community. Our code is open sourced at this https URL.
zh

[AI-51] Using Language and Road Manuals to Inform Map Reconstruction for Autonomous Driving

【速读】：该论文试图解决自动驾驶中车道拓扑预测的问题，该问题对于实现安全可靠的自主导航至关重要。解决方案的关键在于通过轻量级方式将结构化道路元数据（来自开放街道地图OSM）和车道宽度先验信息（来自道路设计手册）与道路中心线编码相结合，以增强SMERF这一基于地图先验的在线车道拓扑预测模型。

链接: https://arxiv.org/abs/2506.10317
作者: Akshar Tumu,Henrik I. Christensen,Marcell Vazquez-Chanlatte,Chikao Tsuchiya,Dhaval Bhanderi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures, Accepted at RSS 2025 Workshop - RobotEvaluation@RSS2025

点击查看摘要

Abstract:Lane-topology prediction is a critical component of safe and reliable autonomous navigation. An accurate understanding of the road environment aids this task. We observe that this information often follows conventions encoded in natural language, through design codes that reflect the road structure and road names that capture the road functionality. We augment this information in a lightweight manner to SMERF, a map-prior-based online lane-topology prediction model, by combining structured road metadata from OSM maps and lane-width priors from Road design manuals with the road centerline encodings. We evaluate our method on two geo-diverse complex intersection scenarios. Our method shows improvement in both lane and traffic element detection and their association. We report results using four topology-aware metrics to comprehensively assess the model performance. These results demonstrate the ability of our approach to generalize and scale to diverse topologies and conditions.
zh

[AI-52] he Alignment Trap: Complexity Barriers

【速读】：该论文试图解决随着AI系统能力提升，验证其安全性所面临的计算复杂性难题。研究揭示了当系统表达能力超过临界阈值时，安全验证需要指数级时间且属于coNP完全问题，表明传统验证方法在高能力AI系统中将变得不可行。解决方案的关键在于揭示了能力-风险扩展（CRS）动态，证明了验证复杂性随系统表达能力呈指数增长、安全策略在策略空间中占比极小、有限的对齐技术无法覆盖所有情况，以及神经网络的鲁棒安全属性为测度零集，从而确立了“不可解差距”的存在。论文最终提出一个战略三难困境，指出AI发展需在约束系统复杂性、接受不可验证风险或探索新安全范式之间做出选择。

链接: https://arxiv.org/abs/2506.10304
作者: Jasper Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 29 Pages, 4 Figures

点击查看摘要

Abstract:We establish fundamental computational complexity barriers to verifying AI safety as system capabilities scale. Our main results show that for AI systems with expressiveness EXP (m) above a critical threshold \tau , safety verification requires exponential time and is coNP-complete. We formalize the Capability-Risk Scaling (CRS) dynamic, which demonstrates how increasing AI capability drives societal safety requirements toward perfection, creating an inescapable tension with verification complexity. Through four core theorems, we prove that (1) verification complexity grows exponentially with system expressiveness, (2) safe policies comprise at most a 2^-2^m fraction of the policy space, (3) no finite set of alignment techniques can provide universal coverage, and (4) robust safety properties form measure-zero sets for neural networks. These results characterize an “intractability gap” where practical safety requirements fall within the region of computational intractability. We conclude by presenting a strategic trilemma: AI development must either constrain system complexity to maintain verifiable safety, accept unverifiable risks while scaling capabilities, or develop fundamentally new safety paradigms beyond verification. Our work provides the first systematic complexity-theoretic analysis of AI alignment and establishes rigorous bounds that any safety approach must confront. A formal verification of the core theorems in Lean4 is currently in progress.
zh

[AI-53] owards Understanding Bias in Synthetic Data for Evaluation

【速读】：该论文试图解决如何评估基于生成式 AI (Generative AI) 构建的合成测试集在信息检索 (Information Retrieval, IR) 系统评估中的可靠性问题。其解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 生成合成查询、标签或两者，进而构建合成测试集，并通过实证分析和线性混合效应模型验证此类测试集在系统评估中可能存在的偏差及其影响。研究结果表明，虽然合成测试集在绝对系统性能计算中可能存在显著偏差，但在相对系统性能比较中影响较小。

链接: https://arxiv.org/abs/2506.10301
作者: Hossein A. Rahmani,Varsha Ramineni,Nick Craswell,Bhaskar Mitra,Emine Yilmaz
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\citerahmani2024synthetic showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using LLMs, where LLMs are used to generate synthetic queries, labels, or both. In particular, we examine the potential biases that might occur when such test collections are used for evaluation. We first empirically show the presence of such bias in evaluation results and analyse the effects it might have on system evaluation. We further validate the presence of such bias using a linear mixed-effects model. Our analysis shows that while the effect of bias present in evaluation results obtained using synthetic test collections could be significant, for e.g.~computing absolute system performance, its effect may not be as significant in comparing relative system performance. Codes and data are available at: this https URL.
zh

[AI-54] Closer to Language than Steam: AI as the Cognitive Engine of a New Productivity Revolution

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在生产力提升中的作用及其对社会和经济的影响问题，核心在于阐释AI作为认知革命的性质，而非传统意义上的机械化工具。其解决方案的关键在于将AI视为一种类似于书面语言的认知增强技术，通过理论框架和跨学科视角，展示AI如何通过放大知识工作来推动生产力变革，并强调AI与人类认知能力的互补性，从而为技能、组织和政策的重新思考提供依据。

链接: https://arxiv.org/abs/2506.10281
作者: Xinmin Fang,Lingfeng Tao,Zhengxiong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Artificial Intelligence (AI) is reframed as a cognitive engine driving a novel productivity revolution distinct from the Industrial Revolution’s physical thrust. This paper develops a theoretical framing of AI as a cognitive revolution akin to written language - a transformative augmentation of human intellect rather than another mechanized tool. We compare AI’s emergence to historical leaps in information technology to show how it amplifies knowledge work. Examples from various domains demonstrate AI’s impact as a driver of productivity in cognitive tasks. We adopt a multidisciplinary perspective combining computer science advances with economic insights and sociological perspectives on how AI reshapes work and society. Through conceptual frameworks, we visualize the shift from manual to cognitive productivity. Our central argument is that AI functions as an engine of cognition - comparable to how human language revolutionized knowledge - heralding a new productivity paradigm. We discuss how this revolution demands rethinking of skills, organizations, and policies. This paper, balancing academic rigor with clarity, concludes that AI’s promise lies in complementing human cognitive abilities, marking a new chapter in productivity evolution.
zh

[AI-55] WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

【速读】：该论文试图解决战略推理（strategic reasoning）在大型语言模型（Large Language Models, LLMs）中的系统性评估与建模问题，尤其是在多智能体动态环境中的行为评估、策略制定与适应能力。解决方案的关键在于提出WGSR-Bench，这是一个基于战争推演（wargame）的基准测试平台，通过环境情境感知、对手风险建模和策略生成三个核心任务，构建了S-POE架构，以系统性地评估LLMs在多智能体决策、意图推断和反事实推理方面的能力。

链接: https://arxiv.org/abs/2506.10264
作者: Qiyue Yin,Pei Xu,Qiaozhe Li,Shengda Liu,Shengqi Shen,Tong Wang,Yihong Han,Xiaonan Zhao,Likun Yang,Shiyue Cao,Shiyu Qiu,Yuxuan Liu,Shizhao Yu,Lei Cui,Chengxin Yan,Jie Sun,Xiangquan Tang,Kaiqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 17 figures

点击查看摘要

Abstract:Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence’ s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic environments, formulate action plans, and adapt strategies, has yet to be systematically evaluated or modeled. To address this gap, this paper introduces WGSR-Bench, the first strategy reasoning benchmark for LLMs using wargame as its evaluation environment. Wargame, a quintessential high-complexity strategic scenario, integrates environmental uncertainty, adversarial dynamics, and non-unique strategic choices, making it an effective testbed for assessing LLMs’ capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning. WGSR-Bench designs test samples around three core tasks, i.e., Environmental situation awareness, Opponent risk modeling and Policy generation, which serve as the core S-POE architecture, to systematically assess main abilities of strategic reasoning. Finally, an LLM-based wargame agent is designed to integrate these parts for a comprehensive strategy reasoning assessment. With WGSR-Bench, we hope to assess the strengths and limitations of state-of-the-art LLMs in game-theoretic strategic reasoning and to advance research in large model-driven strategic intelligence.
zh

[AI-56] Extended Creativity: A Conceptual Framework for Understanding Human-AI Creative Relations

【速读】：该论文试图解决如何有效实现人工智能（Artificial Intelligence, AI）对人类创造力的增强问题。其解决方案的关键在于从分布式创造力（distributed creativity）的视角出发，识别出AI在创造性过程中发挥作用的三种主要模式：支持（Support）、协同（Synergy）与共生（Symbiosis），并依据AI系统的技术自主性水平和人类对其代理权（agency）的感知程度这两个核心维度进行界定。通过分析不同配置对创造力不同层次的影响，论文提出了理论、伦理与设计层面的综合考量。

链接: https://arxiv.org/abs/2506.10249
作者: Andrea Gaggioli,Sabrina Bartolotta,Andrea Ubaldi,Katusha Gerardini,Eleonora Diletta Sarcinella,Alice Chirico
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 36 pages, 3 figures. This conceptual paper proposes a taxonomy of Extended Creativity systems and examines the relational dynamics between human and AI agents in creative processes. Suitable for readers in HCI, AI, cognitive science, and digital design. The illustrations were created by Francesco Giordano and are used with permission (not under CC license)

点击查看摘要

Abstract:Artificial Intelligence holds significant potential to enhance human creativity. However, achieving this vision requires a clearer understanding of how such enhancement can be effectively realized. Adopting the perspective of distributed creativity, we identify three primary modes through which AI can contribute to creative processes: Support, where AI acts as a tool; Synergy, where AI and humans collaborate in complementary ways; and Symbiosis, where human and AI cognition become so integrated that they form a unified creative system. These modes are defined along two key dimensions: the level of technical autonomy exhibited by the AI system and the degree of perceived agency attributed to it. We examine how each configuration influences different levels of creativity - from everyday problem-solving to paradigm-shifting innovation - and discuss the theoretical, ethical, and design implications.
zh

[AI-57] LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation ICML

【速读】：该论文旨在解决模拟拓扑设计自动化中的效率与精度问题，特别是在面对严格容差要求时，传统方法因O(|V |²)的标记长度和对数值输入敏感度不足而表现不佳。其解决方案的关键在于提出一种简洁浮点输入规范形式（SFCI），通过基于标识符的表示提升元件类型识别能力，将标记长度复杂度降低至O(|V |)，并增强数值精度敏感性，从而在紧容差下实现更高的成功率和更低的均方误差。

链接: https://arxiv.org/abs/2506.10235
作者: Chen-Chia Chang,Wan-Hsuan Lin,Yikang Shen,Yiran Chen,Xin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted at 42nd International Conference on Machine Learning (ICML) 2025

点击查看摘要

Abstract:Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O(|V |2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to O(|V |), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation.
zh

[AI-58] Fine-Grained control over Music Generation with Activation Steering

【速读】：该论文旨在解决音乐生成过程中对音频内容进行细粒度控制的问题，具体包括音色迁移、风格迁移和流派融合。其解决方案的关键在于在推理阶段对MusicGen这一自回归生成音乐Transformer模型的残差流或注意力层激活进行干预，通过线性探测器的权重进行引导，从而实现对生成过程的局部控制。研究发现，将此问题建模为回归任务能够提升性能，推测是因为均方误差有助于在激活空间中保留有意义的方向信息。结合MusicGen中文本提示提供的全局条件，该方法实现了对音乐生成的全局与局部双重控制。

链接: https://arxiv.org/abs/2506.10225
作者: Dipanshu Panda,Jayden Koshy Joe,Harshith M R,Swathi Narashiman,Pranay Mathur,Anish Veerakumar,Aniruddh Krishna,Keerthiharan A
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner. We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation. Audio samples illustrating our method are available at our demo page.
zh

[AI-59] Cross-Learning Between ECG and PCG: Exploring Common and Exclusive Characteristics of Bimodal Electromechanical Cardiac Waveforms

【速读】：该论文旨在解决同步心电图（ECG）与心音图（PCG）信号在心脏功能表征中的信息关联性及互重建潜力不明确的问题，尤其是在不同生理状态和个体间的差异。其解决方案的关键在于利用线性和非线性机器学习模型，特别是非因果长短期记忆网络（non-causal LSTM），对两种模态进行相互重建，并通过基于包络的建模方法提升跨被试泛化能力，从而实现 clinically relevant ECG 生物标志物（如特征点和QT间期）从PCG中的估计。

链接: https://arxiv.org/abs/2506.10212
作者: Sajjad Karimi,Amit J. Shah,Gari D. Clifford,Reza Sameni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Simultaneous electrocardiography (ECG) and phonocardiogram (PCG) provide a comprehensive, multimodal perspective on cardiac function by capturing the heart’s electrical and mechanical activities, respectively. However, the distinct and overlapping information content of these signals, as well as their potential for mutual reconstruction and biomarker extraction, remains incompletely understood, especially under varying physiological conditions and across individuals. In this study, we systematically investigate the common and exclusive characteristics of ECG and PCG using the EPHNOGRAM dataset of simultaneous ECG-PCG recordings during rest and exercise. We employ a suite of linear and nonlinear machine learning models, including non-causal LSTM networks, to reconstruct each modality from the other and analyze the influence of causality, physiological state, and cross-subject variability. Our results demonstrate that nonlinear models, particularly non-causal LSTM, provide superior reconstruction performance, with reconstructing ECG from PCG proving more tractable than the reverse. Exercise and cross-subject scenarios present significant challenges, but envelope-based modeling that utilizes instantaneous amplitude features substantially improves cross-subject generalizability for cross-modal learning. Furthermore, we demonstrate that clinically relevant ECG biomarkers, such as fiducial points and QT intervals, can be estimated from PCG in cross-subject settings. These findings advance our understanding of the relationship between electromechanical cardiac modalities, in terms of both waveform characteristics and the timing of cardiac events, with potential applications in novel multimodal cardiac monitoring technologies. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP) Cite as: arXiv:2506.10212 [cs.LG] (or arXiv:2506.10212v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.10212 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-60] owards Responsible AI: Advances in Safety Fairness and Accountability of Autonomous Systems

【速读】：该论文旨在解决人工智能（Artificial Intelligence, AI）系统在安全性、公平性、透明性和责任性方面的挑战，以实现更可信赖的AI系统。其关键解决方案在于提出一系列防护机制（shields），包括针对延迟观测的确定性和概率性安全防护、用于序列决策中的公平性防护，以及基于形式化框架的意图评估方法，从而在实际应用中提升AI系统的安全性、公平性与可问责性。

链接: https://arxiv.org/abs/2506.10192
作者: Filip Cano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 202 pages, 38 figures, PhD Thesis

点击查看摘要

Abstract:Ensuring responsible use of artificial intelligence (AI) has become imperative as autonomous systems increasingly influence critical societal domains. However, the concept of trustworthy AI remains broad and multi-faceted. This thesis advances knowledge in the safety, fairness, transparency, and accountability of AI systems. In safety, we extend classical deterministic shielding techniques to become resilient against delayed observations, enabling practical deployment in real-world conditions. We also implement both deterministic and probabilistic safety shields into simulated autonomous vehicles to prevent collisions with road users, validating the use of these techniques in realistic driving simulators. We introduce fairness shields, a novel post-processing approach to enforce group fairness in sequential decision-making settings over finite and periodic time horizons. By optimizing intervention costs while strictly ensuring fairness constraints, this method efficiently balances fairness with minimal interference. For transparency and accountability, we propose a formal framework for assessing intentional behaviour in probabilistic decision-making agents, introducing quantitative metrics of agency and intention quotient. We use these metrics to propose a retrospective analysis of intention, useful for determining responsibility when autonomous systems cause unintended harm. Finally, we unify these contributions through the ``reactive decision-making’’ framework, providing a general formalization that consolidates previous approaches. Collectively, the advancements presented contribute practically to the realization of safer, fairer, and more accountable AI systems, laying the foundations for future research in trustworthy AI.
zh

[AI-61] Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment ICML2025

【速读】：该论文旨在解决传统等变扩散模型在3D分子生成任务中因专用等变架构导致的可扩展性和效率受限的问题。其解决方案的关键在于放松等变性约束，通过为每个分子学习一个依赖于样本的SO(3)变换，构建对齐的潜在空间，并在此基础上训练非等变扩散模型，从而在保持生成质量的同时提升训练和采样效率。

链接: https://arxiv.org/abs/2506.10186
作者: Yuhui Ding,Thomas Hofmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

Abstract:Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at this https URL
zh

[AI-62] Optimizing Genetic Algorithms with Multilayer Perceptron Networks for Enhancing TinyFace Recognition

【速读】：该论文旨在探讨多层感知机（MLP）在不同特征工程方法下的性能表现，具体包括基准训练、基于遗传算法（GA）的特征选择以及基于主成分分析（PCA）的降维处理。研究的关键在于揭示这些技术如何影响MLP的性能，并指出特征选择与维度约简在提升模型效果中具有相互依赖的作用，特别是在复杂数据集上，GA通过准确识别关键特征显著提高了模型的准确性。

链接: https://arxiv.org/abs/2506.10184
作者: Mohammad Subhi Al-Batah,Mowafaq Salem Alzboon,Muhyeeddin Alqaraleh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study conducts an empirical examination of MLP networks investigated through a rigorous methodical experimentation process involving three diverse datasets: TinyFace, Heart Disease, and Iris. Study Overview: The study includes three key methods: a) a baseline training using the default settings for the Multi-Layer Perceptron (MLP), b) feature selection using Genetic Algorithm (GA) based refinement c) Principal Component Analysis (PCA) based dimension reduction. The results show important information on how such techniques affect performance. While PCA had showed benefits in low-dimensional and noise-free datasets GA consistently increased accuracy in complex datasets by accurately identifying critical features. Comparison reveals that feature selection and dimensionality reduction play interdependent roles in enhancing MLP performance. The study contributes to the literature on feature engineering and neural network parameter optimization, offering practical guidelines for a wide range of machine learning tasks
zh

[AI-63] A Comparative Study of Machine Learning Techniques for Early Prediction of Diabetes

【速读】：该论文试图解决糖尿病早期识别与控制的问题，其解决方案的关键在于利用多种机器学习方法对Pima Indians Diabetes数据集进行分析，以评估这些方法在糖尿病预测中的有效性。研究结果表明，神经网络（Neural Network）算法表现最佳，准确率达到78.57%，显示出机器学习算法在糖尿病预测中的潜力和效率。

链接: https://arxiv.org/abs/2506.10180
作者: Mowafaq Salem Alzboon,Mohammad Al-Batah,Muhyeeddin Alqaraleh,Ahmad Abuashour,Ahmad Fuad Bader
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many nations, diabetes is becoming a significant health problem, and early identification and control are crucial. Using machine learning algorithms to predict diabetes has yielded encouraging results. Using the Pima Indians Diabetes dataset, this study attempts to evaluate the efficacy of several machine-learning methods for diabetes prediction. The collection includes information on 768 patients, such as their ages, BMIs, and glucose levels. The techniques assessed are Logistic Regression, Decision Tree, Random Forest, k-Nearest Neighbors, Naive Bayes, Support Vector Machine, Gradient Boosting, and Neural Network. The findings indicate that the Neural Network algorithm performed the best, with an accuracy of 78.57 percent, followed by the Random Forest method, with an accuracy of 76.30 percent. The study implies that machine learning algorithms can aid diabetes prediction and be an efficient early detection tool.
zh

[AI-64] Correlation vs causation in Alzheimers disease: an interpretability-driven study

【速读】：该论文试图解决阿尔茨海默病（Alzheimer’s disease, AD）研究中因果关系与相关性区分的问题，这一问题对诊断、治疗及真正疾病驱动因素的识别具有重要影响。其解决方案的关键在于结合相关性分析、机器学习分类以及模型可解释性技术，利用XGBoost算法识别影响AD分类的关键特征，并通过SHAP（SHapley Additive exPlanations）值深入分析特征在不同疾病阶段的贡献，从而强调强相关性并不必然代表因果关系，为未来的因果推断研究奠定基础。

链接: https://arxiv.org/abs/2506.10179
作者: Hamzah Dabool,Raghad Mustafa
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Understanding the distinction between causation and correlation is critical in Alzheimer’s disease (AD) research, as it impacts diagnosis, treatment, and the identification of true disease drivers. This experiment investigates the relationships among clinical, cognitive, genetic, and biomarker features using a combination of correlation analysis, machine learning classification, and model interpretability techniques. Employing the XGBoost algorithm, we identified key features influencing AD classification, including cognitive scores and genetic risk factors. Correlation matrices revealed clusters of interrelated variables, while SHAP (SHapley Additive exPlanations) values provided detailed insights into feature contributions across disease stages. Our results highlight that strong correlations do not necessarily imply causation, emphasizing the need for careful interpretation of associative data. By integrating feature importance and interpretability with classical statistical analysis, this work lays groundwork for future causal inference studies aimed at uncovering true pathological mechanisms. Ultimately, distinguishing causal factors from correlated markers can lead to improved early diagnosis and targeted interventions for Alzheimer’s disease.
zh

[AI-65] Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

【速读】：该论文试图解决如何理解通过无模型强化学习训练的卷积循环神经网络（Convolutional RNN）在玩益智游戏Sokoban时所采用的决策机制问题。其解决方案的关键在于揭示该网络通过测试时计算（test-time compute）所学习到的机制，这些机制与经典双向搜索中的组件存在类比关系，包括通过特定方向相关通道的激活来表示每个方块的计划，以及利用专用内核向前和向后扩展这些激活以形成路径，从而构建转移模型。

链接: https://arxiv.org/abs/2506.10138
作者: Mohammad Taufeeque,Aaron David Tucker,Adam Gleave,Adrià Garriga-Alonso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 33 pages, 22 figures

点击查看摘要

Abstract:We partially reverse-engineer a convolutional recurrent neural network (RNN) trained to play the puzzle game Sokoban with model-free reinforcement learning. Prior work found that this network solves more levels with more test-time compute. Our analysis reveals several mechanisms analogous to components of classic bidirectional search. For each square, the RNN represents its plan in the activations of channels associated with specific directions. These state-action activations are analogous to a value function - their magnitudes determine when to backtrack and which plan branch survives pruning. Specialized kernels extend these activations (containing plan and value) forward and backward to create paths, forming a transition model. The algorithm is also unlike classical search in some ways. State representation is not unified; instead, the network considers each box separately. Each layer has its own plan representation and value function, increasing search depth. Far from being inscrutable, the mechanisms leveraging test-time compute learned in this network by model-free training can be understood in familiar terms.
zh

[AI-66] Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

【速读】：该论文试图解决目标条件化行为克隆（GCBC）方法在面对需要基于新状态-目标对进行条件判断的任务时，无法实现零样本泛化的问题，即组合泛化问题。其解决方案的关键在于通过引入一种简单的但有效的表示学习目标——\textBYOL-\gamma augmented GCBC，该方法无需对比样本或时序差分（TD）学习即可理论上近似马尔可夫决策过程（MDP）中的后继表示（successor representation），从而增强表示空间中的时间一致性，进而提升组合泛化能力。

链接: https://arxiv.org/abs/2506.10137
作者: Daniel Lawson,Adriana Hugessen,Charlotte Cloutier,Glen Berseth,Khimya Khetarpal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in domains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective, \textBYOL-\gamma augmented GCBC, which is not only able to theoretically approximate the successor representation in the finite MDP case without contrastive samples or TD learning, but also, results in competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.
zh

[AI-67] A Conjecture on a Fundamental Trade-Off between Certainty and Scope in Symbolic and Generative AI

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）系统中可证明正确性与广泛数据映射能力之间的根本性权衡问题。其核心解决方案的关键在于提出一个猜想，该猜想明确揭示了在追求严格逻辑保证（如经典符号AI中的无误差输出）与处理高维数据并生成丰富信息输出（如现代生成式模型）之间存在的不可调和的矛盾。通过将这一隐含的权衡显性化，并使其能够接受严格的验证，该猜想重新定义了AI工程目标与哲学预期，为评估标准、治理框架及混合系统设计提供了新的理论基础。

链接: https://arxiv.org/abs/2506.10130
作者: Luciano Floridi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article introduces a conjecture that formalises a fundamental trade-off between provable correctness and broad data-mapping capacity in Artificial Intelligence (AI) systems. When an AI system is engineered for deductively watertight guarantees (demonstrable certainty about the error-free nature of its outputs) – as in classical symbolic AI – its operational domain must be narrowly circumscribed and pre-structured. Conversely, a system that can input high-dimensional data to produce rich information outputs – as in contemporary generative models – necessarily relinquishes the possibility of zero-error performance, incurring an irreducible risk of errors or misclassification. By making this previously implicit trade-off explicit and open to rigorous verification, the conjecture significantly reframes both engineering ambitions and philosophical expectations for AI. After reviewing the historical motivations for this tension, the article states the conjecture in information-theoretic form and contextualises it within broader debates in epistemology, formal verification, and the philosophy of technology. It then offers an analysis of its implications and consequences, drawing on notions of underdetermination, prudent epistemic risk, and moral responsibility. The discussion clarifies how, if correct, the conjecture would help reshape evaluation standards, governance frameworks, and hybrid system design. The conclusion underscores the importance of eventually proving or refuting the inequality for the future of trustworthy AI.
zh

[AI-68] GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments

【速读】：该论文试图解决现有图主动学习（Graph-based Active Learning, AL）方法在动态、现实环境中评估不足的问题，尤其是忽视了用户中心的考量，如采样多样性、查询公平性和适应性。其解决方案的关键在于提出GRAIL框架，该框架引入了新的评估指标，用于衡量持续有效性、多样性和用户负担，从而实现对图AL策略在不同条件下的全面评估，并强调了节点重要性、查询多样性和网络拓扑之间平衡的重要性。

链接: https://arxiv.org/abs/2506.10120
作者: Maryam Khalid,Akane Sano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Graph-based Active Learning (AL) leverages the structure of graphs to efficiently prioritize label queries, reducing labeling costs and user burden in applications like health monitoring, human behavior analysis, and sensor networks. By identifying strategically positioned nodes, graph AL minimizes data collection demands while maintaining model performance, making it a valuable tool for dynamic environments. Despite its potential, existing graph AL methods are often evaluated on static graph datasets and primarily focus on prediction accuracy, neglecting user-centric considerations such as sampling diversity, query fairness, and adaptability to dynamic settings. To bridge this gap, we introduce GRAIL, a novel benchmarking framework designed to evaluate graph AL strategies in dynamic, real-world environments. GRAIL introduces novel metrics to assess sustained effectiveness, diversity, and user burden, enabling a comprehensive evaluation of AL methods under varying conditions. Extensive experiments on datasets featuring dynamic, real-life human sensor data reveal trade-offs between prediction performance and user burden, highlighting limitations in existing AL strategies. GRAIL demonstrates the importance of balancing node importance, query diversity, and network topology, providing an evaluation mechanism for graph AL solutions in dynamic environments.
zh

[AI-69] One For All: LLM -based Heterogeneous Mission Planning in Precision Agriculture

【速读】：该论文试图解决非技术用户在精密农业中使用异构机器人时面临的复杂性和学习曲线问题，这些问题限制了机器人自动化技术的广泛应用。解决方案的关键在于提出一种基于自然语言（NL）的机器人任务规划器，该规划器利用大语言模型（LLMs）和预定义的操作原语，将人类语言无缝转换为可被不同机器人平台执行的中间描述，从而允许用户通过自然语言控制机器人完成复杂的农业任务而无需编写代码。

链接: https://arxiv.org/abs/2506.10106
作者: Marcos Abel Zuzuárregui,Mustafa Melih Toslak,Stefano Carpin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to International Federation of Automatic Control (IFAC) Sensing, Control and Automation Technologies for Agriculture - 8th AGRICONTROL 2025

点击查看摘要

Abstract:Artificial intelligence is transforming precision agriculture, offering farmers new tools to streamline their daily operations. While these technological advances promise increased efficiency, they often introduce additional complexity and steep learning curves that are particularly challenging for non-technical users who must balance tech adoption with existing workloads. In this paper, we present a natural language (NL) robotic mission planner that enables non-specialists to control heterogeneous robots through a common interface. By leveraging large language models (LLMs) and predefined primitives, our architecture seamlessly translates human language into intermediate descriptions that can be executed by different robotic platforms. With this system, users can formulate complex agricultural missions without writing any code. In the work presented in this paper, we extend our previous system tailored for wheeled robot mission planning through a new class of experiments involving robotic manipulation and computer vision tasks. Our results demonstrate that the architecture is both general enough to support a diverse set of robots and powerful enough to execute complex mission requests. This work represents a significant step toward making robotic automation in precision agriculture more accessible to non-technical users.
zh

[AI-70] Learning to Collaborate Over Graphs: A Selective Federated Multi-Task Learning Approach

【速读】：该论文试图解决联邦多任务学习中个性化学习与跨客户端协作之间的平衡问题，特别是在客户端数据分布异质性较高的情况下。其解决方案的关键在于引入一个通信高效的机制，通过特征锚点（feature anchor）来概括客户端本地类别的特征，并与服务器共享以反映本地客户端的分布；同时，客户端共享分类头并进行基于图的正则化，通过动态图建模客户端间的协作关系，并利用社区检测方法将动态图划分为同质社区，以最大化社区内任务相似性，从而确保有益的知识迁移和防止负面协作。

链接: https://arxiv.org/abs/2506.10102
作者: Ahmed Elbakary,Chaouki Ben Issaid,Mehdi Bennis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We present a novel federated multi-task learning method that leverages cross-client similarity to enable personalized learning for each client. To avoid transmitting the entire model to the parameter server, we propose a communication-efficient scheme that introduces a feature anchor, a compact vector representation that summarizes the features learned from the client’s local classes. This feature anchor is shared with the server to account for local clients’ distribution. In addition, the clients share the classification heads, a lightweight linear layer, and perform a graph-based regularization to enable collaboration among clients. By modeling collaboration between clients as a dynamic graph and continuously updating and refining this graph, we can account for any drift from the clients. To ensure beneficial knowledge transfer and prevent negative collaboration, we leverage a community detection-based approach that partitions this dynamic graph into homogeneous communities, maximizing the sum of task similarities, represented as the graph edges’ weights, within each community. This mechanism restricts collaboration to highly similar clients within their formed communities, ensuring positive interaction and preserving personalization. Extensive experiments on two heterogeneous datasets demonstrate that our method significantly outperforms state-of-the-art baselines. Furthermore, we show that our method exhibits superior computation and communication efficiency and promotes fairness across clients.
zh

[AI-71] Leverag ing LLM s for Mission Planning in Precision Agriculture ICRA

【速读】：该论文旨在解决在精准农业中，如何使非技术用户通过自然语言指令高效地为自主机器人分配复杂数据采集任务的问题。其解决方案的关键在于构建一个端到端系统，利用生成式 AI（Generative AI）中的大型语言模型（LLMs），如ChatGPT，将自然语言指令转化为符合IEEE任务规范标准的可执行任务计划，并通过ROS2节点实现与现有ROS库的集成，从而提升系统的可复用性和实用性。

链接: https://arxiv.org/abs/2506.10093
作者: Marcos Abel Zuzuárregui,Stefano Carpin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published in Proceedings of 2025 International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:Robotics and artificial intelligence hold significant potential for advancing precision agriculture. While robotic systems have been successfully deployed for various tasks, adapting them to perform diverse missions remains challenging, particularly because end users often lack technical expertise. In this paper, we present an end-to-end system that leverages large language models (LLMs), specifically ChatGPT, to enable users to assign complex data collection tasks to autonomous robots using natural language instructions. To enhance reusability, mission plans are encoded using an existing IEEE task specification standard, and are executed on robots via ROS2 nodes that bridge high-level mission descriptions with existing ROS libraries. Through extensive experiments, we highlight the strengths and limitations of LLMs in this context, particularly regarding spatial reasoning and solving complex routing challenges, and show how our proposed implementation overcomes them.
zh

[AI-72] xtual Bayes: Quantifying Uncertainty in LLM -Based Systems

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在高风险领域应用中面临的不确定性量化不准确问题，以及由于模型的闭源和黑箱特性所带来的挑战。此外，LLM系统对提示词的高度敏感性也增加了其使用难度，通常需要大量人工调优。该研究的关键解决方案是通过贝叶斯视角重新审视LLM系统，将提示词视为统计模型中的文本参数，并利用小规模训练数据进行贝叶斯推断。这一方法实现了对模型文本参数及其下游预测的系统性不确定性量化，同时融入了以自由文本形式表达的先验信念。为实现贝叶斯推断，作者引入了基于LLM提示优化的Metropolis-Hastings通过LLM提议（MHLP）算法，这是一种结合了标准马尔可夫链蒙特卡洛（MCMC）方法与提示优化技术的新型MCMC算法，能够直接应用于现有的LLM流水线，包括仅依赖闭源模型的场景。

链接: https://arxiv.org/abs/2506.10060
作者: Brendan Leigh Ross,Noël Vouitsis,Atiyeh Ashari Ghomi,Rasa Hosseinzadeh,Ji Xin,Zhaoyan Liu,Yi Sui,Shiyi Hou,Kin Kwan Leung,Gabriel Loaiza-Ganem,Jesse C. Cresswell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem, which limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model’s textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference, a difficult problem even for well-studied data modalities, we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.
zh

[AI-73] Ambient Diffusion Omni: Training Good Models with Bad Data

【速读】：该论文试图解决如何利用低质量、合成及分布外图像来提升扩散模型的性能问题。传统扩散模型通常依赖于经过高度筛选的数据集，而本文指出被丢弃的低质量图像中蕴含着巨大价值。解决方案的关键在于提出了一种名为Ambient Diffusion Omni的简单且理论基础扎实的框架，该框架能够从所有可用图像中提取信号进行训练，其核心思想是利用自然图像的两个特性——频谱功率律衰减和局部性。通过这一框架，作者验证了在合成噪声污染图像上训练扩散模型的有效性，并在ImageNet数据集上取得了最先进的FID指标，显著提升了文本到图像生成任务中的图像质量和多样性。

链接: https://arxiv.org/abs/2506.10038
作者: Giannis Daras,Adrian Rodriguez-Munoz,Adam Klivans,Antonio Torralba,Constantinos Daskalakis
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, work in progress

点击查看摘要

Abstract:We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from all available images during training. Our framework exploits two properties of natural images – spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We then use our framework to achieve state-of-the-art ImageNet FID, and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.
zh

[AI-74] FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training

【速读】：该论文旨在解决文本到图像（text-to-image, T2I）生成模型，如FLUX，由于参数量庞大导致的推理速度慢、内存占用高和部署性差的问题。现有加速方法（如单步蒸馏和注意力剪枝）常导致性能显著下降并产生较高的训练成本。论文提出的解决方案是FastFLUX，其关键在于基于架构级剪枝的Block-wise Replacement with Linear Layers (BRLL)方法，该方法通过将ResBlocks中的结构复杂残差分支替换为轻量级线性层，同时保留原始的快捷连接以保证稳定性。此外，引入的Sandwich Training (ST)策略利用LoRA对邻近模块进行局部微调，缓解结构替换带来的性能下降。

链接: https://arxiv.org/abs/2506.10035
作者: Fuhan Cai,Yong Guo,Jie Li,Wenbo Li,Xiangzhong Fang,Jian Chen
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20% of the hierarchy pruned. Our code will be available soon.
zh

[AI-75] Safeguarding Multimodal Knowledge Copyright in the RAG -as-a-Service Environment

【速读】：该论文旨在解决多模态检索增强生成（Multimodal RAG）系统中图像知识版权保护的问题，现有水印技术仅针对文本知识，未能有效保护图像数据。其解决方案的关键在于提出AQUA框架，通过两种互补方法——缩写词触发器和空间关系线索——将语义信号嵌入合成图像中，确保水印信号在从图像检索器到文本生成器的间接传播过程中依然保持鲁棒性、隐蔽性和可靠性。

链接: https://arxiv.org/abs/2506.10030
作者: Tianyu Chen,Jian Lou,Wenjie Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Retrieval-Augmented Generation (RAG) evolves into service-oriented platforms (Rag-as-a-Service) with shared knowledge bases, protecting the copyright of contributed data becomes essential. Existing watermarking methods in RAG focus solely on textual knowledge, leaving image knowledge unprotected. In this work, we propose AQUA, the first watermark framework for image knowledge protection in Multimodal RAG systems. AQUA embeds semantic signals into synthetic images using two complementary methods: acronym-based triggers and spatial relationship cues. These techniques ensure watermark signals survive indirect watermark propagation from image retriever to textual generator, being efficient, effective and imperceptible. Experiments across diverse models and datasets show that AQUA enables robust, stealthy, and reliable copyright tracing, filling a key gap in multimodal RAG protection.
zh

[AI-76] LLM s Caught in the Crossfire: Malware Requests and Jailbreak Challenges ACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在代码生成任务中对越狱攻击（jailbreak attacks）的脆弱性问题，即模型容易被精心设计的提示词诱导生成恶意代码。解决方案的关键在于提出MalwareBench，这是一个包含3,520个越狱提示的基准数据集，用于评估LLMs在面对此类威胁时的鲁棒性。该数据集基于320个手动设计的恶意代码生成需求，涵盖11种越狱方法和29种代码功能类别，从而系统地测试模型的安全防护能力。

链接: https://arxiv.org/abs/2506.10022
作者: Haoyang Li,Huan Gao,Zhiyuan Zhao,Zhiyu Lin,Junyu Gao,Xuelong Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted as ACL 2025 main conference

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs) has heightened concerns about their security, particularly their vulnerability to jailbreak attacks that leverage crafted prompts to generate malicious outputs. While prior research has been conducted on general security capabilities of LLMs, their specific susceptibility to jailbreak attacks in code generation remains largely unexplored. To fill this gap, we propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation, designed to evaluate LLM robustness against such threats. MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories. Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model’s security capabilities: specifically, the average rejection rate for malicious content is 60.93%, dropping to 39.92% when combined with jailbreak attack algorithms. Our work highlights that the code security capabilities of LLMs still pose significant challenges.
zh

[AI-77] From Tool Calling to Symbolic Thinking: LLM s in a Persistent Lisp Metaprogramming Loop

【速读】：该论文试图解决如何将大规模语言模型（Large Language Models, LLMs）与持久化、交互式的Lisp环境进行有效集成的问题，以实现LLMs自我定义、调用和演化工具的能力。解决方案的关键在于通过在生成过程中嵌入Lisp表达式，并利用中间件层进行拦截，从而实现状态感知的外部记忆、反射式编程和动态工具创建。

链接: https://arxiv.org/abs/2506.10021
作者: Jordi de la Torre
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel architecture for integrating large language models (LLMs) with a persistent, interactive Lisp environment. This setup enables LLMs to define, invoke, and evolve their own tools through programmatic interaction with a live REPL. By embedding Lisp expressions within generation and intercepting them via a middleware layer, the system allows for stateful external memory, reflective programming, and dynamic tool creation. We present a design framework and architectural principles to guide future implementations of interactive AI systems that integrate symbolic programming with neural language generation.
zh

[AI-78] Immersive Multimedia Communication: State-of-the-Art on eXtended Reality Streaming

【速读】：该论文旨在解决扩展现实（Extended Reality, XR）流媒体中的效率与性能优化问题，特别是在多模态交互和用户感知体验方面。其解决方案的关键在于基于视觉注意力的优化方法，通过分析用户的视觉关注点来提升XR流媒体的传输效率和系统性能，从而改善用户体验。

链接: https://arxiv.org/abs/2506.10004
作者: Haopeng Wang,Haiwei Dong,Abdulmotaleb El Saddik
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
备注: accepted by ACM Transactions on Multimedia Computing, Communications, and Applications

点击查看摘要

Abstract:Extended reality (XR) is rapidly advancing, and poised to revolutionize content creation and consumption. In XR, users integrate various sensory inputs to form a cohesive perception of the virtual environment. This survey reviews the state-of-the-art in XR streaming, focusing on multiple paradigms. To begin, we define XR and introduce various XR headsets along with their multimodal interaction methods to provide a foundational understanding. We then analyze XR traffic characteristics to highlight the unique data transmission requirements. We also explore factors that influence the quality of experience in XR systems, aiming to identify key elements for enhancing user satisfaction. Following this, we present visual attention-based optimization methods for XR streaming to improve efficiency and performance. Finally, we examine current applications and highlight challenges to provide insights into ongoing and future developments of XR.
zh

[AI-79] Semantic Communication-Enabled Cloud-Edge-End-collaborative Metaverse Services Architecure

【速读】：该论文试图解决元宇宙中由于高分辨率虚拟场景等大量数据在云平台与VR设备间传输所面临的带宽不足、传输延迟高及信道质量差导致的数据错误问题。解决方案的关键在于提出一种语义通信增强的云-边-端协同沉浸式元宇宙服务架构（SC-CEE-Meta），通过在VR设备和边缘服务器上部署语义模块，传输关键语义信息而非关注比特级重建，从而降低延迟、缓解资源与带宽冲突，并提高对信道干扰的鲁棒性。

链接: https://arxiv.org/abs/2506.10001
作者: Yuxuan Li,Sheng Jinag,Bizhu Wang
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2407.13764 by other authors

点击查看摘要

Abstract:With technology advancing and the pursuit of new audiovisual experiences strengthening, the metaverse has gained surging enthusiasm. However, it faces practical hurdles as substantial data like high-resolution virtual scenes must be transmitted between cloud platforms and VR devices. Specifically, the VR device’s wireless transmission hampered by insufficient bandwidth, causes speed and delay problems. Meanwhile, poor channel quality leads to data errors and worsens user experience. To solve this, we’ve proposed the Semantic Communication-Enabled Cloud-Edge-End Collaborative Immersive Metaverse Service (SC-CEE-Meta) Architecture, which includes three modules: VR video semantic transmission, video synthesis, and 3D virtual scene reconstruction. By deploying semantic modules on VR devices and edge servers and sending key semantic info instead of focusing on bit-level reconstruction, it can cut latency, resolve the resource-bandwidth conflict, and better withstand channel interference. Also, the cloud deploys video synthesis and 3D scene reconstruction preprocessing, while edge devices host 3D reconstruction rendering modules, all for immersive services. Verified on Meta Quest Pro, the SC-CEE-Meta can reduce wireless transmission delay by 96.05% and boost image quality by 43.99% under poor channel condition.
zh

[AI-80] A multi-scale loss formulation for learning a probabilistic model with proper score optimisation

【速读】：该论文旨在解决概率性机器学习天气预报模型在训练过程中对小尺度变化约束不足的问题，从而提升预报技能。其解决方案的关键在于引入多尺度损失函数（multi-scale loss），该方法在不损害预报准确性的前提下，有效限制了小尺度变异，为未来尺度感知的模型训练提供了新的方向。

链接: https://arxiv.org/abs/2506.10868
作者: Simon Lang,Martin Leutbecher,Pedro Maciel
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We assess the impact of a multi-scale loss formulation for training probabilistic machine-learned weather forecasting models. The multi-scale loss is tested in AIFS-CRPS, a machine-learned weather forecasting model developed at the European Centre for Medium-Range Weather Forecasts (ECMWF). AIFS-CRPS is trained by directly optimising the almost fair continuous ranked probability score (afCRPS). The multi-scale loss better constrains small scale variability without negatively impacting forecast skill. This opens up promising directions for future work in scale-aware model training.
zh

[AI-81] Learning Chaotic Dynamics with Neuromorphic Network Dynamics

【速读】：该论文试图解决如何利用神经形态网络（neuromorphic network）学习和建模动态系统的问题，其核心在于通过物理系统的特性实现计算。解决方案的关键在于使用由忆阻器（memristive）元件组成的复杂电路，该电路能够对输入电信号产生类似神经突触的非线性响应，并通过调整输入电极和电压来优化非线性动态响应，从而提升对多变量混沌时间序列的自主预测能力。研究发现，当输入电压最大化地激发忆阻器模型的整个动态范围时，可以得到最优的非线性动态响应，同时增加输入电极的覆盖范围有助于抑制不利于学习的其他非线性响应。

链接: https://arxiv.org/abs/2506.10773
作者: Yinhao Xu,Georg A. Gottwald,Zdenka Kuncic
机构: 未知
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 37 pages, 22 figures

点击查看摘要

Abstract:This study investigates how dynamical systems may be learned and modelled with a neuromorphic network which is itself a dynamical system. The neuromorphic network used in this study is based on a complex electrical circuit comprised of memristive elements that produce neuro-synaptic nonlinear responses to input electrical signals. To determine how computation may be performed using the physics of the underlying system, the neuromorphic network was simulated and evaluated on autonomous prediction of a multivariate chaotic time series, implemented with a reservoir computing framework. Through manipulating only input electrodes and voltages, optimal nonlinear dynamical responses were found when input voltages maximise the number of memristive components whose internal dynamics explore the entire dynamical range of the memristor model. Increasing the network coverage with the input electrodes was found to suppress other nonlinear responses that are less conducive to learning. These results provide valuable insights into how a practical neuromorphic network device can be optimised for learning complex dynamical systems using only external control parameters.
zh

[AI-82] RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding ACL

【速读】：该论文旨在解决语音转换（Voice Conversion）中的低延迟与高质量合成问题，特别是在实时应用场景中。其解决方案的关键在于提出一种零样本实时语音转换系统RT-VC，该系统通过利用发音特征空间（articulatory feature space）实现内容与说话人特征的自然解耦，从而提升语音转换的鲁棒性与可解释性，同时结合可微分数字信号处理（DDSP）技术，直接从发音特征进行高效声码合成，显著降低了转换延迟。

链接: https://arxiv.org/abs/2506.10289
作者: Yisi Liu,Chenyang Wang,Hanjo Kim,Raniya Khan,Gopala Anumanchipalli
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: ACL Demo Track 2025

点击查看摘要

Abstract:Voice conversion has emerged as a pivotal technology in numerous applications ranging from assistive communication to entertainment. In this paper, we present RT-VC, a zero-shot real-time voice conversion system that delivers ultra-low latency and high-quality performance. Our approach leverages an articulatory feature space to naturally disentangle content and speaker characteristics, facilitating more robust and interpretable voice transformations. Additionally, the integration of differentiable digital signal processing (DDSP) enables efficient vocoding directly from articulatory features, significantly reducing conversion latency. Experimental evaluations demonstrate that, while maintaining synthesis quality comparable to the current state-of-the-art (SOTA) method, RT-VC achieves a CPU latency of 61.4 ms, representing a 13.3% reduction in latency.
zh

机器学习

[LG-0] Execution Guided Line-by-Line Code Generation

链接: https://arxiv.org/abs/2506.10948
作者: Boaz Lavon,Shahar Katz,Lior Wolf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance (EG-CFG), dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions. EG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions. Our experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming tasks. Our code is available at: this https URL

[LG-1] Self-Adapting Language Models

链接: https://arxiv.org/abs/2506.10943
作者: Adam Zweiger,Jyothish Pari,Han Guo,Ekin Akyürek,Yoon Kim,Pulkit Agrawal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are powerful but static; they lack mechanisms to adapt their weights in response to new tasks, knowledge, or examples. We introduce Self-Adapting LLMs (SEAL), a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit-a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gradient-based updates. Through supervised finetuning (SFT), these self-edits result in persistent weight updates, enabling lasting adaptation. To train the model to produce effective self-edits, we use a reinforcement learning loop with the downstream performance of the updated model as the reward signal. Unlike prior approaches that rely on separate adaptation modules or auxiliary networks, SEAL directly uses the model’s own generation to control its adaptation process. Experiments on knowledge incorporation and few-shot generalization show that SEAL is a promising step toward language models capable of self-directed adaptation. Our website and code is available at this https URL.

[LG-2] Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction

链接: https://arxiv.org/abs/2506.10930
作者: Thanathai Lertpetchpun,Tiantian Feng,Dani Byrd,Shrikanth Narayanan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speech emotion recognition (SER) in naturalistic conditions presents a significant challenge for the speech processing community. Challenges include disagreement in labeling among annotators and imbalanced data distributions. This paper presents a reproducible framework that achieves superior (top 1) performance in the Emotion Recognition in Naturalistic Conditions Challenge (IS25-SER Challenge) - Task 2, evaluated on the MSP-Podcast dataset. Our system is designed to tackle the aforementioned challenges through multimodal learning, multi-task learning, and imbalanced data handling. Specifically, our best system is trained by adding text embeddings, predicting gender, and including Other'' (O) and No Agreement’’ (X) samples in the training set. Our system’s results secured both first and second places in the IS25-SER Challenge, and the top performance was achieved by a simple two-system ensemble.

[LG-3] Sequential-Parallel Duality in Prefix Scannable Models

链接: https://arxiv.org/abs/2506.10918
作者: Morris Yau,Sharut Gupta,Valerie Engelmayer,Kazuki Irie,Stefanie Jegelka,Jacob Andreas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.‘’ This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models – state space models – as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models – in some cases exhibiting better length generalization than either.

[LG-4] Foundation Models for Causal Inference via Prior-Data Fitted Networks

链接: https://arxiv.org/abs/2506.10914
作者: Yuchen Ma,Dennis Frauen,Emil Javurek,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior-data fitted networks (PFNs) have recently been proposed as a promising way to train tabular foundation models. PFNs are transformers that are pre-trained on synthetic data generated from a prespecified prior distribution and that enable Bayesian inference through in-context learning. In this paper, we introduce CausalFM, a comprehensive framework for training PFN-based foundation models in various causal inference settings. First, we formalize the construction of Bayesian priors for causal inference based on structural causal models (SCMs) in a principled way and derive necessary criteria for the validity of such priors. Building on this, we propose a novel family of prior distributions using causality-inspired Bayesian neural networks that enable CausalFM to perform Bayesian causal inference in various settings, including back-door, front-door, and instrumental variable adjustment. Finally, we instantiate CausalFM and explicitly train a foundation model for estimating conditional average treatment effects (CATEs) using back-door adjustment. We show that CausalFM performs competitively for CATE estimation using various synthetic and semi-synthetic benchmarks. In sum, our framework can be used as a general recipe to train foundation models for various causal inference settings. In contrast to the current state-of-the-art in causal inference, CausalFM offers a novel paradigm with the potential to fundamentally change how practitioners perform causal inference in medicine, economics, and other disciplines.

[LG-5] NoLoCo: No-all-reduce Low Communication Training Method for Large Models

链接: https://arxiv.org/abs/2506.10911
作者: Jari Kolehmainen,Nikolay Blagoev,John Donaghy,Oğuzhan Ersoy,Christopher Nies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to 4% faster convergence rate with wide range of model sizes and accelerator counts. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.10911 [cs.LG] (or arXiv:2506.10911v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.10911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Lattice Climber Attack: Adversarial attacks for randomized mixtures of classifiers ECML2025

链接: https://arxiv.org/abs/2506.10888
作者: Lucas Gnecco-Heredia,Benjamin Negrevergne,Yann Chevaleyre
类目: Machine Learning (cs.LG)
*备注: 17 pages including bibliography + 13 pages of supplementary material. Extended version of the article accepted at ECML 2025

点击查看摘要

Abstract:Finite mixtures of classifiers (a.k.a. randomized ensembles) have been proposed as a way to improve robustness against adversarial attacks. However, existing attacks have been shown to not suit this kind of classifier. In this paper, we discuss the problem of attacking a mixture in a principled way and introduce two desirable properties of attacks based on a geometrical analysis of the problem (effectiveness and maximality). We then show that existing attacks do not meet both of these properties. Finally, we introduce a new attack called \em lattice climber attack with theoretical guarantees in the binary linear setting, and demonstrate its performance by conducting experiments on synthetic and real datasets.

[LG-7] Viability of Future Actions: Robust Safety in Reinforcement Learning via Entropy Regularization ECML-PKDD2025

链接: https://arxiv.org/abs/2506.10871
作者: Pierre-François Massiani,Alexander von Rohr,Lukas Haverbeck,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注: 24 pages, 11 figures, 2 tables. Accepted for publication at ECML-PKDD 2025

点击查看摘要

Abstract:Despite the many recent advances in reinforcement learning (RL), the question of learning policies that robustly satisfy state constraints under unknown disturbances remains open. In this paper, we offer a new perspective on achieving robust safety by analyzing the interplay between two well-established techniques in model-free RL: entropy regularization, and constraints penalization. We reveal empirically that entropy regularization in constrained RL inherently biases learning toward maximizing the number of future viable actions, thereby promoting constraints satisfaction robust to action noise. Furthermore, we show that by relaxing strict safety constraints through penalties, the constrained RL problem can be approximated arbitrarily closely by an unconstrained one and thus solved using standard model-free RL. This reformulation preserves both safety and optimality while empirically improving resilience to disturbances. Our results indicate that the connection between entropy regularization and robustness is a promising avenue for further empirical and theoretical investigation, as it enables robust safety in RL through simple reward shaping.

[LG-8] Energy-Efficient Deep Learning for Traffic Classification on Microcontrollers

链接: https://arxiv.org/abs/2506.10851
作者: Adel Chehade,Edoardo Ragusa,Paolo Gastaldo,Rodolfo Zunino
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted at IEEE ISCC 2025

点击查看摘要

Abstract:In this paper, we present a practical deep learning (DL) approach for energy-efficient traffic classification (TC) on resource-limited microcontrollers, which are widely used in IoT-based smart systems and communication networks. Our objective is to balance accuracy, computational efficiency, and real-world deployability. To that end, we develop a lightweight 1D-CNN, optimized via hardware-aware neural architecture search (HW-NAS), which achieves 96.59% accuracy on the ISCX VPN-NonVPN dataset with only 88.26K parameters, a 20.12K maximum tensor size, and 10.08M floating-point operations (FLOPs). Moreover, it generalizes across various TC tasks, with accuracies ranging from 94% to 99%. To enable deployment, the model is quantized to INT8, suffering only a marginal 1-2% accuracy drop relative to its Float32 counterpart. We evaluate real-world inference performance on two microcontrollers: the high-performance STM32F746G-DISCO and the cost-sensitive Nucleo-F401RE. The deployed model achieves inference latencies of 31.43ms and 115.40ms, with energy consumption of 7.86 mJ and 29.10 mJ per inference, respectively. These results demonstrate the feasibility of on-device encrypted traffic analysis, paving the way for scalable, low-power IoT security solutions.

[LG-9] Advanced fraud detection using machine learning models: enhancing financial transaction security

链接: https://arxiv.org/abs/2506.10842
作者: Nudrat Fariha,Md Nazmuddin Moin Khan,Md Iqbal Hossain,Syed Ali Reza,Joy Chakra Bortty,Kazi Sharmin Sultana,Md Shadidur Islam Jawad,Saniah Safat,Md Abdul Ahad,Maksuda Begum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of digital payments has accelerated the need for intelligent and scalable systems to detect fraud. This research presents an end-to-end, feature-rich machine learning framework for detecting credit card transaction anomalies and fraud using real-world data. The study begins by merging transactional, cardholder, merchant, and merchant category datasets from a relational database to create a unified analytical view. Through the feature engineering process, we extract behavioural signals such as average spending, deviation from historical patterns, transaction timing irregularities, and category frequency metrics. These features are enriched with temporal markers such as hour, day of week, and weekend indicators to expose all latent patterns that indicate fraudulent behaviours. Exploratory data analysis reveals contextual transaction trends across all the dataset features. Using the transactional data, we train and evaluate a range of unsupervised models: Isolation Forest, One Class SVM, and a deep autoencoder trained to reconstruct normal behavior. These models flag the top 1% of reconstruction errors as outliers. PCA visualizations illustrate each models ability to separate anomalies into a two-dimensional latent space. We further segment the transaction landscape using K-Means clustering and DBSCAN to identify dense clusters of normal activity and isolate sparse, suspicious regions.

[LG-10] Detecting High-Stakes Interactions with Activation Probes

链接: https://arxiv.org/abs/2506.10805
作者: Alex McKenzie,Urja Pawar,Phil Blandfort,William Bankes,David Krueger,Ekdeep Singh Lubana,Dmitrii Krasheninnikov
类目: Machine Learning (cs.LG)
*备注: 33 pages

点击查看摘要

Abstract:Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting “high-stakes” interactions – where the text indicates that the interaction might lead to significant harm – as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes’ performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

[LG-11] Dense Associative Memory with Epanechnikov Energy

链接: https://arxiv.org/abs/2506.10801
作者: Benjamin Hoover,Zhaoyang Shi,Krishnakumar Balasubramanian,Dmitry Krotov,Parikshit Ram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emphemergent local minima while preserving perfect pattern recovery – a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR’s emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method’s potential for both large-scale memory storage and generative tasks.

[LG-12] Monotone Classification with Relative Approximations

链接: https://arxiv.org/abs/2506.10775
作者: Yufei Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In monotone classification, the input is a multi-set P of points in \mathbbR^d , each associated with a hidden label from -1, 1\ . The goal is to identify a monotone function h , which acts as a classifier, mapping from \mathbbR^d to -1, 1\ with a small \em error, measured as the number of points p \in P whose labels differ from the function values h§ . The cost of an algorithm is defined as the number of points having their labels revealed. This article presents the first study on the lowest cost required to find a monotone classifier whose error is at most (1 + \epsilon) \cdot k^* where \epsilon \ge 0 and k^* is the minimum error achieved by an optimal monotone classifier – in other words, the error is allowed to exceed the optimal by at most a relative factor. Nearly matching upper and lower bounds are presented for the full range of \epsilon . All previous work on the problem can only achieve an error higher than the optimal by an absolute factor.

[LG-13] Skillful joint probabilistic weather forecasting from marginals

链接: https://arxiv.org/abs/2506.10772
作者: Ferran Alet,Ilan Price,Andrew El-Kadi,Dominic Masters,Stratis Markou,Tom R. Andersson,Jacklynn Stott,Remi Lam,Matthew Willson,Alvaro Sanchez-Gonzalez,Peter Battaglia
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Machine learning (ML)-based weather models have rapidly risen to prominence due to their greater accuracy and speed than traditional forecasts based on numerical weather prediction (NWP), recently outperforming traditional ensembles in global probabilistic weather forecasting. This paper presents FGN, a simple, scalable and flexible modeling approach which significantly outperforms the current state-of-the-art models. FGN generates ensembles via learned model-perturbations with an ensemble of appropriately constrained models. It is trained directly to minimize the continuous rank probability score (CRPS) of per-location forecasts. It produces state-of-the-art ensemble forecasts as measured by a range of deterministic and probabilistic metrics, makes skillful ensemble tropical cyclone track predictions, and captures joint spatial structure despite being trained only on marginals.

[LG-14] Preserving Task-Relevant Information Under Linear Concept Removal

链接: https://arxiv.org/abs/2506.10703
作者: Floris Holstege,Shauli Ravfogel,Bram Wouters
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern neural networks often encode unwanted concepts alongside task-relevant information, leading to fairness and interpretability concerns. Existing post-hoc approaches can remove undesired concepts but often degrade useful signals. We introduce SPLICE-Simultaneous Projection for LInear concept removal and Covariance prEservation-which eliminates sensitive concepts from representations while exactly preserving their covariance with a target label. SPLICE achieves this via an oblique projection that “splices out” the unwanted direction yet protects important label correlations. Theoretically, it is the unique solution that removes linear concept predictability and maintains target covariance with minimal embedding distortion. Empirically, SPLICE outperforms baselines on benchmarks such as Bias in Bios and Winobias, removing protected attributes while minimally damaging main-task information.

[LG-15] Structure and asymptotic preserving deep neural surrogates for uncertainty quantification in multiscale kinetic equations

链接: https://arxiv.org/abs/2506.10636
作者: Wei Chen,Giacomo Dimarco,Lorenzo Pareschi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The high dimensionality of kinetic equations with stochastic parameters poses major computational challenges for uncertainty quantification (UQ). Traditional Monte Carlo (MC) sampling methods, while widely used, suffer from slow convergence and high variance, which become increasingly severe as the dimensionality of the parameter space grows. To accelerate MC sampling, we adopt a multiscale control variates strategy that leverages low-fidelity solutions from simplified kinetic models to reduce variance. To further improve sampling efficiency and preserve the underlying physics, we introduce surrogate models based on structure and asymptotic preserving neural networks (SAPNNs). These deep neural networks are specifically designed to satisfy key physical properties, including positivity, conservation laws, entropy dissipation, and asymptotic limits. By training the SAPNNs on low-fidelity models and enriching them with selected high-fidelity samples from the full Boltzmann equation, our method achieves significant variance reduction while maintaining physical consistency and asymptotic accuracy. The proposed methodology enables efficient large-scale prediction in kinetic UQ and is validated across both homogeneous and nonhomogeneous multiscale regimes. Numerical results demonstrate improved accuracy and computational efficiency compared to standard MC techniques.

[LG-16] Leverag ing Low-rank Factorizations of Conditional Correlation Matrices in Graph Learning

链接: https://arxiv.org/abs/2506.10628
作者: Thu Ha Phi,Alexandre Hippert-Ferrer,Florent Bouchard,Arnaud Breloy
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:This paper addresses the problem of learning an undirected graph from data gathered at each nodes. Within the graph signal processing framework, the topology of such graph can be linked to the support of the conditional correlation matrix of the data. The corresponding graph learning problem then scales to the squares of the number of variables (nodes), which is usually problematic at large dimension. To tackle this issue, we propose a graph learning framework that leverages a low-rank factorization of the conditional correlation matrix. In order to solve for the resulting optimization problems, we derive tools required to apply Riemannian optimization techniques for this particular structure. The proposal is then particularized to a low-rank constrained counterpart of the GLasso algorithm, i.e., the penalized maximum likelihood estimation of a Gaussian graphical model. Experiments on synthetic and real data evidence that a very efficient dimension-versus-performance trade-off can be achieved with this approach.

[LG-17] Assessing the Resilience of Automotive Intrusion Detection Systems to Adversarial Manipulation

链接: https://arxiv.org/abs/2506.10620
作者: Stefano Longari,Paolo Cerracchio,Michele Carminati,Stefano Zanero
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The security of modern vehicles has become increasingly important, with the controller area network (CAN) bus serving as a critical communication backbone for various Electronic Control Units (ECUs). The absence of robust security measures in CAN, coupled with the increasing connectivity of vehicles, makes them susceptible to cyberattacks. While intrusion detection systems (IDSs) have been developed to counter such threats, they are not foolproof. Adversarial attacks, particularly evasion attacks, can manipulate inputs to bypass detection by IDSs. This paper extends our previous work by investigating the feasibility and impact of gradient-based adversarial attacks performed with different degrees of knowledge against automotive IDSs. We consider three scenarios: white-box (attacker with full system knowledge), grey-box (partial system knowledge), and the more realistic black-box (no knowledge of the IDS’ internal workings or data). We evaluate the effectiveness of the proposed attacks against state-of-the-art IDSs on two publicly available datasets. Additionally, we study effect of the adversarial perturbation on the attack impact and evaluate real-time feasibility by precomputing evasive payloads for timed injection based on bus traffic. Our results demonstrate that, besides attacks being challenging due to the automotive domain constraints, their effectiveness is strongly dependent on the dataset quality, the target IDS, and the attacker’s degree of knowledge.

[LG-18] Non-stationary Online Learning for Curved Losses: Improved Dynamic Regret via Mixability ICML2025

链接: https://arxiv.org/abs/2506.10616
作者: Yu-Jie Zhang,Peng Zhao,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Non-stationary online learning has drawn much attention in recent years. Despite considerable progress, dynamic regret minimization has primarily focused on convex functions, leaving the functions with stronger curvature (e.g., squared or logistic loss) underexplored. In this work, we address this gap by showing that the regret can be substantially improved by leveraging the concept of mixability, a property that generalizes exp-concavity to effectively capture loss curvature. Let d denote the dimensionality and P_T the path length of comparators that reflects the environmental non-stationarity. We demonstrate that an exponential-weight method with fixed-share updates achieves an \mathcalO(d T^1/3 P_T^2/3 \log T) dynamic regret for mixable losses, improving upon the best-known \mathcalO(d^10/3 T^1/3 P_T^2/3 \log T) result (Baby and Wang, 2021) in d . More importantly, this improvement arises from a simple yet powerful analytical framework that exploits the mixability, which avoids the Karush-Kuhn-Tucker-based analysis required by existing work.

[LG-19] Graph Neural Networks for Automatic Addition of Optimizing Components in Printed Circuit Board Schematics

链接: https://arxiv.org/abs/2506.10577
作者: Pascal Plettenberg,André Alcalde,Bernhard Sick,Josephine M. Thomas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The design and optimization of Printed Circuit Board (PCB) schematics is crucial for the development of high-quality electronic devices. Thereby, an important task is to optimize drafts by adding components that improve the robustness and reliability of the circuit, e.g., pull-up resistors or decoupling capacitors. Since there is a shortage of skilled engineers and manual optimizations are very time-consuming, these best practices are often neglected. However, this typically leads to higher costs for troubleshooting in later development stages as well as shortened product life cycles, resulting in an increased amount of electronic waste that is difficult to recycle. Here, we present an approach for automating the addition of new components into PCB schematics by representing them as bipartite graphs and utilizing a node pair prediction model based on Graph Neural Networks (GNNs). We apply our approach to three highly relevant PCB design optimization tasks and compare the performance of several popular GNN architectures on real-world datasets labeled by human experts. We show that GNNs can solve these problems with high accuracy and demonstrate that our approach offers the potential to automate PCB design optimizations in a time- and cost-efficient manner.

[LG-20] Data-driven Day Ahead Market Prices Forecasting: A Focus on Short Training Set Windows

链接: https://arxiv.org/abs/2506.10536
作者: Vasilis Michalakopoulos,Christoforos Menos-Aikateriniadis,Elissaios Sarmas,Antonis Zakynthinos,Pavlos S. Georgilakis,Dimitris Askounis
类目: Machine Learning (cs.LG)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:This study investigates the performance of machine learning models in forecasting electricity Day-Ahead Market (DAM) prices using short historical training windows, with a focus on detecting seasonal trends and price spikes. We evaluate four models, namely LSTM with Feed Forward Error Correction (FFEC), XGBoost, LightGBM, and CatBoost, across three European energy markets (Greece, Belgium, Ireland) using feature sets derived from ENTSO-E forecast data. Training window lengths range from 7 to 90 days, allowing assessment of model adaptability under constrained data availability. Results indicate that LightGBM consistently achieves the highest forecasting accuracy and robustness, particularly with 45 and 60 day training windows, which balance temporal relevance and learning depth. Furthermore, LightGBM demonstrates superior detection of seasonal effects and peak price events compared to LSTM and other boosting models. These findings suggest that short-window training approaches, combined with boosting methods, can effectively support DAM forecasting in volatile, data-scarce environments.

[LG-21] Equivariant Neural Diffusion for Molecule Generation NEURIPS2024

链接: https://arxiv.org/abs/2506.10532
作者: François Cornet,Grigory Bartosh,Mikkel N. Schmidt,Christian A. Naesseth
类目: Machine Learning (cs.LG)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We introduce Equivariant Neural Diffusion (END), a novel diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Compared to current state-of-the-art equivariant diffusion models, the key innovation in END lies in its learnable forward process for enhanced generative modelling. Rather than pre-specified, the forward process is parameterized through a time- and data-dependent transformation that is equivariant to rigid transformations. Through a series of experiments on standard molecule generation benchmarks, we demonstrate the competitive performance of END compared to several strong baselines for both unconditional and conditional generation.

[LG-22] Macro Graph of Experts for Billion-Scale Multi-Task Recommendation

链接: https://arxiv.org/abs/2506.10520
作者: Hongyu Yao,Zijin Hong,Hao Chen,Yuanchen Bei,Zhiqing Li,Qijie Shen,Zuobin Ying,Huan Gong,Feiran Huang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Expert (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for the homepage of a leading billion-scale recommender system. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems.

[LG-23] A Crack in the Bark: Leverag ing Public Knowledge to Remove Tree-Ring Watermarks USENIX-SECURITY

链接: https://arxiv.org/abs/2506.10502
作者: Junhua Lin,Marc Juarez(University of Edinburgh)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 18 pages, to be published in the 34th USENIX Security Symposium

点击查看摘要

Abstract:We present a novel attack specifically designed against Tree-Ring, a watermarking technique for diffusion models known for its high imperceptibility and robustness against removal attacks. Unlike previous removal attacks, which rely on strong assumptions about attacker capabilities, our attack only requires access to the variational autoencoder that was used to train the target diffusion model, a component that is often publicly available. By leveraging this variational autoencoder, the attacker can approximate the model’s intermediate latent space, enabling more effective surrogate-based attacks. Our evaluation shows that this approach leads to a dramatic reduction in the AUC of Tree-Ring detector’s ROC and PR curves, decreasing from 0.993 to 0.153 and from 0.994 to 0.385, respectively, while maintaining high image quality. Notably, our attacks outperform existing methods that assume full access to the diffusion model. These findings highlight the risk of reusing public autoencoders to train diffusion models – a threat not considered by current industry practices. Furthermore, the results suggest that the Tree-Ring detector’s precision, a metric that has been overlooked by previous evaluations, falls short of the requirements for real-world deployment.

[LG-24] BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

链接: https://arxiv.org/abs/2506.10501
作者: Surya Jasper,Minh Luu,Evan Pan,Aakash Tyagi,Michael Quinn,Jiang Hu,David Kebo Houngninou
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML-assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline leveraging Large Language Models (LLMs) to systematically generate, insert, and validate realistic functional bugs in RTL. BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms to ensure syntactic correctness and functional detectability. Evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour-over five times faster than typical manual expert insertion. Additionally, BugGen identified 104 previously undetected bugs in OpenTitan regressions, highlighting its utility in exposing verification coverage gaps. Compared against Certitude, BugGen demonstrated over twice the syntactic accuracy, deeper exposure of testbench blind spots, and more functionally meaningful and complex bug scenarios. Furthermore, when these BugGen-generated datasets were employed to train ML-based failure triage models, we achieved high classification accuracy (88.1%-93.2%) across different IP blocks, confirming the practical utility and realism of generated bugs. BugGen thus provides a scalable solution for generating high-quality bug datasets, significantly enhancing verification efficiency and ML-assisted debugging.

[LG-25] SHORE: A Long-term User Lifetime Value Prediction Model in Digital Games

链接: https://arxiv.org/abs/2506.10487
作者: Shuaiqi Sun,Congde Yuan,Haoqiang Yang,Mengzhuo Guo,Guiying Wei,Jiangbo Tian
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:In digital gaming, long-term user lifetime value (LTV) prediction is essential for monetization strategy, yet presents major challenges due to delayed payment behavior, sparse early user data, and the presence of high-value outliers. While existing models typically rely on either short-cycle observations or strong distributional assumptions, such approaches often underestimate long-term value or suffer from poor robustness. To address these issues, we propose SHort-cycle auxiliary with Order-preserving REgression (SHORE), a novel LTV prediction framework that integrates short-horizon predictions (e.g., LTV-15 and LTV-30) as auxiliary tasks to enhance long-cycle targets (e.g., LTV-60). SHORE also introduces a hybrid loss function combining order-preserving multi-class classification and a dynamic Huber loss to mitigate the influence of zero-inflation and outlier payment behavior. Extensive offline and online experiments on real-world datasets demonstrate that SHORE significantly outperforms existing baselines, achieving a 47.91% relative reduction in prediction error in online deployment. These results highlight SHORE’s practical effectiveness and robustness in industrial-scale LTV prediction for digital games.

[LG-26] MNN-LLM : A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

链接: https://arxiv.org/abs/2506.10443
作者: Zhaode Wang,Jingbang Yang,Xinyu Qian,Shiwen Xing,Xiaotang Jiang,Chengfei Lv,Shengyu Zhang
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures. Published in the Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops (MMAsia '24 Workshops). The final authenticated version is available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to enhance performance. Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.

[LG-27] System Identification Using Kolmogorov-Arnold Networks: A Case Study on Buck Converters

链接: https://arxiv.org/abs/2506.10434
作者: Nart Gashi,Panagiotis Kakosimos,George Papafotiou
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) are emerging as a powerful framework for interpretable and efficient system identification in dynamic systems. By leveraging the Kolmogorov-Arnold representation theorem, KANs enable function approximation through learnable activation functions, offering improved scalability, accuracy, and interpretability compared to traditional neural networks. This paper investigates the application of KANs to model and analyze the dynamics of a buck converter system, focusing on state-space parameter estimation along with discovering the system equations. Using simulation data, the methodology involves approximating state derivatives with KANs, constructing interpretable state-space representations, and validating these models through numerical experiments. The results demonstrate the ability of KANs to accurately identify system dynamics, verify model consistency, and detect parameter changes, providing valuable insights into their applicability for system identification in modern industrial systems.

[LG-28] Data-Driven Soil Organic Carbon Sampling: Integrating Spectral Clustering with Conditioned Latin Hypercube Optimization

链接: https://arxiv.org/abs/2506.10419
作者: Weiying Zhao,Aleksei Unagaev,Natalia Efremova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Soil organic carbon (SOC) monitoring often relies on selecting representative field sampling locations based on environmental covariates. We propose a novel hybrid methodology that integrates spectral clustering - an unsupervised machine learning technique with conditioned Latin hypercube sampling (cLHS) to enhance the representativeness of SOC sampling. In our approach, spectral clustering partitions the study area into K homogeneous zones using multivariate covariate data, and cLHS is then applied within each zone to select sampling locations that collectively capture the full diversity of environmental conditions. This hybrid spectral-cLHS method ensures that even minor but important environmental clusters are sampled, addressing a key limitation of vanilla cLHS which can overlook such areas. We demonstrate on a real SOC mapping dataset that spectral-cLHS provides more uniform coverage of covariate feature space and spatial heterogeneity than standard cLHS. This improved sampling design has the potential to yield more accurate SOC predictions by providing better-balanced training data for machine learning models.

[LG-29] Generative Algorithms for Wildfire Progression Reconstruction from Multi-Modal Satellite Active Fire Measurements and Terrain Height

链接: https://arxiv.org/abs/2506.10404
作者: Bryan Shaddy,Brianna Binder,Agnimitra Dasgupta,Haitong Qin,James Haley,Angel Farguell,Kyle Hilburn,Derek V. Mallia,Adam Kochanski,Jan Mandel,Assad Oberai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Increasing wildfire occurrence has spurred growing interest in wildfire spread prediction. However, even the most complex wildfire models diverge from observed progression during multi-day simulations, motivating need for data assimilation. A useful approach to assimilating measurement data into complex coupled atmosphere-wildfire models is to estimate wildfire progression from measurements and use this progression to develop a matching atmospheric state. In this study, an approach is developed for estimating fire progression from VIIRS active fire measurements, GOES-derived ignition times, and terrain height data. A conditional Generative Adversarial Network is trained with simulations of historic wildfires from the atmosphere-wildfire model WRF-SFIRE, thus allowing incorporation of WRF-SFIRE physics into estimates. Fire progression is succinctly represented by fire arrival time, and measurements for training are obtained by applying an approximate observation operator to WRF-SFIRE solutions, eliminating need for satellite data during training. The model is trained on tuples of fire arrival times, measurements, and terrain, and once trained leverages measurements of real fires and corresponding terrain data to generate samples of fire arrival times. The approach is validated on five Pacific US wildfires, with results compared against high-resolution perimeters measured via aircraft, finding an average Sorensen-Dice coefficient of 0.81. The influence of terrain height on the arrival time inference is also evaluated and it is observed that terrain has minimal influence when the inference is conditioned on satellite measurements.

[LG-30] EQA-RM: A Generative Embodied Reward Model with Test-time Scaling

链接: https://arxiv.org/abs/2506.10389
作者: Yuhang Chen,Zhen Tan,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents’ spatial, temporal, and logical understanding is critical yet not considered by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9% accuracy on EQA-RM-Bench with only 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM. The code and dataset can be found here this https URL.

[LG-31] Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success

链接: https://arxiv.org/abs/2506.10359
作者: Che Wang,Jeroen van Baar,Chaitanya Mitash,Shuai Li,Dylan Randle,Weiyao Wang,Sumedh Sontakke,Kostas E. Bekris,Kapil Katyal
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to Robotics: Science and Systems (RSS 2025), 15 pages

点击查看摘要

Abstract:This work demonstrates how autonomously learning aspects of robotic operation from sparsely-labeled, real-world data of deployed, engineered solutions at industrial scale can provide with solutions that achieve improved performance. Specifically, it focuses on multi-suction robot picking and performs a comprehensive study on the application of multi-modal visual encoders for predicting the success of candidate robotic picks. Picking diverse items from unstructured piles is an important and challenging task for robot manipulation in real-world settings, such as warehouses. Methods for picking from clutter must work for an open set of items while simultaneously meeting latency constraints to achieve high throughput. The demonstrated approach utilizes multiple input modalities, such as RGB, depth and semantic segmentation, to estimate the quality of candidate multi-suction picks. The strategy is trained from real-world item picking data, with a combination of multimodal pretrain and finetune. The manuscript provides comprehensive experimental evaluation performed over a large item-picking dataset, an item-picking dataset targeted to include partial occlusions, and a package-picking dataset, which focuses on containers, such as boxes and envelopes, instead of unpackaged items. The evaluation measures performance for different item configurations, pick scenes, and object types. Ablations help to understand the effects of in-domain pretraining, the impact of different modalities and the importance of finetuning. These ablations reveal both the importance of training over multiple modalities but also the ability of models to learn during pretraining the relationship between modalities so that during finetuning and inference, only a subset of them can be used as input.

[LG-32] reeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree ICML2025

链接: https://arxiv.org/abs/2506.10355
作者: Yu-Yang Qian,Yuan-Ze Xu,Zhen-Yu Zhang,Peng Zhao,Zhi-Hua Zhou
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Many real-world applications collect data in a streaming environment, where learning tasks are encountered sequentially. This necessitates continual learning (CL) to update models online, enabling adaptation to new tasks while preserving past knowledge to prevent catastrophic forgetting. Nowadays, with the flourish of large pre-trained models (LPMs), efficiency has become increasingly critical for CL, due to their substantial computational demands and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of Low-Rank Adapters), a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity to enable efficient CL, particularly for LPMs. To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds to efficiently explore the task structure. Furthermore, we use sparse gradient updates to facilitate parameter optimization, making the approach better suited for LPMs. Theoretical analysis is provided to justify the rationale behind our approach, and experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach across various domains, including vision and natural language processing tasks.

[LG-33] History-Aware Neural Operator: Robust Data-Driven Constitutive Modeling of Path-Dependent Materials

链接: https://arxiv.org/abs/2506.10352
作者: Binyao Guo,Zihan Lin,QiZhi He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents an end-to-end learning framework for data-driven modeling of path-dependent inelastic materials using neural operators. The framework is built on the premise that irreversible evolution of material responses, governed by hidden dynamics, can be inferred from observable data. We develop the History-Aware Neural Operator (HANO), an autoregressive model that predicts path-dependent material responses from short segments of recent strain-stress history without relying on hidden state variables, thereby overcoming self-consistency issues commonly encountered in recurrent neural network (RNN)-based models. Built on a Fourier-based neural operator backbone, HANO enables discretization-invariant learning. To enhance its ability to capture both global loading patterns and critical local path dependencies, we embed a hierarchical self-attention mechanism that facilitates multiscale feature extraction. Beyond ensuring self-consistency, HANO mitigates sensitivity to initial hidden states, a commonly overlooked issue that can lead to instability in recurrent models when applied to generalized loading paths. By modeling stress-strain evolution as a continuous operator rather than relying on fixed input-output mappings, HANO naturally accommodates varying path discretizations and exhibits robust performance under complex conditions, including irregular sampling, multi-cycle loading, noisy data, and pre-stressed states. We evaluate HANO on two benchmark problems: elastoplasticity with hardening and progressive anisotropic damage in brittle solids. Results show that HANO consistently outperforms baseline models in predictive accuracy, generalization, and robustness. With its demonstrated capabilities, HANO provides an effective data-driven surrogate for simulating inelastic materials and is well-suited for integration with classical numerical solvers. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.10352 [cs.LG] (or arXiv:2506.10352v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.10352 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] LightKG: Efficient Knowledge-Aware Recommendations with Simplified GNN Architecture KDD

链接: https://arxiv.org/abs/2506.10347
作者: Yanhui Li,Dongxia Wang,Zhu Sun,Haonan Zhang,Huizhong Guo
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

点击查看摘要

Abstract:Recently, Graph Neural Networks (GNNs) have become the dominant approach for Knowledge Graph-aware Recommender Systems (KGRSs) due to their proven effectiveness. Building upon GNN-based KGRSs, Self-Supervised Learning (SSL) has been incorporated to address the sparity issue, leading to longer training time. However, through extensive experiments, we reveal that: (1)compared to other KGRSs, the existing GNN-based KGRSs fail to keep their superior performance under sparse interactions even with SSL. (2) More complex models tend to perform worse in sparse interaction scenarios and complex mechanisms, like attention mechanism, can be detrimental as they often increase learning difficulty. Inspired by these findings, we propose LightKG, a simple yet powerful GNN-based KGRS to address sparsity issues. LightKG includes a simplified GNN layer that encodes directed relations as scalar pairs rather than dense embeddings and employs a linear aggregation framework, greatly reducing the complexity of GNNs. Additionally, LightKG incorporates an efficient contrastive layer to implement SSL. It directly minimizes the node similarity in original graph, avoiding the time-consuming subgraph generation and comparison required in previous SSL methods. Experiments on four benchmark datasets show that LightKG outperforms 12 competitive KGRSs in both sparse and dense scenarios while significantly reducing training time. Specifically, it surpasses the best baselines by an average of 5.8% in recommendation accuracy and saves 84.3% of training time compared to KGRSs with SSL. Our code is available at this https URL.

[LG-35] chnical Report with Proofs for A Full Picture in Conformance Checking: Efficiently Summarizing All Optimal Alignments

链接: https://arxiv.org/abs/2506.10345
作者: Philipp Bär,Moe T. Wynn,Sander J. J. Leemans
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This technical report provides proofs for the claims in the paper “A Full Picture in Conformance Checking: Efficiently Summarizing All Optimal Alignments”.

[LG-36] Air in Your Neighborhood: Fine-Grained AQI Forecasting Using Mobile Sensor Data

链接: https://arxiv.org/abs/2506.10332
作者: Aaryam Sharma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 7 figures. Code available at this https URL

点击查看摘要

Abstract:Air pollution has become a significant health risk in developing countries. While governments routinely publish air-quality index (AQI) data to track pollution, these values fail to capture the local reality, as sensors are often very sparse. In this paper, we address this gap by predicting AQI in 1 km^2 neighborhoods, using the example of AirDelhi dataset. Using Spatio-temporal GNNs we surpass existing works by 71.654 MSE a 79% reduction, even on unseen coordinates. New insights about AQI such as the existence of strong repetitive short-term patterns and changing spatial relations are also discovered. The code is available on GitHub.

[LG-37] PyLO: Towards Accessible Learned Optimizers in PyTorch ICML

链接: https://arxiv.org/abs/2506.10315
作者: Paul Janson,Benjamin Therien,Quentin Anthony,Xiaolong Huang,Abhinav Moudgil,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML CODEML Workshop 2025

点击查看摘要

Abstract:Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances – such as VeLO, which was meta-trained for 4000 TPU-months – remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups – from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at this https URL

[LG-38] Collaborative Min-Max Regret in Grouped Multi-Armed Bandits

链接: https://arxiv.org/abs/2506.10313
作者: Moïse Blanchard,Vineet Goyal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the impact of sharing exploration in multi-armed bandits in a grouped setting where a set of groups have overlapping feasible action sets [Baek and Farias '24]. In this grouped bandit setting, groups share reward observations, and the objective is to minimize the collaborative regret, defined as the maximum regret across groups. This naturally captures applications in which one aims to balance the exploration burden between groups or populations – it is known that standard algorithms can lead to significantly imbalanced exploration cost between groups. We address this problem by introducing an algorithm Col-UCB that dynamically coordinates exploration across groups. We show that Col-UCB achieves both optimal minimax and instance-dependent collaborative regret up to logarithmic factors. These bounds are adaptive to the structure of shared action sets between groups, providing insights into when collaboration yields significant benefits over each group learning their best action independently.

[LG-39] Graph-MLLM : Harnessing Multimodal Large Language Models for Multimodal Graph Learning

链接: https://arxiv.org/abs/2506.10282
作者: Jiajin Liu,Dongzhe Fan,Jiacheng Shen,Chuanhao Ji,Daochen Zha,Qiaoyu Tan
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems. Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs: Encoder, Aligner, and Predictor. MLLM-as-Encoder focuses on enhancing graph neural networks (GNNs) via multimodal feature fusion; MLLM-as-Aligner aligns multimodal attributes in language or hidden space to enable LLM-based graph reasoning; MLLM-as-Predictor treats MLLMs as standalone reasoners with in-context learning or fine-tuning. Despite their advances, the MMG field lacks a unified benchmark to fairly evaluate across these approaches, making it unclear what progress has been made. To bridge this gap, we present Graph-MLLM, a comprehensive benchmark for multimodal graph learning by systematically evaluating these three paradigms across six datasets with different domains. Through extensive experiments, we observe that jointly considering the visual and textual attributes of the nodes benefits graph learning, even when using pre-trained text-to-image alignment models (e.g., CLIP) as encoders. We also find that converting visual attributes into textual descriptions further improves performance compared to directly using visual inputs. Moreover, we observe that fine-tuning MLLMs on specific MMGs can achieve state-of-the-art results in most scenarios, even without explicit graph structure information. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field.

[LG-40] Interior-Point Vanishing Problem in Semidefinite Relaxations for Neural Network Verification ICML2025

链接: https://arxiv.org/abs/2506.10269
作者: Ryota Ueda,Takami Sato,Ken Kobayashi,Kazuhide Nakata
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 17 pages, 2 figures. Version revised after ICML 2025 reviews

点击查看摘要

Abstract:Semidefinite programming (SDP) relaxation has emerged as a promising approach for neural network verification, offering tighter bounds than other convex relaxation methods for deep neural networks (DNNs) with ReLU activations. However, we identify a critical limitation in the SDP relaxation when applied to deep networks: interior-point vanishing, which leads to the loss of strict feasibility – a crucial condition for the numerical stability and optimality of SDP. Through rigorous theoretical and empirical analysis, we demonstrate that as the depth of DNNs increases, the strict feasibility is likely to be lost, creating a fundamental barrier to scaling SDP-based verification. To address the interior-point vanishing, we design and investigate five solutions to enhance the feasibility conditions of the verification problem. Our methods can successfully solve 88% of the problems that could not be solved by existing methods, accounting for 41% of the total. Our analysis also reveals that the valid constraints for the lower and upper bounds for each ReLU unit are traditionally inherited from prior work without solid reasons, but are actually not only unbeneficial but also even harmful to the problem’s feasibility. This work provides valuable insights into the fundamental challenges of SDP-based DNN verification and offers practical solutions to improve its applicability to deeper neural networks, contributing to the development of more reliable and secure systems with DNNs.

[LG-41] Meta-learning Representations for Learning from Multiple Annotators

链接: https://arxiv.org/abs/2506.10259
作者: Atsutoshi Kumagai,Tomoharu Iwata,Taishi Nishiyama,Yasutoshi Ida,Yasuhiro Fujiwara
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages

点击查看摘要

Abstract:We propose a meta-learning method for learning from multiple noisy annotators. In many applications such as crowdsourcing services, labels for supervised learning are given by multiple annotators. Since the annotators have different skills or biases, given labels can be noisy. To learn accurate classifiers, existing methods require many noisy annotated data. However, sufficient data might be unavailable in practice. To overcome the lack of data, the proposed method uses labeled data obtained in different but related tasks. The proposed method embeds each example in tasks to a latent space by using a neural network and constructs a probabilistic model for learning a task-specific classifier while estimating annotators’ abilities on the latent space. This neural network is meta-learned to improve the expected test classification performance when the classifier is adapted to a given small amount of annotated data. This classifier adaptation is performed by maximizing the posterior probability via the expectation-maximization (EM) algorithm. Since each step in the EM algorithm is easily computed as a closed-form and is differentiable, the proposed method can efficiently backpropagate the loss through the EM algorithm to meta-learn the neural network. We show the effectiveness of our method with real-world datasets with synthetic noise and real-world crowdsourcing datasets.

[LG-42] A new type of federated clustering: A non-model-sharing approach

链接: https://arxiv.org/abs/2506.10244
作者: Yuji Kawamata,Kaoru Kamijo,Maki Kihira,Akihiro Toyoda,Tomoru Nakayama,Akira Imakura,Tetsuya Sakurai,Yukihiko Okada
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the growing need to leverage sensitive data across institutions has led to increased attention on federated learning (FL), a decentralized machine learning paradigm that enables model training without sharing raw data. However, existing FL-based clustering methods, known as federated clustering, typically assume simple data partitioning scenarios such as horizontal or vertical splits, and cannot handle more complex distributed structures. This study proposes data collaboration clustering (DC-Clustering), a novel federated clustering method that supports clustering over complex data partitioning scenarios where horizontal and vertical splits coexist. In DC-Clustering, each institution shares only intermediate representations instead of raw data, ensuring privacy preservation while enabling collaborative clustering. The method allows flexible selection between k-means and spectral clustering, and achieves final results with a single round of communication with the central server. We conducted extensive experiments using synthetic and open benchmark datasets. The results show that our method achieves clustering performance comparable to centralized clustering where all data are pooled. DC-Clustering addresses an important gap in current FL research by enabling effective knowledge discovery from distributed heterogeneous data. Its practical properties – privacy preservation, communication efficiency, and flexibility – make it a promising tool for privacy-sensitive domains such as healthcare and finance.

[LG-43] AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent ICML2025

链接: https://arxiv.org/abs/2506.10205
作者: Jing Liu,Toshiaki Koike-Akino,Ye Wang,Hassan Mansour,Matthew Brand
类目: Machine Learning (cs.LG)
*备注: ICML 2025 workshop on Efficient Systems for Foundation Models

点击查看摘要

Abstract:To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

[LG-44] Prompt Variability Effects On LLM Code Generation

链接: https://arxiv.org/abs/2506.10204
作者: Andrei Paleyes,Radzim Sendyka,Diana Robinson,Christian Cabrera,Neil D. Lawrence
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code generation is one of the most active areas of application of Large Language Models (LLMs). While LLMs lower barriers to writing code and accelerate development process, the overall quality of generated programs depends on the quality of given prompts. Specifically, functionality and quality of generated code can be sensitive to user’s background and familiarity with software development. It is therefore important to quantify LLM’s sensitivity to variations in the input. To this end we propose a synthetic evaluation pipeline for code generation with LLMs, as well as a systematic persona-based evaluation approach to expose qualitative differences of LLM responses dependent on prospective user background. Both proposed methods are completely independent from specific programming tasks and LLMs, and thus are widely applicable. We provide experimental evidence illustrating utility of our methods and share our code for the benefit of the community.

[LG-45] DynaSubVAE: Adaptive Subgrouping for Scalable and Robust OOD Detection

链接: https://arxiv.org/abs/2506.10200
作者: Tina Behrouzi,Sana Tonekaboni,Rahul G. Krishnan,Anna Goldenberg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world observational data often contain existing or emerging heterogeneous subpopulations that deviate from global patterns. The majority of models tend to overlook these underrepresented groups, leading to inaccurate or even harmful predictions. Existing solutions often rely on detecting these samples as Out-of-domain (OOD) rather than adapting the model to new emerging patterns. We introduce DynaSubVAE, a Dynamic Subgrouping Variational Autoencoder framework that jointly performs representation learning and adaptive OOD detection. Unlike conventional approaches, DynaSubVAE evolves with the data by dynamically updating its latent structure to capture new trends. It leverages a novel non-parametric clustering mechanism, inspired by Gaussian Mixture Models, to discover and model latent subgroups based on embedding similarity. Extensive experiments show that DynaSubVAE achieves competitive performance in both near-OOD and far-OOD detection, and excels in class-OOD scenarios where an entire class is missing during training. We further illustrate that our dynamic subgrouping mechanism outperforms standalone clustering methods such as GMM and KMeans++ in terms of both OOD accuracy and regret precision.

[LG-46] Improving Oral Cancer Outcomes Through Machine Learning and Dimensionality Reduction

链接: https://arxiv.org/abs/2506.10189
作者: Mohammad Subhi Al-Batah,Muhyeeddin Alqaraleh,Mowafaq Salem Alzboon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Oral cancer presents a formidable challenge in oncology, necessitating early diagnosis and accurate prognosis to enhance patient survival rates. Recent advancements in machine learning and data mining have revolutionized traditional diagnostic methodologies, providing sophisticated and automated tools for differentiating between benign and malignant oral lesions. This study presents a comprehensive review of cutting-edge data mining methodologies, including Neural Networks, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and ensemble learning techniques, specifically applied to the diagnosis and prognosis of oral cancer. Through a rigorous comparative analysis, our findings reveal that Neural Networks surpass other models, achieving an impressive classification accuracy of 93,6 % in predicting oral cancer. Furthermore, we underscore the potential benefits of integrating feature selection and dimensionality reduction techniques to enhance model performance. These insights underscore the significant promise of advanced data mining techniques in bolstering early detection, optimizing treatment strategies, and ultimately improving patient outcomes in the realm of oral oncology.

[LG-47] Wasserstein Barycenter Soft Actor-Critic

链接: https://arxiv.org/abs/2506.10167
作者: Zahra Shahrooei,Ali Baheri
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Deep off-policy actor-critic algorithms have emerged as the leading framework for reinforcement learning in continuous control domains. However, most of these algorithms suffer from poor sample efficiency, especially in environments with sparse rewards. In this paper, we take a step towards addressing this issue by providing a principled directed exploration strategy. We propose Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration. This is achieved by using the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process. We compare WBSAC with state-of-the-art off-policy actor-critic algorithms and show that WBSAC is more sample-efficient on MuJoCo continuous control tasks.

[LG-48] he 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

链接: https://arxiv.org/abs/2506.10165
作者: Gilad Landau,Miran Özdogan,Gereon Elvers,Francesco Mantegna,Pratik Somaiya,Dulhan Jayalath,Luisa Kurth,Teyun Kwon,Brendan Shillingford,Greg Farquhar,Minqi Jiang,Karim Jerbi,Hamza Abdelhedi,Yorguin Mantilla Ramos,Caglar Gulcehre,Mark Woolrich,Natalie Voets,Oiwi Parker Jones
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising applications is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an “ImageNet moment” or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech. Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.10165 [cs.LG] (or arXiv:2506.10165v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.10165 Focus to learn more arXiv-issued DOI via DataCite Journalreference: NeurIPS 2025 Competition Track

[LG-49] Probabilistic Variational Contrastive Learning

链接: https://arxiv.org/abs/2506.10159
作者: Minoh Jeong,Seonho Kim,Alfred Hero
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deterministic embeddings learned by contrastive learning (CL) methods such as SimCLR and SupCon achieve state-of-the-art performance but lack a principled mechanism for uncertainty quantification. We propose Variational Contrastive Learning (VCL), a decoder-free framework that maximizes the evidence lower bound (ELBO) by interpreting the InfoNCE loss as a surrogate reconstruction term and adding a KL divergence regularizer to a uniform prior on the unit hypersphere. We model the approximate posterior q_\theta(z|x) as a projected normal distribution, enabling the sampling of probabilistic embeddings. Our two instantiations–VSimCLR and VSupCon–replace deterministic embeddings with samples from q_\theta(z|x) and incorporate a normalized KL term into the loss. Experiments on multiple benchmarks demonstrate that VCL mitigates dimensional collapse, enhances mutual information with class labels, and matches or outperforms deterministic baselines in classification accuracy, all the while providing meaningful uncertainty estimates through the posterior model. VCL thus equips contrastive learning with a probabilistic foundation, serving as a new basis for contrastive approaches.

[LG-50] Physiological-Model-Based Neural Network for Heart Rate Estimation during Daily Physical Activities

链接: https://arxiv.org/abs/2506.10144
作者: Yaowen Zhang,Libera Fresiello,Peter H. Veltink,Dirk W. Donker,Ying Wang
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Heart failure (HF) poses a significant global health challenge, with early detection offering opportunities for improved outcomes. Abnormalities in heart rate (HR), particularly during daily activities, may serve as early indicators of HF risk. However, existing HR monitoring tools for HF detection are limited by their reliability on population-based averages. The estimation of individualized HR serves as a dynamic digital twin, enabling precise tracking of cardiac health biomarkers. Current HR estimation methods, categorized into physiologically-driven and purely data-driven models, struggle with efficiency and interpretability. This study introduces a novel physiological-model-based neural network (PMB-NN) framework for HR estimation based on oxygen uptake (VO2) data during daily physical activities. The framework was trained and tested on individual datasets from 12 participants engaged in activities including resting, cycling, and running. By embedding physiological constraints, which were derived from our proposed simplified human movement physiological model (PM), into the neural network training process, the PMB-NN model adheres to human physiological principles while achieving high estimation accuracy, with a median R ^2 score of 0.8 and an RMSE of 8.3 bpm. Comparative statistical analysis demonstrates that the PMB-NN achieves performance on par with the benchmark neural network model while significantly outperforming traditional physiological model (p=0.002). In addition, our PMB-NN is adept at identifying personalized parameters of the PM, enabling the PM to generate reasonable HR estimation. The proposed framework with a precise VO2 estimation system derived from body movements enables the future possibilities of personalized and real-time cardiac monitoring during daily life physical activities.

[LG-51] Survival Analysis as Imprecise Classification with Trainable Kernels

链接: https://arxiv.org/abs/2506.10140
作者: Andrei V. Konstantinov,Vlada A. Efremenko,Lev V. Utkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Survival analysis is a fundamental tool for modeling time-to-event data in healthcare, engineering, and finance, where censored observations pose significant challenges. While traditional methods like the Beran estimator offer nonparametric solutions, they often struggle with the complex data structures and heavy censoring. This paper introduces three novel survival models, iSurvM (the imprecise Survival model based on Mean likelihood functions), iSurvQ (the imprecise Survival model based on the Quantiles of likelihood functions), and iSurvJ (the imprecise Survival model based on the Joint learning), that combine imprecise probability theory with attention mechanisms to handle censored data without parametric assumptions. The first idea behind the models is to represent censored observations by interval-valued probability distributions for each instance over time intervals between events moments. The second idea is to employ the kernel-based Nadaraya-Watson regression with trainable attention weights for computing the imprecise probability distribution over time intervals for the entire dataset. The third idea is to consider three decision strategies for training, which correspond to the proposed three models. Experiments on synthetic and real datasets demonstrate that the proposed models, especially iSurvJ, consistently outperform the Beran estimator from the accuracy and computational complexity points of view. Codes implementing the proposed models are publicly available.

[LG-52] Provable Sim-to-Real Transfer via Offline Domain Randomization

链接: https://arxiv.org/abs/2506.10133
作者: Arnaud Fickinger,Abderrahim Bendahi,Stuart Russell
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement-learning agents often struggle when deployed from simulation to the real-world. A dominant strategy for reducing the sim-to-real gap is domain randomization (DR) which trains the policy across many simulators produced by sampling dynamics parameters, but standard DR ignores offline data already available from the real system. We study offline domain randomization (ODR), which first fits a distribution over simulator parameters to an offline dataset. While a growing body of empirical work reports substantial gains with algorithms such as DROPO, the theoretical foundations of ODR remain largely unexplored. In this work, we (i) formalize ODR as a maximum-likelihood estimation over a parametric simulator family, (ii) prove consistency of this estimator under mild regularity and identifiability conditions, showing it converges to the true dynamics as the dataset grows, (iii) derive gap bounds demonstrating ODRs sim-to-real error is up to an O(M) factor tighter than uniform DR in the finite-simulator case (and analogous gains in the continuous setting), and (iv) introduce E-DROPO, a new version of DROPO which adds an entropy bonus to prevent variance collapse, yielding broader randomization and more robust zero-shot transfer in practice.

[LG-53] Meet Me at the Arm: The Cooperative Multi-Armed Bandits Problem with Shareable Arms

链接: https://arxiv.org/abs/2506.10127
作者: Xinyi Hu,Aldo Pacchiano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the decentralized multi-player multi-armed bandits (MMAB) problem under a no-sensing setting, where each player receives only their own reward and obtains no information about collisions. Each arm has an unknown capacity, and if the number of players pulling an arm exceeds its capacity, all players involved receive zero reward. This setting generalizes the classical unit-capacity model and introduces new challenges in coordination and capacity discovery under severe feedback limitations. We propose A-CAPELLA (Algorithm for Capacity-Aware Parallel Elimination for Learning and Allocation), a decentralized algorithm that achieves logarithmic regret in this generalized regime. Our main contribution is a collaborative hypothesis testing protocol that enables synchronized successive elimination and capacity estimation through carefully structured collision patterns. This represents a provably efficient learning result in decentralized no-sensing MMAB with unknown arm capacities.

[LG-54] NnD: Diffusion-based Generation of Physically-Nonnegative Objects

链接: https://arxiv.org/abs/2506.10112
作者: Nadav Torem,Tamar Sde-Chen,Yoav Y. Schechner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most natural objects have inherent complexity and variability. While some simple objects can be modeled from first principles, many real-world phenomena, such as cloud formation, require computationally expensive simulations that limit scalability. This work focuses on a class of physically meaningful, nonnegative objects that are computationally tractable but costly to simulate. To dramatically reduce computational costs, we propose nonnegative diffusion (NnD). This is a learned generative model using score based diffusion. It adapts annealed Langevin dynamics to enforce, by design, non-negativity throughout iterative scene generation and analysis (inference). NnD trains on high-quality physically simulated objects. Once trained, it can be used for generation and inference. We demonstrate generation of 3D volumetric clouds, comprising inherently nonnegative microphysical fields. Our generated clouds are consistent with cloud physics trends. They are effectively not distinguished as non-physical by expert perception.

[LG-55] AI5GTest: AI-Driven Specification-Aware Automated Testing and Validation of 5G O-RAN Components

链接: https://arxiv.org/abs/2506.10111
作者: Abiodun Ganiyu,Pranshav Gajjar,Vijay K Shah
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advent of Open Radio Access Networks (O-RAN) has transformed the telecommunications industry by promoting interoperability, vendor diversity, and rapid innovation. However, its disaggregated architecture introduces complex testing challenges, particularly in validating multi-vendor components against O-RAN ALLIANCE and 3GPP specifications. Existing frameworks, such as those provided by Open Testing and Integration Centres (OTICs), rely heavily on manual processes, are fragmented and prone to human error, leading to inconsistency and scalability issues. To address these limitations, we present AI5GTest – an AI-powered, specification-aware testing framework designed to automate the validation of O-RAN components. AI5GTest leverages a cooperative Large Language Models (LLM) framework consisting of Gen-LLM, Val-LLM, and Debug-LLM. Gen-LLM automatically generates expected procedural flows for test cases based on 3GPP and O-RAN specifications, while Val-LLM cross-references signaling messages against these flows to validate compliance and detect deviations. If anomalies arise, Debug-LLM performs root cause analysis, providing insight to the failure cause. To enhance transparency and trustworthiness, AI5GTest incorporates a human-in-the-loop mechanism, where the Gen-LLM presents top-k relevant official specifications to the tester for approval before proceeding with validation. Evaluated using a range of test cases obtained from O-RAN TIFG and WG5-IOT test specifications, AI5GTest demonstrates a significant reduction in overall test execution time compared to traditional manual methods, while maintaining high validation accuracy.

[LG-56] Estimating the Joint Probability of Scenario Parameters with Gaussian Mixture Copula Models

链接: https://arxiv.org/abs/2506.10098
作者: Christian Reichenbächer,Philipp Rank,Jochen Hipp,Oliver Bringmann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures; This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:This paper presents the first application of Gaussian Mixture Copula Models to the statistical modeling of driving scenarios for the safety validation of automated driving systems. Knowledge of the joint probability distribution of scenario parameters is essential for scenario-based safety assessment, where risk quantification depends on the likelihood of concrete parameter combinations. Gaussian Mixture Copula Models bring together the multimodal expressivity of Gaussian Mixture Models and the flexibility of copulas, enabling separate modeling of marginal distributions and dependencies. We benchmark Gaussian Mixture Copula Models against previously proposed approaches - Gaussian Mixture Models and Gaussian Copula Models - using real-world driving data drawn from scenarios defined in United Nations Regulation No. 157. Our evaluation across 18 million scenario instances demonstrates that Gaussian Mixture Copula Models provide a better fit to the data in terms of both likelihood and Sinkhorn distance. These results suggest that Gaussian Mixture Copula Models are a compelling foundation for future scenario-based validation frameworks.

[LG-57] Unsupervised Deep Clustering of MNIST with Triplet-Enhanced Convolutional Autoencoders

链接: https://arxiv.org/abs/2506.10094
作者: Md. Faizul Islam Ansari
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, experimental study on deep clustering with autoencoders

点击查看摘要

Abstract:This research implements an advanced unsupervised clustering system for MNIST handwritten digits through two-phase deep autoencoder architecture. A deep neural autoencoder requires a training process during phase one to develop minimal yet interpretive representations of images by minimizing reconstruction errors. During the second phase we unify the reconstruction error with a KMeans clustering loss for learned latent embeddings through a joint distance-based objective. Our model contains three elements which include batch normalization combined with dropout and weight decay for achieving generalized and stable results. The framework achieves superior clustering performance during extensive tests which used intrinsic measurements including Silhouette Score and Davies-Bouldin Index coupled with extrinsic metrics NMI and ARI when processing image features. The research uses t-SNE visualization to present learned embeddings that show distinct clusters for digits. Our approach reaches an optimal combination between data reconstruction accuracy and cluster separation purity when adding the benefit of understandable results and scalable implementations. The approach creates a dependable base that helps deploy unsupervised representation learning in different large-scale image clustering applications.

[LG-58] Efficient kernelized bandit algorithms via exploration distributions

链接: https://arxiv.org/abs/2506.10091
作者: Bingshan Hu,Zheng He,Danica J. Sutherland
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a kernelized bandit problem with a compact arm set X \subset \mathbbR^d and a fixed but unknown reward function f^* with a finite norm in some Reproducing Kernel Hilbert Space (RKHS). We propose a class of computationally efficient kernelized bandit algorithms, which we call GP-Generic, based on a novel concept: exploration distributions. This class of algorithms includes Upper Confidence Bound-based approaches as a special case, but also allows for a variety of randomized algorithms. With careful choice of exploration distribution, our proposed generic algorithm realizes a wide range of concrete algorithms that achieve \tildeO(\gamma_T\sqrtT) regret bounds, where \gamma_T characterizes the RKHS complexity. This matches known results for UCB- and Thompson Sampling-based algorithms; we also show that in practice, randomization can yield better practical results.

[LG-59] Optimizing Latent Dimension Allocation in Hierarchical VAEs: Balancing Attenuation and Information Retention for OOD Detection

链接: https://arxiv.org/abs/2506.10089
作者: Dane Williamson,Yangfeng Ji,Matthew Dwyer
类目: Machine Learning (cs.LG)
*备注: 41 pages, 6 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a critical task in machine learning, particularly for safety-critical applications where unexpected inputs must be reliably flagged. While hierarchical variational autoencoders (HVAEs) offer improved representational capacity over traditional VAEs, their performance is highly sensitive to how latent dimensions are distributed across layers. Existing approaches often allocate latent capacity arbitrarily, leading to ineffective representations or posterior collapse. In this work, we introduce a theoretically grounded framework for optimizing latent dimension allocation in HVAEs, drawing on principles from information theory to formalize the trade-off between information loss and representational attenuation. We prove the existence of an optimal allocation ratio r^\ast under a fixed latent budget, and empirically show that tuning this ratio consistently improves OOD detection performance across datasets and architectures. Our approach outperforms baseline HVAE configurations and provides practical guidance for principled latent structure design, leading to more robust OOD detection with deep generative models.

[LG-60] Online Discovery of Simulation Models for Evolving Business Processes (Extended Version)

链接: https://arxiv.org/abs/2506.10049
作者: Francesco Vinci,Gyunam Park,Wil van der Aalst,Massimiliano de Leoni
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Business Process Simulation (BPS) refers to techniques designed to replicate the dynamic behavior of a business process. Many approaches have been proposed to automatically discover simulation models from historical event logs, reducing the cost and time to manually design them. However, in dynamic business environments, organizations continuously refine their processes to enhance efficiency, reduce costs, and improve customer satisfaction. Existing techniques to process simulation discovery lack adaptability to real-time operational changes. In this paper, we propose a streaming process simulation discovery technique that integrates Incremental Process Discovery with Online Machine Learning methods. This technique prioritizes recent data while preserving historical information, ensuring adaptation to evolving process dynamics. Experiments conducted on four different event logs demonstrate the importance in simulation of giving more weight to recent data while retaining historical knowledge. Our technique not only produces more stable simulations but also exhibits robustness in handling concept drift, as highlighted in one of the use cases.

[LG-61] Improving the performance of optical inverse design of multilayer thin films using CNN-LSTM tandem neural networks

链接: https://arxiv.org/abs/2506.10044
作者: Uijun Jung,Deokho Jang,Sungchul Kim,Jungho Kim
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE); Optics (physics.optics)
*备注: 22 pages, 8 figures, 2 tables, 11 supplementary figures, 7 supplementary tables

点击查看摘要

Abstract:Optical properties of thin film are greatly influenced by the thickness of each layer. Accurately predicting these thicknesses and their corresponding optical properties is important in the optical inverse design of thin films. However, traditional inverse design methods usually demand extensive numerical simulations and optimization procedures, which are time-consuming. In this paper, we utilize deep learning for the inverse design of the transmission spectra of SiO2/TiO2 multilayer thin films. We implement a tandem neural network (TNN), which can solve the one-to-many mapping problem that greatly degrades the performance of deep-learning-based inverse designs. In general, the TNN has been implemented by a back-to-back connection of an inverse neural network and a pre-trained forward neural network, both of which have been implemented based on multilayer perceptron (MLP) algorithms. In this paper, we propose to use not only MLP, but also convolutional neural network (CNN) or long short-term memory (LSTM) algorithms in the configuration of the TNN. We show that an LSTM-LSTM-based TNN yields the highest accuracy but takes the longest training time among nine configurations of TNNs. We also find that a CNN-LSTM-based TNN will be an optimal solution in terms of accuracy and speed because it could integrate the strengths of the CNN and LSTM algorithms.

[LG-62] NOCL: Node-Oriented Conceptualization LLM for Graph Tasks without Message Passing

链接: https://arxiv.org/abs/2506.10014
作者: Wei Li,Mengcheng Lan,Jiaxing Xu,Yiping Ke
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. arXiv admin note: text overlap with arXiv:1703.00552 , arXiv:1403.2844 by other authors

点击查看摘要

Abstract:Graphs are essential for modeling complex interactions across domains such as social networks, biology, and recommendation systems. Traditional Graph Neural Networks, particularly Message Passing Neural Networks (MPNNs), rely heavily on supervised learning, limiting their generalization and applicability in label-scarce scenarios. Recent self-supervised approaches still require labeled fine-tuning, limiting their effectiveness in zero-shot scenarios. Meanwhile, Large Language Models (LLMs) excel in natural language tasks but face significant challenges when applied to graphs, including preserving reasoning abilities, managing extensive token lengths from rich node attributes, and being limited to textual-attributed graphs (TAGs) and a single level task. To overcome these limitations, we propose the Node-Oriented Conceptualization LLM (NOCL), a novel framework that leverages two core techniques: 1) node description, which converts heterogeneous node attributes into structured natural language, extending LLM from TAGs to non-TAGs; 2) node concept, which encodes node descriptions into compact semantic embeddings using pretrained language models, significantly reducing token lengths by up to 93.9% compared to directly using node descriptions. Additionally, our NOCL employs graph representation descriptors to unify graph tasks at various levels into a shared, language-based query format, paving a new direction for Graph Foundation Models. Experimental results validate NOCL’s competitive supervised performance relative to traditional MPNNs and hybrid LLM-MPNN methods and demonstrate superior generalization in zero-shot settings.

[LG-63] Multimodal Emotion Coupling via Speech-to-Facial and Bodily Gestures in Dyadic Interaction

链接: https://arxiv.org/abs/2506.10010
作者: Von Ralph Dane Marquez Herbuela,Yukie Nagai
类目: Multimedia (cs.MM); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Human emotional expression emerges through coordinated vocal, facial, and gestural signals. While speech face alignment is well established, the broader dynamics linking emotionally expressive speech to regional facial and hand motion remains critical for gaining a deeper insight into how emotional and behavior cues are communicated in real interactions. Further modulating the coordination is the structure of conversational exchange like sequential turn taking, which creates stable temporal windows for multimodal synchrony, and simultaneous speech, often indicative of high arousal moments, disrupts this alignment and impacts emotional clarity. Understanding these dynamics enhances realtime emotion detection by improving the accuracy of timing and synchrony across modalities in both human interactions and AI systems. This study examines multimodal emotion coupling using region specific motion capture from dyadic interactions in the IEMOCAP corpus. Speech features included low level prosody, MFCCs, and model derived arousal, valence, and categorical emotions (Happy, Sad, Angry, Neutral), aligned with 3D facial and hand marker displacements. Expressive activeness was quantified through framewise displacement magnitudes, and speech to gesture prediction mapped speech features to facial and hand movements. Nonoverlapping speech consistently elicited greater activeness particularly in the lower face and mouth. Sadness showed increased expressivity during nonoverlap, while anger suppressed gestures during overlaps. Predictive mapping revealed highest accuracy for prosody and MFCCs in articulatory regions while arousal and valence had lower and more context sensitive correlations. Notably, hand speech synchrony was enhanced under low arousal and overlapping speech, but not for valence.

[LG-64] Leverag ing Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

链接: https://arxiv.org/abs/2506.09999
作者: Yukun Chen,Zihuan Qiu,Fanman Meng,Hongliang Li,Linfeng Xu,Qingbo Wu
类目: Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information and mitigating catastrophic forgetting. To tackle these issues, we propose an MCIL method based on multimodal pre-trained models. Firstly, a Multimodal Incremental Feature Extractor (MIFE) based on Mixture-of-Experts (MoE) structure is introduced to achieve effective incremental fine-tuning for AudioCLIP. Secondly, to enhance feature discriminability and generalization, we propose an Adaptive Audio-Visual Fusion Module (AAVFM) that includes a masking threshold mechanism and a dynamic feature fusion mechanism, along with a strategy to enhance text diversity. Thirdly, a novel multimodal class-incremental contrastive training loss is proposed to optimize cross-modal alignment in MCIL. Finally, two MCIL-specific evaluation metrics are introduced for comprehensive assessment. Extensive experiments on three multimodal datasets validate the effectiveness of our method.

[LG-65] What Exactly Does Guidance Do in Masked Discrete Diffusion Models

链接: https://arxiv.org/abs/2506.10971
作者: He Ye,Rojas Kevin,Tao Molei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study masked discrete diffusion models with classifier-free guidance (CFG). Assuming no score error nor discretization error, we derive an explicit solution to the guided reverse dynamics, so that how guidance influences the sampling behavior can be precisely characterized. When the full data distribution is a mixture over classes and the goal is to sample from a specific class, guidance amplifies class-specific regions while suppresses regions shared with other classes. This effect depends on the guidance strength w and induces distinct covariance structures in the sampled distribution. Notably, we observe quantitatively different behaviors in 1 D and 2 D. We also show that for large w , the decay rate of the total variation ( \mathrmTV ) along the reverse dynamics is double-exponential in w for both 1 D and 2 D. These findings highlight the role of guidance, not just in shaping the output distribution, but also in controlling the dynamics of the sampling trajectory. Our theoretical analysis is supported by experiments that illustrate the geometric effects of guidance and its impact on convergence.

[LG-66] Coupled reaction and diffusion governing interface evolution in solid-state batteries

链接: https://arxiv.org/abs/2506.10944
作者: Jingxuan Ding,Laura Zichi,Matteo Carli,Menghang Wang,Albert Musaelian,Yu Xie,Boris Kozinsky
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Understanding and controlling the atomistic-level reactions governing the formation of the solid-electrolyte interphase (SEI) is crucial for the viability of next-generation solid state batteries. However, challenges persist due to difficulties in experimentally characterizing buried interfaces and limits in simulation speed and accuracy. We conduct large-scale explicit reactive simulations with quantum accuracy for a symmetric battery cell, \symcell, enabled by active learning and deep equivariant neural network interatomic potentials. To automatically characterize the coupled reactions and interdiffusion at the interface, we formulate and use unsupervised classification techniques based on clustering in the space of local atomic environments. Our analysis reveals the formation of a previously unreported crystalline disordered phase, Li _2 S _0.72 P _0.14 Cl _0.14 , in the SEI, that evaded previous predictions based purely on thermodynamics, underscoring the importance of explicit modeling of full reaction and transport kinetics. Our simulations agree with and explain experimental observations of the SEI formations and elucidate the Li creep mechanisms, critical to dendrite initiation, characterized by significant Li motion along the interface. Our approach is to crease a digital twin from first principles, without adjustable parameters fitted to experiment. As such, it offers capabilities to gain insights into atomistic dynamics governing complex heterogeneous processes in solid-state synthesis and electrochemistry.

[LG-67] On feature selection in double-imbalanced data settings: a Random Forest approach

链接: https://arxiv.org/abs/2506.10929
作者: Fabio Demaria
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: Working paper

点击查看摘要

Abstract:Feature selection is a critical step in high-dimensional classification tasks, particularly under challenging conditions of double imbalance, namely settings characterized by both class imbalance in the response variable and dimensional asymmetry in the data (n \gg p) . In such scenarios, traditional feature selection methods applied to Random Forests (RF) often yield unstable or misleading importance rankings. This paper proposes a novel thresholding scheme for feature selection based on minimal depth, which exploits the tree topology to assess variable relevance. Extensive experiments on simulated and real-world datasets demonstrate that the proposed approach produces more parsimonious and accurate subsets of variables compared to conventional minimal depth-based selection. The method provides a practical and interpretable solution for variable selection in RF under double imbalance conditions.

[LG-68] Probably Approximately Correct Labels

链接: https://arxiv.org/abs/2506.10908
作者: Emmanuel J. Candès,Andrew Ilyas,Tijana Zrnic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such “expert” labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

[LG-69] Demystifying Spectral Feature Learning for Instrumental Variable Regression

链接: https://arxiv.org/abs/2506.10899
作者: Dimitri Meunier,Antoine Moulin,Jakub Wornbard,Vladimir R. Kostic,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We address the problem of causal effect estimation in the presence of hidden confounders, using nonparametric instrumental variable (IV) regression. A leading strategy employs spectral features - that is, learned features spanning the top eigensubspaces of the operator linking treatments to instruments. We derive a generalization error bound for a two-stage least squares estimator based on spectral features, and gain insights into the method’s performance and failure modes. We show that performance depends on two key factors, leading to a clear taxonomy of outcomes. In a good scenario, the approach is optimal. This occurs with strong spectral alignment, meaning the structural function is well-represented by the top eigenfunctions of the conditional operator, coupled with this operator’s slow eigenvalue decay, indicating a strong instrument. Performance degrades in a bad scenario: spectral alignment remains strong, but rapid eigenvalue decay (indicating a weaker instrument) demands significantly more samples for effective feature learning. Finally, in the ugly scenario, weak spectral alignment causes the method to fail, regardless of the eigenvalues’ characteristics. Our synthetic experiments empirically validate this taxonomy.

[LG-70] A Goemans-Williamson type algorithm for identifying subcohorts in clinical trials

链接: https://arxiv.org/abs/2506.10879
作者: Pratik Worah
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We design an efficient algorithm that outputs a linear classifier for identifying homogeneous subsets (equivalently subcohorts) from large inhomogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Williamson (1994), that approximates the optimal solution of the underlying optimization problem within a factor of 0.82 . As an application, we use our algorithm to design a simple test that can identify homogeneous subcohorts of patients, that are mainly comprised of metastatic cases, from the RNA microarray dataset for breast cancer by Curtis et al. (2012). Furthermore, we also use the test output by the algorithm to systematically identify subcohorts of patients in which statistically significant changes in methylation levels of tumor suppressor genes co-occur with statistically significant changes in nuclear receptor expression. Identifying such homogeneous subcohorts of patients can be useful for the discovery of disease pathways and therapeutics, specific to the subcohort.

[LG-71] he Gittins Index: A Design Principle for Decision-Making Under Uncertainty

链接: https://arxiv.org/abs/2506.10872
作者: Ziv Scully,Alexander Terenin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Performance (cs.PF); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Gittins index is a tool that optimally solves a variety of decision-making problems involving uncertainty, including multi-armed bandit problems, minimizing mean latency in queues, and search problems like the Pandora’s box model. However, despite the above examples and later extensions thereof, the space of problems that the Gittins index can solve perfectly optimally is limited, and its definition is rather subtle compared to those of other multi-armed bandit algorithms. As a result, the Gittins index is often regarded as being primarily a concept of theoretical importance, rather than a practical tool for solving decision-making problems. The aim of this tutorial is to demonstrate that the Gittins index can be fruitfully applied to practical problems. We start by giving an example-driven introduction to the Gittins index, then walk through several examples of problems it solves - some optimally, some suboptimally but still with excellent performance. Two practical highlights in the latter category are applying the Gittins index to Bayesian optimization, and applying the Gittins index to minimizing tail latency in queues. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Performance (cs.PF); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2506.10872 [math.OC] (or arXiv:2506.10872v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2506.10872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] OmniFluids: Unified Physics Pre-trained Modeling of Fluid Dynamics

链接: https://arxiv.org/abs/2506.10862
作者: Rui Zhang,Qi Meng,Han Wan,Yang Liu,Zhi-Ming Ma,Hao Sun
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-fidelity and efficient simulation of fluid dynamics drive progress in various scientific and engineering applications. Traditional computational fluid dynamics methods offer strong interpretability and guaranteed convergence, but rely on fine spatial and temporal meshes, incurring prohibitive computational costs. Physics-informed neural networks (PINNs) and neural operators aim to accelerate PDE solvers using deep learning techniques. However, PINNs require extensive retraining and careful tuning, and purely data-driven operators demand large labeled datasets. Hybrid physics-aware methods embed numerical discretizations into network architectures or loss functions, but achieve marginal speed gains and become unstable when balancing coarse priors against high-fidelity measurements. To this end, we introduce OmniFluids, a unified physics pre-trained operator learning framework that integrates physics-only pre-training, coarse-grid operator distillation, and few-shot fine-tuning, which enables fast inference and accurate prediction under limited or zero data supervision. For architectural design, the key components of OmniFluids include a mixture of operators, a multi-frame decoder, and factorized Fourier layers, which enable efficient and scalable modeling of diverse physical tasks while maintaining seamless integration with physics-based supervision. Across a broad range of two- and three-dimensional benchmarks, OmniFluids significantly outperforms state-of-the-art AI-driven methods in flow field reconstruction and turbulence statistics accuracy, delivering 10-100x speedups compared to classical solvers, and accurately recovers unknown physical parameters from sparse, noisy data. This work establishes a new paradigm for efficient and generalizable surrogate modeling in complex fluid systems under limited data availability.

[LG-73] SNR and Resource Adaptive Deep JSCC for Distributed IoT Image Classification

链接: https://arxiv.org/abs/2506.10699
作者: Ali Waqas,Sinem Coleri
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, PIMRC Conference 2025

点击查看摘要

Abstract:Sensor-based local inference at IoT devices faces severe computational limitations, often requiring data transmission over noisy wireless channels for server-side processing. To address this, split-network Deep Neural Network (DNN) based Joint Source-Channel Coding (JSCC) schemes are used to extract and transmit relevant features instead of raw data. However, most existing methods rely on fixed network splits and static configurations, lacking adaptability to varying computational budgets and channel conditions. In this paper, we propose a novel SNR- and computation-adaptive distributed CNN framework for wireless image classification across IoT devices and edge servers. We introduce a learning-assisted intelligent Genetic Algorithm (LAIGA) that efficiently explores the CNN hyperparameter space to optimize network configuration under given FLOPs constraints and given SNR. LAIGA intelligently discards the infeasible network configurations that exceed computational budget at IoT device. It also benefits from the Random Forests based learning assistance to avoid a thorough exploration of hyperparameter space and to induce application specific bias in candidate optimal configurations. Experimental results demonstrate that the proposed framework outperforms fixed-split architectures and existing SNR-adaptive methods, especially under low SNR and limited computational resources. We achieve a 10% increase in classification accuracy as compared to existing JSCC based SNR-adaptive multilayer framework at an SNR as low as -10dB across a range of available computational budget (1M to 70M FLOPs) at IoT device.

[LG-74] Practical Improvements of A/B Testing with Off-Policy Estimation

链接: https://arxiv.org/abs/2506.10677
作者: Sakhi Otmane,Gilotte Alexandre,Rohde David
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of A/B testing, a widely used protocol for evaluating the potential improvement achieved by a new decision system compared to a baseline. This protocol segments the population into two subgroups, each exposed to a version of the system and estimates the improvement as the difference between the measured effects. In this work, we demonstrate that the commonly used difference-in-means estimator, while unbiased, can be improved. We introduce a family of unbiased off-policy estimators that achieves lower variance than the standard approach. Among this family, we identify the estimator with the lowest variance. The resulting estimator is simple, and offers substantial variance reduction when the two tested systems exhibit similarities. Our theoretical analysis and experimental results validate the effectiveness and practicality of the proposed method.

[LG-75] Logarithmic Smoothing for Adaptive PAC-Bayesian Off-Policy Learning

链接: https://arxiv.org/abs/2506.10664
作者: Maxime Haddouche,Otmane Sakhi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Off-policy learning serves as the primary framework for learning optimal policies from logged interactions collected under a static behavior policy. In this work, we investigate the more practical and flexible setting of adaptive off-policy learning, where policies are iteratively refined and re-deployed to collect higher-quality data. Building on the success of PAC-Bayesian learning with Logarithmic Smoothing (LS) in static settings, we extend this framework to the adaptive scenario using tools from online PAC-Bayesian theory. Furthermore, we demonstrate that a principled adjustment to the LS estimator naturally accommodates multiple rounds of deployment and yields faster convergence rates under mild conditions. Our method matches the performance of leading offline approaches in static settings, and significantly outperforms them when intermediate policy deployments are allowed. Empirical evaluations across diverse scenarios highlight both the advantages of adaptive data collection and the strength of the PAC-Bayesian formulation.

[LG-76] Pushing the Limits of Extreme Weather: Constructing Extreme Heatwave Storylines with Differentiable Climate Models

链接: https://arxiv.org/abs/2506.10660
作者: Tim Whittaker,Alejandro Di Luca
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Understanding the plausible upper bounds of extreme weather events is essential for risk assessment in a warming climate. Existing methods, based on large ensembles of physics-based models, are often computationally expensive or lack the fidelity needed to simulate rare, high-impact extremes. Here, we present a novel framework that leverages a differentiable hybrid climate model, NeuralGCM, to optimize initial conditions and generate physically consistent worst-case heatwave trajectories. Applied to the 2021 Pacific Northwest heatwave, our method produces temperature anomalies up to 3.7 ^\circ C above the most extreme member of a 75-member ensemble. These trajectories feature intensified atmospheric blocking and amplified Rossby wave patterns–hallmarks of severe heat events. Our results demonstrate that differentiable climate models can efficiently explore the upper tails of event likelihoods, providing a powerful new approach for constructing targeted storylines of extreme weather under climate change.

[LG-77] Box-Constrained Softmax Function and Its Application for Post-Hoc Calibration

链接: https://arxiv.org/abs/2506.10572
作者: Kyohei Atarashi,Satoshi Oyama,Hiromi Arai,Hisashi Kashima
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Controlling the output probabilities of softmax-based models is a common problem in modern machine learning. Although the \mathrmSoftmax function provides soft control via its temperature parameter, it lacks the ability to enforce hard constraints, such as box constraints, on output probabilities, which can be critical in certain applications requiring reliable and trustworthy models. In this work, we propose the box-constrained softmax ( \mathrmBCSoftmax ) function, a novel generalization of the \mathrmSoftmax function that explicitly enforces lower and upper bounds on output probabilities. While \mathrmBCSoftmax is formulated as the solution to a box-constrained optimization problem, we develop an exact and efficient computation algorithm for \mathrmBCSoftmax . As a key application, we introduce two post-hoc calibration methods based on \mathrmBCSoftmax . The proposed methods mitigate underconfidence and overconfidence in predictive models by learning the lower and upper bounds of the output probabilities or logits after model training, thereby enhancing reliability in downstream decision-making tasks. We demonstrate the effectiveness of our methods experimentally using the TinyImageNet, CIFAR-100, and 20NewsGroups datasets, achieving improvements in calibration metrics.

[LG-78] On the role of non-linear latent features in bipartite generative neural networks

链接: https://arxiv.org/abs/2506.10552
作者: Tony Bonnaire,Giovanni Catania,Aurélien Decelle,Beatriz Seoane
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:We investigate the phase diagram and memory retrieval capabilities of bipartite energy-based neural networks, namely Restricted Boltzmann Machines (RBMs), as a function of the prior distribution imposed on their hidden units - including binary, multi-state, and ReLU-like activations. Drawing connections to the Hopfield model and employing analytical tools from statistical physics of disordered systems, we explore how the architectural choices and activation functions shape the thermodynamic properties of these models. Our analysis reveals that standard RBMs with binary hidden nodes and extensive connectivity suffer from reduced critical capacity, limiting their effectiveness as associative memories. To address this, we examine several modifications, such as introducing local biases and adopting richer hidden unit priors. These adjustments restore ordered retrieval phases and markedly improve recall performance, even at finite temperatures. Our theoretical findings, supported by finite-size Monte Carlo simulations, highlight the importance of hidden unit design in enhancing the expressive power of RBMs.

[LG-79] Prediction of steady states in a marine ecosystem model by a machine learning technique

链接: https://arxiv.org/abs/2506.10475
作者: Sarker Miraz Mahfuz,Thomas Slawig
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We used precomputed steady states obtained by a spin-up for a global marine ecosystem model as training data to build a mapping from the small number of biogeochemical model parameters onto the three-dimensional converged steady annual cycle. The mapping was performed by a conditional variational autoencoder (CVAE) with mass correction. Applied for test data, we show that the prediction obtained by the CVAE already gives a reasonable good approximation of the steady states obtained by a regular spin-up. However, the predictions do not reach the same level of annual periodicity as those obtained in the original spin-up data. Thus, we took the predictions as initial values for a spin-up. We could show that the number of necessary iterations, corresponding to model years, to reach a prescribed stopping criterion in the spin-up could be significantly reduced compared to the use of the originally uniform, constant initial value. The amount of reduction depends on the applied stopping criterion, measuring the periodicity of the solution. The savings in needed iterations and, thus, computing time for the spin-up ranges from 50 to 95%, depending on the stopping criterion for the spin-up. We compared these results with the use of the mean of the training data as an initial value. We found that this also accelerates the spin-up, but only by a much lower factor.

[LG-80] Measuring Semantic Information Production in Generative Diffusion Models ICLR

链接: https://arxiv.org/abs/2506.10433
作者: Florian Handke,Félix Koulischer,Gabriel Raya,Luca Ambrogioni
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures, an appendix with derivations and implementation details, accepted at ICLR DeLTa 2025

点击查看摘要

Abstract:It is well known that semantic and structural features of the generated images emerge at different times during the reverse dynamics of diffusion, a phenomenon that has been connected to physical phase transitions in magnets and other materials. In this paper, we introduce a general information-theoretic approach to measure when these class-semantic “decisions” are made during the generative process. By using an online formula for the optimal Bayesian classifier, we estimate the conditional entropy of the class label given the noisy state. We then determine the time intervals corresponding to the highest information transfer between noisy states and class labels using the time derivative of the conditional entropy. We demonstrate our method on one-dimensional Gaussian mixture models and on DDPM models trained on the CIFAR10 dataset. As expected, we find that the semantic information transfer is highest in the intermediate stages of diffusion while vanishing during the final stages. However, we found sizable differences between the entropy rate profiles of different classes, suggesting that different “semantic decisions” are located at different intermediate times.

[LG-81] Self-learning signal classifier for decameter coherent scatter radars

链接: https://arxiv.org/abs/2506.10305
作者: Oleg Berngardt,Ivan Lavygin
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 30 pages, 10 figures, 4 tables. To be submitted to Advances in Space Research

点击查看摘要

Abstract:The paper presents a method for automatic constructing a classifier for processed data obtained by decameter coherent scatter radars. Method is based only on the radar data obtained, the results of automatic modeling of radio wave propagation in the ionosphere, and mathematical criteria for estimating the quality of the models. The final classifier is the model trained at data obtained by 12 radars of the SuperDARN and SECIRA networks over two years for each radar. The number of the model coefficients is 2669. For the classification, the model uses both the calculated parameters of radio wave propagation in the model ionosphere and the parameters directly measured by the radar. Calibration of radiowave elevation measurements at each radar was made using meteor trail scattered signals. The analysis showed that the optimal number of classes in the data is 37, of which 25 are frequently observed. The analysis made it possible to choose 14 classes from them, which are confidently separated in other variants of model training. A preliminary interpretation of 10 of them was carried out. The dynamics of observation of various classes and their dependence on the geographical latitude of radars at different levels of solar and geomagnetic activity were presented, it was shown that it does not contradict with known physical mechanisms. The analysis showed that the most important parameters to identify the classes are the shape of the signal ray-tracing trajectory in its second half, the ray-traced scattering height and the Doppler velocity measured by the radar.

[LG-82] Distributionally-Constrained Adversaries in Online Learning

链接: https://arxiv.org/abs/2506.10293
作者: Moïse Blanchard,Samory Kpotufe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:There has been much recent interest in understanding the continuum from adversarial to stochastic settings in online learning, with various frameworks including smoothed settings proposed to bridge this gap. We consider the more general and flexible framework of distributionally constrained adversaries in which instances are drawn from distributions chosen by an adversary within some constrained distribution class [RST11]. Compared to smoothed analysis, we consider general distributional classes which allows for a fine-grained understanding of learning settings between fully stochastic and fully adversarial for which a learner can achieve non-trivial regret. We give a characterization for which distribution classes are learnable in this context against both oblivious and adaptive adversaries, providing insights into the types of interplay between the function class and distributional constraints on adversaries that enable learnability. In particular, our results recover and generalize learnability for known smoothed settings. Further, we show that for several natural function classes including linear classifiers, learning can be achieved without any prior knowledge of the distribution class – in other words, a learner can simultaneously compete against any constrained adversary within learnable distribution classes.

[LG-83] VQC-MLPNet: An Unconventional Hybrid Quantum-Classical Architecture for Scalable and Robust Quantum Machine Learning

链接: https://arxiv.org/abs/2506.10275
作者: Jun Qi,Chao-Han Yang,Pin-Yu Chen,Min-Hsiu Hsieh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 11 figures, under review

点击查看摘要

Abstract:Variational Quantum Circuits (VQCs) offer a novel pathway for quantum machine learning, yet their practical application is hindered by inherent limitations such as constrained linear expressivity, optimization challenges, and acute sensitivity to quantum hardware noise. This work introduces VQC-MLPNet, a scalable and robust hybrid quantum-classical architecture designed to overcome these obstacles. By innovatively employing quantum circuits to dynamically generate parameters for classical Multi-Layer Perceptrons (MLPs) via amplitude encoding and parameterized quantum operations, VQC-MLPNet substantially expands representation capabilities and augments training stability. We provide rigorous theoretical guarantees via statistical learning techniques and Neural Tangent Kernel analysis, explicitly deriving upper bounds on approximation, uniform deviation, and optimization errors. These theoretical insights demonstrate exponential improvements in representation capacity relative to quantum circuit depth and the number of qubits, providing clear computational advantages over standalone quantum circuits and existing hybrid quantum architectures. Our theoretical claims are empirically corroborated through extensive experiments, including classifying semiconductor quantum-dot charge states and predicting genomic transcription factor binding sites, demonstrating resilient performance even under realistic IBM quantum noise simulations. This research establishes a theoretically sound and practically robust framework, advancing the frontiers of quantum-enhanced learning for unconventional computing paradigms in the Noisy Intermediate-Scale Quantum era and beyond.

[LG-84] Predicting function of evolutionarily implausible DNA sequences ICML2025

链接: https://arxiv.org/abs/2506.10271
作者: Shiyu Jiang,Xuyin Liu,Zitong Jerry Wang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 13 pages, 6 figures, accepted to ICML 2025 Generative AI and Biology Workshop

点击查看摘要

Abstract:Genomic language models (gLMs) show potential for generating novel, functional DNA sequences for synthetic biology, but doing so requires them to learn not just evolutionary plausibility, but also sequence-to-function relationships. We introduce a set of prediction tasks called Nullsettes, which assesses a model’s ability to predict loss-of-function mutations created by translocating key control elements in synthetic expression cassettes. Across 12 state-of-the-art models, we find that mutation effect prediction performance strongly correlates with the predicted likelihood of the nonmutant. Furthermore, the range of likelihood values predictive of strong model performance is highly dependent on sequence length. Our work highlights the importance of considering both sequence likelihood and sequence length when using gLMs for mutation effect prediction.

[LG-85] Exploring Topological and Localization Phenomena in SSH Chains under Generalized AAH Modulation: A Computational Approach

链接: https://arxiv.org/abs/2506.10195
作者: Souvik Ghosh,Sayak Roy
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Su-Schrieffer-Heeger (SSH) model serves as a canonical example of a one-dimensional topological insulator, yet its behavior under more complex, realistic conditions remains a fertile ground for research. This paper presents a comprehensive computational investigation into generalized SSH models, exploring the interplay between topology, quasi-periodic disorder, non-Hermiticity, and time-dependent driving. Using exact diagonalization and specialized numerical solvers, we map the system’s phase space through its spectral properties and localization characteristics, quantified by the Inverse Participation Ratio (IPR). We demonstrate that while the standard SSH model exhibits topologically protected edge states, these are destroyed by a localization transition induced by strong Aubry-André-Harper (AAH) modulation. Further, we employ unsupervised machine learning (PCA) to autonomously classify the system’s phases, revealing that strong localization can obscure underlying topological signatures. Extending the model beyond Hermiticity, we uncover the non-Hermitian skin effect, a dramatic localization of all bulk states at a boundary. Finally, we apply a periodic Floquet drive to a topologically trivial chain, successfully engineering a Floquet topological insulator characterized by the emergence of anomalous edge states at the boundaries of the quasi-energy zone. These findings collectively provide a multi-faceted view of the rich phenomena hosted in generalized 1D topological systems.

[LG-86] Momentum Multi-Marginal Schrödinger Bridge Matching

链接: https://arxiv.org/abs/2506.10168
作者: Panagiotis Theodoropoulos,Augustinos D. Saravanos,Evangelos A. Theodorou,Guan-Horng Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding complex systems by inferring trajectories from sparse sample snapshots is a fundamental challenge in a wide range of domains, e.g., single-cell biology, meteorology, and economics. Despite advancements in Bridge and Flow matching frameworks, current methodologies rely on pairwise interpolation between adjacent snapshots. This hinders their ability to capture long-range temporal dependencies and potentially affects the coherence of the inferred trajectories. To address these issues, we introduce \textbfMomentum Multi-Marginal Schrödinger Bridge Matching (3MSBM), a novel matching framework that learns smooth measure-valued splines for stochastic systems that satisfy multiple positional constraints. This is achieved by lifting the dynamics to phase space and generalizing stochastic bridges to be conditioned on several points, forming a multi-marginal conditional stochastic optimal control problem. The underlying dynamics are then learned by minimizing a variational objective, having fixed the path induced by the multi-marginal conditional bridge. As a matching approach, 3MSBM learns transport maps that preserve intermediate marginals throughout training, significantly improving convergence and scalability. Extensive experimentation in a series of real-world applications validates the superior performance of 3MSBM compared to existing methods in capturing complex dynamics with temporal dependencies, opening new avenues for training matching frameworks in multi-marginal settings.

[LG-87] Attention on flow control: transformer-based reinforcement learning for lift regulation in highly disturbed flows

链接: https://arxiv.org/abs/2506.10153
作者: Zhecheng Liu,Jeff D. Eldredge
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A linear flow control strategy designed for weak disturbances may not remain effective in sequences of strong disturbances due to nonlinear interactions, but it is sensible to leverage it for developing a better strategy. In the present study, we propose a transformer-based reinforcement learning (RL) framework to learn an effective control strategy for regulating aerodynamic lift in gust sequences via pitch control. The transformer addresses the challenge of partial observability from limited surface pressure sensors. We demonstrate that the training can be accelerated with two techniques – pretraining with an expert policy (here, linear control) and task-level transfer learning (here, extending a policy trained on isolated gusts to multiple gusts). We show that the learned strategy outperforms the best proportional control, with the performance gap widening as the number of gusts increases. The control strategy learned in an environment with a small number of successive gusts is shown to effectively generalize to an environment with an arbitrarily long sequence of gusts. We investigate the pivot configuration and show that quarter-chord pitching control can achieve superior lift regulation with substantially less control effort compared to mid-chord pitching control. Through a decomposition of the lift, we attribute this advantage to the dominant added-mass contribution accessible via quarter-chord pitching. The success on multiple configurations shows the generalizability of the proposed transformer-based RL framework, which offers a promising approach to solve more computationally demanding flow control problems when combined with the proposed acceleration techniques.

[LG-88] Diffusion prior as a direct regularization term for FWI

链接: https://arxiv.org/abs/2506.10141
作者: Yuke Xie,Hervé Chauris,Nicolas Desassis
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Diffusion models have recently shown promise as powerful generative priors for inverse problems. However, conventional applications require solving the full reverse diffusion process and operating on noisy intermediate states, which poses challenges for physics-constrained computational seismic imaging. In particular, such instability is pronounced in non-linear solvers like those used in Full Waveform Inversion (FWI), where wave propagation through noisy velocity fields can lead to numerical artifacts and poor inversion quality. In this work, we propose a simple yet effective framework that directly integrates a pretrained Denoising Diffusion Probabilistic Model (DDPM) as a score-based generative diffusion prior into FWI through a score rematching strategy. Unlike traditional diffusion approaches, our method avoids the reverse diffusion sampling and needs fewer iterations. We operate the image inversion entirely in the clean image space, eliminating the need to operate through noisy velocity models. The generative diffusion prior can be introduced as a simple regularization term in the standard FWI update rule, requiring minimal modification to existing FWI pipelines. This promotes stable wave propagation and can improve convergence behavior and inversion quality. Numerical experiments suggest that the proposed method offers enhanced fidelity and robustness compared to conventional and GAN-based FWI approaches, while remaining practical and computationally efficient for seismic imaging and other inverse problem tasks.

[LG-89] Fundamental Limits of Learning High-dimensional Simplices in Noisy Regimes ICML2023

链接: https://arxiv.org/abs/2506.10101
作者: Seyed Amir Hossein Saberi,Amir Najafi,Abolfazl Motahari,Babak H. khalaj
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Extension of our ICML 2023 paper, 44 pages

点击查看摘要

Abstract:In this paper, we establish sample complexity bounds for learning high-dimensional simplices in \mathbbR^K from noisy data. Specifically, we consider n i.i.d. samples uniformly drawn from an unknown simplex in \mathbbR^K , each corrupted by additive Gaussian noise of unknown variance. We prove an algorithm exists that, with high probability, outputs a simplex within \ell_2 or total variation (TV) distance at most \varepsilon from the true simplex, provided n \ge (K^2/\varepsilon^2) e^\mathcalO(K/\mathrmSNR^2) , where \mathrmSNR is the signal-to-noise ratio. Extending our prior work~\citepsaberi2023sample, we derive new information-theoretic lower bounds, showing that simplex estimation within TV distance \varepsilon requires at least n \ge \Omega(K^3 \sigma^2/\varepsilon^2 + K/\varepsilon) samples, where \sigma^2 denotes the noise variance. In the noiseless scenario, our lower bound n \ge \Omega(K/\varepsilon) matches known upper bounds up to constant factors. We resolve an open question by demonstrating that when \mathrmSNR \ge \Omega(K^1/2) , noisy-case complexity aligns with the noiseless case. Our analysis leverages sample compression techniques (Ashtiani et al., 2018) and introduces a novel Fourier-based method for recovering distributions from noisy observations, potentially applicable beyond simplex learning.

[LG-90] Patient-Specific Deep Reinforcement Learning for Automatic Replanning in Head-and-Neck Cancer Proton Therapy

链接: https://arxiv.org/abs/2506.10073
作者: Malvern Madondo,Yuan Shao,Yingzi Liu,Jun Zhou,Xiaofeng Yang,Zhen Tian
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anatomical changes during intensity-modulated proton therapy (IMPT) for head-and-neck cancer (HNC) can shift Bragg peaks, risking tumor underdosing and organ-at-risk overdosing. As a result, treatment replanning is often required to maintain clinically acceptable treatment quality. However, current manual replanning processes are resource-intensive and time-consuming. We propose a patient-specific deep reinforcement learning (DRL) framework for automated IMPT replanning, with a reward-shaping mechanism based on a 150 -point plan quality score addressing competing clinical objectives. We formulate the planning process as an RL problem where agents learn control policies to adjust optimization priorities, maximizing plan quality. Unlike population-based approaches, our framework trains personalized agents for each patient using their planning CT (Computed Tomography) and augmented anatomies simulating anatomical changes (tumor progression and regression). This patient-specific approach leverages anatomical similarities throughout treatment, enabling effective plan adaptation. We implemented two DRL algorithms, Deep Q-Network and Proximal Policy Optimization, using dose-volume histograms (DVHs) as state representations and a 22 -dimensional action space of priority adjustments. Evaluation on five HNC patients using actual replanning CT data showed both DRL agents improved initial plan scores from 120.63 \pm 21.40 to 139.78 \pm 6.84 (DQN) and 142.74 \pm 5.16 (PPO), surpassing manual replans generated by a human planner ( 137.20 \pm 5.58 ). Clinical validation confirms that improvements translate to better tumor coverage and OAR sparing across diverse anatomical changes. This work demonstrates DRL’s potential in addressing geometric and dosimetric complexities of adaptive proton therapy, offering efficient offline adaptation solutions and advancing online adaptive proton therapy.

[LG-91] scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data ICML2025

链接: https://arxiv.org/abs/2506.10031
作者: Olga Ovcharenko,Florian Barkmann,Philip Toma,Imant Daunhawer,Julia Vogt,Sebastian Schelter,Valentina Boeva
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Accepted at ICML 2025 (Spotlight)

点击查看摘要

Abstract:Self-supervised learning (SSL) has proven to be a powerful approach for extracting biologically meaningful representations from single-cell data. To advance our understanding of SSL methods applied to single-cell data, we present scSSL-Bench, a comprehensive benchmark that evaluates nineteen SSL methods. Our evaluation spans nine datasets and focuses on three common downstream tasks: batch correction, cell type annotation, and missing modality prediction. Furthermore, we systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI, CLAIRE, and the finetuned scGPT excel at uni-modal batch correction, while generic SSL methods, such as VICReg and SimCLR, demonstrate superior performance in cell typing and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks, surpassing domain-specific augmentations. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.

[LG-92] Identifying critical residues of a protein using meaningfully-thresholded Random Geometric Graphs

链接: https://arxiv.org/abs/2506.10015
作者: Chuqiao Zhang,Sarath Chandra Dantu,Debarghya Mitra,Dalia Chakrabarty
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: submitted to Journal of Computational and Graphical Statistics

点击查看摘要

Abstract:Identification of critical residues of a protein is actively pursued, since such residues are essential for protein function. We present three ways of recognising critical residues of an example protein, the evolution of which is tracked via molecular dynamical simulations. Our methods are based on learning a Random Geometric Graph (RGG) variable, where the state variable of each of 156 residues, is attached to a node of this graph, with the RGG learnt using the matrix of correlations between state variables of each residue-pair. Given the categorical nature of the state variable, correlation between a residue pair is computed using Cramer’s V. We advance an organic thresholding to learn an RGG, and compare results against extant thresholding techniques, when parametrising criticality as the nodal degree in the learnt RGG. Secondly, we develop a criticality measure by ranking the computed differences between the posterior probability of the full graph variable defined on all 156 residues, and that of the graph with all but one residue omitted. A third parametrisation of criticality informs on the dynamical variation of nodal degrees as the protein evolves during the simulation. Finally, we compare results obtained with the three distinct criticality parameters, against experimentally-ascertained critical residues.

[LG-93] CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes

链接: https://arxiv.org/abs/2406.15669
作者: Jason Yang,Ariane Mora,Shengchao Liu,Bruce J. Wittmann,Anima Anandkumar,Frances H. Arnold,Yisong Yue
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein sequence by its enzyme commission (EC) number and (2) retrieval of an EC number given a chemical reaction. For each task, we design train-test splits to evaluate different kinds of out-of-distribution generalization that are relevant to real use cases. For the classification task, we provide baselines for state-of-the-art methods. Because the retrieval task has not been previously formalized, we propose a method called Contrastive Reaction-EnzymE Pretraining (CREEP) as one of the first baselines for this task and compare it to the recent method, CLIPZyme. CARE is available at this https URL.

信息检索

[IR-0] Constructing and Evaluating Declarative RAG Pipelines in PyTerrier SIGIR2025

链接: https://arxiv.org/abs/2506.10802
作者: Craig Macdonald,Jinyuan Fang,Andrew Parry,Zaiqiao Meng
类目: Information Retrieval (cs.IR)
*备注: 4 pages, 3 tables, Accepted to SIGIR 2025

点击查看摘要

Abstract:Search engines often follow a pipeline architecture, where complex but effective reranking components are used to refine the results of an initial retrieval. Retrieval augmented generation (RAG) is an exciting application of the pipeline architecture, where the final component generates a coherent answer for the users from the retrieved documents. In this demo paper, we describe how such RAG pipelines can be formulated in the declarative PyTerrier architecture, and the advantages of doing so. Our PyTerrier-RAG extension for PyTerrier provides easy access to standard RAG datasets and evaluation measures, state-of-the-art LLM readers, and using PyTerrier’s unique operator notation, easy-to-build pipelines. We demonstrate the succinctness of indexing and RAG pipelines on standard datasets (including Natural Questions) and how to build on the larger PyTerrier ecosystem with state-of-the-art sparse, learned-sparse, and dense retrievers, and other neural rankers.

[IR-1] Context-Adaptive Graph Neural Networks for Next POI Recommendation

链接: https://arxiv.org/abs/2506.10329
作者: Yu Lei,Limin Shen,Zhu Sun,Tiantian He,Yew-Soon Ong
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Next Point-of-Interest (POI) recommendation is a critical task in location-based services, aiming to predict users’ next visits based on their check-in histories. While many existing methods leverage Graph Neural Networks (GNNs) to incorporate collaborative information and improve recommendation accuracy, most of them model each type of context using separate graphs, treating different factors in isolation. This limits their ability to model the co-influence of multiple contextual factors on user transitions during message propagation, resulting in suboptimal attention weights and recommendation performance. Furthermore, they often prioritize sequential components as the primary predictor, potentially undermining the semantic and structural information encoded in the POI embeddings learned by GNNs. To address these limitations, we propose a Context-Adaptive Graph Neural Networks (CAGNN) for next POI recommendation, which dynamically adjusts attention weights using edge-specific contextual factors and enables mutual enhancement between graph-based and sequential components. Specifically, CAGNN introduces (1) a context-adaptive attention mechanism that jointly incorporates different types of contextual factors into the attention computation during graph propagation, enabling the model to dynamically capture collaborative and context-dependent transition patterns; (2) a graph-sequential mutual enhancement module, which aligns the outputs of the graph- and sequential-based modules via the KL divergence, enabling mutual enhancement of both components. Experimental results on three real-world datasets demonstrate that CAGNN consistently outperforms state-of-the-art methods. Meanwhile, theoretical guarantees are provided that our context-adaptive attention mechanism improves the expressiveness of POI representations.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-06-13

目录

概览 (2025-06-13)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载