Arxiv今日论文 | 2025-06-19

本篇博文主要内容为 2025-06-19 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决由私有微调大型语言模型（Large Language Models, LLMs）生成的文本检测问题，这一问题在现有检测方法中尚未得到充分研究。由于用户可通过使用私有语料对开源模型进行微调，导致现有检测器在实际应用中的性能显著下降。论文提出的解决方案是PhantomHunter，其关键在于家族感知学习框架，该框架捕捉基础模型及其衍生模型之间的家族级特征，而非记忆个体特征，从而有效识别未见过的私有微调LLM生成的文本。

链接: https://arxiv.org/abs/2506.15683
作者: Yuhui Shi,Yehan Yang,Qiang Sheng,Hao Mi,Beizhe Hu,Chaoxi Xu,Juan Cao
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 17 pages, 3 figures, 6 tables

点击查看摘要

Abstract:With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.
zh

[NLP-1] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

【速读】：该论文旨在解决将视觉-语言模型（Vision-Language Models, VLMs）部署到资源受限设备中的挑战，尤其是在保持高性能的同时降低计算需求。其关键解决方案是提出一种通用的知识蒸馏框架——生成后校准（Generation after Recalibration, GenRecal），该框架通过引入一个校准器（Recalibrator）来对齐和适应异构VLM之间的特征表示，从而实现跨不同类型VLM的有效知识迁移。

链接: https://arxiv.org/abs/2506.15681
作者: Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu
机构: NVIDIA(英伟达); KAIST(韩国科学技术院); National Taiwan University(台湾大学)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.
zh

[NLP-2] Dense SAE Latents Are Features Not Bugs

【速读】：该论文试图解决稀疏自编码器（Sparse Autoencoder, SAE）在训练过程中产生的密集潜变量（dense latents）是否为训练过程中的不良产物，还是其本身具有语义意义的问题。解决方案的关键在于系统性地分析密集潜变量的几何结构、功能及其起源，发现它们不仅是持续存在的，而且往往反映了模型中具有意义的表示。研究进一步表明，密集潜变量倾向于形成反向对，并重构残差流中的特定方向，且其子空间的消融会抑制重新训练的SAE中新型密集特征的出现，这表明高密度特征是残差空间的固有属性。此外，通过提出密集潜变量的分类体系，揭示了其在不同层中的演化过程，从早期层的结构特征到中期层的语义特征，再到末层的输出导向信号。这些发现表明，密集潜变量在语言模型计算中具有功能性作用，不应被简单视为训练噪声。

链接: https://arxiv.org/abs/2506.15679
作者: Xiaoqing Sun,Alessandro Stolfo,Joshua Engels,Ben Wu,Senthooran Rajamanoharan,Mrinmaya Sachan,Max Tegmark
机构: MIT(麻省理工学院); ETH Zürich(苏黎世联邦理工学院); University of Sheffield(谢菲尔德大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emphdense), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs – suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.
zh

[NLP-3] Embodied Web Agents : Bridging Physical-Digital Realms for Integrated Agent Intelligence

【速读】：该论文试图解决当前人工智能代理（AI agents）在物理世界与数字信息之间集成能力不足的问题，即大多数AI代理仅能专注于在线信息检索与推理或通过具身感知、规划和行动与物理世界交互，而难以实现两者的融合。解决方案的关键在于提出“具身网络代理”（Embodied Web Agents）这一新范式，通过构建一个统一的仿真平台，将真实的三维室内与室外环境与功能性的网络接口紧密集成，从而支持跨物理与数字领域的协同推理任务。

链接: https://arxiv.org/abs/2506.15677
作者: Yining Hong,Rui Sun,Bingxuan Li,Xingcheng Yao,Maxine Wu,Alexander Chien,Da Yin,Ying Nian Wu,Zhecan James Wang,Kai-Wei Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page this https URL.
zh

[NLP-4] Gender-Neutral Machine Translation Strategies in Practice

【速读】：该论文试图解决性别包容性机器翻译（Gender-inclusive Machine Translation, MT）中如何在源语言存在性别模糊性时避免误指性别和表征伤害的问题。解决方案的关键在于识别和分析实际应用中所采用的性别中性策略，并评估不同目标语言下机器翻译系统对性别中立需求的敏感性，从而揭示现有系统在处理性别模糊性时的不足及部分系统通过特定策略实现性别中立的可能性。

链接: https://arxiv.org/abs/2506.15676
作者: Hillary Dawkins,Isar Nejadgholi,Chi-kiu Lo
机构: National Research Council Canada (NRC-CNRC)
类目: Computation and Language (cs.CL)
备注: to appear at GITT 2025

点击查看摘要

Abstract:Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.
zh

[NLP-5] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

【速读】：该论文试图解决大型推理模型在作为个人代理时，其推理轨迹（reasoning traces）中可能泄露隐私的问题。与最终输出不同，推理轨迹通常被认为属于内部信息且较为安全，但本文通过实验证明，推理轨迹中常包含敏感用户数据，可通过提示注入或意外泄露到输出中。解决方案的关键在于揭示推理过程中的隐私风险，并指出增加推理步骤等测试时计算方法虽能提升模型输出的谨慎性，却会使其推理更加冗长，从而扩大隐私泄露的风险。这表明，在提升模型实用性的同时，必须关注其内部思考过程的安全性，而不仅仅局限于输出内容。

链接: https://arxiv.org/abs/2506.15674
作者: Tommaso Green,Martin Gubri,Haritz Puerto,Sangdoo Yun,Seong Joon Oh
机构: Parameter Lab; Data and Web Science Group, University of Mannheim; UKP Lab, Technical University of Darmstadt; NAVER AI Lab; University of Tübingen; Tübingen AI Center
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model’s internal thinking, not just its outputs.
zh

[NLP-6] CC-LEARN: Cohort-based Consistency Learning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理过程中缺乏一致性和鲁棒性的问题。其解决方案的关键在于引入基于群体的一致性学习（Cohort-based Consistency Learning, CC-Learn），通过在由共享程序抽象生成的相似问题群体上进行强化学习训练，提升模型推理的可靠性。CC-Learn定义了一个复合目标函数，包括群体准确率、用于有效问题分解的检索奖励以及对冗余或无效查询的拒绝惩罚，从而引导模型在群体内采用统一的推理模式。

链接: https://arxiv.org/abs/2506.15662
作者: Xiao Ye,Shaswat Shrivastava,Zhaonan Li,Jacob Dineen,Shijie Lu,Avneet Ahuja,Ming Shen,Zhikun Xu,Ben Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.
zh

[NLP-7] AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning

【速读】：该论文旨在解决强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）中依赖手动规则工程的问题，从而提升奖励模型的性能。其解决方案的关键在于提出AutoRule，一种完全自动化的从偏好反馈中提取规则并将其转化为基于规则的奖励的方法。AutoRule通过三个阶段完成规则提取：利用推理模型解析用户偏好、从推理链中识别候选规则，并将这些规则合成统一的规则集。随后，通过语言模型验证器计算每个输出满足规则的比例，并将其作为辅助奖励与学习到的奖励模型结合，在策略优化过程中提升性能。

链接: https://arxiv.org/abs/2506.15651
作者: Tevin Wang,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at this https URL.
zh

[NLP-8] Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

【速读】：该论文试图解决罗马尼亚语文本的作者归属问题，其解决方案的关键在于采用字符n-gram特征并结合六种机器学习技术进行分类，其中人工神经网络（Artificial Neural Networks, ANN）模型在使用5-gram特征时表现最佳，实现了在十五次运行中的四次完美分类，表明轻量级、可解释的字符n-gram方法能够在资源受限或研究不足的语言环境中达到最先进的准确率。

链接: https://arxiv.org/abs/2506.15650
作者: Dana Lupsa,Sanda-Maria Avram
机构: Babe\textcommabelows-Bolyai University (巴贝什-波利艾大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.
zh

[NLP-9] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability ACL2025

【速读】：该论文旨在解决生成式大语言模型（LLMs）在遵循指令和进行组合泛化方面的能力不足问题，特别是在需要按照指定概念顺序生成句子的任务中表现不佳。解决方案的关键是提出Ordered CommonGen基准，该基准通过评估概念是否按指定顺序生成来衡量模型的有序覆盖能力，从而同时评估模型的组合泛化和指令遵循能力。

链接: https://arxiv.org/abs/2506.15629
作者: Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技術大学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main

点击查看摘要

Abstract:In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.
zh

[NLP-10] Minding the Politeness Gap in Cross-cultural Communication

【速读】：该论文试图解决跨文化交际中因语义理解差异导致的误解问题，具体关注英语母语者（英式与美式英语）对强调词（intensifiers）如“quite”和“very”的解释差异。其解决方案的关键在于构建了一个计算认知模型，该模型通过递归推理来模拟听者对说话者在信息量、礼貌性及话语成本之间权衡的考量，从而揭示跨文化差异源于字面意义和话语成本权重的不同组合，而非单一的语义变异或礼貌规范。

链接: https://arxiv.org/abs/2506.15623
作者: Yuka Machino,Matthias Hofer,Max Siegel,Joshua B. Tenenbaum,Robert D. Hawkins
机构: MIT(麻省理工学院); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like “quite” and “very.” To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.
zh

[NLP-11] he Compositional Architecture of Regret in Large Language Models

【速读】：该论文试图解决大型语言模型中后悔机制的识别与分析问题，具体包括如何准确捕捉模型输出中的后悔表达、确定最优的后悔表征层以及识别和分析后悔神经元。其解决方案的关键在于提出三项核心方法：一是通过设计特定提示场景构建全面的后悔数据集；二是引入监督压缩-解耦指数（S-CDI）以确定最优后悔表征层；三是利用后悔主导得分（RDS）识别后悔神经元，并结合组影响系数（GIC）分析激活模式。这些方法有效提升了模型在探测分类实验中的性能，并揭示了模型内部信息处理的耦合与解耦动态特性。

链接: https://arxiv.org/abs/2506.15617
作者: Xiangxiang Cui,Shu Yang,Tianjin Huang,Wanyu Lin,Lijie Hu,Di Wang
机构: Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology; State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University; University of Exeter; The Hong Kong Polytechnic University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model’s hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.
zh

[NLP-12] LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在经过对齐后，其安全防护仍可能因后续微调而被削弱的问题。解决方案的关键在于提出一种无需额外训练的新型方法——低秩外推（Low-Rank Extrapolation, LoX），通过外推对齐模型的安全子空间来增强模型的安全鲁棒性。该方法的核心思想是利用安全关键的低秩子空间对模型参数进行调整，使其进入更平坦的区域，从而降低对微调扰动的敏感性。

链接: https://arxiv.org/abs/2506.15606
作者: Gabrel J. Perin,Runjin Chen,Xuxi Chen,Nina S. T. Hirata,Zhangyang Wang,Junyuan Hong
机构: University of São Paulo (圣保罗大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model’s adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at this http URL.
zh

[NLP-13] From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

【速读】：该论文试图解决手动创建具有不同难度等级和针对性阅读技能的多项选择题（Multiple Choice Questions, MCQs）耗时且成本高昂的问题，同时关注生成式 AI 在实际应用场景中生成的 MCQ 质量与可靠性评估不足的问题。解决方案的关键在于利用当前生成式 AI 模型的能力，自动化生成符合课程相关叙事元素并覆盖不同难度层级的 MCQ，并通过专家评审及学生答题的心理测量属性分析来评估其适用于小学阶段学生的适用性。

链接: https://arxiv.org/abs/2506.15598
作者: Bernardo Leite,Henrique Lopes Cardoso,Pedro Pinto,Abel Ferreira,Luís Abreu,Isabel Rangel,Sandra Monteiro
机构: FEUP(葡萄牙波尔图大学工程学院); Porto Editora(波尔图出版社); Agrupamento de Escolas Lourenço Marques(路易斯·马尔克斯学校集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is a preprint version of the manuscript currently under review at an international journal

点击查看摘要

Abstract:While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention – particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.
zh

[NLP-14] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts ACL2025

【速读】：该论文试图解决自动文档理解（Document Understanding, DU）中长上下文多模态推理的挑战，特别是针对包含复杂布局、表格和图表的文档处理问题。解决方案的关键在于引入WikiMixQA基准，该基准包含1,000道多选题，用于评估模型在从4,000个维基百科页面中提取的表格和图表上的跨模态推理能力，强调通过多模态信息综合进行复杂推理的能力。

链接: https://arxiv.org/abs/2506.15594
作者: Negar Foroutan,Angelika Romanou,Matin Ansaripour,Julian Martin Eisenschlos,Karl Aberer,Rémi Lebret
机构: EPFL(瑞士联邦理工学院); Google DeepMind(谷歌深度思维); Universidad Nacional de Córdoba(科尔多瓦国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL 2025 (Findings)

点击查看摘要

Abstract:Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.
zh

[NLP-15] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

【速读】：该论文试图解决当前视觉-语言模型（Vision-Language Models, VLMs）在生成多句视觉描述时，传统文本场景图解析器因仅针对单句caption-to-graph映射设计而无法处理跨句指代等现象，导致生成的场景图碎片化、下游任务性能下降的问题。解决方案的关键在于引入一个新的任务——话语级文本场景图解析（Discourse-level text Scene Graph parsing, DiscoSG），并构建了包含400个专家标注和8,430个合成多句caption-图对的数据集DiscoSG-DS。为降低计算成本，提出DiscoSG-Refiner框架，先由小型预训练语言模型（PLM）生成基础图，再通过另一个PLM迭代提出图编辑，从而显著提升推理效率并保持解析质量。

链接: https://arxiv.org/abs/2506.15583
作者: Shaoqing Lin,Chong Teng,Fei Li,Donghong Ji,Lizhen Qu,Zhuang Li
机构: Wuhan University (武汉大学); Monash University (莫纳什大学); RMIT (皇家墨尔本理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: this https URL
zh

[NLP-16] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

【速读】：该论文试图解决多模态科学陈述验证（multimodal scientific claim verification）中基础模型能力评估的问题，旨在通过构建一个专门的基准测试集来评估模型在科学文献中的推理与理解能力。解决方案的关键在于提出SciVer，这是一个包含3,000个专家标注示例的基准数据集，覆盖1,113篇科学论文，并包含针对每条示例的专家标注支持证据，从而实现对模型性能的细粒度评估。

链接: https://arxiv.org/abs/2506.15569
作者: Chengye Wang,Yifei Shen,Zexi Kuang,Arman Cohan,Yilun Zhao
机构: Yale NLP Lab (耶鲁自然语言处理实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models’ comprehension and reasoning in multimodal scientific literature tasks.
zh

[NLP-17] Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在性别公平性方面的不足，特别是针对二元和非二元性别的包容性问题。现有研究多集中于二元性别区分，而本文提出了一种新的综合评估指标——性别包容公平指数（Gender Inclusivity Fairness Index, GIFI），其关键在于通过多层次的评估方法，从简单的代词探测到在不同性别假设下模型生成和认知行为的测试，全面量化LLMs的性别包容性，并揭示与不同性别标识相关的偏见。

链接: https://arxiv.org/abs/2506.15568
作者: Zhengyang Shan,Emily Ruth Diana,Jiawei Zhou
机构: Boston University (波士顿大学); Carnegie Mellon University (卡内基梅隆大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 Main

点击查看摘要

Abstract:We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs’ gender inclusivity. Our study highlights the importance of improving LLMs’ inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.
zh

[NLP-18] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实时语音聊天应用中因生成首个句子所需时间过长而导致的显著延迟问题，这种延迟影响了用户体验。解决方案的关键在于提出一种名为预测生成（Predictive Generation, PredGen）的框架，通过在用户说话时进行推测性解码，提前生成候选响应，从而在用户输入结束前就开始文本转语音（TTS）处理，有效减少整体延迟。

链接: https://arxiv.org/abs/2506.15556
作者: Shufan Li,Aditya Grover
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 16 pages,4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.
zh

[NLP-19] Approximating Language Model Training Data from Weights

【速读】：该论文试图解决从模型权重中近似恢复训练数据的问题（data approximation from model weights），这是在现代语言模型中常见的开放权重但封闭训练数据的场景。解决方案的关键在于提出一种基于梯度的方法，该方法从大规模公开文本语料库中选择与模型权重匹配度最高的数据，从而有效恢复有用的训练数据。实验结果表明，即使在没有真实训练数据的情况下，该方法也能定位出一小部分公共网络文档，用于训练性能接近原始模型的模型。

链接: https://arxiv.org/abs/2506.15553
作者: John X. Morris,Junjie Oscar Yin,Woojeong Kim,Vitaly Shmatikov,Alexander M. Rush
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model’s perplexity of 2.0.
zh

[NLP-20] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

【速读】：该论文试图解决局部-全局注意力模型中窗口大小选择带来的帕累托权衡问题，即较大的窗口虽能保持与全注意力相当的性能但效率提升有限，而较小的窗口可能导致性能下降。解决方案的关键在于提出RATTENTION，这是一种将局部注意力与专门设计的线性注意力机制相结合的变体，旨在捕捉窗口外的上下文信息，从而在保持性能的同时提升效率。

链接: https://arxiv.org/abs/2506.15545
作者: Bailin Wang,Chang Lan,Chong Wang,Ruoming Pang
机构: Apple(苹果)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention – its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.
zh

[NLP-21] Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

【速读】：该论文试图解决神经网络特征描述方法中存在的两个关键问题：有限的鲁棒性以及错误假设每个神经元仅编码单一概念（即单义性），而实际上神经元通常是多义的。这一假设限制了特征描述的表达能力，使其难以全面捕捉模型内部编码的行为。解决方案的关键在于提出一种新的框架——多义特征识别与评分方法（Polysemantic FeatuRe Identification and Scoring Method, PRISM），该方法能够捕捉神经网络特征的固有复杂性，为多义和单义特征提供更细致的描述，从而提升特征描述的准确性和真实性。

链接: https://arxiv.org/abs/2506.15538
作者: Laura Kopf,Nils Feldhus,Kirill Bykov,Philine Lou Bommer,Anna Hedström,Marina M.-C. Höhne,Oliver Eberle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Current feature description methods face two critical challenges: limited robustness and the flawed assumption that each neuron encodes only a single concept (monosemanticity), despite growing evidence that neurons are often polysemantic. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework that captures the inherent complexity of neural network features. Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features. We apply PRISM to language models and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).
zh

[NLP-22] Lessons from Training Grounded LLM s with Verifiable Rewards

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）生成具有依据且可信回答的问题，特别是在检索增强生成（Retrieval-Augmented Generation, RAG）框架下，指令调优模型在简单场景中仍存在遗漏明确答案、错误引用或在有证据时拒绝回答等缺陷。解决方案的关键在于利用强化学习（Reinforcement Learning, RL）和内部推理来提升模型的依据性（grounding），具体通过GRPO（Group Relative Policy Optimization）方法，采用可验证的结果导向奖励机制，优化答案正确性、引用充分性和拒绝质量，而无需依赖黄金推理轨迹或昂贵的标注数据。

链接: https://arxiv.org/abs/2506.15522
作者: Shang Hong Sim,Tej Deep Pala,Vernon Toh,Hai Leong Chieu,Amir Zadeh,Chuan Li,Navonil Majumder,Soujanya Poria
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.
zh

[NLP-23] Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge ACL2025

【速读】：该论文旨在解决文本中的夸张（hyperbole）和隐喻（metaphor）检测问题，这一任务对于自然语言处理（NLP）具有重要意义。然而，由于其语义模糊性和表达多样性，传统方法主要依赖于表层文本特征，忽略了夸张与隐喻之间的关联性以及隐含情感对感知这些修辞手法的影响。该研究提出了一种基于双向动态交互的情感引导检测框架（EmoBi），其关键在于通过情感分析模块挖掘夸张与隐喻背后的隐含情感，利用基于情感的领域映射模块识别目标域与源域以深入理解隐含意义，并通过双向动态交互模块实现夸张与隐喻之间的相互促进，从而提升检测的准确性和可靠性。

链接: https://arxiv.org/abs/2506.15504
作者: Li Zheng,Sihang Wang,Hao Fei,Zuquan Peng,Fei Li,Jianming Fu,Chong Teng,Donghong Ji
机构: Wuhan University (武汉大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.
zh

[NLP-24] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）中实现高效且高质量的自动过程标注（process annotation）问题，这一问题限制了多步骤推理能力的进一步提升。解决方案的关键在于提出一种名为单次通过参考引导评估（Single-Pass Annotation with Reference-Guided Evaluation, SPARE）的结构化框架，该框架通过将每个解题步骤与参考解法中的一个或多个步骤对齐，并附带明确的推理评估，从而实现单次通过的逐步标注。

链接: https://arxiv.org/abs/2506.15498
作者: Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages main content, 4 figures, 4 tables

点击查看摘要

Abstract:Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.
zh

[NLP-25] Context-Informed Grounding Supervision

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成回答时缺乏基于外部上下文的 grounded generation 问题，即模型未能有效利用提供的外部知识来生成准确且符合事实的响应。解决方案的关键在于提出 Context-INformed Grounding Supervision (CINGS)，一种后训练监督机制，在该机制中，模型在生成回答时被预先加载相关上下文，但仅在回答部分计算损失并屏蔽上下文部分，从而强化模型对外部上下文的依赖和使用。

链接: https://arxiv.org/abs/2506.15480
作者: Hyunji Lee,Seunghyun Yoon,Yunjae Won,Hanseok Oh,Geewook Kim,Trung Bui,Franck Dernoncourt,Elias Stengel-Eskin,Mohit Bansal,Minjoon Seo
机构: KAIST AI (KAIST人工智能); Adobe Research (Adobe研究院); Mila – Quebec AI Institute (Mila–魁北克人工智能研究所); NAVER Cloud AI (NAVER云人工智能); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model’s LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model’s prior knowledge and behavior, implicitly encouraging greater reliance on the external context.
zh

[NLP-26] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation ICML2025

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在推理基准测试中表现出的高准确率是否源于真正的推理能力还是对训练集的统计回忆这一问题。其解决方案的关键在于引入RE-IMAGINE框架，该框架基于因果关系的三个层次（关联、干预和反事实）构建了一个推理能力的层级结构，并通过自动化的管道生成不同层级的问题变体。该框架通过改变问题的中间符号表示，生成大量无法仅通过记忆解决的问题，从而有效评估模型的真实推理能力。

链接: https://arxiv.org/abs/2506.15455
作者: Xinnuo Xu,Rachel Lawrence,Kshitij Dubey,Atharva Pandey,Risa Ueno,Fabian Falck,Aditya V. Nori,Rahul Sharma,Amit Sharma,Javier Gonzalez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

Abstract:Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.
zh

[NLP-27] Agent GroupChat-V2: Divide-and-Conquer Is What LLM -Based Multi-Agent System Need

【速读】：该论文旨在解决基于大型语言模型的多智能体系统在系统架构设计、跨领域泛化能力和性能保障方面的关键挑战，尤其是在任务复杂度和智能体数量增加时的表现问题。其解决方案的关键在于三个核心创新：一是采用分而治之的全并行架构，将用户查询分解为层次化任务森林结构以实现依赖管理与分布式并发处理；二是引入自适应协作引擎，根据任务特征动态选择异构大语言模型组合及交互模式；三是结合分而治之策略的智能体组织优化方法，以提高问题分解效率。这些创新共同提升了系统的性能与泛化能力。

链接: https://arxiv.org/abs/2506.15451
作者: Zhouhong Gu,Xiaoxuan Zhu,Yin Cai,Hao Shen,Xingzhou Chen,Qingyi Wang,Jialin Li,Xiaoran Shi,Haoran Guo,Wenxuan Huang,Hongwei Feng,Yanghua Xiao,Zheyu Ye,Yao Hu,Shaosheng Cao
机构: Fudan University (复旦大学); Rhine AI (莱茵人工智能); East China Normal University (华东师范大学); Xiaohongshu (小红书)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2’s superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at this https URL.
zh

[NLP-28] Understanding GUI Agent Localization Biases through Logit Sharpness

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在GUI代理中因幻觉（hallucinations）导致的系统性定位误差问题，这些问题会降低模型的可靠性。论文提出的解决方案关键在于构建一个细粒度的评估框架，将模型预测分为四类，揭示了传统准确率指标无法捕捉的复杂失败模式，并引入峰值锐度评分（Peak Sharpness Score, PSS）以更精确地量化模型不确定性。此外，通过提出无需训练的上下文感知裁剪（Context-Aware Cropping）技术，进一步提升了模型性能。

链接: https://arxiv.org/abs/2506.15425
作者: Xingjian Tao,Yiwei Wang,Yujun Cai,Zhicheng Yang,Jing Tang
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; University of California, Merced; The University of Queensland
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.
zh

[NLP-29] argeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning

【速读】：该论文旨在解决低资源语言（LRL）在大型语言模型（LLM）中的跨语言词汇对齐问题，这一问题主要由于数据稀缺和预训练阶段的代表性不足导致。论文提出的解决方案的关键在于Targeted Lexical Injection (TLI)，其核心思想是利用早期内部层中已存在的强跨语言词汇对齐能力，通过低秩适应（LoRA）和对比学习目标进行微调，从而提升模型在输出层面的词汇对齐效果。

链接: https://arxiv.org/abs/2506.15415
作者: Stanley Ngugi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 2 tables. Research on parameter-efficient fine-tuning (PEFT) for low-resource languages (Swahili). Investigates cross-lingual lexical alignment in Lugha-Llama using LoRA and contrastive learning

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p 7.17 x 10^-27). These findings suggest TLI enhances the model’s ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.
zh

[NLP-30] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation ACL2025

【速读】：该论文旨在解决印度语言在评论感知的多模态和多语言摘要任务中研究不足的问题，提出了一种名为COSMMIC的开创性多模态、多语言数据集，该数据集包含九种主要印度语言。其关键解决方案是通过整合读者评论、图像和文本信息，提升摘要质量，并通过结合先进的语言模型（如LLama3和GPT-4）以及使用IndicBERT和基于CLIP的多语言分类器进行评论筛选和图像特征提取，以优化自然语言生成（NLG）任务的有效配置。

链接: https://arxiv.org/abs/2506.15372
作者: Raghvendra Kumar,S. A. Mohammed Salman,Aryan Sahu,Tridib Nandi,Pragathi Y. P.,Sriparna Saha,Jose G. Moreno
机构: Indian Institute of Technology Patna(印度理工学院巴特那分校); National Institute of Technology Tiruchirappalli(印度理工学院蒂鲁奇拉帕利分校); BITS Pilani – Goa Campus(比特学院果阿校区); Indian Institute of Information Technology Vadodara(印度信息科技学院瓦多达拉分校); B.M.S. College of Engineering(比姆·M·斯海工程学院); Université de Toulouse, IRIT UMR 5505 CNRS(图卢兹大学，IRIT UMR 5505 CNRS)
类目: Computation and Language (cs.CL)
备注: ACL 2025 MAINs

点击查看摘要

Abstract:Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset’s effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.
zh

[NLP-31] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models Knowledge of Indian Culture ACL2025

【速读】：该论文试图解决语言模型（Language Models, LMs）在全球范围内的有效性受限于对本地社会文化背景理解不足的问题，特别是针对印度丰富的文化多样性。解决方案的关键在于提出SANSKRITI，这是一个包含21,853个精心策划的问答对的数据集，覆盖印度28个州和8个联邦属地，涵盖了十六个关键的印度文化属性，旨在全面评估语言模型对印度文化的理解能力。通过该基准测试，研究揭示了主流大型语言模型（Large Language Models, LLMs）、印地语语言模型（Indic Language Models, ILMs）和小型语言模型（Small Language Models, SLMs）在处理文化细微查询时存在显著差异，从而为提升语言模型的文化理解能力提供了新的标准和方向。

链接: https://arxiv.org/abs/2506.15355
作者: Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Sriparna Saha
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Banasthali Vidyapeeth University (巴纳斯塔利维达佩思大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models’ comprehension of India’s rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India’s cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.
zh

[NLP-32] DeVisE: Behavioral Testing of Medical Large Language Models

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在临床决策支持中评估方法不足的问题，即现有评估方法难以区分真实的医学推理与表面模式。其解决方案的关键在于引入DeVisE（Demographics and Vital signs Evaluation）行为测试框架，通过构建包含重症监护室（ICU）出院记录的数据集，并生成真实和基于模板的版本，以控制单变量反事实情境，从而深入探究模型对人口统计学和生命体征属性的理解能力。

链接: https://arxiv.org/abs/2506.15339
作者: Camila Zurdo Tagliabue,Heloisa Oss Boll,Aykut Erdem,Erkut Erdem,Iacer Calixto
机构: Amsterdam UMC (阿姆斯特丹大学医学中心); University of Amsterdam (阿姆斯特丹大学); Amsterdam Public Health, Methodology (阿姆斯特丹公共卫生，方法学); Amsterdam Public Health, Mental Health (阿姆斯特丹公共卫生，心理健康); Koç University (科克大学); Hacettepe University (哈切特佩大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.
zh

[NLP-33] When and How Unlabeled Data Provably Improve In-Context Learning

【速读】：该论文试图解决在存在缺失标签的示例情况下，如何有效利用无标签数据以提升模型性能的问题。其解决方案的关键在于利用多层或循环的Transformer架构，通过隐式构建形式为 $\sum_{i\ge 0} a_i (X^\top X)^i X^\top y$ 的估计器，其中 $X$ 和 $y$ 分别表示特征和部分观测的标签（缺失条目设为零），从而有效地利用无标签数据。该方法通过深度或循环结构实现高阶多项式表达，并与期望最大化（Expectation Maximization）算法建立联系，表明适度的深度或循环即可实现显著的半监督学习性能提升。

链接: https://arxiv.org/abs/2506.15329
作者: Yingcong Li,Xiangyu Chang,Muti Kara,Xiaofeng Liu,Amit Roy-Chowdhury,Samet Oymak
机构: University of Michigan (密歇根大学); University of California, Riverside (加州大学河滨分校); Bilkent University (比尔肯特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form \sum_i\ge 0 a_i (X^\top X)^iX^\top y with X and y denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.
zh

[NLP-34] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification EMNLP

【速读】：该论文旨在解决低资源语言在语言识别（Language Identification, LID）任务中因数据分布不均和领域偏差导致的性能低下问题。其解决方案的关键在于提出一种新颖的监督对比学习（Supervised Contrastive Learning, SCL）方法，通过学习领域不变的表示来缓解类别不平衡和数据偏倚问题，从而提升低资源语言在跨领域数据上的LID性能。

链接: https://arxiv.org/abs/2506.15304
作者: Negar Foroutan,Jakhongir Saydaliev,Ye Eun Kim,Antoine Bosselut
机构: EPFL(瑞士联邦理工学院); The University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to EMNLP

点击查看摘要

Abstract:Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages – often limited to single-domain data, such as the Bible – continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.
zh

[NLP-35] Cohort Discovery: A Survey on LLM -Assisted Clinical Trial Recruitment

【速读】：该论文试图解决临床试验招募中试验与患者匹配（trial-patient matching）的问题，该任务在自然语言描述的试验设计和结构化与非结构化文本形式的患者数据背景下，需要强大的知识整合与推理能力。解决方案的关键在于利用大语言模型（Large Language Models, LLMs）的分布式知识整合能力，以构建更通用的匹配方法，而非依赖于传统的试验特定方法。然而，当前基于LLM的解决方案多依赖专有模型且缺乏强有力的评估基准，因此本文对现有基准、方法及评估框架进行了批判性分析，并探讨了LLM在临床研究中应用的挑战与未来方向。

链接: https://arxiv.org/abs/2506.15301
作者: Shrestha Ghosh,Moritz Schneider,Carina Reinicke,Carsten Eickhoff
机构: University of Tübingen, Germany(图宾根大学，德国); Boehringer Ingelheim, Germany(拜耳集团，德国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.
zh

[NLP-36] hunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

【速读】：该论文试图解决韩国司法系统在公开法院判决时，如何在保障司法透明度与个人数据保护之间取得平衡的问题，特别是现有去标识化（de-identification）流程在处理大规模判决文书时存在不足，且法律对个人身份信息的定义和分类不够明确，难以适配技术解决方案。论文提出的关键解决方案是构建一个名为Thunder-DeID的去标识化框架，其核心包括：构建并发布首个包含标注判决书及实体提及列表的韩语法律数据集，引入系统化的个人可识别信息（PII）分类方法，并开发基于深度神经网络（DNN）的端到端去标识化流水线。

链接: https://arxiv.org/abs/2506.15266
作者: Sungen Hahm,Heejin Kim,Gyuseong Lee,Hyunji Park,Jaejin Lee
机构: Seoul National University(首尔国立大学); Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院); Dept. of Computer Science and Engineering, Seoul National University(首尔国立大学计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.
zh

[NLP-37] opClustRAG at SIGIR 2025 LiveRAG Challenge

【速读】：该论文旨在解决大规模网络语料库上的端到端问答问题，特别是在保证答案的多样性、相关性和对检索证据的忠实性方面。其解决方案的关键在于采用了一种混合检索策略，结合稀疏和密集索引，并通过K-Means聚类对语义相似的段落进行分组，进而为每个聚类生成特定提示，以提升大型语言模型（LLM）生成答案的质量。

链接: https://arxiv.org/abs/2506.15246
作者: Juli Bakagianni,John Pavlopoulos,Aristidis Likas
机构: Athens University of Economics and Business (雅典经济与商业大学); Archimedes, Athena Research Center (阿基米德，埃娜研究中心); Computer Science and Engineering, University of Ioannina (计算机科学与工程，约阿尼纳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.
zh

[NLP-38] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs

【速读】：该论文旨在解决通用大语言模型在历史文本分析中存在领域知识缺失的问题，特别是在计算人文学科和AIGC技术背景下。其解决方案的关键在于提出Graph RAG框架，通过结合思维链提示（chain-of-thought prompting）、自我指导生成和过程监督，以最小的人工标注创建《前四史》人物关系数据集，从而支持自动化的历史知识提取，并在图增强生成阶段引入知识图谱与检索增强生成的协作机制，提升通用模型与历史知识的对齐度。

链接: https://arxiv.org/abs/2506.15241
作者: Yang Fan,Zhang Qi,Xing Wenqian,Liu Chang,Liu Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.
zh

[NLP-39] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

【速读】：该论文试图解决当前语言技术在理解巴斯克语（Basque）和西班牙语（Spanish）语言变体方面的能力不足问题。其解决方案的关键在于引入一个手动整理的巴斯克语与西班牙语平行数据集，以及它们各自的变体，并通过自然语言推理（Natural Language Inference, NLI）作为核心任务进行评估。研究还利用编码器-only和解码器-based的大规模语言模型（Large Language Models, LLMs）进行了跨语言和上下文学习实验，以分析语言变异对模型性能的影响。

链接: https://arxiv.org/abs/2506.15239
作者: Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU (HiTZ Center - Ixa, 巴斯克大学 UPV/EHU)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.
zh

[NLP-40] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

【速读】：该论文旨在解决视频描述生成中细节丰富性和准确性的挑战，特别是在具有配对音频的视频场景下。其解决方案的关键在于提出一种基于定向偏好优化（DPO）的低秩适配（LoRA）音频-视觉大语言模型（LLM）——video-SALMONN 2，并引入多轮DPO（MrDPO）方法，通过定期更新参考模型、合并与重新初始化LoRA模块以及利用真实视频字幕引导来提升训练稳定性与效果，从而显著提高了视频字幕生成的准确性。

链接: https://arxiv.org/abs/2506.15220
作者: Changli Tang,Yixuan Li,Yudong Yang,Jimin Zhuang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang
机构: Tsinghua University (清华大学); University of Cambridge (剑桥大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing the captioning error rates by 28%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \hrefthis https URLthis https URL.
zh

[NLP-41] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLM s

【速读】：该论文旨在解决开放性问答（open-ended question answering, QA）中自动评估方法的不足，特别是传统指标如ROUGE和BERTScore在捕捉语义相似性方面的局限性，以及基于大语言模型（LLM）的评估方法在可解释性和适应不同问题内容方面的缺陷。其解决方案的关键在于提出一种名为\textbf{MinosEval}的新评估方法，该方法首先区分开放性问题的类型（事实性问题与非事实性问题），然后针对不同类型采用不同的评估策略：对于事实性问题，使用自适应关键点评分策略；对于非事实性问题，采用实例感知的列表排序策略。

链接: https://arxiv.org/abs/2506.15215
作者: Yongqi Fan,Yating Wang,Guandong Wang,Jie Zhai,Jingping Liu,Qi Ye,Tong Ruan
机构: East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbfMinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.
zh

[NLP-42] ProtoReasoning : Prototypes as the Foundation for Generalizable Reasoning in LLM s

【速读】：该论文试图解决大型语言模型在跨领域任务中泛化能力不足的问题，特别是如何提升模型在不同领域任务中的推理能力。解决方案的关键在于提出ProtoReasoning框架，该框架通过可扩展且可验证的原型表示（Prolog用于逻辑推理，PDDL用于规划）来增强模型的推理能力，其核心是利用共享的抽象推理原型，这些原型捕捉了不同领域问题的本质，从而实现跨领域的有效泛化。

链接: https://arxiv.org/abs/2506.15211
作者: Feng He,Zijun Chen,Xinnian Liang,Tingting Ma,Yunqi Qiu,Shuangzhi Wu,Junchi Yan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes – fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning this http URL on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.
zh

[NLP-43] A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals

【速读】：该论文试图解决在可持续发展目标（Sustainable Development Goals, SDGs）背景下，如何有效跟踪和分析大量文本数据以评估进展的问题。其解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）进行单标签多类文本分类，并通过任务适应技术（如零样本学习、少样本学习和微调）提升模型性能。研究结果表明，经过提示工程优化的小型模型可以达到类似大型模型（如GPT）的性能水平。

链接: https://arxiv.org/abs/2506.15208
作者: Andrea Cadeddu,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi
机构: Linkalab s.r.l., Cagliari, Italy; Department of Mathematics and Computer Science, University of Cagliari, Italy; Knowledge Media Institute, The Open University, London, United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Access

点击查看摘要

Abstract:In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI’s GPT (Generative Pre-trained Transformer).
zh

[NLP-44] Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View

【速读】：该论文试图解决状态空间语言模型中记忆机制的运作原理，特别是信息在时间上的保留与遗忘规律，通过利用首因效应（primacy effect）和近因效应（recency effect）作为行为工具进行研究。其解决方案的关键在于识别出三种机制：首先，长期记忆由模型选择性状态空间块中的稀疏通道支持，这些通道持续编码早期输入标记并因果关联于首因效应；其次，短期记忆由delta调制的循环机制控制，近期输入因指数衰减获得更高权重，但当引入干扰项时，这种近因优势会消失，揭示了记忆深度的限制；最后，记忆分配受语义规律的动态调节，输入序列中的重复关系会改变delta门控行为，增加对中间内容的遗忘倾向。

链接: https://arxiv.org/abs/2506.15156
作者: Muhammad Cendekia Airlangga,Hilal AlQuabeh,Munachiso S Nwadike,Kentaro Inui
机构: MBZUAI; RIKEN; Tohoku University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time. Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences. We identify three mechanisms that give rise to this pattern. First, long-term memory is supported by a sparse subset of channels within the model’s selective state space block, which persistently encode early input tokens and are causally linked to primacy effects. Second, short-term memory is governed by delta-modulated recurrence: recent inputs receive more weight due to exponential decay, but this recency advantage collapses when distractor items are introduced, revealing a clear limit to memory depth. Third, we find that memory allocation is dynamically modulated by semantic regularity: repeated relations in the input sequence shift the delta gating behavior, increasing the tendency to forget intermediate items. We validate these findings via targeted ablations and input perturbations on two large-scale Mamba-based language models: one with 1.4B and another with 7B parameters.
zh

[NLP-45] SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

【速读】：该论文旨在解决音乐描述生成中缺乏对音频细节和音乐属性全面捕捉的问题，以提升音乐数据库的丰富性和推动音乐人工智能的研究。其解决方案的关键在于提出一种基于投影的架构，该架构将音频输入转换为语言标记，同时通过专用的辅助头检测音乐特征，如调性检测和人声检测，这些特征的输出也被投影为语言标记以增强描述生成的输入。这种多任务学习框架能够生成高质量、详尽的音乐片段描述，并通过大型语言模型链式输出实现对更长音乐作品的时间感知描述。

链接: https://arxiv.org/abs/2506.15154
作者: Anuradha Chopra,Abhinaba Roy,Dorien Herremans
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 14 pages, 2 figures, Accepted to AIMC 2025

点击查看摘要

Abstract:Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.
zh

[NLP-46] hunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

【速读】：该论文试图解决传统分词方法在韩语处理中导致的token fertility（分词频率）过高的问题，这可能影响模型的推理速度。解决方案的关键在于采用基于规则的预分词方法，以符合韩语的语言结构，并创建一个包含类似语言单位的种子词汇表，结合基于分支熵的选择算法，从而增加平均token长度，降低fertility的同时保持语言信息的完整性。

链接: https://arxiv.org/abs/2506.15138
作者: Gyeongje Cho,Yeonkyoun So,Chanwoo Park,Sangmin Lee,Sungmok Jung,Jaejin Lee
机构: Seoul National University(首尔国立大学); Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院); Dept. of Computer Science, Seoul National University(首尔国立大学计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.
zh

[NLP-47] Modeling the One-to-Many Property in Open-Domain Dialogue with LLM s

【速读】：该论文试图解决开放域对话（Open-domain Dialogue, OD）中响应多样性不足的问题，即在给定对话上下文时，现有基于大语言模型（Large Language Models, LLMs）的对话代理未能显式建模“一对一多”（one-to-many, o2m）特性。解决方案的关键在于将OD生成分解为两个核心任务：多响应生成（Multi-Response Generation, MRG）和基于偏好的选择（Preference-based Selection, PS），其中MRG负责为给定对话上下文生成一组语义和词汇上多样且高质量的响应，而PS则根据人类偏好从这些响应中选择一个最优解。

链接: https://arxiv.org/abs/2506.15131
作者: Jing Yang Lee,Kong-Aik Lee,Woon-Seng Gan
机构: Nanyang Technological University (南洋理工大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.
zh

[NLP-48] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records

【速读】：该论文旨在解决基于电子健康记录（Electronic Health Records, EHR）的疾病预测模型在医学知识表示不足和临床部署效率低下的问题。其解决方案的关键在于提出CKD-EHR框架，通过知识蒸馏技术将大型语言模型（如Qwen2.5-7B）中蕴含的医学知识高效地迁移至轻量级BERT学生模型，从而实现准确且高效的疾病风险预测。

链接: https://arxiv.org/abs/2506.15118
作者: Junke Wang,Hongshun Ling,Li Zhang,Longqian Zhang,Fang Wang,Yuan Gao,Zhi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages,5 figures

点击查看摘要

Abstract:Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher this http URL then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.
zh

[NLP-49] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification ACL2025

【速读】：该论文旨在解决对话中由于省略和习语等语言特征引起的歧义问题，这些问题会模糊对话关系的意图，从而对对话话语解析器造成显著挑战。其解决方案的关键在于提出一种话语感知的澄清模块（Discourse-aware Clarification Module, DCM），该模块通过两种不同的推理过程——澄清类型推理和话语目标推理——来分析语言特征并区分潜在的对话关系。此外，引入了贡献感知偏好优化（Contribution-aware Preference Optimization, CPO）以减少错误澄清的风险，从而降低级联错误，并提升解析器与DCM之间的适应性和一致性。

链接: https://arxiv.org/abs/2506.15081
作者: Yaxin Fan,Peifeng Li,Qiaoming Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL2025(main conference)

点击查看摘要

Abstract:Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser’s requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.
zh

[NLP-50] Learning-Time Encoding Shapes Unlearning in LLM s

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）中实现“后验删除”（unlearning）的问题，即在模型训练完成后移除特定知识的能力，这在隐私合规、纠正过时或有害内容等方面具有重要意义。论文的解决方案关键在于分析学习阶段的知识编码方式对事实性知识后验删除效果的影响，发现使用改写描述进行学习可以提升删除性能，而从一段文本中单独删除某条知识则面临较大挑战，表明学习阶段的知识编码策略在实现可靠后验删除中起着核心作用。

链接: https://arxiv.org/abs/2506.15076
作者: Ruihan Wu,Konstantin Garov,Kamalika Chaudhuri
机构: UC, San Diego(加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn’', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.
zh

[NLP-51] Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

【速读】：该论文试图解决开放性长文本生成的评估问题，即如何准确区分生成内容的好坏，现有方法在一致性、风格或相关性等方面存在不足，或受到预训练数据的偏差影响。解决方案的关键是提出PrefBERT，这是一种用于评估开放性长文本生成的评分模型，通过GRPO（Generalized Reward Optimization with Preference Learning）框架，利用针对优质和劣质输出的不同奖励信号进行训练，从而提供比传统指标ROUGE-L和BERTScore更优的语义奖励反馈。

链接: https://arxiv.org/abs/2506.15068
作者: Zongxia Li,Yapei Chang,Yuhang Zhou,Xiyang Wu,Zichao Liang,Yoo Yeon Sung,Jordan Lee Boyd-Graber
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at this https URL.
zh

[NLP-52] Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods

【速读】：该论文试图解决社会隔离和孤独感在自杀率中的作用，尤其是在美国国家暴力死亡报告系统（NVDRS）中未被结构化记录的情况下，如何有效识别这些因素的问题。解决方案的关键在于利用自然语言处理（NLP）技术，结合主题建模生成词典并使用监督学习分类器，从而高效且准确地从法医和验尸官的叙述性文本中识别出社会隔离和孤独的相关信息。

链接: https://arxiv.org/abs/2506.15030
作者: Drew Walker,Swati Rajwal,Sudeshna Das,Snigdha Peddireddy,Abeed Sarker
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System’s (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p.0001), gay (OR = 3.68; 1.97, 6.33, p.0001), or were divorced (OR = 3.34; 2.68, 4.19, p.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.
zh

[NLP-53] An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW

【速读】：该论文试图解决视障人士无法自由获取所需书籍的问题，传统方法如盲文书籍和非政府组织提供的音频记录存在局限性。解决方案的关键在于开发一种基于光学字符识别（OCR）的语音合成系统，该系统具备准确性、可靠性、成本效益和用户友好性，采用LabVIEW平台实现。

链接: https://arxiv.org/abs/2506.15029
作者: Prateek Mehta,Anasuya Patil
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Knowledge extraction through sound is a distinctive property. Visually impaired individuals often rely solely on Braille books and audio recordings provided by NGOs. Due to limitations in these approaches, blind individuals often cannot access books of their choice. Speech is a more effective mode of communication than text for blind and visually impaired persons, as they can easily respond to sounds. This paper presents the development of an accurate, reliable, cost-effective, and user-friendly optical character recognition (OCR)-based speech synthesis system. The OCR-based system has been implemented using Laboratory Virtual Instrument Engineering Workbench (LabVIEW).
zh

[NLP-54] Optimal Embedding Learning Rate in LLM s: The Effect of Vocabulary Size

【速读】：该论文试图解决大规模语言模型预训练过程中超参数（HPs）迁移效率低的问题，特别是在模型宽度（嵌入维度）增加时，传统方法如 \mu P（Maximal Update Parametrization）在实际应用中表现出不一致的实验结果。解决方案的关键在于提出一种新的理论分析，揭示词汇表大小对训练动态的影响，并指出当词汇表规模增大时，训练动态会介于 \mu P 模式与另一种称为 Large Vocab (LV) 的模式之间。在 LV 模式下，最优的嵌入学习率（LR）与隐藏层 LR 的比例应大致按 $\Theta(\sqrt{width})$ 进行缩放，这与 \mu P 预测的 $\Theta(width)$ 不同，且与已有实证研究结果相符。

链接: https://arxiv.org/abs/2506.15025
作者: Soufiane Hayou,Liyuan Liu
机构: Simons Institute (西蒙斯研究所); UC Berkeley (加州大学伯克利分校); Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: TD,LR: How to set the learning rate for emebdding layer in LLMs?

点击查看摘要

Abstract:Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, \mu P (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While \mu P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind \mu P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we provide a theoretical analysis of the effect of vocabulary size on training dynamics, and subsequently show that as vocabulary size increases, the training dynamics \emphinterpolate between the \mu P regime and another regime that we call Large Vocab (LV) Regime, where optimal scaling rules are different from those predicted by \mu P. Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as \Theta(\sqrtwidth) , surprisingly close to the empirical findings previously reported in the literature, and different from the \Theta(width) ratio predicted by \mu P. We conduct several experiments to validate our theory, and pretrain a 1B model from scratch to show the benefit of our suggested scaling rule for the embedding LR.
zh

[NLP-55] Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings ACL2025

【速读】：该论文试图解决如何在不修改模型权重的情况下，生成可逆的句子嵌入，使得大语言模型（Large Language Model, LLM）能够精确重建原始文本的问题。解决方案的关键在于引入一个特殊的记忆标记（memory token），其嵌入通过在固定序列上的训练进行优化，当模型接收到该嵌入时，能够准确地重构出原始序列。

链接: https://arxiv.org/abs/2506.15001
作者: Ignacio Sastre,Aiala Rosá
机构: Instituto de Computación, Facultad de Ingeniería, Universidad de la República (计算研究所，工程学院，乌拉圭共和国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper will be presented at The First Workshop on Large Language Model Memorization (L2M2) at ACL 2025

点击查看摘要

Abstract:In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model’s weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.
zh

[NLP-56] Hypothesis Testing for Quantifying LLM -Human Misalignment in Multiple Choice Settings

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在社会科学研究中模拟人类行为的准确性问题，特别是评估这些模型在多选问卷情境下与实际人类行为之间的偏差。其解决方案的关键在于提出一种基于假设检验的量化框架，以系统性地判断特定语言模型是否能够有效模拟通过多选选项所体现的人类观点、决策和一般行为。该框架为评估LLM与真实人群行为的对齐程度提供了理论依据和实证方法。

链接: https://arxiv.org/abs/2506.14997
作者: Harbin Hong,Sebastian Caldas,Liu Leqi
机构: Princeton University (普林斯顿大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people’s opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.
zh

[NLP-57] Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在大型语言模型（Large Language Model, LLM）推理中的应用局限性问题，特别是当前开放研究主要集中在数学和代码领域，限制了对更广泛推理场景适用性的理解。解决方案的关键在于构建了一个名为Guru的经过精心筛选的RL推理语料库，包含92K个可验证示例，覆盖六个推理领域（数学、代码、科学、逻辑、仿真和表格），每个领域通过领域特定的奖励设计、去重和过滤确保可靠性与有效性，从而为RL训练提供可靠且可扩展的奖励信号。

链接: https://arxiv.org/abs/2506.14965
作者: Zhoujun Cheng,Shibo Hao,Tianyang Liu,Fan Zhou,Yutao Xie,Feng Yao,Yuexin Bian,Yonghao Zhuang,Nilabjo Dey,Yuheng Zha,Yi Gu,Kun Zhou,Yuqi Wang,Yuan Li,Richard Fan,Jianshu She,Chengqian Gao,Abulhair Saparov,Haonan Li,Taylor W. Killian,Mikhail Yurochkin,Zhengzhong Liu,Eric P. Xing,Zhiting Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 38 pages, 9 figures. Under review

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains–Math, Code, Science, Logic, Simulation, and Tabular–each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: this https URL
zh

[NLP-58] From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?

【速读】：该论文试图解决将大型语言模型（Large Language Models, LLMs）应用于结构化数值数据的糖尿病预测问题，尤其是在少样本（few-shot）设置下的有效性验证。其解决方案的关键在于测试不同提示策略（zero-shot、one-shot 和 three-shot）下多种LLMs（包括开源和专有模型）在Pima Indian Diabetes Database上的表现，并与传统机器学习模型进行对比，以评估LLMs在医疗预测任务中的潜力。

链接: https://arxiv.org/abs/2506.14949
作者: Shadman Sakib,Oishy Fatema Akhand,Ajwad Abrar
机构: Islamic University of Technology (伊斯兰理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted in 1st IEEE QPAIN 2025

点击查看摘要

Abstract:While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.
zh

[NLP-59] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance ACL2025

【速读】：该论文试图解决多文档（Multi-Document, MD）推理评估基准缺失的问题，尤其是在大型语言模型（Large Language Models, LLMs）处理长上下文输入能力不断提升的背景下，缺乏能够严格检验模型在多文档场景下行为的基准数据集。解决方案的关键在于提出MDBench，这是一个通过新颖的合成生成过程构建的数据集，能够可控且高效地生成具有挑战性的文档集及其对应的问答（QA）示例。该方法基于压缩的结构化种子知识，通过LLM辅助编辑引入MD特定的推理挑战，并将其转换为自然文本形式，从而生成完整的文档集和QA示例。

链接: https://arxiv.org/abs/2506.14927
作者: Joseph J. Peper,Wenzhao Qiu,Ali Payani,Lu Wang
机构: University of Michigan, Ann Arbor, MI (密歇根大学安娜堡分校); Cisco Research, San Jose, CA (思科研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.
zh

[NLP-60] CrEst: Credibility Estimation for Contexts in LLM s via Weak Supervision

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理知识密集型任务时，因上下文文档可信度差异而导致的不可靠信息传播问题。解决方案的关键在于提出一种弱监督框架CrEst，通过文档间的语义一致性来自动评估上下文文档的可信度，从而在不依赖人工标注的情况下提升模型推理的可靠性。

链接: https://arxiv.org/abs/2506.14912
作者: Dyah Adila,Shuai Zhang,Boran Han,Bonan Min,Yuyang Wang
机构: AWS AI Labs (AWS AI 实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference–without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.
zh

[NLP-61] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction

【速读】：该论文试图解决在结构化自然语言处理任务中，使用自回归语言模型进行约束解码时出现的输出质量较低的问题。其解决方案的关键在于提出了一种名为Boosted Constrained Decoding (BoostCD)的方法，该方法通过两个阶段结合约束和非约束解码：第一阶段分别以约束和非约束模式对基础模型进行两次解码，获得两个弱预测结果；第二阶段通过一个学习到的自回归增强模型将这两个弱预测结果融合为最终预测。基础模型在有无约束条件下的错误具有互补性，增强模型能够利用这种互补性提升性能。

链接: https://arxiv.org/abs/2506.14901
作者: Marija Šakota,Robert West
机构: EPFL(瑞士联邦理工学院); Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many recent approaches to structured NLP tasks use an autoregressive language model M to map unstructured input text x to output text y representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs y . This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model M twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.
zh

[NLP-62] Adverse Event Extraction from Discharge Summaries: A New Dataset Annotation Scheme and Initial Findings ACL2025

【速读】：该论文旨在解决从老年患者出院小结中提取不良事件（Adverse Event, AE）的问题，尤其是针对在临床自然语言处理（NLP）资源中常被忽视的老年人群体。其解决方案的关键在于构建一个手动标注的语料库，该语料库包含14种临床相关的AE及其上下文属性，并支持不连续和重叠实体的标注，以应对以往研究中较少涉及的挑战。此外，该数据集在可信研究环境（Trusted Research Environment, TRE）内开发，可用于评估AE提取方法并促进跨数据集的泛化能力。

链接: https://arxiv.org/abs/2506.14900
作者: Imane Guellil,Salomé Andres,Atul Anand,Bruce Guthrie,Huayu Zhang,Abul Hasan,Honghan Wu,Beatrice Alex
机构: University of Edinburgh(爱丁堡大学); University College London(伦敦大学学院); University of Glasgow(格拉斯哥大学)
类目: Computation and Language (cs.CL)
备注: Accepted and will be published at ACL2025 (main conference)

点击查看摘要

Abstract:In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs-such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.
zh

[NLP-63] Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

【速读】：该论文试图解决基于大语言模型（Large Language Model, LLM）的智能体应用在复杂工作流中因大量规划与推理需求而导致的高昂服务成本问题。其解决方案的关键在于提出一种名为“智能体计划缓存”（agentic plan caching）的新方法，该方法通过从智能体应用的规划阶段提取、存储、适应并复用结构化的计划模板，以减少服务成本。与传统语义缓存不同，该系统在测试时从已完成的智能体执行中提取计划模板，利用关键词提取匹配新请求，并借助轻量级模型将这些模板适配为带上下文的任务特定计划。

链接: https://arxiv.org/abs/2506.14852
作者: Qizheng Zhang,Michael Wornow,Kunle Olukotun
机构: Stanford University (斯坦福大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注: 23 pages

点击查看摘要

Abstract:LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.
zh

[NLP-64] Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse

【速读】：该论文试图解决如何检测全球事件是否根本性地重塑公共话语的问题，其解决方案的关键在于引入一种基于持久同调（persistent homology）的拓扑框架，用于识别媒体叙述中的结构变化。通过构建名词短语的共现图，并利用维特里斯-里普斯滤波（Vietoris-Rips filtration）将其转换为持久图谱，进而计算Wasserstein距离和持久熵以捕捉语义扰动和叙述波动，从而揭示重大地缘政治和社会事件对话语结构的影响。

链接: https://arxiv.org/abs/2506.14836
作者: Mark M. Bailey,Mark I. Heiligman
机构: National Intelligence University (国家情报大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注: 23 pages

点击查看摘要

Abstract:How can we detect when global events fundamentally reshape public discourse? This study introduces a topological framework for identifying structural change in media narratives using persistent homology. Drawing on international news articles surrounding major events - including the Russian invasion of Ukraine (Feb 2022), the murder of George Floyd (May 2020), the U.S. Capitol insurrection (Jan 2021), and the Hamas-led invasion of Israel (Oct 2023) - we construct daily co-occurrence graphs of noun phrases to trace evolving discourse. Each graph is embedded and transformed into a persistence diagram via a Vietoris-Rips filtration. We then compute Wasserstein distances and persistence entropies across homological dimensions to capture semantic disruption and narrative volatility over time. Our results show that major geopolitical and social events align with sharp spikes in both H0 (connected components) and H1 (loops), indicating sudden reorganization in narrative structure and coherence. Cross-correlation analyses reveal a typical lag pattern in which changes to component-level structure (H0) precede higher-order motif shifts (H1), suggesting a bottom-up cascade of semantic change. An exception occurs during the Russian invasion of Ukraine, where H1 entropy leads H0, possibly reflecting top-down narrative framing before local discourse adjusts. Persistence entropy further distinguishes tightly focused from diffuse narrative regimes. These findings demonstrate that persistent homology offers a mathematically principled, unsupervised method for detecting inflection points and directional shifts in public attention - without requiring prior knowledge of specific events. This topological approach advances computational social science by enabling real-time detection of semantic restructuring during crises, protests, and information shocks.
zh

[NLP-65] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉细粒度感知和常识性因果推理方面存在的挑战。其解决方案的关键在于构建了一个名为Argus Inspection的多模态基准，该基准包含两个难度层次，强调详细的视觉识别并融合现实世界的常识理解以评估因果推理能力；同时提出了Eye of Panoptes框架，该框架结合二元参数Sigmoid度量与指示函数，实现了对MLLMs在基于观点推理任务中的更全面评估。

链接: https://arxiv.org/abs/2506.14805
作者: Yang Yao,Lingyu Li,Jiaxin Song,Chiyu Chen,Zhenqi He,Yixu Wang,Xin Wang,Tianle Gu,Jie Li,Yan Teng,Yingchun Wang
机构: The University of Hong Kong (香港大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Hong Kong University of Science and Technology (香港科技大学); Fudan University (复旦大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs’ responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.
zh

[NLP-66] Assembly of Experts: Linear-time construction of the Chimera LLM variants with emergent and adaptable behaviors

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）预训练过程中计算成本高昂的问题，具体表现为训练一个8位权重需要10^13至10^15 FLOPs，这使得资源消耗巨大且效率低下。其解决方案的关键在于提出了一种新的“专家集合”（Assembly-of-Experts, AoE）构建方法，该方法能够在线性时间内生成现有混合专家（Mixture-of-Experts, MoE）父模型的高效子模型。通过单独对模型权重张量进行插值，可以增强或抑制父模型的语义特征，并在不进行微调或知识蒸馏的情况下，实现性能优越且功能完整的子模型。

链接: https://arxiv.org/abs/2506.14794
作者: Henrik Klagges,Robert Dahlke,Fabian Klemm,Benjamin Merkel,Daniel Klingmann,David A. Reiss,Dan Zecha
机构: TNG Technology Consulting GmbH (TNG技术咨询公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Requiring 10^13 - 10^15 FLOPs to calculate one 8 bit weight in an LLM during pretraining is extremely expensive and seems inefficient. To better leverage the huge investments made into pretrained models, we develop the new “Assembly-of-Experts” (AoE) construction method to create capable child variants of existing Mixture-of-Experts parent models in linear time. Model weight tensors get interpolated individually, allowing to enhance or suppress semantic features of the parents. Varying the proportion of weights taken from the parent models, we observe some properties of the AoE child model changing gradually, while other behavioral traits emerge with a sharp transition. Surprisingly, nearly every generated model is functional and capable, which makes searching the model space straightforward. We construct the DeepSeek R1T “Chimera”, a 671B open-weights hybrid model combining DeepSeek’s V3-0324 and R1 model variants. The child inherits only the routed expert tensors of R1, but still achieves about R1-level intelligence. At the same time, it uses about 40% fewer output tokens, close to V3 speed. Constructed without any fine-tuning or distillation, the Chimera exhibits surprisingly compact, orderly reasoning compared to its parent models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2506.14794 [cs.LG] (or arXiv:2506.14794v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.14794 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-67] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection

【速读】：该论文旨在解决多模态讽刺检测任务中图形与文本之间隐式关联难以准确识别的问题。其解决方案的关键在于提出一种语义讽刺识别网络（SemIRNet），该网络通过引入ConceptNet知识库增强模型的常识推理能力，并设计了词级和样本级的跨模态语义相似性检测模块，以在不同粒度上建模图形与文本的相关性，同时采用对比学习损失函数优化样本特征的空间分布，从而提升正负样本的可分性。

链接: https://arxiv.org/abs/2506.14791
作者: Jingxuan Zhou,Yuehao Wu,Yibo Zhang,Yeyubei Zhang,Yunchong Liu,Bolin Huang,Chunhong Yuan
机构: University of New South Wales (新南威尔士大学); University of Sydney (悉尼大学); University of Pennsylvania (宾夕法尼亚大学); University of Southern California (南加州大学); Kazan Federal University (喀山联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model’s common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.
zh

[NLP-68] ETS: Open Vocabulary Electroencephalography-To-Text Decoding and Sentiment Classification

【速读】：该论文试图解决从非侵入式脑电图（EEG）信号中解码自然语言的问题，尤其是在开放词汇场景下的噪声和变异性的挑战。传统方法在小规模封闭词汇上表现良好，但在开放词汇任务中仍存在困难。该研究提出的ETS框架通过整合同步的眼动追踪数据，解决了两个关键任务：开放词汇文本生成和感知语言的情感分类。其解决方案的关键在于利用多模态数据融合，从而提升解码性能，在BLEU和Rouge得分以及情感分类的F1分数上均显著优于监督基线模型。

链接: https://arxiv.org/abs/2506.14783
作者: Mohamed Masry,Mohamed Amen,Mohamed Elzyat,Mohamed Hamed,Norhan Magdy,Maram Khaled
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Graduation project report submitted at Faculty of Computer Science and Artificial Intelligence, Helwan University

点击查看摘要

Abstract:Decoding natural language from brain activity using non-invasive electroencephalography (EEG) remains a significant challenge in neuroscience and machine learning, particularly for open-vocabulary scenarios where traditional methods struggle with noise and variability. Previous studies have achieved high accuracy on small-closed vocabularies, but it still struggles on open vocabularies. In this study, we propose ETS, a framework that integrates EEG with synchronized eye-tracking data to address two critical tasks: (1) open-vocabulary text generation and (2) sentiment classification of perceived language. Our model achieves a superior performance on BLEU and Rouge score for EEG-To-Text decoding and up to 10% F1 score on EEG-based ternary sentiment classification, which significantly outperforms supervised baselines. Furthermore, we show that our proposed model can handle data from various subjects and sources, showing great potential for high performance open vocabulary eeg-to-text system.
zh

[NLP-69] Factorized RVQ-GAN For Disentangled Speech Tokenization INTERSPEECH2025

【速读】：该论文试图解决传统语音编解码器在统一建模声学特征与语言语义信息方面的局限性，旨在实现一种能够同时捕捉语音的音素层级结构和词汇层级语义的统一离散语音表示。解决方案的关键在于提出分层音频编解码器（Hierarchical Audio Codec, HAC），其通过将瓶颈层分解为声学、音素和词法三个语言层级，并利用两种知识蒸馏目标：从预训练语音编码器（HuBERT）获取音素级结构信息，从基于文本的编码器（LaBSE）获取词法线索，从而实现语音表示的解耦与语义可解释性。

链接: https://arxiv.org/abs/2506.15456
作者: Sameer Khurana,Dominik Klement,Antoine Laurent,Dominik Bobos,Juraj Novosad,Peter Gazdik,Ellen Zhang,Zili Huang,Amir Hussein,Ricard Marxer,Yoshiki Masuyama,Ryo Aihara,Chiori Hori,Francois G. Germain,Gordon Wichern,Jonathan Le Roux
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC’s factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC’s potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
zh

[NLP-70] Identifying economic narratives in large text corpora – An integrated approach using Large Language Models

【速读】：该论文试图解决从文本中提取经济叙事（economic narratives）的问题，特别是如何在复杂文档中准确区分经济叙事与传统自然语言处理任务。其解决方案的关键在于评估大型语言模型（Large Language Models, LLMs）在这一任务中的表现，而非依赖传统的复杂模型流水线。通过分析《华尔街日报》和《纽约时报》关于通胀的新闻文章，并采用严格的叙事定义，研究发现GPT-4o能够以结构化格式提取有效的经济叙事，但在处理复杂文档和叙事时仍未能达到专家水平。

链接: https://arxiv.org/abs/2506.15041
作者: Tobias Schmidt,Kai-Robin Lange,Matthias Reccius,Henrik Müller,Michael Roos,Carsten Jentsch
机构: 未知
类目: General Economics (econ.GN); Computation and Language (cs.CL)
备注: 53 pages, 5 figures

点击查看摘要

Abstract:As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts. Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task. While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic understanding required to distinguish extracting economic narratives from merely conducting classic tasks like Semantic Role Labeling. Instead of relying on complex model pipelines, we evaluate the benefits of Large Language Models (LLMs) by analyzing a corpus of Wall Street Journal and New York Times newspaper articles about inflation. We apply a rigorous narrative definition and compare GPT-4o outputs to gold-standard narratives produced by expert annotators. Our results suggests that GPT-4o is capable of extracting valid economic narratives in a structured format, but still falls short of expert-level performance when handling complex documents and narratives. Given the novelty of LLMs in economic research, we also provide guidance for future work in economics and the social sciences that employs LLMs to pursue similar objectives.
zh

[NLP-71] Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

【速读】：该论文试图解决多语言自动语音识别（multilingual automatic speech recognition, ASR）模型在进行神经网络剪枝时需要针对每种语言多次迭代剪枝与重训练的问题。解决方案的关键在于提出了一种自适应掩码方法，在两种场景下高效地对多语言ASR模型进行剪枝，从而生成稀疏的单语模型或稀疏的多语言模型（称为Dynamic ASR Pathways）。该方法通过动态调整子网络，避免了对固定子网络结构的过早决策，并通过从不同子网络初始化中适应性地发现和训练更好的子网络路径，提升了模型性能并减少了对语言特异性剪枝的需求。

链接: https://arxiv.org/abs/2309.13018
作者: Jiamin Xie,Ke Li,Jinxi Guo,Andros Tjandra,Yuan Shangguan,Leda Sari,Chunyang Wu,Junteng Jia,Jay Mahadeokar,Ozlem Kalinli
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
zh

计算机视觉

[CV-0] Nabla-R2D3: Effective and Efficient 3D Diffusion Alignment with 2D Rewards

【速读】：该论文旨在解决生成高质量且逼真的3D资产在3D视觉和计算机图形学领域中的长期挑战，尤其是当前先进的生成模型（如扩散模型）在遵循指令、符合人类偏好或生成真实纹理、几何结构和物理属性方面存在局限性。其解决方案的关键在于提出Nabla-R2D3，这是一个基于2D奖励信号的高效且样本高效的强化学习对齐框架，用于优化原生3D扩散模型。该方法建立在Nabla-GFlowNet基础上，通过系统地匹配得分函数与奖励梯度实现奖励微调，从而在少量微调步骤内显著提升奖励并减少先验遗忘。

链接: https://arxiv.org/abs/2506.15684
作者: Qingming Liu,Zhen Liu,Dinghuai Zhang,Kui Jia
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳); Microsoft Research (微软研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical Report (21 pages, 21 figures)

点击查看摘要

Abstract:Generating high-quality and photorealistic 3D assets remains a longstanding challenge in 3D vision and computer graphics. Although state-of-the-art generative models, such as diffusion models, have made significant progress in 3D generation, they often fall short of human-designed content due to limited ability to follow instructions, align with human preferences, or produce realistic textures, geometries, and physical attributes. In this paper, we introduce Nabla-R2D3, a highly effective and sample-efficient reinforcement learning alignment framework for 3D-native diffusion models using 2D rewards. Built upon the recently proposed Nabla-GFlowNet method, which matches the score function to reward gradients in a principled manner for reward finetuning, our Nabla-R2D3 enables effective adaptation of 3D diffusion models using only 2D reward signals. Extensive experiments show that, unlike vanilla finetuning baselines which either struggle to converge or suffer from reward hacking, Nabla-R2D3 consistently achieves higher rewards and reduced prior forgetting within a few finetuning steps.
zh

[CV-1] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

【速读】：该论文旨在解决基于扩散模型的图像生成在推理过程中速度慢且计算成本高的问题。其解决方案的关键在于提出了一种名为ECAD（Evolutionary Caching to Accelerate Diffusion models）的遗传算法，该算法通过学习高效的、针对每个模型的缓存策略，形成帕累托前沿，从而在仅需少量校准提示的情况下实现加速。ECAD无需修改网络参数或参考图像，能够显著提升推理速度，并在质量与延迟之间提供细粒度控制，同时适应不同扩散模型并有效泛化到未见的分辨率和模型变体。

链接: https://arxiv.org/abs/2506.15682
作者: Anirud Aggarwal,Abhinav Shrivastava,Matthew Gwilliam
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 22 figures, 9 tables

点击查看摘要

Abstract:Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD’s learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and this http URL using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at this https URL and our code is available at this https URL.
zh

[CV-2] Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos

【速读】：该论文试图解决可变形物体动态建模的问题，这一问题由于物体多样的物理特性以及从有限视觉信息中估计状态的困难而具有挑战性。其解决方案的关键在于提出了一种结合物体粒子和空间网格的混合表示的神经动力学框架。该框架通过粒子-网格模型捕捉全局形状和运动信息，并预测密集的粒子运动，从而实现对不同形状和材质物体的建模；其中，粒子用于表示物体形状，空间网格则用于离散化三维空间以确保空间连续性并提升学习效率。

链接: https://arxiv.org/abs/2506.15680
作者: Kaifeng Zhang,Baoyu Li,Kris Hauser,Yunzhu Li
机构: Columbia University (哥伦比亚大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects – such as ropes, cloths, stuffed animals, and paper bags – from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at this https URL .
zh

[CV-3] Sekai: A Video Dataset towards World Exploration

【速读】：该论文试图解决现有视频生成数据集在世界探索训练中的不足，如地点有限、时长较短、场景静态以及缺乏关于探索和世界的标注。其解决方案的关键在于引入Sekai数据集，这是一个高质量的第一人称视角全球视频数据集，包含丰富的探索相关标注信息，涵盖超过5,000小时的步行或无人机视角视频，并提供了位置、场景、天气、人群密度、字幕和相机轨迹等详细注释。通过该数据集，作者进一步训练了一个名为YUME的交互式视频世界探索模型。

链接: https://arxiv.org/abs/2506.15675
作者: Zhen Li,Chuanhao Li,Xiaofeng Mao,Shaoheng Lin,Ming Li,Shitian Zhao,Zhaopan Xu,Xinyue Li,Yukang Feng,Jianwen Sun,Zizhen Li,Fanrui Zhang,Jiaxin Ai,Zhixiang Wang,Yuwei Wu,Tong He,Jiangmiao Pang,Yu Qiao,Yunde Jia,Kaipeng Zhang
机构: Shanghai AI Laboratory; Beijing Institute of Technology; Shanghai Innovation Institute; Shenzhen MSU-BIT University; The University of Tokyo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning dream’’ in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
zh

[CV-4] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

【速读】：该论文旨在解决单图像或视频的重光照（relighting）问题，该任务需要精确的场景内在理解与高质量的光照传输合成。现有端到端的重光照模型受限于多光照配对数据的稀缺性，导致其在不同场景中的泛化能力受限；而两阶段管道虽能缓解数据需求，但易受误差累积影响，并在复杂光照条件或复杂材质下难以生成真实感输出。该论文提出一种通用方法，通过在单次处理中联合估计反照率（albedo）并合成重光照结果，充分利用视频扩散模型的生成能力，从而增强隐式场景理解，实现逼真的光照效果和复杂的材质交互。其解决方案的关键在于将反照率估计与光照合成过程联合优化，提升模型的泛化能力和输出质量。

链接: https://arxiv.org/abs/2506.15673
作者: Kai He,Ruofan Liang,Jacob Munkberg,Jon Hasselgren,Nandita Vijaykumar,Alexander Keller,Sanja Fidler,Igor Gilitschenski,Zan Gojcic,Zian Wang
机构: NVIDIA(英伟达); University of Toronto(多伦多大学); Vector Institute(向量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.
zh

[CV-5] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

【速读】：该论文旨在解决视觉-语言模型（VLM）在推理阶段生成过程中存在的计算成本高和低置信度生成导致的持续幻觉问题。其解决方案的关键在于提出一种两阶段的推理框架——基于边际奖励的价值引导推理（ViMaR），该框架通过结合时序差分价值模型与边缘感知奖励调整，提高了推理效率和输出的真实性。第一阶段通过单次遍历识别出价值最高的图像描述，第二阶段则选择性地优化被忽略或视觉基础薄弱的片段，从而消除频繁奖励的评估，并通过校准的边际惩罚机制抑制低置信度的延续，同时保持描述的丰富性。

链接: https://arxiv.org/abs/2506.15649
作者: Ankan Deria,Adinath Madhavrao Dukre,Feilong Tang,Sara Atito,Sudipta Roy,Muhammad Awais,Muhammad Haris Khan,Imran Razzak
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); University of Surrey (萨里大学); Jio Institute (Jio研究所); UNSW (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbfValue-guided Inference with Margin-based Reward (ViMaR), a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4 \times speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textitgeneralizes effectively to guide decoding in a stronger unseen model. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR’s flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
zh

[CV-6] Demystifying the Visual Quality Paradox in Multimodal Large Language Models

【速读】：该论文旨在解决视觉质量对多模态大语言模型（Multimodal Large Language Models, MLLMs）性能影响的问题，特别是探讨图像的感知质量是否直接提升模型的理解能力。研究发现，图像偏离人类感知保真度时，模型在某些任务上的表现反而可能提升，这一现象被称为视觉质量悖论。解决方案的关键在于提出一种轻量级的适应模块——视觉质量测试时调优（Visual-Quality Test-Time Tuning, VQ-TTT），其核心包括：在冻结的视觉编码器前插入可学习的低秩内核以调节频率内容，并通过LoRA微调浅层视觉编码器层，从而在单次前向传播中动态调整输入图像，使其符合特定任务的模型偏好。

链接: https://arxiv.org/abs/2506.15645
作者: Shuo Xing,Lanqing Guo,Hongyuan Hua,Seoyoung Lee,Peiran Li,Yufei Wang,Zhangyang Wang,Zhengzhong Tu
机构: Texas A&M University (得克萨斯A&M大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Toronto (多伦多大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally clean’', imagery, in the new era of AI being the main data customer.
zh

[CV-7] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

【速读】：该论文试图解决在具身环境中长期记忆整合与推理的问题，特别是在大规模视觉-语言模型（Vision-Language Models, VLMs）应用于机器人规划与控制时，其处理多日积累的长序列图像数据能力受限的问题。解决方案的关键在于引入一个针对长程具身任务的新基准，该基准在Habitat模拟器中评估60个需要持续参与和情境意识的任务，并支持通过程序生成扩展为更长、更具挑战性的版本，从而实现对记忆与推理能力的可扩展评估。同时，该工作还提出了结合先进VLM与低级导航策略的基线方法，以评估其在记忆密集型任务中的表现并识别改进方向。

链接: https://arxiv.org/abs/2506.15635
作者: Karmesh Yadav,Yusuf Ali,Gunshi Gupta,Yarin Gal,Zsolt Kira
机构: Georgia Tech(佐治亚理工学院); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Our dataset and code will be made available at: this https URL

点击查看摘要

Abstract:Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.
zh

[CV-8] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

【速读】：该论文试图解决生成真实且合理的物体交互（Human-Object Interaction, HOI）的问题，这一任务具有挑战性，因为需要同时满足严格的接触精度和多样的运动多样性。解决方案的关键在于提出HOIDiNi框架，该框架通过在预训练扩散模型的噪声空间中使用扩散噪声优化（Diffusion Noise Optimization, DNO）直接进行优化，从而在真实性和物理合理性之间取得平衡。该方法将问题分解为两个阶段：以物体为中心的阶段主要进行手-物体接触位置的离散选择，而以人类为中心的阶段则细化全身运动以实现该蓝图，从而在不牺牲运动自然性的前提下实现精确的手-物体接触。

链接: https://arxiv.org/abs/2506.15625
作者: Roey Ron,Guy Tevet,Haim Sawdayee,Amit H. Bermano
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. this https URL.
zh

[CV-9] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

【速读】：该论文旨在解决开放词汇三维目标检测中的计算开销大和内存占用高问题，这些问题限制了现有检测方法在实时任务中的部署。其解决方案的关键在于提出一种无需重建的在线框架，通过利用预训练的视觉基础模型（VFM）进行单视角三维目标检测，并结合CLIP模型获取开放词汇语义信息，同时采用多视角关联模块与优化模块实现跨视角检测框的统一与一致性优化，从而在保证检测性能的同时降低计算复杂度，实现高效实时的三维目标检测。

链接: https://arxiv.org/abs/2506.15610
作者: Yuqing Lan,Chenyang Zhu,Zhirui Gao,Jiazhao Zhang,Yihan Cao,Renjiao Yi,Yijie Wang,Kai Xu
机构: National University of Defense Technology(国防科技大学); Xiangjiang Laboratory(湘江实验室); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.
zh

[CV-10] Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration MICCAI

【速读】：该论文旨在解决多模态变形图像配准（multi-modal deformable image registration, DIR）中由于模态间极端异质性导致的传统无监督DIR方法难以学习可靠空间映射的问题，这些问题常导致图像失真。其解决方案的关键在于提出M2M-Reg框架，该框架通过仅使用单模态相似性进行多模态DIR模型训练，同时保持现有的架构范式以实现无缝集成，并引入GradCyCon正则化器，利用循环训练方案促进微分同胚性，从而提升配准效果。

链接: https://arxiv.org/abs/2506.15596
作者: Kyobin Choo,Hyunkyung Han,Jinyeong Kim,Chanyong Yoon,Seong Jae Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 2 tables, Accepted at Medical Image Computing and Computer Assisted Intervention (MICCAI) 2025

点击查看摘要

Abstract:In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg’s cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at this https URL.
zh

[CV-11] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

【速读】：该论文旨在解决在真实世界视频超分辨率（Real-VSR）中同时恢复丰富的空间细节并保持时间一致性的问题，尤其是在使用预训练的生成式 AI（Generative AI）模型如稳定扩散（Stable Diffusion, SD）进行逼真细节合成时所面临的挑战。现有基于SD的Real-VSR方法通常为了保证时间连贯性而牺牲了空间细节，导致视觉质量不佳。论文提出的关键解决方案是通过双LoRA学习（Dual LoRA Learning, DLoRAL）范式，有效从低质量输入视频中提取退化鲁棒的时间一致性先验，并在增强视频细节的同时保持所提取的一致性先验。核心在于通过交叉帧检索（Cross-Frame Retrieval, CFR）模块和一致性LoRA（Consistency-LoRA, C-LoRA）学习鲁棒的时间表示，随后通过细节LoRA（Detail-LoRA, D-LoRA）增强空间细节并保持时间一致性。

链接: https://arxiv.org/abs/2506.15591
作者: Yujing Sun,Lingchen Sun,Shuaizheng Liu,Rongyuan Wu,Zhengqiang Zhang,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at this https URL.
zh

[CV-12] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds

【速读】：该论文旨在解决传统定量结构模型（Quantitative Structural Model, QSM）在森林地上生物量（Above-Ground Biomass, AGB）估算中的局限性，包括其对高质量地面激光扫描（Terrestrial Laser Scanning, TLS）数据的依赖、针对单株树木的设计以及复杂的预处理步骤，这些因素限制了其可扩展性和实际应用。该研究提出了一种统一的框架，其关键在于采用基于图的处理流程，实现大规模点云数据的端到端处理，通过路径规划和抽象等图操作无缝整合树木分割、叶-木分离和三维骨架重建，从而提升在不同叶况、空间尺度和数据源下的性能。

链接: https://arxiv.org/abs/2506.15577
作者: Di Wang,Shi Li
机构: Xi’an Jiaotong University (西安交通大学); Shaanxi Joint (Key) Laboratory for Artificial Intelligence (陕西省人工智能联合（重点）实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages,19 figures

点击查看摘要

Abstract:Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.
zh

[CV-13] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification

【速读】：该论文旨在解决超高清空间分辨率土地覆盖分类中的挑战，包括像素级标注成本高、尺度变化显著以及大规模视觉模型适应性有限等问题。其解决方案的关键在于提出一种参数高效的半监督分割框架，该框架利用SAM2的知识并引入了面向遥感的FreqWeaver Adapter，以增强细粒度细节建模能力，同时保持模型轻量化（仅占总参数的5.96%），从而在减少参数开销的同时有效利用未标注数据，实现更优的结构一致性分割效果。

链接: https://arxiv.org/abs/2506.15565
作者: Junhao Wu,Aboagye-Ntow Stephen,Chuyuan Wang,Gang Chen,Xin Huang
机构: Towson University (陶森大学); University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.
zh

[CV-14] Show-o2: Improved Native Unified Multimodal Models

【速读】：该论文旨在解决多模态理解与生成任务中模型的统一性与可扩展性问题，特别是在处理文本、图像和视频等多种模态时的高效融合与生成能力。其解决方案的关键在于构建一个基于3D因果变分自编码器空间的统一视觉表示，并通过双路径的空间（-时序）融合机制实现跨模态的可扩展性；同时，利用语言模型分别对语言头和流头进行自回归建模与流匹配，以提升文本标记预测和图像/视频生成的效果。此外，采用两阶段训练策略以有效学习并扩展至更大规模的模型。

链接: https://arxiv.org/abs/2506.15564
作者: Jinheng Xie,Zhenheng Yang,Mike Zheng Shou
机构: Show Lab, National University of Singapore(国家大学); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:This paper presents improved native unified multimodal models, \emphi.e., Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at this https URL.
zh

[CV-15] Control and Realism: Best of Both Worlds in Layout-to-Image without Training ICML2025

【速读】：该论文旨在解决布局到图像生成中对象定位不精确和生成结果出现不真实伪影的问题。其核心解决方案是提出一种无需训练的方法WinWinLay，该方法通过两个关键策略——非局部注意力能量函数和自适应更新——协同提升控制精度与生成质量。非局部注意力能量函数用于缓解传统注意力能量函数带来的空间分布偏差，使对象更符合布局指令；自适应更新则基于Langevin动力学机制，减少从预训练领域偏离导致的分布外伪影，从而在保持布局约束的同时实现更真实的图像生成。

链接: https://arxiv.org/abs/2506.15563
作者: Bonan Li,Yinhan Hu,Songhua Liu,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
zh

[CV-16] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation

【速读】：该论文旨在解决密集度量深度估计中依赖于密集LiDAR监督的问题，该监督通常通过多帧投影和插值生成，导致成本高且数据需求大。其解决方案的关键在于提出RaCalNet框架，该框架利用稀疏LiDAR监督来学习优化的雷达测量值，从而将监督密度降低至仅约1%，而非传统方法中的密集监督。通过重新校准和细化稀疏雷达点以构建精确的深度先验，并将其作为可靠锚点引导单目深度预测，实现了无需密集监督的度量尺度估计。

链接: https://arxiv.org/abs/2506.15560
作者: Xingrui Qin,Wentao Zhao,Chuan Cao,Yihe Niu,Houcheng Jiang,Jingchuan Wang
机构: Shanghai Jiao Tong University (上海交通大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.
zh

[CV-17] CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation

【速读】：该论文旨在解决深度学习在晚期钆增强（LGE）心脏MRI中进行心肌瘢痕分割时，由于高质量瘢痕标注的LGE图像数量有限且存在较大变异，导致鲁棒分割模型难以开发的问题。其解决方案的关键在于提出CLAIM框架，该框架的核心是SMILE模块（Scar Mask generation guided by cLinical knowledgE），该模块通过结合临床常用的AHA 17段模型来引导扩散生成器，从而合成具有解剖一致性和空间多样性的瘢痕图像。同时，CLAIM采用联合训练策略，优化分割网络与生成器，以提升合成瘢痕的真实性和分割精度。

链接: https://arxiv.org/abs/2506.15549
作者: Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Shahnaz Jamil-Copley,Richard H. Clayton,Chen(Cherise)Chen
机构: University of Sheffield (谢菲尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 Pages

点击查看摘要

Abstract:Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbfClinically-Guided \textbfLGE \textbfAugmentation for Real\textbfistic and Diverse \textbfMyocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.
zh

[CV-18] NTIRE 2025 Image Shadow Removal Challenge Report

【速读】：该论文旨在解决阴影去除（Shadow Removal）问题，通过评估不同方法在重建保真度和视觉感知方面的表现来推动该领域的技术进步。解决方案的关键在于利用WSRD+数据集进行多维度的评估，该数据集模拟了自阴影与投射阴影之间的复杂交互，涵盖了大量多样化的物体、纹理和材质，从而全面检验算法的有效性和鲁棒性。

链接: https://arxiv.org/abs/2506.15524
作者: Florin-Alexandru Vasluianu,Tim Seizinger,Zhuyun Zhou,Cailian Chen,Zongwei Wu,Radu Timofte,Mingjia Li,Jin Hu,Hainuo Wang,Hengxing Liu,Jiarui Wang,Qiming Hu,Xiaojie Guo,Xin Lu,Jiarong Yang,Yuanfei Bao,Anya Hu,Zihao Fan,Kunyu Wang,Jie Xiao,Xi Wang,Xueyang Fu,Zheng-Jun Zha,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu,Xingbo Wang,Dong Li,Yuxu Chen,Bin Chen,Yuanbo Zhou,Yuanbin Chen,Hongwei Wang,Jiannan Lin,Qinquan Gao,Tong Tong,Zhao Zhang,Yanyan Wei,Wei Dong,Han Zhou,Seyed Amirreza Mousavi,Jun Chen,Haobo Liang,Jiajie Jing,Junyu Li,Yan Yang,Seoyeon Lee,Chaewon Kim,Ziyu Feng,Shidi Chen,Bowen Luan,Zewen Chen,Vijayalaxmi Ashok Aralikatti,G Gyaneshwar Rao,Nikhil Akalwadi,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Alexandru Brateanu,Cosmin Ancuti,Tanmay Chaturvedi,Manish Kumar,Anmol Srivastav,Daksh Trivedi,Shashwat Thakur,Kishor Upla,Zeyu Xiao,Zhuoyuan Li,Boda Zhou,Shashank Shekhar,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Suiyi Zhao,Bo Wang,Yan Luo,Mingshen Wang,Yilin Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.
zh

[CV-19] Pixel-level Certified Explanations via Randomized Smoothing

【速读】：该论文试图解决深度学习预测解释中像素级归因图的非鲁棒性问题，即微小且难以察觉的输入扰动可能导致归因图发生显著变化，而模型的预测结果保持不变，从而影响解释的可信度。解决方案的关键在于引入首个基于随机平滑的认证框架，该框架能够为任何黑盒归因方法提供像素级的鲁棒性保证。通过稀疏化和平滑归因图，将任务重新表述为分割问题，并针对ℓ₂-有界扰动对每个像素的重要性进行认证。

链接: https://arxiv.org/abs/2506.15499
作者: Alaa Anani,Tobias Lorenz,Mario Fritz,Bernt Schiele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel’s importance against \ell_2 -bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at this https URL.
zh

[CV-20] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

【速读】：该论文旨在解决4D人体-物体交互（4D HOI）生成中因缺乏大规模标注数据而导致的泛化能力和高质量序列生成难题。其解决方案的关键在于提出一种两阶段框架GenHOI，第一阶段通过Object-AnchorNet从3D HOI数据集中重建未见物体的稀疏3D HOI关键帧，减少对大规模4D HOI数据集的依赖；第二阶段则利用Contact-Aware Diffusion Model（ContactDM）将稀疏关键帧无缝插值为时间一致的高保真4D HOI序列，并通过引入Contact-Aware Encoder和Contact-Aware HOI Attention机制有效捕捉并整合人-物接触信号。

链接: https://arxiv.org/abs/2506.15483
作者: Shujia Li,Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Yutong Ban
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.
zh

[CV-21] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning

【速读】：该论文旨在解决从医学影像数据生成医疗报告这一在临床实践中具有挑战性的问题。其解决方案的关键在于提出一种名为MRG-LLM的新型多模态大语言模型（Multimodal Large Language Model），该模型结合了冻结的大语言模型（Large Language Model）与可学习的视觉编码器，并引入了动态提示定制机制。核心创新在于通过从视觉特征中得出的条件仿射变换生成特定实例的提示，从而实现针对个体医学图像的精准报告生成。

链接: https://arxiv.org/abs/2506.15477
作者: Chunlei Li,Jingyang Hou,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.
zh

[CV-22] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

【速读】：该论文试图解决3D AI生成内容（AIGC）领域中由于数据收集、处理和训练3D模型的复杂性而导致的研究人员、开发者和设计师以外的群体难以访问的问题。解决方案的关键在于引入Hunyuan3D 2.1作为案例研究，该系统包含两个核心组件：Hunyuan3D-DiT用于形状生成，Hunyuan3D-Paint用于纹理合成，通过提供从3D数据处理、模型训练到性能评估的全面指导，使用户能够微调或开发适用于游戏、虚拟现实和工业设计的稳健3D生成模型。

链接: https://arxiv.org/abs/2506.15442
作者: Team Hunyuan3D,Shuhui Yang,Mingxin Yang,Yifei Feng,Xin Huang,Sheng Zhang,Zebin He,Di Luo,Haolin Liu,Yunfei Zhao,Qingxiang Lin,Zeqiang Lai,Xianghui Yang,Huiwen Shi,Zibo Zhao,Bowen Zhang,Hongyu Yan,Lifu Wang,Sicong Liu,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Dongyuan Guo,Junlin Yu,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Shida Wei,Chao Zhang,Yonghao Tan,Yifu Sun,Lin Niu,Shirui Huang,Bojian Zheng,Shu Liu,Shilin Chen,Xiang Yuan,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Jingwei Huang,Chunchao Guo
机构: Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Github link: this https URL

点击查看摘要

Abstract:3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.
zh

[CV-23] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance

【速读】：该论文旨在解决深度学习模型在医学影像领域中对分布外（out-of-distribution, OOD）样本检测的可靠性问题，特别是在诊断决策中准确识别OOD输入对于提升模型可信度至关重要。其解决方案的关键在于提出一种名为NERO的新颖OOD评分机制，该机制通过利用特征层中的神经元级相关性进行聚类，构建每个分布内（in-distribution, ID）类别的代表性中心点，并引入相关性距离度量来量化新样本与这些中心点的偏离程度，从而增强OOD可分性。此外，通过引入缩放相关性及特征范数的结合进一步优化性能，实现了可解释的OOD检测。

链接: https://arxiv.org/abs/2506.15404
作者: Anju Chhetri,Jari Korhonen,Prashnna Gyawali,Binod Bhattarai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model’s reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample’s deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.
zh

[CV-24] MCOO-SLAM: A Multi-Camera Omnidirectional Object SLAM System

【速读】：该论文旨在解决传统基于RGB-D传感器或单目视觉的SLAM系统在大尺度或户外环境中因视场角狭窄、遮挡敏感和深度感知有限而导致的对象建模不准确和数据关联不可靠的问题。其解决方案的关键在于提出MCOO-SLAM，一种利用全方位多相机配置的新型多相机全向对象SLAM系统，通过融合语义-几何-时间信息实现跨多视角的鲁棒对象关联，并设计全向回环闭合模块以实现视角不变的场景识别，从而提升系统的鲁棒性、一致性和语义丰富性。

链接: https://arxiv.org/abs/2506.15402
作者: Miaoxin Pan,Jinnan Li,Yaowen Zhang,Yi Yang,Yufeng Yue
机构: Beijing Institute of Technology (北京理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object-level SLAM offers structured and semantically meaningful environment representations, making it more interpretable and suitable for high-level robotic tasks. However, most existing approaches rely on RGB-D sensors or monocular views, which suffer from narrow fields of view, occlusion sensitivity, and limited depth perception-especially in large-scale or outdoor environments. These limitations often restrict the system to observing only partial views of objects from limited perspectives, leading to inaccurate object modeling and unreliable data association. In this work, we propose MCOO-SLAM, a novel Multi-Camera Omnidirectional Object SLAM system that fully leverages surround-view camera configurations to achieve robust, consistent, and semantically enriched mapping in complex outdoor scenarios. Our approach integrates point features and object-level landmarks enhanced with open-vocabulary semantics. A semantic-geometric-temporal fusion strategy is introduced for robust object association across multiple views, leading to improved consistency and accurate object modeling, and an omnidirectional loop closure module is designed to enable viewpoint-invariant place recognition using scene-level descriptors. Furthermore, the constructed map is abstracted into a hierarchical 3D scene graph to support downstream reasoning tasks. Extensive experiments in real-world demonstrate that MCOO-SLAM achieves accurate localization and scalable object-level mapping with improved robustness to occlusion, pose variation, and environmental complexity.
zh

[CV-25] When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class ICML2025

【速读】：该论文旨在解决数据不可用情况下预训练模型的图像合成问题，即在无法访问原始训练数据的前提下，生成与预训练模型所学习数据分布相近的图像。现有方法由于缺乏对自然图像先验知识的了解，生成的样本往往偏离真实数据分布。其解决方案的关键在于提出DDIS（Diffusion-assisted Data-free Image Synthesis），该方法利用文本到图像的扩散模型作为强大的图像先验，通过Domain Alignment Guidance（DAG）在扩散采样过程中对齐合成数据域与训练数据域，并优化单个Class Alignment Token（CAT）嵌入以有效捕捉类别特定属性，从而生成更符合训练数据分布的高质量图像。

链接: https://arxiv.org/abs/2506.15381
作者: Yujin Kim,Hyunsoo Kim,Hyunwoo J.Kim,Suhyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICML 2025

点击查看摘要

Abstract:Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS meth ods produce samples that deviate from the training data distribution due to the lack of prior knowl edge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and Ima geNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.
zh

[CV-26] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification

【速读】：该论文旨在解决动物个体再识别（re-identification）中由于动物毛发或皮肤图案的可变形性导致的几何失真问题，这些问题通常由身体运动和姿势变化引起。解决方案的关键在于提出一种几何感知的纹理映射方法，该方法将皮毛图案（pelage patterns）映射到规范的UV空间，从而实现更鲁棒的特征匹配。该方法通过表面法线估计引导解缠过程，同时保持三维表面与二维纹理空间之间的几何一致性。

链接: https://arxiv.org/abs/2506.15369
作者: Aleksandr Algasov,Ekaterina Nepovinnykh,Fedor Zolotarev,Tuomas Eerola,Heikki Kälviäinen,Pavel Zemčík,Charles V. Stewart
机构: Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-劳里大学技术学院 LUT); Brno University of Technology (布鲁诺技术大学 BUT); Rensselaer Polytechnic Institute (伦斯勒理工学院 RPI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing individual re-identification methods often struggle with the deformable nature of animal fur or skin patterns which undergo geometric distortions due to body movement and posture changes. In this paper, we propose a geometry-aware texture mapping approach that unwarps pelage patterns, the unique markings found on an animal’s skin or fur, into a canonical UV space, enabling more robust feature matching. Our method uses surface normal estimation to guide the unwrapping process while preserving the geometric consistency between the 3D surface and the 2D texture space. We focus on two challenging species: Saimaa ringed seals (Pusa hispida saimensis) and leopards (Panthera pardus). Both species have distinctive yet highly deformable fur patterns. By integrating our pattern-preserving UV mapping with existing re-identification techniques, we demonstrate improved accuracy across diverse poses and viewing angles. Our framework does not require ground truth UV annotations and can be trained in a self-supervised manner. Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy.
zh

[CV-27] Open-World Object Counting in Videos

【速读】：该论文试图解决开放世界视频中目标物体计数的问题，即根据文本描述或图像示例指定目标物体后，对视频中所有唯一实例进行枚举。该任务在拥挤场景中尤为具有挑战性，因为需要避免重复计数并识别物体的再次出现。解决方案的关键在于提出一种名为CountVid的模型，该模型结合了基于图像的计数模型与可提示的视频分割和跟踪模型，从而实现了跨视频帧的自动化、开放世界物体计数。

链接: https://arxiv.org/abs/2506.15368
作者: Niki Amini-Naieni,Andrew Zisserman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at this https URL.
zh

[CV-28] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models MICCAI2025

【速读】：该论文旨在解决病理图像分类中因开放集场景下存在大量分布外（Out-Of-Distribution, OOD）数据而导致传统主动学习（Active Learning, AL）方法效率低下以及初始查询阶段标注成本过高的问题。其解决方案的关键在于提出一种名为OpenPath的新型开放集主动学习方法，该方法利用预训练的视觉-语言模型（Vision-Language Model, VLM），在首次查询中通过任务特定提示选择分布内（In-Distribution, ID）且具有信息量的样本，在后续查询中结合基于原型的ID候选选择和熵引导的随机采样策略，确保所选样本的纯度与信息性，从而有效避免OOD样本的选取。

链接: https://arxiv.org/abs/2506.15318
作者: Lanfeng Zhong,Xin Liao,Shichuan Zhang,Shaoting Zhang,Guotai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025 early accept

点击查看摘要

Abstract:Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model’s performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \hrefthis https URLthis https URL…
zh

[CV-29] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning

【速读】：该论文旨在解决自动驾驶中高精度地图（HD map）和鸟瞰图（BEV）语义地图的在线矢量化生成问题。其解决方案的关键在于引入一种名为MapFM的增强型端到端模型，通过结合强大的基础模型来提升图像编码的特征表示质量，并集成辅助预测头以实现BEV表示中的语义分割，从而通过多任务学习提供更丰富的上下文监督，提升场景表征的全面性及预测矢量化HD地图的准确性和质量。

链接: https://arxiv.org/abs/2506.15313
作者: Leonid Ivanov,Vasily Yuryev,Dmitry Yudin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. Submitted. 12 pages, 4 figures

点击查看摘要

Abstract:In autonomous driving, high-definition (HD) maps and semantic maps in bird’s-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model’s understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at this https URL.
zh

[CV-30] One-shot Face Sketch Synthesis in the Wild via Generative Diffusion Prior and Instruction Tuning

【速读】：该论文旨在解决传统人脸素描合成方法依赖大量成对的图像数据进行训练所带来的数据稀缺性和高人工成本问题。其解决方案的关键在于提出一种基于扩散模型的一次性人脸素描合成方法，通过在扩散模型上优化文本指令，并利用梯度优化得到的指令进行推理，从而在少量训练数据下实现高质量的素描生成。

链接: https://arxiv.org/abs/2506.15312
作者: Han Wu,Junyao Li,Kangbo Zhao,Sen Zhang,Yukai Shi,Liang Lin
机构: Guangdong University of Technology(广东工业大学); TikTok(抖音); ByteDance(字节跳动); Sun Yat-sen University(中山大学)
类目: Graphics (cs.GR); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: We propose a novel framework for face sketch synthesis, where merely a single pair of samples suffices to enable in-the-wild face sketch synthesis

点击查看摘要

Abstract:Face sketch synthesis is a technique aimed at converting face photos into sketches. Existing face sketch synthesis research mainly relies on training with numerous photo-sketch sample pairs from existing datasets. However, these large-scale discriminative learning methods will have to face problems such as data scarcity and high human labor costs. Once the training data becomes scarce, their generative performance significantly degrades. In this paper, we propose a one-shot face sketch synthesis method based on diffusion models. We optimize text instructions on a diffusion model using face photo-sketch image pairs. Then, the instructions derived through gradient-based optimization are used for inference. To simulate real-world scenarios more accurately and evaluate method effectiveness more comprehensively, we introduce a new benchmark named One-shot Face Sketch Dataset (OS-Sketch). The benchmark consists of 400 pairs of face photo-sketch images, including sketches with different styles and photos with different backgrounds, ages, sexes, expressions, illumination, etc. For a solid out-of-distribution evaluation, we select only one pair of images for training at each time, with the rest used for inference. Extensive experiments demonstrate that the proposed method can convert various photos into realistic and highly consistent sketches in a one-shot context. Compared to other methods, our approach offers greater convenience and broader applicability. The dataset will be available at: this https URL
zh

[CV-31] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering ACM-MM2025 MICRO

【速读】：该论文旨在解决传统方法在处理长时间视频中的面部微表情（ME）识别与检测任务时存在的不足，这些问题通常将 spotting 和 recognition 视为独立任务，导致效率和效果受限。其解决方案的关键在于引入两个新任务：ME spot-then-recognize (ME-STR) 和 ME visual question answering (ME-VQA)，前者通过统一的序列流程整合 ME 检测与识别，后者则利用多模态大语言模型（MLLM）或大视觉语言模型（LVLM）进行基于视觉问答的 ME 理解，从而提升复杂场景下的 ME 分析能力。

链接: https://arxiv.org/abs/2506.15298
作者: Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Wen-Huang Cheng,Xiaobai Li,Xiaopeng Hong,Su-Jing Wang,Adrian K. Davision
机构: Manchester Metropolitan University (曼彻斯特城市大学); Institute of Psychology (心理研究所); University of the Chinese Academy of Sciences (中国科学院大学); Heriot-Watt University Malaysia (赫瑞瓦特大学马来西亚分校); National Taiwan University (台湾大学); University of Oulu (奥卢大学); Zhejiang University (浙江大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Micro-Expression Grand Challenge (MEGC) at ACM MM 2025

点击查看摘要

Abstract:Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at this https URL.
zh

[CV-32] Human Motion Capture from Loose and Sparse Inertial Sensors with Garment-aware Diffusion Models IJCAI2025

【速读】：该论文试图解决在使用稀疏、松散附着的惯性测量单元（IMU）传感器进行全身人体姿态估计的问题，因为传统方法通常假设IMU传感器紧贴人体，而这一假设在实际场景中往往不成立。解决方案的关键在于利用基于Transformer的扩散模型，通过从现有考虑服装因素的人体运动数据集中模拟IMU记录，生成松散IMU数据并据此估计人体姿态，同时在训练过程中引入与服装相关的参数，以提升模型对不同松紧程度服装带来的姿态变化的捕捉能力。

链接: https://arxiv.org/abs/2506.15290
作者: Andela Ilic,Jiaxi Jiang,Paul Streli,Xintong Liu,Christian Holz
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking. Existing approaches typically assume that IMU sensors are tightly attached to the human body. However, this assumption often does not hold in real-world scenarios. In this paper, we present a new task of full-body human pose estimation using sparse, loosely attached IMU sensors. To solve this task, we simulate IMU recordings from an existing garment-aware human motion dataset. We developed transformer-based diffusion models to synthesize loose IMU data and estimate human poses based on this challenging loose IMU data. In addition, we show that incorporating garment-related parameters while training the model on simulated loose data effectively maintains expressiveness and enhances the ability to capture variations introduced by looser or tighter garments. Experiments show that our proposed diffusion methods trained on simulated and synthetic data outperformed the state-of-the-art methods quantitatively and qualitatively, opening up a promising direction for future research.
zh

[CV-33] AI-driven visual monitoring of industrial assembly tasks

【速读】：该论文旨在解决工业装配任务中由于操作错误导致设备损坏和保障工人安全的问题，其核心挑战在于如何在不依赖固定工作空间或视觉标记的情况下实现对装配过程的实时视觉监控。解决方案的关键在于提出ViMAT系统，该系统结合了感知模块与推理模块，通过多视角视频流提取视觉观测信息，并基于观察到的装配状态和先验任务知识推断当前执行的动作，从而实现在部分且不确定的视觉观测条件下有效监控装配任务。

链接: https://arxiv.org/abs/2506.15285
作者: Mattia Nardon,Stefano Messelodi,Antonio Granata,Fabio Poiesi,Alberto Danese,Davide Boscaini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: this https URL
zh

[CV-34] BCRNet: Enhancing Landmark Detection in Laparoscopic Liver Surgery via Bezier Curve Refinement MICCAI2025

【速读】：该论文旨在解决腹腔镜肝胆手术中准确识别关键解剖结构的挑战，特别是通过增强现实（AR）系统将MRI/CT图像与腹腔镜图像进行2D-3D配准时，对曲线状解剖标志物的精确检测问题。解决方案的关键在于提出BCRNet（Bezier Curve Refinement Net），其核心是基于贝塞尔曲线精炼策略的框架，通过多模态特征提取、自适应曲线提议初始化以及分层曲线精炼机制，实现对腹腔镜图像中解剖标志物的高精度定位与调整。

链接: https://arxiv.org/abs/2506.15279
作者: Qian Li,Feng Liu,Shuojue Yang,Daiyun Shen,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025, 11 pages, 2 figures

点击查看摘要

Abstract:Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomical landmarks in laparoscopic images. In this paper, we propose BCRNet (Bezier Curve Refinement Net), a novel framework that significantly enhances landmark detection in laparoscopic liver surgery primarily via the Bezier curve refinement strategy. The framework starts with a Multi-modal Feature Extraction (MFE) module designed to robustly capture semantic features. Then we propose Adaptive Curve Proposal Initialization (ACPI) to generate pixel-aligned Bezier curves and confidence scores for reliable initial proposals. Additionally, we design the Hierarchical Curve Refinement (HCR) mechanism to enhance these proposals iteratively through a multi-stage process, capturing fine-grained contextual details from multi-scale pixel-level features for precise Bezier curve adjustment. Extensive evaluations on the L3D and P2ILF datasets demonstrate that BCRNet outperforms state-of-the-art methods, achieving significant performance improvements. Code will be available.
zh

[CV-35] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion

【速读】：该论文旨在解决基于隐式神经表示（Implicit Neural Representations, INRs）的视频压缩方法在处理细节丰富且快速变化的视频内容时表现不佳的问题。其关键解决方案是提出一种多尺度特征融合框架MSNeRV，通过引入时间窗口增强时间一致性、将视频划分为多个图像组（Groups of Pictures, GoPs）并使用GoP级网格进行背景表示，结合多尺度空间解码器与尺度自适应损失函数以整合多分辨率和多频率信息，以及设计多尺度特征块以充分挖掘隐藏特征，从而提升模型的表示能力和压缩效率。

链接: https://arxiv.org/abs/2506.15276
作者: Jun Zhu,Xinfeng Zhang,Lv Tang,JunHao Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.
zh

[CV-36] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing

【速读】：该论文旨在解决半导体领域中由于高需求和激烈竞争导致的市场进入时间和产品质量问题，特别是在缺陷分类任务中提升模型的适应性和泛化能力。解决方案的关键在于利用领域自适应（Domain Adaptation, DA）技术，通过将源域中学习到的知识迁移至目标域，减少对大量手动标注数据或重新训练模型的依赖，从而提高模型的鲁棒性和可扩展性。此外，论文提出了一种受CycleGAN启发的DBACS方法，通过引入额外的损失项进一步优化性能，并在真实电子显微镜图像数据集上进行了验证，证明了其在推动半导体领域领域自适应技术发展中的有效性。

链接: https://arxiv.org/abs/2506.15260
作者: Adrian Poniatowski,Natalie Gentner,Manuel Barusco,Davide Dalle Pezze,Samuele Salti,Gian Antonio Susto
机构: Infineon Technologies AG(英飞凌科技公司); University of Bologna(博洛尼亚大学); University of Padova(帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.
zh

[CV-37] Retrospective Memory for Camouflaged Object Detection

【速读】：该论文旨在解决伪装目标检测（Camouflaged Object Detection, COD）中现有方法在处理复杂场景时缺乏显式历史上下文获取机制，从而限制了模型适应性和检测效果的问题。解决方案的关键在于提出一种基于回忆增强的COD架构RetroMem，其通过将相关历史知识动态整合到检测过程中，实现对伪装模式感知与推理的调制。RetroMem采用两阶段训练范式，包括学习阶段和回忆阶段，其中学习阶段引入密集多尺度适配器（Dense Multi-Scale Adapter, DMA）以提升预训练编码器的多尺度视觉信息捕捉能力，回忆阶段则通过动态记忆机制（Dynamic Memory Mechanism, DMM）和推理模式重构（Inference Pattern Reconstruction, IPR）来重建伪装模式的推理，从而显著提升模型对伪装场景的理解能力。

链接: https://arxiv.org/abs/2506.15244
作者: Chenxi Zhang,Jiayun Wu,Qing Zhang,Yazhe Zhai,Youwei Pang
机构: Shanghai Institute of Technology (上海理工大学); Nanjing University of Science (南京理工大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder’s capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model’s understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.
zh

[CV-38] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories IROS2025

【速读】：该论文旨在解决神经辐射场（Neural Radiance Fields, NeRF）和3D高斯点云（3D Gaussian Splatting, 3DGS）在复杂相机轨迹下相机位姿估计精度不足的问题。现有方法通过引入外部约束来改善这一问题，但在复杂轨迹场景中仍无法达到满意的准确性。论文提出的解决方案RA-NeRF的关键在于其增量式流程，结合了基于光度一致性的NeRF场景重建与流驱动的位姿调节，以增强初始化和定位的鲁棒性，同时采用隐式位姿滤波器捕捉相机运动模式并消除位姿估计中的噪声。

链接: https://arxiv.org/abs/2506.15242
作者: Qingsong Yan,Qiang Wang,Kaiyong Zhao,Jie Chen,Bo Li,Xiaowen Chu,Fei Deng
机构: School of Geodesy and Geomatics, Wuhan University(测绘学院，武汉大学); Department of Computer Science and Technology, HIT (Shenzhen)(计算机科学与技术系，哈工大（深圳）); XGRIDS( XGRIDS); Department of Computer Science, HKBU(计算机科学系，香港浸会大学); Department of Computer Science and Engineering, HKUST(计算机科学与工程系，香港科技大学); Data Science and Analytics Thrust, HKUST (Guangzhou)(数据科学与分析学部，香港科技大学（广州）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2025

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.
zh

[CV-39] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images

【速读】：该论文旨在解决合成孔径雷达（Synthetic Aperture Radar, SAR）在船舶检测中面临的挑战，包括船舶尺度变化大、小型海上船只与噪声混杂以及近岸大型船舶背景复杂等问题。其解决方案的关键在于提出一种名为C-AFBiFPN的特征增强与融合框架，该框架通过构建卷积特征增强（Convolutional Feature Enhancement, CFE）模块以丰富特征表示，并创新性地将BiFormer注意力机制融入BiFPN的融合策略中，形成AFBiFPN网络，从而提升跨尺度特征融合的全局建模能力并自适应关注关键特征区域。

链接: https://arxiv.org/abs/2506.15231
作者: Liangjie Meng,Danxia Li,Jinrong He,Lili Ma,Zhixin Li
机构: Yan’an University (延安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, 2 tables. Code available at this https URL

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.
zh

[CV-40] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder

【速读】：该论文旨在解决多模态医学图像融合（Multimodal Medical Image Fusion, MMIF）中现有方法在捕捉细节特征和跨模态特征交互方面能力不足的问题，从而导致融合图像质量不理想。其解决方案的关键在于提出一种基于两阶段扩散模型的融合网络（DM-FNet），第一阶段通过扩散过程训练UNet以实现图像重建并提取多层级特征，第二阶段利用不同噪声水平的图像输入融合网络以增强特征识别能力，并集成三个关键融合模块以自适应处理不同模态的医学图像，最终通过融合网络结构与混合损失函数优化图像的亮度、色彩、对比度和细节，提升融合图像的质量与信息密度。

链接: https://arxiv.org/abs/2506.15218
作者: Dan He,Weisheng Li,Guofen Wang,Yuping Huang,Shiqiang Liu
机构: Chongqing University of Posts and Telecommunications(重庆邮电大学); Chongqing Key Laboratory of Image Recognition(重庆图像识别重点实验室); Key Laboratory of Cyberspace Big Data Intelligent Security (Chongqing University of Posts and Telecommunications)(网络大数据智能安全重点实验室（重庆邮电大学）); Chongqing Normal University(重庆师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by IEEE Transactions on Multimedia (TMM) in March 2025

点击查看摘要

Abstract:Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model’s feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image’s brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at this https URL.
zh

[CV-41] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models ICML2025

【速读】：该论文旨在解决公开图像在视觉-语言预训练（Vision-Language Pretrained, VLP）模型增强的语义理解能力下，易被搜索引擎等工具滥用而威胁用户隐私的问题。其解决方案的关键在于提出一种灵活的编码方法——隐私防护图像压缩（Privacy-Shielded Image Compression, PSIC），该方法能够生成具有多种解码选项的比特流，在默认情况下保证良好的感知质量同时阻止VLP模型的解读，并通过条件潜在触发生成（Conditional Latent Trigger Generation, CLTG）模块和不确定性感知加密导向（Uncertainty-Aware Encryption-Oriented, UAEO）优化函数实现对解码过程的可控引导与加密性能的提升。

链接: https://arxiv.org/abs/2506.15201
作者: Xuelin Shen,Jiayin Xu,Kangsheng Yin,Wenhan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, publised to ICML 2025

点击查看摘要

Abstract:The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users’ privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model’s uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.
zh

[CV-42] Conquering the Retina: Bringing Visual in-Context Learning to OCT

【速读】：该论文试图解决医学影像分析中专用模型在任务适应性上的局限性，即这些模型仅适用于预定义的任务，且需要专业知识和大量资源进行开发和调整。解决方案的关键在于训练通用模型，使其能够通过视觉上下文学习（VICL）在推理时根据少量示例泛化到不同任务，从而允许医疗从业者灵活定义任务而无需专门模型开发。

链接: https://arxiv.org/abs/2506.15200
作者: Alessio Negrini,Simon Reiß
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in medical image analysis have led to the development of highly specialized models tailored to specific clinical tasks. These models have demonstrated exceptional performance and remain a crucial research direction. Yet, their applicability is limited to predefined tasks, requiring expertise and extensive resources for development and adaptation. In contrast, generalist models offer a different form of utility: allowing medical practitioners to define tasks on the fly without the need for task-specific model development. In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL), i.e., training models to generalize across tasks based on a few examples provided at inference time. To facilitate rigorous assessment, we propose a broad evaluation protocol tailored to VICL in OCT. We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT. To foster further research and practical adoption, we openly release our code.
zh

[CV-43] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections

【速读】：该论文试图解决大规模视觉搜索系统中的双重问题：一是定位所有真正包含句子描述对象的图像，二是识别每个匹配图像中对象的边界框或精确像素。现有技术仅能解决其中一方面的问题，视觉定位依赖于不现实的假设，而文本到图像检索则无法实现细粒度定位。论文提出的解决方案是引入Referring Search and Discovery (ReSeDis)任务，该任务首次将语料库级检索与像素级定位相结合，要求模型在给定自由形式描述时，判断目标对象是否存在于每张图像中，并精确定位其位置，返回边界框或分割掩码。关键在于构建一个基准数据集以消除意外匹配，并设计一个联合评估检索召回率和定位精度的任务特定指标。

链接: https://arxiv.org/abs/2506.15180
作者: Ziling Huang,Yidan Zhang,Shin’ichi Satoh
机构: National Institute of Informatics(国立信息学研究所); The University of Tokyo(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object’s bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.
zh

[CV-44] Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography

【速读】：该论文旨在解决超声心动图中左心室（Left Ventricle, LV）分割的挑战，这一任务在诊断和治疗过程中至关重要，但受限于超声图像的噪声大、对比度低以及LV边界模糊等问题，分割过程复杂。其解决方案的关键在于提出了一种名为Echo-DND的新型双噪声扩散模型，该模型结合了高斯噪声和伯努利噪声，并引入了多尺度融合条件模块以提升分割精度，同时采用空间一致性校准来保持分割掩码的空间完整性。

链接: https://arxiv.org/abs/2506.15166
作者: Abdur Rahman,Keerthiveena Balraj,Manojkumar Ramteke,Anurag Singh Rathore
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Version of record published in Discover Applied Sciences (Springer Nature). The definitive article is available at this https URL

点击查看摘要

Abstract:Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model’s performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: this https URL
zh

[CV-45] Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation

【速读】：该论文旨在解决点云分析中局部结构聚合存在的无关点干扰和特征层次差距问题，以及现有基于直接几何结构编码的增强方法所面临的计算开销高和噪声敏感问题。其解决方案的关键在于提出Point Distribution Set Abstraction模块（PDSA），该模块通过利用高维空间中的相关性在聚合过程中校正特征分布，从而提升计算效率和鲁棒性。PDSA通过轻量级跨阶段结构描述符区分点相关性，并通过长距离建模减少邻域特征矩阵的方差、增加类别可分性，同时引入关键点机制优化计算开销。

链接: https://arxiv.org/abs/2506.15160
作者: Jiaqi Shi,Jin Xiao,Xiaoguang Hu,Boyang Song,Hao Jiang,Tianyou Chen,Baochang Zhang
机构: Beihang University (北京航空航天大学); Chinese Aeronautical Radio Electronics Research Institute (中国航空无线电电子研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 papes, 7 figures

点击查看摘要

Abstract:Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: this https URL
zh

[CV-46] Robust Instant Policy: Leverag ing Students t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation IROS

【速读】：该论文试图解决在机器人领域中，基于即时策略的模仿学习（In-Context IL）因大语言模型（LLM）产生的幻觉问题而导致的轨迹可靠性不足的问题。解决方案的关键在于提出一种称为鲁棒即时策略（RIP）的新算法，该算法利用学生t回归模型对LLM生成的幻觉轨迹具有鲁棒性，通过从LLM生成多个候选轨迹并使用学生t分布进行聚合，从而有效抑制异常值（即幻觉），生成可靠的轨迹。

链接: https://arxiv.org/abs/2506.15157
作者: Hanbit Oh,Andrea M. Salcedo-Vázquez,Ixchel G. Ramirez-Alpizar,Yukiyasu Domae
机构: National Institute of Advanced Industrial Science and Technology (AIST)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025 accepted

点击查看摘要

Abstract:Imitation learning (IL) aims to enable robots to perform tasks autonomously by observing a few human demonstrations. Recently, a variant of IL, called In-Context IL, utilized off-the-shelf large language models (LLMs) as instant policies that understand the context from a few given demonstrations to perform a new task, rather than explicitly updating network models with large-scale demonstrations. However, its reliability in the robotics domain is undermined by hallucination issues such as LLM-based instant policy, which occasionally generates poor trajectories that deviate from the given demonstrations. To alleviate this problem, we propose a new robust in-context imitation learning algorithm called the robust instant policy (RIP), which utilizes a Student’s t-regression model to be robust against the hallucinated trajectories of instant policies to allow reliable trajectory generation. Specifically, RIP generates several candidate robot trajectories to complete a given task from an LLM and aggregates them using the Student’s t-distribution, which is beneficial for ignoring outliers (i.e., hallucinations); thereby, a robust trajectory against hallucinations is generated. Our experiments, conducted in both simulated and real-world environments, show that RIP significantly outperforms state-of-the-art IL methods, with at least 26% improvement in task success rates, particularly in low-data scenarios for everyday tasks. Video results available at this https URL.
zh

[CV-47] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts

【速读】：该论文旨在解决基于大型视觉模型（Large Vision Models, LVMs）的无监督少样本医学图像分割中负提示（negative prompts）利用不足导致在低对比度医学图像上性能不佳的问题。其解决方案的关键在于提升负提示的质量，通过设计一种结合DINOv2与SAM优势的置信度图协同模块，生成更可靠的正负点集，并利用高斯分布和K-means聚类进行选择，最终作为高质量提示输入SAM以获得分割结果。

链接: https://arxiv.org/abs/2506.15153
作者: Yufei Liu,Haoke Xiao,Jiaxing Chai,Yongcun Zhang,Rong Wang,Zijie Meng,Zhiming Luo
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.
zh

[CV-48] An Empirical Study of Bugs in Data Visualization Libraries

【速读】：该论文旨在解决数据可视化（DataViz）库中视觉错误（visual bugs）的检测与分析问题，这些问题可能导致用户误解数据并做出错误决策。研究通过系统分析564个来自五个广泛使用的DataViz库的bug，揭示了错误图表的普遍性及其主要根源为不准确的图形计算，进而提出了八项触发此类错误的关键步骤以及两种针对DataViz库的测试断言（test oracles），为设计有效的自动化测试方法提供了依据。解决方案的关键在于深入理解DataViz库中bug的独特特征，并探索利用视觉语言模型（VLMs）进行错误检测的可行性，尽管其检测效果在29%至57%之间波动，且增加提示信息并不必然提升效果。

链接: https://arxiv.org/abs/2506.15084
作者: Weiqi Lu,Yongqiang Tian,Xiaohan Zhong,Haoyang Ma,Zhenyang Xu,Shing-Chi Cheung,Chengnian Sun
机构: 未知
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Proc. ACM Softw. Eng. 2, FSE

点击查看摘要

Abstract:Data visualization (DataViz) libraries play a crucial role in presentation, data analysis, and application development, underscoring the importance of their accuracy in transforming data into visual representations. Incorrect visualizations can adversely impact user experience, distort information conveyance, and influence user perception and decision-making processes. Visual bugs in these libraries can be particularly insidious as they may not cause obvious errors like crashes, but instead mislead users of the underlying data graphically, resulting in wrong decision making. Consequently, a good understanding of the unique characteristics of bugs in DataViz libraries is essential for researchers and developers to detect and fix bugs in DataViz libraries. This study presents the first comprehensive analysis of bugs in DataViz libraries, examining 564 bugs collected from five widely-used libraries. Our study systematically analyzes their symptoms and root causes, and provides a detailed taxonomy. We found that incorrect/inaccurate plots are pervasive in DataViz libraries and incorrect graphic computation is the major root cause, which necessitates further automated testing methods for DataViz libraries. Moreover, we identified eight key steps to trigger such bugs and two test oracles specific to DataViz libraries, which may inspire future research in designing effective automated testing techniques. Furthermore, with the recent advancements in Vision Language Models (VLMs), we explored the feasibility of applying these models to detect incorrect/inaccurate plots. The results show that the effectiveness of VLMs in bug detection varies from 29% to 57%, depending on the prompts, and adding more information in prompts does not necessarily increase the effectiveness. More findings can be found in our manuscript. Comments: Proc. ACM Softw. Eng. 2, FSE Subjects: Software Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2506.15084 [cs.SE] (or arXiv:2506.15084v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.15084 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3729363 Focus to learn more DOI(s) linking to related resources
zh

[CV-49] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study

【速读】：该论文旨在解决向量量化（Vector Quantization）方法中的两个关键问题：训练不稳定性和代码本坍塌（Codebook Collapse）。训练不稳定性源于直通估计器引入的梯度差异，尤其是在存在显著量化误差时；而代码本坍塌则表现为训练过程中仅使用了代码本中的一小部分代码向量。研究表明，这些问题主要由特征分布与代码向量分布之间的不匹配所驱动，导致代码向量缺乏代表性并在压缩过程中丢失大量数据信息。为了解决这些问题，该论文的关键解决方案是采用Wasserstein距离来对齐这两个分布，从而实现接近100%的代码本利用率并显著降低量化误差。

链接: https://arxiv.org/abs/2506.15078
作者: Xianghong Fang,Litao Guo,Hengchao Chen,Yuxuan Zhang,XiaofanXia,Dingjie Song,Yexin Liu,Hao Wang,Harry Yang,Yuan Yuan,Qiang Sun
机构: University of Toronto (多伦多大学); The Hong Kong University of Science and Technology (香港科技大学); Boston College (波士顿学院); Lehigh University (利哈伊大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The success of autoregressive models largely depends on the effectiveness of vector quantization, a technique that discretizes continuous features by mapping them to the nearest code vectors within a learnable codebook. Two critical issues in existing vector quantization methods are training instability and codebook collapse. Training instability arises from the gradient discrepancy introduced by the straight-through estimator, especially in the presence of significant quantization errors, while codebook collapse occurs when only a small subset of code vectors are utilized during training. A closer examination of these issues reveals that they are primarily driven by a mismatch between the distributions of the features and code vectors, leading to unrepresentative code vectors and significant data information loss during compression. To address this, we employ the Wasserstein distance to align these two distributions, achieving near 100% codebook utilization and significantly reducing the quantization error. Both empirical and theoretical analyses validate the effectiveness of the proposed approach.
zh

[CV-50] Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

【速读】：该论文旨在解决传统风格迁移方法在风格注入过程中存在的多种问题，以及不同任务间框架不统一的问题。其解决方案的关键在于提出一种基于语义的风格注入方法，利用BLIP模型生成与风格图像语义严格对齐的文本描述，并通过大语言模型去除风格相关描述以构建语义差距，从而实现高效且无偏移的风格知识注入；同时引入基于人类反馈的数据增强策略和无需训练的三重扩散过程，通过调整自注意力层特征来实现风格注入与文本控制的平衡，最终实现了高质量的图像驱动风格迁移和文本驱动风格化。

链接: https://arxiv.org/abs/2506.15033
作者: Gary Song Yan,Yusen Zhang,Jinyu Zhao,Hao Zhang,Zhangping Yang,Guanye Xiong,Yanfei Liu,Tao Zhang,Yujie He,Siyuan Tian,Yao Gou,Min Li
机构: Xi’an Institute of High-tech (西安高技术研究所); Huazhong University of Science and Technology (华中科技大学); National University of Defence Technology (国防科技大学); Beijing (北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.
zh

[CV-51] Hyper-Local Deformable Transformers for Text Spotting on Historical Maps KDD2024

【速读】：该论文旨在解决从历史地图中提取文本的挑战，这一过程因缺乏有效的方法和训练数据而变得困难。传统方法通常针对特定的地图风格进行定制化处理，难以泛化。本文提出的解决方案关键在于PALETTE，这是一种端到端的文本检测器，其核心创新是引入了超局部采样模块，以显式学习文本实例目标边界点和字符周围的局部图像特征，并通过超局部位置嵌入来捕捉边界点与字符之间的空间交互关系。此外，论文还提出了一种自动生成合成地图图像的方法SynthMap+，用于训练历史地图的文本检测器。实验表明，PALETTE结合SynthMap+在两个新的历史地图基准数据集上优于当前最先进（SOTA）的文本检测器，尤其在长文本和倾斜文本的处理上表现突出。

链接: https://arxiv.org/abs/2506.15010
作者: Yijun Lin,Yao-Yi Chiang
机构: University of Minnesota, Twin Cities(明尼苏达大学双城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in KDD2024

点击查看摘要

Abstract:Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at this https URL.
zh

[CV-52] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors

【速读】：该论文试图解决传统合规性检测方法在便携性、可扩展性和准确性方面的局限性，以及基于神经网络的视觉触觉传感器方法在预测精度上的不足。解决方案的关键在于利用基于长时程循环卷积网络（LRCN）和Transformer架构的模型，结合由视觉触觉传感器GelSight捕获的RGB触觉图像和其他信息，以提高合规性度量的预测准确性。

链接: https://arxiv.org/abs/2506.14980
作者: Ziteng Li,Malte Kuhlmann,Ilana Nisky,Nicolás Navarro-Guerrero
机构: L3S Research Center (L3S 研究中心); Leibniz Universität Hannover (莱布尼茨汉诺威大学); Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted in the IEEE International Conference on Development and Learning (ICDL). The paper contains 8 pages and 7 figures

点击查看摘要

Abstract:Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.
zh

[CV-53] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images IJCAI

【速读】：该论文旨在解决高能物理中区分夸克-和胶子起始喷注（quark- and gluon-initiated jets）的问题，这是提升新物理搜索和精密测量的关键挑战。其解决方案的关键在于利用视觉Transformer（Vision Transformer, ViT）架构及其与卷积神经网络（CNN）的混合模型，通过直接分析 calorimeter 图像来捕捉喷注子结构中的长程空间相关性，从而在F1分数、ROC-AUC和准确率等指标上优于传统CNN基线模型。

链接: https://arxiv.org/abs/2506.14934
作者: Md Abrar Jahin,Shahriar Soudeep,Arian Rahman Aditta,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Third International Workshop on Generalizing from Limited Resources in the Open World Workshop at International Joint Conference on Artificial Intelligence (IJCAI) 2025

点击查看摘要

Abstract:Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.
zh

[CV-54] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models

【速读】：该论文旨在解决扩散模型在医学图像生成中的隐私泄露问题，特别是通过Membership Inference Attack (MIA)判断特定图像是否被用于训练扩散模型。现有MIA方法依赖于扩散重建误差，但直接应用于医学图像时面临重建误差受图像固有难度影响以及高频率细节难以重建的挑战。该论文提出的解决方案关键在于Frequency-Calibrated Reconstruction Error (FCRE)方法，通过聚焦于中频段的重建误差，排除高频（难重建）和低频（信息量少）区域，从而减轻图像固有难度的干扰，提升MIA的准确性。

链接: https://arxiv.org/abs/2506.14919
作者: Xinkai Zhao,Yuta Tokuoka,Junichiro Iwasawa,Keita Oda
机构: Preferred Networks, Inc.(Preferred Networks, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing use of diffusion models for image generation, especially in sensitive areas like medical imaging, has raised significant privacy concerns. Membership Inference Attack (MIA) has emerged as a potential approach to determine if a specific image was used to train a diffusion model, thus quantifying privacy risks. Existing MIA methods often rely on diffusion reconstruction errors, where member images are expected to have lower reconstruction errors than non-member images. However, applying these methods directly to medical images faces challenges. Reconstruction error is influenced by inherent image difficulty, and diffusion models struggle with high-frequency detail reconstruction. To address these issues, we propose a Frequency-Calibrated Reconstruction Error (FCRE) method for MIAs on medical image diffusion models. By focusing on reconstruction errors within a specific mid-frequency range and excluding both high-frequency (difficult to reconstruct) and low-frequency (less informative) regions, our frequency-selective approach mitigates the confounding factor of inherent image difficulty. Specifically, we analyze the reverse diffusion process, obtain the mid-frequency reconstruction error, and compute the structural similarity index score between the reconstructed and original images. Membership is determined by comparing this score to a threshold. Experiments on several medical image datasets demonstrate that our FCRE method outperforms existing MIA methods.
zh

[CV-55] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

【速读】：该论文旨在解决当前多模态强化学习方法在处理涉及多图像位置关系的复杂现实场景时表现不足的问题，特别是这些方法大多局限于单图像空间推理，难以有效理解跨图像的关系。其解决方案的关键在于提出一种名为PeRL的通用强化学习框架，用于交错的多模态任务，并采用多阶段策略以优化探索与利用的平衡，从而提高学习效率和任务性能。具体而言，通过图像序列的排列来模拟多样化的空间和位置关系，以及设计了一种回放过滤机制以聚焦于对学习最优行为贡献最大的轨迹。

链接: https://arxiv.org/abs/2506.14907
作者: Yizhen Zhang,Yang Ding,Shuoshuo Zhang,Xinchen Zhang,Haoling Li,Zhong-zhi Li,Peijie Wang,Jie Wu,Lei Ji,Yelong Shen,Yujiu Yang,Yeyun Gong
机构: Tsinghua University (清华大学); Microsoft (微软); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.
zh

[CV-56] DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization

【速读】：该论文旨在解决文本到图像（T2I）模型在对齐用户意图、确保安全性与公平性方面存在的挑战，特别是如何提升生成图像与用户指令的一致性以及减少潜在的偏见和歧视。其解决方案的关键在于提出DPO-Kernels方法，该方法通过三个核心组件实现增强对齐：（i）混合损失函数，结合基于嵌入的目标与传统概率损失以优化训练；（ii）核化表示，利用径向基函数（RBF）、多项式和小波核进行更丰富的特征变换；（iii）散度选择，引入Wasserstein和Rényi散度替代传统的Kullback-Leibler（KL）正则化以提高稳定性与鲁棒性。

链接: https://arxiv.org/abs/2506.14903
作者: Renjith Prasad,Abhilekh Borah,Hasnat Md Abdullah,Chathurangi Shyalika,Gurpreet Singh,Ritvik Garimella,Rajarshi Roy,Harshul Surana,Nasrin Imanpour,Suranjana Trivedy,Amit Sheth,Amitava Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 59 pages, 10 figures

点击查看摘要

Abstract:Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO’s default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R’enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released.
zh

[CV-57] pycnet-audio: A Python package to support bioacoustics data processing

【速读】：该论文旨在解决野生动物研究中大规模音频数据处理的挑战，特别是在被动声学监测（Passive Acoustic Monitoring, PAM）领域，面对海量录音数据时手动分析不切实际的问题。解决方案的关键在于开发一个实用的音频数据处理工作流，其核心是PNW-Cnet模型，该模型最初由美国林业局为支持北部斑点猫头鹰（Strix occidentalis caurina）及其他森林猫头鹰种群监测而开发，并已扩展用于检测约80种森林野生动物的鸣叫以及多种人为和环境噪声。

链接: https://arxiv.org/abs/2506.14864
作者: Zachary J. Ruff,Damon B. Lesmeister
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Passive acoustic monitoring is an emerging approach in wildlife research that leverages recent improvements in purpose-made automated recording units (ARUs). The general approach is to deploy ARUs in the field to record on a programmed schedule for extended periods (weeks or months), after which the audio data are retrieved. These data must then be processed, typically either by measuring or analyzing characteristics of the audio itself (e.g. calculating acoustic indices), or by searching for some signal of interest within the recordings, e.g. vocalizations or other sounds produced by some target species, anthropogenic or environmental noise, etc. In the latter case, some method is required to locate the signal(s) of interest within the audio. While very small datasets can simply be searched manually, even modest projects can produce audio datasets on the order of 105 hours of recordings, making manual review impractical and necessitating some form of automated detection. pycnet-audio (Ruff 2024) is intended to provide a practical processing workflow for acoustic data, built around the PNW-Cnet model, which was initially developed by the U.S. Forest Service to support population monitoring of northern spotted owls (Strix occidentalis caurina) and other forest owls (Lesmeister and Jenkins 2022; Ruff et al. 2020). PNW-Cnet has been expanded to detect vocalizations of ca. 80 forest wildlife species and numerous forms of anthropogenic and environmental noise (Ruff et al. 2021, 2023).
zh

[CV-58] owards Perception-based Collision Avoidance for UAVs when Guiding the Visually Impaired IROS

【速读】：该论文旨在解决视觉障碍人士（VIPs）在户外城市环境中自主导航的问题，通过无人机协助实现安全路径规划。解决方案的关键在于提出一种基于感知的局部路径规划系统，该系统结合了以几何形式表示的问题和多深度神经网络（DNN）框架，用于无人机及VIP的障碍物规避，同时与基于GPS和地图的全局规划器集成，以实现粗略规划。

链接: https://arxiv.org/abs/2506.14857
作者: Suman Raj,Swapnil Padhi,Ruchi Bhoot,Prince Modi,Yogesh Simmhan
机构: Indian Institute of Science (印度科学研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures; Accepted as Late-Breaking Results at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2023

点击查看摘要

Abstract:Autonomous navigation by drones using onboard sensors combined with machine learning and computer vision algorithms is impacting a number of domains, including agriculture, logistics, and disaster management. In this paper, we examine the use of drones for assisting visually impaired people (VIPs) in navigating through outdoor urban environments. Specifically, we present a perception-based path planning system for local planning around the neighborhood of the VIP, integrated with a global planner based on GPS and maps for coarse planning. We represent the problem using a geometric formulation and propose a multi DNN based framework for obstacle avoidance of the UAV as well as the VIP. Our evaluations conducted on a drone human system in a university campus environment verifies the feasibility of our algorithms in three scenarios; when the VIP walks on a footpath, near parked vehicles, and in a crowded street.
zh

[CV-59] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction NEURIPS2025

【速读】：该论文试图解决主动视角选择（Active View Selection, AVS）在3D物体重建中的问题，即如何确定最少的视角集合以实现准确且高效的三维重建。其解决方案的关键在于引入一种基于轻量级前馈深度神经网络UPNet的新型AVS方法，该网络通过单张输入图像生成预测的不确定性图，表示所有可能候选视角的不确定性值，并利用从大量自然物体及其相关不确定性模式中得出的启发式方法，直接学习从视角外观到体积表示不确定性的映射。通过聚合先前预测的不确定性图，该方法有效抑制冗余视角并选择最具信息量的视角，从而在减少计算开销的同时保持重建精度。

链接: https://arxiv.org/abs/2506.14856
作者: Zhengquan Zhang,Feng Xu,Mengmi Zhang
机构: Nanyang Technological University (南洋理工大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures in the main text. Under review for NeurIPS 2025

点击查看摘要

Abstract:Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction. Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet. UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations. Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training.
zh

[CV-60] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis ICCV2025

【速读】：该论文旨在解决传统零售视频标注方法依赖人工标注导致的效率低下和成本高昂的问题（traditional methods heavily rely on time-consuming manual labeling by human annotators）。其解决方案的关键在于提出一种基于深度学习的方法，通过深度神经网络学习具有区分性的特征，并结合面向零售环境的物体检测技术，实现零售视频中关键帧的自动识别与标注。该方法在保持标注精度接近人工标注水平的同时，显著提升了标注效率并降低了运营成本。

链接: https://arxiv.org/abs/2506.14854
作者: Varun Mannam,Zhenyu Shi
机构: Amazon(亚马逊); Alexa(亚历克斯); AGI(人工智能通用); ADS-Science(亚马逊数据服务科学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Submitting to ICCV 2025 workshop: this https URL

点击查看摘要

Abstract:Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.
zh

[CV-61] Finding Optimal Kernel Size and Dimension in Convolutional Neural Networks An Architecture Optimization Approach

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）中核尺寸选择这一关键但常被忽视的设计决策问题，该问题影响感受野、特征提取、计算成本和模型精度。论文提出的解决方案是最佳核尺寸估计函数（Best Kernel Size Estimation Function, BKSEF），其核心在于通过整合信息论、信号处理和学习理论的原理，平衡信息增益、计算效率和精度提升，从而实现逐层最优核尺寸的确定。

链接: https://arxiv.org/abs/2506.14846
作者: Shreyas Rajeev,B Sathish Babu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Kernel size selection in Convolutional Neural Networks (CNNs) is a critical but often overlooked design decision that affects receptive field, feature extraction, computational cost, and model accuracy. This paper proposes the Best Kernel Size Estimation Function (BKSEF), a mathematically grounded and empirically validated framework for optimal, layer-wise kernel size determination. BKSEF balances information gain, computational efficiency, and accuracy improvements by integrating principles from information theory, signal processing, and learning theory. Extensive experiments on CIFAR-10, CIFAR-100, ImageNet-lite, ChestX-ray14, and GTSRB datasets demonstrate that BKSEF-guided architectures achieve up to 3.1 percent accuracy improvement and 42.8 percent reduction in FLOPs compared to traditional models using uniform 3x3 kernels. Two real-world case studies further validate the approach: one for medical image classification in a cloud-based setup, and another for traffic sign recognition on edge devices. The former achieved enhanced interpretability and accuracy, while the latter reduced latency and model size significantly, with minimal accuracy trade-off. These results show that kernel size can be an active, optimizable parameter rather than a fixed heuristic. BKSEF provides practical heuristics and theoretical support for researchers and developers seeking efficient and application-aware CNN designs. It is suitable for integration into neural architecture search pipelines and real-time systems, offering a new perspective on CNN optimization.
zh

[CV-62] CACTUS as a Reliable Tool for Early Classification of Age-related Macular Degeneration

【速读】：该论文旨在解决医疗数据有限、不完整以及机器学习（Machine Learning, ML）模型透明度不足导致的诊断准确性与可信度问题，特别是在年龄相关性黄斑变性（Age-related Macular Degeneration, AMD）的早期阶段分类中。其解决方案的关键在于提出了一种名为综合抽象与分类工具以揭示结构（Comprehensive Abstraction and Classification Tool for Uncovering Structures, CACTUS）的方法，该方法具备可解释性和灵活性，能够识别关键影响因素并提供结果置信度，从而提升决策质量并支持临床反馈与偏倚修正。

链接: https://arxiv.org/abs/2506.14843
作者: Luca Gherardini,Imre Lengyel,Tunde Peto,Caroline C.W. Klaverd,Magda A. Meester-Smoord,Johanna Maria Colijnd,EYE-RISK Consortium,E3 Consortium,Jose Sousa
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Machine Learning (ML) is used to tackle various tasks, such as disease classification and prediction. The effectiveness of ML models relies heavily on having large amounts of complete data. However, healthcare data is often limited or incomplete, which can hinder model performance. Additionally, issues like the trustworthiness of solutions vary with the datasets used. The lack of transparency in some ML models further complicates their understanding and use. In healthcare, particularly in the case of Age-related Macular Degeneration (AMD), which affects millions of older adults, early diagnosis is crucial due to the absence of effective treatments for reversing progression. Diagnosing AMD involves assessing retinal images along with patients’ symptom reports. There is a need for classification approaches that consider genetic, dietary, clinical, and demographic factors. Recently, we introduced the -Comprehensive Abstraction and Classification Tool for Uncovering Structures-(CACTUS), aimed at improving AMD stage classification. CACTUS offers explainability and flexibility, outperforming standard ML models. It enhances decision-making by identifying key factors and providing confidence in its results. The important features identified by CACTUS allow us to compare with existing medical knowledge. By eliminating less relevant or biased data, we created a clinical scenario for clinicians to offer feedback and address biases.
zh

[CV-63] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

【速读】：该论文旨在解决数据稀缺领域中图像分类模型构建的困难问题，特别是在缺乏大量标注数据的情况下，如何实现有效的少样本图像分类（Few-shot Image Classification, FSIC）。其解决方案的关键在于重新审视基于上下文学习（In-context Learning, ICL）的FSIC流程中图像嵌入（image embeddings）的作用，并将嵌入模型的架构、预训练和训练动态作为核心分析对象。通过系统评估不同视觉编码器类型、预训练目标和微调策略对下游任务性能的影响，该研究提出了PictSure框架，证明了嵌入模型的预训练方式对模型在域外基准上的表现具有决定性影响，从而在保持域内任务性能的同时显著提升了域外任务的泛化能力。

链接: https://arxiv.org/abs/2506.14842
作者: Lukas Schiesser,Cornelius Wolff,Sophie Haas,Simon Pukrop
机构: German Research Center for AI (DFKI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model – its architecture, pretraining, and training dynamics – at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at this https URL.
zh

[CV-64] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在图表到代码生成任务中的性能不足问题。该任务要求模型生成可执行代码以重现给定图表，不仅需要精确的视觉理解，还需要将视觉元素准确转换为结构化代码。解决方案的关键在于提出ChartIR方法，该方法基于结构化指令进行迭代优化，首先区分视觉理解与代码翻译两个子任务，并设计描述和差异两类结构化指令以有效转换视觉特征为语言表示，其次将整个图表生成流程分解为初始代码生成与迭代优化两个阶段，从而逐步提升最终输出质量。

链接: https://arxiv.org/abs/2506.14837
作者: Chengzhi Xu,Yuyang Wang,Lai Wei,Lichao Sun,Weiran Huang
机构: MIFA Lab, Shanghai Jiao Tong University (MIFA实验室，上海交通大学); Shanghai Innovation Institute (上海创新研究院); Lehigh University (利哈伊大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室，BIGAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose ChartIR, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.
zh

[CV-65] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

【速读】：该论文旨在解决单目3D目标检测中从单张图像精确定位3D物体这一核心挑战，尤其是针对DETR类架构在该领域直接应用时所面临的固有局限性。其解决方案的关键在于提出MonoVQD框架，通过三个主要贡献实现性能提升：首先，引入了Mask Separated Self-Attention机制，将去噪过程集成到DETR架构中以提高匈牙利匹配的稳定性；其次，提出了Variational Query Denoising技术，通过引入随机性解决传统去噪方法的梯度消失问题；最后，设计了自蒸馏策略，利用后期解码器层的信息提升早期层的查询质量，从而增强迭代优化过程。这些创新显著提升了模型在KITTI和nuScenes数据集上的性能，并展现了良好的泛化能力。

链接: https://arxiv.org/abs/2506.14835
作者: Kiet Dang Vu,Trung Thai Tran,Duc Dung Nguyen
机构: Ho Chi Minh City University of Technology, VNUHCM (胡志明市科技大学，VNUHCM)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD’s core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.
zh

[CV-66] Real-Time Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices

【速读】：该论文旨在解决在资源受限环境中实现高性能、低延迟视频监控的问题。其关键解决方案是提出了一种基于熵的自适应帧缓冲算法，并将其与MobileNetV2结合，以在保持高吞吐量的同时降低延迟。该系统能够在如Raspberry Pi、Amazon和NVIDIA Jetson Nano等嵌入式平台上实现低于50毫秒的端到端推理延迟，并在标准视频监控数据集上保持超过92%的检测准确率。

链接: https://arxiv.org/abs/2506.14833
作者: Poojashree Chandrashekar Pankaj M Sajjanar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: pages

点击查看摘要

Abstract:This paper describes a high-performance, low-latency video surveillance system designed for resource-constrained environments. We have proposed a formal entropy-based adaptive frame buffering algorithm and integrated that with MobileNetV2 to achieve high throughput with low latency. The system is capable of processing live streams of video with sub-50ms end-to-end inference latency on resource-constrained devices (embedding platforms) such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Our method maintains over 92% detection accuracy on standard datasets focused on video surveillance and exhibits robustness to varying lighting, backgrounds, and speeds. A number of comparative and ablation experiments validate the effectiveness of our design. Finally, our architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, so that the system could coexist in a smart city or embedded security architecture.
zh

[CV-67] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes

【速读】：该论文试图解决在当代建筑设计中，如何客观分析人类设计与机器生成的三维形式之间的差异这一挑战，从而理解两者的优势并推动生成式工具的发展。其解决方案的关键在于构建了ArchForms-4000数据集，包含2,000个建筑师设计和2,000个Evomass生成的三维形式，并提出了ArchShapeNet，这是一种针对建筑形式分类与分析的三维卷积神经网络，结合了显著性模块以突出符合建筑推理的关键空间特征。通过对比实验验证了该模型在区分形式来源方面的优越性能。

链接: https://arxiv.org/abs/2506.14832
作者: Jun Yin,Jing Zhong,Pengyu Zeng,Peilin Li,Zixuan Dai,Miao Zhang,Shuai Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. However, objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge, limiting our understanding of their respective strengths and hindering the advancement of generative tools. To address this, we built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms, incorporating a saliency module to highlight key spatial features aligned with architectural reasoning; And conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future. Comments: 22 pages, 8 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.14832 [cs.CV] (or arXiv:2506.14832v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.14832 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-68] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review

【速读】：该论文旨在解决多智能体人类轨迹预测（multi-agent human trajectory prediction, HTP）中的关键问题，即如何更精确地建模和预测多个交互智能体的未来轨迹。其解决方案的关键在于利用深度学习方法，通过优化模型架构设计、输入表示方式以及整体预测策略，提升对复杂多智能体交互关系的理解与建模能力。论文特别关注在ETH/UCY基准上评估的模型，以推动该领域在自主导航和人群建模等应用中的发展。

链接: https://arxiv.org/abs/2506.14831
作者: Céline Finet,Stephane Da Silva Martins,Jean-Bernard Hayet,Ioannis Karamouzas,Javad Amirian,Sylvie Le Hégarat-Mascle,Julien Pettré,Emanuel Aldea
机构: INRIA Rennes; SATIE - CNRS; Université Paris-Saclay; Centro de Investigación en Matemáticas; University of California, Riverside; Sorbonne Université; Institut National de Recherche en Informatique et en Automatique (INRIA)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 30 pages

点击查看摘要

Abstract:With the emergence of powerful data-driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi-agent interactions lies within hand’s reach, with important implications in areas such as autonomous navigation and crowd modeling. This survey reviews some of the most recent advancements in deep learning-based multi-agent trajectory prediction, focusing on studies published between 2020 and 2024. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi-agent HTP.
zh

[CV-69] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning

【速读】：该论文试图解决AI生成视频（AI-generated video）检测中缺乏可解释性的问题，即现有方法多将其视为二分类任务，无法提供细致且具有说服力的证据来证明视频为合成内容。解决方案的关键在于提出DAVID-X数据集，该数据集首次将AI生成视频与详细的缺陷级、时空标注及书面推理相结合，并基于此构建了DAVID-XR1视频-语言模型，能够提供可解释的视觉推理链条，包括缺陷分类、时空定位和自然语言解释，从而将AI生成视频检测从黑箱决策转变为透明可验证的诊断过程。

链接: https://arxiv.org/abs/2506.14827
作者: Yifeng Gao,Yifan Ding,Hongyu Su,Juncheng Li,Yunhan Zhao,Lin Luo,Zixing Chen,Li Wang,Xin Wang,Yixu Wang,Xingjun Ma,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.
zh

[CV-70] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction

【速读】：该论文旨在解决3D语义占据预测（3D semantic occupancy prediction）中现有3D Gaussian Splating（3DGS）方法存在的两个关键问题：一是统一特征聚合忽略了相似类别之间的语义相关性以及跨区域的语义关联；二是由于MLP迭代优化中缺乏几何约束导致的边界模糊问题。其解决方案的关键在于提出GraphGSOcc模型，该模型结合了语义与几何图Transformer，通过双图注意力机制动态构建几何图和语义图，分别用于捕捉局部几何一致性和跨实例的语义关系，并结合多尺度图注意力框架实现细粒度与粗粒度的注意力优化。

链接: https://arxiv.org/abs/2506.14825
作者: Ke Song,Yunhe Wu,Chunchit Siu,Huiyuan Xiong
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Addressing the task of 3D semantic occupancy prediction for autonomous driving, we tackle two key issues in existing 3D Gaussian Splating (3DGS) methods: (1) unified feature aggregation neglecting semantic correlations among similar categories and across regions, and (2) boundary ambiguities caused by the lack of geometric constraints in MLP iterative optimization. We propose the GraphGSOcc model, a novel framework that combines semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction. We propose the Dual Gaussians Graph Attenntion, which dynamically constructs dual graph structures: a geometric graph adaptively calculating KNN search radii based on Gaussian poses, enabling large-scale Gaussians to aggregate features from broader neighborhoods while compact Gaussians focus on local geometric consistency; a semantic graph retaining top-M highly correlated nodes via cosine similarity to explicitly encode semantic relationships within and across instances. Coupled with the Multi-scale Graph Attention framework, fine-grained attention at lower layers optimizes boundary details, while coarse-grained attention at higher layers models object-level topology. Experiments on the SurroundOcc dataset achieve an mIoU of 24.10%, reducing GPU memory to 6.1 GB, demonstrating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld
zh

[CV-71] ViLLa: A Neuro-Symbolic approach for Animal Monitoring

【速读】：该论文试图解决在自然环境中监测动物种群时，如何有效结合视觉数据与人类语言查询的问题。其核心挑战在于实现对图像中动物的识别以及对自然语言问题的准确理解与推理。解决方案的关键在于提出一种神经符号框架ViLLa（Vision-Language-Logic Approach），该框架整合了视觉检测模块、语言解析器和符号推理层，通过将视觉检测结果转化为符号事实，并利用预定义规则进行逻辑推理，从而实现对数量、存在和位置等信息的准确回答。

链接: https://arxiv.org/abs/2506.14823
作者: Harsha Koduri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as “How many dogs are in the scene?” or “Where is the buffalo?”, the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.
zh

[CV-72] Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在计算资源受限情况下进行细节视觉推理能力不足的问题。其解决方案的关键在于采用小组相对策略优化（Group Relative Policy Optimization, GRPO）训练小规模模型，并结合外部工具（如zoom）以获取更详细的视觉信息，同时通过简化奖励结构、工具调用接口、增加工具调用结果的token分配以及过采样视觉难度较高的训练数据来提升模型性能。

链接: https://arxiv.org/abs/2506.14821
作者: Sunil Kumar,Bowen Zhao,Leo Dirac,Paulina Varshavskaya
机构: Groundlight AI(地面光人工智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.
zh

[CV-73] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection

【速读】：该论文试图解决 Falcons（猎鹰）在狩猎场景中健康监测的问题，特别是针对三种疾病——正常状态、肝脏疾病和曲霉病的准确分类。解决方案的关键在于采用了一种结合 ConvNeXt 和 EfficientNet 两种 AI 模型的混合架构，通过该架构在关键性能指标如准确率、精确率、召回率和 F1 分数上的表现优于传统诊断方法和单一模型结构，从而实现了更精准的 Falcon 疾病检测。

链接: https://arxiv.org/abs/2506.14816
作者: Alavikunhu Panthakkan,Zubair Medammal,S M Anzar,Fatma Taher,Hussain Al-Ahmad
机构: College of Engineering and IT, University of Dubai, U.A.E.; Department of Zoology, University of Calicut, Kerala, India; Department of Electronics and Communication, TKM College of Engineering, Kollam, India; NextGen Center, Zayed University, U.A.E.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Falconry, a revered tradition involving the training and hunting with falcons, requires meticulous health surveillance to ensure the health and safety of these prized birds, particularly in hunting scenarios. This paper presents an innovative method employing a hybrid of ConvNeXt and EfficientNet AI models for the classification of falcon diseases. The study focuses on accurately identifying three conditions: Normal, Liver Disease and ‘Aspergillosis’. A substantial dataset was utilized for training and validating the model, with an emphasis on key performance metrics such as accuracy, precision, recall, and F1-score. Extensive testing and analysis have shown that our concatenated AI model outperforms traditional diagnostic methods and individual model architectures. The successful implementation of this hybrid AI model marks a significant step forward in precise falcon disease detection and paves the way for future developments in AI-powered avian healthcare solutions.
zh

[CV-74] Omnidirectional Video Super-Resolution using Deep Learning

【速读】：该论文旨在解决360°视频在空间分辨率上的局限性，这一问题导致了沉浸式体验中的视觉质量受限。现有的深度学习视频超分辨率（VSR）技术无法有效处理360°视频信号在等距柱状投影中的畸变，且缺乏足够的360°视频数据集支持研究。论文的关键解决方案是提出一种针对360°视频的新型深度学习模型——Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO)，其采用带有注意力机制的循环建模方式，摆脱了传统VSR技术中对齐的限制，并结合定制化的特征提取器和针对球面畸变设计的新损失函数，从而在360°视频数据集上表现出优于现有主流VSR模型的性能。

链接: https://arxiv.org/abs/2506.14803
作者: Arbind Agrahari Baniya,Tsz-Kwan Lee,Peter W. Eklund,Sunil Aryal
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Omnidirectional Videos (or 360° videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360° videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360° video signals. An additional obstacle is the limited availability of 360° video datasets for study. To address these issues, this paper creates a novel 360° Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360° videos. This paper further proposes a novel deep learning model for 360° Video Super-Resolution (360° VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360°~specific super-resolution models on 360° video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.
zh

[CV-75] PIPE: Physics-Informed Position Encoding for Alignment of Satellite Images and Time Series

【速读】：该论文旨在解决多模态时间序列预测中对视觉数据利用不足的问题，尤其是如何有效捕捉卫星图像等视觉数据中的物理信息，如时间和空间上下文。其解决方案的关键在于提出一种轻量级的物理信息嵌入方法——物理感知位置编码（Physics-Informed Positional Encoding, PIPE），该方法通过引入物理感知的位置索引方案和变频位置编码机制，将物理变量的频率信息与序列顺序信息同时嵌入到视觉语言模型中，从而显著提升多模态对齐与预测精度。

链接: https://arxiv.org/abs/2506.14786
作者: Haobo Li,Eunseo Jung,Zixin Chen,Zhaowei Wang,Yueya Wang,Huamin Qu,Alexis Kai Hon Lau
机构: 1 Department of Computer Science & Engineering (计算机科学与工程系); 2 Division of Environment & Sustainability (环境与可持续发展学院); Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal time series forecasting is foundational in various fields, such as utilizing satellite imagery and numerical data for predicting typhoons in climate science. However, existing multimodal approaches primarily focus on utilizing text data to help time series forecasting, leaving the visual data in existing time series datasets untouched. Furthermore, it is challenging for models to effectively capture the physical information embedded in visual data, such as satellite imagery’s temporal and geospatial context, which extends beyond images themselves. To address this gap, we propose physics-informed positional encoding (PIPE), a lightweight method that embeds physical information into vision language models (VLMs). PIPE introduces two key innovations: (1) a physics-informed positional indexing scheme for mapping physics to positional IDs, and (2) a variant-frequency positional encoding mechanism for encoding frequency information of physical variables and sequential order of tokens within the embedding space. By preserving both the physical information and sequential order information, PIPE significantly improves multimodal alignment and forecasting accuracy. Through the experiments on the most representative and the largest open-sourced satellite image dataset, PIPE achieves state-of-the-art performance in both deep learning forecasting and climate domain methods, demonstrating superiority across benchmarks, including a 12% improvement in typhoon intensity forecasting over prior works. Our code is provided in the supplementary material.
zh

[CV-76] Automated MRI Tumor Segmentation using hybrid U-Net with Transformer and Efficient Attention

【速读】：该论文旨在解决放射治疗计划中肿瘤及周围正常组织自动分割的准确性问题，特别是在本地患者群体数据异质性不足的情况下，如何提升分割模型的泛化能力和临床适用性。其解决方案的关键在于构建一个计算效率高的混合UNet-Transformer架构，并结合本地医院MRI数据集进行训练，通过引入Transformer瓶颈和互补注意力模块（如高效注意力、Squeeze-and-Excitation块、Convolutional Block Attention Module和ResNeXt块）增强模型表达能力，同时采用DICOM数据提取与预处理流水线以及图像增强策略，以提升模型在不同临床场景下的泛化性能。

链接: https://arxiv.org/abs/2506.15562
作者: Syed Haider Ali,Asrar Ahmad,Muhammad Ali,Asifullah Khan,Muhammad Shahban,Nadeem Shaukat
机构: Pakistan Institute of Engineering and Applied Sciences (PIEAS); Pakistan Institute of Engineering & Applied Sciences; Pakistan Institute of Engineering & Applied Sciences; Pakistan Institute of Engineering and Applied Sciences; Pakistan Institute of Engineering & Applied Sciences
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Cancer is an abnormal growth with potential to invade locally and metastasize to distant organs. Accurate auto-segmentation of the tumor and surrounding normal tissues is required for radiotherapy treatment plan optimization. Recent AI-based segmentation models are generally trained on large public datasets, which lack the heterogeneity of local patient populations. While these studies advance AI-based medical image segmentation, research on local datasets is necessary to develop and integrate AI tumor segmentation models directly into hospital software for efficient and accurate oncology treatment planning and execution. This study enhances tumor segmentation using computationally efficient hybrid UNet-Transformer models on magnetic resonance imaging (MRI) datasets acquired from a local hospital under strict privacy protection. We developed a robust data pipeline for seamless DICOM extraction and preprocessing, followed by extensive image augmentation to ensure model generalization across diverse clinical settings, resulting in a total dataset of 6080 images for training. Our novel architecture integrates UNet-based convolutional neural networks with a transformer bottleneck and complementary attention modules, including efficient attention, Squeeze-and-Excitation (SE) blocks, Convolutional Block Attention Module (CBAM), and ResNeXt blocks. To accelerate convergence and reduce computational demands, we used a maximum batch size of 8 and initialized the encoder with pretrained ImageNet weights, training the model on dual NVIDIA T4 GPUs via checkpointing to overcome Kaggle’s runtime limits. Quantitative evaluation on the local MRI dataset yielded a Dice similarity coefficient of 0.764 and an Intersection over Union (IoU) of 0.736, demonstrating competitive performance despite limited data and underscoring the importance of site-specific model development for clinical deployment.
zh

[CV-77] Advanced cervical cancer classification: enhancing pap smear images with hybrid PMD Filter-CLAHE

【速读】：该论文旨在解决宫颈癌早期检测中因涂片图像质量影响卷积神经网络（CNN）分类性能的问题。其解决方案的关键在于提出了一种结合Perona-Malik扩散（PMD）滤波器与对比度受限自适应直方图均衡化（CLAHE）的混合图像预处理技术，以提升图像质量并优化CNN模型的分类表现。

链接: https://arxiv.org/abs/2506.15489
作者: Ach Khozaimi,Isnani Darti,Syaiful Anam,Wuryansari Muharini Kusumawinahyu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cervical cancer remains a significant health problem, especially in developing countries. Early detection is critical for effective treatment. Convolutional neural networks (CNN) have shown promise in automated cervical cancer screening, but their performance depends on Pap smear image quality. This study investigates the impact of various image preprocessing techniques on CNN performance for cervical cancer classification using the SIPaKMeD dataset. Three preprocessing techniques were evaluated: perona-malik diffusion (PMD) filter for noise reduction, contrast-limited adaptive histogram equalization (CLAHE) for image contrast enhancement, and the proposed hybrid PMD filter-CLAHE approach. The enhanced image datasets were evaluated on pretrained models, such as ResNet-34, ResNet-50, SqueezeNet-1.0, MobileNet-V2, EfficientNet-B0, EfficientNet-B1, DenseNet-121, and DenseNet-201. The results show that hybrid preprocessing PMD filter-CLAHE can improve the Pap smear image quality and CNN architecture performance compared to the original images. The maximum metric improvements are 13.62% for accuracy, 10.04% for precision, 13.08% for recall, and 14.34% for F1-score. The proposed hybrid PMD filter-CLAHE technique offers a new perspective in improving cervical cancer classification performance using CNN architectures.
zh

[CV-78] A Real-time Endoscopic Image Denoising System

【速读】：该论文旨在解决医疗内窥镜中使用超小型模拟图像传感器所导致的图像噪声问题，这些问题包括固定模式噪声、周期性条带噪声以及混合泊松-高斯噪声，进而影响图像质量和诊断准确性。解决方案的关键在于建立一个全面的噪声模型，并提出一种结合传统图像处理算法与先进基于学习的去噪技术的混合去噪系统，从而在不损失细节或引入色彩失真的前提下有效降低噪声，同时实现在FPGA平台上的实时性能。

链接: https://arxiv.org/abs/2506.15395
作者: Yu Xing,Shishi Huang,Meng Lv,Guo Chen,Huailiang Wang,Lingzhi Sui
机构: Beijing Yi Li Technology (北京亿立科技)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Endoscopes featuring a miniaturized design have significantly enhanced operational flexibility, portability, and diagnostic capability while substantially reducing the invasiveness of medical procedures. Recently, single-use endoscopes equipped with an ultra-compact analogue image sensor measuring less than 1mm x 1mm bring revolutionary advancements to medical diagnosis. They reduce the structural redundancy and large capital expenditures associated with reusable devices, eliminate the risk of patient infections caused by inadequate disinfection, and alleviate patient suffering. However, the limited photosensitive area results in reduced photon capture per pixel, requiring higher photon sensitivity settings to maintain adequate brightness. In high-contrast medical imaging scenarios, the small-sized sensor exhibits a constrained dynamic range, making it difficult to simultaneously capture details in both highlights and shadows, and additional localized digital gain is required to compensate. Moreover, the simplified circuit design and analog signal transmission introduce additional noise sources. These factors collectively contribute to significant noise issues in processed endoscopic images. In this work, we developed a comprehensive noise model for analog image sensors in medical endoscopes, addressing three primary noise types: fixed-pattern noise, periodic banding noise, and mixed Poisson-Gaussian noise. Building on this analysis, we propose a hybrid denoising system that synergistically combines traditional image processing algorithms with advanced learning-based techniques for captured raw frames from sensors. Experiments demonstrate that our approach effectively reduces image noise without fine detail loss or color distortion, while achieving real-time performance on FPGA platforms and an average PSNR improvement from 21.16 to 33.05 on our test dataset.
zh

[CV-79] FedWSIDD: Federated Whole Slide Image Classification via Dataset Distillation MICCAI2025

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在全切片图像（Whole Slide Image, WSI）分类任务中面临的计算资源异构性和隐私保护问题。其解决方案的关键在于提出FedWSIDD，一个基于数据集蒸馏（Dataset Distillation, DD）的新型联邦学习框架，通过学习和传输合成切片图像而非模型参数，实现跨机构协作建模的同时保障患者隐私，并提升本地WSI分类性能。

链接: https://arxiv.org/abs/2506.15365
作者: Haolong Jin,Shenglin Liu,Cong Cong,Qingmin Feng,Yongzhi Liu,Lina Huang,Yingzi Hu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising approach for collaborative medical image analysis, enabling multiple institutions to build robust predictive models while preserving sensitive patient data. In the context of Whole Slide Image (WSI) classification, FL faces significant challenges, including heterogeneous computational resources across participating medical institutes and privacy concerns. To address these challenges, we propose FedWSIDD, a novel FL paradigm that leverages dataset distillation (DD) to learn and transmit synthetic slides. On the server side, FedWSIDD aggregates synthetic slides from participating centres and distributes them across all centres. On the client side, we introduce a novel DD algorithm tailored to histopathology datasets which incorporates stain normalisation into the distillation process to generate a compact set of highly informative synthetic slides. These synthetic slides, rather than model parameters, are transmitted to the server. After communication, the received synthetic slides are combined with original slides for local tasks. Extensive experiments on multiple WSI classification tasks, including CAMELYON16 and CAMELYON17, demonstrate that FedWSIDD offers flexibility for heterogeneous local models, enhances local WSI classification performance, and preserves patient privacy. This makes it a highly effective solution for complex WSI classification tasks. The code is available at FedWSIDD.
zh

[CV-80] Privacy-Preserving Chest X-ray Classification in Latent Space with Homomorphically Encrypted Neural Inference

【速读】：该论文试图解决医疗影像数据在进行加密推理（Homomorphic Encryption, HE）时计算成本过高的问题，特别是在处理大型医学图像（如胸部X光片）时。解决方案的关键在于利用VQGAN（Vector Quantized Generative Adversarial Network）将图像压缩为潜在表示，从而显著降低计算负担，同时保持图像质量；此外，通过用低次多项式近似激活函数来平衡精度与效率，并对挤压-激励模块进行适应性调整以提升HE框架的性能。

链接: https://arxiv.org/abs/2506.15258
作者: Jonghun Kim,Gyeongdeok Jo,Shinyoung Ra,Hyunjin Park
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Medical imaging data contain sensitive patient information requiring strong privacy protection. Many analytical setups require data to be sent to a server for inference purposes. Homomorphic encryption (HE) provides a solution by allowing computations to be performed on encrypted data without revealing the original information. However, HE inference is computationally expensive, particularly for large images (e.g., chest X-rays). In this study, we propose an HE inference framework for medical images that uses VQGAN to compress images into latent representations, thereby significantly reducing the computational burden while preserving image quality. We approximate the activation functions with lower-degree polynomials to balance the accuracy and efficiency in compliance with HE requirements. We observed that a downsampling factor of eight for compression achieved an optimal balance between performance and computational cost. We further adapted the squeeze and excitation module, which is known to improve traditional CNNs, to enhance the HE framework. Our method was tested on two chest X-ray datasets for multi-label classification tasks using vanilla CNN backbones. Although HE inference remains relatively slow and introduces minor performance differences compared with unencrypted inference, our approach shows strong potential for practical use in medical images
zh

[CV-81] Classification of Multi-Parametric Body MRI Series Using Deep Learning

【速读】：该论文试图解决多参数磁共振成像（multi-parametric magnetic resonance imaging, mpMRI）检查中由于成像协议多样性及技术人员错误导致的DICOM头信息不准确问题，从而影响放射科医生高效阅读检查。解决方案的关键是采用基于深度学习的分类模型对8种不同的身体mpMRI序列类型进行分类，其中DenseNet-121模型在多种评估指标上表现最佳，显示出较高的F1分数和准确率，且在不同训练数据量和外部数据集上均保持了良好的性能。

链接: https://arxiv.org/abs/2506.15182
作者: Boah Kim,Tejas Sudharshan Mathai,Kimberly Helm,Peter A. Pinto,Ronald M. Summers
机构: National Institutes of Health Clinical Center (国家卫生研究院临床中心); National Cancer Institute (国家癌症研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-parametric magnetic resonance imaging (mpMRI) exams have various series types acquired with different imaging protocols. The DICOM headers of these series often have incorrect information due to the sheer diversity of protocols and occasional technologist errors. To address this, we present a deep learning-based classification model to classify 8 different body mpMRI series types so that radiologists read the exams efficiently. Using mpMRI data from various institutions, multiple deep learning-based classifiers of ResNet, EfficientNet, and DenseNet are trained to classify 8 different MRI series, and their performance is compared. Then, the best-performing classifier is identified, and its classification capability under the setting of different training data quantities is studied. Also, the model is evaluated on the out-of-training-distribution datasets. Moreover, the model is trained using mpMRI exams obtained from different scanners in two training strategies, and its performance is tested. Experimental results show that the DenseNet-121 model achieves the highest F1-score and accuracy of 0.966 and 0.972 over the other classification models with p-value 0.05. The model shows greater than 0.95 accuracy when trained with over 729 studies of the training data, whose performance improves as the training data quantities grew larger. On the external data with the DLDS and CPTAC-UCEC datasets, the model yields 0.872 and 0.810 accuracy for each. These results indicate that in both the internal and external datasets, the DenseNet-121 model attains high accuracy for the task of classifying 8 body MRI series types.
zh

[CV-82] NeuroMoE: A Transformer-Based Mixture-of-Experts Framework for Multi-Modal Neurological Disorder Classification

【速读】：该论文旨在解决多模态磁共振成像（MRI）与临床数据在神经障碍（NDs）诊断中的有效融合问题，现有深度学习方法难以充分利用多模态数据导致性能不佳。其解决方案的关键在于提出一种基于Transformer的专家混合（Mixture-of-Experts, MoE）框架，通过Transformer编码器捕捉体积MRI数据中的空间关系，并利用模态特定专家进行特征提取，结合自适应融合的门控机制动态整合专家输出，从而提升诊断准确性。

链接: https://arxiv.org/abs/2506.14970
作者: Wajih Hassan Raza,Aamir Bader Shah,Yu Wen,Yidan Shen,Juan Diego Martinez Lemus,Mya Caryn Schiess,Timothy Michael Ellmore,Renjie Hu,Xin Fu
机构: University of Houston(休斯顿大学); University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分校); McGovern Medical School(麦戈文医学院); The City College of New York(纽约市立学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society

点击查看摘要

Abstract:The integration of multi-modal Magnetic Resonance Imaging (MRI) and clinical data holds great promise for enhancing the diagnosis of neurological disorders (NDs) in real-world clinical settings. Deep Learning (DL) has recently emerged as a powerful tool for extracting meaningful patterns from medical data to aid in diagnosis. However, existing DL approaches struggle to effectively leverage multi-modal MRI and clinical data, leading to suboptimal performance. To address this challenge, we utilize a unique, proprietary multi-modal clinical dataset curated for ND research. Based on this dataset, we propose a novel transformer-based Mixture-of-Experts (MoE) framework for ND classification, leveraging multiple MRI modalities-anatomical (aMRI), Diffusion Tensor Imaging (DTI), and functional (fMRI)-alongside clinical assessments. Our framework employs transformer encoders to capture spatial relationships within volumetric MRI data while utilizing modality-specific experts for targeted feature extraction. A gating mechanism with adaptive fusion dynamically integrates expert outputs, ensuring optimal predictive performance. Comprehensive experiments and comparisons with multiple baselines demonstrate that our multi-modal approach significantly enhances diagnostic accuracy, particularly in distinguishing overlapping disease states. Our framework achieves a validation accuracy of 82.47%, outperforming baseline methods by over 10%, highlighting its potential to improve ND diagnosis by applying multi-modal learning to real-world clinical data. Comments: Accepted at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.14970 [eess.IV] (or arXiv:2506.14970v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.14970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-83] Recursive Variational Autoencoders for 3D Blood Vessel Generative Modeling

【速读】：该论文旨在解决如何准确且多样化地生成血管结构的问题，这一问题在临床诊断和治疗规划中具有重要意义。现有方法多为基于规则的合成方式，难以捕捉真实解剖数据的复杂性和多样性。论文提出的解决方案是开发一种递归变分神经网络（RvNN），其关键在于充分利用血管的层次化组织结构，并学习低维流形编码分支连接性及描述目标表面几何特征的特征。通过训练后的RvNN潜在空间采样，能够生成既准确又多样的三维血管模型，从而在医学培训、血流动力学模拟等方面具有重要应用价值。

链接: https://arxiv.org/abs/2506.14914
作者: Paula Feldman,Miguel Fainstein,Viviana Siless,Claudio Delrieux,Emmanuel Iarussi
机构: Consejo Nacional de Investigaciones Científicas y Técnicas (国家科学与技术研究委员会); Universidad Torcuato Di Tella (托尔夸托·迪特拉大学); Universidad Nacional del Sur (南国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Anatomical trees play an important role in clinical diagnosis and treatment planning. Yet, accurately representing these structures poses significant challenges owing to their intricate and varied topology and geometry. Most existing methods to synthesize vasculature are rule based, and despite providing some degree of control and variation in the structures produced, they fail to capture the diversity and complexity of actual anatomical data. We developed a Recursive variational Neural Network (RvNN) that fully exploits the hierarchical organization of the vessel and learns a low-dimensional manifold encoding branch connectivity along with geometry features describing the target surface. After training, the RvNN latent space can be sampled to generate new vessel geometries. By leveraging the power of generative neural networks, we generate 3D models of blood vessels that are both accurate and diverse, which is crucial for medical and surgical training, hemodynamic simulations, and many other purposes. These results closely resemble real data, achieving high similarity in vessel radii, length, and tortuosity across various datasets, including those with aneurysms. To the best of our knowledge, this work is the first to utilize this technique for synthesizing blood vessels.
zh

[CV-84] Foundation Artificial Intelligence Models for Health Recognition Using Face Photographs (FAHR-Face)

【速读】：该论文旨在通过面部图像分析，提供一种非侵入性的健康评估方法，具体解决生物年龄估计和生存风险预测问题。其解决方案的关键在于构建一个名为FAHR-Face的基础模型，该模型在4000万张面部图像上进行训练，并针对两个特定任务进行微调：FAHR-FaceAge用于生物年龄估计，FAHR-FaceSurvival用于生存风险预测。通过在多个数据集上的临床测试，验证了模型的鲁棒性和泛化能力，同时证明了两种算法提供的互补性信息能够提升预后准确性。

链接: https://arxiv.org/abs/2506.14909
作者: Fridolin Haugg,Grace Lee,John He,Leonard Nürnberg,Dennis Bontempi,Danielle S. Bitterman,Paul Catalano,Vasco Prudente,Dmitrii Glubokov,Andrew Warrington,Suraj Pai,Dirk De Ruysscher,Christian Guthier,Benjamin H. Kann,Vadim N. Gladyshev,Hugo JWL Aerts,Raymond H. Mak
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Facial appearance offers a noninvasive window into health. We built FAHR-Face, a foundation model trained on 40 million facial images and fine-tuned it for two distinct tasks: biological age estimation (FAHR-FaceAge) and survival risk prediction (FAHR-FaceSurvival). Methods: FAHR-FaceAge underwent a two-stage, age-balanced fine-tuning on 749,935 public images; FAHR-FaceSurvival was fine-tuned on 34,389 photos of cancer patients. Model robustness (cosmetic surgery, makeup, pose, lighting) and independence (saliency mapping) was tested extensively. Both models were clinically tested in two independent cancer patient datasets with survival analyzed by multivariable Cox models and adjusted for clinical prognostic factors. Findings: For age estimation, FAHR-FaceAge had the lowest mean absolute error of 5.1 years on public datasets, outperforming benchmark models and maintaining accuracy across the full human lifespan. In cancer patients, FAHR-FaceAge outperformed a prior facial age estimation model in survival prognostication. FAHR-FaceSurvival demonstrated robust prediction of mortality, and the highest-risk quartile had more than triple the mortality of the lowest (adjusted hazard ratio 3.22; P0.001). These findings were validated in the independent cohort and both models showed generalizability across age, sex, race and cancer subgroups. The two algorithms provided distinct, complementary prognostic information; saliency mapping revealed each model relied on distinct facial regions. The combination of FAHR-FaceAge and FAHR-FaceSurvival improved prognostic accuracy. Interpretation: A single foundation model can generate inexpensive, scalable facial biomarkers that capture both biological ageing and disease-related mortality risk. The foundation model enabled effective training using relatively small clinical datasets. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.14909 [eess.IV] (or arXiv:2506.14909v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2506.14909 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fridolin Haugg [view email] [v1] Tue, 17 Jun 2025 18:28:11 UTC (9,029 KB)
zh

[CV-85] Improving Prostate Gland Segmenting Using Transformer based Architectures

【速读】：该论文试图解决在使用T2加权MRI图像进行前列腺解剖结构自动分割时，由于读者间差异和跨中心领域偏移带来的挑战。其解决方案的关键在于采用基于Transformer的模型，特别是SwinUNETR，通过全局和位移窗口自注意力机制有效减少标签噪声和类别不平衡的敏感性，从而在保持计算效率的同时，显著提升Dice相似性系数，表现出更高的临床部署鲁棒性。

链接: https://arxiv.org/abs/2506.14844
作者: Shatha Abudalou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inter reader variability and cross site domain shift challenge the automatic segmentation of prostate anatomy using T2 weighted MRI images. This study investigates whether transformer models can retain precision amid such heterogeneity. We compare the performance of UNETR and SwinUNETR in prostate gland segmentation against our previous 3D UNet model [1], based on 546 MRI (T2weighted) volumes annotated by two independent experts. Three training strategies were analyzed: single cohort dataset, 5 fold cross validated mixed cohort, and gland size based dataset. Hyperparameters were tuned by Optuna. The test set, from an independent population of readers, served as the evaluation endpoint (Dice Similarity Coefficient). In single reader training, SwinUNETR achieved an average dice score of 0.816 for Reader#1 and 0.860 for Reader#2, while UNETR scored 0.8 and 0.833 for Readers #1 and #2, respectively, compared to the baseline UNets 0.825 for Reader #1 and 0.851 for Reader #2. SwinUNETR had an average dice score of 0.8583 for Reader#1 and 0.867 for Reader#2 in cross-validated mixed training. For the gland size-based dataset, SwinUNETR achieved an average dice score of 0.902 for Reader#1 subset and 0.894 for Reader#2, using the five-fold mixed training strategy (Reader#1, n=53; Reader#2, n=87) at larger gland size-based subsets, where UNETR performed poorly. Our findings demonstrate that global and shifted-window self-attention effectively reduces label noise and class imbalance sensitivity, resulting in improvements in the Dice score over CNNs by up to five points while maintaining computational efficiency. This contributes to the high robustness of SwinUNETR for clinical deployment.
zh

[CV-86] Deploying and Evaluating Multiple Deep Learning Models on Edge Devices for Diabetic Retinopathy Detection

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）诊断过程中依赖人工检查导致的效率低、耗时长及资源消耗大的问题。其关键解决方案是利用Edge Impulse平台在边缘设备上部署多种深度学习模型，实现DR的实时检测。通过构建并优化基于TensorFlow的卷积神经网络（CNN）模型，如MobileNet、ShuffleNet、SqueezeNet和自定义深度神经网络（DNN），并将其转换为TensorFlow Lite格式进行8位整数量化，以减小模型体积并提升推理速度，同时保持较高的准确率，从而在不同边缘硬件平台上实现高效、精准的DR检测。

链接: https://arxiv.org/abs/2506.14834
作者: Akwasi Asare,Dennis Agyemanh Nana Gookyi,Derrick Boateng,Fortunatus Aabangbio Wulnye
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR), a leading cause of vision impairment in individuals with diabetes, affects approximately 34.6% of diabetes patients globally, with the number of cases projected to reach 242 million by 2045. Traditional DR diagnosis relies on the manual examination of retinal fundus images, which is both time-consuming and resource intensive. This study presents a novel solution using Edge Impulse to deploy multiple deep learning models for real-time DR detection on edge devices. A robust dataset of over 3,662 retinal fundus images, sourced from the Kaggle EyePACS dataset, was curated, and enhanced through preprocessing techniques, including augmentation and normalization. Using TensorFlow, various Convolutional Neural Networks (CNNs), such as MobileNet, ShuffleNet, SqueezeNet, and a custom Deep Neural Network (DNN), were designed, trained, and optimized for edge deployment. The models were converted to TensorFlowLite and quantized to 8-bit integers to reduce their size and enhance inference speed, with minimal trade-offs in accuracy. Performance evaluations across different edge hardware platforms, including smartphones and microcontrollers, highlighted key metrics such as inference speed, accuracy, precision, and resource utilization. MobileNet achieved an accuracy of 96.45%, while SqueezeNet demonstrated strong real-time performance with a small model size of 176 KB and latency of just 17 ms on GPU. ShuffleNet and the custom DNN achieved moderate accuracy but excelled in resource efficiency, making them suitable for lower-end devices. This integration of edge AI technology into healthcare presents a scalable, cost-effective solution for early DR detection, providing timely and accurate diagnosis, especially in resource-constrained and remote healthcare settings.
zh

[CV-87] Empirical Studies of Large Scale Environment Scanning by Consumer Electronics

【速读】：该论文旨在评估Matterport Pro3这一消费级三维扫描设备在大规模环境重建中的性能与适用性，解决其在实际应用中遇到的挑战。研究通过详细扫描六层楼建筑并对比其与iPhone的扫描结果，验证了Pro3在点云密度和对齐精度方面的优势。解决方案的关键在于利用LiDAR技术和先进的对齐算法，以提升扫描质量和重建效果，从而实现高精度、大规模的三维建模。

链接: https://arxiv.org/abs/2506.14771
作者: Mengyuan Wang,Yang Liu,Haopeng Wang,Haiwei Dong,Abdulmotaleb El Saddik
机构: University of Ottawa(渥太华大学); Huawei Canada(华为加拿大)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Multimedia (cs.MM)
备注: Accepted by IEEE Consumer Electronics Magazine

点击查看摘要

Abstract:This paper presents an empirical evaluation of the Matterport Pro3, a consumer-grade 3D scanning device, for large-scale environment reconstruction. We conduct detailed scanning (1,099 scanning points) of a six-floor building (17,567 square meters) and assess the device’s effectiveness, limitations, and performance enhancements in diverse scenarios. Challenges encountered during the scanning are addressed through proposed solutions, while we also explore advanced methods to overcome them more effectively. Comparative analysis with another consumer-grade device (iPhone) highlights the Pro3’s balance between cost-effectiveness and performance. The Matterport Pro3 achieves a denser point cloud with 1,877,324 points compared to the iPhone’s 506,961 points and higher alignment accuracy with an RMSE of 0.0118 meters. The cloud-to-cloud (C2C) average distance error between the two point cloud models is 0.0408 meters, with a standard deviation of 0.0715 meters. The study demonstrates the Pro3’s ability to generate high-quality 3D models suitable for large-scale applications, leveraging features such as LiDAR and advanced alignment techniques.
zh

人工智能

[AI-0] SwarmAgent ic: Towards Fully Automated Agent ic System Generation via Swarm Intelligence

【速读】：该论文试图解决现有自主系统生成框架缺乏完全自主性的问题，具体表现为无法从零开始生成智能体、缺少自我优化的智能体功能以及协作能力，从而限制了系统的适应性和可扩展性。其解决方案的关键在于提出SwarmAgentic框架，该框架通过语言驱动的探索，从零构建自主系统，并将智能体功能与协作作为相互依赖的组件进行联合优化，同时借鉴粒子群优化（Particle Swarm Optimization, PSO）的思想，维护候选系统种群并通过反馈引导的更新进行演化，实现系统级结构的高效搜索。

链接: https://arxiv.org/abs/2506.15672
作者: Yao Zhang,Chenyang Lin,Shijie Tang,Haokun Chen,Shijie Zhou,Yunpu Ma,Volker Tresp
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 41 pages

点击查看摘要

Abstract:The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation. Our code is publicly released at this https URL.
zh

[AI-1] Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在复杂问题求解过程中存在的过度思考问题，即生成冗长且不必要的内容，从而影响效率并增加推理成本。解决方案的关键在于揭示LRMs内在的简洁推理能力，并通过两种轻量级方法提升其推理效率：一是无需训练的Efficiency Steering技术，通过模型表示空间中的单一方向调节推理行为；二是Self-Rewarded Efficiency RL框架，通过奖励简洁正确的解题方案动态平衡任务准确性和简洁性。

链接: https://arxiv.org/abs/2506.15647
作者: Weixiang Zhao,Jiahe Guo,Yang Deng,Xingyu Sui,Yulin Hu,Yanyan Zhao,Wanxiang Che,Bing Qin,Tat-Seng Chua,Ting Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large reasoning models (LRMs) have significantly enhanced language models’ capabilities in complex problem-solving by emulating human-like deliberative thinking. However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost. In this work, we explore the representational and behavioral origins of this inefficiency, revealing that LRMs inherently possess the capacity for more concise reasoning. Empirical analyses show that correct reasoning paths vary significantly in length, and the shortest correct responses often suffice, indicating untapped efficiency potential. Exploiting these findings, we propose two lightweight methods to enhance LRM efficiency. First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction in the model’s representation space. Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity by rewarding concise correct solutions. Extensive experiments on seven LRM backbones across multiple mathematical reasoning benchmarks demonstrate that our methods significantly reduce reasoning length while preserving or improving task performance. Our results highlight that reasoning efficiency can be improved by leveraging and guiding the intrinsic capabilities of existing models in a self-guided manner.
zh

[AI-2] he AI Policy Module: Developing Computer Science Student Competency in AI Ethics and Policy

【速读】：该论文试图解决当前高等教育中计算机科学课程未能充分培养未来AI从业者在人工智能政策与伦理实践方面的能力问题。解决方案的关键在于开发并实施一个AI政策模块（AI Policy Module），将AI政策讨论引入计算机科学教育，并通过“AI监管”技术任务增强学生对AI伦理影响的认知与参与能力，从而提升其在实际工程中应对伦理与政策挑战的素养。

链接: https://arxiv.org/abs/2506.15639
作者: James Weichert,Daniel Dunlap,Mohammed Farghally,Hoda Eldardiry
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Frontiers in Education (FIE) 2025

点击查看摘要

Abstract:As artificial intelligence (AI) further embeds itself into many settings across personal and professional contexts, increasing attention must be paid not only to AI ethics, but also to the governance and regulation of AI technologies through AI policy. However, the prevailing post-secondary computing curriculum is currently ill-equipped to prepare future AI practitioners to confront increasing demands to implement abstract ethical principles and normative policy preferences into the design and development of AI systems. We believe that familiarity with the ‘AI policy landscape’ and the ability to translate ethical principles to practices will in the future constitute an important responsibility for even the most technically-focused AI engineers. Toward preparing current computer science (CS) students for these new expectations, we developed an AI Policy Module to introduce discussions of AI policy into the CS curriculum. Building on a successful pilot in fall 2024, in this innovative practice full paper we present an updated and expanded version of the module, including a technical assignment on “AI regulation”. We present the findings from our pilot of the AI Policy Module 2.0, evaluating student attitudes towards AI ethics and policy through pre- and post-module surveys. Following the module, students reported increased concern about the ethical impacts of AI technologies while also expressing greater confidence in their abilities to engage in discussions about AI regulation. Finally, we highlight the AI Regulation Assignment as an effective and engaging tool for exploring the limits of AI alignment and emphasizing the role of ‘policy’ in addressing ethical challenges. Comments: Accepted at IEEE Frontiers in Education (FIE) 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.15639 [cs.AI] (or arXiv:2506.15639v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.15639 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction

【速读】：该论文旨在解决在隐私限制下训练鲁棒的脑预测年龄差异（BrainAGE）模型的问题，其核心挑战是如何在不集中化数据的情况下实现准确的脑龄估计。研究采用联邦学习（FL）作为解决方案，通过在多个医院中心进行本地训练并聚合模型参数，从而在保护患者隐私的同时提升模型性能，验证了FL在缺血性卒中患者中的有效性及其与临床表型和功能预后的关联性。

链接: https://arxiv.org/abs/2506.15626
作者: Vincent Roca,Marc Tommasi,Paul Andrey,Aurélien Bellet,Markus D. Schirmer,Hilde Henon,Laurent Puy,Julien Ramon,Grégory Kuchcinski,Martin Bretzner,Renaud Lopes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract: \textbfObjective: Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. \textbfMethods: We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE’s predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. \textbfResults: While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. \textbfConclusion: FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2506.15626 [cs.LG] (or arXiv:2506.15626v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.15626 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vincent Roca [view email] [v1] Wed, 18 Jun 2025 16:56:44 UTC (154 KB)
zh

[AI-4] he Effect of State Representation on LLM Agent Behavior in Dynamic Routing Games

【速读】：该论文旨在解决在动态多智能体博弈中，如何系统地构建自然语言“状态”表示以提升大型语言模型（Large Language Models, LLMs）作为决策者的性能问题。现有研究在编码游戏历史时采用的是非系统化的方法，这不仅模糊了状态表示对智能体行为的影响，还限制了不同研究之间的可比性。论文提出的解决方案关键在于构建一个统一框架，从三个维度对状态表示进行表征：动作信息性（即状态表示捕捉已执行动作的程度）、奖励信息性（即状态表示描述获得奖励的程度）以及提示风格（或自然语言压缩，即对完整文本历史的总结程度）。通过这一框架，研究发现特定的自然语言状态表示能够显著影响LLM智能体的行为，并更接近博弈论均衡预测。

链接: https://arxiv.org/abs/2506.15624
作者: Lyle Goodyear,Rachel Guo,Ramesh Johari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 20 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise as decision-makers in dynamic settings, but their stateless nature necessitates creating a natural language representation of history. We present a unifying framework for systematically constructing natural language “state” representations for prompting LLM agents in repeated multi-agent games. Previous work on games with LLM agents has taken an ad hoc approach to encoding game history, which not only obscures the impact of state representation on agents’ behavior, but also limits comparability between studies. Our framework addresses these gaps by characterizing methods of state representation along three axes: action informativeness (i.e., the extent to which the state representation captures actions played); reward informativeness (i.e., the extent to which the state representation describes rewards obtained); and prompting style (or natural language compression, i.e., the extent to which the full text history is summarized). We apply this framework to a dynamic selfish routing game, chosen because it admits a simple equilibrium both in theory and in human subject experiments \citerapoport_choice_2009. Despite the game’s relative simplicity, we find that there are key dependencies of LLM agent behavior on the natural language state representation. In particular, we observe that representations which provide agents with (1) summarized, rather than complete, natural language representations of past history; (2) information about regrets, rather than raw payoffs; and (3) limited information about others’ actions lead to behavior that more closely matches game theoretic equilibrium predictions, and with more stable game play by the agents. By contrast, other representations can exhibit either large deviations from equilibrium, higher variation in dynamic game play over time, or both. Comments: 27 pages, 20 figures Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.11; I.6.4; F.1.2; F.2.2; G.3; J.7 Cite as: arXiv:2506.15624 [cs.AI] (or arXiv:2506.15624v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.15624 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] GFLC: Graph-based Fairness-aware Label Correction for Fair Classification

【速读】：该论文试图解决机器学习中由于训练数据标签偏差和噪声导致的公平性问题，这种问题会影响模型性能并扭曲分类器在测试中的公平性表现。解决方案的关键在于提出一种基于图的公平性感知标签修正方法（Graph-based Fairness-aware Label Correction, GFLC），该方法结合了预测置信度度量、通过Ricci流优化图拉普拉斯的图正则化以及显式的群体公平性激励机制，从而在修正标签噪声的同时保持数据集的群体公平性。

链接: https://arxiv.org/abs/2506.15620
作者: Modar Sulaiman,Kallol Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures

点击查看摘要

Abstract:Fairness in machine learning (ML) has a critical importance for building trustworthy machine learning system as artificial intelligence (AI) systems increasingly impact various aspects of society, including healthcare decisions and legal judgments. Moreover, numerous studies demonstrate evidence of unfair outcomes in ML and the need for more robust fairness-aware methods. However, the data we use to train and develop debiasing techniques often contains biased and noisy labels. As a result, the label bias in the training data affects model performance and misrepresents the fairness of classifiers during testing. To tackle this problem, our paper presents Graph-based Fairness-aware Label Correction (GFLC), an efficient method for correcting label noise while preserving demographic parity in datasets. In particular, our approach combines three key components: prediction confidence measure, graph-based regularization through Ricci-flow-optimized graph Laplacians, and explicit demographic parity incentives. Our experimental findings show the effectiveness of our proposed approach and show significant improvements in the trade-off between performance and fairness metrics compared to the baseline.
zh

[AI-6] Managing Complex Failure Analysis Workflows with LLM -based Reasoning and Acting Agents

【速读】：该论文旨在解决故障分析（Failure Analysis, FA）过程中因AI模型数量增加而导致的组件协调与流程整合难题，即如何将多种AI组件高效地编排为协同工作的流程以支持FA任务。解决方案的关键在于设计并实现一种基于大型语言模型（Large Language Model, LLM）的规划代理（Planning Agent, LPA），该代理结合了LLM的自然语言处理能力、高级规划功能以及对外部工具的调用能力，从而实现对复杂查询的自主处理、外部系统的相关数据检索以及生成可读性良好的响应。

链接: https://arxiv.org/abs/2506.15567
作者: Aline Dobrovsky,Konstantin Schekotihin,Christian Burmer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Failure Analysis (FA) is a highly intricate and knowledge-intensive process. The integration of AI components within the computational infrastructure of FA labs has the potential to automate a variety of tasks, including the detection of non-conformities in images, the retrieval of analogous cases from diverse data sources, and the generation of reports from annotated images. However, as the number of deployed AI models increases, the challenge lies in orchestrating these components into cohesive and efficient workflows that seamlessly integrate with the FA process. This paper investigates the design and implementation of a Large Language Model (LLM)-based Planning Agent (LPA) to assist FA engineers in solving their analysis cases. The LPA integrates LLMs with advanced planning capabilities and external tool utilization, enabling autonomous processing of complex queries, retrieval of relevant data from external systems, and generation of human-readable responses. Evaluation results demonstrate the agent’s operational effectiveness and reliability in supporting FA tasks. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.15567 [cs.AI] (or arXiv:2506.15567v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.15567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-7] owards Explainable Indoor Localization: Interpreting Neural Network Learning on Wi-Fi Fingerprints Using Logic Gates

【速读】：该论文旨在解决深度学习（Deep Learning, DL）在室内定位中的可解释性不足问题，即现有DL框架作为黑箱模型，难以揭示预测机制及对随时间变化的环境噪声的响应，从而限制了模型在长期部署中的适应性和稳定性。其解决方案的关键在于提出LogNet，一个基于逻辑门的框架，通过识别每个参考点（Reference Point, RP）中最关键的接入点（Access Point, AP），并揭示环境噪声如何影响DL驱动的定位决策，从而实现模型内部行为的透明推理与故障诊断，提升模型的长期可靠性与性能。

链接: https://arxiv.org/abs/2506.15559
作者: Danish Gufran,Sudeep Pasricha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indoor localization using deep learning (DL) has demonstrated strong accuracy in mapping Wi-Fi RSS fingerprints to physical locations; however, most existing DL frameworks function as black-box models, offering limited insight into how predictions are made or how models respond to real-world noise over time. This lack of interpretability hampers our ability to understand the impact of temporal variations - caused by environmental dynamics - and to adapt models for long-term reliability. To address this, we introduce LogNet, a novel logic gate-based framework designed to interpret and enhance DL-based indoor localization. LogNet enables transparent reasoning by identifying which access points (APs) are most influential for each reference point (RP) and reveals how environmental noise disrupts DL-driven localization decisions. This interpretability allows us to trace and diagnose model failures and adapt DL systems for more stable long-term deployments. Evaluations across multiple real-world building floorplans and over two years of temporal variation show that LogNet not only interprets the internal behavior of DL models but also improves performance-achieving up to 1.1x to 2.8x lower localization error, 3.4x to 43.3x smaller model size, and 1.5x to 3.6x lower latency compared to prior DL-based models.
zh

[AI-8] DAILOC: Domain-Incremental Learning for Indoor Localization using Smartphones

【速读】：该论文旨在解决基于Wi-Fi指纹的室内定位在实际部署中因设备异构性和室内环境随时间变化而导致的领域偏移问题。现有方法通常独立处理这些问题，导致泛化能力差且容易发生灾难性遗忘。该论文提出的DAILOC框架是一种新的领域增量学习方法，其关键在于引入了一种解耦策略，通过多级变分自编码器将领域偏移与位置相关特征分离，并结合一种记忆引导的类别潜在对齐机制，以缓解时间上的灾难性遗忘问题。

链接: https://arxiv.org/abs/2506.15554
作者: Akhil Singampalli,Danish Gufran,Sudeep Pasricha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wi-Fi fingerprinting-based indoor localization faces significant challenges in real-world deployments due to domain shifts arising from device heterogeneity and temporal variations within indoor environments. Existing approaches often address these issues independently, resulting in poor generalization and susceptibility to catastrophic forgetting over time. In this work, we propose DAILOC, a novel domain-incremental learning framework that jointly addresses both temporal and device-induced domain shifts. DAILOC introduces a novel disentanglement strategy that separates domain shifts from location-relevant features using a multi-level variational autoencoder. Additionally, we introduce a novel memory-guided class latent alignment mechanism to address the effects of catastrophic forgetting over time. Experiments across multiple smartphones, buildings, and time instances demonstrate that DAILOC significantly outperforms state-of-the-art methods, achieving up to 2.74x lower average error and 4.6x lower worst-case error.
zh

[AI-9] Learning Algorithms in the Limit COLT2025 ATC

【速读】：该论文试图解决在有限时间内通过计算观测和受限输入源学习可计算函数的问题，其核心是扩展Gold的归纳推理框架以纳入时间约束观测和策略轨迹观测，从而研究一般递归函数的可学习性。解决方案的关键在于通过引入计算复杂性约束或近似时间约束观测来克服传统输入输出观测不足以在极限下学习一般递归函数的障碍，并构建了围绕计算代理观测的正式框架，表明从策略轨迹中学习可计算函数可转化为从输入输出中学习有理函数，揭示了与有限状态转换器推断的联系。

链接: https://arxiv.org/abs/2506.15543
作者: Hristo Papazov,Nicolas Flammarion
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL)
备注: Accepted at COLT 2025. This version matches the proceedings version

点击查看摘要

Abstract:This paper studies the problem of learning computable functions in the limit by extending Gold’s inductive inference framework to incorporate \textitcomputational observations and \textitrestricted input sources. Complimentary to the traditional Input-Output Observations, we introduce Time-Bound Observations, and Policy-Trajectory Observations to study the learnability of general recursive functions under more realistic constraints. While input-output observations do not suffice for learning the class of general recursive functions in the limit, we overcome this learning barrier by imposing computational complexity constraints or supplementing with approximate time-bound observations. Further, we build a formal framework around observations of \textitcomputational agents and show that learning computable functions from policy trajectories reduces to learning rational functions from input and output, thereby revealing interesting connections to finite-state transducer inference. On the negative side, we show that computable or polynomial-mass characteristic sets cannot exist for the class of linear-time computable functions even for policy-trajectory observations.
zh

[AI-10] Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity

【速读】：该论文试图解决Transformer模型中自注意力机制的内在（注意力头内部）和外在（注意力头之间）结构问题，特别是其对Softmax激活函数的不变性以及如何通过层次化张量组织来揭示网络结构的规律性。解决方案的关键在于利用泛微分演算理论证明自注意力机制对Softmax的不变性，并借助现有的张量层次组织方法，构建查询、键和头轴的层次划分树，从而在具有规律性的几何空间上执行信号处理任务。此外，通过分析单个注意力头及整个网络在双尺度和三尺度Haar基下的展开系数，量化网络稀疏性，进一步验证了层次化组织的有效性。

链接: https://arxiv.org/abs/2506.15541
作者: Oluwadamilola Fasina,Ruben V.C. Pohle,Pei-Chun Su,Ronald R. Coifman
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures, 2 tables

点击查看摘要

Abstract:We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.
zh

[AI-11] RePCS: Diagnosing Data Memorization in LLM -Powered Retrieval-Augmented Generation

【速读】：该论文旨在解决检索增强生成（Retrieval-augmented generation, RAG）系统中模型可能依赖记忆的训练数据而非检索到的外部信息，从而产生污染输出的问题。解决方案的关键在于提出一种名为检索路径污染评分（Retrieval-Path Contamination Scoring, RePCS）的诊断方法，该方法通过比较仅使用查询的参数化路径与结合查询和检索上下文的增强路径之间的Kullback-Leibler（KL）散度来检测模型是否未有效利用检索内容，从而判断是否存在潜在的记忆化行为。该方法无需模型访问或重新训练，具有轻量级和黑盒特性。

链接: https://arxiv.org/abs/2506.15513
作者: Le Vu Anh,Nguyen Viet Anh,Mehmet Dik,Luong Van Nghia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs. We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions. A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization. This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass. We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates. On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918. This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU. RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications.
zh

[AI-12] Optimizing Web-Based AI Query Retrieval with GPT Integration in LangChain A CoT-Enhanced Prompt Engineering Approach

【速读】：该论文试图解决远程学习资源检索中缺乏对复杂学生查询的上下文语义深度理解的问题，导致无法提供全面的信息。解决方案的关键在于将基于GPT的模型集成到LangChain框架中，并通过思维链（Chain-of-Thought, CoT）推理和提示工程实现更直观和高效的信息检索系统，从而提升检索结果的精度和相关性，以满足学生的个性化需求。

链接: https://arxiv.org/abs/2506.15512
作者: Wenqi Guan,Yang Fang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have brought a radical change in the process of remote learning students, among other aspects of educative activities. Current retrieval of remote learning resources lacks depth in contextual meaning that provides comprehensive information on complex student queries. This work proposes a novel approach to enhancing remote learning retrieval by integrating GPT-based models within the LangChain framework. We achieve this system in a more intuitive and productive manner using CoT reasoning and prompt engineering. The framework we propose puts much emphasis on increasing the precision and relevance of the retrieval results to return comprehensive and contextually enriched explanations and resources that best suit each student’s needs. We also assess the effectiveness of our approach against paradigmatic LLMs and report improvements in user satisfaction and learning outcomes.
zh

[AI-13] Over-squashing in Spatiotemporal Graph Neural Networks

【速读】：该论文试图解决时空图神经网络（Spatiotemporal Graph Neural Networks, STGNNs）中由于信息传播能力受限而导致的时空过压缩（spatiotemporal over-squashing）问题，该问题表现为远距离节点无法有效交换信息。解决方案的关键在于对这一问题进行形式化定义，并揭示其与静态情况下的不同特性，特别是发现卷积型STGNN倾向于从时间上较远的节点而非邻近节点传播信息。此外，研究还证明了采用时间-空间或先时间后空间处理范式的架构均受此现象影响，从而为计算效率高的实现提供了理论依据。

链接: https://arxiv.org/abs/2506.15507
作者: Ivan Marisca,Jacob Bamberger,Cesare Alippi,Michael M. Bronstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.
zh

[AI-14] Co-Creative Learning via Metropolis-Hastings Interaction between Humans and AI

【速读】：该论文试图解决人类与人工智能（AI）在信息模态本质上不同的情况下，如何通过互动构建共享外部表征的问题，即符号的涌现问题。传统的人工智能教学依赖于单向的知识传递，而本文提出的解决方案是通过一种基于Metropolis-Hastings命名游戏（MHNG）的人机交互模型，实现生物代理与人工代理之间的协同创造学习（co-creative learning）。其关键在于利用去中心化的贝叶斯推理机制，使人类与AI在部分可观测环境下通过互动逐步达成对分类任务的共识，并动态对齐感知体验，从而推动符号系统的共同演化。

链接: https://arxiv.org/abs/2506.15468
作者: Ryota Okumura,Tadahiro Taniguchi,Akira Taniguchi,Yoshinobu Hagiwara
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose co-creative learning as a novel paradigm where humans and AI, i.e., biological and artificial agents, mutually integrate their partial perceptual information and knowledge to construct shared external representations, a process we interpret as symbol emergence. Unlike traditional AI teaching based on unilateral knowledge transfer, this addresses the challenge of integrating information from inherently different modalities. We empirically test this framework using a human-AI interaction model based on the Metropolis-Hastings naming game (MHNG), a decentralized Bayesian inference mechanism. In an online experiment, 69 participants played a joint attention naming game (JA-NG) with one of three computer agent types (MH-based, always-accept, or always-reject) under partial observability. Results show that human-AI pairs with an MH-based agent significantly improved categorization accuracy through interaction and achieved stronger convergence toward a shared sign system. Furthermore, human acceptance behavior aligned closely with the MH-derived acceptance probability. These findings provide the first empirical evidence for co-creative learning emerging in human-AI dyads via MHNG-based interaction. This suggests a promising path toward symbiotic AI systems that learn with humans, rather than from them, by dynamically aligning perceptual experiences, opening a new venue for symbiotic AI alignment.
zh

[AI-15] Uncovering Intention through LLM -Driven Code Snippet Description Generation

【速读】：该论文试图解决代码片段文档编写中关键信息缺失的问题，特别是针对第三方库的使用示例和API描述的不足。其解决方案的关键在于利用大型语言模型（Large Language Models, LLMs），以评估其在生成代码片段描述方面的有效性。研究通过分析NPM代码片段数据集，发现大多数原始描述强调基于示例的使用，并验证了Llama模型在识别“示例”类型描述上的准确性，同时评估了生成描述与原始描述之间的相似性，为提升代码文档质量提供了参考依据。

链接: https://arxiv.org/abs/2506.15453
作者: Yusuf Sulistyo Nugroho,Farah Danisha Salam,Brittany Reid,Raula Gaikovina Kula,Kazumasa Shimari,Kenichi Matsumoto
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 4 tables, conference paper

点击查看摘要

Abstract:Documenting code snippets is essential to pinpoint key areas where both developers and users should pay attention. Examples include usage examples and other Application Programming Interfaces (APIs), which are especially important for third-party libraries. With the rise of Large Language Models (LLMs), the key goal is to investigate the kinds of description developers commonly use and evaluate how well an LLM, in this case Llama, can support description generation. We use NPM Code Snippets, consisting of 185,412 packages with 1,024,579 code snippets. From there, we use 400 code snippets (and their descriptions) as samples. First, our manual classification found that the majority of original descriptions (55.5%) highlight example-based usage. This finding emphasizes the importance of clear documentation, as some descriptions lacked sufficient detail to convey intent. Second, the LLM correctly identified the majority of original descriptions as “Example” (79.75%), which is identical to our manual finding, showing a propensity for generalization. Third, compared to the originals, the produced description had an average similarity score of 0.7173, suggesting relevance but room for improvement. Scores below 0.9 indicate some irrelevance. Our results show that depending on the task of the code snippet, the intention of the document may differ from being instructions for usage, installations, or descriptive learning examples for any user of a library.
zh

[AI-16] Warping and Matching Subsequences Between Time Series

【速读】：该论文试图解决时间序列比较中缺乏定性分析的问题，尤其是在传统可视化方法无法有效传达子序列层面的结构关系的情况下，难以理解一个时间序列相对于另一个时间序列在何时、如何发生位移、加速或减速。解决方案的关键在于提出一种新方法，通过简化变形路径来突出、量化和可视化关键变换（如位移、压缩和振幅差异），从而增强时间序列比较的可解释性。

链接: https://arxiv.org/abs/2506.15452
作者: Simiao Lin,Wannes Meert,Pieter Robberechts,Hendrik Blockeel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Comparing time series is essential in various tasks such as clustering and classification. While elastic distance measures that allow warping provide a robust quantitative comparison, a qualitative comparison on top of them is missing. Traditional visualizations focus on point-to-point alignment and do not convey the broader structural relationships at the level of subsequences. This limitation makes it difficult to understand how and where one time series shifts, speeds up or slows down with respect to another. To address this, we propose a novel technique that simplifies the warping path to highlight, quantify and visualize key transformations (shift, compression, difference in amplitude). By offering a clearer representation of how subsequences match between time series, our method enhances interpretability in time series comparison.
zh

[AI-17] Zero-Shot Reinforcement Learning Under Partial Observability

【速读】：该论文试图解决在部分可观测环境下，标准零样本强化学习（zero-shot RL）方法性能退化的问题。其解决方案的关键在于引入基于记忆的架构，以应对部分可观测性带来的挑战，并在多个状态、奖励及动态变化部分可观测的领域中验证了该方法的有效性。

链接: https://arxiv.org/abs/2506.15446
作者: Scott Jeen,Tom Bewley,Jonathan M. Cullen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Reinforcement Learning Conference 2025

点击查看摘要

Abstract:Recent work has shown that, under certain assumptions, zero-shot reinforcement learning (RL) methods can generalise to any unseen task in an environment after reward-free pre-training. Access to Markov states is one such assumption, yet, in many real-world applications, the Markov state is only partially observable. Here, we explore how the performance of standard zero-shot RL methods degrades when subjected to partially observability, and show that, as in single-task RL, memory-based architectures are an effective remedy. We evaluate our memory-based zero-shot RL methods in domains where the states, rewards and a change in dynamics are partially observed, and show improved performance over memory-free baselines. Our code is open-sourced via: this https URL.
zh

[AI-18] Reward Models in Deep Reinforcement Learning: A Survey IJCAI2025

【速读】：该论文试图解决在强化学习（Reinforcement Learning, RL）中如何有效构建奖励模型（reward models）以更好地指导策略优化的问题。其核心挑战在于设计能够紧密对齐真实目标并促进策略优化的奖励模型。解决方案的关键在于系统性地回顾和分类现有的奖励建模技术，涵盖其来源、机制及学习范式，并探讨其应用与评估方法，从而为未来研究提供方向。

链接: https://arxiv.org/abs/2506.15421
作者: Rui Yu,Shenghua Wan,Yucen Wang,Chen-Xiao Gao,Le Gan,Zongzhang Zhang,De-Chuan Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IJCAI 2025 Survey Track (To Appear)

点击查看摘要

Abstract:In reinforcement learning (RL), agents continually interact with the environment and use the feedback to refine their behavior. To guide policy optimization, reward models are introduced as proxies of the desired objectives, such that when the agent maximizes the accumulated reward, it also fulfills the task designer’s intentions. Recently, significant attention from both academic and industrial researchers has focused on developing reward models that not only align closely with the true objectives but also facilitate policy optimization. In this survey, we provide a comprehensive review of reward modeling techniques within the deep RL literature. We begin by outlining the background and preliminaries in reward modeling. Next, we present an overview of recent reward modeling approaches, categorizing them based on the source, the mechanism, and the learning paradigm. Building on this understanding, we discuss various applications of these reward modeling techniques and review methods for evaluating reward models. Finally, we conclude by highlighting promising research directions in reward modeling. Altogether, this survey includes both established and emerging methods, filling the vacancy of a systematic review of reward models in current literature.
zh

[AI-19] Unifying VXAI: A Systematic Review and Framework for the Evaluation of Explainable AI

【速读】：该论文试图解决可解释人工智能（Explainable AI, XAI）方法缺乏标准化评估协议和共识性评价指标的问题。其解决方案的关键在于通过系统文献综述构建一个统一的XAI评估框架（VXAI），该框架整合了41个功能相似的度量指标组，并提出了一个涵盖解释类型、评估情境性和解释质量期望的三维分类体系，从而为XAI方法的评估提供了结构化和系统化的支持。

链接: https://arxiv.org/abs/2506.15408
作者: David Dembinsky,Adriano Lucieri,Stanislav Frolov,Hiba Najjar,Ko Watanabe,Andreas Dengel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to TMLR, under review

点击查看摘要

Abstract:Modern AI systems frequently rely on opaque black-box models, most notably Deep Neural Networks, whose performance stems from complex architectures with millions of learned parameters. While powerful, their complexity poses a major challenge to trustworthiness, particularly due to a lack of transparency. Explainable AI (XAI) addresses this issue by providing human-understandable explanations of model behavior. However, to ensure their usefulness and trustworthiness, such explanations must be rigorously evaluated. Despite the growing number of XAI methods, the field lacks standardized evaluation protocols and consensus on appropriate metrics. To address this gap, we conduct a systematic literature review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and introduce a unified framework for the eValuation of XAI (VXAI). We identify 362 relevant publications and aggregate their contributions into 41 functionally similar metric groups. In addition, we propose a three-dimensional categorization scheme spanning explanation type, evaluation contextuality, and explanation quality desiderata. Our framework provides the most comprehensive and structured overview of VXAI to date. It supports systematic metric selection, promotes comparability across methods, and offers a flexible foundation for future extensions.
zh

[AI-20] Evaluation Pipeline for systematically searching for Anomaly Detection Systems

【速读】：该论文试图解决医疗领域数字化带来的安全问题，即在享受数字化优势的同时，系统面临攻击者的威胁，导致安全性难以保障。其解决方案的关键是在硬件上部署异常检测系统，以实时检测恶意客户端，通过使用可编程逻辑器件（FPGA）满足实时性和功耗限制的要求，从而实现整体系统的高效性能评估。

链接: https://arxiv.org/abs/2506.15388
作者: Florian Rokohl,Alexander Lehnert,Marc Reichenbach
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted to 18th HiPEAC Workshop on Reconfigurable Computing (WRC’2024)

点击查看摘要

Abstract:Digitalization in the medical world provides major benefits while making it a target for attackers and thus hard to secure. To deal with network intruders we propose an anomaly detection system on hardware to detect malicious clients in real-time. We meet real-time and power restrictions using FPGAs. Overall system performance is achieved via the presented holistic system evaluation.
zh

[AI-21] Efficient and Generalizable Environmental Understanding for Visual Navigation

【速读】：该论文旨在解决视觉导航（Visual Navigation）任务中，现有方法在处理序列数据时未能充分建模数据内部关联结构的问题，这可能限制了任务性能的进一步提升。其解决方案的关键在于引入一种因果框架，通过因果推理揭示导航任务的独特特性，并提出因果感知导航（Causality-Aware Navigation, CAN）方法，该方法包含一个因果理解模块（Causal Understanding Module），以增强智能体对环境的理解能力，从而在多种任务和仿真环境中取得更优性能。

链接: https://arxiv.org/abs/2506.15377
作者: Ruoyu Wang,Xinshu Li,Chen Wang,Lina Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Navigation is a core task in Embodied AI, enabling agents to navigate complex environments toward given objectives. Across diverse settings within Navigation tasks, many necessitate the modelling of sequential data accumulated from preceding time steps. While existing methods perform well, they typically process all historical observations simultaneously, overlooking the internal association structure within the data, which may limit the potential for further improvements in task performance. We address this by examining the unique characteristics of Navigation tasks through the lens of causality, introducing a causal framework to highlight the limitations of conventional sequential methods. Leveraging this insight, we propose Causality-Aware Navigation (CAN), which incorporates a Causal Understanding Module to enhance the agent’s environmental understanding capability. Empirical evaluations show that our approach consistently outperforms baselines across various tasks and simulation environments. Extensive ablations studies attribute these gains to the Causal Understanding Module, which generalizes effectively in both Reinforcement and Supervised Learning settings without computational overhead.
zh

[AI-22] J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor

【速读】：该论文旨在解决在资源受限的3D堆叠CMOS图像传感器上实现高效边缘人工智能（Edge AI）处理的问题，其解决方案的关键在于设计了一个名为J3DAI的轻量级深度神经网络（Deep Neural Network, DNN）硬件加速器。该加速器通过优化性能-功耗-面积（PPA）特性，结合Aidge软件框架支持的后训练量化技术，显著降低了内存占用和计算复杂度，从而实现了在有限硬件资源下的高效神经网络任务执行。

链接: https://arxiv.org/abs/2506.15316
作者: Benoit Tain,Raphael Millet,Romain Lemaire,Michal Szczepanski,Laurent Alacoque,Emmanuel Pluchart,Sylvain Choisnet,Rohit Prasad,Jerome Chossat,Pascal Pierunek,Pascal Vivet,Sebastien Thuries
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Preprint from ISLPED 2025. 979-8-3315-2710-5/25/$31.00 \c{opyright}2025 IEEE

点击查看摘要

Abstract:This paper presents J3DAI, a tiny deep neural network-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor featuring an artificial intelligence (AI) chip integrating a Deep Neural Network (DNN)-based accelerator. The DNN accelerator is designed to efficiently perform neural network tasks such as image classification and segmentation. This paper focuses on the digital system of J3DAI, highlighting its Performance-Power-Area (PPA) characteristics and showcasing advanced edge AI capabilities on a CMOS image sensor. To support hardware, we utilized the Aidge comprehensive software framework, which enables the programming of both the host processor and the DNN accelerator. Aidge supports post-training quantization, significantly reducing memory footprint and computational complexity, making it crucial for deploying models on resource-constrained hardware like J3DAI. Our experimental results demonstrate the versatility and efficiency of this innovative design in the field of edge AI, showcasing its potential to handle both simple and computationally intensive tasks. Future work will focus on further optimizing the architecture and exploring new applications to fully leverage the capabilities of J3DAI. As edge AI continues to grow in importance, innovations like J3DAI will play a crucial role in enabling real-time, low-latency, and energy-efficient AI processing at the edge.
zh

[AI-23] Active Learning-Guided Seq2Seq Variational Autoencoder for Multi-target Inhibitor Generation

【速读】：该论文试图解决在药物发现中同时优化分子对多个治疗靶点的挑战，尤其是由于奖励稀疏性和设计约束冲突所带来的困难。其解决方案的关键在于提出了一种结构化的主动学习（Active Learning, AL）范式，该范式将序列到序列（Sequence-to-Sequence, Seq2Seq）变分自编码器（Variational Autoencoder, VAE）整合到迭代循环中，以平衡化学多样性、分子质量和多靶点亲和力。通过在潜在空间中扩展化学可行区域并逐步根据更严格的多靶点对接阈值约束分子，该方法有效生成了结构多样化的广谱抑制剂候选分子。

链接: https://arxiv.org/abs/2506.15309
作者: Júlia Vilalta-Mor,Alexis Molina,Laura Ortega Varga,Isaac Filella-Merce,Victor Guallar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Simultaneously optimizing molecules against multiple therapeutic targets remains a profound challenge in drug discovery, particularly due to sparse rewards and conflicting design constraints. We propose a structured active learning (AL) paradigm integrating a sequence-to-sequence (Seq2Seq) variational autoencoder (VAE) into iterative loops designed to balance chemical diversity, molecular quality, and multi-target affinity. Our method alternates between expanding chemically feasible regions of latent space and progressively constraining molecules based on increasingly stringent multi-target docking thresholds. In a proof-of-concept study targeting three related coronavirus main proteases (SARS-CoV-2, SARS-CoV, MERS-CoV), our approach efficiently generated a structurally diverse set of pan-inhibitor candidates. We demonstrate that careful timing and strategic placement of chemical filters within this active learning pipeline markedly enhance exploration of beneficial chemical space, transforming the sparse-reward, multi-objective drug design problem into an accessible computational task. Our framework thus provides a generalizable roadmap for efficiently navigating complex polypharmacological landscapes.
zh

[AI-24] Unlocking Post-hoc Dataset Inference with Synthetic Data ICML2025

【速读】：该论文试图解决数据推断（Dataset Inference, DI）方法在实际应用中面临的一个关键问题，即现有DI方法需要一个与被泄露数据集分布相近的私有已知数据集作为验证集，而这种数据在现实中通常难以获取，从而严重限制了DI的适用性。论文提出的解决方案的关键在于通过合成生成所需的验证集，具体包括两个核心步骤：一是通过在精心设计的基于后缀的补全任务上训练数据生成器，以创建高质量且多样化的合成数据；二是通过事后校准来弥合真实数据与合成数据之间的似然差距。实验结果表明，使用生成的数据作为验证集能够有效提升DI检测原始训练数据集的准确性，同时保持较低的误报率。

链接: https://arxiv.org/abs/2506.15271
作者: Bihe Zhao,Pratyush Maini,Franziska Boenisch,Adam Dziedzic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners’ intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available at this https URL.
zh

[AI-25] RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments

【速读】：该论文试图解决大型语言模型（Large language model, LLM）代理在医疗和金融等关键领域部署时缺乏标准化安全评估基准的问题。解决方案的关键在于提出RAS-Eval，这是一个支持模拟和真实世界工具执行的全面安全基准，包含80个测试用例和3,802个映射到11个Common Weakness Enumeration (CWE)类别的攻击任务，并采用JSON、LangGraph和Model Context Protocol (MCP)格式实现工具。

链接: https://arxiv.org/abs/2506.15253
作者: Yuchuan Fu,Xiaohan Yuan,Dongxia Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks. To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS-Eval, a comprehensive security benchmark supporting both simulated and real-world tool execution. RAS-Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats. We evaluate 6 state-of-the-art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings. Notably, scaling laws held for security capabilities, with larger models outperforming smaller counterparts. Our findings expose critical risks in real-world agent deployments and provide a foundational framework for future security research. Code and data are available at this https URL.
zh

[AI-26] Singular Value Decomposition on Kronecker Adaptation for Large Language Model

【速读】：该论文旨在解决大型预训练Transformer模型在微调过程中产生的存储、内存和计算开销过大的问题。现有参数高效微调（PEFT）方法存在推理延迟、收敛性不佳或固定秩选择不适应任务复杂度等局限性。其解决方案的关键在于提出SoKA（SVD on Kronecker Adaptation），该方法结合了Kronecker乘积张量分解与基于奇异值分解（SVD）的初始化及谱感知动态秩选择，通过KPSVD过程将全权重更新的主成分提取为紧凑的Kronecker因子，并利用能量阈值和拐点准则进行冗余成分的剪枝，从而实现更少可训练参数、更快收敛和更稳定的梯度。

链接: https://arxiv.org/abs/2506.15251
作者: Yee Hin Chong,Peng Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large pre-trained Transformer models achieve state-of-the-art results across diverse language and reasoning tasks, but full fine-tuning incurs substantial storage, memory, and computational overhead. Parameter-efficient fine-tuning (PEFT) methods mitigate these costs by learning only a small subset of task-specific parameters, yet existing approaches either introduce inference-time latency (adapter modules), suffer from suboptimal convergence (randomly initialized low-rank updates), or rely on fixed rank choices that may not match task complexity (Kronecker-based decompositions). We propose SoKA (SVD on Kronecker Adaptation), a novel PEFT strategy that combines Kronecker-product tensor factorization with SVD-driven initialization and spectrum-aware dynamic rank selection. Our Kronecker-Product SVD (KPSVD) procedure extracts principal components of the full weight update into compact Kronecker factors, while an adaptive rank selection algorithm uses energy-threshold and elbow-point criteria to prune negligible components. Empirical evaluation on LLaMA2-7B across arithmetic reasoning (GSM8K), formal mathematics (MATH), and code generation (MBPP) demonstrates that SoKA requires only 0.99M trainable parameters, 25% fewer than LoRA/PiSSA, while matching or exceeding baseline performance. Moreover, SoKA exhibits faster convergence and more stable gradients, highlighting its robustness and efficiency for large-scale model adaptation. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.15251 [cs.LG] (or arXiv:2506.15251v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.15251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-27] Joint Computation Offloading and Resource Allocation for Uncertain Maritime MEC via Cooperation of UAVs and Vessels

【速读】：该论文试图解决海上物联网（Maritime Internet of Things, MIoT）计算需求快速增长背景下，由于任务不确定性导致的计算卸载与资源分配效率低下的问题。其解决方案的关键在于提出一种基于无人机（UAVs）和船舶协同的多接入边缘计算（MEC）框架，并利用Lyapunov优化将长期约束转化为短期约束，从而得到一系列小规模优化问题。进一步地，考虑到UAVs和船舶在动作与资源上的异质性，将小规模优化问题重构为马尔可夫博弈（Markov Game, MG），并引入异构智能体软策略梯度算法以有效求解MG问题。

链接: https://arxiv.org/abs/2506.15225
作者: Jiahao You,Ziye Jia,Chao Dong,Qihui Wu,Zhu Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The computation demands from the maritime Internet of Things (MIoT) increase rapidly in recent years, and the unmanned aerial vehicles (UAVs) and vessels based multi-access edge computing (MEC) can fulfill these MIoT requirements. However, the uncertain maritime tasks present significant challenges of inefficient computation offloading and resource allocation. In this paper, we focus on the maritime computation offloading and resource allocation through the cooperation of UAVs and vessels, with consideration of uncertain tasks. Specifically, we propose a cooperative MEC framework for computation offloading and resource allocation, including MIoT devices, UAVs and vessels. Then, we formulate the optimization problem to minimize the total execution time. As for the uncertain MIoT tasks, we leverage Lyapunov optimization to tackle the unpredictable task arrivals and varying computational resource availability. By converting the long-term constraints into short-term constraints, we obtain a set of small-scale optimization problems. Further, considering the heterogeneity of actions and resources of UAVs and vessels, we reformulate the small-scale optimization problem into a Markov game (MG). Moreover, a heterogeneous-agent soft actor-critic is proposed to sequentially update various neural networks and effectively solve the MG problem. Finally, simulations are conducted to verify the effectiveness in addressing computational offloading and resource allocation.
zh

[AI-28] Multi-Agent Reinforcement Learning for Autonomous Multi-Satellite Earth Observation: A Realistic Case Study

【速读】：该论文试图解决多卫星系统中自主协调的难题，特别是在动态地球观测（Earth Observation, EO）任务中传统优化方法难以满足实时决策需求的问题。解决方案的关键在于采用强化学习（Reinforcement Learning, RL）和多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）框架，通过建模单星操作并扩展至多星星座，以应对能源与数据存储限制、观测不确定性以及部分可观测环境下的去中心化协调复杂性。研究利用近真实卫星仿真环境评估了先进MARL算法的训练稳定性和性能，验证了MARL在平衡成像与资源管理、处理非平稳性和奖励依赖性方面的有效性。

链接: https://arxiv.org/abs/2506.15207
作者: Mohamad A. Hady,Siyi Hu,Mahardhika Pratama,Jimmy Cao,Ryszard Kowalczyk
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The exponential growth of Low Earth Orbit (LEO) satellites has revolutionised Earth Observation (EO) missions, addressing challenges in climate monitoring, disaster management, and more. However, autonomous coordination in multi-satellite systems remains a fundamental challenge. Traditional optimisation approaches struggle to handle the real-time decision-making demands of dynamic EO missions, necessitating the use of Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL). In this paper, we investigate RL-based autonomous EO mission planning by modelling single-satellite operations and extending to multi-satellite constellations using MARL frameworks. We address key challenges, including energy and data storage limitations, uncertainties in satellite observations, and the complexities of decentralised coordination under partial observability. By leveraging a near-realistic satellite simulation environment, we evaluate the training stability and performance of state-of-the-art MARL algorithms, including PPO, IPPO, MAPPO, and HAPPO. Our results demonstrate that MARL can effectively balance imaging and resource management while addressing non-stationarity and reward interdependency in multi-satellite coordination. The insights gained from this study provide a foundation for autonomous satellite operations, offering practical guidelines for improving policy learning in decentralised EO missions.
zh

[AI-29] HeurAg enix: Leverag ing LLM s for Solving Complex Combinatorial Optimization Challenges

【速读】：该论文试图解决传统启发式算法在求解组合优化（Combinatorial Optimization, CO）问题时依赖人工经验、泛化能力差的问题。其解决方案的关键在于提出一种基于大语言模型（Large Language Models, LLMs）的两阶段超启发式框架HeurAgenix，该框架首先通过LLM演化出可复用的启发式策略，随后利用LLM的感知能力动态选择最优启发式策略，从而实现高效且自适应的求解过程。

链接: https://arxiv.org/abs/2506.15196
作者: Xianliang Yang,Ling Zhang,Haolong Qian,Lei Song,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages,9 figures

点击查看摘要

Abstract:Heuristic algorithms play a vital role in solving combinatorial optimization (CO) problems, yet traditional designs depend heavily on manual expertise and struggle to generalize across diverse instances. We introduce \textbfHeurAgenix, a two-stage hyper-heuristic framework powered by large language models (LLMs) that first evolves heuristics and then selects among them automatically. In the heuristic evolution phase, HeurAgenix leverages an LLM to compare seed heuristic solutions with higher-quality solutions and extract reusable evolution strategies. During problem solving, it dynamically picks the most promising heuristic for each problem state, guided by the LLM’s perception ability. For flexibility, this selector can be either a state-of-the-art LLM or a fine-tuned lightweight model with lower inference cost. To mitigate the scarcity of reliable supervision caused by CO complexity, we fine-tune the lightweight heuristic selector with a dual-reward mechanism that jointly exploits singals from selection preferences and state perception, enabling robust selection under noisy annotations. Extensive experiments on canonical benchmarks show that HeurAgenix not only outperforms existing LLM-based hyper-heuristics but also matches or exceeds specialized solvers. Code is available at this https URL.
zh

[AI-30] Accessible Gesture-Driven Augmented Reality Interaction System

【速读】：该论文旨在解决增强现实（Augmented Reality, AR）对运动功能障碍或精细动作能力受限用户不具可及性的问题，因为现有AR系统依赖于精确的输入方式。其解决方案的关键在于开发一种基于手势的交互系统，利用深度学习技术从可穿戴传感器和摄像头中识别手部和身体手势，并根据用户能力自适应调整界面。该系统结合了视觉变压器（Vision Transformers, ViTs）、时序卷积网络（Temporal Convolutional Networks, TCNs）和图注意力网络（Graph Attention Networks, GATs）进行手势处理，并采用联邦学习（Federated Learning）实现跨多样化用户的隐私保护模型训练，同时通过强化学习优化界面元素。

链接: https://arxiv.org/abs/2506.15189
作者: Yikan Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Augmented reality (AR) offers immersive interaction but remains inaccessible for users with motor impairments or limited dexterity due to reliance on precise input methods. This study proposes a gesture-based interaction system for AR environments, leveraging deep learning to recognize hand and body gestures from wearable sensors and cameras, adapting interfaces to user capabilities. The system employs vision transformers (ViTs), temporal convolutional networks (TCNs), and graph attention networks (GATs) for gesture processing, with federated learning ensuring privacy-preserving model training across diverse users. Reinforcement learning optimizes interface elements like menu layouts and interaction modes. Experiments demonstrate a 20% improvement in task completion efficiency and a 25% increase in user satisfaction for motor-impaired users compared to baseline AR systems. This approach enhances AR accessibility and scalability. Keywords: Deep learning, Federated learning, Gesture recognition, Augmented reality, Accessibility, Human-computer interaction
zh

[AI-31] LLM Agent for Hyper-Parameter Optimization

【速读】：该论文试图解决无线网络中基于无线电图的无人机（UAV）轨迹与通信优化问题中，传统超参数调优方法依赖人工经验、自动化程度低且性能不佳的问题。解决方案的关键在于设计一个大型语言模型（Large Language Model, LLM）代理，通过迭代框架和模型上下文协议（Model Context Protocol, MCP），实现对加权粒子群优化算法（Weighted Particle Swarm Optimization with Crossover and Mutation, WS-PSO-CM）的自动超参数调优，从而提升算法性能。

链接: https://arxiv.org/abs/2506.15167
作者: Wanzhe Wang,Jianqiu Peng,Menghao Hu,Weihuang Zhong,Tong Zhang,Shuai Wang,Yixin Zhang,Mingjie Shao,Wanli Ni
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Hyper-parameters are essential and critical for the performance of communication algorithms. However, current hyper-parameters tuning methods for warm-start particles swarm optimization with cross and mutation (WS-PSO-CM) algortihm for radio map-enabled unmanned aerial vehicle (UAV) trajectory and communication are primarily heuristic-based, exhibiting low levels of automation and unsatisfactory performance. In this paper, we design an large language model (LLM) agent for automatic hyper-parameters-tuning, where an iterative framework and model context protocol (MCP) are applied. In particular, the LLM agent is first setup via a profile, which specifies the mission, background, and output format. Then, the LLM agent is driven by the prompt requirement, and iteratively invokes WS-PSO-CM algorithm for exploration. Finally, the LLM agent autonomously terminates the loop and returns a set of hyper-parameters. Our experiment results show that the minimal sum-rate achieved by hyper-parameters generated via our LLM agent is significantly higher than those by both human heuristics and random generation methods. This indicates that an LLM agent with PSO knowledge and WS-PSO-CM algorithm background is useful in finding high-performance hyper-parameters.
zh

[AI-32] Advancing Loss Functions in Recommender Systems: A Comparative Study with a Rényi Divergence-Based Solution AAAI2025

【速读】：该论文旨在解决推荐模型中损失函数在面对分布偏移时的鲁棒性不足以及数据利用效率低的问题。其关键解决方案是提出一种新的损失函数DrRL，该函数通过在分布鲁棒优化（DRO）中引入Rényi散度，对Softmax Loss（SL）和Cosine Contrastive Loss（CCL）进行泛化，从而结合两者的优势并有效缓解各自的局限性。

链接: https://arxiv.org/abs/2506.15120
作者: Shengjia Zhang,Jiawei Chen,Changdong Li,Sheng Zhou,Qihao Shi,Yan Feng,Chun Chen,Can Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI 2025

点击查看摘要

Abstract:Loss functions play a pivotal role in optimizing recommendation models. Among various loss functions, Softmax Loss (SL) and Cosine Contrastive Loss (CCL) are particularly effective. Their theoretical connections and differences warrant in-depth exploration. This work conducts comprehensive analyses of these losses, yielding significant insights: 1) Common strengths – both can be viewed as augmentations of traditional losses with Distributional Robust Optimization (DRO), enhancing robustness to distributional shifts; 2) Respective limitations – stemming from their use of different distribution distance metrics in DRO optimization, SL exhibits high sensitivity to false negative instances, whereas CCL suffers from low data utilization. To address these limitations, this work proposes a new loss function, DrRL, which generalizes SL and CCL by leveraging Rényi-divergence in DRO optimization. DrRL incorporates the advantageous structures of both SL and CCL, and can be demonstrated to effectively mitigate their limitations. Extensive experiments have been conducted to validate the superiority of DrRL on both recommendation accuracy and robustness.
zh

[AI-33] ransit for All: Mapping Equitable Bike2Subway Connection using Region Representation Learning

【速读】：该论文试图解决城市中公共交通可达性不平等的问题，特别是在纽约市等人口密集城市中，低收入和少数族裔社区往往面临有限的交通可及性。其解决方案的关键在于提出了一种名为“Transit for All”（TFA）的空间计算框架，该框架通过三个核心组件实现：基于区域表示学习的冷启动站点自行车共享需求预测、结合预测需求与传统交通可达性指标的新型加权公共交通可达性等级（wPTAL）评估，以及考虑潜在骑行量和公平性提升的新自行车站点选址策略。该框架有效识别了交通可达性差距，并通过优化站点布局显著减少了与经济和人口因素相关的交通不平等问题。

链接: https://arxiv.org/abs/2506.15113
作者: Min Namgung,JangHyeon Lee,Fangyi Ding,Yao-Yi Chiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring equitable public transit access remains challenging, particularly in densely populated cities like New York City (NYC), where low-income and minority communities often face limited transit accessibility. Bike-sharing systems (BSS) can bridge these equity gaps by providing affordable first- and last-mile connections. However, strategically expanding BSS into underserved neighborhoods is difficult due to uncertain bike-sharing demand at newly planned (“cold-start”) station locations and limitations in traditional accessibility metrics that may overlook realistic bike usage potential. We introduce Transit for All (TFA), a spatial computing framework designed to guide the equitable expansion of BSS through three components: (1) spatially-informed bike-sharing demand prediction at cold-start stations using region representation learning that integrates multimodal geospatial data, (2) comprehensive transit accessibility assessment leveraging our novel weighted Public Transport Accessibility Level (wPTAL) by combining predicted bike-sharing demand with conventional transit accessibility metrics, and (3) strategic recommendations for new bike station placements that consider potential ridership and equity enhancement. Using NYC as a case study, we identify transit accessibility gaps that disproportionately impact low-income and minority communities in historically underserved neighborhoods. Our results show that strategically placing new stations guided by wPTAL notably reduces disparities in transit access related to economic and demographic factors. From our study, we demonstrate that TFA provides practical guidance for urban planners to promote equitable transit and enhance the quality of life in underserved urban communities.
zh

[AI-34] Sequential Policy Gradient for Adaptive Hyperparameter Optimization

【速读】：该论文旨在解决强化学习在神经网络架构搜索和超参数优化中因高昂的时间和计算成本而难以广泛应用的问题。其解决方案的关键在于提出了一种名为序列策略梯度建模（Sequential Policy Gradient, SPG）的新轨迹生成范式，通过在基础模型上扩展临时模块，使其能够在单次前向传播中生成状态-动作（padded）轨迹，从而实现轻量级在线超参数优化。

链接: https://arxiv.org/abs/2506.15051
作者: Zheng Li,Jerry Cheng,Huanying Helen Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Reinforcement learning is essential for neural architecture search and hyperparameter optimization, but the conventional approaches impede widespread use due to prohibitive time and computational costs. Inspired by DeepSeek-V3 multi-token prediction architecture, we propose Sequential Policy Gradient modeling (SPG), a novel trajectory generation paradigm for lightweight online hyperparameter optimization. In contrast to conventional policy gradient methods, SPG extends the base model with temporary modules, enabling it to generate state-action (padded) trajectories in a single forward pass. Our experiments demonstrate that models gain performance when retrained with SPG on their original datasets and also outperform standard transfer fine-tuning. We evaluate on five datasets spanning computer vision (ImageNet, COCO), natural language processing (GLUE, SQuAD), and audio (SUPERB) to assess the industrial applicability of SPG. The proposed method demonstrates consistent improvements across widely adopted models, achieving performance gains of +0.2\sim7% , with significantly low computational costs. Fully reproducible code and pre-trained models: this https URL.
zh

[AI-35] runcated Proximal Policy Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在测试时进行推理任务时，由于策略更新过程中的高计算开销和硬件利用率低而导致的训练效率问题。其关键解决方案是提出一种名为截断近端策略优化（Truncated Proximal Policy Optimization, T-PPO）的新方法，通过简化策略更新和限制响应长度生成来提升训练效率。T-PPO的核心创新包括：一是引入扩展广义优势估计（Extended Generalized Advantage Estimation, EGAE）以在不完整响应基础上进行优势估计，同时保持策略学习的完整性；二是设计一种计算优化机制，允许独立优化策略模型和价值模型，通过选择性过滤提示词和截断标记减少冗余计算，从而加速训练过程而不牺牲收敛性能。

链接: https://arxiv.org/abs/2506.15050
作者: Tiantian Fan,Lingjun Liu,Yu Yue,Jiaze Chen,Chengyi Wang,Qiying Yu,Chi Zhang,Zhiqi Lin,Ruofei Zhu,Yufeng Yuan,Xiaochen Zuo,Bole Ma,Mofan Zhang,Gaohong Liu,Ru Zhang,Haotian Zhou,Cong Xie,Ruidong Zhu,Zhi Zhang,Xin Liu,Mingxuan Wang,Lin Yan,Yonghui Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts. Our contributions are two-folds. First, we propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses while maintaining the integrity of policy learning. Second, we devise a computationally optimized mechanism that allows for the independent optimization of the policy and value models. By selectively filtering prompt and truncated tokens, this mechanism reduces redundant computations and accelerates the training process without sacrificing convergence performance. We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model. The experimental results show that T-PPO improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.
zh

[AI-36] Mapping Caregiver Needs to AI Chatbot Design: Strengths and Gaps in Mental Health Support for Alzheimers and Dementia Caregivers

【速读】：该论文试图解决阿尔茨海默病及相关痴呆症（AD/ADRD）患者家庭照护者在面对情感和后勤挑战时，因长期压力而面临较高的心理疾病风险问题，同时探索生成式AI（Generative AI）在支持照护者心理健康方面的潜力。解决方案的关键在于开发了一个基于GPT-4o的聊天机器人Carey，通过情景驱动的交互与16名家庭照护者进行半结构化访谈，从而深入理解照护者的需求、期望及其对AI技术的感知，最终提出针对信息获取、情感支持、隐私保护等六个核心主题的设计建议。

链接: https://arxiv.org/abs/2506.15047
作者: Jiayue Melissa Shi,Dong Whi Yoo,Keran Wang,Violeta J. Rodriguez,Ravi Karkar,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Family caregivers of individuals with Alzheimer’s Disease and Related Dementia (AD/ADRD) face significant emotional and logistical challenges that place them at heightened risk for stress, anxiety, and depression. Although recent advances in generative AI – particularly large language models (LLMs) – offer new opportunities to support mental health, little is known about how caregivers perceive and engage with such technologies. To address this gap, we developed Carey, a GPT-4o-based chatbot designed to provide informational and emotional support to AD/ADRD caregivers. Using Carey as a technology probe, we conducted semi-structured interviews with 16 family caregivers following scenario-driven interactions grounded in common caregiving stressors. Through inductive coding and reflexive thematic analysis, we surface a systemic understanding of caregiver needs and expectations across six themes – on-demand information access, emotional support, safe space for disclosure, crisis management, personalization, and data privacy. For each of these themes, we also identified the nuanced tensions in the caregivers’ desires and concerns. We present a mapping of caregiver needs, AI chatbot’s strengths, gaps, and design recommendations. Our findings offer theoretical and practical insights to inform the design of proactive, trustworthy, and caregiver-centered AI systems that better support the evolving mental health needs of AD/ADRD caregivers.
zh

[AI-37] Advanced Prediction of Hypersonic Missile Trajectories with CNN-LSTM-GRU Architectures

【速读】：该论文旨在解决高超音速导弹轨迹预测的难题，这一问题对于有效防御措施至关重要。解决方案的关键在于采用一种新颖的混合深度学习方法，结合卷积神经网络（Convolutional Neural Networks, CNN）、长短期记忆网络（Long Short-Term Memory, LSTM）和门控循环单元（Gated Recurrent Units, GRU），以充分利用各类网络结构的优势，从而实现对高超音速导弹复杂轨迹的高精度预测。

链接: https://arxiv.org/abs/2506.15043
作者: Amir Hossein Baradaran
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancements in the defense industry are paramount for ensuring the safety and security of nations, providing robust protection against emerging threats. Among these threats, hypersonic missiles pose a significant challenge due to their extreme speeds and maneuverability, making accurate trajectory prediction a critical necessity for effective countermeasures. This paper addresses this challenge by employing a novel hybrid deep learning approach, integrating Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs). By leveraging the strengths of these architectures, the proposed method successfully predicts the complex trajectories of hypersonic missiles with high accuracy, offering a significant contribution to defense strategies and missile interception technologies. This research demonstrates the potential of advanced machine learning techniques in enhancing the predictive capabilities of defense systems.
zh

[AI-38] SFT-GO: Supervised Fine-Tuning with Group Optimization for Large Language Models

【速读】：该论文试图解决监督微调（Supervised Fine-Tuning, SFT）过程中现有方法对每个训练实例中的所有标记赋予相同重要性的局限性，从而忽略了仅部分标记包含关键任务信息的问题。解决方案的关键在于提出一种名为监督微调组优化（SFT-GO）的新方法，该方法根据标记的重要性值对每个样本中的标记进行分组，并通过加权组合最差组损失和标准交叉熵损失来优化大语言模型（LLM），从而自适应地强调最具挑战性的标记组，提升模型处理不同组分布的能力。

链接: https://arxiv.org/abs/2506.15021
作者: Gyuhak Kim,Sumiran Singh Thakur,Su Min Park,Wei Wei,Yujia Bao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) has become an essential step in tailoring large language models (LLMs) to align with human expectations and specific downstream tasks. However, existing SFT methods typically treat each training instance as a uniform sequence, giving equal importance to all tokens regardless of their relevance. This overlooks the fact that only a subset of tokens often contains critical, task-specific information. To address this limitation, we introduce Supervised Fine-Tuning with Group Optimization (SFT-GO), a novel approach that treats groups of tokens differently based on their this http URL-GO groups tokens in each sample based on their importance values and optimizes the LLM using a weighted combination of the worst-group loss and the standard cross-entropy loss. This mechanism adaptively emphasizes the most challenging token groups and guides the model to better handle different group distributions, thereby improving overall learning dynamics. We provide a theoretical analysis of SFT-GO’s convergence rate, demonstrating its efficiency. Empirically, we apply SFT-GO with three different token grouping strategies and show that models trained with SFT-GO consistently outperform baseline approaches across popular LLM benchmarks. These improvements hold across various datasets and base models, demonstrating the robustness and the effectiveness of our method.
zh

[AI-39] Stable CDE Autoencoders with Acuity Regularization for Offline Reinforcement Learning in Sepsis Treatment IJCAI2025

【速读】：该论文试图解决在脓毒症治疗中，基于强化学习（Reinforcement Learning, RL）的稳定、临床有意义的状态表示学习问题。现有研究虽已探索了该任务中的表示学习方法，但忽略了顺序表示训练不稳定对策略性能的负面影响。论文提出的解决方案关键在于使用受控微分方程（Controlled Differential Equations, CDE）状态表示，并确保两个核心因素：一是通过早停或稳定化方法保证训练稳定性，二是通过与临床评分（如SOFA、SAPS-II、OASIS）的相关性正则化来实现病情严重程度感知的表示。实验结果表明，稳定的CDE自编码器能够生成与病情评分高度相关的表示，并支持性能优越的RL策略。

链接: https://arxiv.org/abs/2506.15019
作者: Yue Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI2025 AI4TS

点击查看摘要

Abstract:Effective reinforcement learning (RL) for sepsis treatment depends on learning stable, clinically meaningful state representations from irregular ICU time series. While previous works have explored representation learning for this task, the critical challenge of training instability in sequential representations and its detrimental impact on policy performance has been overlooked. This work demonstrates that Controlled Differential Equations (CDE) state representation can achieve strong RL policies when two key factors are met: (1) ensuring training stability through early stopping or stabilization methods, and (2) enforcing acuity-aware representations by correlation regularization with clinical scores (SOFA, SAPS-II, OASIS). Experiments on the MIMIC-III sepsis cohort reveal that stable CDE autoencoder produces representations strongly correlated with acuity scores and enables RL policies with superior performance (WIS return 0.9 ). In contrast, unstable CDE representation leads to degraded representations and policy failure (WIS return \sim 0). Visualizations of the latent space show that stable CDEs not only separate survivor and non-survivor trajectories but also reveal clear acuity score gradients, whereas unstable training fails to capture either pattern. These findings highlight practical guidelines for using CDEs to encode irregular medical time series in clinical RL, emphasizing the need for training stability in sequential representation learning.
zh

[AI-40] Insights Informed Generative AI for Design: Incorporating Real-world Data for Text-to-Image Output

【速读】：该论文试图解决生成式AI在室内建筑设计中虽能快速生成视觉化设计，但缺乏对设计师具有实际指导意义的可持续性数据和材料使用信息的问题。解决方案的关键在于提出一种新流程，将DALL-E 3与材料数据集相结合，通过后处理模块识别生成设计中的主要材料并关联其碳排放当量（CO2e）值，从而为设计师提供环境影响评估和材料选择的依据。

链接: https://arxiv.org/abs/2506.15008
作者: Richa Gupta,Alexander Htet Kyaw
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 15 Pages, 6 figures, CAAD Futures 2025

点击查看摘要

Abstract:Generative AI, specifically text-to-image models, have revolutionized interior architectural design by enabling the rapid translation of conceptual ideas into visual representations from simple text prompts. While generative AI can produce visually appealing images they often lack actionable data for designers In this work, we propose a novel pipeline that integrates DALL-E 3 with a materials dataset to enrich AI-generated designs with sustainability metrics and material usage insights. After the model generates an interior design image, a post-processing module identifies the top ten materials present and pairs them with carbon dioxide equivalent (CO2e) values from a general materials dictionary. This approach allows designers to immediately evaluate environmental impacts and refine prompts accordingly. We evaluate the system through three user tests: (1) no mention of sustainability to the user prior to the prompting process with generative AI, (2) sustainability goals communicated to the user before prompting, and (3) sustainability goals communicated along with quantitative CO2e data included in the generative AI outputs. Our qualitative and quantitative analyses reveal that the introduction of sustainability metrics in the third test leads to more informed design decisions, however, it can also trigger decision fatigue and lower overall satisfaction. Nevertheless, the majority of participants reported incorporating sustainability principles into their workflows in the third test, underscoring the potential of integrated metrics to guide more ecologically responsible practices. Our findings showcase the importance of balancing design freedom with practical constraints, offering a clear path toward holistic, data-driven solutions in AI-assisted architectural design.
zh

[AI-41] Scaling Intelligence: Designing Data Centers for Next-Gen Language Models

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）快速发展对数据中心架构提出的挑战，特别是如何在保证可扩展性、效率和成本效益的前提下支持超大规模参数模型的训练与推理。其解决方案的关键在于提出一种综合性的协同设计框架，该框架联合优化了计算能力（FLOPS）、高带宽内存（HBM）带宽与容量、多种网络拓扑结构（如两层架构与FullFlat光网络）、扩展域规模以及常见的并行化与优化策略。通过引入并评估FullFlat网络架构，实现了节点间均匀的高带宽、低延迟连接，从而显著提升了性能与可扩展性。此外，研究还通过敏感性分析量化了计算与通信重叠、硬件加速集体操作、更宽的扩展域及更大内存容量带来的优势。

链接: https://arxiv.org/abs/2506.15006
作者: Jesmin Jahan Tithi,Hanjiang Wu,Avishaii Abuhatzera,Fabrizio Petrini
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Performance (cs.PF)
备注: 14 pages, submitted to SC25 for review

点击查看摘要

Abstract:The explosive growth of Large Language Models (LLMs) - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, wider scale-out domains, and larger memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model flops per token x Observed tokens per sec / Peak flops of the hardware) and overall throughput. For the co-design study, we extended and validated a performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.
zh

[AI-42] MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

【速读】：该论文试图解决持续学习（Continual Learning, CL）与合作多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）结合场景下的基准测试不足问题，尤其是在复杂环境中实现长期任务序列的持续学习所面临的计算瓶颈和方法可扩展性问题。解决方案的关键在于提出MEAL（Multi-agent Environments for Adaptive Learning），这是首个针对持续多智能体强化学习（Continual Multi-Agent Reinforcement Learning, CMARL）的基准，利用JAX框架实现GPU加速，从而在标准台式计算机上高效完成长达100个任务的持续学习实验。

链接: https://arxiv.org/abs/2506.14990
作者: Tristan Tomilin,Luka van den Boogaard,Samuel Garcin,Bram Grooten,Meng Fang,Mykola Pechenizkiy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms, with environment availability strongly impacting research. One particularly underexplored intersection is continual learning (CL) in cooperative multi-agent settings. To remedy this, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark tailored for continual multi-agent reinforcement learning (CMARL). Existing CL benchmarks run environments on the CPU, leading to computational bottlenecks and limiting the length of task sequences. MEAL leverages JAX for GPU acceleration, enabling continual learning across sequences of 100 tasks on a standard desktop PC in a few hours. We show that naively combining popular CL and MARL methods yields strong performance on simple environments, but fails to scale to more complex settings requiring sustained coordination and adaptation. Our ablation study identifies architectural and algorithmic features critical for CMARL on MEAL.
zh

[AI-43] Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

【速读】：该论文试图解决在多智能体多臂老虎机（MA-MAB）框架中，如何在有限的信息条件下实现公平的结果同时最大化整体系统性能的问题。解决方案的关键在于引入一种新颖的探测框架，该框架在分配之前战略性地收集关于所选臂的信息，从而优化决策过程。在离线设置中，利用子模性设计了一个具有可证明性能边界的贪心探测算法；而在在线设置中，则开发了一个能够实现次线性遗憾并保持公平性的算法。

链接: https://arxiv.org/abs/2506.14988
作者: Tianyi Xu,Jiaxin Liu,Zizhan Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.
zh

[AI-44] FEAST: A Flexible Mealtime-Assistance System Towards In-the-Wild Personalization

【速读】：该论文试图解决家庭环境中进食辅助的复杂性问题，包括多样的活动（如进食、饮水、擦拭嘴巴）、场景（如社交、看电视）、食物种类以及用户偏好所带来的挑战。解决方案的关键在于提出FEAST系统，该系统通过三个核心原则实现灵活的个性化：适应性、透明性和安全性。FEAST通过模块化硬件、多种交互方式以及可参数化的行为树来体现这些原则，从而支持在真实环境中的安全和透明适应。

链接: https://arxiv.org/abs/2506.14968
作者: Rajat Kumar Jenamani,Tom Silver,Ben Dodson,Shiqin Tong,Anthony Song,Yuting Yang,Ziang Liu,Benjamin Howe,Aimee Whitneck,Tapomayukh Bhattacharjee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: RSS 2025 - Outstanding Paper Award Outstanding Systems Paper Award Finalist

点击查看摘要

Abstract:Physical caregiving robots hold promise for improving the quality of life of millions worldwide who require assistance with feeding. However, in-home meal assistance remains challenging due to the diversity of activities (e.g., eating, drinking, mouth wiping), contexts (e.g., socializing, watching TV), food items, and user preferences that arise during deployment. In this work, we propose FEAST, a flexible mealtime-assistance system that can be personalized in-the-wild to meet the unique needs of individual care recipients. Developed in collaboration with two community researchers and informed by a formative study with a diverse group of care recipients, our system is guided by three key tenets for in-the-wild personalization: adaptability, transparency, and safety. FEAST embodies these principles through: (i) modular hardware that enables switching between assisted feeding, drinking, and mouth-wiping, (ii) diverse interaction methods, including a web interface, head gestures, and physical buttons, to accommodate diverse functional abilities and preferences, and (iii) parameterized behavior trees that can be safely and transparently adapted using a large language model. We evaluate our system based on the personalization requirements identified in our formative study, demonstrating that FEAST offers a wide range of transparent and safe adaptations and outperforms a state-of-the-art baseline limited to fixed customizations. To demonstrate real-world applicability, we conduct an in-home user study with two care recipients (who are community researchers), feeding them three meals each across three diverse scenarios. We further assess FEAST’s ecological validity by evaluating with an Occupational Therapist previously unfamiliar with the system. In all cases, users successfully personalize FEAST to meet their individual needs and preferences. Website: this https URL
zh

[AI-45] Flat Channels to Infinity in Neural Loss Landscapes

【速读】：该论文试图解决神经网络损失景观中存在平坦区域和鞍点的结构特性及其对优化过程的影响问题。其关键解决方案是识别并表征了一种特殊的损失景观结构：在这些通道中，损失下降极其缓慢，而至少两个神经元的输出权重 $ a_i $ 和 $ a_j $ 发散至正负无穷，同时它们的输入权重向量 $ \mathbf{w}_i $ 和 $ \mathbf{w}_j $ 趋于相等。这种结构在收敛时表现为一个门控线性单元，揭示了全连接层在计算能力上的意外特性。

链接: https://arxiv.org/abs/2506.14951
作者: Flavio Martinelli,Alexander Van Meegen,Berfin Şimşek,Wulfram Gerstner,Johanni Brea
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, a_i and a_j , diverge to \pm infinity, and their input weight vectors, \mathbfw_i and \mathbfw_j , become equal to each other. At convergence, the two neurons implement a gated linear unit: a_i\sigma(\mathbfw_i \cdot \mathbfx) + a_j\sigma(\mathbfw_j \cdot \mathbfx) \rightarrow \sigma(\mathbfw \cdot \mathbfx) + (\mathbfv \cdot \mathbfx) \sigma’(\mathbfw \cdot \mathbfx) . Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
zh

[AI-46] Determinação Automática de Limiar de Detecção de Ataques em Redes de Computadores Utilizando Autoencoders

【速读】：该论文试图解决基于自编码器（Autoencoder, AE）的异常检测系统中因重建误差分离阈值非标准化而导致的检测性能不稳定问题。解决方案的关键在于通过机器学习算法自动定义该分离阈值，以提高检测过程的准确性和可靠性。

链接: https://arxiv.org/abs/2506.14937
作者: Luan Gonçalves Miranda,Pedro Ivo da Cruz,Murilo Bellezoni Loiola
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Performance (cs.PF)
备注: This work was accepted at SBrT 2022 (Brazilian Symposium on Telecommunications and Signal Processing), though it was not included in the official proceedings. in Portuguese language

点击查看摘要

Abstract:Currently, digital security mechanisms like Anomaly Detection Systems using Autoencoders (AE) show great potential for bypassing problems intrinsic to the data, such as data imbalance. Because AE use a non-trivial and nonstandardized separation threshold to classify the extracted reconstruction error, the definition of this threshold directly impacts the performance of the detection process. Thus, this work proposes the automatic definition of this threshold using some machine learning algorithms. For this, three algorithms were evaluated: the K-Nearst Neighbors, the K-Means and the Support Vector Machine.
zh

[AI-47] CALM: Contextual Analog Logic with Multimodality

【速读】：该论文试图解决传统二值逻辑系统无法捕捉人类决策的细微差别以及在多模态环境中需要人工干预的问题，同时解决神经网络虽能提取丰富的上下文信息但缺乏可解释推理结构的缺陷。其解决方案的关键在于提出一种名为Contextual Analog Logic with Multimodality (CALM) 的框架，该框架将符号推理与神经生成相结合，通过领域树表示每个谓词，并利用神经网络预测上下文相关的模拟真值，再通过符号推理模块确保约束满足，从而实现基于多模态输入的类比逻辑推理。

链接: https://arxiv.org/abs/2506.14936
作者: Maxwell J. Jacobson,Corey J. Maley,Yexiang Xue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce Contextual Analog Logic with Multimodality (CALM). CALM unites symbolic reasoning with neural generation, enabling systems to make context-sensitive decisions grounded in real-world multi-modal data. Background: Classic bivalent logic systems cannot capture the nuance of human decision-making. They also require human grounding in multi-modal environments, which can be ad-hoc, rigid, and brittle. Neural networks are good at extracting rich contextual information from multi-modal data, but lack interpretable structures for reasoning. Objectives: CALM aims to bridge the gap between logic and neural perception, creating an analog logic that can reason over multi-modal inputs. Without this integration, AI systems remain either brittle or unstructured, unable to generalize robustly to real-world tasks. In CALM, symbolic predicates evaluate to analog truth values computed by neural networks and constrained search. Methods: CALM represents each predicate using a domain tree, which iteratively refines its analog truth value when the contextual groundings of its entities are determined. The iterative refinement is predicted by neural networks capable of capturing multi-modal information and is filtered through a symbolic reasoning module to ensure constraint satisfaction. Results: In fill-in-the-blank object placement tasks, CALM achieved 92.2% accuracy, outperforming classical logic (86.3%) and LLM (59.4%) baselines. It also demonstrated spatial heatmap generation aligned with logical constraints and delicate human preferences, as shown by a human study. Conclusions: CALM demonstrates the potential to reason with logic structure while aligning with preferences in multi-modal environments. It lays the foundation for next-gen AI systems that require the precision and interpretation of logic and the multimodal information processing of neural networks. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.14936 [cs.AI] (or arXiv:2506.14936v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.14936 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Maxwell Jacobson [view email] [v1] Tue, 17 Jun 2025 19:40:32 UTC (31,032 KB) Full-text links: Access Paper: View a PDF of the paper titled CALM: Contextual Analog Logic with Multimodality, by Maxwell J. Jacobson and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-48] Explain First Trust Later: LLM -Augmented Explanations for Graph-Based Crypto Anomaly Detection

【速读】：该论文试图解决加密货币领域日益增长的金融犯罪问题，尤其是在去中心化金融（DeFi）社区快速发展的背景下。解决方案的关键在于实施与政策相关的自动化检测工具，以应对技术新颖性带来的执法挑战。

链接: https://arxiv.org/abs/2506.14933
作者: Adriana Watson
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 6 pages, 4 figures. Code available at: this https URL

点击查看摘要

Abstract:The decentralized finance (DeFi) community has grown rapidly in recent years, pushed forward by cryptocurrency enthusiasts interested in the vast untapped potential of new markets. The surge in popularity of cryptocurrency has ushered in a new era of financial crime. Unfortunately, the novelty of the technology makes the task of catching and prosecuting offenders particularly challenging. Thus, it is necessary to implement automated detection tools related to policies to address the growing criminality in the cryptocurrency realm.
zh

[AI-49] Preparing for the Intelligence Explosion

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）快速发展可能带来的重大挑战，即“宏大挑战”（grand challenges），这些挑战包括新型大规模杀伤性武器、AI赋能的专制政权、争夺外星资源的竞赛以及具有道德地位的数字生命体等。论文指出，这些挑战无法总是交由未来的AI系统处理，因此需要当前采取行动以提升应对能力。解决方案的关键在于提前做好通用智能（Artificial General Intelligence, AGI）准备，不仅关注高级AI系统的对齐问题，还需为智能爆炸可能带来的复杂和不可预测的发展做好应对。

链接: https://arxiv.org/abs/2506.14863
作者: William MacAskill,Fin Moorhouse
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 61 pages

点击查看摘要

Abstract:AI that can accelerate research could drive a century of technological progress over just a few years. During such a period, new technological or political developments will raise consequential and hard-to-reverse decisions, in rapid succession. We call these developments grand challenges. These challenges include new weapons of mass destruction, AI-enabled autocracies, races to grab offworld resources, and digital beings worthy of moral consideration, as well as opportunities to dramatically improve quality of life and collective decision-making. We argue that these challenges cannot always be delegated to future AI systems, and suggest things we can do today to meaningfully improve our prospects. AGI preparedness is therefore not just about ensuring that advanced AI systems are aligned: we should be preparing, now, for the disorienting range of developments an intelligence explosion would bring.
zh

[AI-50] Feedback-MPPI: Fast Sampling-Based MPC via Rollout Differentiation – Adios low-level controllers

【速读】：该论文旨在解决Model Predictive Path Integral control（MPPI）在实时、高频机器人控制场景中因计算需求高而应用受限的问题。其解决方案的关键在于提出Feedback-MPPI（F-MPPI），通过引入基于Riccati的反馈思想进行敏感性分析，计算局部线性反馈增益，从而在不进行每一步完整重优化的情况下实现对当前状态的快速闭环修正，显著提升了控制性能和稳定性。

链接: https://arxiv.org/abs/2506.14855
作者: Tommaso Belvedere(RAINBOW, IRISA),Michael Ziegltrum(UCL),Giulio Turrisi(IIT),Valerio Modugno(UCL)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model Predictive Path Integral control is a powerful sampling-based approach suitable for complex robotic tasks due to its flexibility in handling nonlinear dynamics and non-convex costs. However, its applicability in real-time, highfrequency robotic control scenarios is limited by computational demands. This paper introduces Feedback-MPPI (F-MPPI), a novel framework that augments standard MPPI by computing local linear feedback gains derived from sensitivity analysis inspired by Riccati-based feedback used in gradient-based MPC. These gains allow for rapid closed-loop corrections around the current state without requiring full re-optimization at each timestep. We demonstrate the effectiveness of F-MPPI through simulations and real-world experiments on two robotic platforms: a quadrupedal robot performing dynamic locomotion on uneven terrain and a quadrotor executing aggressive maneuvers with onboard computation. Results illustrate that incorporating local feedback significantly improves control performance and stability, enabling robust, high-frequency operation suitable for complex robotic systems.
zh

[AI-51] Efficient Serving of LLM Applications with Probabilistic Demand Modeling

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLM）应用在服务系统中因资源需求动态变化而导致的端到端效率低下问题。现有服务系统将LLM应用的资源需求视为黑箱，导致排队顺序不当和后端预热延迟，从而影响整体性能。论文提出的关键解决方案是Hermes，其核心在于利用概率需求图（Probabilistic Demand Graph, PDGraph）对LLM应用的资源需求进行建模，并结合Gittins策略优化任务调度顺序，同时基于PDGraph模型在适当时间预热冷后端，从而显著提升服务效率。实验结果表明，Hermes能够有效减少平均完成时间超过70%，P95完成时间超过80%。

链接: https://arxiv.org/abs/2506.14851
作者: Yifei Liu,Zuo Gan,Zhenghao Gan,Weiye Wang,Chen Chen,Yizhou Shan,Xusheng Chen,Zhenhua Han,Yifei Zhu,Shixuan Sun,Minyi Guo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Applications based on Large Language Models (LLMs) contains a series of tasks to address real-world problems with boosted capability, which have dynamic demand volumes on diverse backends. Existing serving systems treat the resource demands of LLM applications as a blackbox, compromising end-to-end efficiency due to improper queuing order and backend warm up latency. We find that the resource demands of LLM applications can be modeled in a general and accurate manner with Probabilistic Demand Graph (PDGraph). We then propose Hermes, which leverages PDGraph for efficient serving of LLM applications. Confronting probabilistic demand description, Hermes applies the Gittins policy to determine the scheduling order that can minimize the average application completion time. It also uses the PDGraph model to help prewarm cold backends at proper moments. Experiments with diverse LLM applications confirm that Hermes can effectively improve the application serving efficiency, reducing the average completion time by over 70% and the P95 completion time by over 80%.
zh

[AI-52] Optimization of bi-directional gated loop cell based on multi-head attention mechanism for SSD health state classification model

【速读】：该论文旨在解决固态硬盘（SSD）健康状态预测在数据可靠性保障中的关键问题，通过提出一种融合双向门控循环单元（BiGRU）与多头注意力机制（MHA）的混合模型，提升存储设备健康分类的准确性和稳定性。该解决方案的关键在于创新性地结合了时间特征提取与关键信息聚焦能力，利用BiGRU网络的双向时序建模优势捕捉SSD退化特征的正向和反向依赖关系，同时通过多头注意力机制动态分配特征权重，增强模型对关键健康指标的敏感性。

链接: https://arxiv.org/abs/2506.14830
作者: Zhizhao Wen,Ruoxin Zhang,Chao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Source code available; Accepted by 2025 6th International Conference on Electronic Communication and Artificial Intelligence; 5 pages; 7 figures

点击查看摘要

Abstract:Aiming at the critical role of SSD health state prediction in data reliability assurance, this study proposes a hybrid BiGRU-MHA model that incorporates a multi-head attention mechanism to enhance the accuracy and stability of storage device health classification. The model innovatively integrates temporal feature extraction and key information focusing capabilities. Specifically, it leverages the bidirectional timing modeling advantages of the BiGRU network to capture both forward and backward dependencies of SSD degradation features. Simultaneously, the multi-head attention mechanism dynamically assigns feature weights, improving the model’s sensitivity to critical health indicators. Experimental results show that the proposed model achieves classification accuracies of 92.70% on the training set and 92.44% on the test set, with a minimal performance gap of only 0.26%, demonstrating excellent generalization ability. Further analysis using the receiver operating characteristic (ROC) curve shows an area under the curve (AUC) of 0.94 on the test set, confirming the model’s robust binary classification performance. This work not only presents a new technical approach for SSD health prediction but also addresses the generalization bottleneck of traditional models, offering a verifiable method with practical value for preventive maintenance of industrial-grade storage systems. The results show the model can significantly reduce data loss risks by providing early failure warnings and help optimize maintenance costs, supporting intelligent decision-making in building reliable storage systems for cloud computing data centers and edge storage environments.
zh

[AI-53] he Hardness of Achieving Impact in AI for Social Impact Research: A Ground-Level View of Challenges Opportunities

【速读】：该论文试图解决AI for Social Impact (AI4SI)项目在实现实际社会影响方面所面临的多重挑战，包括难以找到愿意共同设计和部署AI解决方案的合作者、项目难以从原型阶段过渡到规模化生产级解决方案，以及AI4SI研究的独特性未被更广泛的AI社区充分认可等问题。其解决方案的关键在于通过半结构化访谈和作者自身的实践经验，识别出结构性与组织性、沟通、协作及操作等方面的障碍，并提炼出最佳实践和可操作策略，以期为AI4SI研究人员和合作机构提供实用参考指南。

链接: https://arxiv.org/abs/2506.14829
作者: Aditya Majumdar,Wenbo Zhang,Kashvi Prawal,Amulya Yadav
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In an attempt to tackle the UN SDGs, AI for Social Impact (AI4SI) projects focus on harnessing AI to address societal issues in areas such as healthcare, social justice, etc. Unfortunately, despite growing interest in AI4SI, achieving tangible, on-the-ground impact remains a significant challenge. For example, identifying and engaging motivated collaborators who are willing to co-design and deploy AI based solutions in real-world settings is often difficult. Even when such partnerships are established, many AI4SI projects “fail” to progress beyond the proof-of-concept stage, and hence, are unable to transition to at-scale production-level solutions. Furthermore, the unique challenges faced by AI4SI researchers are not always fully recognized within the broader AI community, where such work is sometimes viewed as primarily applied and not aligning with the traditional criteria for novelty emphasized in core AI venues. This paper attempts to shine a light on the diverse challenges faced in AI4SI research by diagnosing a multitude of factors that prevent AI4SI partnerships from achieving real-world impact on the ground. Drawing on semi-structured interviews with six leading AI4SI researchers - complemented by the authors’ own lived experiences in conducting AI4SI research - this paper attempts to understand the day-to-day difficulties faced in developing and deploying socially impactful AI solutions. Through thematic analysis, we identify structural and organizational, communication, collaboration, and operational challenges as key barriers to deployment. While there are no easy fixes, we synthesize best practices and actionable strategies drawn from these interviews and our own work in this space. In doing so, we hope this paper serves as a practical reference guide for AI4SI researchers and partner organizations seeking to engage more effectively in socially impactful AI collaborations.
zh

[AI-54] Collaborative Interest-aware Graph Learning for Group Identification ECML KDD2025

【速读】：该论文试图解决群体识别（Group Identification, GI）中现有方法未能充分建模用户双层次兴趣（即群体级兴趣和物品级兴趣）之间的协同演化关系的问题，特别是在群体级兴趣对物品级兴趣的增强作用以及跨层次兴趣对齐时存在的假阴性样本干扰方面存在不足。解决方案的关键在于提出CI4GI模型，其核心是设计一种兴趣增强策略，通过用户所加入群体交互的物品来补充物品级兴趣，同时利用用户间兴趣分布的距离优化负样本识别，从而缓解跨层次兴趣对齐中的假阴性问题。

链接: https://arxiv.org/abs/2506.14826
作者: Rui Zhao,Beihong Jin,Beibei Li,Yiyuan Zheng
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: accepted by ECML PKDD 2025

点击查看摘要

Abstract:With the popularity of social media, an increasing number of users are joining group activities on online social platforms. This elicits the requirement of group identification (GI), which is to recommend groups to users. We reveal that users are influenced by both group-level and item-level interests, and these dual-level interests have a collaborative evolution relationship: joining a group expands the user’s item interests, further prompting the user to join new groups. Ultimately, the two interests tend to align dynamically. However, existing GI methods fail to fully model this collaborative evolution relationship, ignoring the enhancement of group-level interests on item-level interests, and suffering from false-negative samples when aligning cross-level interests. In order to fully model the collaborative evolution relationship between dual-level user interests, we propose CI4GI, a Collaborative Interest-aware model for Group Identification. Specifically, we design an interest enhancement strategy that identifies additional interests of users from the items interacted with by the groups they have joined as a supplement to item-level interests. In addition, we adopt the distance between interest distributions of two users to optimize the identification of negative samples for a user, mitigating the interference of false-negative samples during cross-level interests alignment. The results of experiments on three real-world datasets demonstrate that CI4GI significantly outperforms state-of-the-art models.
zh

[AI-55] FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在联邦学习（Federated Learning, FL）场景下的部署难题，包括高计算需求、客户端容量有限、通信成本高昂以及客户端数据异构性等问题。现有联邦学习方法假设客户端部署完整模型，这一假设在大规模MLLM中难以成立。论文提出的解决方案关键在于FedNano框架，该框架将大语言模型（Large Language Model, LLM）集中部署于服务器端，并引入轻量级的NanoEdge模块，通过模态特定编码器、连接器和低秩适配的可训练NanoAdapters实现客户端个性化适配，从而避免在客户端部署LLM，显著降低存储需求和通信开销。

链接: https://arxiv.org/abs/2506.14824
作者: Yao Zhang,Hewei Gao,Haokun Chen,Weiguo Li,Yunpu Ma,Volker Tresp
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) excel in tasks like multimodal reasoning and cross-modal retrieval but face deployment challenges in real-world scenarios due to distributed multimodal data and strict privacy requirements. Federated Learning (FL) offers a solution by enabling collaborative model training without centralizing data. However, realizing FL for MLLMs presents significant challenges, including high computational demands, limited client capacity, substantial communication costs, and heterogeneous client data. Existing FL methods assume client-side deployment of full models, an assumption that breaks down for large-scale MLLMs due to their massive size and communication demands. To address these limitations, we propose FedNano, the first FL framework that centralizes the LLM on the server while introducing NanoEdge, a lightweight module for client-specific adaptation. NanoEdge employs modality-specific encoders, connectors, and trainable NanoAdapters with low-rank adaptation. This design eliminates the need to deploy LLM on clients, reducing client-side storage by 95%, and limiting communication overhead to only 0.01% of the model parameters. By transmitting only compact NanoAdapter updates, FedNano handles heterogeneous client data and resource constraints while preserving privacy. Experiments demonstrate that FedNano outperforms prior FL baselines, bridging the gap between MLLM scale and FL feasibility, and enabling scalable, decentralized multimodal AI systems.
zh

[AI-56] raining with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks OSDI’25

【速读】：该论文试图解决深度学习（Deep Learning, DL）模型训练过程中容易出现的隐性错误（silent errors）问题，这类错误难以检测和诊断。解决方案的关键在于提出TRAINCHECK框架，该框架采用主动检查的方法，通过自动推断适用于DL训练的不变式（invariants），在训练过程中主动检测隐性错误并提供调试支持。

链接: https://arxiv.org/abs/2506.14813
作者: Yuxuan Jiang,Ziming Zhou,Boyu Xu,Beijie Liu,Runhui Xu,Peng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, to appear in 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI '25)

点击查看摘要

Abstract:Training deep learning (DL) models is a complex process, making it prone to silent errors that are challenging to detect and diagnose. This paper presents TRAINCHECK, a framework that takes a proactive checking approach to address silent training errors. TRAINCHECK automatically infers invariants tailored for DL training. It uses these invariants to proactively detect silent errors during the training process while providing debugging help. To evaluate TRAINCHECK, we reproduce 20 real-world silent training errors with diverse root causes. TRAINCHECK successfully detects 18 errors within a single training iteration. It also uncovers 6 unknown bugs in popular training libraries that lead to silent errors.
zh

[AI-57] ss-Mamba: Semantic-Spline Selective State-Space Model

【速读】：该论文旨在解决时间序列预测中的准确性、鲁棒性和可解释性不足的问题，同时降低计算复杂度。其解决方案的关键在于提出ss-Mamba模型，该模型结合了语义感知嵌入（semantic-aware embeddings）与自适应样条基时间编码（adaptive spline-based temporal encoding），并在选择性状态空间建模框架内进行整合，从而在保持高性能的同时将计算复杂度从二次方降低到线性。

链接: https://arxiv.org/abs/2506.14802
作者: Zuochen Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose ss-Mamba, a novel foundation model that enhances time series forecasting by integrating semantic-aware embeddings and adaptive spline-based temporal encoding within a selective state-space modeling framework. Building upon the recent success of Transformer architectures, ss-Mamba adopts the Mamba selective state space model as an efficient alternative that achieves comparable performance while significantly reducing computational complexity from quadratic to linear time. Semantic index embeddings, initialized from pretrained language models, allow effective generalization to previously unseen series through meaningful semantic priors. Additionally, spline-based Kolmogorov-Arnold Networks (KAN) dynamically and interpretably capture complex seasonalities and non-stationary temporal effects, providing a powerful enhancement over conventional temporal feature encodings. Extensive experimental evaluations confirm that ss-Mamba delivers superior accuracy, robustness, and interpretability, demonstrating its capability as a versatile and computationally efficient alternative to traditional Transformer-based models in time-series forecasting.
zh

[AI-58] Analyzing Character Representation in Media Content using Multimodal Foundation Model: Effectiveness and Trust

【速读】：该论文试图解决的问题是：尽管生成式 AI (Generative AI) 已经能够从音频、视频和文本中量化分析角色在性别和年龄等人口统计维度上的表现，但缺乏对受众的参与，因此不清楚这些量化结果对普通公众的实际有用性以及他们对 AI 生成数据的信任程度。解决方案的关键在于通过用户研究验证 AI 生成结果的有用性和可信度，并提出一种基于对比语言图像预训练（Contrastive Language Image Pretraining, CLIP）基础模型的字符表现分析与可视化工具，以更有效地向非专业观众呈现相关分析结果。

链接: https://arxiv.org/abs/2506.14799
作者: Evdoxia Taka,Debadyuti Bhattacharya,Joanne Garde-Hansen,Sanjay Sharma,Tanaya Guha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in AI has enabled automated analysis of complex media content at scale and generate actionable insights regarding character representation along such dimensions as gender and age. Past work focused on quantifying representation from audio/video/text using various ML models, but without having the audience in the loop. We ask, even if character distribution along demographic dimensions are available, how useful are they to the general public? Do they actually trust the numbers generated by AI models? Our work addresses these questions through a user study, while proposing a new AI-based character representation and visualization tool. Our tool based on the Contrastive Language Image Pretraining (CLIP) foundation model to analyze visual screen data to quantify character representation across dimensions of age and gender. We also designed effective visualizations suitable for presenting such analytics to lay audience. Next, we conducted a user study to seek empirical evidence on the usefulness and trustworthiness of the AI-generated results for carefully chosen movies presented in the form of our visualizations. We note that participants were able to understand the analytics from our visualization, and deemed the tool `overall useful’. Participants also indicated a need for more detailed visualizations to include more demographic categories and contextual information of the characters. Participants’ trust in AI-based gender and age models is seen to be moderate to low, although they were not against the use of AI in this context. Our tool including code, benchmarking, and data from the user study can be found here: this https URL
zh

[AI-59] Bound by semanticity: universal laws governing the generalization-identification tradeoff

【速读】：该论文试图解决深度学习系统中普遍存在的泛化与识别之间的权衡问题，即如何在保持输入身份的同时实现广泛的泛化能力。解决方案的关键在于揭示了基于有限语义分辨率的表征相似性衰减所导致的通用帕累托前沿，该前沿独立于输入空间的几何结构，并通过理论推导和实验验证表明，这种限制在不同复杂度的模型（包括全连接网络、卷积神经网络和视觉-语言模型）中均存在，从而确立了有限分辨率相似性作为深度网络和大脑表征能力的基本信息约束。

链接: https://arxiv.org/abs/2506.14797
作者: Marco Nurisso,Jesseba Fernando,Raj Deshpande,Alan Perotti,Raja Marjieh,Steven M. Frankland,Richard L. Lewis,Taylor W. Webb,Declan Campbell,Francesco Vaccarino,Jonathan D. Cohen,Giovanni Petri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent systems must deploy internal representations that are simultaneously structured – to support broad generalization – and selective – to preserve input identity. We expose a fundamental limit on this tradeoff. For any model whose representational similarity between inputs decays with finite semantic resolution \varepsilon , we derive closed-form expressions that pin its probability of correct generalization p_S and identification p_I to a universal Pareto front independent of input space geometry. Extending the analysis to noisy, heterogeneous spaces and to n2 inputs predicts a sharp 1/n collapse of multi-input processing capacity and a non-monotonic optimum for p_S . A minimal ReLU network trained end-to-end reproduces these laws: during learning a resolution boundary self-organizes and empirical (p_S,p_I) trajectories closely follow theoretical curves for linearly decaying similarity. Finally, we demonstrate that the same limits persist in two markedly more complex settings – a convolutional neural network and state-of-the-art vision-language models – confirming that finite-resolution similarity is a fundamental emergent informational constraint, not merely a toy-model artifact. Together, these results provide an exact theory of the generalization-identification trade-off and clarify how semantic resolution shapes the representational capacity of deep networks and brains alike.
zh

[AI-60] opology-Aware and Highly Generalizable Deep Reinforcement Learning for Efficient Retrieval in Multi-Deep Storag e Systems

【速读】：该论文旨在解决多深自主车辆存储与检索系统（multi-deep autonomous vehicle storage and retrieval systems, AVS/RS）在异构物品配置下进行取货操作时面临的因通道堵塞导致的效率下降问题，其核心目标是通过最小化总延误时间来优化取货调度。解决方案的关键在于提出一种基于深度强化学习的框架，该框架采用图神经网络（Graph Neural Network, GNN）与Transformer模型相结合的新型神经网络架构，以有效捕捉仓储系统的拓扑结构和物品属性，并通过全局优先级分配实现高效的决策优化。

链接: https://arxiv.org/abs/2506.14787
作者: Funing Li,Yuan Tian,Ruben Noortwyck,Jifeng Zhou,Liming Kuang,Robert Schulz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In modern industrial and logistics environments, the rapid expansion of fast delivery services has heightened the demand for storage systems that combine high efficiency with increased density. Multi-deep autonomous vehicle storage and retrieval systems (AVS/RS) present a viable solution for achieving greater storage density. However, these systems encounter significant challenges during retrieval operations due to lane blockages. A conventional approach to mitigate this issue involves storing items with homogeneous characteristics in a single lane, but this strategy restricts the flexibility and adaptability of multi-deep storage systems. In this study, we propose a deep reinforcement learning-based framework to address the retrieval problem in multi-deep storage systems with heterogeneous item configurations. Each item is associated with a specific due date, and the objective is to minimize total tardiness. To effectively capture the system’s topology, we introduce a graph-based state representation that integrates both item attributes and the local topological structure of the multi-deep warehouse. To process this representation, we design a novel neural network architecture that combines a Graph Neural Network (GNN) with a Transformer model. The GNN encodes topological and item-specific information into embeddings for all directly accessible items, while the Transformer maps these embeddings into global priority assignments. The Transformer’s strong generalization capability further allows our approach to be applied to storage systems with diverse layouts. Extensive numerical experiments, including comparisons with heuristic methods, demonstrate the superiority of the proposed neural network architecture and the effectiveness of the trained agent in optimizing retrieval tardiness. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.14787 [cs.LG] (or arXiv:2506.14787v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.14787 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-61] WebXAII: an open-source web framework to study human-XAI interaction

【速读】：该论文试图解决当前研究中人类与可解释人工智能（XAI）系统交互时，因缺乏标准化平台而导致的实验可重复性和接口可复用性不足的问题。解决方案的关键在于设计并实现了一个开源的Web框架WebXAII，该框架能够承载完整的实验协议，通过通用视图和模块的复合架构提供高度灵活性，并通过结构化配置文件降低实验协议实现的编程门槛，从而支持研究者高效地构建和共享实验环境。

链接: https://arxiv.org/abs/2506.14777
作者: Jules Leguy,Pierre-Antoine Jean,Felipe Torres Figueroa,Sébastien Harispe
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This article introduces WebXAII, an open-source web framework designed to facilitate research on human interaction with eXplainable Artificial Intelligence (XAI) systems. The field of XAI is rapidly expanding, driven by the growing societal implications of the widespread adoption of AI (and in particular machine learning) across diverse applications. Researchers who study the interaction between humans and XAI techniques typically develop ad hoc interfaces in order to conduct their studies. These interfaces are usually not shared alongside the results of the studies, which limits their reusability and the reproducibility of experiments. In response, we design and implement WebXAII, a web-based platform that can embody full experimental protocols, meaning that it can present all aspects of the experiment to human participants and record their responses. The experimental protocols are translated into a composite architecture of generic views and modules, which offers a lot of flexibility. The architecture is defined in a structured configuration file, so that protocols can be implemented with minimal programming skills. We demonstrate that WebXAII can effectively embody relevant protocols, by reproducing the protocol of a state-of-the-art study of the literature. The framework is available at this https URL.
zh

[AI-62] See What I Mean? CUE: A Cognitive Model of Understanding Explanations ICIP

【速读】：该论文试图解决当前可解释人工智能（Explainable AI, XAI）评估中过于侧重技术准确性而忽视用户认知可理解性的问题，特别是在视觉障碍用户群体中的表现。其解决方案的关键在于提出CUE（Cognitive Understanding of Explanations）模型，该模型将解释属性与认知子过程（可读性、可理解性和可解释性）相联系，从而为XAI的评估提供更符合人类认知规律的框架。通过实验验证，CUE模型表明解释的可读性直接影响理解效果，并强调了开发适应性XAI界面的重要性。

链接: https://arxiv.org/abs/2506.14775
作者: Tobias Labarta,Nhi Hoang,Katharina Weitz,Wojciech Samek,Sebastian Lapuschkin,Leander Weber
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures (main text), 4 tables, 455-participant user study

点击查看摘要

Abstract:As machine learning systems increasingly inform critical decisions, the need for human-understandable explanations grows. Current evaluations of Explainable AI (XAI) often prioritize technical fidelity over cognitive accessibility which critically affects users, in particular those with visual impairments. We propose CUE, a model for Cognitive Understanding of Explanations, linking explanation properties to cognitive sub-processes: legibility (perception), readability (comprehension), and interpretability (interpretation). In a study (N=455) testing heatmaps with varying colormaps (BWR, Cividis, Coolwarm), we found comparable task performance but lower confidence/effort for visually impaired users. Unlike expected, these gaps were not mitigated and sometimes worsened by accessibility-focused color maps like Cividis. These results challenge assumptions about perceptual optimization and support the need for adaptive XAI interfaces. They also validate CUE by demonstrating that altering explanation legibility affects understandability. We contribute: (1) a formalized cognitive model for explanation understanding, (2) an integrated definition of human-centered explanation properties, and (3) empirical evidence motivating accessible, user-tailored XAI.
zh

[AI-63] MedSyn: Enhancing Diagnostics with Human-AI Collaboration

【速读】：该论文试图解决临床决策过程中因认知偏差、信息不全和病例模糊性所带来的复杂性问题。其解决方案的关键在于提出一种混合人机框架MedSyn，通过医生与大型语言模型（Large Language Models, LLMs）之间的多步骤、交互式对话，以优化诊断和治疗决策。该框架强调动态交流，允许医生质疑LLM的建议，同时LLM能够提供不同的视角，从而更贴近真实医疗实践的需求。

链接: https://arxiv.org/abs/2506.14774
作者: Burcu Sayin,Ipek Baris Schlicht,Ngoc Vo Hong,Sara Allievi,Jacopo Staiano,Pasquale Minervini,Andrea Passerini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to the Trustworthy and Collaborative Artificial Intelligence Workshop 2025 (TCAI 2025) in the 4th International Conference Series on Hybrid Human-Artificial Intelligence (HHAI 2025)

点击查看摘要

Abstract:Clinical decision-making is inherently complex, often influenced by cognitive biases, incomplete information, and case ambiguity. Large Language Models (LLMs) have shown promise as tools for supporting clinical decision-making, yet their typical one-shot or limited-interaction usage may overlook the complexities of real-world medical practice. In this work, we propose a hybrid human-AI framework, MedSyn, where physicians and LLMs engage in multi-step, interactive dialogues to refine diagnoses and treatment decisions. Unlike static decision-support tools, MedSyn enables dynamic exchanges, allowing physicians to challenge LLM suggestions while the LLM highlights alternative perspectives. Through simulated physician-LLM interactions, we assess the potential of open-source LLMs as physician assistants. Results show open-source LLMs are promising as physician assistants in the real world. Future work will involve real physician interactions to further validate MedSyn’s usefulness in diagnostic accuracy and patient outcomes.
zh

[AI-64] Recommendations and Reporting Checklist for Rigorous Transparent Human Baselines in Model Evaluations ICML2025

【速读】：该论文试图解决当前基础模型评估中人类基准（human baselines）不够严谨和透明的问题，从而无法实现对人类与人工智能性能的有效比较。其解决方案的关键在于提出一套设计、执行和报告人类基准的框架及相应的报告检查清单，以提高评估的严谨性和可重复性，进而为机器学习社区、下游用户和政策制定者提供更可靠的AI评估依据。

链接: https://arxiv.org/abs/2506.13776
作者: Kevin L. Wei,Patricia Paskov,Sunishchal Dev,Michael J. Byun,Anka Reuel,Xavier Roberts-Gaal,Rachel Calcott,Evie Coxon,Chinmay Deshpande
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: A version of this paper has been accepted to ICML 2025 as a position paper (spotlight), with the title: “Position: Human Baselines in Model Evaluations Need Rigor and Transparency (With Recommendations Reporting Checklist).”

点击查看摘要

Abstract:In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve “super-human” performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: this https URL
zh

[AI-65] Improved Image Reconstruction and Diffusion Parameter Estimation Using a Temporal Convolutional Network Model of Gradient Trajectory Errors

【速读】：该论文旨在解决梯度轨迹误差在磁共振成像中引入的显著伪影和失真问题，特别是在非笛卡尔成像序列中，由于梯度波形不完美导致图像质量下降的问题。解决方案的关键在于开发一种通用的非线性梯度系统模型，该模型利用卷积网络准确预测梯度失真，通过训练时间卷积网络来预测成像系统产生的梯度波形，并将其集成到图像重建流程中，从而提升图像质量和扩散参数映射效果。

链接: https://arxiv.org/abs/2506.14995
作者: Jonathan B. Martin,Hannah E. Alderson,John C. Gore,Mark D. Does,Kevin D. Harkins
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Summary: Errors in gradient trajectories introduce significant artifacts and distortions in magnetic resonance images, particularly in non-Cartesian imaging sequences, where imperfect gradient waveforms can greatly reduce image quality. Purpose: Our objective is to develop a general, nonlinear gradient system model that can accurately predict gradient distortions using convolutional networks. Methods: A set of training gradient waveforms were measured on a small animal imaging system, and used to train a temporal convolutional network to predict the gradient waveforms produced by the imaging system. Results: The trained network was able to accurately predict nonlinear distortions produced by the gradient system. Network prediction of gradient waveforms was incorporated into the image reconstruction pipeline and provided improvements in image quality and diffusion parameter mapping compared to both the nominal gradient waveform and the gradient impulse response function. Conclusion: Temporal convolutional networks can more accurately model gradient system behavior than existing linear methods and may be used to retrospectively correct gradient errors.
zh

[AI-66] hinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition INTERSPEECH2025

【速读】：该论文旨在解决Speech LLMs在处理多通道音频及空间线索方面的能力不足问题，特别是如何实现方向性语音识别、声源定位以及旁观者干扰抑制。其解决方案的关键在于提出两种关键技术：序列化方向输出训练（S-DOT）和对比方向数据增强（CDDA），以提升模型对方向性的理解能力，从而有效捕捉文本线索与空间音频之间的关系。

链接: https://arxiv.org/abs/2506.14973
作者: Jiamin Xie,Ju Lin,Yiteng Huang,Tyler Vuong,Zhaojiang Lin,Zhaojun Yang,Peng Su,Prashant Rawat,Sangeeta Srivastava,Ming Sun,Florian Metze
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model’s ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.
zh

[AI-67] Forecasting the spatiotemporal evolution of fluid-induced microearthquakes with deep learning

【速读】：该论文旨在解决在地热能增强系统（Enhanced Geothermal Systems, EGS）等地质工程应用中，对由地下流体注入产生的微地震事件（Microearthquakes, MEQs）的时空演化进行准确预测的问题。其解决方案的关键在于提出一种基于Transformer的深度学习模型，该模型能够整合水力压裂历史和先前的MEQ观测数据，以预测四个关键参数：累计MEQ数量、累计对数地震矩以及MEQ云的第50百分位和第95百分位范围（P_50, P_95）。该模型在EGS Collab Experiment 1数据集上表现出色，实现了高精度的预测，并通过学习的标准差项提供不确定性估计，从而支持实时推断裂缝扩展和渗透率演化。

链接: https://arxiv.org/abs/2506.14923
作者: Jaehong Chung,Michael Manga,Timothy Kneafsey,Tapan Mukerji,Mengsu Hu
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Microearthquakes (MEQs) generated by subsurface fluid injection record the evolving stress state and permeability of reservoirs. Forecasting their full spatiotemporal evolution is therefore critical for applications such as enhanced geothermal systems (EGS), CO _2 sequestration and other geo-engineering applications. We present a transformer-based deep learning model that ingests hydraulic stimulation history and prior MEQ observations to forecast four key quantities: cumulative MEQ count, cumulative logarithmic seismic moment, and the 50th- and 95th-percentile extents ( P_50, P_95 ) of the MEQ cloud. Applied to the EGS Collab Experiment 1 dataset, the model achieves R^2 0.98 for the 1-second forecast horizon and R^2 0.88 for the 15-second forecast horizon across all targets, and supplies uncertainty estimates through a learned standard deviation term. These accurate, uncertainty-quantified forecasts enable real-time inference of fracture propagation and permeability evolution, demonstrating the strong potential of deep-learning approaches to improve seismic-risk assessment and guide mitigation strategies in future fluid-injection operations.
zh

[AI-68] Identifiability by common backdoor in summary causal graphs of time series

【速读】：该论文试图解决在仅有总结性因果图（summary causal graph）的情况下，针对时间序列数据中的多个干预措施和多个效应的可识别性问题（identifiability problem），即判断这些干预的总效应是否能够通过无do算子的公式表达，并仅基于观察数据进行计算。解决方案的关键在于通过共同后门集（common backdoor set）来实现可识别性，并针对具有和不具时间一致性的时间序列，分别建立了存在此类后门集的条件，同时提供了复杂度有限的算法以判断问题是否可识别。

链接: https://arxiv.org/abs/2506.14862
作者: Clément Yvernes,Charles K. Assaad,Emilie Devijver,Eric Gaussier
机构: 未知
类目: atistics Theory (math.ST); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The identifiability problem for interventions aims at assessing whether the total effect of some given interventions can be written with a do-free formula, and thus be computed from observational data only. We study this problem, considering multiple interventions and multiple effects, in the context of time series when only abstractions of the true causal graph in the form of summary causal graphs are available. We focus in this study on identifiability by a common backdoor set, and establish, for time series with and without consistency throughout time, conditions under which such a set exists. We also provide algorithms of limited complexity to decide whether the problem is identifiable or not.
zh

[AI-69] BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models

【速读】：该论文试图解决当前转录组基础模型（Transcriptomic Foundation Models, TFMs）在模型实现和训练策略上的多样性导致难以评估单一设计选择的贡献或其潜在协同效应的问题，这限制了领域内最佳实践的形成以及研究结果的可重复性。解决方案的关键在于提出BMFM-RNA，一个开源、模块化的软件包，它在一个框架内统一了多种TFM的预训练和微调目标，并引入了一种新的训练目标——全细胞表达解码器（whole cell expression decoder, WCED），该目标通过类似自编码器的CLS瓶颈表示捕捉全局表达模式，从而提升了模型在零样本和微调任务中的性能，达到了或超过了如scGPT等最先进的方法。

链接: https://arxiv.org/abs/2506.14861
作者: Bharath Dandala,Michael M. Danziger,Ella Barkan,Tanwi Biswas,Viatcheslav Gurev,Jianying Hu,Matthew Madgwick,Akira Koseki,Tal Kozlovski,Michal Rosen-Zvi,Yishai Shimoni,Ching-Huei Tsou
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individual design choices or evaluate their potential synergies. This hinders the field’s ability to converge on best practices and limits the reproducibility of insights across studies. We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives within a single framework. Leveraging this capability, we introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation. In this paper, we describe the framework, supported input representations, and training objectives. We evaluated four model checkpoints pretrained on CELLxGENE using combinations of masked language modeling (MLM), WCED and multitask learning. Using the benchmarking capabilities of BMFM-RNA, we show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT across more than a dozen datasets in both zero-shot and fine-tuning tasks. BMFM-RNA, available as part of the biomed-multi-omics project ( this https URL ), offers a reproducible foundation for systematic benchmarking and community-driven exploration of optimal TFM training strategies, enabling the development of more effective tools to leverage the latest advances in AI for understanding cell biology.
zh

[AI-70] Next-Generation Conflict Forecasting: Unleashing Predictive Patterns through Spatiotemporal Learning

【速读】：该论文试图解决在高时空分辨率下预测暴力冲突的问题，这是研究人员和政策制定者面临的核心挑战。其解决方案的关键在于提出一种新型的神经网络架构，该架构基于蒙特卡洛丢弃长短期记忆（LSTM）U-Net，结合卷积层以捕捉空间依赖性，并利用循环结构建模时间动态，从而实现对国家行为体、非国家行为体和单边暴力三种不同类型暴力的预测。该模型无需人工特征工程，仅依赖历史冲突数据，能够自主学习复杂的时空模式，并生成概率估计和预期事件规模，同时量化预测不确定性。

链接: https://arxiv.org/abs/2506.14817
作者: Simon P. von der Maase
机构: 未知
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
备注: 33 pages, 9 figures, 3 tables. Presented at workshops hosted by PRIO, AFK (German Association for Peace and Conflict Studies), CCEW (Bundeswehr University Munich), Uppsala University, SODAS (University of Copenhagen) and in briefings with UN agencies including UNIDIR, OCHA, and FAO

点击查看摘要

Abstract:Forecasting violent conflict at high spatial and temporal resolution remains a central challenge for both researchers and policymakers. This study presents a novel neural network architecture for forecasting three distinct types of violence – state-based, non-state, and one-sided – at the subnational (priogrid-month) level, up to 36 months in advance. The model jointly performs classification and regression tasks, producing both probabilistic estimates and expected magnitudes of future events. It achieves state-of-the-art performance across all tasks and generates approximate predictive posterior distributions to quantify forecast uncertainty. The architecture is built on a Monte Carlo Dropout Long Short-Term Memory (LSTM) U-Net, integrating convolutional layers to capture spatial dependencies with recurrent structures to model temporal dynamics. Unlike many existing approaches, it requires no manual feature engineering and relies solely on historical conflict data. This design enables the model to autonomously learn complex spatiotemporal patterns underlying violent conflict. Beyond achieving state-of-the-art predictive performance, the model is also highly extensible: it can readily integrate additional data sources and jointly forecast auxiliary variables. These capabilities make it a promising tool for early warning systems, humanitarian response planning, and evidence-based peacebuilding initiatives. Comments: 33 pages, 9 figures, 3 tables. Presented at workshops hosted by PRIO, AFK (German Association for Peace and Conflict Studies), CCEW (Bundeswehr University Munich), Uppsala University, SODAS (University of Copenhagen) and in briefings with UN agencies including UNIDIR, OCHA, and FAO Subjects: Other Statistics (stat.OT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2506.14817 [stat.OT] (or arXiv:2506.14817v1 [stat.OT] for this version) https://doi.org/10.48550/arXiv.2506.14817 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-71] MODS: Multi-source Observations Conditional Diffusion Model for Meteorological State Downscaling

【速读】：该论文试图解决高分辨率地表气象条件获取的问题，即直接应用空间插值方法从低分辨率网格场推导特定位置的气象值时，结果往往与实际条件存在较大偏差。现有降尺度方法主要依赖静止卫星与ERA5变量之间的耦合关系作为条件，但仅使用静止卫星的亮温数据无法全面捕捉ERA5图中所有气象变量的变化。解决方案的关键在于利用更广泛的卫星数据，以更充分地利用其对多种气象变量的反演效果，从而生成更符合实际情况的结果。为此，论文提出多源观测降尺度模型（Multi-source Observation Down-Scaling Model, MODS），该模型是一种融合多源静止卫星（GridSat）、极轨卫星（AMSU-A、HIRS和MHS）以及地形数据（GEBCO）的条件扩散模型，并在ERA5再分析数据集上进行预训练，通过多源交叉注意力模块将不同条件输入的潜在特征融合到ERA5地图中，从而生成更贴近真实环境的气象状态。

链接: https://arxiv.org/abs/2506.14798
作者: Siwei Tu,Jingyi Xu,Weidong Yang,Lei Bai,Ben Fei
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate acquisition of high-resolution surface meteorological conditions is critical for forecasting and simulating meteorological variables. Directly applying spatial interpolation methods to derive meteorological values at specific locations from low-resolution grid fields often yields results that deviate significantly from the actual conditions. Existing downscaling methods primarily rely on the coupling relationship between geostationary satellites and ERA5 variables as a condition. However, using brightness temperature data from geostationary satellites alone fails to comprehensively capture all the changes in meteorological variables in ERA5 maps. To address this limitation, we can use a wider range of satellite data to make more full use of its inversion effects on various meteorological variables, thus producing more realistic results across different meteorological variables. To further improve the accuracy of downscaling meteorological variables at any location, we propose the Multi-source Observation Down-Scaling Model (MODS). It is a conditional diffusion model that fuses data from multiple geostationary satellites GridSat, polar-orbiting satellites (AMSU-A, HIRS, and MHS), and topographic data (GEBCO), as conditions, and is pre-trained on the ERA5 reanalysis dataset. During training, latent features from diverse conditional inputs are extracted separately and fused into ERA5 maps via a multi-source cross-attention module. By exploiting the inversion relationships between reanalysis data and multi-source atmospheric variables, MODS generates atmospheric states that align more closely with real-world conditions. During sampling, MODS enhances downscaling consistency by incorporating low-resolution ERA5 maps and station-level meteorological data as guidance. Experimental results demonstrate that MODS achieves higher fidelity when downscaling ERA5 maps to a 6.25 km resolution.
zh

[AI-72] PFMBench: Protein Foundation Model Benchmark

【速读】：该论文试图解决当前蛋白质基础模型研究中缺乏全面评估基准的问题，这限制了对模型泛化能力及局限性的深入理解。解决方案的关键是提出PFMBench，这是一个涵盖38个任务、覆盖8个蛋白质科学关键领域的综合性评估基准，通过在17个最先进模型上的大量实验，揭示任务间的内在关联，识别表现最佳的模型，并提供标准化的评估流程。

链接: https://arxiv.org/abs/2506.14796
作者: Zhangyang Gao,Hao Wang,Cheng Tan,Chenrui Xu,Mengdi Liu,Bozhen Hu,Linlin Chao,Xiaoming Zhang,Stan Z. Li
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study investigates the current landscape and future directions of protein foundation model research. While recent advancements have transformed protein science and engineering, the field lacks a comprehensive benchmark for fair evaluation and in-depth understanding. Since ESM-1B, numerous protein foundation models have emerged, each with unique datasets and methodologies. However, evaluations often focus on limited tasks tailored to specific models, hindering insights into broader generalization and limitations. Specifically, researchers struggle to understand the relationships between tasks, assess how well current models perform across them, and determine the criteria in developing new foundation models. To fill this gap, we present PFMBench, a comprehensive benchmark evaluating protein foundation models across 38 tasks spanning 8 key areas of protein science. Through hundreds of experiments on 17 state-of-the-art models across 38 tasks, PFMBench reveals the inherent correlations between tasks, identifies top-performing models, and provides a streamlined evaluation protocol. Code is available at \hrefthis https URL\textcolorblueGitHub.
zh

[AI-73] Comparative Analysis of QNN Architectures for Wind Power Prediction: Feature Maps and Ansatz Configurations

【速读】：该论文试图解决量子机器学习（Quantum Machine Learning, QML）在实际应用中是否具有优于经典机器学习方法的争议问题，特别是针对当前噪声中等规模量子（Noisy Intermediate-Scale Quantum, NISQ）设备的局限性所引发的质疑。其解决方案的关键在于构建并评估多种量子神经网络（Quantum Neural Networks, QNNs）配置，通过结合独特的量子特征映射和不同的纠缠策略来优化量子电路设计，并在风能数据集上验证了QNNs在预测任务中的优越性能。

链接: https://arxiv.org/abs/2506.14795
作者: Batuhan Hangun,Emine Akpinar,Oguz Altun,Onder Eyecioglu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Quantum Machine Learning (QML) is an emerging field at the intersection of quantum computing and machine learning, aiming to enhance classical machine learning methods by leveraging quantum mechanics principles such as entanglement and superposition. However, skepticism persists regarding the practical advantages of QML, mainly due to the current limitations of noisy intermediate-scale quantum (NISQ) devices. This study addresses these concerns by extensively assessing Quantum Neural Networks (QNNs)-quantum-inspired counterparts of Artificial Neural Networks (ANNs), demonstrating their effectiveness compared to classical methods. We systematically construct and evaluate twelve distinct QNN configurations, utilizing two unique quantum feature maps combined with six different entanglement strategies for ansatz design. Experiments conducted on a wind energy dataset reveal that QNNs employing the Z feature map achieve up to 93% prediction accuracy when forecasting wind power output using only four input parameters. Our findings show that QNNs outperform classical methods in predictive tasks, underscoring the potential of QML in real-world applications.
zh

[AI-74] MixRep: Hidden Representation Mixup for Low-Resource Speech Recognition INTERSPEECH2023

【速读】：该论文旨在解决低资源自动语音识别（ASR）中的数据不足问题，通过引入一种简单而有效的数据增强策略MixRep来提升模型的泛化能力。MixRep的核心在于基于mixup的方法，通过对神经网络中隐藏表示的特征维度进行插值，适用于声学特征输入及各层输出，从而扩展了先前的MixSpeech方法。此外，该方案还结合了沿输入时间轴的正则化，与mixup形成互补效果，显著提升了模型性能。

链接: https://arxiv.org/abs/2310.18450
作者: Jiamin Xie,John H.L. Hansen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2023

点击查看摘要

Abstract:In this paper, we present MixRep, a simple and effective data augmentation strategy based on mixup for low-resource ASR. MixRep interpolates the feature dimensions of hidden representations in the neural network that can be applied to both the acoustic feature input and the output of each layer, which generalizes the previous MixSpeech method. Further, we propose to combine the mixup with a regularization along the time axis of the input, which is shown as complementary. We apply MixRep to a Conformer encoder of an E2E LAS architecture trained with a joint CTC loss. We experiment on the WSJ dataset and subsets of the SWB dataset, covering reading and telephony conversational speech. Experimental results show that MixRep consistently outperforms other regularization methods for low-resource ASR. Compared to a strong SpecAugment baseline, MixRep achieves a +6.5% and a +6.7% relative WER reduction on the eval92 set and the Callhome part of the eval’2000 set.
zh

[AI-75] DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition INTERSPEECH2022

【速读】：该论文旨在解决传统卷积神经网络（CNN）在语音识别中对局部时频模式的建模局限性，即其假设局部模式出现在对称且刚性的卷积核中，可能限制了模型对复杂特征的适应能力。解决方案的关键在于引入可变形的卷积操作，替代Conformer架构中的深度可分离卷积，构建了一个名为“Deformer”的新型结构。该结构通过自适应的局部感受野和全局注意力机制，增强了特征之间的关联性，并在WSJ eval92数据集上实现了显著的词错误率（WER）降低。

链接: https://arxiv.org/abs/2207.01732
作者: Jiamin Xie,John H.L. Hansen
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2022

点击查看摘要

Abstract:Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN operation. It motivates the question: What about asymmetric kernels? In this study, we illustrate adaptive views can discover local features which couple better with attention than fixed views of the input. We replace depthwise CNNs in the Conformer architecture with a deformable counterpart, dubbed this “Deformer”. By analyzing our best-performing model, we visualize both local receptive fields and global attention maps learned by the Deformer and show increased feature associations on the utterance level. The statistical analysis of learned kernel offsets provides an insight into the change of information in features with the network depth. Finally, replacing only half of the layers in the encoder, the Deformer improves +5.6% relative WER without a LM and +6.4% relative WER with a LM over the Conformer baseline on the WSJ eval92 set.
zh

机器学习

[LG-0] A Data-Integrated Framework for Learning Fractional-Order Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2506.15665
作者: Bahram Yaghooti,Chengyu Li,Bruno Sinopoli
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a data-integrated framework for learning the dynamics of fractional-order nonlinear systems in both discrete-time and continuous-time settings. The proposed framework consists of two main steps. In the first step, input-output experiments are designed to generate the necessary datasets for learning the system dynamics, including the fractional order, the drift vector field, and the control vector field. In the second step, these datasets, along with the memory-dependent property of fractional-order systems, are used to estimate the system’s fractional order. The drift and control vector fields are then reconstructed using orthonormal basis functions. To validate the proposed approach, the algorithm is applied to four benchmark fractional-order systems. The results confirm the effectiveness of the proposed framework in learning the system dynamics accurately. Finally, the same datasets are used to learn equivalent integer-order models. The numerical comparisons demonstrate that fractional-order models better capture long-range dependencies, highlighting the limitations of integer-order representations.

[LG-1] On the Upper Bounds for the Matrix Spectral Norm

链接: https://arxiv.org/abs/2506.15660
作者: Alexey Naumov,Maxim Rakhuba,Denis Ryapolov,Sergey Samsonov
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider the problem of estimating the spectral norm of a matrix using only matrix-vector products. We propose a new Counterbalance estimator that provides upper bounds on the norm and derive probabilistic guarantees on its underestimation. Compared to standard approaches such as the power method, the proposed estimator produces significantly tighter upper bounds in both synthetic and real-world settings. Our method is especially effective for matrices with fast-decaying spectra, such as those arising in deep learning and inverse problems.

[LG-2] CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization

链接: https://arxiv.org/abs/2506.15654
作者: Ranting Hu
类目: Machine Learning (cs.LG)
*备注: 23 pages, 14 figures

点击查看摘要

Abstract:Offline reinforcement learning (offline RL) algorithms often require additional constraints or penalty terms to address distribution shift issues, such as adding implicit or explicit policy constraints during policy optimization to reduce the estimation bias of functions. This paper focuses on a limitation of the Advantage-Weighted Regression family (AWRs), i.e., the potential for learning over-conservative policies due to data corruption, specifically the poor explorations in suboptimal offline data. We study it from two perspectives: (1) how poor explorations impact the theoretically optimal policy based on KL divergence, and (2) how such poor explorations affect the approximation of the theoretically optimal policy. We prove that such over-conservatism is mainly caused by the sensitivity of the loss function for policy optimization to poor explorations, and the proportion of poor explorations in offline datasets. To address this concern, we propose Corruption-Averse Advantage-Weighted Regression (CAWR), which incorporates a set of robust loss functions during policy optimization and an advantage-based prioritized experience replay method to filter out poor explorations. Numerical experiments on the D4RL benchmark show that our method can learn superior policies from suboptimal offline data, significantly enhancing the performance of policy optimization.

[LG-3] deepSURF: Detecting Memory Safety Vulnerabilities in Rust Through Fuzzing LLM -Augmented Harnesses

链接: https://arxiv.org/abs/2506.15648
作者: Georgios Androutsopoulos,Antonio Bianchi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Although Rust ensures memory safety by default, it also permits the use of unsafe code, which can introduce memory safety vulnerabilities if misused. Unfortunately, existing tools for detecting memory bugs in Rust typically exhibit limited detection capabilities, inadequately handle Rust-specific types, or rely heavily on manual intervention. To address these limitations, we present deepSURF, a tool that integrates static analysis with Large Language Model (LLM)-guided fuzzing harness generation to effectively identify memory safety vulnerabilities in Rust libraries, specifically targeting unsafe code. deepSURF introduces a novel approach for handling generics by substituting them with custom types and generating tailored implementations for the required traits, enabling the fuzzer to simulate user-defined behaviors within the fuzzed library. Additionally, deepSURF employs LLMs to augment fuzzing harnesses dynamically, facilitating exploration of complex API interactions and significantly increasing the likelihood of exposing memory safety vulnerabilities. We evaluated deepSURF on 27 real-world Rust crates, successfully rediscovering 20 known memory safety bugs and uncovering 6 previously unknown vulnerabilities, demonstrating clear improvements over state-of-the-art tools. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2506.15648 [cs.CR] (or arXiv:2506.15648v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.15648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Memory-Efficient Differentially Private Training with Gradient Random Projection

链接: https://arxiv.org/abs/2506.15588
作者: Alex Mulrooney,Devansh Gupta,James Flemings,Huanyu Zhang,Murali Annavaram,Meisam Razaviyayn,Xinwei Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. Rather than directly applying DP to GaLore, DP-GRAPE introduces three key modifications: (1) gradients are privatized after projection, (2) random Gaussian matrices replace SVD-based subspaces, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility trade-off comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters.

[LG-5] MicroRicci: A Greedy and Local Ricci Flow Solver for Self-Tuning Mesh Smoothing

链接: https://arxiv.org/abs/2506.15571
作者: Le Vu Anh,Nguyen Viet Anh,Mehmet Dik,Tu Nguyen Thi Ngoc
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注: 9 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Real-time mesh smoothing at scale remains a formidable challenge: classical Ricci-flow solvers demand costly global updates, while greedy heuristics suffer from slow convergence or brittle tuning. We present MicroRicci, the first truly self-tuning, local Ricci-flow solver that borrows ideas from coding theory and packs them into just 1K + 200 parameters. Its primary core is a greedy syndrome-decoding step that pinpoints and corrects the largest curvature error in O(E) time, augmented by two tiny neural modules that adaptively choose vertices and step sizes on the fly. On a diverse set of 110 SJTU-TMQA meshes, MicroRicci slashes iteration counts from 950+=140 to 400+=80 (2.4x speedup), tightens curvature spread from 0.19 to 0.185, and achieves a remarkable UV-distortion-to-MOS correlation of r = -0.93. It adds only 0.25 ms per iteration (0.80 to 1.05 ms), yielding an end-to-end 1.8x runtime acceleration over state-of-the-art methods. MicroRicci’s combination of linear-time updates, automatic hyperparameter adaptation, and high-quality geometric and perceptual results makes it well suited for real-time, resource-limited applications in graphics, simulation, and related fields.

[LG-6] ask-Agnostic Experts Composition for Continual Learning

链接: https://arxiv.org/abs/2506.15566
作者: Luigi Quarantiello,Andrea Cossu,Vincenzo Lomonaco
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compositionality is one of the fundamental abilities of the human reasoning process, that allows to decompose a complex problem into simpler elements. Such property is crucial also for neural networks, especially when aiming for a more efficient and sustainable AI framework. We propose a compositional approach by ensembling zero-shot a set of expert models, assessing our methodology using a challenging benchmark, designed to test compositionality capabilities. We show that our Expert Composition method is able to achieve a much higher accuracy than baseline algorithms while requiring less computational resources, hence being more efficient.

[LG-7] Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2506.15544
作者: Roger Creus Castanyer,Johan Obando-Ceron,Lu Li,Pierre-Luc Bacon,Glen Berseth,Aaron Courville,Pablo Samuel Castro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure mode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.

[LG-8] A Simplified Analysis of SGD for Linear Regression with Weight Averag ing

链接: https://arxiv.org/abs/2506.15535
作者: Alexandru Meterez,Depen Morwani,Costin-Andrei Oncescu,Jingfeng Wu,Cengiz Pehlevan,Sham Kakade
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Theoretically understanding stochastic gradient descent (SGD) in overparameterized models has led to the development of several optimization algorithms that are widely used in practice today. Recent work by~\citetzou2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate, both with and without tail iterate averaging, based on a bias-variance decomposition of the risk. In our work, we provide a simplified analysis recovering the same bias and variance bounds provided in~\citepzou2021benign based on simple linear algebra tools, bypassing the requirement to manipulate operators on positive semi-definite (PSD) matrices. We believe our work makes the analysis of SGD on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling, leading to improvements in the training of realistic models.

[LG-9] Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models

链接: https://arxiv.org/abs/2506.15530
作者: Teysir Baoueb,Xiaoyu Bie,Xi Wang,Gaël Richard
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Breakthroughs in text-to-music generation models are transforming the creative landscape, equipping musicians with innovative tools for composition and experimentation like never before. However, controlling the generation process to achieve a specific desired outcome remains a significant challenge. Even a minor change in the text prompt, combined with the same random seed, can drastically alter the generated piece. In this paper, we explore the application of existing text-to-music diffusion models for instrument editing. Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content. Based on the insight that the model first focuses on the overall structure or content of the audio, then adds instrument information, and finally refines the quality, we show that selecting a well-chosen intermediate timestep, identified through an instrument classifier, yields a balance between preserving the original piece’s content and achieving the desired timbre. Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process’s speed.

[LG-10] Insights on Adversarial Attacks for Tabular Machine Learning via a Systematic Literature Review

链接: https://arxiv.org/abs/2506.15506
作者: Salijona Dyrmishi,Mohamed Djilani,Thibault Simonetto,Salah Ghamizi,Maxime Cordy
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review at ACM Computing Surveys

点击查看摘要

Abstract:Adversarial attacks in machine learning have been extensively reviewed in areas like computer vision and NLP, but research on tabular data remains scattered. This paper provides the first systematic literature review focused on adversarial attacks targeting tabular machine learning models. We highlight key trends, categorize attack strategies and analyze how they address practical considerations for real-world applicability. Additionally, we outline current challenges and open research questions. By offering a clear and structured overview, this review aims to guide future efforts in understanding and addressing adversarial vulnerabilities in tabular machine learning.

[LG-11] LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

链接: https://arxiv.org/abs/2506.15492
作者: Mohammadreza Nemati,Zhipeng Huang,Kevin S. Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Some of the simplest, yet most frequently used predictors in statistics and machine learning use weighted linear combinations of features. Such linear predictors can model non-linear relationships between features by adding interaction terms corresponding to the products of all pairs of features. We consider the problem of accurately estimating coefficients for interaction terms in linear predictors. We hypothesize that the coefficients for different interaction terms have an approximate low-dimensional structure and represent each feature by a latent vector in a low-dimensional space. This low-dimensional representation can be viewed as a structured regularization approach that further mitigates overfitting in high-dimensional settings beyond standard regularizers such as the lasso and elastic net. We demonstrate that our approach, called LIT-LVM, achieves superior prediction accuracy compared to elastic net and factorization machines on a wide variety of simulated and real data, particularly when the number of interaction terms is high compared to the number of samples. LIT-LVM also provides low-dimensional latent representations for features that are useful for visualizing and analyzing their relationships.

[LG-12] Creating User-steerable Projections with Interactive Semantic Mapping

链接: https://arxiv.org/abs/2506.15479
作者: Artur André Oliveira,Mateus Espadoto,Roberto Hirata Jr.,Roberto M. Cesar Jr.,Alex C. Telea
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dimensionality reduction (DR) techniques map high-dimensional data into lower-dimensional spaces. Yet, current DR techniques are not designed to explore semantic structure that is not directly available in the form of variables or class labels. We introduce a novel user-guided projection framework for image and text data that enables customizable, interpretable, data visualizations via zero-shot classification with Multimodal Large Language Models (MLLMs). We enable users to steer projections dynamically via natural-language guiding prompts, to specify high-level semantic relationships of interest to the users which are not explicitly present in the data dimensions. We evaluate our method across several datasets and show that it not only enhances cluster separation, but also transforms DR into an interactive, user-driven process. Our approach bridges the gap between fully automated DR techniques and human-centered data exploration, offering a flexible and adaptive way to tailor projections to specific analytical needs.

[LG-13] All is Not Lost: LLM Recovery without Checkpoints

链接: https://arxiv.org/abs/2506.15461
作者: Nikolay Blagoev,Oğuzhan Ersoy,Lydia Yiyu Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator’s scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: this https URL.

[LG-14] Semi-supervised Graph Anomaly Detection via Robust Homophily Learning

链接: https://arxiv.org/abs/2506.15448
作者: Guoguo Ai,Hezhe Qiao,Hui Yan,Guansong Pang
类目: Machine Learning (cs.LG)
*备注: 18 pages, 11 figures, 3 tables

点击查看摘要

Abstract:Semi-supervised graph anomaly detection (GAD) utilizes a small set of labeled normal nodes to identify abnormal nodes from a large set of unlabeled nodes in a graph. Current methods in this line posit that 1) normal nodes share a similar level of homophily and 2) the labeled normal nodes can well represent the homophily patterns in the normal class. However, this assumption often does not hold well since normal nodes in a graph can exhibit diverse homophily in real-world GAD datasets. In this paper, we propose RHO, namely Robust Homophily Learning, to adaptively learn such homophily patterns. RHO consists of two novel modules, adaptive frequency response filters (AdaFreq) and graph normality alignment (GNA). AdaFreq learns a set of adaptive spectral filters that capture different frequency components of the labeled normal nodes with varying homophily in the channel-wise and cross-channel views of node attributes. GNA is introduced to enforce consistency between the channel-wise and cross-channel homophily representations to robustify the normality learned by the filters in the two views. Experiments on eight real-world GAD datasets show that RHO can effectively learn varying, often under-represented, homophily in the small normal node set and substantially outperforms state-of-the-art competing methods. Code is available at this https URL.

[LG-15] Learn to Vaccinate: Combining Structure Learning and Effective Vaccination for Epidemic and Outbreak Control

链接: https://arxiv.org/abs/2506.15397
作者: Sepehr Elahi,Paula Mürmann,Patrick Thiran
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The Susceptible-Infected-Susceptible (SIS) model is a widely used model for the spread of information and infectious diseases, particularly non-immunizing ones, on a graph. Given a highly contagious disease, a natural question is how to best vaccinate individuals to minimize the disease’s extinction time. While previous works showed that the problem of optimal vaccination is closely linked to the NP-hard Spectral Radius Minimization (SRM) problem, they assumed that the graph is known, which is often not the case in practice. In this work, we consider the problem of minimizing the extinction time of an outbreak modeled by an SIS model where the graph on which the disease spreads is unknown and only the infection states of the vertices are observed. To this end, we split the problem into two: learning the graph and determining effective vaccination strategies. We propose a novel inclusion-exclusion-based learning algorithm and, unlike previous approaches, establish its sample complexity for graph recovery. We then detail an optimal algorithm for the SRM problem and prove that its running time is polynomial in the number of vertices for graphs with bounded treewidth. This is complemented by an efficient and effective polynomial-time greedy heuristic for any graph. Finally, we present experiments on synthetic and real-world data that numerically validate our learning and vaccination algorithms.

[LG-16] Provable Maximum Entropy Manifold Exploration via Diffusion Models ICML2025

链接: https://arxiv.org/abs/2506.15385
作者: Riccardo De Santi,Marin Vlastelica,Ya-Ping Hsieh,Zebang Shen,Niao He,Andreas Krause
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Exploration is critical for solving real-world decision-making problems such as scientific discovery, where the objective is to generate truly novel designs rather than mimic existing data distributions. In this work, we address the challenge of leveraging the representational power of generative models for exploration without relying on explicit uncertainty quantification. We introduce a novel framework that casts exploration as entropy maximization over the approximate data manifold implicitly defined by a pre-trained diffusion model. Then, we present a novel principle for exploration based on density estimation, a problem well-known to be challenging in practice. To overcome this issue and render this method truly scalable, we leverage a fundamental connection between the entropy of the density induced by a diffusion model and its score function. Building on this, we develop an algorithm based on mirror descent that solves the exploration problem as sequential fine-tuning of a pre-trained diffusion model. We prove its convergence to the optimal exploratory diffusion model under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we empirically evaluate our approach on both synthetic and high-dimensional text-to-image diffusion, demonstrating promising results.

[LG-17] Global Ground Metric Learning with Applications to scRNA data

链接: https://arxiv.org/abs/2506.15383
作者: Damin Kühn,Michael T. Schaub
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: This method is provided as a Python package on PyPI, see this https URL

点击查看摘要

Abstract:Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases.

[LG-18] Sampling 3D Molecular Conformers with Diffusion Transformers

链接: https://arxiv.org/abs/2506.15378
作者: J. Thorben Frank,Winfried Ripken,Gregor Lied,Klaus-Robert Müller,Oliver T. Unke,Stefan Chmiela
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have demonstrated strong performance in generative modeling, particularly in image synthesis, making them a compelling choice for molecular conformer generation. However, applying DiTs to molecules introduces novel challenges, such as integrating discrete molecular graph information with continuous 3D geometry, handling Euclidean symmetries, and designing conditioning mechanisms that generalize across molecules of varying sizes and structures. We propose DiTMC, a framework that adapts DiTs to address these challenges through a modular architecture that separates the processing of 3D coordinates from conditioning on atomic connectivity. To this end, we introduce two complementary graph-based conditioning strategies that integrate seamlessly with the DiT architecture. These are combined with different attention mechanisms, including both standard non-equivariant and SO(3)-equivariant formulations, enabling flexible control over the trade-off between between accuracy and computational efficiency. Experiments on standard conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL) demonstrate that DiTMC achieves state-of-the-art precision and physical validity. Our results highlight how architectural choices and symmetry priors affect sample quality and efficiency, suggesting promising directions for large-scale generative modeling of molecular structures. Code available at this https URL.

[LG-19] Enhancing One-run Privacy Auditing with Quantile Regression-Based Membership Inference

链接: https://arxiv.org/abs/2506.15349
作者: Terrance Liu,Matteo Boglioni,Yiwei Fu,Shengyuan Hu,Pratiksha Thaker,Zhiwei Steven Wu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) auditing aims to provide empirical lower bounds on the privacy guarantees of DP mechanisms like DP-SGD. While some existing techniques require many training runs that are prohibitively costly, recent work introduces one-run auditing approaches that effectively audit DP-SGD in white-box settings while still being computationally efficient. However, in the more practical black-box setting where gradients cannot be manipulated during training and only the last model iterate is observed, prior work shows that there is still a large gap between the empirical lower bounds and theoretical upper bounds. Consequently, in this work, we study how incorporating approaches for stronger membership inference attacks (MIA) can improve one-run auditing in the black-box setting. Evaluating on image classification models trained on CIFAR-10 with DP-SGD, we demonstrate that our proposed approach, which utilizes quantile regression for MIA, achieves tighter bounds while crucially maintaining the computational efficiency of one-run methods.

[LG-20] Acoustic Waveform Inversion with Image-to-Image Schrödinger Bridges

链接: https://arxiv.org/abs/2506.15346
作者: A.S. Stankevich,I.B. Petrov
类目: Machine Learning (cs.LG)
*备注: Submitted to “Computational Mathematics And Mathematical Physics”, ISSN 1555-6662, issue 8, August 2025

点击查看摘要

Abstract:Recent developments in application of deep learning models to acoustic Full Waveform Inversion (FWI) are marked by the use of diffusion models as prior distributions for Bayesian-like inference procedures. The advantage of these methods is the ability to generate high-resolution samples, which are otherwise unattainable with classical inversion methods or other deep learning-based solutions. However, the iterative and stochastic nature of sampling from diffusion models along with heuristic nature of output control remain limiting factors for their applicability. For instance, an optimal way to include the approximate velocity model into diffusion-based inversion scheme remains unclear, even though it is considered an essential part of FWI pipeline. We address the issue by employing a Schrödinger Bridge that interpolates between the distributions of ground truth and smoothed velocity models. To facilitate the learning of nonlinear drifts that transfer samples between distributions we extend the concept of Image-to-Image Schrödinger Bridge ( \textI^2\textSB ) to conditional sampling, resulting in a conditional Image-to-Image Schrödinger Bridge (c \textI^2\textSB ) framework. To validate our method, we assess its effectiveness in reconstructing the reference velocity model from its smoothed approximation, coupled with the observed seismic signal of fixed shape. Our experiments demonstrate that the proposed solution outperforms our reimplementation of conditional diffusion model suggested in earlier works, while requiring only a few neural function evaluations (NFEs) to achieve sample fidelity superior to that attained with supervised learning-based approach. The supplementary code implementing the algorithms described in this paper can be found in the repository this https URL.

[LG-21] Knowledge Distillation Framework for Accelerating High-Accuracy Neural Network-Based Molecular Dynamics Simulations

链接: https://arxiv.org/abs/2506.15337
作者: Naoki Matsumura,Yuta Yoshimoto,Yuto Iwasaki,Meguru Yamazaki,Yasufumi Sakai
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural network potentials (NNPs) offer a powerful alternative to traditional force fields for molecular dynamics (MD) simulations. Accurate and stable MD simulations, crucial for evaluating material properties, require training data encompassing both low-energy stable structures and high-energy structures. Conventional knowledge distillation (KD) methods fine-tune a pre-trained NNP as a teacher model to generate training data for a student model. However, in material-specific models, this fine-tuning process increases energy barriers, making it difficult to create training data containing high-energy structures. To address this, we propose a novel KD framework that leverages a non-fine-tuned, off-the-shelf pre-trained NNP as a teacher. Its gentler energy landscape facilitates the exploration of a wider range of structures, including the high-energy structures crucial for stable MD simulations. Our framework employs a two-stage training process: first, the student NNP is trained with a dataset generated by the off-the-shelf teacher; then, it is fine-tuned with a smaller, high-accuracy density functional theory (DFT) dataset. We demonstrate the effectiveness of our framework by applying it to both organic (polyethylene glycol) and inorganic (L _10 GeP _2 S _12 ) materials, achieving comparable or superior accuracy in reproducing physical properties compared to existing methods. Importantly, our method reduces the number of expensive DFT calculations by 10x compared to existing NNP generation methods, without sacrificing accuracy.

[LG-22] Universal Laboratory Model: prognosis of abnormal clinical outcomes based on routine tests

链接: https://arxiv.org/abs/2506.15330
作者: Pavel Karpov,Ilya Petrenkov,Ruslan Raiman
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 7 pages, 2 figues

点击查看摘要

Abstract:Clinical laboratory results are ubiquitous in any diagnosis making. Predicting abnormal values of not prescribed tests based on the results of performed tests looks intriguing, as it would be possible to make early diagnosis available to everyone. The special place is taken by the Common Blood Count (CBC) test, as it is the most widely used clinical procedure. Combining routine biochemical panels with CBC presents a set of test-value pairs that varies from patient to patient, or, in common settings, a table with missing values. Here we formulate a tabular modeling problem as a set translation problem where the source set comprises pairs of GPT-like label column embedding and its corresponding value while the target set consists of the same type embeddings only. The proposed approach can effectively deal with missing values without implicitly estimating them and bridges the world of LLM with the tabular domain. Applying this method to clinical laboratory data, we achieve an improvement up to 8% AUC for joint predictions of high uric acid, glucose, cholesterol, and low ferritin levels.

[LG-23] SecFwT: Efficient Privacy-Preserving Fine-Tuning of Large Language Models Using Forward-Only Passes

链接: https://arxiv.org/abs/2506.15307
作者: Jinglong Luo,Zhuo Zhang,Yehong Zhang,Shiyu Liu,Ye Dong,Xun Zhou,Hui Wang,Yue Yu,Zenglin Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed numerous fields, yet their adaptation to specialized tasks in privacy-sensitive domains, such as healthcare and finance, is constrained by the scarcity of accessible training data due to stringent privacy requirements. Secure multi-party computation (MPC)-based privacy-preserving machine learning offers a powerful approach to protect both model parameters and user data, but its application to LLMs has been largely limited to inference, as fine-tuning introduces significant computational challenges, particularly in privacy-preserving backward propagation and optimizer operations. This paper identifies two primary obstacles to MPC-based privacy-preserving fine-tuning of LLMs: (1) the substantial computational overhead of backward and optimizer processes, and (2) the inefficiency of softmax-based attention mechanisms in MPC settings. To address these challenges, we propose SecFwT, the first MPC-based framework designed for efficient, privacy-preserving LLM fine-tuning. SecFwT introduces a forward-only tuning paradigm to eliminate backward and optimizer computations and employs MPC-friendly Random Feature Attention to approximate softmax attention, significantly reducing costly non-linear operations and computational complexity. Experimental results demonstrate that SecFwT delivers substantial improvements in efficiency and privacy preservation, enabling scalable and secure fine-tuning of LLMs for privacy-critical applications.

[LG-24] Conditional Generative Modeling for Enhanced Credit Risk Management in Supply Chain Finance

链接: https://arxiv.org/abs/2506.15305
作者: Qingkai Zhang,L. Jeff Hong,Houmin Yan
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:The rapid expansion of cross-border e-commerce (CBEC) has created significant opportunities for small and medium-sized enterprises (SMEs), yet financing remains a critical challenge due to SMEs’ limited credit histories. Third-party logistics (3PL)-led supply chain finance (SCF) has emerged as a promising solution, leveraging in-transit inventory as collateral. We propose an advanced credit risk management framework tailored for 3PL-led SCF, addressing the dual challenges of credit risk assessment and loan size determination. Specifically, we leverage conditional generative modeling of sales distributions through Quantile-Regression-based Generative Metamodeling (QRGMM) as the foundation for risk estimation. We propose a unified framework that enables flexible estimation of multiple risk measures while introducing a functional risk measure formulation that systematically captures the relationship between these risk measures and varying loan levels, supported by theoretical guarantees. To capture complex covariate interactions in e-commerce sales data, we integrate QRGMM with Deep Factorization Machines (DeepFM). Extensive experiments on synthetic and real-world data validate the efficacy of our model for credit risk assessment and loan size determination. This study represents a pioneering application of generative AI in CBEC SCF risk management, offering a solid foundation for enhanced credit practices and improved SME access to capital.

[LG-25] DOVA-PATBM: An Intelligent Adaptive and Scalable Framework for Optimizing Large-Scale EV Charging Infrastructure

链接: https://arxiv.org/abs/2506.15289
作者: Chuan Li,Shunyu Zhao,Vincent Gauthier,Hassine Moungla
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The accelerating uptake of battery-electric vehicles demands infrastructure planning tools that are both data-rich and geographically scalable. Whereas most prior studies optimise charging locations for single cities, state-wide and national networks must reconcile the conflicting requirements of dense metropolitan cores, car-dependent exurbs, and power-constrained rural corridors. We present DOVA-PATBM (Deployment Optimisation with Voronoi-oriented, Adaptive, POI-Aware Temporal Behaviour Model), a geo-computational framework that unifies these contexts in a single pipeline. The method rasterises heterogeneous data (roads, population, night lights, POIs, and feeder lines) onto a hierarchical H3 grid, infers intersection importance with a zone-normalised graph neural network centrality model, and overlays a Voronoi tessellation that guarantees at least one five-port DC fast charger within every 30 km radius. Hourly arrival profiles, learned from loop-detector and floating-car traces, feed a finite M/M/c queue to size ports under feeder-capacity and outage-risk constraints. A greedy maximal-coverage heuristic with income-weighted penalties then selects the minimum number of sites that satisfy coverage and equity targets. Applied to the State of Georgia, USA, DOVA-PATBM (i) increases 30 km tile coverage by 12 percentage points, (ii) halves the mean distance that low-income residents travel to the nearest charger, and (iii) meets sub-transmission headroom everywhere – all while remaining computationally tractable for national-scale roll-outs. These results demonstrate that a tightly integrated, GNN-driven, multi-resolution approach can bridge the gap between academic optimisation and deployable infrastructure policy. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2506.15289 [cs.LG] (or arXiv:2506.15289v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.15289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-26] Centroid Approximation for Byzantine-Tolerant Federated Learning

链接: https://arxiv.org/abs/2506.15264
作者: Mélanie Cambus,Darya Melnyk,Tijana Milentijević,Stefan Schmid
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:Federated learning allows each client to keep its data locally when training machine learning models in a distributed setting. Significant recent research established the requirements that the input must satisfy in order to guarantee convergence of the training loop. This line of work uses averaging as the aggregation rule for the training models. In particular, we are interested in whether federated learning is robust to Byzantine behavior, and observe and investigate a tradeoff between the average/centroid and the validity conditions from distributed computing. We show that the various validity conditions alone do not guarantee a good approximation of the average. Furthermore, we show that reaching good approximation does not give good results in experimental settings due to possible Byzantine outliers. Our main contribution is the first lower bound of \min\fracn-tt,\sqrtd\ on the centroid approximation under box validity that is often considered in the literature, where n is the number of clients, t the upper bound on the number of Byzantine faults, and d is the dimension of the machine learning model. We complement this lower bound by an upper bound of 2\min\n,\sqrtd\ , by providing a new analysis for the case nd . In addition, we present a new algorithm that achieves a \sqrt2d -approximation under convex validity, which also proves that the existing lower bound in the literature is tight. We show that all presented bounds can also be achieved in the distributed peer-to-peer setting. We complement our analytical results with empirical evaluations in federated stochastic gradient descent and federated averaging settings.

[LG-27] Minimizing Structural Vibrations via Guided Flow Matching Design Optimization

链接: https://arxiv.org/abs/2506.15263
作者: Jan van Delden,Julius Schultz,Sebastian Rothe,Christian Libner,Sabine C. Langer,Timo Lüddecke
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Structural vibrations are a source of unwanted noise in engineering systems like cars, trains or airplanes. Minimizing these vibrations is crucial for improving passenger comfort. This work presents a novel design optimization approach based on guided flow matching for reducing vibrations by placing beadings (indentations) in plate-like structures. Our method integrates a generative flow matching model and a surrogate model trained to predict structural vibrations. During the generation process, the flow matching model pushes towards manufacturability while the surrogate model pushes to low-vibration solutions. The flow matching model and its training data implicitly define the design space, enabling a broader exploration of potential solutions as no optimization of manually-defined design parameters is required. We apply our method to a range of differentiable optimization objectives, including direct optimization of specific eigenfrequencies through careful construction of the objective function. Results demonstrate that our method generates diverse and manufacturable plate designs with reduced structural vibrations compared to designs from random search, a criterion-based design heuristic and genetic optimization. The code and data are available from this https URL.

[LG-28] Context-Aware Deep Lagrangian Networks for Model Predictive Control IROS

链接: https://arxiv.org/abs/2506.15249
作者: Lucas Schulze,Jan Peters,Oleg Arenz
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

点击查看摘要

Abstract:Controlling a robot based on physics-informed dynamic models, such as deep Lagrangian networks (DeLaN), can improve the generalizability and interpretability of the resulting behavior. However, in complex environments, the number of objects to potentially interact with is vast, and their physical properties are often uncertain. This complexity makes it infeasible to employ a single global model. Therefore, we need to resort to online system identification of context-aware models that capture only the currently relevant aspects of the environment. While physical principles such as the conservation of energy may not hold across varying contexts, ensuring physical plausibility for any individual context-aware model can still be highly desirable, particularly when using it for receding horizon control methods such as Model Predictive Control (MPC). Hence, in this work, we extend DeLaN to make it context-aware, combine it with a recurrent network for online system identification, and integrate it with a MPC for adaptive, physics-informed control. We also combine DeLaN with a residual dynamics model to leverage the fact that a nominal model of the robot is typically available. We evaluate our method on a 7-DOF robot arm for trajectory tracking under varying loads. Our method reduces the end-effector tracking error by 39%, compared to a 21% improvement achieved by a baseline that uses an extended Kalman filter.

[LG-29] Interpretability and Generalization Bounds for Learning Spatial Physics

链接: https://arxiv.org/abs/2506.15199
作者: Alejandro Francisco Queiruga,Theo Gutman-Solo,Shuai Jiang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. For scientific applications, actual quantitative accuracy is crucial. This work applies the rigor of numerical analysis for differential equations to machine learning by specifically quantifying the accuracy of applying different ML techniques to the elementary 1D Poisson differential equation. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. We prove generalization bounds and convergence rates under finite data discretizations and restricted training data subspaces by analyzing the training dynamics and deriving optimal parameters for both a white-box differential equation discovery method and a black-box linear model. The analytically derived generalization bounds are replicated empirically. Similar lack of generalization is empirically demonstrated for deep linear models, shallow neural networks, and physics-specific DeepONets and Neural Operators. We theoretically and empirically demonstrate that generalization to the true physical equation is not guaranteed in each explored case. Surprisingly, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also demonstrate a new mechanistic interpretability lens on scientific models whereby Green’s function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems. We propose applying it to the Poisson equation as an evaluation benchmark of future methods.

[LG-30] Learning Task-Agnostic Skill Bases to Uncover Motor Primitives in Animal Behaviors

链接: https://arxiv.org/abs/2506.15190
作者: Jiyi Wang,Jingyang Ke,Bo Dai,Anqi Wu
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 9 pages and 4 figures for the main text

点击查看摘要

Abstract:Animals flexibly recombine a finite set of core motor primitives to meet diverse task demands, but existing behavior-segmentation methods oversimplify this process by imposing discrete syllables under restrictive generative assumptions. To reflect the animal behavior generation procedure, we introduce skill-based imitation learning (SKIL) for behavior understanding, a reinforcement learning-based imitation framework that (1) infers interpretable skill sets, i.e., latent basis functions of behavior, by leveraging representation learning on transition probabilities, and (2) parameterizes policies as dynamic mixtures of these skills. We validate our approach on a simple grid world, a discrete labyrinth, and unconstrained videos of freely moving animals. Across tasks, it identifies reusable skill components, learns continuously evolving compositional policies, and generates realistic trajectories beyond the capabilities of traditional discrete models. By exploiting generative behavior modeling with compositional representations, our method offers a concise, principled account of how complex animal behaviors emerge from dynamic combinations of fundamental motor primitives.

[LG-31] ImprovDML: Improved Trade-off in Private Byzantine-Resilient Distributed Machine Learning

链接: https://arxiv.org/abs/2506.15181
作者: Bing Liu,Chengcheng Zhao,Li Chai,Peng Cheng,Yaonan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Jointly addressing Byzantine attacks and privacy leakage in distributed machine learning (DML) has become an important issue. A common strategy involves integrating Byzantine-resilient aggregation rules with differential privacy mechanisms. However, the incorporation of these techniques often results in a significant degradation in model accuracy. To address this issue, we propose a decentralized DML framework, named ImprovDML, that achieves high model accuracy while simultaneously ensuring privacy preservation and resilience to Byzantine attacks. The framework leverages a kind of resilient vector consensus algorithms that can compute a point within the normal (non-Byzantine) agents’ convex hull for resilient aggregation at each iteration. Then, multivariate Gaussian noises are introduced to the gradients for privacy preservation. We provide convergence guarantees and derive asymptotic learning error bounds under non-convex settings, which are tighter than those reported in existing works. For the privacy analysis, we adopt the notion of concentrated geo-privacy, which quantifies privacy preservation based on the Euclidean distance between inputs. We demonstrate that it enables an improved trade-off between privacy preservation and model accuracy compared to differential privacy. Finally, numerical simulations validate our theoretical results.

[LG-32] In-Context Learning for Gradient-Free Receiver Adaptation: Principles Applications and Theory

链接: https://arxiv.org/abs/2506.15176
作者: Matteo Zecchin,Tomer Raviv,Dileep Kalathil,Krishna Narayanan,Nir Shlezinger,Osvaldo Simeone
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In recent years, deep learning has facilitated the creation of wireless receivers capable of functioning effectively in conditions that challenge traditional model-based designs. Leveraging programmable hardware architectures, deep learning-based receivers offer the potential to dynamically adapt to varying channel environments. However, current adaptation strategies, including joint training, hypernetwork-based methods, and meta-learning, either demonstrate limited flexibility or necessitate explicit optimization through gradient descent. This paper presents gradient-free adaptation techniques rooted in the emerging paradigm of in-context learning (ICL). We review architectural frameworks for ICL based on Transformer models and structured state-space models (SSMs), alongside theoretical insights into how sequence models effectively learn adaptation from contextual information. Further, we explore the application of ICL to cell-free massive MIMO networks, providing both theoretical analyses and empirical evidence. Our findings indicate that ICL represents a principled and efficient approach to real-time receiver adaptation using pilot signals and auxiliary contextual information-without requiring online retraining.

[LG-33] owards Reliable Forgetting: A Survey on Machine Unlearning Verification Challenges and Future Directions

链接: https://arxiv.org/abs/2506.15115
作者: Lulu Xue,Shengshan Hu,Wei Lu,Yan Shen,Dongxu Li,Peijin Guo,Ziqi Zhou,Minghui Li,Yanjun Zhang,Leo Yu Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With growing demands for privacy protection, security, and legal compliance (e.g., GDPR), machine unlearning has emerged as a critical technique for ensuring the controllability and regulatory alignment of machine learning models. However, a fundamental challenge in this field lies in effectively verifying whether unlearning operations have been successfully and thoroughly executed. Despite a growing body of work on unlearning techniques, verification methodologies remain comparatively underexplored and often fragmented. Existing approaches lack a unified taxonomy and a systematic framework for evaluation. To bridge this gap, this paper presents the first structured survey of machine unlearning verification methods. We propose a taxonomy that organizes current techniques into two principal categories – behavioral verification and parametric verification – based on the type of evidence used to assess unlearning fidelity. We examine representative methods within each category, analyze their underlying assumptions, strengths, and limitations, and identify potential vulnerabilities in practical deployment. In closing, we articulate a set of open problems in current verification research, aiming to provide a foundation for developing more robust, efficient, and theoretically grounded unlearning verification mechanisms.

[LG-34] Neural Canonical Polyadic Factorization for Traffic Analysis

链接: https://arxiv.org/abs/2506.15079
作者: Yikai Hou,Peng Tang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern intelligent transportation systems rely on accurate spatiotemporal traffic analysis to optimize urban mobility and infrastructure resilience. However, pervasive missing data caused by sensor failures and heterogeneous sensing gaps fundamentally hinders reliable traffic modeling. This paper proposes a Neural Canonical Polyadic Factorization (NCPF) model that synergizes low-rank tensor algebra with deep representation learning for robust traffic data imputation. The model innovatively embeds CP decomposition into neural architecture through learnable embedding projections, where sparse traffic tensors are encoded into dense latent factors across road segments, time intervals, and mobility metrics. A hierarchical feature fusion mechanism employs Hadamard products to explicitly model multilinear interactions, while stacked multilayer perceptron layers nonlinearly refine these representations to capture complex spatiotemporal couplings. Extensive evaluations on six urban traffic datasets demonstrate NCPF’s superiority over six state-of-the-art baselines. By unifying CP decomposition’s interpretable factor analysis with neural network’s nonlinear expressive power, NCPF provides a principled yet flexible approaches for high-dimensional traffic data imputation, offering critical support for next-generation transportation digital twins and adaptive traffic control systems.

[LG-35] CWGAN-GP Augmented CAE for Jamming Detection in 5G-NR in Non-IID Datasets

链接: https://arxiv.org/abs/2506.15075
作者: Samhita Kuili,Mohammadreza Amini,Burak Kantarci
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 5 figures, Accepted to IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC) 2025

点击查看摘要

Abstract:In the ever-expanding domain of 5G-NR wireless cellular networks, over-the-air jamming attacks are prevalent as security attacks, compromising the quality of the received signal. We simulate a jamming environment by incorporating additive white Gaussian noise (AWGN) into the real-world In-phase and Quadrature (I/Q) OFDM datasets. A Convolutional Autoencoder (CAE) is exploited to implement a jamming detection over various characteristics such as heterogenous I/Q datasets; extracting relevant information on Synchronization Signal Blocks (SSBs), and fewer SSB observations with notable class imbalance. Given the characteristics of datasets, balanced datasets are acquired by employing a Conv1D conditional Wasserstein Generative Adversarial Network-Gradient Penalty(CWGAN-GP) on both majority and minority SSB observations. Additionally, we compare the performance and detection ability of the proposed CAE model on augmented datasets with benchmark models: Convolutional Denoising Autoencoder (CDAE) and Convolutional Sparse Autoencoder (CSAE). Despite the complexity of data heterogeneity involved across all datasets, CAE depicts the robustness in detection performance of jammed signal by achieving average values of 97.33% precision, 91.33% recall, 94.08% F1-score, and 94.35% accuracy over CDAE and CSAE.

[LG-36] HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

链接: https://arxiv.org/abs/2506.15065
作者: Trishna Chakraborty,Udita Ghosh,Xiaopan Zhang,Fahim Faisal Niloy,Yue Dong,Jiachen Li,Amit K. Roy-Chowdhury,Chengyu Song
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.

[LG-37] HiPreNets: High-Precision Neural Networks through Progressive Training

链接: https://arxiv.org/abs/2506.15064
作者: Ethan Mulle,Wei Kang,Qi Gong
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep neural networks are powerful tools for solving nonlinear problems in science and engineering, but training highly accurate models becomes challenging as problem complexity increases. Non-convex optimization and numerous hyperparameters to tune make performance improvement difficult, and traditional approaches often prioritize minimizing mean squared error (MSE) while overlooking L^\infty error, which is the critical focus in many applications. To address these challenges, we present a progressive framework for training and tuning high-precision neural networks (HiPreNets). Our approach refines a previously explored staged training technique for neural networks that improves an existing fully connected neural network by sequentially learning its prediction residuals using additional networks, leading to improved overall accuracy. We discuss how to take advantage of the structure of the residuals to guide the choice of loss function, number of parameters to use, and ways to introduce adaptive data sampling techniques. We validate our framework’s effectiveness through several benchmark problems.

[LG-38] Muon Optimizes Under Spectral Norm Constraints

链接: https://arxiv.org/abs/2506.15054
作者: Lizhang Chen,Jonathan Li,Qiang Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The pursuit of faster optimization algorithms remains an active and important research direction in deep learning. Recently, the Muon optimizer [JJB+24] has demonstrated promising empirical performance, but its theoretical foundation remains less understood. In this paper, we bridge this gap and provide a theoretical analysis of Muon by placing it within the Lion- \mathcalK family of optimizers [CLLL24]. Specifically, we show that Muon corresponds to Lion- \mathcalK when equipped with the nuclear norm, and we leverage the theoretical results of Lion- \mathcalK to establish that Muon (with decoupled weight decay) implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices. This perspective not only demystifies the implicit regularization effects of Muon but also leads to natural generalizations through varying the choice of convex map \mathcalK , allowing for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.

[LG-39] Systems-Theoretic and Data-Driven Security Analysis in ML-enabled Medical Devices

链接: https://arxiv.org/abs/2506.15028
作者: Gargi Mitra,Mohammadreza Hallajiyan,Inji Kim,Athish Pranav Dharmalingam,Mohammed Elnawawy,Shahrear Iqbal,Karthik Pattabiraman,Homa Alemzadeh
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 32 pages, 6 figures, 6 tables

点击查看摘要

Abstract:The integration of AI/ML into medical devices is rapidly transforming healthcare by enhancing diagnostic and treatment facilities. However, this advancement also introduces serious cybersecurity risks due to the use of complex and often opaque models, extensive interconnectivity, interoperability with third-party peripheral devices, Internet connectivity, and vulnerabilities in the underlying technologies. These factors contribute to a broad attack surface and make threat prevention, detection, and mitigation challenging. Given the highly safety-critical nature of these devices, a cyberattack on these devices can cause the ML models to mispredict, thereby posing significant safety risks to patients. Therefore, ensuring the security of these devices from the time of design is essential. This paper underscores the urgency of addressing the cybersecurity challenges in ML-enabled medical devices at the pre-market phase. We begin by analyzing publicly available data on device recalls and adverse events, and known vulnerabilities, to understand the threat landscape of AI/ML-enabled medical devices and their repercussions on patient safety. Building on this analysis, we introduce a suite of tools and techniques designed by us to assist security analysts in conducting comprehensive premarket risk assessments. Our work aims to empower manufacturers to embed cybersecurity as a core design principle in AI/ML-enabled medical devices, thereby making them safe for patients.

[LG-40] Private Continual Counting of Unbounded Streams

链接: https://arxiv.org/abs/2506.15018
作者: Ben Jacobsen,Kassem Fawaz
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:We study the problem of differentially private continual counting in the unbounded setting where the input size n is not known in advance. Current state-of-the-art algorithms based on optimal instantiations of the matrix mechanism cannot be directly applied here because their privacy guarantees only hold when key parameters are tuned to n . Using the common `doubling trick’ avoids knowledge of n but leads to suboptimal and non-smooth error. We solve this problem by introducing novel matrix factorizations based on logarithmic perturbations of the function \frac1\sqrt1-z studied in prior works, which may be of independent interest. The resulting algorithm has smooth error, and for any \alpha 0 and t\leq n it is able to privately estimate the sum of the first t data points with O(\log^2+2\alpha(t)) variance. It requires O(t) space and amortized O(\log t) time per round, compared to O(\log(n)\log(t)) variance, O(n) space and O(n \log n) pre-processing time for the nearly-optimal bounded-input algorithm of Henzinger et al. (SODA 2023). Empirically, we find that our algorithm’s performance is also comparable to theirs in absolute terms: our variance is less than 1.5\times theirs for t as large as 2^24 .

[LG-41] GCN-Driven Reinforcement Learning for Probabilistic Real-Time Guarantees in Industrial URLLC

链接: https://arxiv.org/abs/2506.15011
作者: Eman Alqudah,Ashfaq Khokhar
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEE MASS 2025 on May 7, 2025

点击查看摘要

Abstract:Ensuring packet-level communication quality is vital for ultra-reliable, low-latency communications (URLLC) in large-scale industrial wireless networks. We enhance the Local Deadline Partition (LDP) algorithm by introducing a Graph Convolutional Network (GCN) integrated with a Deep Q-Network (DQN) reinforcement learning framework for improved interference coordination in multi-cell, multi-channel networks. Unlike LDP’s static priorities, our approach dynamically learns link priorities based on real-time traffic demand, network topology, remaining transmission opportunities, and interference patterns. The GCN captures spatial dependencies, while the DQN enables adaptive scheduling decisions through reward-guided exploration. Simulation results show that our GCN-DQN model achieves mean SINR improvements of 179.6%, 197.4%, and 175.2% over LDP across three network configurations. Additionally, the GCN-DQN model demonstrates mean SINR improvements of 31.5%, 53.0%, and 84.7% over our previous CNN-based approach across the same configurations. These results underscore the effectiveness of our GCN-DQN model in addressing complex URLLC requirements with minimal overhead and superior network performance.

[LG-42] A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments

链接: https://arxiv.org/abs/2506.15000
作者: Md Jahangir Alam Khondkar,Ajan Ahmed,Masudul Haider Imtiaz,Stephanie Schuckers
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.

[LG-43] CNN-Enabled Scheduling for Probabilistic Real-Time Guarantees in Industrial URLLC

链接: https://arxiv.org/abs/2506.14987
作者: Eman Alqudah,Ashfaq Khokhar
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEEGLOBE2025 on April 15, 2025

点击查看摘要

Abstract:Ensuring packet-level communication quality is vital for ultra-reliable, low-latency communications (URLLC) in large-scale industrial wireless networks. We enhance the Local Deadline Partition (LDP) algorithm by introducing a CNN-based dynamic priority prediction mechanism for improved interference coordination in multi-cell, multi-channel networks. Unlike LDP’s static priorities, our approach uses a Convolutional Neural Network and graph coloring to adaptively assign link priorities based on real-time traffic, transmission opportunities, and network conditions. Assuming that first training phase is performed offline, our approach introduced minimal overhead, while enabling more efficient resource allocation, boosting network capacity, SINR, and schedulability. Simulation results show SINR gains of up to 113%, 94%, and 49% over LDP across three network configurations, highlighting its effectiveness for complex URLLC scenarios.

[LG-44] Early Prediction of Multiple Sclerosis Disability Progression via Multimodal Foundation Model Benchmarks IJCAI2025

链接: https://arxiv.org/abs/2506.14986
作者: Maxime Usdin,Lito Kriara,Licinio Craveiro
类目: Machine Learning (cs.LG)
*备注: Accepted to IJCAI 2025

点击查看摘要

Abstract:Early multiple sclerosis (MS) disability progression prediction is challenging due to disease heterogeneity. This work predicts 48- and 72-week disability using sparse baseline clinical data and 12 weeks of daily digital Floodlight data from the CONSONANCE clinical trial. We employed state-of-the-art tabular and time-series foundation models (FMs), a custom multimodal attention-based transformer, and machine learning methods. Despite the difficulty of early prediction (AUROC 0.63), integrating digital data via advanced models improved performance over clinical data alone. A transformer model using unimodal embeddings from the Moment FM yielded the best result, but our multimodal transformer consistently outperformed its unimodal counterpart, confirming the advantages of combining clinical with digital data. Our findings demonstrate the promise of FMs and multimodal approaches to extract predictive signals from complex and diverse clinical and digital life sciences data (e.g., imaging, omics), enabling more accurate prognostics for MS and potentially other complex diseases.

[LG-45] Extending Spike-Timing Dependent Plasticity to Learning Synaptic Delays

链接: https://arxiv.org/abs/2506.14984
作者: Marissa Dominijanni,Alexander Ororbia,Kenneth W. Regan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Repository containing the source code used to generate the results is available at: this https URL

点击查看摘要

Abstract:Synaptic delays play a crucial role in biological neuronal networks, where their modulation has been observed in mammalian learning processes. In the realm of neuromorphic computing, although spiking neural networks (SNNs) aim to emulate biology more closely than traditional artificial neural networks do, synaptic delays are rarely incorporated into their simulation. We introduce a novel learning rule for simultaneously learning synaptic connection strengths and delays, by extending spike-timing dependent plasticity (STDP), a Hebbian method commonly used for learning synaptic weights. We validate our approach by extending a widely-used SNN model for classification trained with unsupervised learning. Then we demonstrate the effectiveness of our new method by comparing it against another existing methods for co-learning synaptic weights and delays as well as against STDP without synaptic delays. Results demonstrate that our proposed method consistently achieves superior performance across a variety of test scenarios. Furthermore, our experimental results yield insight into the interplay between synaptic efficacy and delay.

[LG-46] ODD: Overlap-aware Estimation of Model Performance under Distribution Shift

链接: https://arxiv.org/abs/2506.14978
作者: Aayush Mishra,Anqi Liu
类目: Machine Learning (cs.LG)
*备注: Accepted to the 41st Conference on Uncertainty in Artificial Intelligence, 2025

点击查看摘要

Abstract:Reliable and accurate estimation of the error of an ML model in unseen test domains is an important problem for safe intelligent systems. Prior work uses disagreement discrepancy (DIS^2) to derive practical error bounds under distribution shifts. It optimizes for a maximally disagreeing classifier on the target domain to bound the error of a given source classifier. Although this approach offers a reliable and competitively accurate estimate of the target error, we identify a problem in this approach which causes the disagreement discrepancy objective to compete in the overlapping region between source and target domains. With an intuitive assumption that the target disagreement should be no more than the source disagreement in the overlapping region due to high enough support, we devise Overlap-aware Disagreement Discrepancy (ODD). Maximizing ODD only requires disagreement in the non-overlapping target domain, removing the competition. Our ODD-based bound uses domain-classifiers to estimate domain-overlap and better predicts target performance than DIS^2. We conduct experiments on a wide array of benchmarks to show that our method improves the overall performance-estimation error while remaining valid and reliable. Our code and results are available on GitHub.

[LG-47] FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning

链接: https://arxiv.org/abs/2506.14929
作者: Ganyu Wang,Jinjie Fang,Maxwell J. Ying,Bin Gu,Xi Chen,Boyu Wang,Charles Ling
类目: Machine Learning (cs.LG)
*备注: Published in Proceedings of the 42nd International Conference on Machine Learning

点击查看摘要

Abstract:Black-Box Discrete Prompt Learning is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. Adapting federated learning to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textitFedOne, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.

[LG-48] FORTRESS: Frontier Risk Evaluation for National Security and Public Safety NEURIPS

链接: https://arxiv.org/abs/2506.14922
作者: Christina Q. Knight,Kaustubh Deshpande,Ved Sirdeshmukh,Meher Mankikar,Scale Red Team,SEAL Research Team,Julian Michael
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, submitted to NeurIPS

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information. However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way. We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence Terrorism, and Criminal Financial Illicit Activities, with 10 total subcategories across these domains. Each prompt-rubric pair has a corresponding benign version to test for model over-refusals. This evaluation of frontier LLMs’ safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29). Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06. Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2). To provide policymakers and researchers with a clear understanding of models’ potential risks, we publicly release FORTRESS at this https URL. We also maintain a private set for evaluation.

[LG-49] Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

链接: https://arxiv.org/abs/2506.14913
作者: Wassim Bouaziz,Mathurin Videau,Nicolas Usunier,El-Mahdi El-Mhamdi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of training data, which LM providers try to limit. In this work, we demonstrate that indirect data poisoning (where the targeted behavior is absent from training data) is not only feasible but also allow to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus. We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret and detect it with extremely high confidence ( p 10^-55 ) with a theoretically certifiable scheme. Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets never appearing in the training set.

[LG-50] Event-Driven Online Vertical Federated Learning ICLR2025

链接: https://arxiv.org/abs/2506.14911
作者: Ganyu Wang,Boyu Wang,Bin Gu,Charles Ling
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Online learning is more adaptable to real-world scenarios in Vertical Federated Learning (VFL) compared to offline learning. However, integrating online learning into VFL presents challenges due to the unique nature of VFL, where clients possess non-intersecting feature sets for the same sample. In real-world scenarios, the clients may not receive data streaming for the disjoint features for the same entity synchronously. Instead, the data are typically generated by an \emphevent relevant to only a subset of clients. We are the first to identify these challenges in online VFL, which have been overlooked by previous research. To address these challenges, we proposed an event-driven online VFL framework. In this framework, only a subset of clients were activated during each event, while the remaining clients passively collaborated in the learning process. Furthermore, we incorporated \emphdynamic local regret (DLR) into VFL to address the challenges posed by online learning problems with non-convex models within a non-stationary environment. We conducted a comprehensive regret analysis of our proposed framework, specifically examining the DLR under non-convex conditions with event-driven online VFL. Extensive experiments demonstrated that our proposed framework was more stable than the existing online VFL framework under non-stationary data conditions while also significantly reducing communication and computation costs.

[LG-51] Generalized Reference Kernel With Negative Samples For Support Vector One-class Classification

链接: https://arxiv.org/abs/2506.14895
作者: Jenni Raitoharju
类目: Machine Learning (cs.LG)
*备注: Accepted to EUSIPCO2025

点击查看摘要

Abstract:This paper focuses on small-scale one-class classification with some negative samples available. We propose Generalized Reference Kernel with Negative Samples (GRKneg) for One-class Support Vector Machine (OC-SVM). We study different ways to select/generate the reference vectors and recommend an approach for the problem at hand. It is worth noting that the proposed method does not use any labels in the model optimization but uses the original OC-SVM implementation. Only the kernel used in the process is improved using the negative data. We compare our method with the standard OC-SVM and with the binary Support Vector Machine (SVM) using different amounts of negative samples. Our approach consistently outperforms the standard OC-SVM using Radial Basis Function kernel. When there are plenty of negative samples, the binary SVM outperforms the one-class approaches as expected, but we show that for the lowest numbers of negative samples the proposed approach clearly outperforms the binary SVM.

[LG-52] OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

链接: https://arxiv.org/abs/2506.14866
作者: Thomas Kuntz,Agatha Duzan,Hao Zhao,Francesco Croce,Zico Kolter,Nicolas Flammarion,Maksym Andriushchenko
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computer use agents are LLM-based agents that can directly interact with a graphical user interface, by processing screenshots or accessibility trees. While these systems are gaining popularity, their safety has been largely overlooked, despite the fact that evaluating and understanding their potential for harmful behavior is essential for widespread adoption. To address this gap, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we create 150 tasks that span several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and require the agent to interact with a variety of OS applications (email client, code editor, browser, etc.). Moreover, we propose an automated judge to evaluate both accuracy and safety of agents that achieves high agreement with human annotations (0.76 and 0.79 F1 score). We evaluate computer use agents based on a range of frontier models - such as o4-mini, Claude 3.7 Sonnet, Gemini 2.5 Pro - and provide insights into their safety. In particular, all models tend to directly comply with many deliberate misuse queries, are relatively vulnerable to static prompt injections, and occasionally perform unsafe actions. The OS-Harm benchmark is available at this https URL.

[LG-53] Accurate and Uncertainty-Aware Multi-Task Prediction of HEA Properties Using Prior-Guided Deep Gaussian Processes

链接: https://arxiv.org/abs/2506.14828
作者: Sk Md Ahnaf Akif Alvi,Mrinalini Mulukutla,Nicolas Flores,Danial Khatamsaz,Jan Janssen,Danny Perez,Douglas Allaire,Vahid Attari,Raymundo Arroyave
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Deep Gaussian Processes Multi-task Gaussian Processes High Entropy Alloys

点击查看摘要

Abstract:Surrogate modeling techniques have become indispensable in accelerating the discovery and optimization of high-entropy alloys(HEAs), especially when integrating computational predictions with sparse experimental observations. This study systematically evaluates the fitting performance of four prominent surrogate models conventional Gaussian Processes(cGP), Deep Gaussian Processes(DGP), encoder-decoder neural networks for multi-output regression and XGBoost applied to a hybrid dataset of experimental and computational properties in the AlCoCrCuFeMnNiV HEA system. We specifically assess their capabilities in predicting correlated material properties, including yield strength, hardness, modulus, ultimate tensile strength, elongation, and average hardness under dynamic and quasi-static conditions, alongside auxiliary computational properties. The comparison highlights the strengths of hierarchical and deep modeling approaches in handling heteroscedastic, heterotopic, and incomplete data commonly encountered in materials informatics. Our findings illustrate that DGP infused with machine learning-based prior outperform other surrogates by effectively capturing inter-property correlations and input-dependent uncertainty. This enhanced predictive accuracy positions advanced surrogate models as powerful tools for robust and data-efficient materials design.

[LG-54] Navigating High-Dimensional Backstage: A Guide for Exploring Literature for the Reliable Use of Dimensionality Reduction

链接: https://arxiv.org/abs/2506.14820
作者: Hyeon Jeon,Hyunwook Lee,Yun-Hsin Kuo,Taehyun Yang,Daniel Archambault,Sungahn Ko,Takanori Fujiwara,Kwan-Liu Ma,Jinwook Seo
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: EG/VGTC EuroVis 2025 Short paper

点击查看摘要

Abstract:Visual analytics using dimensionality reduction (DR) can easily be unreliable for various reasons, e.g., inherent distortions in representing the original data. The literature has thus proposed a wide range of methodologies to make DR-based visual analytics reliable. However, the diversity and extensiveness of the literature can leave novice analysts and researchers uncertain about where to begin and proceed. To address this problem, we propose a guide for reading papers for reliable visual analytics with DR. Relying on the previous classification of the relevant literature, our guide helps both practitioners to (1) assess their current DR expertise and (2) identify papers that will further enhance their understanding. Interview studies with three experts in DR and data visualizations validate the significance, comprehensiveness, and usefulness of our guide.

[LG-55] Predicting Anthropometric Body Composition Variables Using 3D Optical Imaging and Machine Learning

链接: https://arxiv.org/abs/2506.14815
作者: Gyaneshwar Agrahari,Kiran Bist,Monika Pandey,Jacob Kapita,Zachary James,Jackson Knox,Steven Heymsfield,Sophia Ramirez,Peter Wolenski,Nadejda Drenska
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of anthropometric body composition variables, such as Appendicular Lean Mass (ALM), Body Fat Percentage (BFP), and Bone Mineral Density (BMD), is essential for early diagnosis of several chronic diseases. Currently, researchers rely on Dual-Energy X-ray Absorptiometry (DXA) scans to measure these metrics; however, DXA scans are costly and time-consuming. This work proposes an alternative to DXA scans by applying statistical and machine learning models on biomarkers (height, volume, left calf circumference, etc) obtained from 3D optical images. The dataset consists of 847 patients and was sourced from Pennington Biomedical Research Center. Extracting patients’ data in healthcare faces many technical challenges and legal restrictions. However, most supervised machine learning algorithms are inherently data-intensive, requiring a large amount of training data. To overcome these limitations, we implemented a semi-supervised model, the p -Laplacian regression model. This paper is the first to demonstrate the application of a p -Laplacian model for regression. Our p -Laplacian model yielded errors of \sim13% for ALM, \sim10% for BMD, and \sim20% for BFP when the training data accounted for 10 percent of all data. Among the supervised algorithms we implemented, Support Vector Regression (SVR) performed the best for ALM and BMD, yielding errors of \sim 8% for both, while Least Squares SVR performed the best for BFP with \sim 11% error when trained on 80 percent of the data. Our findings position the p -Laplacian model as a promising tool for healthcare applications, particularly in a data-constrained environment.

[LG-56] Self-Composing Policies for Scalable Continual Reinforcement Learning ICML2024

链接: https://arxiv.org/abs/2506.14811
作者: Mikel Malagón,Josu Ceberio,Jose A. Lozano
类目: Machine Learning (cs.LG)
*备注: ICML 2024 (oral)

点击查看摘要

Abstract:This work introduces a growable and modular neural network architecture that naturally avoids catastrophic forgetting and interference in continual reinforcement learning. The structure of each module allows the selective combination of previous policies along with its internal policy, accelerating the learning process on the current task. Unlike previous growing neural network approaches, we show that the number of parameters of the proposed approach grows linearly with respect to the number of tasks, and does not sacrifice plasticity to scale. Experiments conducted in benchmark continuous control and visual problems reveal that the proposed approach achieves greater knowledge transfer and performance than alternative methods.

[LG-57] Intelligent Routing for Sparse Demand Forecasting: A Comparative Evaluation of Selection Strategies

链接: https://arxiv.org/abs/2506.14810
作者: Qiwen Zhang
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, conference

点击查看摘要

Abstract:Sparse and intermittent demand forecasting in supply chains presents a critical challenge, as frequent zero-demand periods hinder traditional model accuracy and impact inventory management. We propose and evaluate a Model-Router framework that dynamically selects the most suitable forecasting model-spanning classical, ML, and DL methods for each product based on its unique demand pattern. By comparing rule-based, LightGBM, and InceptionTime routers, our approach learns to assign appropriate forecasting strategies, effectively differentiating between smooth, lumpy, or intermittent demand regimes to optimize predictions. Experiments on the large-scale Favorita dataset show our deep learning (Inception Time) router improves forecasting accuracy by up to 11.8% (NWRMSLE) over strong, single-model benchmarks with 4.67x faster inference time. Ultimately, these gains in forecasting precision will drive substantial reductions in both stockouts and wasteful excess inventory, underscoring the critical role of intelligent, adaptive Al in optimizing contemporary supply chain operations.

[LG-58] Impact of a Deployed LLM Survey Creation Tool through the IS Success Model

链接: https://arxiv.org/abs/2506.14809
作者: Peng Jiang,Vinicius Cezar Monteiro de Lira,Antonio Maiorino
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surveys are a cornerstone of Information Systems (IS) research, yet creating high-quality surveys remains labor-intensive, requiring both domain expertise and methodological rigor. With the evolution of large language models (LLMs), new opportunities emerge to automate survey generation. This paper presents the real-world deployment of an LLM-powered system designed to accelerate data collection while maintaining survey quality. Deploying such systems in production introduces real-world complexity, including diverse user needs and quality control. We evaluate the system using the DeLone and McLean IS Success Model to understand how generative AI can reshape a core IS method. This study makes three key contributions. To our knowledge, this is the first application of the IS Success Model to a generative AI system for survey creation. In addition, we propose a hybrid evaluation framework combining automated and human assessments. Finally, we implement safeguards that mitigate post-deployment risks and support responsible integration into IS workflows.

[LG-59] PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models CVPR2025

链接: https://arxiv.org/abs/2506.14808
作者: Jenny Schmalfuss,Nadine Chang,Vibashan VS,Maying Shen,Andres Bruhn,Jose M. Alvarez
类目: Machine Learning (cs.LG)
*备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Vision language models (VLMs) respond to user-crafted text prompts and visual inputs, and are applied to numerous real-world problems. VLMs integrate visual modalities with large language models (LLMs), which are well known to be prompt-sensitive. Hence, it is crucial to determine whether VLMs inherit this instability to varying prompts. We therefore investigate which prompt variations VLMs are most sensitive to and which VLMs are most agnostic to prompt variations. To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis. Regarding prompt variations, PARC’s evaluation shows that VLMs mirror LLM language prompt sensitivity in the vision domain, and most destructive variations change the expected answer. Regarding models, outstandingly robust VLMs among 22 evaluated models come from the InternVL2 family. We further find indications that prompt sensitivity is linked to training data. The code will be at this https URL.

[LG-60] Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

链接: https://arxiv.org/abs/2506.14806
作者: Bochen Lyu,Xiaojing Zhang,Fangyi Zheng,He Wang,Zheng Wang,Zhanxing Zhu
类目: Machine Learning (cs.LG)
*备注: 32 pages, 7 figures

点击查看摘要

Abstract:This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.

[LG-61] Protein Language Model Zero-Shot Fitness Predictions are Improved by Inference-only Dropout

链接: https://arxiv.org/abs/2506.14793
作者: Aditya Ravuri,Neil D. Lawrence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protein Language Models (PLMs) such as ESM2 have been shown to be capable of zero-shot prediction of critical scalar properties of proteins (fitness). In this work, we show that injecting a dropout layer at inference time between a PLM’s featurizer/embedding layer and its transformer, and averaging its output akin to Monte-Carlo dropout increases zero-shot performance on a subset of the ProteinGym dataset. This is the case even when the model was not trained with dropouts to begin with, and does not require retraining or finetuning of the PLM. A dropout of 0.1 seems performant across all models.

[LG-62] Continuous Evolution Pool: Taming Recurring Concept Drift in Online Time Series Forecasting

链接: https://arxiv.org/abs/2506.14790
作者: Tianxiang Zhan,Ming Jin,Yuanpeng He,Yuxuan Liang,Yong Deng,Shirui Pan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurring concept drift, a type of concept drift in which previously observed data patterns reappear after some time, is one of the most prevalent types of concept drift in time series. As time progresses, concept drift occurs and previously encountered concepts are forgotten, thereby leading to a decline in the accuracy of online predictions. Existing solutions employ parameter updating techniques to delay forgetting; however, this may result in the loss of some previously learned knowledge while neglecting the exploration of knowledge retention mechanisms. To retain all conceptual knowledge and fully utilize it when the concepts recur, we propose the Continuous Evolution Pool (CEP), a pooling mechanism that stores different instances of forecasters for different concepts. Our method first selects the forecaster nearest to the test sample and then learns the features from its neighboring samples - a process we refer to as the retrieval. If there are insufficient neighboring samples, it indicates that a new concept has emerged, and a new model will evolve from the current nearest sample to the pool to store the knowledge of the concept. Simultaneously, the elimination mechanism will enable outdated knowledge to be cleared to ensure the prediction effect of the forecasters. Experiments on different architectural models and eight real datasets demonstrate that CEP effectively retains the knowledge of different concepts. In the scenario of online forecasting with recurring concepts, CEP significantly enhances the prediction results.

[LG-63] AZT1D: A Real-World Dataset for Type 1 Diabetes

链接: https://arxiv.org/abs/2506.14789
作者: Saman Khamesian,Asiful Arefeen,Bithika M. Thompson,Maria Adela Grando,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 4 pages

点击查看摘要

Abstract:High quality real world datasets are essential for advancing data driven approaches in type 1 diabetes (T1D) management, including personalized therapy design, digital twin systems, and glucose prediction models. However, progress in this area has been limited by the scarcity of publicly available datasets that offer detailed and comprehensive patient data. To address this gap, we present AZT1D, a dataset containing data collected from 25 individuals with T1D on automated insulin delivery (AID) systems. AZT1D includes continuous glucose monitoring (CGM) data, insulin pump and insulin administration data, carbohydrate intake, and device mode (regular, sleep, and exercise) obtained over 6 to 8 weeks for each patient. Notably, the dataset provides granular details on bolus insulin delivery (i.e., total dose, bolus type, correction specific amounts) features that are rarely found in existing datasets. By offering rich, naturalistic data, AZT1D supports a wide range of artificial intelligence and machine learning applications aimed at improving clinical decision making and individualized care in T1D.

[LG-64] Predicting Onflow Parameters Using Transfer Learning for Domain and Task Adaptation

链接: https://arxiv.org/abs/2506.14784
作者: Emre Yilmaz,Philipp Bekemeyer
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Determining onflow parameters is crucial from the perspectives of wind tunnel testing and regular flight and wind turbine operations. These parameters have traditionally been predicted via direct measurements which might lead to challenges in case of sensor faults. Alternatively, a data-driven prediction model based on surface pressure data can be used to determine these parameters. It is essential that such predictors achieve close to real-time learning as dictated by practical applications such as monitoring wind tunnel operations or learning the variations in aerodynamic performance of aerospace and wind energy systems. To overcome the challenges caused by changes in the data distribution as well as in adapting to a new prediction task, we propose a transfer learning methodology to predict the onflow parameters, specifically angle of attack and onflow speed. It requires first training a convolutional neural network (ConvNet) model offline for the core prediction task, then freezing the weights of this model except the selected layers preceding the output node, and finally executing transfer learning by retraining these layers. A demonstration of this approach is provided using steady CFD analysis data for an airfoil for i) domain adaptation where transfer learning is performed with data from a target domain having different data distribution than the source domain and ii) task adaptation where the prediction task is changed. Further exploration on the influence of noisy data, performance on an extended domain, and trade studies varying sampling sizes and architectures are provided. Results successfully demonstrate the potential of the approach for adaptation to changing data distribution, domain extension, and task update while the application for noisy data is concluded to be not as effective.

[LG-65] Integrating Dynamical Systems Learning with Foundational Models: A Meta-Evolutionary AI Framework for Clinical Trials

链接: https://arxiv.org/abs/2506.14782
作者: Joseph Geraci,Bessi Qorri,Christian Cumbaa,Mike Tsay,Paul Leonczyk,Luca Pani
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 27 pages

点击查看摘要

Abstract:Artificial intelligence (AI) has evolved into an ecosystem of specialized “species,” each with unique strengths. We analyze two: DeepSeek-V3, a 671-billion-parameter Mixture of Experts large language model (LLM) exemplifying scale-driven generality, and NetraAI, a dynamical system-based framework engineered for stability and interpretability on small clinical trial datasets. We formalize NetraAI’s foundations, combining contraction mappings, information geometry, and evolutionary algorithms to identify predictive patient cohorts. Features are embedded in a metric space and iteratively contracted toward stable attractors that define latent subgroups. A pseudo-temporal embedding and long-range memory enable exploration of higher-order feature interactions, while an internal evolutionary loop selects compact, explainable 2-4-variable bundles (“Personas”). To guide discovery, we introduce an LLM Strategist as a meta-evolutionary layer that observes Persona outputs, prioritizes promising variables, injects domain knowledge, and assesses robustness. This two-tier architecture mirrors the human scientific process: NetraAI as experimentalist, the LLM as theorist, forming a self-improving loop. In case studies (schizophrenia, depression, pancreatic cancer), NetraAI uncovered small, high-effect-size subpopulations that transformed weak baseline models (AUC ~0.50-0.68) into near-perfect classifiers using only a few features. We position NetraAI at the intersection of dynamical systems, information geometry, and evolutionary learning, aligned with emerging concept-level reasoning paradigms such as LeCun’s Joint Embedding Predictive Architecture (JEPA). By prioritizing reliable, explainable knowledge, NetraAI offers a new generation of adaptive, self-reflective AI to accelerate clinical discovery. Comments: 27 pages Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2506.14782 [cs.LG] (or arXiv:2506.14782v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.14782 Focus to learn more arXiv-issued DOI via DataCite

[LG-66] wo-dimensional Parallel Tempering for Constrained Optimization

链接: https://arxiv.org/abs/2506.14781
作者: Corentin Delacour,M Mahmudul Hasan Sajeeb,Joao P. Hespanha,Kerem Y. Camsari
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling Boltzmann probability distributions plays a key role in machine learning and optimization, motivating the design of hardware accelerators such as Ising machines. While the Ising model can in principle encode arbitrary optimization problems, practical implementations are often hindered by soft constraints that either slow down mixing when too strong, or fail to enforce feasibility when too weak. We introduce a two-dimensional extension of the powerful parallel tempering algorithm (PT) that addresses this challenge by adding a second dimension of replicas interpolating the penalty strengths. This scheme ensures constraint satisfaction in the final replicas, analogous to low-energy states at low temperature. The resulting two-dimensional parallel tempering algorithm (2D-PT) improves mixing in heavily constrained replicas and eliminates the need to explicitly tune the penalty strength. In a representative example of graph sparsification with copy constraints, 2D-PT achieves near-ideal mixing, with Kullback-Leibler divergence decaying as O(1/t). When applied to sparsified Wishart instances, 2D-PT yields orders of magnitude speedup over conventional PT with the same number of replicas. The method applies broadly to constrained Ising problems and can be deployed on existing Ising machines.

[LG-67] SimBank: from Simulation to Solution in Prescriptive Process Monitoring

链接: https://arxiv.org/abs/2506.14772
作者: Jakob De Moor,Hans Weytjens,Johannes De Smedt,Jochen De Weerdt
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prescriptive Process Monitoring (PresPM) is an emerging area within Process Mining, focused on optimizing processes through real-time interventions for effective decision-making. PresPM holds significant promise for organizations seeking enhanced operational performance. However, the current literature faces two key limitations: a lack of extensive comparisons between techniques and insufficient evaluation approaches. To address these gaps, we introduce SimBank: a simulator designed for accurate benchmarking of PresPM methods. Modeled after a bank’s loan application process, SimBank enables extensive comparisons of both online and offline PresPM methods. It incorporates a variety of intervention optimization problems with differing levels of complexity and supports experiments on key causal machine learning challenges, such as assessing a method’s robustness to confounding in data. SimBank additionally offers a comprehensive evaluation capability: for each test case, it can generate the true outcome under each intervention action, which is not possible using recorded datasets. The simulator incorporates parallel activities and loops, drawing from common logs to generate cases that closely resemble real-life process instances. Our proof of concept demonstrates SimBank’s benchmarking capabilities through experiments with various PresPM methods across different interventions, highlighting its value as a publicly available simulator for advancing research and practice in PresPM.

[LG-68] Revisiting Randomization in Greedy Model Search

链接: https://arxiv.org/abs/2506.15643
作者: Xin Chen,Jason M. Klusowski,Yan Shuo Tan,Chang Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Combining randomized estimators in an ensemble, such as via random forests, has become a fundamental technique in modern data science, but can be computationally expensive. Furthermore, the mechanism by which this improves predictive performance is poorly understood. We address these issues in the context of sparse linear regression by proposing and analyzing an ensemble of greedy forward selection estimators that are randomized by feature subsampling – at each iteration, the best feature is selected from within a random subset. We design a novel implementation based on dynamic programming that greatly improves its computational efficiency. Furthermore, we show via careful numerical experiments that our method can outperform popular methods such as lasso and elastic net across a wide range of settings. Next, contrary to prevailing belief that randomized ensembling is analogous to shrinkage, we show via numerical experiments that it can simultaneously reduce training error and degrees of freedom, thereby shifting the entire bias-variance trade-off curve of the base estimator. We prove this fact rigorously in the setting of orthogonal features, in which case, the ensemble estimator rescales the ordinary least squares coefficients with a two-parameter family of logistic weights, thereby enlarging the model search space. These results enhance our understanding of random forests and suggest that implicit regularization in general may have more complicated effects than explicit regularization.

[LG-69] me-dependent density estimation using binary classifiers

链接: https://arxiv.org/abs/2506.15505
作者: Agnimitra Dasgupta,Javier Murgoitio-Esandi,Ali Fardisi,Assad A Oberai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a data-driven method to learn the time-dependent probability density of a multivariate stochastic process from sample paths, assuming that the initial probability density is known and can be evaluated. Our method uses a novel time-dependent binary classifier trained using a contrastive estimation-based objective that trains the classifier to discriminate between realizations of the stochastic process at two nearby time instants. Significantly, the proposed method explicitly models the time-dependent probability distribution, which means that it is possible to obtain the value of the probability density within the time horizon of interest. Additionally, the input before the final activation in the time-dependent classifier is a second-order approximation to the partial derivative, with respect to time, of the logarithm of the density. We apply the proposed approach to approximate the time-dependent probability density functions for systems driven by stochastic excitations. We also use the proposed approach to synthesize new samples of a random vector from a given set of its realizations. In such applications, we generate sample paths necessary for training using stochastic interpolants. Subsequently, new samples are generated using gradient-based Markov chain Monte Carlo methods because automatic differentiation can efficiently provide the necessary gradient. Further, we demonstrate the utility of an explicit approximation to the time-dependent probability density function through applications in unsupervised outlier detection. Through several numerical experiments, we show that the proposed method accurately reconstructs complex time-dependent, multi-modal, and near-degenerate densities, scales effectively to moderately high-dimensional problems, and reliably detects rare events among real-world data.

[LG-70] Spectral Contraction of Boundary-Weighted Filters on delta-Hyperbolic Graphs

链接: https://arxiv.org/abs/2506.15464
作者: Le Vu Anh,Mehmet Dik,Nguyen Viet Anh
类目: Metric Geometry (math.MG); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:Hierarchical graphs often exhibit tree-like branching patterns, a structural property that challenges the design of traditional graph filters. We introduce a boundary-weighted operator that rescales each edge according to how far its endpoints drift toward the graph’s Gromov boundary. Using Busemann functions on delta-hyperbolic networks, we prove a closed-form upper bound on the operator’s spectral norm: every signal loses a curvature-controlled fraction of its energy at each pass. The result delivers a parameter-free, lightweight filter whose stability follows directly from geometric first principles, offering a new analytic tool for graph signal processing on data with dense or hidden hierarchical structure.

[LG-71] Multi-Timescale Gradient Sliding for Distributed Optimization

链接: https://arxiv.org/abs/2506.15387
作者: Junhui Zhang,Patrick Jaillet
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose two first-order methods for convex, non-smooth, distributed optimization problems, hereafter called Multi-Timescale Gradient Sliding (MT-GS) and its accelerated variant (AMT-GS). Our MT-GS and AMT-GS can take advantage of similarities between (local) objectives to reduce the communication rounds, are flexible so that different subsets (of agents) can communicate at different, user-picked rates, and are fully deterministic. These three desirable features are achieved through a block-decomposable primal-dual formulation, and a multi-timescale variant of the sliding method introduced in Lan et al. (2020), Lan (2016), where different dual blocks are updated at potentially different rates. To find an \epsilon -suboptimal solution, the complexities of our algorithms achieve optimal dependency on \epsilon : MT-GS needs O(\overlinerA/\epsilon) communication rounds and O(\overliner/\epsilon^2) subgradient steps for Lipchitz objectives, and AMT-GS needs O(\overlinerA/\sqrt\epsilon\mu) communication rounds and O(\overliner/(\epsilon\mu)) subgradient steps if the objectives are also \mu -strongly convex. Here, \overliner measures the ``average rate of updates’’ for dual blocks, and A measures similarities between (subgradients of) local functions. In addition, the linear dependency of communication rounds on A is optimal (Arjevani and Shamir 2015), thereby providing a positive answer to the open question whether such dependency is achievable for non-smooth objectives (Arjevani and Shamir 2015). Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2506.15387 [math.OC] (or arXiv:2506.15387v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2506.15387 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] Performative Validity of Recourse Explanations

链接: https://arxiv.org/abs/2506.15366
作者: Gunnar König,Hidde Fokkema,Timo Freiesleben,Celestine Mendler-Dünner,Ulrike on Luxburg
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 34 pages, 3 figures, 1 table, Preprint

点击查看摘要

Abstract:When applicants get rejected by an algorithmic decision system, recourse explanations provide actionable suggestions for how to change their input features to get a positive evaluation. A crucial yet overlooked phenomenon is that recourse explanations are performative: When many applicants act according to their recommendations, their collective behavior may change statistical regularities in the data and, once the model is refitted, also the decision boundary. Consequently, the recourse algorithm may render its own recommendations invalid, such that applicants who make the effort of implementing their recommendations may be rejected again when they reapply. In this work, we formally characterize the conditions under which recourse explanations remain valid under performativity. A key finding is that recourse actions may become invalid if they are influenced by or if they intervene on non-causal variables. Based on our analysis, we caution against the use of standard counterfactual explanations and causal recourse methods, and instead advocate for recourse methods that recommend actions exclusively on causal variables.

[LG-73] Proximal Operators of Sorted Nonconvex Penalties

链接: https://arxiv.org/abs/2506.15315
作者: Anne Gagneux,Mathurin Massias,Emmanuel Soubies
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work studies the problem of sparse signal recovery with automatic grouping of variables. To this end, we investigate sorted nonsmooth penalties as a regularization approach for generalized linear models. We focus on a family of sorted nonconvex penalties which generalizes the Sorted L1 Norm (SLOPE). These penalties are designed to promote clustering of variables due to their sorted nature, while the nonconvexity reduces the shrinkage of coefficients. Our goal is to provide efficient ways to compute their proximal operator, enabling the use of popular proximal algorithms to solve composite optimization problems with this choice of sorted penalties. We distinguish between two classes of problems: the weakly convex case where computing the proximal operator remains a convex problem, and the nonconvex case where computing the proximal operator becomes a challenging nonconvex combinatorial problem. For the weakly convex case (e.g. sorted MCP and SCAD), we explain how the Pool Adjacent Violators (PAV) algorithm can exactly compute the proximal operator. For the nonconvex case (e.g. sorted Lq with q in ]0,1[), we show that a slight modification of this algorithm turns out to be remarkably efficient to tackle the computation of the proximal operator. We also present new theoretical insights on the minimizers of the nonconvex proximal problem. We demonstrate the practical interest of using such penalties on several experiments.

[LG-74] Data analysis using discrete cubical homology

链接: https://arxiv.org/abs/2506.15020
作者: Chris Kapulkin,Nathan Kershaw
类目: Algebraic Topology (math.AT); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 17 pages; comments welcome

点击查看摘要

Abstract:We present a new tool for data analysis: persistence discrete homology, which is well-suited to analyze filtrations of graphs. In particular, we provide a novel way of representing high-dimensional data as a filtration of graphs using pairwise correlations. We discuss several applications of these tools, e.g., in weather and financial data, comparing them to the standard methods used in the respective fields.

[LG-75] POCO: Scalable Neural Forecasting through Population Conditioning

链接: https://arxiv.org/abs/2506.14957
作者: Yu Duan,Hamza Tahir Chaudhry,Misha B. Ahrens,Christopher D Harvey,Matthew G Perich,Karl Deisseroth,Kanaka Rajan
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting future neural activity is a core challenge in modeling brain dynamics, with applications ranging from scientific investigation to closed-loop neurotechnology. While recent models of population activity emphasize interpretability and behavioral decoding, neural forecasting-particularly across multi-session, spontaneous recordings-remains underexplored. We introduce POCO, a unified forecasting model that combines a lightweight univariate forecaster with a population-level encoder to capture both neuron-specific and brain-wide dynamics. Trained across five calcium imaging datasets spanning zebrafish, mice, and C. elegans, POCO achieves state-of-the-art accuracy at cellular resolution in spontaneous behaviors. After pre-training, POCO rapidly adapts to new recordings with minimal fine-tuning. Notably, POCO’s learned unit embeddings recover biologically meaningful structure-such as brain region clustering-without any anatomical labels. Our comprehensive analysis reveals several key factors influencing performance, including context length, session diversity, and preprocessing. Together, these results position POCO as a scalable and adaptable approach for cross-session neural forecasting and offer actionable insights for future model design. By enabling accurate, generalizable forecasting models of neural dynamics across individuals and species, POCO lays the groundwork for adaptive neurotechnologies and large-scale efforts for neural foundation models.

[LG-76] An Observation on Lloyds k-Means Algorithm in High Dimensions

链接: https://arxiv.org/abs/2506.14952
作者: David Silva-Sánchez,Roy R. Lederman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 3 figures, 4 supplemental figures

点击查看摘要

Abstract:Clustering and estimating cluster means are core problems in statistics and machine learning, with k-means and Expectation Maximization (EM) being two widely used algorithms. In this work, we provide a theoretical explanation for the failure of k-means in high-dimensional settings with high noise and limited sample sizes, using a simple Gaussian Mixture Model (GMM). We identify regimes where, with high probability, almost every partition of the data becomes a fixed point of the k-means algorithm. This study is motivated by challenges in the analysis of more complex cases, such as masked GMMs, and those arising from applications in Cryo-Electron Microscopy.

[LG-77] Double Machine Learning for Conditional Moment Restrictions: IV regression Proximal Causal Learning and Beyond

链接: https://arxiv.org/abs/2506.14950
作者: Daqian Shao,Ashkan Soleymani,Francesco Quinzan,Marta Kwiatkowska
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Solving conditional moment restrictions (CMRs) is a key problem considered in statistics, causal inference, and econometrics, where the aim is to solve for a function of interest that satisfies some conditional moment equalities. Specifically, many techniques for causal inference, such as instrumental variable (IV) regression and proximal causal learning (PCL), are CMR problems. Most CMR estimators use a two-stage approach, where the first-stage estimation is directly plugged into the second stage to estimate the function of interest. However, naively plugging in the first-stage estimator can cause heavy bias in the second stage. This is particularly the case for recently proposed CMR estimators that use deep neural network (DNN) estimators for both stages, where regularisation and overfitting bias is present. We propose DML-CMR, a two-stage CMR estimator that provides an unbiased estimate with fast convergence rate guarantees. We derive a novel learning objective to reduce bias and develop the DML-CMR algorithm following the double/debiased machine learning (DML) framework. We show that our DML-CMR estimator can achieve the minimax optimal convergence rate of O(N^-1/2) under parameterisation and mild regularity conditions, where N is the sample size. We apply DML-CMR to a range of problems using DNN estimators, including IV regression and proximal causal learning on real-world datasets, demonstrating state-of-the-art performance against existing CMR estimators and algorithms tailored to those problems.

[LG-78] Digital twin for virtual sensing of ferry quays via a Gaussian Process Latent Force Model

链接: https://arxiv.org/abs/2506.14925
作者: Luigi Sibille,Torodd Skjerve Nord,Alice Cicirello
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 Figures, 1 Table

点击查看摘要

Abstract:Ferry quays experience rapid deterioration due to their exposure to harsh maritime environments and ferry impacts. Vibration-based structural health monitoring offers a valuable approach to assessing structural integrity and understanding the structural implications of these impacts. However, practical limitations often restrict sensor placement at critical locations. Consequently, virtual sensing techniques become essential for establishing a Digital Twin and estimating the structural response. This study investigates the application of the Gaussian Process Latent Force Model (GPLFM) for virtual sensing on the Magerholm ferry quay, combining in-operation experimental data collected during a ferry impact with a detailed physics-based model. The proposed Physics-Encoded Machine Learning model integrates a reduced-order structural model with a data-driven GPLFM representing the unknown impact forces via their modal contributions. Significant challenges are addressed for the development of the Digital Twin of the ferry quay, including unknown impact characteristics (location, direction, intensity), time-varying boundary conditions, and sparse sensor configurations. Results show that the GPLFM provides accurate acceleration response estimates at most locations, even under simplifying modeling assumptions such as linear time-invariant behavior during the impact phase. Lower accuracy was observed at locations in the impact zone. A numerical study was conducted to explore an optimal real-world sensor placement strategy using a Backward Sequential Sensor Placement approach. Sensitivity analyses were conducted to examine the influence of sensor types, sampling frequencies, and incorrectly assumed damping ratios. The results suggest that the GP latent forces can help accommodate modeling and measurement uncertainties, maintaining acceptable estimation accuracy across scenarios.

[LG-79] Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery

链接: https://arxiv.org/abs/2506.14920
作者: Alejandro Giraldo,Daniel Ruiz,Mariano Caruso,Javier Mancilla,Guido Bellomo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantitative Structure-Activity Relationship (QSAR) modeling is a cornerstone of computational drug discovery. This research demonstrates the successful application of a Quantum Multiple Kernel Learning (QMKL) framework to enhance QSAR classification, showing a notable performance improvement over classical methods. We apply this methodology to a dataset for identifying DYRK1A kinase inhibitors. The workflow involves converting SMILES representations into numerical molecular descriptors, reducing dimensionality via Principal Component Analysis (PCA), and employing a Support Vector Machine (SVM) trained on an optimized combination of multiple quantum and classical kernels. By benchmarking the QMKL-SVM against a classical Gradient Boosting model, we show that the quantum-enhanced approach achieves a superior AUC score, highlighting its potential to provide a quantum advantage in challenging cheminformatics classification tasks.

[LG-80] Optimal Convergence Rates of Deep Neural Network Classifiers

链接: https://arxiv.org/abs/2506.14899
作者: Zihan Zhang,Lei Shi,Ding-Xuan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the binary classification problem on [0,1]^d under the Tsybakov noise condition (with exponent s \in [0,\infty] ) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of q+1 vector-valued multivariate functions, where each component function is either a maximum value function or a Hölder- \beta smooth function that depends only on d_* of its input variables. Notably, d_* can be significantly smaller than the input dimension d . We prove that, under these conditions, the optimal convergence rate for the excess 0-1 risk of classifiers is \left( \frac1n \right)^\frac\beta\cdot(1\wedge\beta)^q\fracd_*s+1+(1+\frac1s+1)\cdot\beta\cdot(1\wedge\beta)^q;;;, which is independent of the input dimension d . Additionally, we demonstrate that ReLU deep neural networks (DNNs) trained with hinge loss can achieve this optimal convergence rate up to a logarithmic factor. This result provides theoretical justification for the excellent performance of ReLU DNNs in practical classification tasks, particularly in high-dimensional settings. The technique used to establish these results extends the oracle inequality presented in our previous work. The generalized approach is of independent interest.

[LG-81] CutReg: A loss regularizer for enhancing the scalability of QML via adaptive circuit cutting

链接: https://arxiv.org/abs/2506.14858
作者: Maniraman Periyasamy,Christian Ufrecht,Daniel D. Scherer,Wolfgang Mauerer
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: This work has been submitted to the QML@QCE workshop for possible publication

点击查看摘要

Abstract:Whether QML can offer a transformative advantage remains an open question. The severe constraints of NISQ hardware, particularly in circuit depth and connectivity, hinder both the validation of quantum advantage and the empirical investigation of major obstacles like barren plateaus. Circuit cutting techniques have emerged as a strategy to execute larger quantum circuits on smaller, less connected hardware by dividing them into subcircuits. However, this partitioning increases the number of samples needed to estimate the expectation value accurately through classical post-processing compared to estimating it directly from the full circuit. This work introduces a novel regularization term into the QML optimization process, directly penalizing the overhead associated with sampling. We demonstrate that this approach enables the optimizer to balance the advantages of gate cutting against the optimization of the typical ML cost function. Specifically, it navigates the trade-off between minimizing the cutting overhead and maintaining the overall accuracy of the QML model, paving the way to study larger complex problems in pursuit of quantum advantage.

[LG-82] DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing ICML

链接: https://arxiv.org/abs/2506.14853
作者: Max Ku,Sun Sun,Hongyu Guo,Wenhu Chen
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Accepted to ICMLW (GenBio) 2025 and ICMLW (FM4LS) 2025

点击查看摘要

Abstract:We introduce DisProtEdit, a controllable protein editing framework that leverages dual-channel natural language supervision to learn disentangled representations of structural and functional properties. Unlike prior approaches that rely on joint holistic embeddings, DisProtEdit explicitly separates semantic factors, enabling modular and interpretable control. To support this, we construct SwissProtDis, a large-scale multimodal dataset where each protein sequence is paired with two textual descriptions, one for structure and one for function, automatically decomposed using a large language model. DisProtEdit aligns protein and text embeddings using alignment and uniformity objectives, while a disentanglement loss promotes independence between structural and functional semantics. At inference time, protein editing is performed by modifying one or both text inputs and decoding from the updated latent representation. Experiments on protein editing and representation learning benchmarks demonstrate that DisProtEdit performs competitively with existing methods while providing improved interpretability and controllability. On a newly constructed multi-attribute editing benchmark, the model achieves a both-hit success rate of up to 61.7%, highlighting its effectiveness in coordinating simultaneous structural and functional edits.

[LG-83] Beyond Force Metrics: Pre-Training MLFFs for Stable MD Simulations

链接: https://arxiv.org/abs/2506.14850
作者: Shagun Maheshwari,Janghoon Ock,Adeesh Kolluru,Amir Barati Farimani,John R. Kitchin
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine-learning force fields (MLFFs) have emerged as a promising solution for speeding up ab initio molecular dynamics (MD) simulations, where accurate force predictions are critical but often computationally expensive. In this work, we employ GemNet-T, a graph neural network model, as an MLFF and investigate two training strategies: (1) direct training on MD17 (10K samples) without pre-training, and (2) pre-training on the large-scale OC20 dataset followed by fine-tuning on MD17 (10K). While both approaches achieve low force mean absolute errors (MAEs), reaching 5 meV/A per atom, we find that lower force errors do not necessarily guarantee stable MD simulations. Notably, the pre-trained GemNet-T model yields significantly improved simulation stability, sustaining trajectories up to three times longer than the model trained from scratch. These findings underscore the value of pre-training on large, diverse datasets to capture complex molecular interactions and highlight that force MAE alone is not always a sufficient metric of MD simulation stability.

信息检索

[IR-0] DiscRec: Disentangled Semantic-Collaborative Modeling for Generative Recommendation

链接: https://arxiv.org/abs/2506.15576
作者: Chang Liu,Yimeng Bai,Xiaoyan Zhao,Yang Zhang,Fuli Feng,Wenge Rong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation is emerging as a powerful paradigm that directly generates item predictions, moving beyond traditional matching-based approaches. However, current methods face two key challenges: token-item misalignment, where uniform token-level modeling ignores item-level granularity that is critical for collaborative signal learning, and semantic-collaborative signal entanglement, where collaborative and semantic signals exhibit distinct distributions yet are fused in a unified embedding space, leading to conflicting optimization objectives that limit the recommendation performance. To address these issues, we propose DiscRec, a novel framework that enables Disentangled Semantic-Collaborative signal modeling with flexible fusion for generative this http URL, DiscRec introduces item-level position embeddings, assigned based on indices within each semantic ID, enabling explicit modeling of item structure in input token this http URL, DiscRec employs a dual-branch module to disentangle the two signals at the embedding layer: a semantic branch encodes semantic signals using original token embeddings, while a collaborative branch applies localized attention restricted to tokens within the same item to effectively capture collaborative signals. A gating mechanism subsequently fuses both branches while preserving the model’s ability to model sequential dependencies. Extensive experiments on four real-world datasets demonstrate that DiscRec effectively decouples these signals and consistently outperforms state-of-the-art baselines. Our codes are available on this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2506.15576 [cs.IR] (or arXiv:2506.15576v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.15576 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Multi-Interest Recommendation: A Survey

链接: https://arxiv.org/abs/2506.15284
作者: Zihao Li,Qiang Chen,Lixin Zou,Aixin Sun,Chenliang Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing recommendation methods often struggle to model users’ multifaceted preferences due to the diversity and volatility of user behavior, as well as the inherent uncertainty and ambiguity of item attributes in practical scenarios. Multi-interest recommendation addresses this challenge by extracting multiple interest representations from users’ historical interactions, enabling fine-grained preference modeling and more accurate recommendations. It has drawn broad interest in recommendation research. However, current recommendation surveys have either specialized in frontier recommendation methods or delved into specific tasks and downstream applications. In this work, we systematically review the progress, solutions, challenges, and future directions of multi-interest recommendation by answering the following three questions: (1) Why is multi-interest modeling significantly important for recommendation? (2) What aspects are focused on by multi-interest modeling in recommendation? and (3) How can multi-interest modeling be applied, along with the technical details of the representative modules? We hope that this survey establishes a fundamental framework and delivers a preliminary overview for researchers interested in this field and committed to further exploration. The implementation of multi-interest recommendation summarized in this survey is maintained at this https URL.

[IR-2] Next-User Retrieval: Enhancing Cold-Start Recommendations via Generative Next-User Modeling

链接: https://arxiv.org/abs/2506.15267
作者: Yu-Ting Lan,Yang Huo,Yi Shen,Xiao Yang,Zuotao Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The item cold-start problem is critical for online recommendation systems, as the success of this phase determines whether high-quality new items can transition to popular ones, receive essential feedback to inspire creators, and thus lead to the long-term retention of creators. However, modern recommendation systems still struggle to address item cold-start challenges due to the heavy reliance on item and historical interactions, which are non-trivial for cold-start items lacking sufficient exposure and feedback. Lookalike algorithms provide a promising solution by extending feedback for new items based on lookalike users. Traditional lookalike algorithms face such limitations: (1) failing to effectively model the lookalike users and further improve recommendations with the existing rule- or model-based methods; and (2) struggling to utilize the interaction signals and incorporate diverse features in modern recommendation systems. Inspired by lookalike algorithms, we propose Next-User Retrieval, a novel framework for enhancing cold-start recommendations via generative next-user modeling. Specifically, we employ a transformer-based model to capture the unidirectional relationships among recently interacted users and utilize these sequences to generate the next potential user who is most likely to interact with the item. The additional item features are also integrated as prefix prompt embeddings to assist the next-user generation. The effectiveness of Next-User Retrieval is evaluated through both offline experiments and online A/B tests. Our method achieves significant improvements with increases of 0.0142% in daily active users and +0.1144% in publications in Douyin, showcasing its practical applicability and scalability. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2506.15267 [cs.IR] (or arXiv:2506.15267v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.15267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-06-19

目录

概览 (2025-06-19)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载