Arxiv今日论文 | 2025-06-06

本篇博文主要内容为 2025-06-06 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在下游微调过程中因安全对齐数据与微调任务之间表示相似性过高而导致的安全防护机制退化问题。解决方案的关键在于通过分析上游对齐数据集与下游微调任务之间的表示相似性，发现高相似性会显著削弱安全防护机制，从而使模型更容易受到越狱攻击，而低相似性则能有效提升模型的鲁棒性并降低有害性评分。研究强调了上游数据集设计在构建持久安全防护机制中的重要性。

链接: https://arxiv.org/abs/2506.05346
作者: Lei Hsiung,Tianyu Pang,Yung-Chen Tang,Linyue Song,Tsung-Yi Ho,Pin-Yu Chen,Yaoqing Yang
机构: Dartmouth College (达特茅斯学院); EPFL (瑞士联邦理工学院); UC Berkeley (加州大学伯克利分校); CUHK (香港中文大学); IBM Research (IBM研究院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.
zh

[NLP-1] Inference-Time Hyper-Scaling with KV Cache Compression

【速读】：该论文试图解决在推理阶段通过增加生成序列长度或并行性来提升推理准确性的效率问题，但受限于Transformer大语言模型（Large Language Models, LLMs）中键值（Key-Value, KV）缓存的大小，而非生成标记的数量。解决方案的关键在于引入一种称为动态内存稀疏化（Dynamic Memory Sparsification, DMS）的新方法，通过压缩KV缓存，在相同的计算预算下生成更多标记，从而进一步提高扩展推理的准确性。DMS通过延迟标记驱逐而非提前丢弃缓存标记，隐式合并表示并保留关键信息，仅需1K次训练步骤即可实现8倍压缩，同时保持比无训练稀疏注意力更好的准确性。

链接: https://arxiv.org/abs/2506.05345
作者: Adrian Łańcucki,Konrad Staniszewski,Piotr Nawrot,Edoardo M. Ponti
机构: NVIDIA(英伟达); University of Edinburgh(爱丁堡大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8 \times compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.
zh

[NLP-2] Flattery Fluff and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

【速读】：该论文试图解决语言模型在对齐和评估过程中因训练数据偏差导致的偏好模型校准不足问题，具体表现为模型过度依赖长度、结构、术语、阿谀奉承和模糊性等表面特征，而非实质质量，从而引发奖励黑客和不可靠评估等问题。解决方案的关键在于提出一种基于反事实数据增强（Counterfactual Data Augmentation, CDA）的后训练方法，通过合成对比示例来减少模型对虚假线索的依赖，实验结果表明该方法有效降低了模型的校准误差并改善了偏差差异。

链接: https://arxiv.org/abs/2506.05339
作者: Anirudh Bharadwaj,Chaitanya Malaviya,Nitish Joshi,Mark Yatskar
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code and data available at this https URL

点击查看摘要

Abstract:Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in 60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.
zh

[NLP-3] Search Arena: Analyzing Search-Augmented LLM s

【速读】：该论文试图解决搜索增强型语言模型（Search-augmented Language Models）系统分析困难的问题，现有数据集在规模和范围上存在局限，通常仅限于静态、单轮的事实核查问题。其解决方案的关键是引入Search Arena，这是一个由众包生成的大规模、基于人类偏好的数据集，包含超过24,000对多轮用户交互记录，覆盖多种意图和语言，并包含约12,000次人类偏好投票的完整系统轨迹。该数据集为研究用户偏好与模型表现提供了丰富的实证基础。

链接: https://arxiv.org/abs/2506.05334
作者: Mihran Miroyan,Tsung-Han Wu,Logan King,Tianle Li,Jiayi Pan,Xinyan Hu,Wei-Lin Chiang,Anastasios N. Angelopoulos,Trevor Darrell,Narges Norouzi,Joseph E. Gonzalez
机构: UC Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Preprint. Code: this https URL . Dataset: this https URL

点击查看摘要

Abstract:Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model’s parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: this https URL.
zh

[NLP-4] Kinetics: Rethinking Test-Time Scaling Laws

【速读】：该论文试图解决测试时扩展规律（test-time scaling laws）中对小型模型效果的高估问题，以及推理过程中由于策略（如Best-of-N、长链式思维）引入的内存访问瓶颈。其解决方案的关键在于提出一种新的动力学扩展规律（Kinetics Scaling Law），该规律通过综合计算和内存访问成本来优化资源分配，并揭示了在测试时，计算资源在超过一定阈值的模型上更为有效。这一结论的核心原因是，在测试时，注意力机制（attention）而非参数数量成为主要的成本因素。为应对这一问题，研究者提出了一种以稀疏注意力（sparse attention）为中心的扩展范式，该范式降低了每个标记的计算成本，从而在相同资源预算下实现更长的生成和更多并行样本。

链接: https://arxiv.org/abs/2506.05333
作者: Ranajoy Sadhukhan,Zhuoming Chen,Haizhong Zheng,Yang Zhou,Emma Strubell,Beidi Chen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of- N , long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at this https URL.
zh

[NLP-5] Unleashing Hour-Scale Video Training for Long Video-Language Understanding

【速读】：该论文旨在解决长视频语言理解任务中由于高质量标注数据稀缺而导致的小时级视频大模型（Video-LMM）训练不足的问题。其解决方案的关键在于构建了一个大规模的小时级视频指令跟随数据集VideoMarathon，该数据集包含约9,700小时的多领域长视频及3.3M高质量问答对，并支持22种需要短期和长期视频理解的任务。基于此数据集，作者提出了Hour-LLaVA模型，通过引入记忆增强模块，实现了在1-FPS采样下对小时级视频的高效训练与推理。

链接: https://arxiv.org/abs/2506.05332
作者: Jingyang Lin,Jialian Wu,Ximeng Sun,Ze Wang,Jiang Liu,Yusheng Su,Xiaodong Yu,Hao Chen,Jiebo Luo,Zicheng Liu,Emad Barsoum
机构: AMD(超微半导体); University of Rochester(罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.
zh

[NLP-6] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在强化学习（Reinforcement Learning, RL）微调过程中数据效率低的问题。现有方法在资源消耗方面较高，且未充分关注数据效率的提升。论文提出的解决方案关键在于两个技术：基于自适应难度的在线数据选择和回放重用（rollout replay）。其中，自适应难度用于指导在线数据选择，优先选择中等难度的问题以获取更有信息量的学习信号；同时，通过注意力机制高效估计剩余问题的自适应难度，减少需要进行回放（rollout）的样本数量。此外，引入回放重用机制，通过复用近期的回放数据降低每步计算成本，从而显著提升RL微调的数据效率。

链接: https://arxiv.org/abs/2506.05316
作者: Yifan Sun,Jingyan Shen,Yibin Wang,Tianyu Chen,Zhendong Wang,Mingyuan Zhou,Huan Zhang
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校); New York University(纽约大学); University of Texas at Austin(德克萨斯大学奥斯汀分校); Microsoft(微软)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism that reuses recent rollouts, lowering per-step computation while maintaining stable updates. Extensive experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 25% to 65% to reach the same level of performance as the original GRPO algorithm.
zh

[NLP-7] Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中需要遗忘敏感、过时或专有信息的问题。现有方法通常将遗忘与保留建模为一种正则化权衡，通过组合两个目标生成单一标量损失，这往往导致优化不稳定且在保留数据上的性能下降，尤其是在强制遗忘的情况下。该论文提出了一种新的LLM遗忘形式，将其建模为一个约束优化问题：通过一种新颖的logit-margin flattening损失强制遗忘，该损失显式地将指定遗忘集的输出分布推向均匀性，而保留则通过在独立保留集上的硬约束来实现。其关键在于采用softmax-free的损失函数，具有数值稳定性并保持非消失梯度，从而实现更高效和鲁棒的优化。

链接: https://arxiv.org/abs/2506.05314
作者: Taha Entesari,Arman Hatami,Rinat Khaziev,Anil Ramakrishna,Mahyar Fazlyab
机构: Johns Hopkins University (约翰霍普金斯大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.
zh

[NLP-8] me to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在异步通信场景中的应用问题，特别是在没有明确发言轮次的群体互动中，如群聊、在线团队会议或社交游戏。传统LLMs主要用于同步通信，而现实场景中参与者需自主决定何时发言，这构成了决策的重要部分。该研究提出了一种自适应异步LLM代理，其关键在于不仅决定“说什么”，还决定“何时说”。通过在在线“黑手党”游戏中收集数据进行评估，结果表明该代理在游戏表现和融入人类玩家方面均表现出与人类相当的能力，其发言时机行为与人类模式高度相似，尽管在消息内容上存在差异。

链接: https://arxiv.org/abs/2506.05309
作者: Niv Eckhaus,Uri Berger,Gabriel Stanovsky
机构: School of Computer Science and Engineering, The Hebrew University of Jerusalem (希伯来大学计算机科学与工程学院); School of Computing and Information Systems, University of Melbourne (墨尔本大学计算与信息系统学院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs are used predominantly in synchronous communication, where a human user and a model communicate in alternating turns. In contrast, many real-world settings are inherently asynchronous. For example, in group chats, online team meetings, or social games, there is no inherent notion of turns; therefore, the decision of when to speak forms a crucial part of the participant’s decision making. In this work, we develop an adaptive asynchronous LLM-agent which, in addition to determining what to say, also decides when to say it. To evaluate our agent, we collect a unique dataset of online Mafia games, including both human participants, as well as our asynchronous agent. Overall, our agent performs on par with human players, both in game performance, as well as in its ability to blend in with the other human players. Our analysis shows that the agent’s behavior in deciding when to speak closely mirrors human patterns, although differences emerge in message content. We release all our data and code to support and encourage further research for more realistic asynchronous communication between LLM agents. This work paves the way for integration of LLMs into realistic human group settings, from assistance in team discussions to educational and professional environments where complex social dynamics must be navigated.
zh

[NLP-9] ProRefine: Inference-time Prompt Refinement with Textual Feedback

【速读】：该论文旨在解决多智能体协作任务中由于提示设计不当导致的错误传播和性能不佳问题（error propagation and sub-optimal performance）。其解决方案的关键在于提出ProRefine，一种在推理阶段优化提示的方法，通过利用大语言模型（Large Language Models, LLMs）的文本反馈动态调整多步骤推理任务的提示，从而提升任务执行的准确性和效率。

链接: https://arxiv.org/abs/2506.05305
作者: Deepak Pandita,Tharindu Cyril Weerasooriya,Ankit Parag Shah,Christopher M. Homan,Wei Wei
机构: Rochester Institute of Technology (罗彻斯特理工学院); Center for Advanced AI, Accenture (先进人工智能中心，埃森哲)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, are becoming increasingly prevalent. However, these workflows often suffer from error propagation and sub-optimal performance, largely due to poorly designed prompts that fail to effectively guide individual agents. This is a critical problem because it limits the reliability and scalability of these powerful systems. We introduce ProRefine, an innovative inference-time prompt optimization method that leverages textual feedback from large language models (LLMs) to address this challenge. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to match the performance of larger ones, highlighting its potential for efficient and scalable AI deployment, and democratizing access to high-performing AI.
zh

[NLP-10] Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning ACL2025

【速读】：该论文试图解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中因外部知识与大语言模型（Large Language Models, LLMs）固有参数化知识之间的冲突而导致的性能下降问题，特别是在问答（Question Answering, QA）任务中的表现。解决方案的关键在于提出Micro-Act框架，该框架通过分层动作空间自动感知上下文复杂性，并将每个知识源自适应地分解为一系列细粒度的比较步骤，这些比较以可操作步骤的形式表示，从而支持超越表层上下文的推理。

链接: https://arxiv.org/abs/2506.05278
作者: Nan Huo,Jinyang Li,Bowen Qin,Ge Qu,Xiaolong Li,Xiaodong Li,Chenhao Ma,Reynold Cheng
机构: The University of Hong Kong(香港大学); BAAI(百度研究院); Xiamen University(厦门大学); The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Main

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.
zh

[NLP-11] CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection

【速读】：该论文试图解决生成文本中的幻觉检测问题，即判断生成内容是否与参考文本存在蕴含关系。其解决方案的关键在于引导模型执行系统且全面的推理过程，该过程包括将文本分解为较小的事实、为每个事实寻找来源证据，并进行细粒度的蕴含分类，从而提升检测性能。

链接: https://arxiv.org/abs/2506.05243
作者: Ron Eliav,Arie Cattan,Eran Hirsch,Shahaf Bassan,Elias Stengel-Eskin,Mohit Bansal,Ido Dagan
机构: Bar-Ilan University (巴伊兰大学); Hebrew University of Jerusalem (希伯来大学耶路撒冷分校); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A common approach to hallucination detection casts it as a natural language inference (NLI) task, often using LLMs to classify whether the generated text is entailed by corresponding reference texts. Since entailment classification is a complex reasoning task, one would expect that LLMs could benefit from generating an explicit reasoning process, as in CoT reasoning or the explicit ``thinking’’ of recent reasoning models. In this work, we propose that guiding such models to perform a systematic and comprehensive reasoning process – one that both decomposes the text into smaller facts and also finds evidence in the source for each fact – allows models to execute much finer-grained and accurate entailment decisions, leading to increased performance. To that end, we define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection. Following this reasoning framework, we introduce an analysis scheme, consisting of several metrics that measure the quality of the intermediate reasoning steps, which provided additional empirical evidence for the improved quality of our guided reasoning scheme.
zh

[NLP-12] owards a Unified System of Representation for Continuity and Discontinuity in Natural Language

【速读】：该论文试图解决自然语言结构中连续与非连续现象的统一表示问题（syntactic discontinuity），即如何在不同语法形式体系下对语言结构进行一致性的分析。解决方案的关键在于整合三种语法形式体系：短语结构语法（Phrase Structure Grammar, PSG）所强调的成分性（constituency）、依存语法（Dependency Grammar, DG）所关注的中心词-修饰语关系以及范畴语法（Categorial Grammar, CG）所侧重的函子-变元关系，通过一种统一的数学推导方式实现对连续与非连续结构的共同描述。

链接: https://arxiv.org/abs/2506.05235
作者: Ratna Kandala,Prakash Mondal
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Syntactic discontinuity is a grammatical phenomenon in which a constituent is split into more than one part because of the insertion of an element which is not part of the constituent. This is observed in many languages across the world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal, Yidiny etc. Different formalisms/frameworks in current linguistic theory approach the problem of discontinuous structures in different ways. Each framework/formalism has widely been viewed as an independent and non-converging system of analysis. In this paper, we propose a unified system of representation for both continuity and discontinuity in structures of natural languages by taking into account three formalisms, in particular, Phrase Structure Grammar (PSG) for its widely used notion of constituency, Dependency Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for its focus on functor-argument relations. We attempt to show that discontinuous expressions as well as continuous structures can be analysed through a unified mathematical derivation incorporating the representations of linguistic structure in these three grammar formalisms.
zh

[NLP-13] MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

【速读】：该论文试图解决传统因果Transformer架构在推理过程中内存和计算成本线性增长的问题，其解决方案的关键在于引入一种数值稳定且可分块并行化的Mesa层（Mesa layer），该层源自上下文损失函数，并通过快速共轭梯度求解器在每个时间点最小化该损失函数，从而实现更高效的序列建模。

链接: https://arxiv.org/abs/2506.05233
作者: Johannes von Oswald,Nino Scherrer,Seijin Kobayashi,Luca Versari,Songlin Yang,Maximilian Schlegel,Kaitlin Maile,Yanick Schimpf,Oliver Sieberling,Alexander Meulemans,Rif A. Saurous,Guillaume Lajoie,Charlotte Frenkel,Razvan Pascanu,Blaise Agüera y Arcas,João Sacramento
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance – here by spending compute to solve sequential optimization problems within the neural network itself.
zh

[NLP-14] Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

【速读】：该论文旨在解决Transformer模型在长上下文推理中因二次时间复杂度和线性内存复杂度导致的性能瓶颈问题。其提出的解决方案是Recurrent Memory Transformers (RMTs)，通过将渐近复杂度降低至线性时间与常数内存使用来优化长序列处理。然而，RMTs的内存更新机制导致了串行执行，形成了性能瓶颈。该研究的关键创新在于引入Diagonal Batching，这是一种调度方案，能够在保持精确递归性的前提下，实现RMT中不同片段的并行处理，从而消除串行约束，提升GPU推理效率。

链接: https://arxiv.org/abs/2506.05229
作者: Danil Sivtsov,Ivan Rodkin,Gleb Kuzmin,Yuri Kuratov,Ivan Oseledets
机构: AIRI(人工智能研究机构); Skoltech(斯科尔科沃科技学院); Neural Networks and Deep Learning Lab(神经网络与深度学习实验室); MIPT(莫斯科物理技术学院); MBZUAI(穆罕默德·本·扎耶德人工智能大学); FRC CSC RAS(俄罗斯科学院计算数学与数学物理研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2506.05229 [cs.LG] (or arXiv:2506.05229v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.05229 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-15] Improving Low-Resource Morphological Inflection via Self-Supervised Objectives ACL2025

【速读】：该论文试图解决在极低资源条件下，如何有效进行形态屈折（morphological inflection）建模的问题，特别是在缺乏大量标注数据的情况下。其解决方案的关键在于探索自监督辅助任务的有效性，特别是通过对比自动编码（autoencoding）和字符掩码语言模型（CMLM）等方法，在不同数据量下优化模型性能。研究发现，当无标签数据非常有限时，自动编码表现最佳，而随着数据量的增加，CMLM逐渐更具优势，同时基于已知词素边界采样掩码的方法显著提升了模型性能。

链接: https://arxiv.org/abs/2506.05227
作者: Adam Wiemerslage,Katharina von der Wense
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Johannes Gutenberg University Mainz (美因茨约翰内斯古腾堡大学); Kensho Technologies (肯肖技术)
类目: Computation and Language (cs.CL)
备注: ACL 2025 main

点击查看摘要

Abstract:Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world’s languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection – a character-level task highly relevant for language documentation – in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.
zh

[NLP-16] Mitigating Degree Bias Adaptively with Hard-to-Learn Nodes in Graph Contrastive Learning

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）在节点分类任务中存在的度偏差（degree bias）问题，即不同度数的节点在预测性能上存在差异。为缓解这一问题，现有方法虽然采用图对比学习（Graph Contrastive Learning, GCL），但受限于正样本对数量有限以及正负样本权重均等，导致低度节点获取的信息不足且噪声较大。论文提出的解决方案关键在于引入一种基于难度自适应重加权的对比损失（Hardness Adaptive Reweighted, HAR），通过利用节点标签增加正样本对，并根据学习难度自适应调整正负样本的权重。此外，论文还构建了实验框架SHARP以扩展HAR的应用场景，理论分析与实验结果均验证了其有效性。

链接: https://arxiv.org/abs/2506.05214
作者: Jingyu Hu,Hongbo Bo,Jun Hong,Xiaowei Liu,Weiru Liu
机构: University of Bristol (布里斯托大学); University of the West of England (西英格兰大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) often suffer from degree bias in node classification tasks, where prediction performance varies across nodes with different degrees. Several approaches, which adopt Graph Contrastive Learning (GCL), have been proposed to mitigate this bias. However, the limited number of positive pairs and the equal weighting of all positives and negatives in GCL still lead to low-degree nodes acquiring insufficient and noisy information. This paper proposes the Hardness Adaptive Reweighted (HAR) contrastive loss to mitigate degree bias. It adds more positive pairs by leveraging node labels and adaptively weights positive and negative pairs based on their learning hardness. In addition, we develop an experimental framework named SHARP to extend HAR to a broader range of scenarios. Both our theoretical analysis and experiments validate the effectiveness of SHARP. The experimental results across four datasets show that SHARP achieves better performance against baselines at both global and degree levels.
zh

[NLP-17] LLM -First Search: Self-Guided Exploration of the Solution Space

【速读】：该论文试图解决传统搜索算法在面对不同难度任务时适应性差的问题，尤其是其对固定探索超参数的依赖导致在某些场景下不实用或成本过高的问题。解决方案的关键在于提出一种名为LLM-First Search (LFS)的新方法，该方法通过让大语言模型（Large Language Model, LLM）自主控制搜索过程，实现自我引导的探索，从而无需预定义的搜索策略。LFS利用LLM内部的评分机制决定是否继续当前路径或探索其他分支，提升了推理的灵活性和上下文敏感性，同时避免了手动调优或任务特定的适配。

链接: https://arxiv.org/abs/2506.05213
作者: Nathan Herr,Tim Rocktäschel,Roberta Raileanu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 main pages, 2 figures, 2 tables, 36 appendix pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbfLLM-First Search (LFS), a novel \textitLLM Self-Guided Search method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts’ Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \hrefthis https URLLLM-First-Search.
zh

[NLP-18] he Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）训练过程中依赖未授权文本所带来的知识产权和伦理问题，其解决方案的关键在于构建一个大规模、高质量的开放许可文本数据集——Common Pile v0.1。该数据集包含8TB的开放许可文本，涵盖30个不同领域的数据源，并通过在该数据集上训练两个70亿参数的LLM（Comma v0.1-1T和Comma v0.1-2T）验证了其有效性，结果表明这些模型在相似计算预算下可达到与使用未授权文本训练的LLM（如Llama 1和2 7B）相当的性能。

链接: https://arxiv.org/abs/2506.05209
作者: Nikhil Kandpal,Brian Lester,Colin Raffel,Sebastian Majstorovic,Stella Biderman,Baber Abbasi,Luca Soldaini,Enrico Shippole,A. Feder Cooper,Aviya Skowron,John Kirchenbauer,Shayne Longpre,Lintang Sutawika,Alon Albalak,Zhenlin Xu,Guilherme Penedo,Loubna Ben Allal,Elie Bakouch,John David Pressman,Honglu Fan,Dashiell Stander,Guangyu Song,Aaron Gokaslan,Tom Goldstein,Brian R. Bartoldson,Bhavya Kailkhura,Tyler Murray
机构: University of Toronto(多伦多大学); Vector Institute(向量研究所); Hugging Face(胡格夫); EleutherAI(电报AI); The Allen Institute for Artificial Intelligence(艾伦人工智能研究所); Teraflop AI(特拉弗洛普AI); Cornell University(康奈尔大学); University of Maryland, College Park(马里兰大学学院公园分校); MIT(麻省理工学院); CMU(卡内基梅隆大学); Lila Sciences(莉拉科学); Independent(独立); poolside(泳池); Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
zh

[NLP-19] RELIC: Evaluating Compositional Instruction Following via Language Recognition

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在仅提供任务说明的情况下执行任务的能力，即指令遵循（instruction following）的问题。其解决方案的关键在于引入了基于语言识别的评估框架——Recognition of Languages In-Context (RELIC)，通过判断一个字符串是否由形式语法生成这一任务来评估模型的指令遵循能力。该框架要求模型从上下文中组合大量指令（语法生成规则），并利用合成语言的特性实现任务复杂度的动态调整和数据的自动生成，从而有效避免数据污染。

链接: https://arxiv.org/abs/2506.05205
作者: Jackson Petty,Michael Y. Hu,Wentao Wang,Shauli Ravfogel,William Merrill,Tal Linzen
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs’ ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs’ skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.
zh

[NLP-20] Counterfactual reasoning : an analysis of in-context emergence

【速读】：该论文试图解决语言模型在情境中进行反事实推理（counterfactual reasoning）的问题，即预测假设情景下变化的后果。解决方案的关键在于通过一个定义明确的合成设置——线性回归任务，该任务需要从实际观测中推断并复制上下文中的噪声以实现准确预测。研究发现，语言模型在此控制环境下具备反事实推理能力，并揭示了反事实推理可被简化为对情境观察的转换；此外，自注意力机制、模型深度和预训练数据多样性是影响Transformer性能的重要因素。

链接: https://arxiv.org/abs/2506.05188
作者: Moritz Miller,Bernhard Schölkopf,Siyuan Guo
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ETH Zurich (苏黎世联邦理工学院); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under this https URL .
zh

[NLP-21] Qwen 3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

【速读】：该论文旨在解决文本嵌入（text embedding）和重排序（reranking）任务中的性能提升问题，特别是在多语言环境下的表现优化。其关键解决方案是基于Qwen3基础模型构建的Qwen3 Embedding系列，通过创新的多阶段训练流程，结合大规模无监督预训练与高质量数据集上的监督微调，并利用Qwen3大语言模型在多语言文本理解和生成方面的强大能力，合成高质量、丰富且多样化的跨领域和跨语言训练数据，从而显著提升了模型的鲁棒性与适应性。

链接: https://arxiv.org/abs/2506.05176
作者: Yanzhao Zhang,Mingxin Li,Dingkun Long,Xin Zhang,Huan Lin,Baosong Yang,Pengjun Xie,An Yang,Dayiheng Liu,Junyang Lin,Fei Huang,Jingren Zhou
机构: Tongyi Lab(通义实验室); Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs’ robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.
zh

[NLP-22] ECoRAG : Evidentiality-guided Compression for Long Context RAG

【速读】：该论文试图解决在基于生成式AI（Generative AI）的检索增强生成（Retrieval-Augmented Generation, RAG）系统中，由于长上下文带来的计算开销以及非证据性信息干扰导致的性能受限问题。解决方案的关键在于提出一种基于证据性的检索增强生成框架——Evidentiality-guided RAG（ECoRAG），通过根据证据性对检索到的文档进行压缩，确保答案生成过程有正确的证据支持，并在必要时进一步检索以补充证据，从而提升模型在开放域问答（Open-Domain Question Answering, ODQA）任务中的性能，同时降低计算成本和token使用量。

链接: https://arxiv.org/abs/2506.05167
作者: Yeonseok Jeong,Jinsu Kim,Dohyeon Lee,Seung-won Hwang
机构: Seoul National University (首尔国立大学); Korea University (高丽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or \textbfECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at this https URL.
zh

[NLP-23] Dissecting Bias in LLM s: A Mechanistic Interpretability Perspective

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中存在的人口统计学和性别偏见问题，其关键在于通过机制可解释性方法分析这些偏见在模型内部的结构化表示。研究聚焦于识别导致偏见行为的内部连接，并评估这些组件在数据集和语言变化下的稳定性、定位性和泛化能力。通过系统消融实验，发现与偏见相关的计算高度局部化，通常集中在少数层中，且这些组件在微调设置中会发生变化，最终证明移除这些组件不仅能减少偏见输出，还会对其他自然语言处理任务产生影响。

链接: https://arxiv.org/abs/2506.05166
作者: Bhavik Chandna,Zubair Bashir,Procheta Sen
机构: University of California San Diego (加州大学圣地亚哥分校); Indian Institute of Technology, Kharagpur (印度理工学院，克哈格普尔分校); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.
zh

[NLP-24] Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation

【速读】：该论文旨在解决当前检索增强生成（Retrieval-Augmented Generation, RAG）系统过度依赖检索到的上下文信息，从而导致对不准确来源的依赖以及忽视模型自身知识的问题，特别是在处理误导性或冗余信息时。其解决方案的关键在于提出Knowledgeable-r1方法，通过联合采样和在知识能力探索中定义多策略分布，以激发大语言模型对参数化知识和上下文知识的自我整合利用。

链接: https://arxiv.org/abs/2506.05154
作者: Chenyu Lin,Yilin Wen,Du Su,Fei Sun,Muhan Chen,Chenfu Bao,Zhonghou Lv
机构: Baidu Inc. (百度公司); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model’s inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models’self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at this https URL knowledgeable-r1.
zh

[NLP-25] CIVET: Systematic Evaluation of Understanding in VLMs

【速读】：该论文试图解决当前视觉-语言模型（Vision-Language Models, VLMs）在理解场景底层结构和语义方面研究不足的问题，特别是其对物体属性和关系的把握能力有限。解决方案的关键在于提出CIVET，这是一个新颖且可扩展的框架，用于通过受控刺激进行系统性评估（systematic evaluation via controlled stimuli），以标准化的方式衡量VLMs的理解能力，从而支持研究人员以统计严谨性测试假设。

链接: https://arxiv.org/abs/2506.05146
作者: Massimo Rizzoli,Simone Alghisi,Olha Khomyn,Gabriel Roccabruna,Seyed Mahed Mousavi,Giuseppe Riccardi
机构: Signals and Interactive Systems Lab, University of Trento (信号与交互系统实验室，特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs’ understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.
zh

[NLP-26] Do Large Language Models Judge Error Severity Like Humans?

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）是否能够准确复制人类对自然语言生成中语义错误严重程度的判断。研究的关键在于通过系统比较人类与LLMs在包含受控语义错误的图像描述中的评估结果，分析不同错误类型（如年龄、性别、服装类型和颜色）的严重性判断差异，并探讨LLMs在多模态和单模态设置下的表现。研究发现，人类对错误类型的严重性判断受视觉上下文影响显著，而大多数LLMs在性别错误和颜色错误上的评分存在偏差，表明其可能内化了社会规范但缺乏对颜色感知的生理基础理解，仅有少数模型表现出接近人类的判断模式。

链接: https://arxiv.org/abs/2506.05142
作者: Diege Sun,Guanyi Chen,Fan Zhao,Xiaorong Cheng,Tingting He
机构: Central China Normal University (中央师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.
zh

[NLP-27] AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

【速读】：该论文试图解决大型音频-语言模型（Large Audio-Language Models, LALMs）内部机制的理解问题，特别是其对听觉属性的感知与识别机制。解决方案的关键在于通过词汇投影技术对三个先进的LALMs进行分析，追踪属性信息在不同层和标记位置的演变过程。研究发现，当识别失败时，属性信息通常随着层深度的增加而减少，而在较浅层解析属性与更高的准确性相关。此外，LALMs主要依赖于查询听觉输入来预测属性，而非在提及属性的位置聚合必要信息。基于这些发现，论文提出了一种增强LALMs性能的方法。

链接: https://arxiv.org/abs/2506.05140
作者: Chih-Kai Yang,Neo Ho,Yi-Jyun Lee,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.
zh

[NLP-28] Information Locality as an Inductive Bias for Neural Language Models

【速读】：该论文试图解决神经语言模型（Neural Language Models, LMs）中的归纳偏置（Inductive Biases）是否与人类处理约束一致的问题。其解决方案的关键在于提出一种定量框架，该框架引入了m-局部熵（m-local entropy），这是一种基于平均有损上下文意外度（average lossy-context surprisal）的信息论度量，用于量化语言中局部不确定性的程度，进而评估语言模型对语言局部统计结构的敏感性。

链接: https://arxiv.org/abs/2506.05136
作者: Taiga Someya,Anej Svete,Brian DuSell,Timothy J. O’Donnell,Mario Giulianelli,Ryan Cotterell
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce m -local entropy \unicodex2013 an information-theoretic measure derived from average lossy-context surprisal \unicodex2013 that captures the local uncertainty of a language by quantifying how effectively the m-1 preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSAs), we show that languages with higher m -local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.
zh

[NLP-29] DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning ACL

【速读】：该论文旨在解决零样本事件检测（Zero-shot Event Detection, ED）问题，即在没有训练数据的情况下从自然语言文本中识别事件提及。传统大型语言模型（Large Language Models, LLMs）在处理复杂事件本体、提取领域特定的触发词并进行结构化时存在性能瓶颈。论文提出的解决方案关键在于DiCoRe框架，该框架通过分解ED任务为“Dreamer”和“Grounder”两个模块实现发散-收敛推理：Dreamer通过开放式的事件发现提升事件覆盖率，而Grounder则利用有限状态机引导的约束解码对自由形式预测进行任务对齐，最终通过LLM-Judge确保输出精度。

链接: https://arxiv.org/abs/2506.05128
作者: Tanmay Parekh,Kartik Mehta,Ninareh Mehrabi,Kai-Wei Chang,Nanyun Peng
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Amazon AGI Foundations (亚马逊AGI基础机构)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted at ACL ARR May 2025

点击查看摘要

Abstract:Zero-shot Event Detection (ED), the task of identifying event mentions in natural language text without any training data, is critical for document understanding in specialized domains. Understanding the complex event ontology, extracting domain-specific triggers from the passage, and structuring them appropriately overloads and limits the utility of Large Language Models (LLMs) for zero-shot ED. To this end, we propose DiCoRe, a divergent-convergent reasoning framework that decouples the task of ED using Dreamer and Grounder. Dreamer encourages divergent reasoning through open-ended event discovery, which helps to boost event coverage. Conversely, Grounder introduces convergent reasoning to align the free-form predictions with the task-specific instructions using finite-state machine guided constrained decoding. Additionally, an LLM-Judge verifies the final outputs to ensure high precision. Through extensive experiments on six datasets across five domains and nine LLMs, we demonstrate how DiCoRe consistently outperforms prior zero-shot, transfer-learning, and reasoning baselines, achieving 4-7% average F1 gains over the best baseline – establishing DiCoRe as a strong zero-shot ED framework.
zh

[NLP-30] he NTNU System at the SI Challenge 2025 SLA Open Track ISCA

【速读】：该论文旨在解决语音语言评估（Spoken Language Assessment, SLA）中单一模态模型的局限性问题，即基于BERT的方法依赖语音识别（ASR）转录文本，难以捕捉语调和语音特征，而基于wav2vec 2.0（W2V）的方法虽然在声学特征建模上表现优异，但缺乏语义可解释性。解决方案的关键在于将W2V与Phi-4多模态大语言模型（Multimodal Large Language Model, MLLM）通过评分融合策略进行集成，从而结合声学与语义信息，提升评估性能。

链接: https://arxiv.org/abs/2506.05121
作者: Hong-Yun Lin,Tien-Hong Lo,Yu-Hsuan Fang,Jhen-Ke Lin,Chung-Chun Wang,Hao-Chien Lu,Berlin Chen
机构: Department of Computer Science and Information Engineering; Institute of AI Interdisciplinary Applied Technology
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to the ISCA SLaTE-2025 Workshop

点击查看摘要

Abstract:A recent line of research on spoken language assessment (SLA) employs neural models such as BERT and wav2vec 2.0 (W2V) to evaluate speaking proficiency across linguistic and acoustic modalities. Although both models effectively capture features relevant to oral competence, each exhibits modality-specific limitations. BERT-based methods rely on ASR transcripts, which often fail to capture prosodic and phonetic cues for SLA. In contrast, W2V-based methods excel at modeling acoustic features but lack semantic interpretability. To overcome these limitations, we propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy. The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak Improve Challenge 2025, securing second place in the competition. For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
zh

[NLP-31] CL-ISR: A Contrastive Learning and Implicit Stance Reasoning Framework for Misleading Text Detection on Social Media

【速读】：该论文旨在解决社交媒体平台上误导性文本的检测问题，此类文本可能导致公众误解、社会恐慌甚至经济损失。其解决方案的关键在于提出一种新的框架CL-ISR（Contrastive Learning and Implicit Stance Reasoning），该框架结合了对比学习和隐式立场推理。对比学习用于增强模型对真实与误导性文本语义差异的学习能力，而隐式立场推理模块则用于挖掘文本中的潜在立场倾向及其与相关话题的关系，从而提升误导性内容的检测效果。

链接: https://arxiv.org/abs/2506.05107
作者: Tianyi Huang,Zikun Cui,Cuiqianhe Du,Chia-En Chiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:Misleading text detection on social media platforms is a critical research area, as these texts can lead to public misunderstanding, social panic and even economic losses. This paper proposes a novel framework - CL-ISR (Contrastive Learning and Implicit Stance Reasoning), which combines contrastive learning and implicit stance reasoning, to improve the detection accuracy of misleading texts on social media. First, we use the contrastive learning algorithm to improve the model’s learning ability of semantic differences between truthful and misleading texts. Contrastive learning could help the model to better capture the distinguishing features between different categories by constructing positive and negative sample pairs. This approach enables the model to capture distinguishing features more effectively, particularly in linguistically complicated situations. Second, we introduce the implicit stance reasoning module, to explore the potential stance tendencies in the text and their relationships with related topics. This method is effective for identifying content that misleads through stance shifting or emotional manipulation, because it can capture the implicit information behind the text. Finally, we integrate these two algorithms together to form a new framework, CL-ISR, which leverages the discriminative power of contrastive learning and the interpretive depth of stance reasoning to significantly improve detection effect.
zh

[NLP-32] Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics

【速读】：该论文试图解决传统城市分析中基于影像或GIS的客观街道指标无法捕捉包容性城市设计所需主观感知的问题。其解决方案的关键在于提出一种多模态街道评估框架（Multimodal Street Evaluation Framework, MSEF），该框架融合了视觉变换器（VisualGLM-6B）与大语言模型（GPT-4），实现了可解释的双输出街道景观评估。通过在哈尔滨的15,000多张标注街景图像上进行LoRA和P-Tuning v2微调，MSEF不仅在客观特征上达到0.84的F1分数，还与居民综合感知达成89.3%的一致性，有效揭示了不同社会经济地理背景下的空间感知矛盾与非线性模式。

链接: https://arxiv.org/abs/2506.05087
作者: HaoTian Lan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 24 pages, 10 figures

点击查看摘要

Abstract:While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns – such as the divergent perceptual effects of architectural transparency across residential and commercial zones – revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.
zh

[NLP-33] Parking Perception and Retail: Street-Level Determinants of Community Vitality in Harbin

【速读】：该论文试图解决社区尺度街道在中文城市中的商业活力受车辆可达性、环境质量和行人感知等复杂交互影响的问题，其核心是探究街道层面特征（如停车密度、绿化率、清洁度和街道宽度）对零售业绩和用户满意度的影响。解决方案的关键在于提出一种可解释的、基于图像的框架，利用街景图像和多模态大语言模型（VisualGLM-6B）构建社区商业活力指数（Community Commercial Vitality Index, CCVI），并通过GPT-4驱动的感知建模提取空间属性，从而分析其与商业表现的关系。

链接: https://arxiv.org/abs/2506.05080
作者: HaoTian Lan
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages,5 figures

点击查看摘要

Abstract:The commercial vitality of community-scale streets in Chinese cities is shaped by complex interactions between vehicular accessibility, environmental quality, and pedestrian perception. This study proposes an interpretable, image-based framework to examine how street-level features – including parked vehicle density, greenery, cleanliness, and street width – impact retail performance and user satisfaction in Harbin, China. Leveraging street view imagery and a multimodal large language model (VisualGLM-6B), we construct a Community Commercial Vitality Index (CCVI) from Meituan and Dianping data and analyze its relationship with spatial attributes extracted via GPT-4-based perception modeling. Our findings reveal that while moderate vehicle presence may enhance commercial access, excessive on-street parking – especially in narrow streets – erodes walkability and reduces both satisfaction and shop-level pricing. In contrast, streets with higher perceived greenery and cleanliness show significantly greater satisfaction scores but only weak associations with pricing. Street width moderates the effects of vehicle presence, underscoring the importance of spatial configuration. These results demonstrate the value of integrating AI-assisted perception with urban morphological analysis to capture non-linear and context-sensitive drivers of commercial success. This study advances both theoretical and methodological frontiers by highlighting the conditional role of vehicle activity in neighborhood commerce and demonstrating the feasibility of multimodal AI for perceptual urban diagnostics. The implications extend to urban design, parking management, and scalable planning tools for community revitalization.
zh

[NLP-34] Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation ACL2025

【速读】：该论文旨在解决社交媒体上自我伤害（self-harm）表达的检测问题，该问题因表达的细微性和语境依赖性而具有挑战性。现有大型语言模型（LLMs）在理解非直接语言和表情符号中的隐含线索方面存在不足。该研究的关键解决方案是通过区分意图来增强LLMs对自我伤害的理解，具体包括构建Centennial Emoji Sensitivity Matrix (CESM-100)和Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES)数据集，并在此基础上提出一个统一框架，该框架通过引入CESM-100丰富输入、对LLMs进行多任务微调以及生成可解释的预测理由，显著提升了LLMs在检测和解释任务中的性能。

链接: https://arxiv.org/abs/2506.05073
作者: Soumitra Ghosh,Gopendra Vikram Singh,Shambhavi,Sabarna Choudhury,Asif Ekbal
机构: Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会); Indian Institute of Technology Patna(印度理工学院巴特那分校); Indian Institute of Technology Jodhpur(印度理工学院贾姆布尔分校)
类目: Computation and Language (cs.CL)
备注: To be published in the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Main)

点击查看摘要

Abstract:Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs’ comprehension of self-harm by distinguishing intent through nuanced language-emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100), a curated set of 100 emojis with contextual self-harm interpretations and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework: a) enriches inputs using CESM-100; b) fine-tunes LLMs for multi-task learning: self-harm detection (primary) and CM/SI span detection (auxiliary); c) generates explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs-Llama 3, Mental-Alpaca, and MentalLlama, across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: this https URL .
zh

[NLP-35] RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

【速读】：该论文试图解决在口语化字幕翻译任务中，将生成式 AI (Generative AI) 与基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）结合时表现不佳的问题。研究发现，离线奖励模型（offline reward model, RM）由于分布偏移逐渐偏离在线语言模型（LLM），导致训练结果不理想。解决方案的关键在于提出 RIVAL 框架，该框架将训练过程建模为奖励模型与语言模型之间的最小最大博弈，通过迭代更新两者，使奖励模型区分高质量与低质量翻译（定性偏好奖励），同时语言模型优化翻译以缩小差距。此外，引入定量偏好奖励（如 BLEU）以增强训练稳定性与泛化能力。

链接: https://arxiv.org/abs/2506.05070
作者: Tianjiao Li,Mengran Yu,Chenyu Shi,Yanjun Zhao,Xiaojing Liu,Qiang Zhang,Qi Zhang,Xuanjing Huang,Jiayin Wang
机构: Bilibili Inc.(哔哩哔哩公司); Xi’an Jiaotong University(西安交通大学); School of Computer Science, Fudan University(复旦大学计算机学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) possess strong multilingual capabilities, and combining Reinforcement Learning from Human Feedback (RLHF) with translation tasks has shown great potential. However, we observe that this paradigm performs unexpectedly poorly when applied to colloquial subtitle translation tasks. In this work, we investigate this issue and find that the offline reward model (RM) gradually diverges from the online LLM due to distributional shift, ultimately leading to undesirable training outcomes. To address this, we propose RIVAL, an adversarial training framework that formulates the process as a min-max game between the RM and the LLM. RIVAL iteratively updates the both models, with the RM trained to distinguish strong from weak translations (qualitative preference reward), and the LLM trained to enhance its translation for closing this gap. To stabilize training and improve generalizability, we also incorporate quantitative preference reward (e.g., BLEU) into the RM, enabling reference-free quality modeling aligned with human evaluation. Through extensive experiments, we demonstrate that the proposed adversarial training framework significantly improves upon translation baselines.
zh

[NLP-36] Does It Make Sense to Speak of Introspection in Large Language Models ?

【速读】：该论文试图解决的问题是：随着大型语言模型（Large Language Models, LLMs）语言流畅性和认知能力的提升，如何解释其产生的自我报告（self-reports）现象，特别是这些报告是否可以被理解为类似于人类的内省（introspection）行为。论文的核心在于探讨内省这一通常与意识相关的概念是否能够有意义地应用于LLMs。解决方案的关键在于通过分析两个具体的LLM自我报告案例，区分哪些行为可以被视为有限的内省表现，而哪些则仅仅是模型对输入信息的模仿或计算结果，并指出其中可能缺乏意识体验。

链接: https://arxiv.org/abs/2506.05068
作者: Iulia Comşa,Murray Shanahan
机构: Google DeepMind(谷歌深度思维); Imperial College London(帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own ``creative’’ writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.
zh

[NLP-37] Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

【速读】：该论文试图解决如何评估大型语言模型（Large Language Models, LLMs）在辩论演讲评价任务中的表现问题，特别是其在理解论点强度与相关性、演讲的连贯性与结构、风格与语气适当性等方面的能力。解决方案的关键在于构建一个包含600余篇精心标注的辩论演讲数据集，并对其进行深入分析，以比较先进LLMs与人类评委在该任务上的表现，揭示模型在个体判断上的近似能力及其整体判断行为的显著差异。

链接: https://arxiv.org/abs/2506.05062
作者: Noy Sternlicht,Ariel Gera,Roy Bar-Haim,Tom Hope,Noam Slonim
机构: The Hebrew University of Jerusalem(希伯来大学); IBM Research(IBM研究院); The Allen Institute for AI (AI2)(艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
zh

[NLP-38] ALL – A Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在低资源语言上的性能不足问题，这主要是由于训练数据有限所致。其解决方案的关键在于提出TALL（Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages），该架构将LLM与两个双语翻译模型相结合，通过将低资源语言输入转换为高资源语言表示，从而利用LLM的能力，同时通过维度对齐层和定制变换器保留语言特征。

链接: https://arxiv.org/abs/2506.05057
作者: Moshe Ofer,Orel Zamler,Amos Azaria
机构: Ariel University (阿里尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in high-resource languages but struggle with low-resource languages due to limited training data. This paper presents TALL (Trainable Architecture for Enhancing LLM Performance in Low-Resource Languages), which integrates an LLM with two bilingual translation models. TALL transforms low-resource inputs into high-resource representations, leveraging the LLM’s capabilities while preserving linguistic features through dimension alignment layers and custom transformers. Our experiments on Hebrew demonstrate significant improvements over several baselines, including direct use, naive translation, and fine-tuning approaches. The architecture employs a parameter-efficient strategy, freezing pre-trained components while training only lightweight adapter modules, balancing computational efficiency with performance gains.
zh

[NLP-39] Automatic Robustness Stress Testing of LLM s as Mathematical Problem Solvers

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对某些简单推理任务时可能出现的鲁棒性不足问题，即模型在特定情况下会意外失败。现有方法通过手工设计的模板或有限的扰动规则评估模型鲁棒性，但这种方法可能无法全面反映真实场景下的问题。论文提出的解决方案是设计一种名为自动鲁棒性检查器（Automatic Robustness Checker, AR-Checker）的新框架，其关键在于通过多轮并行的基于LLM的重写与验证流程生成保持语义不变但可能使LLMs失败的数学问题变体，从而动态生成针对每个LLM的基准变体，降低数据污染风险。

链接: https://arxiv.org/abs/2506.05038
作者: Yutao Hou,Zeguan Xiao,Fei Yu,Yihan Jiang,Xuetao Wei,Hailiang Huang,Yun Chen,Guanhua Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Ant Group (蚂蚁集团); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved distinguished performance on various reasoning-intensive tasks. However, LLMs might still face the challenges of robustness issues and fail unexpectedly in some simple reasoning tasks. Previous works evaluate the LLM robustness with hand-crafted templates or a limited set of perturbation rules, indicating potential data contamination in pre-training or fine-tuning datasets. In this work, inspired by stress testing in software engineering, we propose a novel framework, Automatic Robustness Checker (AR-Checker), to generate mathematical problem variants that maintain the semantic meanings of the original one but might fail the LLMs. The AR-Checker framework generates mathematical problem variants through multi-round parallel streams of LLM-based rewriting and verification. Our framework can generate benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the strong performance of AR-Checker on mathematical tasks. We also evaluate AR-Checker on benchmarks beyond mathematics, including MMLU, MMLU-Pro, and CommonsenseQA, where it also achieves strong performance, further proving the effectiveness of AR-Checker.
zh

[NLP-40] Controlling Summarization Length Through EOS Token Weighting

【速读】：该论文试图解决在文本生成任务中控制生成文本长度的问题，特别是在摘要生成任务中，现有方法通常需要复杂的模型修改，限制了与预训练模型的兼容性。解决方案的关键在于通过在交叉熵损失计算中增加正确预测结束标记（EOS token）的重要性，从而实现对生成摘要长度的有效控制，该方法与架构和解码算法无关，并且与其他推理阶段的长度控制技术正交。

链接: https://arxiv.org/abs/2506.05017
作者: Zeno Belligoli,Emmanouil Stergiadis,Eran Fainman,Ilya Gusev
机构: Booking.com(Booking.com)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Controlling the length of generated text can be crucial in various text-generation tasks, including summarization. Existing methods often require complex model alterations, limiting compatibility with pre-trained models. We address these limitations by developing a simple approach for controlling the length of automatic text summaries by increasing the importance of correctly predicting the EOS token in the cross-entropy loss computation. The proposed methodology is agnostic to architecture and decoding algorithms and orthogonal to other inference-time techniques to control the generation length. We tested it with encoder-decoder and modern GPT-style LLMs, and show that this method can control generation length, often without affecting the quality of the summary.
zh

[NLP-41] ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development ACL2025

【速读】：该论文旨在解决ComfyUI（一种用于AI驱动艺术创作的开源平台）在使用过程中存在的挑战，包括新用户面临的文档不足、模型配置错误以及工作流设计复杂等问题。其解决方案的关键在于引入ComfyUI-Copilot，该插件基于大语言模型，通过智能节点与模型推荐以及自动化的一键工作流构建来提升用户体验。系统的核心是一个分层的多智能体框架，包含负责任务委派的中央助理代理和针对不同用途的专用工作者代理，并依托于精心整理的ComfyUI知识库以简化调试与部署流程。

链接: https://arxiv.org/abs/2506.05010
作者: Zhenran Xu,Xue Yang,Yiyu Wang,Qingli Hu,Zijiao Wu,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 Demo. Github: this https URL

点击查看摘要

Abstract:We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at this https URL.
zh

[NLP-42] SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在机器理解任务中是否能够达到专家水平的可解释性与对齐性问题。其解决方案的关键在于提出SCOP，这是一个从认知视角系统评估LLMs在理解过程中表现的框架，包含对五个必要理解技能的定义、严格的数据构建框架以及对开源和闭源LLMs的详细分析，旨在揭示LLMs在理解过程中的不足与潜在相似性。

链接: https://arxiv.org/abs/2506.05000
作者: Yongjie Xiao,Hongru Liang,Peixin Qin,Yao Zhang,Wenqiang Lei
机构: Sichuan University (四川大学); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education (教育部机器学习与工业智能工程研究中心); School of Statistics and Data Science, AAIS, Nankai University (南开大学统计与数据科学学院，人工智能研究院)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2004.14535 by other authors

点击查看摘要

Abstract:Despite the great potential of large language models(LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable – they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.
zh

[NLP-43] owards Storag e-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings ACL2025

【速读】：该论文旨在解决生成式 AI (Generative AI) 在可视化文档检索（Visualized Document Retrieval, VDR）任务中，由于将每页编码为多个 patch-level embeddings 导致的内存占用过高的问题。其解决方案的关键在于通过优化 token 处理策略，特别是采用 token merging 方法，以减少每页的嵌入数量，从而在保持较高检索性能的同时显著降低内存使用。研究最终提出了 Light-ColPali/ColQwen2 模型，在仅使用 11.8% 原始内存的情况下维持了 98.2% 的检索性能，并在 2.8% 内存占用下仍保留 94.6% 的有效性。

链接: https://arxiv.org/abs/2506.04997
作者: Yubo Ma,Jinsong Li,Yuhang Zang,Xiaobao Wu,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Haodong Duan,Jiaqi Wang,Yixin Cao,Aixin Sun
机构: Nanyang Technological University (南洋理工大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Institute of Trustworthy Embodied AI, Fudan University (复旦大学可信具身人工智能研究所)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by ACL 2025 findings

点击查看摘要

Abstract:Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), it encodes each page into multiple patch-level embeddings and leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page at minimum performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develop Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2.8% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future research towards efficient VDR.
zh

[NLP-44] Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering INTERSPEECH2025

【速读】：该论文旨在解决在标注数据稀缺的情况下，对预训练自动语音识别（ASR）模型进行特定领域微调的挑战。其关键解决方案是提出一种增量半监督学习流程，首先结合少量领域内标注数据和来自相关领域的辅助数据，随后通过基于多模型共识或命名实体识别（NER）的过滤方法选择并迭代优化伪标签，从而有效提升模型性能。

链接: https://arxiv.org/abs/2506.04981
作者: Andres Carofilis,Pradeep Rangappa,Srikanth Madikeri,Shashi Kumar,Sergio Burdisso,Jeena Prakash,Esau Villatoro-Tello,Petr Motlicek,Bidisha Sharma,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025, Netherlands

点击查看摘要

Abstract:Fine-tuning pretrained ASR models for specific domains is challenging when labeled data is scarce. But unlabeled audio and labeled data from related domains are often available. We propose an incremental semi-supervised learning pipeline that first integrates a small in-domain labeled set and an auxiliary dataset from a closely related domain, achieving a relative improvement of 4% over no auxiliary data. Filtering based on multi-model consensus or named entity recognition (NER) is then applied to select and iteratively refine pseudo-labels, showing slower performance saturation compared to random selection. Evaluated on the multi-domain Wow call center and Fisher English corpora, it outperforms single-step fine-tuning. Consensus-based filtering outperforms other methods, providing up to 22.3% relative improvement on Wow and 24.8% on Fisher over single-step fine-tuning with random selection. NER is the second-best filter, providing competitive performance at a lower computational cost.
zh

[NLP-45] From Struggle (06-2024) to Mastery (02-2025) LLM s Conquer Advanced Algorithm Exams and Pave the Way for Editorial Generation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在大学级别算法考试中的表现评估问题，以及其在教育环境中支持教学的潜力。解决方案的关键在于通过测试多个模型在罗马尼亚语考试及其高质量英文翻译上的表现，分析其问题求解能力、一致性及多语言性能，并基于实证研究结果探索生成式 AI 在生成高质量教学内容方面的应用，从而为教师提供增强学生反馈的工具。

链接: https://arxiv.org/abs/2506.04965
作者: Adrian Marius Dumitran,Theodor-Pierre Moroianu,Vasile Paul Alexe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages Pre-print Paper accepted to ITS 2025

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of the performance of state-of-the-art Large Language Models (LLMs) on challenging university-level algorithms exams. By testing multiple models on both a Romanian exam and its high-quality English translation, we analyze LLMs’ problem-solving capabilities, consistency, and multilingual performance. Our empirical study reveals that the most recent models not only achieve scores comparable to top-performing students but also demonstrate robust reasoning skills on complex, multi-step algorithmic challenges, even though difficulties remain with graph-based tasks. Building on these findings, we explore the potential of LLMs to support educational environments through the generation of high-quality editorial content, offering instructors a powerful tool to enhance student feedback. The insights and best practices discussed herein pave the way for further integration of generative AI in advanced algorithm education.
zh

[NLP-46] ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT ACL2025

【速读】：该论文试图解决神经机器翻译（Neural Machine Translation, NMT）在处理词义歧义和上下文理解方面的不足，特别是在领域特定应用中，如电子商务数据，常面临句子不清晰或数据质量差的问题。解决方案的关键在于通过引入额外信息来增强模型的上下文感知能力，具体包括使用视觉语言模型（Vision-Language Model, VLM）结合图像和产品元数据，以及将上下文信息如产品类别路径或图像描述融入文本到文本模型中，从而提升机器翻译的质量。

链接: https://arxiv.org/abs/2506.04929
作者: Mikołaj Pokrywka,Wojciech Kusa,Mieszko Rutkowski,Mikołaj Koszowski
机构: Allegro.com(阿尔格罗公司); Laniqo.com(兰尼科公司); NASK(国家科学研究机构)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 (The 63rd Annual Meeting of the Association for Computational Linguistics)

点击查看摘要

Abstract:Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT – a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product’s category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.
zh

[NLP-47] Simulating LLM -to-LLM Tutoring for Multilingual Math Feedback

【速读】：该论文试图解决多语言环境下大型语言模型（Large Language Models, LLMs）在数学基础推理任务中提供有效教学支持的能力问题，特别是针对低资源语言的教育应用。解决方案的关键在于通过构建大规模的多语言导师-学生交互模拟，利用强模型生成提示性反馈，弱模型模拟学生，并在多种语言、模型和提示策略下评估语言特定反馈对学习效果的影响，从而验证多语言提示对提升学习成果的有效性。

链接: https://arxiv.org/abs/2506.04920
作者: Junior Cedric Tonga,KV Aditya Srivatsa,Kaushal Kumar Maurya,Fajri Koto,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, in submission

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor-student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-of-the-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student’s native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.
zh

[NLP-48] A Practitioners Guide to Building ASR Models for Low-Resource Languages: A Case Study on Scottish Gaelic INTERSPEECH2025

【速读】：该论文试图解决在低资源语言中开发自动语音识别（ASR）系统的挑战，特别是当目标语言未出现在原始多语言端到端（E2E）模型的训练数据中时，如何有效提升识别性能。论文提出的解决方案的关键在于结合混合隐马尔可夫模型（HMM）与自监督模型，通过持续的自监督预训练和半监督训练，更高效地利用所有可用的语音和文本数据，从而在有限的训练数据下显著提升识别效果。

链接: https://arxiv.org/abs/2506.04915
作者: Ondřej Klejch,William Lamb,Peter Bell
机构: Centre for Speech Technology Research (语音技术研究中心); Celtic and Scottish Studies (凯尔特与苏格兰研究)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:An effective approach to the development of ASR systems for low-resource languages is to fine-tune an existing multilingual end-to-end model. When the original model has been trained on large quantities of data from many languages, fine-tuning can be effective with limited training data, even when the language in question was not present in the original training data. The fine-tuning approach has been encouraged by the availability of public-domain E2E models and is widely believed to lead to state-of-the-art results. This paper, however, challenges that belief. We show that an approach combining hybrid HMMs with self-supervised models can yield substantially better performance with limited training data. This combination allows better utilisation of all available speech and text data through continued self-supervised pre-training and semi-supervised training. We benchmark our approach on Scottish Gaelic, achieving WER reductions of 32% relative over our best fine-tuned Whisper model.
zh

[NLP-49] Dissecting Long Reasoning Models: An Empirical Study

【速读】：该论文旨在解决长上下文推理模型在强化学习（Reinforcement Learning, RL）训练中仍存在的开放性问题和反直觉行为。其核心解决方案的关键在于三个方面：首先，系统分析了正样本与负样本在RL中的作用，发现负样本在提升模型泛化能力和鲁棒性方面具有显著优势；其次，针对群体相对策略优化中的数据低效问题，提出通过相对长度奖励和离线样本注入等策略来提高数据利用率；最后，通过多次评估运行缓解了不同推理模型和基准测试中的性能不稳定问题。

链接: https://arxiv.org/abs/2506.04913
作者: Yongyu Mu,Jiali Zeng,Bei Li,Xinyan Guan,Fandong Meng,Jie Zhou,Tong Xiao,Jingbo Zhu
机构: Northeastern University (东北大学); Tencent Inc (腾讯公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Working in process

点击查看摘要

Abstract:Despite recent progress in training long-context reasoning models via reinforcement learning (RL), several open questions and counterintuitive behaviors remain. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in RL, revealing that positive samples mainly facilitate data fitting, whereas negative samples significantly enhance generalization and robustness. Interestingly, training solely on negative samples can rival standard RL training performance. (2) We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage. To address this, we explore two straightforward strategies, including relative length rewards and offline sample injection, to better leverage these data and enhance reasoning efficiency and capability. (3) We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes, and demonstrate that multiple evaluation runs mitigate this issue.
zh

[NLP-50] When Thinking LLM s Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在链式思维（Chain-of-Thought, CoT）推理过程中可能出现的战略性欺骗问题，即模型在目标驱动下故意提供与推理过程相矛盾的错误信息。传统上，LLMs的不诚实行为通常被解释为幻觉（hallucination），但本文指出，由于CoT模型显式的思维路径，可以系统地研究这种战略性欺骗。解决方案的关键在于通过表示工程（representation engineering）诱导、检测和控制此类欺骗行为，并利用线性人工断层扫描（Linear Artificial Tomography, LAT）提取“欺骗向量”，从而实现高精度的欺骗检测，同时通过激活操控（activation steering）在无显式提示的情况下诱发适当的欺骗行为，揭示了推理模型中的具体诚实性问题，并为可信的人工智能对齐提供了工具。

链接: https://arxiv.org/abs/2506.04909
作者: Kai Wang,Yihao Zhang,Meng Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models’ explicit thought paths enable us to study strategic deception–goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting “deception vectors” via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.
zh

[NLP-51] Verbose ListOps (VLO): Beyond Long Context – Unmasking LLM s Reasoning Blind Spots

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理嵌套叙事推理任务时的局限性，特别是其在面对包含大量干扰信息的长文本时，难以有效管理内部计算和状态的问题。现有基准测试未能充分评估这一能力，缺乏真实干扰项或未能将上下文长度与推理复杂度解耦。论文提出的解决方案是引入Verbose ListOps，这是一个新型基准，通过程序化地将ListOps计算转换为长而连贯的故事，强制模型进行嵌套推理并管理中间状态，同时提供对叙事规模和推理难度的细粒度控制。该方法揭示了LLMs在处理语义相关但具有干扰性的嵌套子推理任务中的脆弱性，并为提升模型的推理能力提供了更精准的方向，而不仅仅是扩展上下文窗口。

链接: https://arxiv.org/abs/2506.04907
作者: Alex Pan,Mary-Anne Williams
机构: UNSW AI Institute (UNSW人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), whilst great at extracting facts from text, struggle with nested narrative reasoning. Existing long context and multi-hop QA benchmarks inadequately test this, lacking realistic distractors or failing to decouple context length from reasoning complexity, masking a fundamental LLM limitation. We introduce Verbose ListOps, a novel benchmark that programmatically transposes ListOps computations into lengthy, coherent stories. This uniquely forces internal computation and state management of nested reasoning problems by withholding intermediate results, and offers fine-grained controls for both narrative size \emphand reasoning difficulty. Whilst benchmarks like LongReason (2025) advance approaches for synthetically expanding the context size of multi-hop QA problems, Verbose ListOps pinpoints a specific LLM vulnerability: difficulty in state management for nested sub-reasoning amongst semantically-relevant, distracting narrative. Our experiments show that leading LLMs (e.g., OpenAI o4, Gemini 2.5 Pro) collapse in performance on Verbose ListOps at modest (~10k token) narrative lengths, despite effortlessly solving raw ListOps equations. Addressing this failure is paramount for real-world text interpretation which requires identifying key reasoning points, tracking conceptual intermediate results, and filtering irrelevant information. Verbose ListOps, and its extensible generation framework thus enables targeted reasoning enhancements beyond mere context-window expansion; a critical step to automating the world’s knowledge work.
zh

[NLP-52] ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

【速读】：该论文试图解决现有基准（如LiveCodeBench和CodeElo）在评估大型语言模型（Large Language Models, LLMs）的编码能力时，无法有效反映真实竞赛环境中的复杂性和挑战性问题，以及当前评估指标（如Pass@K）无法捕捉推理模型的反思能力的问题。其解决方案的关键在于提出\textbfICPC-Eval，这是一个针对顶级竞赛编码任务设计的基准，包含118个来自全球11场ICPC竞赛的精心筛选问题，并通过三个核心贡献进行改进：构建符合实际竞赛难度与题型分布的挑战性场景、开发稳健的测试用例生成方法及本地评估工具、引入基于执行反馈迭代修复解法的Refine@K评估指标。

链接: https://arxiv.org/abs/2506.04894
作者: Shiyi Xu,Yiwen Hu,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbfICPC-Eval, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: this https URL
zh

[NLP-53] Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

【速读】：该论文试图解决通用依存关系（Universal Dependencies, UD）在跨语言句法表示中的有效性尚未被充分探索的问题，具体是评估将UD整合到预训练语言模型中是否能提升其在跨语言对抗性改写识别任务中的性能。解决方案的关键在于将UD结构融入预训练语言模型，并通过实验验证其对模型性能的提升效果，结果显示UD的引入显著提高了准确率和F_1分数，表明UD在跨语言任务中的有效性和潜力。

链接: https://arxiv.org/abs/2506.04887
作者: Wenxi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and F_1 scores, with average gains of 3.85% and 6.08% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.
zh

[NLP-54] Prompting LLM s: Length Control for Isometric Machine Translation

【速读】：该论文旨在解决在等距机器翻译（isometric machine translation）任务中如何有效控制翻译长度并提升翻译质量的问题。其解决方案的关键在于通过调整提示策略（prompting strategies）、少量示例（few-shot examples）的数量以及示例选择，优化大型语言模型（LLMs）的输出长度和翻译质量之间的权衡。研究发现，指令的表述方式与提供的示例特性相匹配时，对输出长度的控制具有关键作用，同时综合考虑多个输出结果可显著提升整体性能。

链接: https://arxiv.org/abs/2506.04855
作者: Dávid Javorský,Ondřej Bojar,François Yvon
机构: Charles University, Faculty of Mathematics and Physics, Prague, Czechia (查理大学数学与物理学院, 布拉格, 捷克); Sorbonne Université, CNRS, ISIR, Paris, France (索邦大学, 法国国家科学研究中心, ISIR, 巴黎, 法国)
类目: Computation and Language (cs.CL)
备注: Accepted to IWSLT 2025

点击查看摘要

Abstract:In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En \to De, En \to Fr, and En \to Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.
zh

[NLP-55] Multiple-Choice Question Generation Using Large Language Models : Methodology and Educator Insights

【速读】：该论文试图解决教育领域中生成高质量多选题（Multiple-Choice Questions, MCQs）所面临的资源密集和耗时问题，其解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）的能力，通过向提示中注入特定知识来控制生成内容的准确性，从而减少模型的幻觉现象，并赋予教育者对测试文本来源的控制权。研究对比了Llama 2、Mistral和GPT-3.5三种LLMs在生成MCQs方面的表现，结果显示GPT-3.5在多个指标上最为有效。

链接: https://arxiv.org/abs/2506.04851
作者: Giorgio Biancini,Alessio Ferrato,Carla Limongelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Copyright ACM 2024. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Adjunct Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct '24), this http URL

点击查看摘要

Abstract:Integrating Artificial Intelligence (AI) in educational settings has brought new learning approaches, transforming the practices of both students and educators. Among the various technologies driving this transformation, Large Language Models (LLMs) have emerged as powerful tools for creating educational materials and question answering, but there are still space for new applications. Educators commonly use Multiple-Choice Questions (MCQs) to assess student knowledge, but manually generating these questions is resource-intensive and requires significant time and cognitive effort. In our opinion, LLMs offer a promising solution to these challenges. This paper presents a novel comparative analysis of three widely known LLMs - Llama 2, Mistral, and GPT-3.5 - to explore their potential for creating informative and challenging MCQs. In our approach, we do not rely on the knowledge of the LLM, but we inject the knowledge into the prompt to contrast the hallucinations, giving the educators control over the test’s source text, too. Our experiment involving 21 educators shows that GPT-3.5 generates the most effective MCQs across several known metrics. Additionally, it shows that there is still some reluctance to adopt AI in the educational field. This study sheds light on the potential of LLMs to generate MCQs and improve the educational experience, providing valuable insights for the future.
zh

[NLP-56] MockConf: A Student Interpretation Dataset: Analysis Word- and Span-level Alignment and Baselines ACL2025

【速读】：该论文试图解决同步口译任务中自动理解和再现动态复杂过程的需求，现有平行语料库和对齐算法难以建模语音段之间的长距离交互以及原文与口译内容之间的特定类型偏差（如缩短、简化、功能泛化）。解决方案的关键在于引入MockConf数据集，该数据集来源于学生课程中的模拟会议，包含5种欧洲语言的7小时录音，并在词段和词级别进行转录和对齐；同时开发了InterAlign工具，用于在长输入上进行并行词和词段标注，适用于同步口译的对齐，并提出了评估指标和自动对齐的基线。

链接: https://arxiv.org/abs/2506.04848
作者: Dávid Javorský,Ondřej Bojar,François Yvon
机构: Charles University, Faculty of Mathematics and Physics, Prague, Czechia (查理大学，数学与物理学院，布拉格，捷克); Sorbonne Université, CNRS, ISIR, Paris, France (索邦大学，法国国家科学研究中心，ISIR，巴黎，法国)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students’ curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.
zh

[NLP-57] Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）中由于推理过程中的冗余或逻辑不一致导致的幻觉问题，现有幻觉检测方法主要关注答案层面的不确定性，难以检测由推理轨迹引发的幻觉或逻辑矛盾。解决方案的关键在于提出RACE（Reasoning and Answer Consistency Evaluation）框架，通过提取关键推理步骤并计算四种诊断信号：推理轨迹的样本间一致性、基于熵的答案不确定性、推理与答案之间的语义对齐度以及推理内部的一致性，从而实现细粒度的幻觉检测。

链接: https://arxiv.org/abs/2506.04832
作者: Changyue Wang,Weihang Su,Qingyao Ai,Yiqun Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model’s reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model’s decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: this https URL.
zh

[NLP-58] From EHRs to Patient Pathways: Scalable Modeling of Longitudinal Health Trajectories with LLM s

【速读】：该论文试图解决医疗系统在处理和解释大量异构患者数据以实现个性化护理时面临的挑战，现有方法通常局限于狭窄的应用场景，未能充分考虑复杂的、纵向的互动关系。其解决方案的关键在于提出一种新的患者路径建模方法——EHR2Path，通过将多样的电子健康记录（Electronic Health Record, EHR）数据转化为结构化表示，并设计一个全面的路径预测模型，以优化未来健康轨迹的预测。此外，引入了一种新颖的摘要机制，将长期时间上下文嵌入到特定主题的摘要标记中，从而在保持较高令牌效率的同时提升模型性能。

链接: https://arxiv.org/abs/2506.04831
作者: Chantal Pellegrini,Ege Özsoy,David Bani-Harouni,Matthias Keicher,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Healthcare systems face significant challenges in managing and interpreting vast, heterogeneous patient data for personalized care. Existing approaches often focus on narrow use cases with a limited feature space, overlooking the complex, longitudinal interactions needed for a holistic understanding of patient health. In this work, we propose a novel approach to patient pathway modeling by transforming diverse electronic health record (EHR) data into a structured representation and designing a holistic pathway prediction model, EHR2Path, optimized to predict future health trajectories. Further, we introduce a novel summary mechanism that embeds long-term temporal context into topic-specific summary tokens, improving performance over text-only models, while being much more token-efficient. EHR2Path demonstrates strong performance in both next time-step prediction and longitudinal simulation, outperforming competitive baselines. It enables detailed simulations of patient trajectories, inherently targeting diverse evaluation tasks, such as forecasting vital signs, lab test results, or length-of-stay, opening a path towards predictive and personalized healthcare.
zh

[NLP-59] A Reasoning -Based Approach to Cryptic Crossword Clue Solving ICML2025

【速读】：该论文旨在解决 cryptic crossword clues（隐晦填字谜题）的自动求解问题，这类谜题要求模型不仅理解答案的定义，还需解析其中的“wordplay”（文字游戏）以验证答案的正确性。解决方案的关键在于构建一个基于大语言模型（LLM）的推理系统，该系统通过三个核心步骤实现：(i) 提出可能的答案；(ii) 生成对应的wordplay解释；(iii) 利用验证系统对编码后的推理步骤进行校验。该系统在Cryptonite数据集上取得了新的最先进性能。

链接: https://arxiv.org/abs/2506.04824
作者: Martin Andrews,Sam Witteveen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 page paper plus Appendices. Accepted to ICML 2025

点击查看摘要

Abstract:Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and ‘wordplay’ that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.
zh

[NLP-60] Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms

【速读】：该论文试图解决生成式 AI (Generative AI) 在实际课堂教育评估中的有效性问题，特别是在资源不足的教育环境中。研究的关键在于评估一种先进的视觉语言模型（VLM）和多个大型语言模型（LLM）在处理四年级学生手写试卷响应中的表现，包括评分和生成个性化反馈的任务。研究发现，尽管VLM在识别手写内容方面存在困难，导致下游LLM评分出现误差传播，但LLM生成的反馈仍具有一定实用性，其关键挑战在于提升个性化和上下文相关性。

链接: https://arxiv.org/abs/2506.04822
作者: Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Raqib Chowdhury,Fajri Koto
机构: Quantic School of Business and Technology (量子商学院和科技学院); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Indonesia University of Education (印度尼西亚教育大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although vision-language and large language models (VLM and LLM) offer promising opportunities for AI-driven educational assessment, their effectiveness in real-world classroom settings, particularly in underrepresented educational contexts, remains underexplored. In this study, we evaluated the performance of a state-of-the-art VLM and several LLMs on 646 handwritten exam responses from grade 4 students in six Indonesian schools, covering two subjects: Mathematics and English. These sheets contain more than 14K student answers that span multiple choice, short answer, and essay questions. Assessment tasks include grading these responses and generating personalized feedback. Our findings show that the VLM often struggles to accurately recognize student handwriting, leading to error propagation in downstream LLM grading. Nevertheless, LLM-generated feedback retains some utility, even when derived from imperfect input, although limitations in personalization and contextual relevance persist.
zh

[NLP-61] Design of intelligent proofreading system for English translation based on CNN and BERT

【速读】：该论文旨在解决机器翻译后编辑（machine translation proofreading）中因自动翻译错误而需要大量人工校对的问题，其核心挑战是提升翻译质量并减少人工干预。解决方案的关键在于提出一种结合卷积神经网络（CNN）与双向Transformer编码器（BERT）的混合方法，通过CNN提取局部n-gram语义特征，利用BERT生成上下文丰富的序列表示，并结合注意力机制检测翻译异常，同时采用并行英德对齐和GRU解码器模型进行逻辑修正，从而实现端到端的高质量校对。

链接: https://arxiv.org/abs/2506.04811
作者: Feijun Liu,Huifeng Wang,Kun Wang,Yizhen Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since automatic translations can contain errors that require substantial human post-editing, machine translation proofreading is essential for improving quality. This paper proposes a novel hybrid approach for robust proofreading that combines convolutional neural networks (CNN) with Bidirectional Encoder Representations from Transformers (BERT). In order to extract semantic information from phrases and expressions, CNN uses a variety of convolution kernel filters to capture local n-gram patterns. In the meanwhile, BERT creates context-rich representations of whole sequences by utilizing stacked bidirectional transformer encoders. Using BERT’s attention processes, the integrated error detection component relates tokens to spot translation irregularities including word order problems and omissions. The correction module then uses parallel English-German alignment and GRU decoder models in conjunction with translation memory to propose logical modifications that maintain original meaning. A unified end-to-end training process optimized for post-editing performance is applied to the whole pipeline. The multi-domain collection of WMT and the conversational dialogues of Open-Subtitles are two of the English-German parallel corpora used to train the model. Multiple loss functions supervise detection and correction capabilities. Experiments attain a 90% accuracy, 89.37% F1, and 16.24% MSE, exceeding recent proofreading techniques by over 10% overall. Comparative benchmarking demonstrates state-of-the-art performance in identifying and coherently rectifying mistranslations and omissions.
zh

[NLP-62] Dissecting Logical Reasoning in LLM s: A Fine-Grained Evaluation and Supervision Study

【速读】：该论文旨在解决现有基准测试仅依赖最终答案准确性来评估大型语言模型（Large Language Models, LLMs）逻辑推理能力的问题，而未能全面捕捉推理过程的质量和结构。其解决方案的关键在于提出FineLogic，这是一个细粒度的评估框架，从整体基准准确率、逐步合理性以及表示层面的一致性三个维度对逻辑推理进行评估。此外，通过研究微调过程中监督格式的影响，进一步揭示了不同监督方式对推理能力形成的作用机制。

链接: https://arxiv.org/abs/2506.04810
作者: Yujun Zhou,Jiayi Ye,Zipeng Ling,Yufei Han,Yue Huang,Haomin Zhuang,Zhenwen Liang,Kehan Guo,Taicheng Guo,Xiangqi Wang,Xiangliang Zhang
机构: University of Notre Dame (圣母大学); MBZUAI (MBZUAI); University of Pennsylvania (宾夕法尼亚大学); INRIA (INRIA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Logical reasoning is a core capability for many applications of large language models (LLMs), yet existing benchmarks often rely solely on final-answer accuracy, failing to capture the quality and structure of the reasoning process. We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall benchmark accuracy, stepwise soundness, and representation-level alignment. In addition, to better understand how reasoning capabilities emerge, we conduct a comprehensive study on the effects of supervision format during fine-tuning. We construct four supervision styles (one natural language and three symbolic variants) and train LLMs under each. Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks, while symbolic reasoning styles promote more structurally sound and atomic inference chains. Further, our representation-level probing shows that fine-tuning primarily improves reasoning behaviors through step-by-step generation, rather than enhancing shortcut prediction or internalized correctness. Together, our framework and analysis provide a more rigorous and interpretable lens for evaluating and improving logical reasoning in LLMs.
zh

[NLP-63] owards LLM -Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中不同模态与语言主干架构之间如何有效连接和对齐的问题。现有文献在这一方面存在显著的空白，因此本文提出了一种以语言模型（LLM）为中心的分析方法，重点探讨了将多种模态输入转换并对齐到语言嵌入空间的方法。解决方案的关键在于构建一个基于三个核心维度的分类框架：模态整合的架构策略、表示学习技术（联合或协调表示）以及训练范式（包括训练策略和目标函数）。通过分析2021年至2025年间开发的125个MLLMs，本文识别出该领域的新兴趋势，并为未来基于预训练基础模型的多模态集成策略提供结构化指导。

链接: https://arxiv.org/abs/2506.04788
作者: Jisu An,Junseok Lee,Jeoungeun Lee,Yongseok Son
机构: Seoul National University (首尔国立大学); University of California San Diego (加利福尼亚大学圣地亚哥分校); Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 3 figures, 3 tables

点击查看摘要

Abstract:The rapid progress of Multimodal Large Language Models(MLLMs) has transformed the AI landscape. These models combine pre-trained LLMs with various modality encoders. This integration requires a systematic understanding of how different modalities connect to the language backbone. Our survey presents an LLM-centric analysis of current approaches. We examine methods for transforming and aligning diverse modal inputs into the language embedding space. This addresses a significant gap in existing literature. We propose a classification framework for MLLMs based on three key dimensions. First, we examine architectural strategies for modality integration. This includes both the specific integration mechanisms and the fusion level. Second, we categorize representation learning techniques as either joint or coordinate representations. Third, we analyze training paradigms, including training strategies and objective functions. By examining 125 MLLMs developed between 2021 and 2025, we identify emerging patterns in the field. Our taxonomy provides researchers with a structured overview of current integration techniques. These insights aim to guide the development of more robust multimodal integration strategies for future models built on pre-trained foundations.
zh

[NLP-64] MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

【速读】：该论文旨在解决现有语音大语言模型（SpeechLLMs）在自然语音中进行细粒度感知和复杂推理能力不足的问题。其解决方案的关键在于提出MMSU，一个专为语音理解和推理设计的综合性基准，包含5,000个精心构建的音频-问题-答案三元组，覆盖47项任务，并系统性地整合了语音中的多种语言现象，如语音学、韵律、修辞、句法、语义和副语言特征，从而为评估和优化语音理解模型提供了新的标准。

链接: https://arxiv.org/abs/2506.04779
作者: Dingdong Wang,Jincenzi Wu,Junan Li,Dongchao Yang,Xueyuan Chen,Tianhua Zhang,Helen Meng
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: MMSU benchmark is available at this https URL . Evaluation Code is available at this https URL

点击查看摘要

Abstract:Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at this https URL. Evaluation Code is available at this https URL.
zh

[NLP-65] Fine-Grained Interpretation of Political Opinions in Large Language Models

【速读】：该论文试图解决生成式 AI (Generative AI) 在政治观点分析中存在内部意图与外部响应不一致以及单维度概念分析导致的语义混淆问题。其解决方案的关键在于构建一个四维政治学习框架，并应用可解释的表示工程（representation engineering）技术，以实现更透明的政治概念学习，从而能够检测和干预模型内部的政治状态。

链接: https://arxiv.org/abs/2506.04774
作者: Jingyu Hu,Mengyue Yang,Mengnan Du,Weiru Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Studies of LLMs’ political opinions mainly rely on evaluations of their open-ended responses. Recent work indicates that there is a misalignment between LLMs’ responses and their internal intentions. This motivates us to probe LLMs’ internal mechanisms and help uncover their internal political states. Additionally, we found that the analysis of LLMs’ political opinions often relies on single-axis concepts, which can lead to concept confounds. In this work, we extend the single-axis to multi-dimensions and apply interpretable representation engineering techniques for more transparent LLM political concept learning. Specifically, we designed a four-dimensional political learning framework and constructed a corresponding dataset for fine-grained political concept vector learning. These vectors can be used to detect and intervene in LLM internals. Experiments are conducted on eight open-source LLMs with three representation engineering techniques. Results show these vectors can disentangle political concept confounds. Detection tasks validate the semantic meaning of the vectors and show good generalization and robustness in OOD settings. Intervention Experiments show these vectors can intervene in LLMs to generate responses with different political leanings.
zh

[NLP-66] Identifying Reliable Evaluation Metrics for Scientific Text Revision ACL2025

【速读】：该论文试图解决科学写作中文本修订评估的问题，传统指标如ROUGE和BERTScore主要关注文本相似性而非有意义的改进。其解决方案的关键在于探索与人类判断更一致的评估方法，包括进行人工标注研究、分析无参考的自然语言处理领域评估指标，以及评估大型语言模型作为评判者（LLM-as-a-judge）的能力。研究发现，结合LLM-as-a-judge评估与任务特定指标的混合方法能够最可靠地评估修订质量。

链接: https://arxiv.org/abs/2506.04772
作者: Léane Jourdan,Florian Boudin,Richard Dufour,Nicolas Hernandez
机构: Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France; JFLI, CNRS, Nantes University, France
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 main

点击查看摘要

Abstract:Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.
zh

[NLP-67] GOLFer: Smaller LM-Generated Documents Hallucination Filter Combiner for Query Expansion in Information Retrieval

【速读】：该论文旨在解决基于大型语言模型（Large Language Models, LLMs）的查询扩展在信息检索中的性能依赖于模型规模的问题，这一问题导致了高昂的成本、计算资源的密集消耗以及可访问性受限。其解决方案的关键在于提出GOLFer——一种利用较小的开源语言模型进行生成文档幻觉过滤与组合的方法，该方法通过两个模块实现：幻觉过滤模块用于检测并移除生成文档中非事实性和不一致的句子，而文档组合模块则通过权重向量将过滤后的内容与查询相结合，以平衡两者的影响。

链接: https://arxiv.org/abs/2506.04762
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.
zh

[NLP-68] Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion

【速读】：该论文试图解决传统稀疏检索（sparse retrieval）在查询扩展（query expansion）中效果受限的问题，尤其是如何有效利用生成式AI (Generative AI) 提升检索性能。其解决方案的关键在于引入一种名为Exp4Fuse的融合排序框架，通过零样本（zero-shot）生成式AI进行查询扩展，并结合原始查询与扩展后的查询分别生成排序列表，再采用改进的倒数排名融合（modified reciprocal rank fusion）方法进行融合，从而提升稀疏检索的效果。

链接: https://arxiv.org/abs/2506.04760
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes-one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.
zh

[NLP-69] Lifelong Evolution: Collaborative Learning between Large and Small Language Models for Continuous Emergent Fake News Detection

【速读】：该论文旨在解决社交媒体上虚假新闻（fake news）传播所带来的社会影响问题，特别是传统深度学习方法在面对数据稀缺和分布变化时的适应性不足，以及大型语言模型（LLMs）因知识过时和缺乏合适示例而在检测虚假新闻时的准确性不足。其解决方案的关键在于提出一种连续协作涌现式虚假新闻检测框架（C^2 EFND），通过多轮协作学习策略，结合LLMs的泛化能力和小语言模型（SLMs）的分类专长，并引入基于专家混合架构的持续知识编辑模块和基于重放的持续学习方法，以实现对LLMs的增量更新和SLMs对先前知识的保留。

链接: https://arxiv.org/abs/2506.04739
作者: Ziyi Zhou,Xiaoming Zhang,Litian Zhang,Yibo Zhang,Zhenyu Guan,Chaozhuo Li,Philip S. Yu
机构: School of Cyber Science and Technology, Beihang University (网络空间安全学院，北京航空航天大学); School of Cyber Science and Technology, Beijing University of Posts and Telecommunications (网络空间安全学院，北京邮电大学); Department of Computer Science, University of Illinois at Chicago (计算机科学系，伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread dissemination of fake news on social media has significantly impacted society, resulting in serious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from extensive supervised training requirements and difficulties adapting to evolving news environments due to data scarcity and distribution shifts. Large language models (LLMs), despite robust zero-shot capabilities, fall short in accurately detecting fake news owing to outdated knowledge and the absence of suitable demonstrations. In this paper, we propose a novel Continuous Collaborative Emergent Fake News Detection (C ^2 EFND) framework to address these challenges. The C ^2 EFND framework strategically leverages both LLMs’ generalization power and SLMs’ classification expertise via a multi-round collaborative learning framework. We further introduce a lifelong knowledge editing module based on a Mixture-of-Experts architecture to incrementally update LLMs and a replay-based continue learning method to ensure SLMs retain prior knowledge without retraining entirely. Extensive experiments on Pheme and Twitter16 datasets demonstrate that C ^2 EFND significantly outperforms existed methods, effectively improving detection accuracy and adaptability in continuous emergent fake news scenarios.
zh

[NLP-70] Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

【速读】：该论文试图解决当前基于Deepseek-R1-Distill系列的生成式AI（Generative AI）模型在基准测试中表现结果存在显著波动的问题，这种波动由多种因素引起，导致其宣称的性能提升难以可靠复现。解决方案的关键在于提出建立更为严谨的模型性能评估范式，并通过实证分析对Deepseek-R1-Distill系列模型进行评估。

链接: https://arxiv.org/abs/2506.04734
作者: Lin Sun,Weihong Lin,Jinzhu Wu,Yongfu Zhu,Xiaoqi Jian,Guangxiang Zhao,Change Jia,Linglin Zhang,Sai-er Hu,Yuhan Wu,Xiangzheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.
zh

[NLP-71] SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat

【速读】：该论文试图解决多大语言模型（Large Language Models, LLMs）在生成多样性不足和评估偏差的问题，通过集体对齐提升模型性能。其解决方案的关键在于引入“SPARTA ALIGNMENT”算法，该算法通过模型间的竞争与对抗，形成一个“斯巴达部落”，在执行指令的同时相互作为裁判进行评估，利用改进的Elo评分系统动态调整模型的权重，从而生成偏好对并让所有模型从竞争结果中学习，实现多模型的迭代式协同进化。

链接: https://arxiv.org/abs/2506.04721
作者: Yuru Jiang,Wenxuan Ding,Shangbin Feng,Greg Durrett,Yulia Tsvetkov
机构: Zhejiang University (浙江大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat. To complement a single model’s lack of diversity in generation and biases in evaluation, multiple LLMs form a “sparta tribe” to compete against each other in fulfilling instructions while serving as judges for the competition of others. For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system, where winners/losers of combat gain/lose weight in evaluating others. The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration. SPARTA ALIGNMENT enables the self-evolution of multiple LLMs in an iterative and collective competition process. Extensive experiments demonstrate that SPARTA ALIGNMENT outperforms initial models and 4 self-alignment baselines across 10 out of 12 tasks and datasets with 7.0% average improvement. Further analysis reveals that SPARTA ALIGNMENT generalizes more effectively to unseen tasks and leverages the expertise diversity of participating models to produce more logical, direct and informative outputs.
zh

[NLP-72] IIITH-BUT system for IWSLT 2025 low-resource Bhojpuri to Hindi speech translation

【速读】：该论文试图解决低资源语言对（Bhojpuri-Hindi）语音翻译中的性能提升问题。其解决方案的关键在于通过超参数优化和数据增强技术来提升SeamlessM4T模型在该任务上的表现，具体包括系统性地调整学习率调度、更新步数、预热步数、标签平滑和批量大小，并应用速度扰动和SpecAugment等数据增强方法以缓解数据稀缺问题。此外，还通过与马拉地语和博杰普尔语语音数据的联合训练引入跨语言信号，进一步提升翻译质量。

链接: https://arxiv.org/abs/2506.04714
作者: Bhavana Akkiraju,Aishwarya Pothula,Santosh Kesiraju,Anil Kumar Vuppala
机构: International Institute of Information Technology, Hyderabad, India (国际信息科技学院，海得拉巴，印度); Speech@FIT, Brno University of Technology, Czechia (Speech@FIT，布尔诺理工大学，捷克)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Paper is accepted to IWSLT2025

点击查看摘要

Abstract:This paper presents the submission of IIITH-BUT to the IWSLT 2025 shared task on speech translation for the low-resource Bhojpuri-Hindi language pair. We explored the impact of hyperparameter optimisation and data augmentation techniques on the performance of the SeamlessM4T model fine-tuned for this specific task. We systematically investigated a range of hyperparameters including learning rate schedules, number of update steps, warm-up steps, label smoothing, and batch sizes; and report their effect on translation quality. To address data scarcity, we applied speed perturbation and SpecAugment and studied their effect on translation quality. We also examined the use of cross-lingual signal through joint training with Marathi and Bhojpuri speech data. Our experiments reveal that careful selection of hyperparameters and the application of simple yet effective augmentation techniques significantly improve performance in low-resource settings. We also analysed the translation hypotheses to understand various kinds of errors that impacted the translation quality in terms of BLEU.
zh

[NLP-73] LLM -based phoneme-to-grapheme for phoneme-based speech recognition INTERSPEECH2025

【速读】：该论文旨在解决基于音素（phoneme）的自动语音识别（ASR）在跨语言任务中面临的数据效率高但解码过程受限的问题，尤其是传统加权有限状态转换器（WFST）解码方法因复杂流程和无法有效利用大语言模型（LLM）而存在的局限性。其解决方案的关键在于提出一种基于大语言模型的音素到字素（LLM-P2G）解码方法，通过语音到音素（S2P）和音素到字素（P2G）的级联结构实现更高效的跨语言ASR，并采用数据增强与噪声音素（DANP）以及随机化top-K边缘化（TKM）训练和解码策略来缓解级联过程中可能产生的信息损失问题。

链接: https://arxiv.org/abs/2506.04711
作者: Te Ma,Min Bi,Saierdaer Yusuyin,Hao Huang,Zhijian Ou
机构: Guangxi Radio and Television Monitoring Center (广西广播电视监测中心); School of Computer Science and Technology (计算机科学与技术学院); Speech Processing and Machine Intelligence (SPMI) Lab (语音处理与机器智能实验室)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025

点击查看摘要

Abstract:In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top- K marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.
zh

[NLP-74] Accelerated Test-Time Scaling with Model-Free Speculative Sampling

【速读】：该论文试图解决语言模型在推理任务中因使用测试时扩展技术（如最佳N采样和树搜索）而导致的性能与效率之间的权衡问题。解决方案的关键在于提出一种无需模型的推测解码方法STAND（STochastic Adaptive N-gram Drafting），该方法利用推理路径中的内在冗余性，通过引入随机草案和基于日志的N-gram模块，结合优化的Gumbel-Top-K采样和数据驱动的树构建，显著提高了令牌接受率，从而在不牺牲准确性的前提下大幅降低推理延迟。

链接: https://arxiv.org/abs/2506.04708
作者: Woomin Song,Saket Dingliwal,Sai Muralidhar Jayanthi,Bhavana Ganesh,Jinwoo Shin,Aram Galstyan,Sravan Babu Bodapati
机构: Amazon AGI; KAIST
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that leverages the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis reveals that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND outperforms state-of-the-art speculative decoding methods by 14-28% in throughput and shows strong performance even in single-trajectory scenarios, reducing inference latency by 48-58%. As a model-free approach, STAND can be applied to any existing language model without additional training, being a powerful plug-and-play solution for accelerating language model reasoning.
zh

[NLP-75] Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification

【速读】：该论文试图解决隐性仇恨言论（implicit hate speech, im-HS）检测的问题，因为现有自动检测方法在识别显性仇恨言论（explicit hate speech, ex-HS）方面表现良好，但在处理更隐蔽的im-HS时效果不佳。解决方案的关键在于引入一种新的分类体系，定义了六种编码策略（codetypes），并通过两种方法将其整合到im-HS检测中：一是直接通过提示大型语言模型（LLMs）根据生成的响应对句子进行分类，二是将LLMs作为编码器，在编码过程中嵌入codetypes。实验表明，codetypes的使用显著提升了中英文数据集中的im-HS检测性能。

链接: https://arxiv.org/abs/2506.04693
作者: Lu Wei,Liangzhi Li,Tong Xiang,Xiao Liu,Noa Garcia
机构: The University of Osaka(大阪大学); Meetyou AI Lab(美图AI实验室)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), 112-126

点击查看摘要

Abstract:The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named codetypes. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.
zh

[NLP-76] Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

【速读】：该论文试图解决预训练过程中面临的数据瓶颈（data wall）问题，即自然数据的供应无法跟上计算资源的增长速度，且高质量文本的可用性有限。解决方案的关键在于通过REWIRE方法对现有数据过滤流程中被丢弃的低质量文档进行转换和重写，从而将其转化为可用于训练的有效数据。该方法提高了合成数据在最终预训练集中的比例，并在多个基准测试中表现出优于仅使用过滤后网络数据的性能。

链接: https://arxiv.org/abs/2506.04689
作者: Thao Nguyen,Yang Li,Olga Golovneva,Luke Zettlemoyer,Sewoong Oh,Ludwig Schmidt,Xian Li
机构: FAIR at Meta (FAIR at Meta); University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the “data wall” of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.
zh

[NLP-77] MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models ACL

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在推理过程中错误修正能力的评估问题，旨在通过更全面的框架来检测和纠正错误，而不仅仅依赖最终准确性的比较。解决方案的关键是提出MMRefine基准，该基准通过六个不同的场景和六类错误类型对MLLMs的错误修正能力进行系统性评估，从而揭示影响修正性能的瓶颈和改进方向。

链接: https://arxiv.org/abs/2506.04688
作者: Gio Paik,Geewook Kim,Jinbae Im
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL Findings 2025

点击查看摘要

Abstract:This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs’ abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at this https URL.
zh

[NLP-78] Urania: Differentially Private Insights into AI Use

【速读】：该论文旨在解决在大型语言模型（Large Language Model, LLM）聊天机器人交互中如何生成有价值的洞察同时确保用户隐私的问题。其关键解决方案是提出了一种名为Urania的框架，该框架通过严格的差分隐私（Differential Privacy, DP）机制实现端到端的隐私保护，核心方法包括私有聚类机制和创新的关键词提取技术，如基于频率、TF-IDF以及LLM引导的方法，并结合聚类、分区选择和基于直方图的总结等DP工具，以在保持数据效用的同时有效平衡隐私保护与信息提取的准确性。

链接: https://arxiv.org/abs/2506.04681
作者: Daogao Liu,Edith Cohen,Badih Ghazi,Peter Kairouz,Pritish Kamath,Alexander Knop,Ravi Kumar,Pasin Manurangsi,Adam Sealfon,Da Yu,Chiyuan Zhang
机构: Google Research(谷歌研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We introduce Urania , a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, Urania provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework’s ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.
zh

[NLP-79] Normative Conflicts and Shallow AI Alignment

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在价值对齐（value alignment）方面的问题，特别是现有对齐策略在防止滥用方面的不足。论文指出，尽管通过基于人类偏好的微调方法试图赋予LLMs诸如帮助性、诚实性和无害性等规范，但这些模型仍易受对抗攻击的影响，因为它们无法有效处理规范之间的冲突。解决方案的关键在于突破当前“浅层对齐”（shallow alignment）的局限，即避免仅强化表面行为倾向，而是赋予LLMs真正的规范性反思能力（normative deliberation）。这一能力能够提升模型在面对规范冲突时的理性判断与解决能力，从而增强其安全性与鲁棒性。

链接: https://arxiv.org/abs/2506.04679
作者: Raphaël Millière
机构: 未知
类目: Computation and Language (cs.CL)
备注: Published in Philosophical Studies

点击查看摘要

Abstract:The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans’ ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This ``shallow alignment’’ problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.
zh

[NLP-80] Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在动态规划场景中的适应性与约束平衡能力评估不足的问题。现有基准测试主要关注静态、单轮的规划任务，而现实世界中的规划问题需要模型在不断变化的需求和冲突约束之间进行灵活推理。论文提出的解决方案是构建Flex-TravelPlanner基准，其关键在于引入两种新的评估设置：多轮次中逐步引入约束条件，以及明确优先级的冲突约束场景，从而更真实地反映实际规划任务的复杂性。

链接: https://arxiv.org/abs/2506.04649
作者: Juhyun Oh,Eunsu Kim,Alice Oh
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs’ planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models’ ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citepxie2024travelplanner, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models’ performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at this https URL.
zh

[NLP-81] aDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering ACL-2025

【速读】：该论文旨在解决Transformer模型中键值（KV）缓存内存需求随序列长度增长而显著增加的问题，这一问题限制了大语言模型的可扩展性部署。其解决方案的关键在于提出TaDA方法，该方法通过量化精度自适应于不同层的误差敏感性，并采用均值中心化技术消除对稀疏和非连续异常值的单独处理，从而在不牺牲准确性的前提下大幅降低KV缓存的内存占用。

链接: https://arxiv.org/abs/2506.04642
作者: Vinay Joshi,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum
机构: AMD(超威半导体)
类目: Computation and Language (cs.CL)
备注: ACL-2025 industry-track accepted

点击查看摘要

Abstract:The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements – a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.
zh

[NLP-82] ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition INTERSPEECH2025

【速读】：该论文试图解决音频-视觉语音识别（Audio-Visual Speech Recognition, AVSR）模型在大多数语言中缺乏大规模数据集的问题，尤其是英语以外的语言。其解决方案的关键在于提出一种从原始视频中生成AVSR数据集的实用方法，通过改进现有技术以提高效率和可访问性，从而为资源匮乏的语言构建有效的基准模型。

链接: https://arxiv.org/abs/2506.04635
作者: Thai-Binh Nguyen,Thi Van Nguyen,Quoc Truong Do,Chi Mai Luong
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.
zh

[NLP-83] Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning KDD2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在工具使用能力上的两大核心问题：一是由于低质量指令数据集（如广泛存在的幻觉API调用）导致的不可靠工具规划与调用；二是由于静态模仿学习导致的工具反思能力薄弱（超过90%的错误无法纠正）。其解决方案的关键在于提出Tool-MVR，通过两项创新实现全面的系统2推理：首先引入多智能体元验证（Multi-Agent Meta-Verification, MAMV），构建高质量指令数据集ToolBench-V以提升工具规划与调用的可靠性；其次提出基于探索的反思学习（Exploration-based Reflection Learning, EXPLORE），通过动态“错误-反思-修正”机制增强工具反思能力，生成反思数据集ToolBench-R。

链接: https://arxiv.org/abs/2506.04625
作者: Zhiyuan Ma,Jiayu Liu,Xianzhen Luo,Zhenya Huang,Qingfu Zhu,Wanxiang Che
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the Research Track of KDD 2025

点击查看摘要

Abstract:Empowering large language models (LLMs) with effective tool utilization capabilities is crucial for enabling AI agents to solve complex problems. However, current models face two major limitations: (1) unreliable tool planning and invocation due to low-quality instruction datasets (e.g., widespread hallucinated API calls), and (2) weak tool reflection abilities (over 90% of errors cannot be corrected) resulting from static imitation learning. To address these critical limitations, we propose Tool-MVR, a novel Tool-Augmented LLM that achieves comprehensive System 2 reasoning through two key innovations. Specifically, we first introduce Multi-Agent Meta-Verification (MAMV), a systematic pipeline that rigorously validates APIs, queries, and reasoning trajectories to construct ToolBench-V, a new high-quality instruction dataset that addresses the limitation of unreliable tool planning and invocation. Second, we propose Exploration-based Reflection Learning (EXPLORE), which enhances tool reflection capabilities by leveraging tool feedback through a dynamic “Error - Reflection - Correction” learning paradigm, resulting in our reflection dataset ToolBench-R and addressing the critical weakness in tool reflection. Finally, we obtain Tool-MVR by finetuning open-source LLMs (e.g., Qwen-7B) on both ToolBench-V and ToolBench-R. Our experiments demonstrate that Tool-MVR achieves state-of-the-art performance on StableToolBench, surpassing both ToolLLM (by 23.9%) and GPT-4 (by 15.3%) while reducing API calls by 31.4%, with strong generalization capabilities across unseen tools and scenarios. Additionally, on our proposed RefineToolBench, the first benchmark specifically designed to evaluate tool reflection capabilities, Tool-MVR achieves a 58.9% error correction rate, significantly outperforming ToolLLM’s 9.1%.
zh

[NLP-84] Static Word Embeddings for Sentence Semantic Representation

【速读】：该论文试图解决静态词向量在句子语义表示中的性能不足问题，旨在提升静态词嵌入在句级语义任务中的表现。其解决方案的关键在于从预训练的Sentence Transformer中提取词向量，并通过句子级别的主成分分析进行优化，随后结合知识蒸馏或对比学习进一步改进。在推理阶段，仅通过简单平均词向量来表示句子，从而实现低计算成本的高效语义表示。

链接: https://arxiv.org/abs/2506.04624
作者: Takashi Wada,Yuki Hirakawa,Ryotaro Shimizu,Takahiro Kawashima,Yuki Saito
机构: ZOZO Research (ZOZO 研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even rivals a basic Sentence Transformer model (SimCSE) on some data sets. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are irrelevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.
zh

[NLP-85] Subjective Perspectives within Learned Representations Predict High-Impact Innovation

【速读】：该论文试图解决创新过程中个体与群体如何通过主观视角和背景多样性影响创造力的问题，其核心在于揭示主观视角与背景多样性对创新成果的不同影响。解决方案的关键在于利用动态语言表示构建概念的几何空间，并通过测量创新者在该空间中的位置来量化其主观视角和创新机会，进而分析视角差异与背景差异对创造性成就的预测作用。

链接: https://arxiv.org/abs/2506.04616
作者: Likun Cao,Rui Pan,James Evans
机构: 未知
类目: Computation and Language (cs.CL); Applications (stat.AP); Machine Learning (stat.ML)
备注: 107 pages, 20 figures

点击查看摘要

Abstract:Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators’ personal perspectives and interpersonal innovation opportunities as a function of their prior trajectories of experience. We theorize then quantify subjective perspectives and innovation opportunities based on innovator positions within the geometric space of concepts inscribed by dynamic language representations. Using data on millions of scientists, inventors, writers, entrepreneurs, and Wikipedia contributors across the creative domains of science, technology, film, entrepreneurship, and Wikipedia, here we show that measured subjective perspectives anticipate what ideas individuals and groups creatively attend to and successfully combine in future. When perspective and background diversity are decomposed as the angular difference between collaborators’ perspectives on their creation and between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite, across all cases and time periods examined. We analyze a natural experiment and simulate creative collaborations between AI (large language model) agents designed with various perspective and background diversity, which are consistent with our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experience obtained through trajectories of prior work that converge to provoke one another and innovate. We explore the importance of these findings for team assembly and research policy.
zh

[NLP-86] Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning EMNLP2025

【速读】：该论文旨在解决测试阶段缩放（Test-Time Scaling, TTS）在大型语言模型（Large Language Models, LLMs）推理性能提升中的有效性受限问题，特别是在推理优化模型生成输出多样性不足的情况下。其解决方案的关键在于提出ADAPT（A Diversity Aware Prefix fine-Tuning），一种轻量级的前缀微调方法，通过引入以多样性为导向的数据策略来增强模型生成的多样性，从而提升TTS的效果。实验结果表明，ADAPT在数学推理任务中仅使用强基线八分之一的计算资源即可达到80%的准确率。

链接: https://arxiv.org/abs/2506.04611
作者: Ho-Lam Chung,Teng-Yun Hsiao,Hsiao-Ying Huang,Chunerh Cho,Jian-Ren Lin,Zhang Ziwei,Yun-Nung Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: emnlp 2025 submission

点击查看摘要

Abstract:Test-Time Scaling (TTS) improves the reasoning performance of Large Language Models (LLMs) by allocating additional compute during inference. We conduct a structured survey of TTS methods and categorize them into sampling-based, search-based, and trajectory optimization strategies. We observe that reasoning-optimized models often produce less diverse outputs, which limits TTS effectiveness. To address this, we propose ADAPT (A Diversity Aware Prefix fine-Tuning), a lightweight method that applies prefix tuning with a diversity-focused data strategy. Experiments on mathematical reasoning tasks show that ADAPT reaches 80% accuracy using eight times less compute than strong baselines. Our findings highlight the essential role of generative diversity in maximizing TTS effectiveness.
zh

[NLP-87] A MISMATCHED Benchmark for Scientific Natural Language Inference ACL2025

【速读】：该论文试图解决科学自然语言推理（Scientific Natural Language Inference, NLI）任务中数据集覆盖领域单一的问题，现有数据集主要来源于计算机科学（Computer Science, CS）领域，而其他非CS领域如心理学（Psychology）、工程学（Engineering）和公共卫生（Public Health）则被完全忽略。为解决这一问题，论文提出了一种新的评估基准MISMATCHED，该基准涵盖了三个非CS领域，并包含2,700对人工标注的句子对。解决方案的关键在于构建跨领域的高质量标注数据集，并通过预训练的小型语言模型（Pre-trained Small Language Models, SLMs）和大型语言模型（Large Language Models, LLMs）建立强基线，同时验证了在模型训练中引入具有隐式科学NLI关系的句子对能够提升模型性能。

链接: https://arxiv.org/abs/2506.04603
作者: Firoz Shaik,Mobashir Sadat,Nikita Gautam,Doina Caragea,Cornelia Caragea
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Kansas State University (堪萨斯州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MISMATCHED benchmark covers three non-CS domains-PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.
zh

[NLP-88] Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在使用Chain-of-Thought (CoT)提示方法进行推理时产生的幻觉问题，而现有方法如过程奖励模型（Process Reward Models, PRMs）或自洽性机制由于缺乏可检查的证据而效果受限。解决方案的关键在于提出一个回顾性、步骤感知的形式化验证框架 Safe，该框架通过在每个推理步骤中用形式化数学语言 Lean 4 表述数学命题并提供形式化证明，从而识别幻觉。此方法不仅提升了性能，还提供了可解释和可验证的证据。

链接: https://arxiv.org/abs/2506.04592
作者: Chengwu Liu,Ye Yuan,Yichun Yin,Yan Xu,Xin Xu,Zaoyu Chen,Yasheng Wang,Lifeng Shang,Qun Liu,Ming Zhang
机构: Peking University (北京大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in ACL 2025

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that “the gold standard for supporting a mathematical claim is to provide a proof”. We propose a retrospective, step-aware formal verification framework Safe . Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
zh

[NLP-89] LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models

【速读】：该论文试图解决在无监督数据中生成伪标签（pseudo labels）时存在的噪声问题，进而提升语音识别（Automatic Speech Recognition, ASR）和语音翻译（Automatic Speech Translation, AST）任务的性能。解决方案的关键在于引入LESS（Large Language Model Enhanced Semi-supervised Learning）框架，利用大型语言模型（Large Language Model, LLM）对来自真实场景数据的伪标签进行修正，并结合数据过滤策略以提高LLM知识迁移的效率。

链接: https://arxiv.org/abs/2506.04586
作者: Wen Ding,Fan Qian
机构: NVIDIA Corporation (NVIDIA 公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce LESS (Large Language Model Enhanced Semi-supervised Learning), a versatile framework that leverages Large Language Models (LLMs) to correct pseudo labels generated from in-the-wild data. Within the LESS framework, pseudo-labeled text from Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST) of the unsupervised data is refined by an LLM, and augmented by a data filtering strategy to optimize LLM knowledge transfer efficiency. Experiments on both Mandarin ASR and Spanish-to-English AST tasks show that LESS achieves a notable absolute WER reduction of 3.77% on the Wenet Speech test set, as well as BLEU scores of 34.0 and 64.7 on Callhome and Fisher test sets respectively. These results validate the adaptability of LESS across different languages, tasks, and domains. Ablation studies conducted with various LLMs and prompt configurations provide novel insights into leveraging LLM-derived knowledge for speech processing applications.
zh

[NLP-90] MuSciClaims: Multimodal Scientific Claim Verification

【速读】：该论文试图解决科学文献中基于多模态数据的科学主张验证问题，即如何准确识别、提取并推理信息丰富的图表中的多模态数据以评估科学主张的真实性。解决方案的关键在于引入一个新的基准测试集MuSciClaims，并配套诊断任务，通过自动从科学文章中提取支持性主张并手动扰动生成矛盾主张，从而测试模型在主张验证方面的能力。此外，还设计了诊断任务以分析模型在定位证据、跨模态信息整合及理解图表基本组成部分等方面的不足。

链接: https://arxiv.org/abs/2506.04585
作者: Yash Kumar Lal,Manikanta Bandham,Mohammad Saqib Hasan,Apoorva Kashi,Mahnaz Koupaee,Niranjan Balasubramanian
机构: Stony Brook University (斯托尼布鲁克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.77 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.
zh

[NLP-91] SUCEA: Reasoning -Intensive Retrieval for Adversarial Fact-checking through Claim Decomposition and Editing

【速读】：该论文旨在解决对抗性声明（adversarial claims）在自动事实核查中的挑战，这些声明是人为设计以干扰事实核查系统的。其解决方案的关键在于提出一种无需训练的方法，通过重新表述原始声明，使其更容易找到支持性证据。该方法的核心是SUCEA框架，它将任务分解为三个步骤：声明分割与去上下文化、迭代证据检索与声明编辑以及证据聚合与标签预测，从而显著提升了检索和蕴含标签的准确性。

链接: https://arxiv.org/abs/2506.04583
作者: Hongjun Liu,Yilun Zhao,Arman Cohan,Chen Zhao
机构: New York University (纽约大学); NYU Shanghai (纽约大学上海中心); Yale University (耶鲁大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Automatic fact-checking has recently received more attention as a means of combating misinformation. Despite significant advancements, fact-checking systems based on retrieval-augmented language models still struggle to tackle adversarial claims, which are intentionally designed by humans to challenge fact-checking systems. To address these challenges, we propose a training-free method designed to rephrase the original claim, making it easier to locate supporting evidence. Our modular framework, SUCEA, decomposes the task into three steps: 1) Claim Segmentation and Decontextualization that segments adversarial claims into independent sub-claims; 2) Iterative Evidence Retrieval and Claim Editing that iteratively retrieves evidence and edits the subclaim based on the retrieved evidence; 3) Evidence Aggregation and Label Prediction that aggregates all retrieved evidence and predicts the entailment label. Experiments on two challenging fact-checking datasets demonstrate that our framework significantly improves on both retrieval and entailment label accuracy, outperforming four strong claim-decomposition-based baselines.
zh

[NLP-92] Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching ACL2025

【速读】：该论文试图解决在In-Context Learning (ICL)中，由于演示样本选择方法受限于随机选择而导致的性能瓶颈问题。其解决方案的关键在于提出一种基于梯度匹配的演示样本选择方法，通过将目标任务整个训练集的微调梯度与所选示例的梯度对齐，从而在有限的示例中逼近全量训练集的学习效果。该方法在较小模型上的梯度匹配能够显著提升大规模语言模型在多种任务和不同样本数量（4-shot至128-shot）下的性能。

链接: https://arxiv.org/abs/2506.04579
作者: Jianfei Zhang,Bei Li,Jun Bai,Rumei Li,Yanmeng Wang,Chenghua Lin,Wenge Rong
机构: Beihang University (北京航空航天大学); Meituan, Inc. (美团); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); Ping An Technology (平安科技); University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: accepted to the ACL2025 Findings

点击查看摘要

Abstract:In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.
zh

[NLP-93] Are LLM s Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?

【速读】：该论文试图解决生成式 AI (Generative AI) 在逻辑推理任务中作为逻辑翻译器时，面对词汇多样性（lexical diversification）的可靠性问题。现有逻辑推理基准缺乏词汇多样性，无法有效评估模型在真实场景下的翻译能力。解决方案的关键在于提出 SCALe 基准，通过逻辑不变的词汇多样化方法生成具有词汇多样性但逻辑等价的数据集，以评估模型将不同表达映射为统一逻辑符号的能力，并进一步提出 MenTaL 方法，通过引导模型先构建统一表达表再进行翻译，显著提升了模型在词汇多样化文本上的表现。

链接: https://arxiv.org/abs/2506.04575
作者: Qingchuan Li,Jiatong Li,Zirui Liu,Mingyue Cheng,Yuting Zeng,Qi Liu,Tongxuan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs’ ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through logic-invariant lexical diversification. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs’ ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at this https URL.
zh

[NLP-94] Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis

【速读】：该论文试图解决在零样本金融情感分析任务中，大型语言模型（LLMs）的性能优化问题，特别是探讨基于推理与非推理模型的表现差异。其解决方案的关键在于通过对比不同LLMs及提示策略在金融语境下与人工标注情感的一致性，发现推理机制（如Chain-of-Thought提示）并不一定提升模型性能，反而可能因过度思考导致预测效果下降，而采用类似“系统1”（System 1）的快速直觉式处理方式反而更接近人类判断。

链接: https://arxiv.org/abs/2506.04574
作者: Dimitris Vamvourellis,Dhagash Mehta
机构: BlackRock, Inc(黑石集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate the effectiveness of large language models (LLMs), including reasoning-based and non-reasoning models, in performing zero-shot financial sentiment analysis. Using the Financial PhraseBank dataset annotated by domain experts, we evaluate how various LLMs and prompting strategies align with human-labeled sentiment in a financial context. We compare three proprietary LLMs (GPT-4o, GPT-4.1, o3-mini) under different prompting paradigms that simulate System 1 (fast and intuitive) or System 2 (slow and deliberate) thinking and benchmark them against two smaller models (FinBERT-Prosus, FinBERT-Tone) fine-tuned on financial sentiment analysis. Our findings suggest that reasoning, either through prompting or inherent model design, does not improve performance on this task. Surprisingly, the most accurate and human-aligned combination of model and method was GPT-4o without any Chain-of-Thought (CoT) prompting. We further explore how performance is impacted by linguistic complexity and annotation agreement levels, uncovering that reasoning may introduce overthinking, leading to suboptimal predictions. This suggests that for financial sentiment classification, fast, intuitive “System 1”-like thinking aligns more closely with human judgment compared to “System 2”-style slower, deliberative reasoning simulated by reasoning models or CoT prompting. Our results challenge the default assumption that more reasoning always leads to better LLM decisions, particularly in high-stakes financial applications.
zh

[NLP-95] Demonstrations of Integrity Attacks in Multi-Agent Systems

【速读】：该论文试图解决多智能体系统（Multi-Agent Systems, MAS）在面对恶意代理时的完整性威胁问题，特别是恶意代理通过细微的提示操控来偏移系统行为以获取自身利益。解决方案的关键在于揭示了经过精心设计的提示可以引入系统性偏差，从而有效误导评估系统并操纵协作代理，同时表明现有基于大型语言模型（Large Language Models, LLM）的监控机制存在局限性。研究强调了构建具备强大安全协议和内容验证机制的MAS架构的重要性。

链接: https://arxiv.org/abs/2506.04572
作者: Can Zheng,Yuhan Cao,Xiaoning Dong,Tianxing He
机构: Shanghai Qi Zhi Institute; Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); University of Pittsburgh; ShanghaiTech University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, code generation, and complex planning. Simultaneously, Multi-Agent Systems (MAS) have garnered attention for their potential to enable cooperation among distributed agents. However, from a multi-party perspective, MAS could be vulnerable to malicious agents that exploit the system to serve self-interests without disrupting its core functionality. This work explores integrity attacks where malicious agents employ subtle prompt manipulation to bias MAS operations and gain various benefits. Four types of attacks are examined: \textitScapegoater, who misleads the system monitor to underestimate other agents’ contributions; \textitBoaster, who misleads the system monitor to overestimate their own performance; \textitSelf-Dealer, who manipulates other agents to adopt certain tools; and \textitFree-Rider, who hands off its own task to others. We demonstrate that strategically crafted prompts can introduce systematic biases in MAS behavior and executable instructions, enabling malicious agents to effectively mislead evaluation systems and manipulate collaborative agents. Furthermore, our attacks can bypass advanced LLM-based monitors, such as GPT-4o-mini and o3-mini, highlighting the limitations of current detection mechanisms. Our findings underscore the critical need for MAS architectures with robust security protocols and content validation mechanisms, alongside monitoring systems capable of comprehensive risk scenario assessment.
zh

[NLP-96] Clustering and Median Aggregation Improve Differentially Private Inference

【速读】：该论文旨在解决差分隐私（Differentially Private, DP）语言模型推理中生成文本质量下降的问题，特别是在敏感输入涉及异构主题时，传统均匀采样方法会显著降低私有生成文本的质量。其解决方案的关键在于在选择推理批次前对输入数据进行聚类，从而提升生成文本的一致性和质量；同时，利用聚类后预测的相似性，提出一种新的算法，通过私密计算中位数替代平均值来聚合下一个词的概率统计，从而降低局部敏感性，实现更优的数据依赖性且事后差分隐私保证。

链接: https://arxiv.org/abs/2506.04566
作者: Kareem Amin,Salman Avestimehr,Sara Babakniya,Alex Bie,Weiwei Kong,Natalia Ponomareva,Umar Syed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Differentially private (DP) language model inference is an approach for generating private synthetic text. A sensitive input example is used to prompt an off-the-shelf large language model (LLM) to produce a similar example. Multiple examples can be aggregated together to formally satisfy the DP guarantee. Prior work creates inference batches by sampling sensitive inputs uniformly at random. We show that uniform sampling degrades the quality of privately generated text, especially when the sensitive examples concern heterogeneous topics. We remedy this problem by clustering the input data before selecting inference batches. Next, we observe that clustering also leads to more similar next-token predictions across inferences. We use this insight to introduce a new algorithm that aggregates next token statistics by privately computing medians instead of averages. This approach leverages the fact that the median has decreased local sensitivity when next token predictions are similar, allowing us to state a data-dependent and ex-post DP guarantee about the privacy properties of this algorithm. Finally, we demonstrate improvements in terms of representativeness metrics (e.g., MAUVE) as well as downstream task performance. We show that our method produces high-quality synthetic data at significantly lower privacy cost than a previous state-of-the-art method. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2506.04566 [cs.LG] (or arXiv:2506.04566v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.04566 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-97] SSA-COMET: Do LLM s Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?

【速读】：该论文旨在解决低资源非洲语言机器翻译（Machine Translation, MT）质量评估的挑战，现有评估指标在语言覆盖范围和低资源场景下的性能表现不佳。其解决方案的关键在于构建一个大规模的人工标注MT评估数据集SSA-MTE，涵盖13个非洲语言对，并基于此数据集开发了改进的参考依赖型和参考无关型评估指标SSA-COMET和SSA-COMET-QE，同时通过先进的大语言模型（LLM）进行提示工程基准测试，以提升评估效果，尤其在低资源语言如Twi、Luo和Yoruba上的表现显著优于现有方法。

链接: https://arxiv.org/abs/2506.04557
作者: Senyu Li,Jiayi Wang,Felermino D. M. A. Ali,Colin Cherry,Daniel Deutsch,Eleftheria Briakou,Rui Sousa-Silva,Henrique Lopes Cardoso,Pontus Stenetorp,David Ifeoluwa Adelani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.
zh

[NLP-98] BSBench: will your LLM find the largest prime number?

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在面对无合理答案的问题时的表现评估问题。其关键解决方案是提出一个基准测试框架，允许对这类问题进行测试，并提供一种修改现有数据集的方法，从而揭示现有模型在处理此类问题时的表现远未达到理想水平。

链接: https://arxiv.org/abs/2506.04535
作者: K. O. T. Erziev
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 + 2 pages

点击查看摘要

Abstract:We propose that benchmarking LLMs on questions which have no reasonable answer actually isn’t as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at this https URL
zh

[NLP-99] Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLM s ACL2025

【速读】：该论文试图解决大型语言模型（LLM）在理解英语话语标记词“just”的细微语义差异方面的能力问题，特别是其在语用和语义层面的复杂功能。解决方案的关键在于利用由专家语言学家精心构建并标注的数据集，以评估LLM对“just”这一在形式语义学中被广泛研究的词的多义性识别能力。研究结果表明，尽管LLM能够在一定程度上区分更宽泛的语义类别，但它们在捕捉更细微的语义/话语效果方面仍存在明显不足。

链接: https://arxiv.org/abs/2506.04534
作者: William Sheffield,Kanishka Misra,Valentina Pyatkin,Ashwini Deo,Kyle Mahowald,Junyi Jessy Li
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Toyota Technological Institute at Chicago (芝加哥丰田技术学院); Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in Findings of The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). The main paper is 5 pages and contains 3 figures and 1 table. In total, the paper is 12 pages and contains 8 figures and 5 tables (References + Appendix)

点击查看摘要

Abstract:Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects, as exemplified by the diverse uses of the particle “just” (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English “just”, a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.
zh

[NLP-100] Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning INTERSPEECH2025

【速读】：该论文试图解决语音中音素和语调标签与字形（grapheme）不一致的问题，旨在获得与字形一致的语音标签。解决方案的关键在于通过两种方法对标签生成进行基于字形的条件约束：1）通过预训练BERT特征的提示编码器添加隐式的字形条件；2）在推理过程中显式剪除与字形不一致的标签假设。这些方法使得能够生成语音、标签和字形的并行数据，适用于文本到语音合成和从文本进行口音估计等下游任务。

链接: https://arxiv.org/abs/2506.04527
作者: Hien Ohnaka,Yuma Shirahata,Byeongseon Park,Ryuichi Yamamoto
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, and 4 tables, accepted to INTERSPEECH 2025

点击查看摘要

Abstract:We propose a model to obtain phonemic and prosodic labels of speech that are coherent with graphemes. Unlike previous methods that simply fine-tune a pre-trained ASR model with the labels, the proposed model conditions the label generation on corresponding graphemes by two methods: 1) Add implicit grapheme conditioning through prompt encoder using pre-trained BERT features. 2) Explicitly prune the label hypotheses inconsistent with the grapheme during inference. These methods enable obtaining parallel data of speech, the labels, and graphemes, which is applicable to various downstream tasks such as text-to-speech and accent estimation from text. Experiments showed that the proposed method significantly improved the consistency between graphemes and the predicted labels. Further, experiments on accent estimation task confirmed that the created parallel data by the proposed method effectively improve the estimation accuracy.
zh

[NLP-101] Please Translate Again: Two Simple Experiments on Whether Human-Like Reasoning Helps Translation

【速读】：该论文试图解决在机器翻译任务中，通过显式分解翻译过程（如Chain-of-Thought, CoT）是否能有效提升大型语言模型（Large Language Models, LLMs）的性能问题。其解决方案的关键在于对现有方法进行实证分析，发现模型性能的提升可能并非源于显式的分步推理，而是通过简单提示LLMs“重新翻译”即可获得优于人类设计的分步提示的效果。研究强调了对CoT在翻译场景中有效性因素的进一步探索的重要性。

链接: https://arxiv.org/abs/2506.04521
作者: Di Wu,Seth Aycock,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 16 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong reasoning capabilities for many tasks, often by explicitly decomposing the task via Chain-of-Thought (CoT) reasoning. Recent work on LLM-based translation designs hand-crafted prompts to decompose translation, or trains models to incorporate intermediate steps.~\textitTranslating Step-by-step~\citepbriakou2024translating, for instance, introduces a multi-step prompt with decomposition and refinement of translation with LLMs, which achieved state-of-the-art results on WMT24. In this work, we scrutinise this strategy’s effectiveness. Empirically, we find no clear evidence that performance gains stem from explicitly decomposing the translation process, at least for the models on test; and we show that simply prompting LLMs to ``translate again’’ yields even better results than human-like step-by-step prompting. Our analysis does not rule out the role of reasoning, but instead invites future work exploring the factors for CoT’s effectiveness in the context of translation.
zh

[NLP-102] DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation

【速读】：该论文试图解决在存在多种合理响应的模糊场景中，大型语言模型（Large Language Models, LLMs）容易产生不可靠结果，而小型语言模型（Small Language Models, SLMs）虽在这些场景中表现稳健，但易受误导或对抗性输入影响的问题。解决方案的关键在于利用LLMs对负例的处理能力和SLMs对正例的优势，通过自适应加权整合两者，提出SLIDE（Small and Large Integrated for Dialogue Evaluation）方法，并在此基础上进一步引入双精炼评估（Dual-Refinement Evaluation, DRE）方法，以提升SLM与LLM的融合效果。

链接: https://arxiv.org/abs/2506.04516
作者: Kun Zhao,Bohao Yang,Chen Tang,Siyuan Dai,Haoteng Tang,Chenghua Lin,Liang Zhan
机构: University of Pittsburgh (匹兹堡大学); The University of Manchester (曼彻斯特大学); University of Texas Rio Grande Valley (得克萨斯州里奥格兰德谷大学)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2405.15924

点击查看摘要

Abstract:Large Language Models (LLMs) excel at many tasks but struggle with ambiguous scenarios where multiple valid responses exist, often yielding unreliable results. Conversely, Small Language Models (SLMs) demonstrate robustness in such scenarios but are susceptible to misleading or adversarial inputs. We observed that LLMs handle negative examples effectively, while SLMs excel with positive examples. To leverage their complementary strengths, we introduce SLIDE (Small and Large Integrated for Dialogue Evaluation), a method integrating SLMs and LLMs via adaptive weighting. Building on SLIDE, we further propose a Dual-Refinement Evaluation (DRE) method to enhance SLM-LLM integration: (1) SLM-generated insights guide the LLM to produce initial evaluations; (2) SLM-derived adjustments refine the LLM’s scores for improved accuracy. Experiments demonstrate that DRE outperforms existing methods, showing stronger alignment with human judgment across diverse benchmarks. This work illustrates how combining small and large models can yield more reliable evaluation tools, particularly for open-ended tasks such as dialogue evaluation.
zh

[NLP-103] SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在文本到SQL（Text-to-SQL）任务中生成的查询存在语义错误但语法正确的问题，从而影响其可靠性。解决方案的关键在于提出SQLens，这是一个端到端框架，通过整合数据库底层和LLM产生的错误信号，实现对SQL语句中语义错误的细粒度检测与修正，进而提升生成查询的准确性和执行效果。

链接: https://arxiv.org/abs/2506.04494
作者: Yue Gong,Chuan Lei,Xiao Qin,Kapil Vaidya,Balakrishnan Narayanaswamy,Tim Kraska
机构: Amazon Web Services (亚马逊网络服务); The University of Chicago (芝加哥大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL systems translate natural language (NL) questions into SQL queries, enabling non-technical users to interact with structured data. While large language models (LLMs) have shown promising results on the text-to-SQL task, they often produce semantically incorrect yet syntactically valid queries, with limited insight into their reliability. We propose SQLens, an end-to-end framework for fine-grained detection and correction of semantic errors in LLM-generated SQL. SQLens integrates error signals from both the underlying database and the LLM to identify potential semantic errors within SQL clauses. It further leverages these signals to guide query correction. Empirical results on two public benchmarks show that SQLens outperforms the best LLM-based self-evaluation method by 25.78% in F1 for error detection, and improves execution accuracy of out-of-the-box text-to-SQL systems by up to 20%.
zh

[NLP-104] Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM -Based Systems ACL2025

【速读】：该论文试图解决当前用于评估基于大语言模型（Large Language Model, LLM）系统中表征性危害（representational harms）的公开工具与实际从业者需求之间的不匹配问题。研究发现，从业者常常无法有效使用这些工具，主要面临两类挑战：一是工具本身未能准确测量从业者关注的核心问题，二是即使工具有效，也因实际操作和制度性障碍而未被采用。解决方案的关键在于结合测量理论和实用测量方法，提出针对性的改进建议，以更好地满足从业者的需求。

链接: https://arxiv.org/abs/2506.04482
作者: Emma Harvey,Emily Sheng,Su Lin Blodgett,Alexandra Chouldechova,Jean Garcia-Gathright,Alexandra Olteanu,Hanna Wallach
机构: Cornell University (康奈尔大学); Microsoft Research (微软研究院)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Findings of the Association for Computational Linguistics: ACL 2025

点击查看摘要

Abstract:The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
zh

[NLP-105] Aligning Large Language Models with Implicit Preferences from User-Generated Content ACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）与人类价值观对齐的问题，特别是如何在不依赖昂贵且难以扩展的人工标注数据或高级LLMs生成数据的情况下，提升模型生成响应的质量。其解决方案的关键在于提出一种名为PUGC的新框架，该框架通过挖掘未标注用户生成内容（User-Generated Content, UGC）中的隐式人类偏好，自动生成偏好数据。PUGC将UGC转化为用户查询，并从策略模型生成响应，随后利用UGC作为参考文本进行响应评分，从而实现模型与隐式偏好的对齐。

链接: https://arxiv.org/abs/2506.04463
作者: Zhaoxuan Tan,Zheng Li,Tianyi Liu,Haodong Wang,Hyokun Yun,Ming Zeng,Pei Chen,Zhihan Zhang,Yifan Gao,Ruijie Wang,Priyanka Nigam,Bing Yin,Meng Jiang
机构: Amazon.com Inc (亚马逊公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers’ questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at this https URL
zh

[NLP-106] Watermarking Degrades Alignment in Language Models: Analysis and Mitigation ICLR2025

【速读】：该论文旨在解决水印技术对大型语言模型（Large Language Models, LLMs）核心对齐属性——真实性、安全性和有用性——产生的负面影响问题。研究发现，现有的两种主流水印方法（Gumbel与KGW）在提升模型有用性时可能削弱安全性，或因过度谨慎而降低有用性，这种现象源于水印引起的token分布变化所引发的对齐目标间的根本冲突。论文提出的关键解决方案是Alignment Resampling (AR)，这是一种在推理阶段使用的采样方法，通过引入外部奖励模型来恢复模型的对齐性，从而有效缓解水印带来的性能退化。

链接: https://arxiv.org/abs/2506.04462
作者: Apurv Verma,NhatHai Phan,Shubhendu Trivedi
机构: New Jersey Institute of Technology (新泽西理工学院); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Published at the 1st Workshop on GenAI Watermarking, collocated with ICLR 2025. OpenReview: this https URL

点击查看摘要

Abstract:Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice. Comments: Published at the 1st Workshop on GenAI Watermarking, collocated with ICLR 2025. OpenReview: this https URL Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2506.04462 [cs.CL] (or arXiv:2506.04462v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.04462 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 1st Workshop on GenAI Watermarking, ICLR 2025
zh

[NLP-107] Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey ACL2025

【速读】：该论文试图解决机器学习模型中系统性（systematicity）的评估与提升问题，特别是针对现有研究多关注行为层面的系统性而非表征层面的系统性这一局限。其解决方案的关键在于强调表征系统性的重要性，并提出通过机制可解释性领域的方法来评估模型表征的系统性，从而更全面地理解和增强模型的泛化能力。

链接: https://arxiv.org/abs/2506.04461
作者: Ivan Vegner,Sydelle de Souza,Valentin Forch,Martha Lewis,Leonidas A.A. Doumas
机构: School of Informatics, University of Edinburgh(信息学院，爱丁堡大学); Chemnitz University of Technology(开姆尼茨工业大学); Institute of Logic, Language, and Computation, University of Amsterdam(逻辑、语言与计算研究所，阿姆斯特丹大学); School of Philosophy, Psychology & Language Sciences, University of Edinburgh(哲学、心理学与语言科学学院，爱丁堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear at ACL 2025 Main Conference

点击查看摘要

Abstract:A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley’s (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.
zh

[NLP-108] Zero-Shot Open-Schema Entity Structure Discovery

【速读】：该论文试图解决现有基于大语言模型（Large Language Models, LLMs）的实体结构抽取方法依赖预定义实体属性模式或标注数据而导致抽取结果不完整的问题。其解决方案的关键在于提出一种无需任何模式或标注样本的零样本开放模式实体结构发现方法（Zero-Shot Open-schema Entity Structure Discovery, ZOES），该方法通过增强、精炼和统一的机制，利用实体及其结构之间的相互促进关系，提升LLMs在不同领域中抽取更完整实体结构的能力。

链接: https://arxiv.org/abs/2506.04458
作者: Xueqiang Xu,Jinfeng Xiao,James Barry,Mohab Elkaref,Jiaru Zou,Pengcheng Jiang,Yunyi Zhang,Max Giammona,Geeth de Mel,Jiawei Han
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Entity structure extraction, which aims to extract entities and their associated attribute-value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce Zero-Shot Open-schema Entity Structure Discovery (ZOES), a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs’ ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.
zh

[NLP-109] Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

【速读】：该论文试图解决在辅助科学发现过程中，如何高效筛选可行假设以降低自动化实验成本的问题。其解决方案的关键在于构建一个名为Matter-of-Fact的挑战数据集，用于评估假设作为命题的可行性，该数据集包含从四个高影响力材料科学领域中提取的8.4k条命题，涵盖定性和定量声明，并通过领域专家验证表明大部分命题是可解的，从而为提升科学发现效率提供了潜在路径。

链接: https://arxiv.org/abs/2506.04410
作者: Peter Jansen,Samiah Hassan,Ruoyao Wang
机构: University of Arizona (亚利桑那大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable – highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.
zh

[NLP-110] Empaths at SemEval-2025 Task 11: Retrieval-Augmented Approach to Perceived Emotions Prediction ACL2025 SEMEVAL-2025

【速读】：该论文旨在解决文本中感知情感的多标签检测问题，具体是通过给定的文本片段预测说话者所传达的情感，如喜悦、悲伤、恐惧、愤怒、惊讶和厌恶。其解决方案的关键在于不依赖额外的模型训练，而是采用模型集成（ensemble of models）的方式进行情感预测，从而在保持性能的同时提高了系统的效率、可扩展性和实现便捷性。

链接: https://arxiv.org/abs/2506.04409
作者: Lev Morozov,Aleksandr Mogilevskii,Alexander Shirnin
机构: HSE University (高等经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to SemEval-2025, an ACL 2025 workshop

点击查看摘要

Abstract:This paper describes EmoRAG, a system designed to detect perceived emotions in text for SemEval-2025 Task 11, Subtask A: Multi-label Emotion Detection. We focus on predicting the perceived emotions of the speaker from a given text snippet, labeling it with emotions such as joy, sadness, fear, anger, surprise, and disgust. Our approach does not require additional model training and only uses an ensemble of models to predict emotions. EmoRAG achieves results comparable to the best performing systems, while being more efficient, scalable, and easier to implement.
zh

[NLP-111] Unpacking Let Alone: Human-Scale Models Generalize to a Rare Construction in Form but not Meaning

【速读】：该论文试图解决当前语言模型在处理罕见语法现象时，是否能够像人类一样不仅掌握其形式，还能理解其意义的问题。研究的关键在于通过构建一个专门的合成基准测试，评估大规模Transformer语言模型对英语中罕见且奇特的“LET-ALONE”结构的形式和语义知识。实验结果表明，尽管这些模型对形式敏感，但无法正确泛化该结构的语义，揭示了当前架构在语言形式与语义之间的样本效率存在不对称性。

链接: https://arxiv.org/abs/2506.04408
作者: Wesley Scivetti,Tatsuya Aoyama,Ethan Wilcox,Nathan Schneider
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans have a remarkable ability to acquire and understand grammatical phenomena that are seen rarely, if ever, during childhood. Recent evidence suggests that language models with human-scale pretraining data may possess a similar ability by generalizing from frequent to rare constructions. However, it remains an open question how widespread this generalization ability is, and to what extent this knowledge extends to meanings of rare constructions, as opposed to just their forms. We fill this gap by testing human-scale transformer language models on their knowledge of both the form and meaning of the (rare and quirky) English LET-ALONE construction. To evaluate our LMs we construct a bespoke synthetic benchmark that targets syntactic and semantic properties of the construction. We find that human-scale LMs are sensitive to form, even when related constructions are filtered from the dataset. However, human-scale LMs do not make correct generalizations about LET-ALONE’s meaning. These results point to an asymmetry in the current architectures’ sample efficiency between language form and meaning, something which is not present in human language learners.
zh

[NLP-112] MedAgent Gym: Training LLM Agents for Code-Based Medical Reasoning at Scale

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在基于编码的医学推理能力上的不足，通过构建一个公开可用的训练环境MedAgentGYM来提升其性能。该环境包含从真实生物医学场景中提取的72,413个任务实例，涵盖129个类别，并提供可执行的编码环境、详细的任务描述、交互式反馈机制、可验证的基准标注以及可扩展的训练轨迹生成功能。解决方案的关键在于利用MedAgentGYM进行监督微调和持续强化学习，从而显著提升模型性能，如Med-Copilot-7B在两项技术下分别获得+36.44%和+42.47%的性能提升，成为一种经济且隐私保护良好的竞争性替代方案。

链接: https://arxiv.org/abs/2506.04405
作者: Ran Xu,Yuchen Zhuang,Yishan Zhong,Yue Yu,Xiangru Tang,Hang Wu,May D. Wang,Peifeng Ruan,Donghan Yang,Tao Wang,Guanghua Xiao,Carl Yang,Yang Xie,Wenqi Shi
机构: Emory University(埃默里大学); Georgia Tech(佐治亚理工学院); Yale University(耶鲁大学); UT Southwestern Medical Center(德克萨斯西南医学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.
zh

[NLP-113] Building a Few-Shot Cross-Domain Multilingual NLU Model for Customer Care

【速读】：该论文旨在解决跨领域意图分类器在仅有少量标注样本的情况下难以有效泛化的问题，特别是在电商客户关怀场景中，由于数据在不同领域（由渠道、地理区域和语言定义）间的分布不均，导致现有预训练模型性能受限。其解决方案的关键在于提出一种嵌入器兼分类器的模型架构，通过监督微调结合各向同性正则化训练领域特定的句子嵌入器，并采用多语言知识蒸馏策略实现嵌入器在多个领域的泛化，最终通过简单线性分类器部署至新领域，实验表明该方法在少样本意图检测任务中相比现有最先进模型提升了20-23%的准确率。

链接: https://arxiv.org/abs/2506.04389
作者: Saurabh Kumar,Sourav Bansal,Neeraj Agrawal,Priyanka Bhatt
机构: Walmart Global Tech, Bangalore, India; Indian Institute of Technology, Delhi
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Customer care is an essential pillar of the e-commerce shopping experience with companies spending millions of dollars each year, employing automation and human agents, across geographies (like US, Canada, Mexico, Chile), channels (like Chat, Interactive Voice Response (IVR)), and languages (like English, Spanish). SOTA pre-trained models like multilingual-BERT, fine-tuned on annotated data have shown good performance in downstream tasks relevant to Customer Care. However, model performance is largely subject to the availability of sufficient annotated domain-specific data. Cross-domain availability of data remains a bottleneck, thus building an intent classifier that generalizes across domains (defined by channel, geography, and language) with only a few annotations, is of great practical value. In this paper, we propose an embedder-cum-classifier model architecture which extends state-of-the-art domain-specific models to other domains with only a few labeled samples. We adopt a supervised fine-tuning approach with isotropic regularizers to train a domain-specific sentence embedder and a multilingual knowledge distillation strategy to generalize this embedder across multiple domains. The trained embedder, further augmented with a simple linear classifier can be deployed for new domains. Experiments on Canada and Mexico e-commerce Customer Care dataset with few-shot intent detection show an increase in accuracy by 20-23% against the existing state-of-the-art pre-trained models.
zh

[NLP-114] MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在低资源语言（如马耳他语）中的性能受限问题，尤其是在生成任务上的表现不佳。其解决方案的关键在于通过多维度分析发现，模型在预训练和指令调优阶段对目标语言的先前接触程度是影响性能的最重要因素，并指出微调（fine-tuning）虽然初始成本较高，但能够带来更好的性能和更低的推理成本，从而为低资源语言的建模提供更有效的策略。

链接: https://arxiv.org/abs/2506.04385
作者: Kurt Micallef,Claudia Borg
机构: University of Malta(马耳他大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings Camera-Ready

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend that researchers working with low-resource languages consider more “traditional” language modelling approaches.
zh

[NLP-115] Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy

【速读】：该论文旨在解决层次化文本分类（Hierarchical Text Classification, HTC）中如何有效融合标签层次结构与文本表示的问题。现有方法要么单独编码标签层次结构并将其与文本编码混合，要么在文本编码器中引导标签层次结构，但两者各有侧重且互为补充。论文提出的解决方案关键在于引入一种基于对比学习的路径引导层次结构方法（HTC-CLIP），通过对比学习同时学习层次感知的文本表示和文本感知的路径引导层次结构表示，并在推理阶段融合两种概率分布以提升分类性能。实验结果表明，该方法在两个公开基准数据集上相比现有最先进模型在Macro F1分数上提升了0.99%-2.37%。

链接: https://arxiv.org/abs/2506.04381
作者: Neeraj Agrawal,Saurabh Kumar,Priyanka Bhatt,Tanishka Agarwal
机构: Walmart Global Tech (沃尔玛全球科技)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2203.03825 by other authors

点击查看摘要

Abstract:Hierarchical Text Classification (HTC) has recently gained traction given the ability to handle complex label hierarchy. This has found applications in domains like E- commerce, customer care and medicine industry among other real-world applications. Existing HTC models either encode label hierarchy separately and mix it with text encoding or guide the label hierarchy structure in the text encoder. Both approaches capture different characteristics of label hierarchy and are complementary to each other. In this paper, we propose a Hierarchical Text Classification using Contrastive Learning Informed Path guided hierarchy (HTC-CLIP), which learns hierarchy-aware text representation and text informed path guided hierarchy representation using contrastive learning. During the training of HTC-CLIP, we learn two different sets of class probabilities distributions and during inference, we use the pooled output of both probabilities for each class to get the best of both representations. Our results show that the two previous approaches can be effectively combined into one architecture to achieve improved performance. Tests on two public benchmark datasets showed an improvement of 0.99 - 2.37% in Macro F1 score using HTC-CLIP over the existing state-of-the-art models.
zh

[NLP-116] Mechanistic Decomposition of Sentence Representations

【速读】：该论文试图解决句子嵌入（sentence embeddings）内部结构不透明的问题，即现有的方法虽然可以利用余弦相似度等指标对句子嵌入进行比较，但其内部贡献特征难以被人类解释，且句子嵌入的内容因复杂的神经变换和最终的池化操作而难以追溯。解决方案的关键在于通过在词级表示上使用字典学习（dictionary learning），对句子嵌入进行机制性分解，从而将其分解为可解释的组成部分，进而分析池化操作如何将这些特征压缩为句子表示，并评估句子嵌入中的潜在特征，实现从词级机制可解释性到句子级分析的桥梁构建。

链接: https://arxiv.org/abs/2506.04373
作者: Matthieu Tehenan,Vikram Natarajan,Jonathan Michala,Milton Lin,Juri Opitz
机构: University of Cambridge (剑桥大学); Independent (独立); University of Southern California (南加州大学); Johns Hopkins University (约翰霍普金斯大学); University of Zürich (苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.
zh

[NLP-117] Effects of Speaker Count Duration and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR INTERSPEECH2025

【速读】：该论文试图解决如何提升自动语音识别（Automatic Speech Recognition, ASR）系统对未见过的口音的鲁棒性问题。其关键解决方案在于通过优化训练数据中的三个变量——说话人数量、每个说话人的音频时长以及口音多样性，来提高ASR在低资源训练条件下的表现。研究发现，在固定总训练时长的情况下，增加说话人数量比增加每个说话人的训练时长更为有效，并且更多的说话人能够使ASR性能从训练时长的扩展中获益。此外，当说话人数量固定时，优先选择不同口音的说话人带来的收益有限。因此，该研究建议在新语言的ASR训练数据构建中应优先增加说话人数量。

链接: https://arxiv.org/abs/2506.04364
作者: Zheng-Xin Yong,Vineel Pratap,Michael Auli,Jean Maillard
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2025

点击查看摘要

Abstract:To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data – the number of speakers, the audio duration per each individual speaker, and the diversity of accents – affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.
zh

[NLP-118] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

【速读】：该论文旨在解决当前胸片视觉问答（VQA）任务中缺乏大规模、综合性基准的问题，以更准确地评估人工智能在放射学领域的表现。其解决方案的关键在于构建了ReXVQA，这是目前最大且最全面的胸片VQA基准，包含约696,000个问题与160,000张胸片图像，并引入了反映五种核心放射学推理技能的任务集，即存在性评估、位置分析、否定检测、鉴别诊断和几何推理，从而提升了任务的临床真实性和多样性。此外，通过对比先进多模态大语言模型与放射科医生的表现，验证了AI在胸片解读中达到甚至超越专家水平的可能性。

链接: https://arxiv.org/abs/2506.04353
作者: Ankit Pal,Jung-Oh Lee,Xiaoman Zhang,Malaikannan Sankarasubbu,Seunghyeon Roh,Won Jung Kim,Meesun Lee,Pranav Rajpurkar
机构: Saama AI Research (Saama AI 研究院); Seoul National University (首尔国立大学); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at this https URL
zh

[NLP-119] GEM: Empowering LLM for both Embedding Generation and Language Understanding

【速读】：该论文试图解决传统应用中依赖独立嵌入模型生成文本嵌入所导致的系统复杂性和查询理解不一致的问题（text embeddings）。其解决方案的关键在于提出一种简单的自监督方法——生成式嵌入大语言模型（Generative Embedding large language Model, GEM），通过在文本中插入特殊标记并利用注意力掩码生成文本摘要嵌入，使任何大型解码器语言模型（Large decoder-only language models, LLMs）能够在保持原有文本生成和推理能力的同时，生成高质量的文本嵌入。

链接: https://arxiv.org/abs/2506.04344
作者: Caojin Zhang,Qiang Zhang,Ke Li,Sai Vidyaranya Nuthalapati,Benyu Zhang,Jason Liu,Serena Li,Lizhu Zhang,Xiangjun Fan
机构: Meta Inc(元公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance
zh

[NLP-120] A Graph-Retrieval-Augmented Generation Framework Enhances Decision-Making in the Circular Economy

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在可持续制造领域中容易产生工业代码和排放因子的幻觉问题，这会损害监管和投资决策的准确性。解决方案的关键在于提出CircuGraphRAG，这是一种检索增强生成（Retrieval-Augmented Generation, RAG）框架，通过将LLMs的输出与针对循环经济的领域特定知识图谱相结合，实现结构化的多跳推理。该知识图谱包含117,380个工业和废弃物实体及其分类代码和GWP100排放数据，能够通过自然语言查询转换为SPARQL并检索验证过的子图，从而确保结果的准确性和可追溯性。

链接: https://arxiv.org/abs/2506.04252
作者: Yang Zhao,Chengxiao Dai,Dusit Niyato,Chuan Fu Tan,Keyi Xiang,Yueyang Wang,Zhiquan Yeo,Daren Tan Zong Loong,Jonathan Low Zhaozhi,Eugene H.Z. HO
机构: SIMTech (新加坡科技研究局); NTU (南洋理工大学); NUS (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold promise for sustainable manufacturing, but often hallucinate industrial codes and emission factors, undermining regulatory and investment decisions. We introduce CircuGraphRAG, a retrieval-augmented generation (RAG) framework that grounds LLMs outputs in a domain-specific knowledge graph for the circular economy. This graph connects 117,380 industrial and waste entities with classification codes and GWP100 emission data, enabling structured multi-hop reasoning. Natural language queries are translated into SPARQL and verified subgraphs are retrieved to ensure accuracy and traceability. Compared with Standalone LLMs and Naive RAG, CircuGraphRAG achieves superior performance in single-hop and multi-hop question answering, with ROUGE-L F1 scores up to 1.0, while baseline scores below 0.08. It also improves efficiency, halving the response time and reducing token usage by 16% in representative tasks. CircuGraphRAG provides fact-checked, regulatory-ready support for circular economy planning, advancing reliable, low-carbon resource decision making.
zh

[NLP-121] Contextual Integrity in LLM s via Reasoning and Reinforcement Learning

【速读】：该论文试图解决在自主代理代表用户做出决策的背景下，确保上下文完整性（contextual integrity, CI）的问题，即在执行特定任务时确定应共享的信息类型。解决方案的关键在于让代理能够推理其运行的上下文，并通过显式提示大型语言模型（LLMs）在决定披露信息时进行CI推理，进一步结合强化学习（reinforcement learning, RL）框架，使模型具备实现CI所需的推理能力。实验表明，该方法在少量但多样化的数据集上有效减少了不适当的信息泄露，同时保持了任务性能。

链接: https://arxiv.org/abs/2506.04245
作者: Guangchen Lan,Huseyin A. Inan,Sahar Abdelnabi,Janardhan Kulkarni,Lukas Wutschitz,Reza Shokri,Christopher G. Brinton,Robert Sim
机构: Purdue University (普渡大学); Microsoft (微软); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) – what is the appropriate information to share while carrying out a certain task – becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to reason explicitly about CI when deciding what information to disclose. We then extend this approach by developing a reinforcement learning (RL) framework that further instills in models the reasoning necessary to achieve CI. Using a synthetic, automatically created, dataset of only \sim700 examples but with diverse contexts and information disclosure norms, we show that our method substantially reduces inappropriate information disclosure while maintaining task performance across multiple model sizes and families. Importantly, improvements transfer from this synthetic dataset to established CI benchmarks such as PrivacyLens that has human annotations and evaluates privacy leakage of AI assistants in actions and tool calls.
zh

[NLP-122] HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

【速读】：该论文试图解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）评估基准在人文与社会科学（Humanities and Social Sciences, HSS）领域存在不足的问题，即现有基准主要关注STEM领域所需的通用知识和垂直推理能力，而忽视了HSS领域所需的横向跨学科思维和多领域知识整合能力。解决方案的关键在于提出HSSBench，这是一个专门针对HSS任务设计的多语言评估基准，并引入一种面向HSS场景的数据生成流程，该流程通过领域专家与自动化代理的协作，迭代生成并优化样本，以更准确地反映HSS任务的复杂性。

链接: https://arxiv.org/abs/2506.03922
作者: Zhaolu Kang,Junhao Gong,Jiaxu Yan,Wanke Xia,Yian Wang,Ziwen Wang,Huaxuan Ding,Zhuo Cheng,Wenhao Cao,Zhiyuan Feng,Siqi He,Shannan Yan,Junzhe Chen,Xiaomin He,Chaoya Jiang,Wei Ye,Kaidong Yu,Xuelong Li
机构: TeleAI(电信AI); Peking University(北京大学); Tsinghua University(清华大学); Chinese Academy of Sciences(中国科学院); University of British Columbia(不列颠哥伦比亚大学); Renmin University of China(中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
zh

[NLP-123] VLMs Can Aggregate Scattered Training Patches

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在训练数据中存在危险样本时所带来的安全风险问题，特别是当有害图像被分割为看似无害的图像块并分散到多个训练样本中时，模型可能通过整合这些碎片生成有害响应。解决方案的关键在于揭示VLMs具备“视觉拼接”（visual stitching）能力，即模型能够整合共享相同文本描述的多个训练样本中的视觉信息。研究通过实验验证了这一能力，并模拟了对抗性数据污染场景，展示了有害内容如何通过图像块规避审核并在推理阶段通过视觉拼接重建，从而对VLM的安全性构成威胁。

链接: https://arxiv.org/abs/2506.03614
作者: Zhanhui Zhou,Lingjie Chen,Chao Yang,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions “safe,” VLMs may later describe, the full image or a text reference to the scene, as “safe.” We define the core ability of VLMs enabling this attack as \textitvisual stitching – the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each (\textttimage, \textttID) pair into (\textttpatch, \textttID)\ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like safe'' or unsafe’', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at this https URL.
zh

[NLP-124] Seed-Coder: Let the Code Model Curate Data for Itself

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）预训练中代码数据（Code Data）构建过程中对人工干预的高度依赖问题。传统方法通常依赖人工设计的过滤规则或人工标注的数据来生成高质量的代码预训练数据，但这些方法在可扩展性、主观偏差和跨多种编程语言的维护成本方面存在明显局限。论文提出的解决方案关键在于采用以模型为中心的数据流水线，主要利用大语言模型对代码数据进行评分和过滤，从而最大限度地减少人工参与，提升数据构建的效率与质量。

链接: https://arxiv.org/abs/2506.03524
作者: ByteDance Seed,Yuyu Zhang,Jing Su,Yifan Sun,Chenguang Xi,Xia Xiao,Shen Zheng,Anxiang Zhang,Kaibo Liu,Daoguang Zan,Tao Sun,Jinhua Zhu,Shulin Xin,Dong Huang,Yetao Bai,Lixin Dong,Chao Li,Jianchong Chen,Hanzhi Zhou,Yifan Huang,Guanghan Ning,Xierui Song,Jiaze Chen,Siyao Liu,Kai Shen,Liang Xiang,Yonghui Wu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
zh

[NLP-125] CoRe-MMRAG : Cross-Source Knowledge Reconciliation for Multimodal RAG ACL2025

【速读】：该论文旨在解决多模态检索增强生成（Multimodal Retrieval-Augmented Generation, MMRAG）中出现的参数化-检索知识不一致（Parametric-Retrieved Knowledge Inconsistency, PRKI）和视觉-文本知识不一致（Visual-Textual Knowledge Inconsistency, VTKI）问题。其解决方案的关键在于提出了一种端到端的跨源知识协调框架——CoRe-MMRAG，该框架通过四阶段流程：基于参数化知识生成内部响应、通过联合相似性评估选择最相关多模态证据、生成外部响应，并最终整合两者以生成可靠答案。此外，专门的训练范式增强了知识源区分、多模态融合与统一答案生成能力。

链接: https://arxiv.org/abs/2506.02544
作者: Yang Tian,Fan Liu,Jingyuan Zhang,Victoria W.,Yupeng Hu,Liqiang Nie
机构: Shandong University (山东大学); National University of Singapore (新加坡国立大学); Independent Researcher (独立研究员); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to ACL 2025 Main

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose Cross-source knowledge \textbfReconciliation for Multimodal RAG (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6% and 9.3% performance gains on InfoSeek and Encyclopedic-VQA, respectively.
zh

[NLP-126] JurisCTC: Enhancing Legal Judgment Prediction via Cross-Domain Transfer and Contrastive Learning IJCNN

【速读】：该论文旨在解决在不同法律领域之间进行知识迁移的挑战，特别是在面对长且复杂的法律文本以及缺乏大规模标注数据集的情况下，提升法律判决预测（Legal Judgment Prediction, LJP）任务的准确性。其解决方案的关键在于提出了一种名为JurisCTC的新模型，该模型通过对比学习（contrastive learning）区分不同领域的样本，并实现了民事法与刑事法领域间的有效知识迁移。

链接: https://arxiv.org/abs/2504.17264
作者: Zhaolu Kang,Hongtian Cai,Xiangyang Ji,Jinzhe Li,Nanfei Gu
机构: Peking University (北京大学); Nanyang Technological University (南洋理工大学); Jilin University (吉林大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted in International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:In recent years, Unsupervised Domain Adaptation (UDA) has gained significant attention in the field of Natural Language Processing (NLP) owing to its ability to enhance model generalization across diverse domains. However, its application for knowledge transfer between distinct legal domains remains largely unexplored. To address the challenges posed by lengthy and complex legal texts and the limited availability of large-scale annotated datasets, we propose JurisCTC, a novel model designed to improve the accuracy of Legal Judgment Prediction (LJP) tasks. Unlike existing approaches, JurisCTC facilitates effective knowledge transfer across various legal domains and employs contrastive learning to distinguish samples from different domains. Specifically, for the LJP task, we enable knowledge transfer between civil and criminal law domains. Compared to other models and specific large language models (LLMs), JurisCTC demonstrates notable advancements, achieving peak accuracies of 76.59% and 78.83%, respectively.
zh

[NLP-127] EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition

【速读】：该论文试图解决语音情感识别（Speech Emotion Recognition, SER）系统中存在的性别偏差问题，特别是现有去偏方法在多标签场景下的有效性和鲁棒性尚未得到充分研究。解决方案的关键在于提出EMO-Debias，这是一个大规模的对比实验，涵盖了13种去偏方法，包括预处理、正则化、对抗学习、有偏学习器和分布鲁棒优化等技术，并在不同性别不平衡条件下评估了这些方法在WavLM和XLSR表示上的性能，从而量化公平性与准确性的权衡，识别出能够在不牺牲整体模型性能的前提下持续减少性别性能差距的方法。

链接: https://arxiv.org/abs/2506.04652
作者: Yi-Cheng Lin,Huang-Cheng Chou,Yu-Hsuan Li Liang,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Speech emotion recognition (SER) systems often exhibit gender bias. However, the effectiveness and robustness of existing debiasing methods in such multi-label scenarios remain underexplored. To address this gap, we present EMO-Debias, a large-scale comparison of 13 debiasing methods applied to multi-label SER. Our study encompasses techniques from pre-processing, regularization, adversarial learning, biased learners, and distributionally robust optimization. Experiments conducted on acted and naturalistic emotion datasets, using WavLM and XLSR representations, evaluate each method under conditions of gender imbalance. Our analysis quantifies the trade-offs between fairness and accuracy, identifying which approaches consistently reduce gender performance gaps without compromising overall model performance. The findings provide actionable insights for selecting effective debiasing strategies and highlight the impact of dataset distributions.
zh

[NLP-128] owards Efficient Speech-Text Jointly Decoding within One Speech Language Model

【速读】：该论文旨在解决语音语言模型（Speech LMs）在端到端语音-文本建模中的性能、效率和对齐质量问题，特别是针对语音-文本联合解码范式的优化。其解决方案的关键在于系统性地比较不同的联合解码策略，如交错生成（interleaved）和并行生成（parallel generation），并通过引入一种新的早停交错（early-stop interleaved, ESI）模式来提升解码速度和性能，同时保持良好的对齐效果。

链接: https://arxiv.org/abs/2506.04518
作者: Haibin Wu,Yuxuan Hu,Ruchao Fan,Xiaofei Wang,Kenichi Kumatani,Bo Ren,Jianwei Yu,Heng Lu,Lijuan Wang,Yao Qian,Jinyu Li
机构: Microsoft(微软)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.
zh

[NLP-129] Can we reconstruct a dysarthric voice with the large speech model Parler TTS? INTERSPEECH2025

【速读】：该论文试图解决语音障碍者在沟通中面临的困难，通过个性化Text-to-Speech（文本转语音）技术作为辅助沟通手段。其解决方案的关键在于利用大规模语音模型进行语音重建，具体而言是尝试使用先进的大型语音模型Parler TTS，在患者病情发生前生成其语音的近似版本，以恢复或模拟患者的语音特征。研究重点在于验证该模型能否在保持说话人身份的同时生成可理解的语音。

链接: https://arxiv.org/abs/2506.04397
作者: Ariadna Sanchez,Simon King
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which we generate an approximation of a dysarthric speaker’s voice prior to the onset of their condition. In particular, we investigate whether a state-of-the-art large speech model, Parler TTS, can generate intelligible speech while maintaining speaker identity. We curate a dataset and annotate it with relevant speaker and intelligibility information, and use this to fine-tune the model. Our results show that the model can indeed learn to generate from the distribution of this challenging data, but struggles to control intelligibility and to maintain consistent speaker identity. We propose future directions to improve controllability of this class of model, for the voice reconstruction task.
zh

计算机视觉

[CV-0] Contrastive Flow Matching

【速读】：该论文试图解决条件扩散模型中流匹配（flow matching）的不确定性问题，即在条件设置下（如类别条件模型），不同条件之间的流可能重叠，导致生成结果模糊。解决方案的关键在于引入对比流匹配（Contrastive Flow Matching），通过添加对比目标来显式增强所有条件流的独特性，从而提升条件分离能力。该方法通过最大化任意样本对预测流之间的差异性，确保条件流的唯一性。

链接: https://arxiv.org/abs/2506.05350
作者: George Stoica,Vivek Ramanujan,Xiang Fan,Ali Farhadi,Ranjay Krishna,Judy Hoffman
机构: Georgia Tech(佐治亚理工学院); University of Washington(华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed–flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: this https URL.
zh

[CV-1] VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

【速读】：该论文试图解决在真实世界视频场景中进行数学推理的问题，这一任务相较于静态图像或文本更具挑战性，因为它需要解析细粒度的视觉信息、准确识别手写或数字文本，并整合分散在时间上的口语线索。解决方案的关键在于引入VideoMathQA基准，用于评估模型是否能够在视频中执行时间扩展的跨模态推理。该基准覆盖10个不同的数学领域，涵盖从10秒到超过1小时的视频，并要求模型理解结构化视觉内容、指令性叙述以及跨视觉、音频和文本模态的概念关联。通过高精度的专家标注和针对直接问题求解、概念迁移和深度指令理解的三类核心推理挑战，该基准旨在系统性地评估模型在复杂多模态数学问题中的推理能力。

链接: https://arxiv.org/abs/2506.05349
作者: Hanoona Rasheed,Abdelrahman Shaker,Anqi Tang,Muhammad Maaz,Ming-Hsuan Yang,Salman Khan,Fahad Khan
机构: MBZUAI; University of California Merced; Google Research; Australian National University; Linköping University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: VideoMathQA Technical Report

点击查看摘要

Abstract:Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: this https URL
zh

[CV-2] FreeTimeGS: Free Gaussians at Anytime and Anywhere for Dynamic Scene Reconstruction CVPR2025

【速读】：该论文试图解决动态三维场景重建中复杂运动建模的问题（dynamic 3D scene reconstruction with complex motions）。现有方法通过在规范空间中定义3D高斯基元并使用变形场将其映射到观测空间，实现了实时动态视图合成，但在处理复杂运动时表现不佳，主要由于变形场优化的困难。该论文提出的解决方案关键在于引入FreeTimeGS，这是一种新颖的4D表示，允许高斯基元在任意时间和位置出现，具有更强的灵活性，从而提升了动态三维场景的建模能力；此外，为每个高斯基元赋予运动函数，使其随时间移动至邻近区域，减少了时间冗余。

链接: https://arxiv.org/abs/2506.05348
作者: Yifan Wang,Peishan Yang,Zhen Xu,Jiaming Sun,Zhanhua Zhang,Yong Chen,Hujun Bao,Sida Peng,Xiaowei Zhou
机构: Zhejiang University (浙江大学); Geely Automobile Research Institute (吉利汽车研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025; Project page: this https URL

点击查看摘要

Abstract:This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing deformation fields. To overcome this problem, we propose FreeTimeGS, a novel 4D representation that allows Gaussian primitives to appear at arbitrary time and locations. In contrast to canonical Gaussian primitives, our representation possesses the strong flexibility, thus improving the ability to model dynamic 3D scenes. In addition, we endow each Gaussian primitive with an motion function, allowing it to move to neighboring regions over time, which reduces the temporal redundancy. Experiments results on several datasets show that the rendering quality of our method outperforms recent methods by a large margin.
zh

[CV-3] Neural Inverse Rendering from Propagating Light

【速读】：该论文旨在解决从多视角视频中进行物理基础的神经逆向渲染（neural inverse rendering）问题，特别是针对传播光的直接和间接光照传输效应的准确建模与重建。其解决方案的关键在于对神经辐射缓存（neural radiance caching）技术的时间分辨扩展，该技术通过存储任意方向任意点到达的无限次反弹辐射亮度来加速逆向渲染过程，从而实现对复杂光照效果的精确模拟与3D重建。

链接: https://arxiv.org/abs/2506.05347
作者: Anagh Malik,Benjamin Attal,Andrew Xie,Matthew O’Toole,David B. Lindell
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. Our approach relies on a time-resolved extension of neural radiance caching – a technique that accelerates inverse rendering by storing infinite-bounce radiance arriving at any point from any direction. The resulting model accurately accounts for direct and indirect light transport effects and, when applied to captured measurements from a flash lidar system, enables state-of-the-art 3D reconstruction in the presence of strong indirect light. Further, we demonstrate view synthesis of propagating light, automatic decomposition of captured measurements into direct and indirect components, as well as novel capabilities such as multi-view time-resolved relighting of captured scenes.
zh

[CV-4] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理视觉输入时计算效率低下的问题。研究发现，仅有约5%的注意力头（attention heads）对视觉理解有显著贡献，称为视觉头（visual heads）。解决方案的关键在于设计了一种无需训练的框架，通过目标响应分析量化头部级别的视觉相关性，并基于此提出SparseMM优化策略，该策略根据视觉得分将不对称的计算资源分配给注意力头，从而利用视觉头的稀疏性加速MLLM的推理过程。

链接: https://arxiv.org/abs/2506.05344
作者: Jiahui Wang,Zuyan Liu,Yongming Rao,Jiwen Lu
机构: Tsinghua University (清华大学); Tencent Hunyuan Research (腾讯混元实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at this https URL.
zh

[CV-5] ContentV: Efficient Training of Video Generation Models with Limited Compute

【速读】：该论文旨在解决视频生成中日益增长的计算成本问题，以实现更高效的训练方法。其关键解决方案是提出ContentV，一个8B参数的文本到视频模型，通过三项核心创新实现高效训练与高质量视频生成：（1）最小化架构，最大化预训练图像生成模型在视频生成中的复用；（2）系统化的多阶段训练策略，利用流匹配提升效率；（3）一种成本效益高的基于人类反馈的强化学习框架，无需额外人工标注即可提升生成质量。

链接: https://arxiv.org/abs/2506.05343
作者: Wenfeng Lin,Renjie Chen,Boyuan Liu,Shiyue Yan,Ruoyu Feng,Jiangchuan Wei,Yichen Zhang,Yimeng Zhou,Chao Feng,Jiao Ran,Qi Wu,Zuotao Liu,Mingyu Guo
机构: ByteDance Douyin Content Group (字节跳动抖音内容团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: this https URL.
zh

[CV-6] Refer to Anything with Vision-Language Prompts

【速读】：该论文试图解决现有图像分割模型在处理基于语言和视觉的复杂查询时缺乏全面语义理解的问题，这一限制降低了其在需要用户友好交互的应用中的有效性。解决方案的关键在于提出一种新的任务——全模态指代表达分割（Omnimodal Referring Expression Segmentation, ORES），并构建了一个名为“Refer to Any Segmentation Mask Group”（RAS）的框架，该框架通过以掩码为中心的大规模多模态模型增强分割模型的多模态交互与理解能力。

链接: https://arxiv.org/abs/2506.05342
作者: Shengcao Cao,Zijun Wei,Jason Kuen,Kangning Liu,Lingzhi Zhang,Jiuxiang Gu,HyunJoon Jung,Liang-Yan Gui,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of omnimodal referring expression segmentation (ORES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to “Refer to Any Segmentation Mask Group” (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking ORES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks. Project page: this https URL.
zh

[CV-7] Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

【速读】：该论文旨在解决3D室内场景合成中布局生成（layout generation）的挑战，特别是在缺乏足够数据集的情况下，现有方法难以生成既符合细粒度用户指令又具备开放词汇能力的场景。其解决方案的关键在于提出DirectLayout框架，该框架通过利用大语言模型（LLM）的可泛化空间推理能力，直接从文本描述生成数值化的3D布局。该框架将生成过程分解为三个阶段：生成鸟瞰图（BEV）布局、将其提升至3D空间以及优化物体位置，并结合基于3D-Front数据集的思维链（CoT）激活与CoT-Grounded Generative Layout Reward来增强空间推理和泛化能力。此外，在推理阶段通过上下文学习实现资产与布局的迭代对齐，以解决资产-布局不匹配问题。

链接: https://arxiv.org/abs/2506.05341
作者: Xingjian Ran,Yixuan Li,Linning Xu,Mulin Yu,Bo Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird’s-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
zh

[CV-8] Defurnishing with X-Ray Vision: Joint Removal of Furniture from Panoramas and Mesh

【速读】：该论文试图解决室内空间的去家具化（defurnishing）问题，即从纹理网格和多视角全景图像中移除家具并生成高质量的去家具化副本。解决方案的关键在于首先通过分割和移除家具、扩展平面和填充孔洞，获得一个简化的去家具化网格（simplified defurnished mesh, SDM），作为场景底层结构的“X光”视图，随后利用SDM渲染的深度和法线图像提取Canny边缘，并通过ControlNet修复技术引导全景图像中的家具移除，从而确保全局几何信息的可用性，最终使用修复后的全景图像对网格进行纹理映射，以生成比依赖神经辐射场或RGB-D修复方法更高质量的资产。

链接: https://arxiv.org/abs/2506.05338
作者: Alan Dolhasz,Chen Ma,Dave Gausebeck,Kevin Chen,Gregor Miller,Lucas Hayne,Gunnar Hovden,Azwad Sabik,Olaf Brandt,Mira Slavcheva
机构: Matterport
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a pipeline for generating defurnished replicas of indoor spaces represented as textured meshes and corresponding multi-view panoramic images. To achieve this, we first segment and remove furniture from the mesh representation, extend planes, and fill holes, obtaining a simplified defurnished mesh (SDM). This SDM acts as an ``X-ray’’ of the scene’s underlying structure, guiding the defurnishing process. We extract Canny edges from depth and normal images rendered from the SDM. We then use these as a guide to remove the furniture from panorama images via ControlNet inpainting. This control signal ensures the availability of global geometric information that may be hidden from a particular panoramic view by the furniture being removed. The inpainted panoramas are used to texture the mesh. We show that our approach produces higher quality assets than methods that rely on neural radiance fields, which tend to produce blurry low-resolution images, or RGB-D inpainting, which is highly susceptible to hallucinations.
zh

[CV-9] VideoMolmo: Spatio-Temporal Grounding Meets Pointing

【速读】：该论文旨在解决视频中时空定位（spatio-temporal localization）的精准交互问题，特别是在生物研究、自主导航和交互界面等领域的应用。现有基于视频的方法虽然在跟踪方面表现良好，但缺乏大型语言模型（Large Language Model, LLM）的复杂推理能力，限制了其上下文理解和泛化能力。论文提出的解决方案是VideoMolmo，一个针对文本描述条件下的细粒度时空指认任务设计的大型多模态模型。其关键在于通过引入时间模块和注意力机制，使每一帧都能基于前序帧进行条件化处理，从而保证时间一致性；同时采用SAM2的双向点传播技术，结合两阶段分解策略，即先由LLM生成精确指认坐标，再通过序列掩码融合模块生成连贯的分割结果，从而提升模型的准确性和可解释性。

链接: https://arxiv.org/abs/2506.05336
作者: Ghazi Shazan Ahmad,Ahmed Heakl,Hanan Gani,Abdelrahman Shaker,Zhiqiang Shen,Ranjay Krishna,Fahad Shahbaz Khan,Salman Khan
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Linköping University (林雪平大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at this https URL.
zh

[CV-10] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

【速读】：该论文旨在解决将链式思维（Chain-of-Thought, CoT）方法扩展到多模态领域时所面临的挑战，特别是在数学问题求解中对视觉信息的处理不足。现有方法在处理图像输入时依赖于粗粒度的框状图像区域，受限于视觉编码器对数学内容的感知能力，并且依赖外部能力进行视觉修改。论文提出的解决方案关键在于引入数学交错标记（Mathematical INterleaved Tokens, MINT），通过一个交错标记动态选择数学图形中任意形状的视觉区域，并将其自适应地融入文本推理步骤中，从而实现更有效的视觉与文本的交织推理。

链接: https://arxiv.org/abs/2506.05331
作者: Xinyan Chen,Renrui Zhang,Dongzhi Jiang,Aojun Zhou,Shilin Yan,Weifeng Lin,Hongsheng Li
机构: CUHK MMLab(香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is released at this https URL

点击查看摘要

Abstract:Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at this https URL
zh

[CV-11] AV-Reason er: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLM s

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在计数任务上的不足。现有基准测试存在视频长度短、封闭集查询、缺乏线索标注和多模态覆盖不足等问题。为解决这些问题，作者提出了CG-AV-Counting，一个手动标注的基于线索的计数基准，包含1,027个多模态问题和5,845个标注线索，覆盖497个长视频。其关键解决方案是提出AV-Reasoner模型，该模型通过GRPO（Generalized Reward-based Policy Optimization）和课程学习进行训练，以从相关任务中泛化计数能力。

链接: https://arxiv.org/abs/2506.05328
作者: Lidong Lu,Guo Chen,Zhiqi Li,Yicheng Liu,Tong Lu
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 11 figures

点击查看摘要

Abstract:Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model’s counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on this https URL.
zh

[CV-12] Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting

【速读】：该论文旨在解决深度图（depth map）在前向3D高斯喷溅（3DGS）流水线中因物体边界处的深度不连续性导致点云碎片化或稀疏化的问题，从而影响渲染质量。解决方案的关键在于引入PM-Loss，这是一种基于预训练Transformer预测的点图（pointmap）的正则化损失函数。尽管点图的准确性可能低于深度图，但它能够有效增强几何平滑性，尤其是在物体边界区域，从而提升深度图的质量并显著改善3DGS的渲染效果。

链接: https://arxiv.org/abs/2506.05327
作者: Duochao Shi,Weijie Wang,Donny Y. Chen,Zeyu Zhang,Jia-Wang Bian,Bohan Zhuang,Chunhua Shen
机构: Zhejiang University (浙江大学); Monash University (莫纳什大学); MBZUAI; GigaAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Depth maps are widely used in feed-forward 3D Gaussian Splatting (3DGS) pipelines by unprojecting them into 3D point clouds for novel view synthesis. This approach offers advantages such as efficient training, the use of known camera poses, and accurate geometry estimation. However, depth discontinuities at object boundaries often lead to fragmented or sparse point clouds, degrading rendering quality – a well-known limitation of depth-based representations. To tackle this issue, we introduce PM-Loss, a novel regularization loss based on a pointmap predicted by a pre-trained transformer. Although the pointmap itself may be less accurate than the depth map, it effectively enforces geometric smoothness, especially around object boundaries. With the improved depth map, our method significantly improves the feed-forward 3DGS across various architectures and scenes, delivering consistently better rendering results. Our project page: this https URL
zh

[CV-13] Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

【速读】：该论文试图解决3D视觉语言模型（3D VLMs）在性能上相较于2D VLMs和3D物体中心方法表现较差的问题，特别是3D场景中心型VLMs对3D场景编码器的依赖度较低、预训练阶段效果不佳以及数据扩展收益不显著等瓶颈。其解决方案的关键在于引入一种新的3D相关性判别问答（3D Relevance Discrimination QA）数据集，以打破捷径学习，促进对3D场景的真实理解，并推动更有效的跨模态对齐与3D编码器利用策略的发展。

链接: https://arxiv.org/abs/2506.05318
作者: Haoyuan Li,Yanpeng Zhou,Yufei Gao,Tao Tang,Jianhua Han,Yujie Yuan,Dave Zhenyu Chen,Jiawang Bian,Hang Xu,Xiaodan Liang
机构: Shenzhen campus of Sun Yat-sen University (中山大学深圳校区); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); MBZUAI (MBZUAI); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remarkable progress in 2D Vision-Language Models (VLMs) has spurred interest in extending them to 3D settings for tasks like 3D Question Answering, Dense Captioning, and Visual Grounding. Unlike 2D VLMs that typically process images through an image encoder, 3D scenes, with their intricate spatial structures, allow for diverse model architectures. Based on their encoder design, this paper categorizes recent 3D VLMs into 3D object-centric, 2D image-based, and 3D scene-centric approaches. Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches. To understand this gap, we conduct an in-depth analysis, revealing that 3D scene-centric VLMs show limited reliance on the 3D scene encoder, and the pre-train stage appears less effective than in 2D VLMs. Furthermore, we observe that data scaling benefits are less pronounced on larger datasets. Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions, thereby diminishing the effective utilization of the 3D encoder. To address these limitations and encourage genuine 3D scene understanding, we introduce a novel 3D Relevance Discrimination QA dataset designed to disrupt shortcut learning and improve 3D understanding. Our findings highlight the need for advanced evaluation and improved strategies for better 3D understanding in 3D VLMs.
zh

[CV-14] ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

【速读】：该论文旨在解决从视觉数据中估计物理属性的逆问题，这一问题在机器人和扩展现实（XR）等应用中对物理准确的数字孪生创建具有重要影响。现有方法通常需要密集的多视角视频作为输入，且在处理稀疏多视角视频时，其顺序优化策略会导致误差累积。为了解决这一问题，论文提出了ProJo4D，其关键在于采用渐进式联合优化框架，通过参数敏感性引导逐步增加联合优化的参数集，最终实现几何、外观、物理状态和材料属性的完全联合优化。

链接: https://arxiv.org/abs/2506.05317
作者: Daniel Rho,Jun Myeong Choi,Biswadip Dey,Roni Sengupta
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural rendering has made significant strides in 3D reconstruction and novel view synthesis. With the integration with physics, it opens up new applications. The inverse problem of estimating physics from visual data, however, still remains challenging, limiting its effectiveness for applications like physically accurate digital twin creation in robotics and XR. Existing methods that incorporate physics into neural rendering frameworks typically require dense multi-view videos as input, making them impractical for scalable, real-world use. When presented with sparse multi-view videos, the sequential optimization strategy used by existing approaches introduces significant error accumulation, e.g., poor initial 3D reconstruction leads to bad material parameter estimation in subsequent stages. Instead of sequential optimization, directly optimizing all parameters at the same time also fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually increases the set of jointly optimized parameters guided by their sensitivity, leading to fully joint optimization over geometry, appearance, physical state, and material property. Evaluations on PAC-NeRF and Spring-Gaus datasets show that ProJo4D outperforms prior work in 4D future state prediction, novel view rendering of future state, and material parameter estimation, demonstrating its effectiveness in physically grounded 4D scene understanding. For demos, please visit the project webpage: this https URL
zh

[CV-15] MARBLE: Material Recomposition and Blending in CLIP-Space

【速读】：该论文旨在解决基于示例图像对图像中物体材质进行编辑的问题，特别是实现细粒度材质属性的混合与重组。其解决方案的关键在于利用CLIP空间中的材质嵌入（material embeddings）来控制预训练的文本到图像模型，并通过在去噪UNet中找到负责材质归属的模块，实现对材质属性的精准操控。此外，通过一个浅层网络预测所需材质属性变化的方向，进一步实现了对如粗糙度、金属感、透明度和发光等细粒度属性的参数化控制。

链接: https://arxiv.org/abs/2506.05313
作者: Ta-Ying Cheng,Prafull Sharma,Mark Boss,Varun Jampani
机构: University of Oxford(牛津大学); MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); Stability AI(Stability AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Editing materials of objects in images based on exemplar images is an active area of research in computer vision and graphics. We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. We improve exemplar-based material editing by finding a block in the denoising UNet responsible for material attribution. Given two material exemplar-images, we find directions in the CLIP-space for blending the materials. Further, we can achieve parametric control over fine-grained material attributes such as roughness, metallic, transparency, and glow using a shallow network to predict the direction for the desired material attribute change. We perform qualitative and quantitative analysis to demonstrate the efficacy of our proposed method. We also present the ability of our method to perform multiple edits in a single forward pass and applicability to painting. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.05313 [cs.CV] (or arXiv:2506.05313v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.05313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-16] Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels

【速读】：该论文旨在解决在图像和物体实例之间找到语义相似点的对应关系这一长期存在的计算机视觉挑战，尤其是在对称物体或重复物体部分上存在歧义的情况下。其解决方案的关键在于通过3D-aware伪标签（pseudo-labeling）来提升语义对应估计的效果，具体包括训练一个适配器以利用通过3D-aware链式传播获得的伪标签进行特征优化，并通过放松的循环一致性过滤错误标签，以及应用3D球面原型映射约束来增强模型性能。

链接: https://arxiv.org/abs/2506.05312
作者: Olaf Dünkel,Thomas Wimmer,Christian Theobalt,Christian Rupprecht,Adam Kortylewski
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); ETH Zurich (苏黎世联邦理工学院); University of Oxford (牛津大学); University of Freiburg (弗莱堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision. While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts. We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to refine off-the-shelf features using pseudo-labels obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints. While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4% absolute gain and by over 7% against methods with similar supervision requirements. The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.
zh

[CV-17] Perceive Anything: Recognize Explain Caption and Segment Anything in Images and Videos

【速读】：该论文试图解决图像和视频中区域级视觉理解的综合性问题，即在进行对象分割的同时生成多样化的、区域特定的语义输出。解决方案的关键在于引入Semantic Perceiver模块，该模块能够高效地将SAM 2模型丰富的视觉特征转化为适用于大型语言模型（LLM）理解的多模态标记，从而实现对区域语义的全面解析。

链接: https://arxiv.org/abs/2506.05302
作者: Weifeng Lin,Xinyu Wei,Ruichuan An,Tianhe Ren,Tingwei Chen,Renrui Zhang,Ziyu Guo,Wentao Zhang,Lei Zhang,Hongsheng Li
机构: CUHK(香港中文大学); HKU(香港大学); PolyU(香港理工大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 13 figures, Website: this https URL

点击查看摘要

Abstract:We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2’s rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.
zh

[CV-18] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training CEC

【速读】：该论文旨在解决基于扩散的视频修复（diffusion-based video restoration, VR）在推理过程中计算成本过高的问题，以及如何将现有的单步图像修复方法扩展到视频修复领域的挑战，尤其是在处理高分辨率视频时。其解决方案的关键在于提出一种单步扩散式视频修复模型SeedVR2，该模型通过对抗性视频修复训练与真实数据进行交互，并引入自适应窗口注意力机制以动态调整窗口大小，从而避免高分辨率视频修复中因固定窗口尺寸导致的窗口不一致问题；同时，通过验证多种损失函数的有效性，包括一种不显著牺牲训练效率的特征匹配损失，以稳定并提升对抗后训练的效果。

链接: https://arxiv.org/abs/2506.05301
作者: Jianyi Wang,Shanchuan Lin,Zhijie Lin,Yuxi Ren,Meng Wei,Zongsheng Yue,Shangchen Zhou,Hao Chen,Yang Zhao,Ceyuan Yang,Xuefeng Xiao,Chen Change Loy,Lu Jiang
机构: Nanyang Technological University (南洋理工大学); ByteDance Seed (字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Draft Ver. Project page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.
zh

[CV-19] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

【速读】：该论文旨在解决自回归图像生成中因现有图像分词器在压缩过程中引入双向依赖关系，从而阻碍自回归模型有效建模的问题。其解决方案的关键在于提出一种对齐分词器（AliTok），该分词器利用因果解码器建立编码标记之间的单向依赖关系，从而实现分词器与自回归模型在标记建模方法上的对齐。此外，通过引入前缀标记并采用两阶段分词器训练以增强重建一致性，AliTok在保持生成友好性的同时实现了优异的重建性能。

链接: https://arxiv.org/abs/2506.05289
作者: Pingyu Wu,Kai Zhu,Yu Liu,Longxiang Tang,Jian Yang,Yansong Peng,Wei Zhai,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Tongyi Lab (通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at this https URL.
zh

[CV-20] EOC-Bench: Can MLLM s Identify Recall and Forecast Objects in an Egocentric World?

【速读】：该论文试图解决现有具身基准在动态、杂乱环境中对用户交互引发的动态变化评估不足的问题，从而推动以对象为中心的具身认知能力的发展。解决方案的关键在于引入EOC-Bench，这是一个创新的基准，通过3,277个精心标注的问答对，涵盖过去、现在和未来三个时间类别，并结合多尺度时间精度指标和混合格式的人机协同标注框架，系统地评估动态第一人称场景中的对象中心具身认知。

链接: https://arxiv.org/abs/2506.05287
作者: Yuqian Yuan,Ronghao Dang,Long Li,Wentong Li,Dian Jiao,Xin Li,Deli Zhao,Fan Wang,Wenqiao Zhang,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); DAMO Academy, Alibaba Group (达摩院，阿里巴巴集团); Hupan Lab (湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32pages

点击查看摘要

Abstract:The emergence of multimodal large language models (MLLMs) has driven breakthroughs in egocentric vision applications. These applications necessitate persistent, context-aware understanding of objects, as users interact with tools in dynamic and cluttered environments. However, existing embodied benchmarks primarily focus on static scene exploration, emphasizing object’s appearance and spatial attributes while neglecting the assessment of dynamic changes arising from users’ interactions. To address this gap, we introduce EOC-Bench, an innovative benchmark designed to systematically evaluate object-centric embodied cognition in dynamic egocentric scenarios. Specially, EOC-Bench features 3,277 meticulously annotated QA pairs categorized into three temporal categories: Past, Present, and Future, covering 11 fine-grained evaluation dimensions and 3 visual object referencing types. To ensure thorough assessment, we develop a mixed-format human-in-the-loop annotation framework with four types of questions and design a novel multi-scale temporal accuracy metric for open-ended temporal evaluation. Based on EOC-Bench, we conduct comprehensive evaluations of various proprietary, open-source, and object-level MLLMs. EOC-Bench serves as a crucial tool for advancing the embodied object cognitive capabilities of MLLMs, establishing a robust foundation for developing reliable core models for embodied systems.
zh

[CV-21] Stable Vision Concept Transformers for Medical Diagnosis

【速读】：该论文旨在解决医学领域中可解释人工智能（Explainable AI, XAI）模型在保持高准确率的同时缺乏稳定性和可靠性的问题。现有基于概念瓶颈模型（Concept Bottleneck Models, CBMs）的方法仅依赖概念特征进行预测，忽视了医学图像中的内在特征嵌入，导致模型性能下降且解释不稳定。为解决这一问题，作者提出了Vision Concept Transformer (VCT)，其关键在于通过融合概念特征与图像特征来增强决策能力；进一步提出的Stable Vision Concept Transformer (SVCT) 则通过引入去噪扩散平滑技术确保模型的可信度，从而在面对输入扰动时仍能提供稳定且可靠的解释。

链接: https://arxiv.org/abs/2506.05286
作者: Lijie Hu,Songning Lai,Yuan Hua,Shu Yang,Jingfeng Zhang,Di Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2304.06129 by other authors

点击查看摘要

Abstract:Transparency is a paramount concern in the medical field, prompting researchers to delve into the realm of explainable AI (XAI). Among these XAI methods, Concept Bottleneck Models (CBMs) aim to restrict the model’s latent space to human-understandable high-level concepts by generating a conceptual layer for extracting conceptual features, which has drawn much attention recently. However, existing methods rely solely on concept features to determine the model’s predictions, which overlook the intrinsic feature embeddings within medical images. To address this utility gap between the original models and concept-based models, we propose Vision Concept Transformer (VCT). Furthermore, despite their benefits, CBMs have been found to negatively impact model performance and fail to provide stable explanations when faced with input perturbations, which limits their application in the medical field. To address this faithfulness issue, this paper further proposes the Stable Vision Concept Transformer (SVCT) based on VCT, which leverages the vision transformer (ViT) as its backbone and incorporates a conceptual layer. SVCT employs conceptual features to enhance decision-making capabilities by fusing them with image features and ensures model faithfulness through the integration of Denoised Diffusion Smoothing. Comprehensive experiments on four medical datasets demonstrate that our VCT and SVCT maintain accuracy while remaining interpretable compared to baselines. Furthermore, even when subjected to perturbations, our SVCT model consistently provides faithful explanations, thus meeting the needs of the medical field.
zh

[CV-22] RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion

【速读】：该论文旨在解决3D形状补全（3D shape completion）中存在的3D一致性不足、计算成本高以及难以捕捉锐利物体边界的问题。其解决方案的关键在于将3D形状补全重新建模为一种新颖的视角合成问题，通过给定单张RGB-D图像和新视角（编码为查询光线集合），训练一个前馈Transformer模型来预测查询光线的深度图、物体掩码和像素级置信度分数，并通过多视角预测融合重建完整的3D形状。

链接: https://arxiv.org/abs/2506.05285
作者: Bardienus P. Duisterhof,Jan Oberst,Bowen Wen,Stan Birchfield,Deva Ramanan,Jeffrey Ichnowski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance. Project page: this https URL
zh

[CV-23] Video World Models with Long-term Spatial Memory

【速读】：该论文试图解决视频世界模型在多次访问时难以保持场景一致性的问题，特别是在有限的时间上下文窗口大小下，模型容易遗忘之前生成的环境。其解决方案的关键在于引入一种基于几何的长期空间记忆框架，通过存储和检索长期空间记忆中的信息，以增强视频世界模型的长期一致性。

链接: https://arxiv.org/abs/2506.05284
作者: Tong Wu,Shuai Yang,Ryan Po,Yinghao Xu,Ziwei Liu,Dahua Lin,Gordon Wetzstein
机构: Stanford University (斯坦福大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); S-Lab, Nanyang Technological University (南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.
zh

[CV-24] Rectified Point Flow: Generic Point Cloud Pose Estimation

【速读】：该论文试图解决点云配准（pairwise point cloud registration）与多部件形状组装（multi-part shape assembly）的问题，这两个任务通常被单独处理。其解决方案的关键在于引入了Rectified Point Flow，这是一种统一的参数化方法，将上述两个任务建模为一个单一的条件生成问题。该方法通过学习连续的点级速度场，将噪声点云中的点传输到目标位置，从而恢复部件姿态，无需依赖对称性标签即可内在地学习装配对称性。

链接: https://arxiv.org/abs/2506.05282
作者: Tao Sun,Liyuan Zhu,Shengyu Huang,Shuran Song,Iro Armeni
机构: Stanford University (斯坦福大学); NVIDIA Research (NVIDIA 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: this https URL.
zh

[CV-25] Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

【速读】：该论文旨在解决真实场景下神经渲染技术中由于无法保证完美的光度一致性而导致的重建质量下降问题。其解决方案的关键在于提出一种多尺度双边网格（multi-scale bilateral grid），该方法将外观编码与双边网格统一起来，从而有效提升动态、解耦的自动驾驶场景重建中的几何精度，显著优于传统的外观编码和双边网格方法。

链接: https://arxiv.org/abs/2506.05280
作者: Nan Wang,Yuantao Chen,Lixing Xiao,Weiqing Xiao,Bohan Li,Zhaoxi Chen,Chongjie Ye,Shaocong Xu,Saining Zhang,Ziyang Yan,Pierre Merriaux,Lei Lei,Tianfan Xue,Hao Zhao
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); Eastern Institute of Technology, Ningbo (宁波工程学院); Tongji University (同济大学); The Chinese University of Hong Kong (香港中文大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); University of Trento (特伦托大学); Zhejiang University (浙江大学); Lightwheel AI (光轮智能); LeddarTech (莱德科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Neural rendering techniques, including NeRF and Gaussian Splatting (GS), rely on photometric consistency to produce high-quality reconstructions. However, in real-world scenarios, it is challenging to guarantee perfect photometric consistency in acquired images. Appearance codes have been widely used to address this issue, but their modeling capability is limited, as a single code is applied to the entire image. Recently, the bilateral grid was introduced to perform pixel-wise color mapping, but it is difficult to optimize and constrain effectively. In this paper, we propose a novel multi-scale bilateral grid that unifies appearance codes and bilateral grids. We demonstrate that this approach significantly improves geometric accuracy in dynamic, decoupled autonomous driving scene reconstruction, outperforming both appearance codes and bilateral grids. This is crucial for autonomous driving, where accurate geometry is important for obstacle avoidance and control. Our method shows strong results across four datasets: Waymo, NuScenes, Argoverse, and PandaSet. We further demonstrate that the improvement in geometry is driven by the multi-scale bilateral grid, which effectively reduces floaters caused by photometric inconsistency.
zh

[CV-26] From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

【速读】：该论文旨在解决组合视频检索（Composed Video Retrieval, CoVR）中对时间细粒度变化捕捉能力不足的问题，现有基准主要关注外观变化或粗粒度事件变化，无法有效测试对细微、快速时间差异的识别能力。其解决方案的关键在于提出TF-CoVR，一个专注于体操和跳水的大规模基准，包含从FineGym和FineDiving中提取的180K个三元组，并通过提示大型语言模型（LLM）生成不同视频片段间的标签差异来构建查询与修改对，从而关联多个有效目标视频。为建模这些时间动态，作者提出了TF-CoVR-Base，一个两阶段训练框架：首先在细粒度动作分类上预训练视频编码器以获得时间判别性嵌入，其次通过对比学习对齐组合查询与候选视频。

链接: https://arxiv.org/abs/2506.05274
作者: Animesh Gupta,Jay Parmar,Ishan Rajendrakumar Dave,Mubarak Shah
机构: University of Central Florida (佛罗里达中央大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each query, modification pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.
zh

[CV-27] Can Foundation Models Generalise the Presentation Attack Detection Capabilities on ID Cards?

【速读】：该论文旨在解决身份证明文件（ID Documents）的呈现攻击检测（PAD）中模型泛化能力不足的问题，特别是在面对不同国家发行的ID卡时，由于隐私保护限制，大多数PAD系统仅在少量ID文档上进行训练，导致其在未知国家的ID卡上表现不佳。为了解决这一问题，研究利用了在大规模数据集上训练的基础模型（Foundation Models, FM），以提升模型的泛化能力。研究发现，真实图像（bona fide images）是实现良好泛化性能的关键因素。

链接: https://arxiv.org/abs/2506.05263
作者: Juan E. Tapia,Christoph Busch
机构: Hochschule Darmstadt (达姆施塔特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nowadays, one of the main challenges in presentation attack detection (PAD) on ID cards is obtaining generalisation capabilities for a diversity of countries that are issuing ID cards. Most PAD systems are trained on one, two, or three ID documents because of privacy protection concerns. As a result, they do not obtain competitive results for commercial purposes when tested in an unknown new ID card country. In this scenario, Foundation Models (FM) trained on huge datasets can help to improve generalisation capabilities. This work intends to improve and benchmark the capabilities of FM and how to use them to adapt the generalisation on PAD of ID Documents. Different test protocols were used, considering zero-shot and fine-tuning and two different ID card datasets. One private dataset based on Chilean IDs and one open-set based on three ID countries: Finland, Spain, and Slovakia. Our findings indicate that bona fide images are the key to generalisation.
zh

[CV-28] LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLM s

【速读】：该论文旨在解决视频大语言模型（Video-LLMs）在使用偏好对齐技术（如DPO）时，由于似然偏移现象导致的非目标响应概率意外提升的问题。其关键解决方案是提出一种无参考的偏好优化方法——LeanPO，该方法通过将隐式奖励重新定义为策略模型下响应的平均似然，并构建一个与奖励可信度相关的自生成偏好数据流水线，从而获取高质量的配对数据并准确估计新定义的奖励，进而缓解非预期的性能下降问题。

链接: https://arxiv.org/abs/2506.05260
作者: Xiaodong Wang,Jinfa Huang,Li Yuan,Peixi Peng
机构: Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citeprafailov2024dpo, to optimize the reward margin between a winning response ( y_w ) and a losing response ( y_l ). However, the likelihood displacement observed in DPO indicates that both \log \pi_\theta (y_w\mid x) and \log \pi_\theta (y_l\mid x) often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emphLean Preference Optimization (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.
zh

[CV-29] Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

【速读】：该论文旨在解决在GPS拒止、非结构化地形环境中，由于重复性植被和季节性变化导致的视觉模糊与场景外观显著改变，从而影响与过时卫星图像对齐的鲁棒性跨视角3-DoF定位问题。其解决方案的关键在于提出MoViX框架，该框架通过自监督学习获取视角和季节不变的表征，同时保留方向感知能力以实现精准定位，其核心技术包括姿态依赖的正样本采样策略、时间对齐的难例挖掘、运动感知的帧采样器以及轻量级的时间聚合器，以提升模型在复杂环境下的泛化能力和定位精度。

链接: https://arxiv.org/abs/2506.05250
作者: Zhiyun Deng,Dongmyeong Lee,Amanda Adkins,Jesse Quattrociocchi,Christian Ellis,Joydeep Biswas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.
zh

[CV-30] Aligning Latent Spaces with Flow Priors

【速读】：该论文试图解决如何将可学习的潜在空间对齐到任意目标分布的问题，传统方法通常依赖于计算昂贵的似然评估或常微分方程（ODE）求解。解决方案的关键在于利用基于流的生成模型作为先验，通过预训练流模型捕捉目标特征的潜在分布，并引入对齐损失来规范潜在空间，从而将流匹配目标重新表述为潜在变量的优化目标。该方法通过理论证明，最小化对齐损失可以建立一个计算上可行的替代目标，用于最大化目标分布下潜在变量的对数似然的变分下界。

链接: https://arxiv.org/abs/2506.05240
作者: Yizhuo Li,Yuying Ge,Yixiao Ge,Ying Shan,Ping Luo
机构: The University of Hong Kong (香港大学); ARC Lab (ARC 实验室); Tencent PCG (腾讯PCG)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.
zh

[CV-31] SAM-aware Test-time Adaptation for Universal Medical Image Segmentation

【速读】：该论文旨在解决通用医学图像分割中Segment Anything Model (SAM)适应性不足的问题，尤其是在面对医学领域数据时其泛化能力受限。现有方法如MedSAM虽然提升了SAM在医学影像中的性能，但牺牲了对未见数据的泛化能力。解决方案的关键在于提出SAM-aware Test-Time Adaptation (SAM-TTA)，该方法通过测试阶段自适应框架，在保持SAM泛化能力的同时提升其在医学影像中的分割性能。SAM-TTA的核心技术包括Self-adaptive Bezier Curve-based Transformation (SBCT)和Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT)，分别用于缓解输入级差异和语义级差异，从而实现无需辅助监督或昂贵微调的高效适应。

链接: https://arxiv.org/abs/2506.05221
作者: Jianghao Wu,Yicheng Wu,Yutong Xie,Wenjia Bai,You Zhang,Feilong Tang,Yulong Li,Yasmeen George,Imran Razzak
机构: Monash University (莫纳什大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Imperial College London (帝国理工学院); UT Southwestern Medical Center (德州西南医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Universal medical image segmentation using the Segment Anything Model (SAM) remains challenging due to its limited adaptability to medical domains. Existing adaptations, such as MedSAM, enhance SAM’s performance in medical imaging but at the cost of reduced generalization to unseen data. Therefore, in this paper, we propose SAM-aware Test-Time Adaptation (SAM-TTA), a fundamentally different pipeline that preserves the generalization of SAM while improving its segmentation performance in medical imaging via a test-time framework. SAM-TTA tackles two key challenges: (1) input-level discrepancies caused by differences in image acquisition between natural and medical images and (2) semantic-level discrepancies due to fundamental differences in object definition between natural and medical domains (e.g., clear boundaries vs. ambiguous structures). Specifically, our SAM-TTA framework comprises (1) Self-adaptive Bezier Curve-based Transformation (SBCT), which adaptively converts single-channel medical images into three-channel SAM-compatible inputs while maintaining structural integrity, to mitigate the input gap between medical and natural images, and (2) Dual-scale Uncertainty-driven Mean Teacher adaptation (DUMT), which employs consistency learning to align SAM’s internal representations to medical semantics, enabling efficient adaptation without auxiliary supervision or expensive retraining. Extensive experiments on five public datasets demonstrate that our SAM-TTA outperforms existing TTA approaches and even surpasses fully fine-tuned models such as MedSAM in certain scenarios, establishing a new paradigm for universal medical image segmentation. Code can be found at this https URL.
zh

[CV-32] MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm

【速读】：该论文旨在解决文档解析中的复杂性和低效问题，传统方法通常依赖于多工具流水线或大规模端到端模型，导致处理效率低下。其解决方案的关键在于提出一种基于Structure-Recognition-Relation (SRR)三元组范式的视觉-语言模型，将文档解析抽象为“Where is it?”（结构）、“What is it?”（识别）和“How is it organized?”（关系）三个核心问题，从而实现准确且高效的布局分析、内容识别与逻辑排序。这一设计在保持精度的同时提升了处理速度和可扩展性。

链接: https://arxiv.org/abs/2506.05218
作者: Zhang Li,Yuliang Liu,Qiang Liu,Zhiyin Ma,Ziyang Zhang,Shuo Zhang,Zidun Guo,Jiarui Zhang,Xinyu Wang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Kingsoft Office (金山办公)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce MonkeyOCR, a vision-language model for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU’s modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - “Where is it?” (structure), “What is it?” (recognition), and “How is it organized?” (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at this https URL.
zh

[CV-33] DSG-World: Learning a 3D Gaussian World Model from Dual State Videos

【速读】：该论文旨在解决从有限观测中构建高效且物理一致的世界模型这一长期挑战，现有基于隐式生成模型的世界建模方法在训练上存在困难，并且通常缺乏三维或物理一致性；而基于单状态的显式三维方法则因遮挡问题需要多阶段处理流程。该论文的解决方案关键在于利用同一场景在不同物体配置下的两个扰动观测，通过双状态提供互补可见性，缓解状态转换过程中的遮挡问题，并提出DSG-World框架，该框架通过构建双分割感知的高斯场并强制双向光度和语义一致性，结合伪中间状态对齐与协同共剪枝策略，实现从双状态观测中显式构建3D高斯世界模型，从而支持高效的现实到仿真迁移及高质量渲染与对象级场景操作。

链接: https://arxiv.org/abs/2506.05217
作者: Wenhao Hu,Xuexiang Wen,Xi Li,Gaoang Wang
机构: Zhejiang University (浙江大学); ZJU-UIUC Institute (ZJU-UIUC研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi-stage processing-such as segmentation, background completion, and inpainting-due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co-pruning trategies to refine geometric completeness. DSG-World enables efficient real-to-simulation transfer purely in the explicit Gaussian representation space, supporting high-fidelity rendering and object-level scene manipulation without relying on dense observations or multi-stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real-world 3D reconstruction and simulation.
zh

[CV-34] owards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation CVPR

【速读】：该论文试图解决多模态基础模型在特定领域（如服装生成）中的知识迁移能力不足的问题，尽管这些模型在通用任务上表现出强大的泛化能力。解决方案的关键在于提出VLG（vision-language-garment model），该模型能够从文本描述和视觉图像中合成服装，并通过零样本泛化实验验证其将大规模网络推理能力迁移到未见过的服装风格和提示中的潜力。

链接: https://arxiv.org/abs/2506.05210
作者: Jan Ackermann,Kiyohiro Nakayama,Guandao Yang,Tong Wu,Gordon Wetzstein
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at MMFM CVPRW’25, code available at this https URL

点击查看摘要

Abstract:Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG’s zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.
zh

[CV-35] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

【速读】：该论文旨在解决大规模视频扩散变压器在运动迁移任务中存在的运动不一致性和调优效率低的问题。现有基于适配的方法在处理复杂运动时难以维持生成视频与输入视频之间的运动一致性，且两阶段的LoRA微调过程耗时较长。论文提出的解决方案关键在于设计一种时空解耦的LoRA（Spatial-Temporal Decoupled LoRA），以解耦注意力机制中的空间外观与时间运动处理，并在第二阶段引入稀疏运动采样和自适应RoPE（Rotary Position Embedding）来加速调优过程。

链接: https://arxiv.org/abs/2506.05207
作者: Yue Ma,Yulong Liu,Qiyuan Zhu,Ayden Yang,Kunyu Feng,Xinhua Zhang,Zhifeng Li,Sirui Han,Chenyang Qi,Qifeng Chen
机构: HKUST(香港科技大学); HKUST(GZ)(香港科技大学(广州)); Tsinghua Univerisity(清华大学); Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex this http URL, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
zh

[CV-36] OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

【速读】：该论文旨在解决从稀疏视角重建语义感知的3D场景的问题，这一任务在虚拟现实和具身人工智能等新兴应用中具有重要价值。现有方法依赖于密集输入视角并产生较高的计算成本，而泛化性方法在输入视角范围外的区域重建效果较差。论文提出的解决方案是OGGSplat，其关键在于利用开放高斯（open Gaussian）的语义属性作为图像外推的强先验，从而实现语义一致性和视觉合理性。通过在选定渲染视角上应用RGB-语义一致的修复模块，并结合图像扩散模型与语义扩散模型的双向控制，将修复区域提升至3D空间进行高效的高斯参数优化。

链接: https://arxiv.org/abs/2506.05204
作者: Yanbo Wang,Ziyi Wang,Wenzhao Zheng,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.
zh

[CV-37] Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

【速读】：该论文旨在解决**具身化3D定位（embodied 3D grounding）**任务中，如何有效理解语言指令并准确定位目标物体的问题。传统方法通常采用两阶段范式，即利用预训练的3D检测器作为基础进行定位模型的初始化，但该研究发现，仅依赖检测模型在特定类别上的预测框即可实现优于专门针对语言指令训练的定位模型的性能，这表明当前方法在类别级定位上仍存在不足，更复杂的上下文感知定位问题则更加棘手。为解决这一问题，作者提出DEGround框架，其关键在于共享DETR查询作为检测与定位的统一对象表示，并通过区域激活定位模块和查询级调制模块增强对语言指令的上下文理解，从而提升定位效果。

链接: https://arxiv.org/abs/2506.05199
作者: Yani Zhang,Dongming Wu,Hao Shi,Yingfei Liu,Tiancai Wang,Haoqiang Fan,Xingping Dong
机构: Wuhan University (武汉大学); Beijing Institute of Technology (北京理工大学); Tsinghua University (清华大学); Dexmal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 1st place on embodiedscan

点击查看摘要

Abstract:Embodied 3D grounding aims to localize target objects described in human instructions from ego-centric viewpoint. Most methods typically follow a two-stage paradigm where a trained 3D detector’s optimized backbone parameters are used to initialize a grounding model. In this study, we explore a fundamental question: Does embodied 3D grounding benefit enough from detection? To answer this question, we assess the grounding performance of detection models using predicted boxes filtered by the target category. Surprisingly, these detection models without any instruction-specific training outperform the grounding models explicitly trained with language instructions. This indicates that even category-level embodied 3D grounding may not be well resolved, let alone more fine-grained context-aware grounding. Motivated by this finding, we propose DEGround, which shares DETR queries as object representation for both DEtection and Grounding and enables the grounding to benefit from basic category classification and box detection. Based on this framework, we further introduce a regional activation grounding module that highlights instruction-related regions and a query-wise modulation module that incorporates sentence-level semantic into the query representation, strengthening the context-aware understanding of language instructions. Remarkably, DEGround outperforms state-of-the-art model BIP3D by 7.52% at overall accuracy on the EmbodiedScan validation set. The source code will be publicly available at this https URL.
zh

[CV-38] Quantifying Cross-Modality Memorization in Vision-Language Models

【速读】：该论文试图解决多模态模型在跨模态记忆（cross-modal memorization）中的知识迁移能力不足的问题，特别是在不同模态间信息回忆存在显著差距的背景下。其解决方案的关键在于通过构建一个合成人格数据集，并基于单模态训练与跨模态评估的方法，量化事实性知识的记忆能力和跨模态迁移性，从而揭示跨模态记忆的特性及局限性，并提出一种基线方法以缓解这一挑战。

链接: https://arxiv.org/abs/2506.05198
作者: Yuxin Wen,Yangsibo Huang,Tom Goldstein,Ravi Kumar,Badih Ghazi,Chiyuan Zhang
机构: University of Maryland, College Park (马里兰大学学院市分校); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding what and how neural networks memorize during training is crucial, both from the perspective of unintentional memorization of potentially sensitive information and from the standpoint of effective knowledge acquisition for real-world, knowledge-intensive tasks. While previous studies primarily investigate memorization within a single modality, such as text memorization in large language models or image memorization in diffusion models, unified multimodal models are becoming increasingly prevalent in practical applications. In this work, we focus on the unique characteristics of cross-modality memorization and conduct a systematic study centered on vision-language models. To facilitate controlled experiments, we first introduce a synthetic persona dataset comprising diverse synthetic person images and textual descriptions. We quantify factual knowledge memorization and cross-modal transferability by training models on a single modality and evaluating their performance in the other. Our results reveal that facts learned in one modality transfer to the other, but a significant gap exists between recalling information in the source and target modalities. Furthermore, we observe that this gap exists across various scenarios, including more capable models, machine unlearning, and the multi-hop case. At the end, we propose a baseline method to mitigate this challenge. We hope our study can inspire future research on developing more robust multimodal learning techniques to enhance cross-modal transferability.
zh

[CV-39] Vision-Based Autonomous MM-Wave Reflector Using ArUco-Driven Angle-of-Arrival Estimation

【速读】：该论文旨在解决在非视距（NLoS）条件下可靠毫米波（mmWave）通信的挑战，特别是在城市或基础设施有限的环境中。其解决方案的关键在于设计一种视觉辅助的自主反射系统，通过使用带有电机驱动的金属板动态调整信号反射路径，该系统利用单目相机检测盟友发射器和接收器节点上的ArUco标记，估计到达角度，并实时对准反射器以实现最佳信号重定向，从而实现选择性波束覆盖并降低意外信号暴露的风险。

链接: https://arxiv.org/abs/2506.05195
作者: Josue Marroquin,Nan Inzali,Miles Dillon Lantz,Campbell Freeman,Amod Ashtekar,\Ajinkya Umesh Mulik,Mohammed E Eltayeb
机构: California State University, Sacramento (加利福尼亚州立大学萨克拉门托分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable millimeter-wave (mmWave) communication in non-line-of-sight (NLoS) conditions remains a major challenge for both military and civilian operations, especially in urban or infrastructure-limited environments. This paper presents a vision-aided autonomous reflector system designed to enhance mmWave link performance by dynamically steering signal reflections using a motorized metallic plate. The proposed system leverages a monocular camera to detect ArUco markers on allied transmitter and receiver nodes, estimate their angles of arrival, and align the reflector in real time for optimal signal redirection. This approach enables selective beam coverage by serving only authenticated targets with visible markers and reduces the risk of unintended signal exposure. The designed prototype, built on a Raspberry Pi 4 and low-power hardware, operates autonomously without reliance on external infrastructure or GPS. Experimental results at 60,GHz demonstrate a 23,dB average gain in received signal strength and an 0.89 probability of maintaining signal reception above a target threshold of -65 dB in an indoor environment, far exceeding the static and no-reflector baselines. These results demonstrate the system’s potential for resilient and adaptive mmWave connectivity in complex and dynamic environments.
zh

[CV-40] MokA: Multimodal Low-Rank Adaptation for MLLM s

【速读】：该论文试图解决当前高效多模态微调方法存在的关键问题，即这些方法大多直接借鉴大语言模型（Large Language Models, LLMs）的策略，忽视了多模态场景的内在差异，从而影响了所有模态的充分利用。解决方案的关键在于提出一种多模态感知的高效微调策略——多模态低秩适配（Multimodal low-rank Adaptation, MokA），该策略将单模态适配与跨模态适配作为有效微调多模态大语言模型（Multimodal Large Language Models, MLLMs）的两个核心部分，通过模态特异性参数压缩单模态信息并显式增强跨模态交互，从而实现单模态和跨模态的协同适配。

链接: https://arxiv.org/abs/2506.05191
作者: Yake Wei,Yu Miao,Dongzhan Zhou,Di Hu
机构: Renmin University of China (中国人民大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we reveal that most current efficient multimodal fine-tuning methods are hindered by a key limitation: they are directly borrowed from LLMs, often neglecting the intrinsic differences of multimodal scenarios and even affecting the full utilization of all modalities. Inspired by our empirical observation, we argue that unimodal adaptation and cross-modal adaptation are two essential parts for the effective fine-tuning of MLLMs. From this perspective, we propose Multimodal low-rank Adaptation (MokA), a multimodal-aware efficient fine-tuning strategy that takes multimodal characteristics into consideration. It compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction, ensuring both unimodal and cross-modal adaptation. Extensive experiments cover three representative multimodal scenarios (audio-visual-text, visual-text, and speech-text), and multiple LLM backbones (LLaMA2/3, Qwen2, Qwen2.5-VL, etc). Consistent improvements indicate the efficacy and versatility of the proposed method. Ablation studies and efficiency evaluation are also conducted to fully asses our method. Overall, we think MokA provides a more targeted solution for efficient adaptation of MLLMs, paving the way for further exploration. The project page is at this https URL.
zh

[CV-41] Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis

【速读】：该论文旨在解决将预训练的病理基础模型（Pathology Foundation Models, PFMs）适应到特定临床任务中的挑战，特别是在仅有弱标签（WSI级标签）的情况下进行全切片图像（Whole Slide Images, WSIs）分析的问题。其解决方案的关键在于提出了一种单GPU任务自适应方法（Task Adaptation of PFMs, TAPFM），该方法利用视觉Transformer（Vision Transformer, ViT）的注意力机制进行多实例学习（Multiple Instance Learning, MIL）聚合，并在端到端适应过程中同时优化特征表示和注意力权重。通过为MIL聚合器和PFM分别构建独立的计算图，TAPFM实现了与下游任务目标一致的稳定训练动态。

链接: https://arxiv.org/abs/2506.05184
作者: Neeraj Kumar,Swaraj Nanda,Siddharth Singi,Jamal Benhamida,David Kim,Jie-Fu Chen,Amir Momeni-Boroujeni,Gregory M. Goldgof,Gabriele Campanella,Chad Vanderbilt
机构: Memorial Sloan Kettering Cancer Center (纪念斯隆-凯特琳癌症中心); Icahn School of Medicine at Mount Sinai (伊坎医学院); Hasso Platner Institute (哈索·普拉特纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology foundation models (PFMs) have emerged as powerful tools for analyzing whole slide images (WSIs). However, adapting these pretrained PFMs for specific clinical tasks presents considerable challenges, primarily due to the availability of only weak (WSI-level) labels for gigapixel images, necessitating multiple instance learning (MIL) paradigm for effective WSI analysis. This paper proposes a novel approach for single-GPU \textbfTask \textbfAdaptation of \textbfPFMs (TAPFM) that uses vision transformer (\vit) attention for MIL aggregation while optimizing both for feature representations and attention weights. The proposed approach maintains separate computational graphs for MIL aggregator and the PFM to create stable training dynamics that align with downstream task objectives during end-to-end adaptation. Evaluated on mutation prediction tasks for bladder cancer and lung adenocarcinoma across institutional and TCGA cohorts, TAPFM consistently outperforms conventional approaches, with H-Optimus-0 (TAPFM) outperforming the benchmarks. TAPFM effectively handles multi-label classification of actionable mutations as well. Thus, TAPFM makes adaptation of powerful pre-trained PFMs practical on standard hardware for various clinical applications.
zh

[CV-42] rack Any Anomalous Object: A Granular Video Anomaly Detection Pipeline

【速读】：该论文试图解决视频异常检测（Video Anomaly Detection, VAD）中对细粒度异常对象（如异常像素）检测不足的问题，传统方法通常仅关注异常帧或物体，而忽视了更细微的异常特征，从而限制了其检测范围。解决方案的关键在于提出一种名为Track Any Anomalous Object (TAO)的新框架，该框架首次将多细粒度异常对象的检测整合到统一的体系中，并通过将异常得分与下游任务（如分割和跟踪）相结合，将问题转化为异常对象的像素级跟踪，从而避免了阈值调整，实现了在长且复杂的视频序列中更精确的异常定位。

链接: https://arxiv.org/abs/2506.05175
作者: Yuzhi Huang,Chenxin Li,Haitao Zhang,Zixu Lin,Yunlong Lin,Hengyu Liu,Wuyang Li,Xinyu Liu,Jiechao Gao,Yue Huang,Xinghao Ding,Yixuan Yuan
机构: Xiamen University (厦门大学); The Chinese University of Hong Kong (香港中文大学); University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) is crucial in scenarios such as surveillance and autonomous driving, where timely detection of unexpected activities is essential. Although existing methods have primarily focused on detecting anomalous objects in videos – either by identifying anomalous frames or objects – they often neglect finer-grained analysis, such as anomalous pixels, which limits their ability to capture a broader range of anomalies. To address this challenge, we propose a new framework called Track Any Anomalous Object (TAO), which introduces a granular video anomaly detection pipeline that, for the first time, integrates the detection of multiple fine-grained anomalous objects into a unified framework. Unlike methods that assign anomaly scores to every pixel, our approach transforms the problem into pixel-level tracking of anomalous objects. By linking anomaly scores to downstream tasks such as segmentation and tracking, our method removes the need for threshold tuning and achieves more precise anomaly localization in long and complex video sequences. Experiments demonstrate that TAO sets new benchmarks in accuracy and robustness. Project page available online.
zh

[CV-43] hrough-the-Wall Radar Human Activity Recognition WITHOUT Using Neural Networks

【速读】：该论文试图解决通过墙雷达（Through-the-Wall Radar, TWR）进行人体活动识别（Human Activity Recognition, HAR）的问题，其核心挑战在于传统方法依赖神经网络模型进行训练，而作者希望回归到无需训练的物理可解释性方法。解决方案的关键在于利用角点检测方法确定目标与噪声区域，并通过多相主动轮廓模型分割微多普勒特征，将其离散化为二维点云，最后使用Mapper算法计算点云与模板数据之间的拓扑相似性以实现识别。

链接: https://arxiv.org/abs/2506.05169
作者: Weicheng Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 15 pages, 8 figures, 8 tables

点击查看摘要

Abstract:After a few years of research in the field of through-the-wall radar (TWR) human activity recognition (HAR), I found that we seem to be stuck in the mindset of training on radar image data through neural network models. The earliest related works in this field based on template matching did not require a training process, and I believe they have never died. Because these methods possess a strong physical interpretability and are closer to the basis of theoretical signal processing research. In this paper, I would like to try to return to the original path by attempting to eschew neural networks to achieve the TWR HAR task and challenge to achieve intelligent recognition as neural network models. In detail, the range-time map and Doppler-time map of TWR are first generated. Then, the initial regions of the human target foreground and noise background on the maps are determined using corner detection method, and the micro-Doppler signature is segmented using the multiphase active contour model. The micro-Doppler segmentation feature is discretized into a two-dimensional point cloud. Finally, the topological similarity between the resulting point cloud and the point clouds of the template data is calculated using Mapper algorithm to obtain the recognition results. The effectiveness of the proposed method is demonstrated by numerical simulated and measured experiments. The open-source code of this work is released at: this https URL.
zh

[CV-44] FRED: The Florence RGB-Event Drone Dataset

【速读】：该论文旨在解决传统RGB相机在捕捉高速移动目标，尤其是在复杂光照条件下存在的局限性问题，特别是在小型、快速和轻量级无人机的检测、跟踪与轨迹预测任务中。解决方案的关键在于引入了Florence RGB-Event Drone dataset (FRED)，这是一个结合RGB视频与事件流的多模态数据集，具备高时间分辨率和动态范围，能够有效支持高速无人机感知与时空理解的研究。

链接: https://arxiv.org/abs/2506.05163
作者: Gabriele Magrini,Niccolò Marini,Federico Becattini,Lorenzo Berlincioni,Niccolò Biondi,Pietro Pala,Alberto Del Bimbo
机构: University of Florence (佛罗伦萨大学); University of Siena (锡耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small, fast, and lightweight drones present significant challenges for traditional RGB cameras due to their limitations in capturing fast-moving objects, especially under challenging lighting conditions. Event cameras offer an ideal solution, providing high temporal definition and dynamic range, yet existing benchmarks often lack fine temporal resolution or drone-specific motion patterns, hindering progress in these areas. This paper introduces the Florence RGB-Event Drone dataset (FRED), a novel multimodal dataset specifically designed for drone detection, tracking, and trajectory forecasting, combining RGB video and event streams. FRED features more than 7 hours of densely annotated drone trajectories, using 5 different drone models and including challenging scenarios such as rain and adverse lighting conditions. We provide detailed evaluation protocols and standard metrics for each task, facilitating reproducible benchmarking. The authors hope FRED will advance research in high-speed drone perception and multimodal spatiotemporal understanding.
zh

[CV-45] Practical Manipulation Model for Robust Deepfake Detection

【速读】：该论文旨在解决深度伪造检测模型在非理想条件下的性能不稳定问题，这限制了其在某些基准数据集上的表现并容易被绕过检测。论文提出的解决方案关键在于构建一个更贴近现实的伪造模型——实用篡改模型（Practical Manipulation Model, PMM），通过扩展伪假图像的生成空间，引入泊松融合、更多样化的掩码、生成器伪影和干扰物，并在训练图像中加入强退化以提升检测器的泛化性和鲁棒性。这些改进不仅显著增强了模型对常见图像退化的鲁棒性，还在标准基准数据集上取得了性能提升。

链接: https://arxiv.org/abs/2506.05119
作者: Benedikt Hopf,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors’ generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model’s robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of 3.51% and 6.21% AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at this https URL
zh

[CV-46] DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成质量与一致性提升过程中出现的表征多样性不足的问题。现有自动评估方法要么依赖参考图像数据集，要么对所测量多样性的类型缺乏具体描述，从而限制了其适应性和可解释性。论文提出的解决方案是引入DIM-CIM框架，这是一种无需参考的默认模式多样性（“Does”）和泛化能力（“Can”）的度量方法。该框架的关键在于通过构造COCO-DIMCIM基准，实现对模型在不同参数规模下多样性与泛化能力的系统评估，并揭示训练数据多样性与模型默认模式多样性之间的强相关性。

链接: https://arxiv.org/abs/2506.05108
作者: Revant Teotia,Candace Ross,Karen Ullrich,Sumit Chopra,Adriana Romero-Soriano,Melissa Hall,Matthew J. Muckley
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity (“Does” the model generate images with expected attributes?) and generalization capacity (“Can” the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.
zh

[CV-47] Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

【速读】：该论文旨在解决视频扩散变压器（Video Diffusion Transformers, vDiTs）在文本到视频生成任务中计算需求高、难以实际部署的问题。现有加速方法虽能在不同粒度上减少工作量，但通常依赖启发式策略，限制了其适用性。论文提出的ASTRAEA框架通过自动搜索近似最优配置来解决这一问题，其关键在于引入了一种轻量级的token选择机制和一种内存高效、GPU并行的稀疏注意力策略，实现了执行时间的线性减少且对生成质量影响最小。此外，ASTRAEA还设计了一个基于经典进化算法的搜索框架，以自动确定不同时间步长下的最佳token削减分布，从而在保持视频质量的前提下显著提升了推理速度。

链接: https://arxiv.org/abs/2506.05096
作者: Haosong Liu,Yuge Cheng,Zihan Liu,Aiyue Chen,Yiwu Yao,Chen Chen,Jingwen Leng,Yu Feng,Minyi Guo
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Qizhi Institute (上海智源研究院); Huawei Technologies Co.,Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (0.5% loss on the VBench score compared to the baseline vDiT models). Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.05096 [cs.CV] (or arXiv:2506.05096v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.05096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-48] FG 2025 TrustFAA: the First Workshop on Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness Explainability and Safety (TrustFAA)

【速读】：该论文旨在解决情感人工智能（Emotion AI）驱动的面部情绪分析（Facial Affect Analysis, FAA）系统在可信度方面存在的问题，包括可解释性、不确定性、偏见和隐私等挑战。其解决方案的关键在于通过跨任务的多维度研究，如宏观/微表情识别、面部动作单元检测、疼痛与抑郁检测以及人机交互等，推动公平性、可解释性和安全性的进展，从而提升FAA系统的可信度。

链接: https://arxiv.org/abs/2506.05095
作者: Jiaee Cheong,Yang Liu,Harold Soh,Hatice Gunes
机构: University of Cambridge (剑桥大学); University of Oulu (奥卢大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the increasing prevalence and deployment of Emotion AI-powered facial affect analysis (FAA) tools, concerns about the trustworthiness of these systems have become more prominent. This first workshop on “Towards Trustworthy Facial Affect Analysis: Advancing Insights of Fairness, Explainability, and Safety (TrustFAA)” aims to bring together researchers who are investigating different challenges in relation to trustworthiness-such as interpretability, uncertainty, biases, and privacy-across various facial affect analysis tasks, including macro/ micro-expression recognition, facial action unit detection, other corresponding applications such as pain and depression detection, as well as human-robot interaction and collaboration. In alignment with FG2025’s emphasis on ethics, as demonstrated by the inclusion of an Ethical Impact Statement requirement for this year’s submissions, this workshop supports FG2025’s efforts by encouraging research, discussion and dialogue on trustworthy FAA.
zh

[CV-49] Synthetic Dataset Generation for Autonomous Mobile Robots Using 3D Gaussian Splatting for Vision Training

【速读】：该论文试图解决传统标注数据集在创建过程中存在耗时、劳动强度大、易出错以及多样性不足的问题，特别是在机器人领域中，由于场景复杂多变，进一步增加了数据集构建的难度。解决方案的关键在于提出一种在Unreal Engine中自动生成带标注的合成数据的新方法，该方法利用了逼真的3D高斯泼溅（3D Gaussian splats）技术实现快速合成数据生成，并通过结合真实世界与合成数据提升目标检测性能，从而提供了一种可扩展且全面的替代传统数据集构建方式的方法。

链接: https://arxiv.org/abs/2506.05092
作者: Aneesh Deogan,Wout Beks,Peter Teurlings,Koen de Vos,Mark van den Brand,Rene van de Molengraft
机构: Eindhoven University of Technology (埃因霍温理工大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Annotated datasets are critical for training neural networks for object detection, yet their manual creation is time- and labour-intensive, subjective to human error, and often limited in diversity. This challenge is particularly pronounced in the domain of robotics, where diverse and dynamic scenarios further complicate the creation of representative datasets. To address this, we propose a novel method for automatically generating annotated synthetic data in Unreal Engine. Our approach leverages photorealistic 3D Gaussian splats for rapid synthetic data generation. We demonstrate that synthetic datasets can achieve performance comparable to that of real-world datasets while significantly reducing the time required to generate and annotate data. Additionally, combining real-world and synthetic data significantly increases object detection performance by leveraging the quality of real-world images with the easier scalability of synthetic data. To our knowledge, this is the first application of synthetic data for training object detection algorithms in the highly dynamic and varied environment of robot soccer. Validation experiments reveal that a detector trained on synthetic images performs on par with one trained on manually annotated real-world images when tested on robot soccer match scenarios. Our method offers a scalable and comprehensive alternative to traditional dataset creation, eliminating the labour-intensive error-prone manual annotation process. By generating datasets in a simulator where all elements are intrinsically known, we ensure accurate annotations while significantly reducing manual effort, which makes it particularly valuable for robotics applications requiring diverse and scalable training data.
zh

[CV-50] SeedEdit 3.0: Fast and High-Quality Generative Image Editing

【速读】：该论文旨在解决真实图像编辑中编辑指令遵循能力和图像内容（如身份/知识产权）保持之间的平衡问题。其关键解决方案是通过引入增强的数据整理流程，采用元信息范式和元信息嵌入策略，以有效整合多源图像数据，并通过联合学习框架计算扩散损失和奖励损失，从而提升模型在实际图像编辑任务中的性能与实用性。

链接: https://arxiv.org/abs/2506.05083
作者: Peng Wang,Yichun Shi,Xiaochen Lian,Zhonghua Zhai,Xin Xia,Xuefeng Xiao,Weilin Huang,Jianchao Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our website: this https URL

点击查看摘要

Abstract:We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0 [22], which significantly improves over our previous version [27] in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and a reward loss. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).
zh

[CV-51] A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions

【速读】：该论文旨在解决越南语文档分析与识别（Vietnamese Document Analysis and Recognition, DAR）中的挑战，包括复杂的变音符号、声调变化以及缺乏大规模标注数据集等问题。传统光学字符识别（OCR）方法在处理实际文档变化时表现不佳，而深度学习方法虽有潜力，但受限于数据稀缺性和泛化能力。论文提出的解决方案关键在于利用大语言模型（Large Language Models, LLMs）和视觉-语言模型，以提升文本识别与文档理解性能，并探索多模态学习、领域适应和计算效率等方向，从而推动越南语DAR技术的发展。

链接: https://arxiv.org/abs/2506.05061
作者: Anh Le,Thanh Lam,Dung Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vietnamese document analysis and recognition (DAR) is a crucial field with applications in digitization, information retrieval, and automation. Despite advancements in OCR and NLP, Vietnamese text recognition faces unique challenges due to its complex diacritics, tonal variations, and lack of large-scale annotated datasets. Traditional OCR methods often struggle with real-world document variations, while deep learning approaches have shown promise but remain limited by data scarcity and generalization issues. Recently, large language models (LLMs) and vision-language models have demonstrated remarkable improvements in text recognition and document understanding, offering a new direction for Vietnamese DAR. However, challenges such as domain adaptation, multimodal learning, and computational efficiency persist. This survey provide a comprehensive review of existing techniques in Vietnamese document recognition, highlights key limitations, and explores how LLMs can revolutionize the field. We discuss future research directions, including dataset development, model optimization, and the integration of multimodal approaches for improved document intelligence. By addressing these gaps, we aim to foster advancements in Vietnamese DAR and encourage community-driven solutions.
zh

[CV-52] FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

【速读】：该论文试图解决文本驱动视频编辑中由于依赖基于逆向技术的方案而导致的时间不一致性和结构保真度下降的问题。其解决方案的关键在于提出FlowDirector框架，该框架通过在数据空间中直接建模编辑过程，并利用常微分方程（ODE）引导视频沿固有的时空流形平滑过渡，从而保持时间连贯性和结构细节。

链接: https://arxiv.org/abs/2506.05046
作者: Guangzhao Li,Yanming Yang,Chenxi Song,Chi Zhang
机构: Westlake University (西湖大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page is this https URL

点击查看摘要

Abstract:Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.
zh

[CV-53] Identifying and Understanding Cross-Class Features in Adversarial Training ICML2025

【速读】：该论文试图解决对抗训练（Adversarial Training, AT）中训练机制和动态过程的理论理解问题，特别是如何通过特征层面的分析揭示AT的内在工作原理。其解决方案的关键在于从类间特征归因（class-wise feature attribution）的角度出发，识别出一种被称为跨类特征（cross-class features）的关键特征族，这些特征在多个类别间共享，并对鲁棒分类具有重要作用。研究发现，在AT的初期阶段，模型倾向于学习更多跨类特征以达到最佳鲁棒性检查点，而随着训练深入导致鲁棒过拟合时，模型则更依赖于类特定特征。这一发现为解释AT的两个现有特性——软标签训练的优势和鲁棒过拟合现象提供了统一视角。

链接: https://arxiv.org/abs/2506.05032
作者: Zeming Wei,Yiwen Guo,Yisen Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: ICML 2025

点击查看摘要

Abstract:Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at this https URL.
zh

[CV-54] Physical Annotation for Automated Optical Inspection: A Concept for In-Situ Pointer-Based Trainingdata Generation

【速读】：该论文试图解决自动化光学检测中训练数据生成的问题，特别是如何有效将熟练检测人员的专业知识转化为机器学习（ML）训练数据。解决方案的关键在于提出一种基于指针的在位交互物理标注系统，通过校准跟踪的指针直接在物体上捕获物理轨迹和轮廓，将其转换为标准化的标注格式，并结合投影界面提供视觉引导，从而实现更直观、高效的标注过程。该方法使非IT专家能够参与ML训练流程，避免有价值的训练样本丢失。

链接: https://arxiv.org/abs/2506.05026
作者: Oliver Krumpek,Oliver Heimann,Jörg Krüger
机构: Fraunhofer IPK (弗劳恩霍夫研究所); Technical University Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a novel physical annotation system designed to generate training data for automated optical inspection. The system uses pointer-based in-situ interaction to transfer the valuable expertise of trained inspection personnel directly into a machine learning (ML) training pipeline. Unlike conventional screen-based annotation methods, our system captures physical trajectories and contours directly on the object, providing a more intuitive and efficient way to label data. The core technology uses calibrated, tracked pointers to accurately record user input and transform these spatial interactions into standardised annotation formats that are compatible with open-source annotation software. Additionally, a simple projector-based interface projects visual guidance onto the object to assist users during the annotation process, ensuring greater accuracy and consistency. The proposed concept bridges the gap between human expertise and automated data generation, enabling non-IT experts to contribute to the ML training pipeline and preventing the loss of valuable training samples. Preliminary evaluation results confirm the feasibility of capturing detailed annotation trajectories and demonstrate that integration with CVAT streamlines the workflow for subsequent ML tasks. This paper details the system architecture, calibration procedures and interface design, and discusses its potential contribution to future ML data generation for automated optical inspection.
zh

[CV-55] UAV4D: Dynamic Neural Rendering of Human-Centric UAV Imagery using Gaussian Splatting

【速读】：该论文旨在解决无人机（UAV）捕获场景中动态场景重建与渲染的问题，特别是针对单目相机设置、俯视视角以及多个小型移动人体所带来的挑战。现有数据集未能充分涵盖这些特性，导致现有方法在处理此类场景时表现不佳。论文提出的解决方案关键在于结合3D基础模型和人体网格重建模型，以无须额外传感器的方式从单目视频数据中重建动态场景及人体。此外，通过识别人体与场景的接触点来解决场景尺度模糊问题，并利用SMPL模型和背景网格初始化高斯斑点，实现整体场景的逼真渲染。

链接: https://arxiv.org/abs/2506.05011
作者: Jaehoon Choi,Dongki Jung,Christopher Maxey,Yonghan Lee,Sungmin Eum,Dinesh Manocha,Heesung Kwon
机构: University of Maryland, College Park(马里兰大学学院公园分校); DEVCOM Army Research Laboratory(陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advancements in dynamic neural rendering, existing methods fail to address the unique challenges posed by UAV-captured scenarios, particularly those involving monocular camera setups, top-down perspective, and multiple small, moving humans, which are not adequately represented in existing datasets. In this work, we introduce UAV4D, a framework for enabling photorealistic rendering for dynamic real-world scenes captured by UAVs. Specifically, we address the challenge of reconstructing dynamic scenes with multiple moving pedestrians from monocular video data without the need for additional sensors. We use a combination of a 3D foundation model and a human mesh reconstruction model to reconstruct both the scene background and humans. We propose a novel approach to resolve the scene scale ambiguity and place both humans and the scene in world coordinates by identifying human-scene contact points. Additionally, we exploit the SMPL model and background mesh to initialize Gaussian splats, enabling holistic scene rendering. We evaluated our method on three complex UAV-captured datasets: VisDrone, Manipal-UAV, and Okutama-Action, each with distinct characteristics and 10~50 humans. Our results demonstrate the benefits of our approach over existing methods in novel view synthesis, achieving a 1.5 dB PSNR improvement and superior visual sharpness.
zh

[CV-56] Point Cloud Segmentation of Agricultural Vehicles using 3D Gaussian Splatting

【速读】：该论文试图解决3D点云语义分割任务中真实数据获取和标注成本高、耗时的问题。其解决方案的关键在于引入一种新颖的管道，通过3D Gaussian Splatting（3DGS）和Gaussian Opacity Fields（GOF）生成多种不同农业车辆的逼真三维资产，并在模拟环境中利用模拟激光雷达生成点云数据。该方法具有灵活性，可调整激光雷达参数而无需额外成本，且实验表明仅使用合成数据训练的模型在某些场景下表现优于基于真实数据训练的模型。

链接: https://arxiv.org/abs/2506.05009
作者: Alfred T. Christiansen,Andreas H. Højrup,Morten K. Stephansen,Md Ibtihaj A. Sakib,Taman S. Poojary,Filip Slezak,Morten S. Laursen,Thomas B. Moeslund,Joakim B. Haurum
机构: Aalborg University (奥尔堡大学); AGCO A/S (AGCO A/S); Visual Analysis & Perception Lab, Aalborg University (视觉分析与感知实验室，奥尔堡大学); Pioneer Centre for AI, Denmark (丹麦人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training neural networks for tasks such as 3D point cloud semantic segmentation demands extensive datasets, yet obtaining and annotating real-world point clouds is costly and labor-intensive. This work aims to introduce a novel pipeline for generating realistic synthetic data, by leveraging 3D Gaussian Splatting (3DGS) and Gaussian Opacity Fields (GOF) to generate 3D assets of multiple different agricultural vehicles instead of using generic models. These assets are placed in a simulated environment, where the point clouds are generated using a simulated LiDAR. This is a flexible approach that allows changing the LiDAR specifications without incurring additional costs. We evaluated the impact of synthetic data on segmentation models such as PointNet++, Point Transformer V3, and OACNN, by training and validating the models only on synthetic data. Remarkably, the PTv3 model had an mIoU of 91.35%, a noteworthy result given that the model had neither been trained nor validated on any real data. Further studies even suggested that in certain scenarios the models trained only on synthetically generated data performed better than models trained on real-world data. Finally, experiments demonstrated that the models can generalize across semantic classes, enabling accurate predictions on mesh models they were never trained on.
zh

[CV-57] Structure-Aware Radar-Camera Depth Estimation

【速读】：该论文旨在解决单目深度估计（monocular depth estimation）中在未见领域（unseen domains）泛化能力不足的问题。其解决方案的关键在于利用仿射不变损失（affine-invariant loss）以实现多数据集联合训练，从而提升模型在未知场景下的结构信息提取能力，尽管在精确度量深度方面仍存在局限。

链接: https://arxiv.org/abs/2506.05008
作者: Fuyi Zhang,Zhu Yu,Chunhao Li,Runmin Zhang,Xiaokai Bai,Zili Zhou,Si-Yuan Cao,Wang Wang,Hui-Liang Shen
机构: Zhejiang University (浙江大学); Hangzhou City University (杭州城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation aims to determine the depth of each pixel from an RGB image captured by a monocular camera. The development of deep learning has significantly advanced this field by facilitating the learning of depth features from some well-annotated datasets \citeGeiger_Lenz_Stiller_Urtasun_2013,silberman2012indoor. Eigen \textitet al. \citeeigen2014depth first introduce a multi-scale fusion network for depth regression. Following this, subsequent improvements have come from reinterpreting the regression task as a classification problem \citebhat2021adabins,Li_Wang_Liu_Jiang_2022, incorporating additional priors \citeshao2023nddepth,yang2023gedepth, and developing more effective objective function \citexian2020structure,Yin_Liu_Shen_Yan_2019. Despite these advances, generalizing to unseen domains remains a challenge. Recently, several methods have employed affine-invariant loss to enable multi-dataset joint training \citeMiDaS,ZeroDepth,guizilini2023towards,Dany. Among them, Depth Anything \citeDany has shown leading performance in zero-shot monocular depth estimation. While it struggles to estimate accurate metric depth due to the lack of explicit depth cues, it excels at extracting structural information from unseen images, producing structure-detailed monocular depth.
zh

[CV-58] Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

【速读】：该论文旨在解决中文场景文本检索（Chinese Scene Text Retrieval）中由于文本布局复杂多变而导致的性能不足问题，现有方法多借鉴英文场景文本检索方案，未能取得满意效果。其解决方案的关键在于提出一种名为CSTR-CLIP的新模型，该模型通过整合全局视觉信息与多粒度对齐训练，并采用两阶段训练流程，克服了以往方法中忽略文本区域外视觉特征以及依赖单一粒度对齐的局限性，从而有效处理多样化的文本布局。

链接: https://arxiv.org/abs/2506.04999
作者: Gengluo Li,Huawen Shen,Yu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.
zh

[CV-59] PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

【速读】：该论文旨在解决自动化运动技能评估中视频采样方法破坏时间连续性的问题，从而影响技能熟练度的准确评价。其解决方案的关键在于提出一种名为Proficiency-Aware Temporal Sampling (PATS)的新颖采样策略，该策略通过自适应分割视频以确保每个分析部分包含关键性能组件的完整执行，从而在保持时间连贯性的同时最大化信息覆盖范围。

链接: https://arxiv.org/abs/2506.04996
作者: Edoardo Bianchi,Antonio Liotta
机构: Free University of Bozen-Bolzano (博岑-波尔扎诺自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.
zh

[CV-60] Multi-scale Image Super Resolution with a Single Auto-Regressive Model

【速读】：该论文试图解决图像超分辨率（Image Super Resolution, ISR）问题，其核心挑战在于如何在不同尺度间生成高质量的高分辨率图像。解决方案的关键在于引入两种创新组件：一是层次化图像标记化方法，通过多尺度图像分词器逐步表示图像的不同尺度，同时在不同尺度间强制标记重叠；二是基于直接偏好优化（Direct Preference Optimization, DPO）的正则化项，仅依赖低分辨率和高分辨率的标记化结果，引导Transformer生成高分辨率标记而非低分辨率标记。这些创新使得模型能够在单次前向传播中实现去噪和超分辨率，并在不使用外部训练数据的情况下取得最先进的性能。

链接: https://arxiv.org/abs/2506.04990
作者: Enrique Sanchez,Isma Hadji,Adrian Bulat,Christos Tzelepis,Brais Martinez,Georgios Tzimiropoulos
机构: Samsung AI Center (三星人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Enrique Sanchez and Isma Hadji equally contributed to this work. Project site this https URL

点击查看摘要

Abstract:In this paper we tackle Image Super Resolution (ISR), using recent advances in Visual Auto-Regressive (VAR) modeling. VAR iteratively estimates the residual in latent space between gradually increasing image scales, a process referred to as next-scale prediction. Thus, the strong priors learned during pre-training align well with the downstream task (ISR). To our knowledge, only VARSR has exploited this synergy so far, showing promising results. However, due to the limitations of existing residual quantizers, VARSR works only at a fixed resolution, i.e. it fails to map intermediate outputs to the corresponding image scales. Additionally, it relies on a 1B transformer architecture (VAR-d24), and leverages a large-scale private dataset to achieve state-of-the-art results. We address these limitations through two novel components: a) a Hierarchical Image Tokenization approach with a multi-scale image tokenizer that progressively represents images at different scales while simultaneously enforcing token overlap across scales, and b) a Direct Preference Optimization (DPO) regularization term that, relying solely on the LR and HR tokenizations, encourages the transformer to produce the latter over the former. To the best of our knowledge, this is the first time a quantizer is trained to force semantically consistent residuals at different scales, and the first time that preference-based optimization is used to train a VAR. Using these two components, our model can denoise the LR image and super-resolve at half and full target upscale factors in a single forward pass. Additionally, we achieve \textitstate-of-the-art results on ISR, while using a small model (300M params vs ~1B params of VARSR), and without using external training data.
zh

[CV-61] xtVidBench: A Benchmark for Long Video Scene Text Understanding

【速读】：该论文旨在解决现有短视频文本-视觉问答（Text-Visual Question Answering, ViteVQA）数据集在视频时长和评估范围上的局限性，从而无法充分评估多模态大语言模型（Multimodal Large Language Models, MLLMs）的能力。为应对这一问题，研究者提出了TextVidBench，首个针对长视频文本问答（3分钟）的基准测试。其解决方案的关键在于：1）跨领域长视频覆盖，涵盖9个类别，平均视频长度达2306秒；2）三阶段评估框架：“Text Needle-in-Haystack - Temporal Grounding - Text Dynamics Captioning”；3）高质量细粒度标注，包含超过5000个带语义标签的问答对。此外，研究还提出了一种高效提升大模型性能的范式，包括引入IT-Rope机制与时间提示工程以增强时间感知、采用非均匀位置编码处理长视频序列，以及在视频-文本数据上进行轻量级微调。

链接: https://arxiv.org/abs/2506.04983
作者: Yangyang Zhong,Ji Qi,Yuan Yao,Pengxin Luo,Yunfeng Yan,Donglian Qi,Zhiyuan Liu,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: “Text Needle-in-Haystack - Temporal Grounding - Text Dynamics Captioning”. 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.
zh

[CV-62] Bringing SAM to new heights: Leverag ing elevation data for tree crown segmentation from drone imagery

【速读】：该论文试图解决在大范围遥感影像中自动分割个体树冠的问题，以提高森林生态系统监测和管理的效率。解决方案的关键在于利用生成式 AI（Generative AI）模型，特别是 Segment Anything Model (SAM)，结合高分辨率无人机影像与数字表面模型（Digital Surface Model, DSM）信息，提升树冠实例分割的准确性与适用性。研究发现，虽然直接使用 SAM 在某些情况下不如定制的 Mask R-CNN，但通过端到端优化 SAM 并融合 DSM 数据，能够显著提升分割性能，尤其在针叶林种植园中表现突出。

链接: https://arxiv.org/abs/2506.04970
作者: Mélisande Teng,Arthur Ouaknine,Etienne Laliberté,Yoshua Bengio,David Rolnick,Hugo Larochelle
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); McGill University (麦吉尔大学); Rubisco AI (Rubisco AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad-scale. Large pre-trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts. However, efficiently tuning SAM end-to-end and integrating DSM information are both promising avenues for tree crown instance segmentation models.
zh

[CV-63] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation MICCAI2025

【速读】：该论文旨在解决高质量动态医学视频合成中空间一致性与时间动态建模的挑战。现有基于Transformer的方法存在通道交互不足、自注意力机制计算复杂度高以及时间步嵌入在处理不同噪声水平时提供的去噪指导粗略等问题。其解决方案的关键在于提出FEAT（Full-dimensional Efficient Attention Transformer），通过三个核心创新：(1) 统一的时空通道注意力机制以捕捉全维度的全局依赖关系，(2) 每个维度的注意力机制采用线性复杂度设计，利用加权键值注意力和全局通道注意力，(3) 残差值引导模块提供细粒度像素级指导以适应不同噪声水平。

链接: https://arxiv.org/abs/2506.04956
作者: Huihan Wang,Zhiwen Yang,Hui Zhang,Dan Zhao,Bingzheng Wei,Yan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been early accepted by MICCAI 2025

点击查看摘要

Abstract:Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at this https URL.
zh

[CV-64] APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

【速读】：该论文旨在解决当前基于视频的多模态大语言模型在处理小时级视频时面临的计算约束和从长时序序列中提取有效信息效率低的问题。其解决方案的关键在于提出一种无需训练的框架APVR（Adaptive Pivot Visual information Retrieval），通过分层视觉信息检索来突破内存墙限制，该框架包含两个互补组件：Pivot Frame Retrieval利用语义扩展和多模态置信度评分识别语义相关的视频帧，而Pivot Token Retrieval则在选定的帧内进行查询感知的注意力驱动标记选择，从而实现对长视频的高效处理并保持语义一致性。

链接: https://arxiv.org/abs/2506.04953
作者: Hong Gao,Yiming Bao,Xuezhan Tu,Bin Zhong,Minling Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.
zh

[CV-65] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

【速读】：该论文试图解决图像质量评估（Image Quality Assessment, IQA）模型在实际应用中因对抗性扰动而导致的不稳定性问题，这种不稳定性可能导致评分被人为夸大并损害模型的可信度。传统方法通过数据驱动的防御策略（如对抗性再训练、正则化或输入净化）来缓解这一问题，但该论文提出了一种新的视角：将鲁棒性作为架构先验（architectural prior）。解决方案的关键在于通过强制正交信息流和约束网络进行范数保持操作，从底层结构上抑制对扰动的敏感性，并通过剪枝和微调进一步稳定系统，从而实现无需对抗训练即可抵御对抗攻击的鲁棒IQA架构。

链接: https://arxiv.org/abs/2506.04951
作者: Igor Meleshin,Anna Chistyakova,Anastasia Antsiferova,Dmitriy Vatolin
机构: Lomonosov Moscow State University (莫斯科国立大学); ISP RAS Research Center for Trusted Artificial Intelligence (ISP RAS可信人工智能研究中心); MSU Institute for Artificial Intelligence (莫斯科国立大学人工智能研究所); Innopolis University (伊诺波利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems – from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses – adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations – and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.
zh

[CV-66] me-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

【速读】：该论文试图解决当前人工智能在体外受精（IVF）中自动化胚胎筛选时存在的问题，即现有方法要么仅关注部分胚胎评估而缺乏整体质量评估，要么针对临床结果但易受额外胚胎因素干扰，从而限制了临床应用价值。其解决方案的关键在于提出了一种新的任务——基于视频的胚胎评级（Video-Based Embryo Grading），首次直接利用完整时间流逝监控（TLM）视频来预测胚胎学家的整体质量评估。为此，研究者构建了一个包含超过2500个TLM视频的真实临床数据集，并提出了一个基于临床决策原理的互补时空模式挖掘框架（CoSTeM），该框架通过形态学分支和形态动力学分支协同整合静态与动态决定因素，以实现更准确的胚胎评级。

链接: https://arxiv.org/abs/2506.04950
作者: Yong Sun,Yipeng Wang,Junyu Shi,Zhiyuan Zhang,Yanmei Xiao,Lei Zhu,Manxi Jiang,Qiang Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence has recently shown promise in automated embryo selection for In-Vitro Fertilization (IVF). However, current approaches either address partial embryo evaluation lacking holistic quality assessment or target clinical outcomes inevitably confounded by extra-embryonic factors, both limiting clinical utility. To bridge this gap, we propose a new task called Video-Based Embryo Grading - the first paradigm that directly utilizes full-length time-lapse monitoring (TLM) videos to predict embryologists’ overall quality assessments. To support this task, we curate a real-world clinical dataset comprising over 2,500 TLM videos, each annotated with a grading label indicating the overall quality of embryos. Grounded in clinical decision-making principles, we propose a Complementary Spatial-Temporal Pattern Mining (CoSTeM) framework that conceptually replicates embryologists’ evaluation process. The CoSTeM comprises two branches: (1) a morphological branch using a Mixture of Cross-Attentive Experts layer and a Temporal Selection Block to select discriminative local structural features, and (2) a morphokinetic branch employing a Temporal Transformer to model global developmental trajectories, synergistically integrating static and dynamic determinants for grading embryos. Extensive experimental results demonstrate the superiority of our design. This work provides a valuable methodological framework for AI-assisted embryo selection. The dataset and source code will be publicly available upon acceptance.
zh

[CV-67] CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx

【速读】：该论文旨在解决野生动物个体识别、2D姿态估计及实例分割的挑战，特别是针对欧亚 lynx（Lynx lynx）的监测与分析问题。其关键解决方案是构建了一个大规模、开放获取的数据集——CzechLynx，包含超过3万张相机陷阱图像，配有分割掩码、身份标签和20点骨骼标注，并覆盖了15年间两个地理区域的219个独特个体。此外，通过基于Unity的管道和扩散驱动的文本到纹理建模生成了超过10万张逼真合成图像，以增强数据多样性。为评估模型在时空域上的泛化能力，研究者还设计了三种定制的评估协议。

链接: https://arxiv.org/abs/2506.04931
作者: Lukas Picek,Elisa Belotti,Michal Bojda,Ludek Bufka,Vojtech Cermak,Martin Dula,Rostislav Dvorak,Luboslav Hrdy,Miroslav Jirik,Vaclav Kocourek,Josefa Krausova,Jirı Labuda,Jakub Straka,Ludek Toman,Vlado Trulık,Martin Vana,Miroslav Kutal
机构: University of West Bohemia in Pilsen (西波希米亚大学); Inria (法国国家信息与自动化研究所); Czech University of Life Sciences Prague (布拉格捷克农业大学); Šumava National Park Administration (舒马瓦国家公园管理局); Mendel University in Brno (布尔诺马萨里克大学); Friends of the Earth Czech Republic (地球之友捷克分会); Center for Machine Perception, Czech Technical University in Prague (捷克共和国布拉格技术大学机器感知中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, 2D pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx includes more than 30k camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 219 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: Southwest Bohemia and the Western Carpathians. To increase the data variability, we create a complementary synthetic set with more than 100k photorealistic images generated via a Unity-based pipeline and diffusion-driven text-to-texture modeling, covering diverse environments, poses, and coat-pattern variations. To allow testing generalization across spatial and temporal domains, we define three tailored evaluation protocols/splits: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set. This dataset is targeted to be instrumental in benchmarking state-of-the-art models and the development of novel methods for not just individual animal re-identification.
zh

[CV-68] Light and 3D: a methodological exploration of digitisation techniques adapted to a selection of objects from the Musée dArchéologie Nationale

【速读】：该论文试图解决如何为文化遗产对象选择最合适的三维数字化方法的问题，强调不存在适用于所有情况的单一方法。其解决方案的关键在于通过遗产保护领域与数字技术领域相关人员的协作，根据每个对象的特性及其数字孪生体的未来用途，灵活选择或调整适合的数字工具。

链接: https://arxiv.org/abs/2506.04925
作者: Antoine Laurent(TRACES, IRIT-REVA, Toulouse INP),Jean Mélou(IRIT-REVA, Toulouse INP),Catherine Schwab(TEMPS),Rolande Simon-Millot(ARTeHiS),Sophie Féret(Inrap, GAMA),Thomas Sagory,Carole Fritz(MSHS-T, LAMS),Jean-Denis Durou(IRIT-REVA, Toulouse INP)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in French language

点击查看摘要

Abstract:The need to digitize heritage objects is now widely accepted. This article presents the very fashionable context of the creation of ‘‘digital twins’’. It illustrates the diversity of photographic 3D digitization methods, but this is not its only objective. Using a selection of objects from the collections of the musée d’Archéologie nationale, it shows that no single method is suitable for all cases. Rather, the method to be recommended for a given object should be the result of a concerted choice between those involved in heritage and those involved in the digital domain, as each new object may require the adaptation of existing tools. It would therefore be pointless to attempt an absolute classification of 3D digitization methods. On the contrary, we need to find the digital tool best suited to each object, taking into account not only its characteristics, but also the future use of its digital twin.
zh

[CV-69] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

【速读】：该论文旨在解决传统神经辐射场（Neural Radiance Fields, NeRF）方法在立体数据集生成中的效率问题，提出一种基于3D Gaussian Splatting (3DGS) 的高效替代方案。其解决方案的关键在于利用显式3D表示的重建几何与FoundationStereo模型的视差估计，在专家知识迁移框架下融合两者以获得更高质量的几何信息，从而提升立体模型在零样本泛化任务中的性能。研究发现，直接使用重建几何会引入噪声和伪影，而FoundationStereo提供的视差估计更为干净，有助于提升模型表现。

链接: https://arxiv.org/abs/2506.04908
作者: Filip Slezak,Magnus K. Gjerde,Joakim B. Haurum,Ivan Nikolov,Morten S. Laursen,Thomas B. Moeslund
机构: Aalborg University (奥尔堡大学); AGCO A/S (AGCO A/S); Pioneer Centre for AI (人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce a 3D Gaussian Splatting (3DGS)-based pipeline for stereo dataset generation, offering an efficient alternative to Neural Radiance Fields (NeRF)-based methods. To obtain useful geometry estimates, we explore utilizing the reconstructed geometry from the explicit 3D representations as well as depth estimates from the FoundationStereo model in an expert knowledge transfer setup. We find that when fine-tuning stereo models on 3DGS-generated datasets, we demonstrate competitive performance in zero-shot generalization benchmarks. When using the reconstructed geometry directly, we observe that it is often noisy and contains artifacts, which propagate noise to the trained model. In contrast, we find that the disparity estimates from FoundationStereo are cleaner and consequently result in a better performance on the zero-shot generalization benchmarks. Our method highlights the potential for low-cost, high-fidelity dataset creation and fast fine-tuning for deep stereo models. Moreover, we also reveal that while the latest Gaussian Splatting based methods have achieved superior performance on established benchmarks, their robustness falls short in challenging in-the-wild settings warranting further exploration.
zh

[CV-70] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

【速读】：该论文试图解决3D场景中超越物体级别的视觉定位问题，即对人类活动区域、物体之外的未占用空间、场景中的物体以及物体细粒度部分的指代表达进行准确识别。其解决方案的关键在于构建一个全面的3D视觉定位基准——Anywhere3D-Bench，该基准包含2,632个指代表达与3D边界框的配对，覆盖四个不同的定位层次。通过在该基准上评估多种先进的3D视觉定位方法及大型语言模型（LLMs）和多模态LLMs（MLLMs），研究揭示了当前模型在空间级和部件级任务上的显著不足，突显出现有模型在理解与推理3D场景超越物体语义方面的关键缺陷。

链接: https://arxiv.org/abs/2506.04897
作者: Tianxu Wang,Zhuofan Zhang,Ziyu Zhu,Yue Fan,Jing Xiong,Pengxiang Li,Xiaojian Ma,Qing Li
机构: BIGAI; Tsinghua University (清华大学); Peking University (北京大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models’ capacity to understand and reason about 3D scene beyond object-level semantics.
zh

[CV-71] Learning to Plan via Supervised Contrastive Learning and Strategic Interpolation: A Chess Case Study

【速读】：该论文试图解决传统棋类博弈中基于深度搜索和回归评估的引擎与人类玩家依赖直觉选择候选着法并进行浅层验证之间的差异问题。其解决方案的关键在于利用监督对比学习训练一个Transformer编码器，将棋盘状态嵌入到由位置评估结构化的潜在空间中，通过该空间中的距离反映评估相似性，并通过可视化轨迹展示游戏状态间的可解释转换，从而实现无需依赖深度搜索的着法选择。

链接: https://arxiv.org/abs/2506.04892
作者: Andrew Hamara,Greg Hamerly,Pablo Rivas,Andrew C. Freeman
机构: Baylor University (贝勒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern chess engines achieve superhuman performance through deep tree search and regressive evaluation, while human players rely on intuition to select candidate moves followed by a shallow search to validate them. To model this intuition-driven planning process, we train a transformer encoder using supervised contrastive learning to embed board states into a latent space structured by positional evaluation. In this space, distance reflects evaluative similarity, and visualized trajectories display interpretable transitions between game states. We demonstrate that move selection can occur entirely within this embedding space by advancing toward favorable regions, without relying on deep search. Despite using only a 6-ply beam search, our model achieves an estimated Elo rating of 2593. Performance improves with both model size and embedding dimensionality, suggesting that latent planning may offer a viable alternative to traditional search. Although we focus on chess, the proposed embedding-based planning method can be generalized to other perfect-information games where state evaluations are learnable. All source code is available at this https URL.
zh

[CV-72] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

【速读】：该论文试图解决扩散模型在图像编辑过程中面临的后门攻击问题，即通过在输入中嵌入特定模式来操控模型行为的安全隐患。现有研究多集中于图像生成阶段的攻击框架，而对图像编辑中的后门攻击研究较少，且多数方法使用可见触发器，不具实用性。该论文提出的解决方案关键在于通过污染训练数据，在图像编辑过程中嵌入不可见的触发器，利用现成的深度水印模型将不可察觉的水印作为后门触发器，使模型在接收到带水印输入时生成预设的后门目标，同时正常处理干净图像。

链接: https://arxiv.org/abs/2506.04879
作者: Yu-Feng Chen,Tzuhsuan Huang,Pin-Yen Chiu,Jun-Cheng Chen
机构: Research Center for Information Technology Innovation, Academia Sinica (资讯科技創新研究室，中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model’s behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:this https URL
zh

[CV-73] Geological Field Restoration through the Lens of Image Inpainting

【速读】：该论文试图解决从稀疏观测中重建多维地质场的问题（multidimensional geological fields），其核心挑战在于如何在数据稀疏的情况下准确恢复缺失的地质信息。解决方案的关键是将部分观测的空间场建模为多维张量（multidimensional tensor），并通过强制全局低秩结构来恢复缺失值，该方法结合了张量补全（tensor completion）与地统计学（geostatistics）的思想，提供了一个稳健的优化框架。实验结果表明，该方法在不同观测数据比例下均显著优于传统克里金法（ordinary kriging）。

链接: https://arxiv.org/abs/2506.04869
作者: Vladislav Trifonov,Ivan Oseledets,Ekaterina Muravleva
机构: The Artificial Intelligence Center (人工智能中心); Skoltech (斯科尔科沃科技学院); AI4S Center (AI4S中心); Sberbank of Russia (俄罗斯储蓄银行); AIRI (AIRI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a new viewpoint on a reconstructing multidimensional geological fields from sparse observations. Drawing inspiration from deterministic image inpainting techniques, we model a partially observed spatial field as a multidimensional tensor and recover missing values by enforcing a global low-rank structure. Our approach combines ideas from tensor completion and geostatistics, providing a robust optimization framework. Experiments on synthetic geological fields demonstrate that used tensor completion method significant improvements in reconstruction accuracy over ordinary kriging for various percent of observed data.
zh

[CV-74] MineInsight: A Multi-sensor Dataset for Humanitarian Demining Robotics in Off-Road Environments

【速读】：该论文试图解决在人道主义排雷中，由于缺乏多样且真实的数据库，导致算法可靠验证困难的问题。解决方案的关键在于提出MineInsight，这是一个公开的多传感器、多光谱数据集，专门用于非公路环境下的地雷检测。该数据集集成了来自无人地面车辆及其机械臂的双视角传感器扫描，提供了多种视角以减少遮挡并提高空间感知能力，同时包含可见光（RGB、单色）、可见短波红外（VIS-SWIR）和长波红外（LWIR）图像，以及目标位置估计，为检测算法的评估提供了基准。

链接: https://arxiv.org/abs/2506.04842
作者: Mario Malizia,Charles Hamesse,Ken Hasselmann,Geert De Cubber,Nikolaos Tsiogkas,Eric Demeester,Rob Haelterman
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The use of robotics in humanitarian demining increasingly involves computer vision techniques to improve landmine detection capabilities. However, in the absence of diverse and realistic datasets, the reliable validation of algorithms remains a challenge for the research community. In this paper, we introduce MineInsight, a publicly available multi-sensor, multi-spectral dataset designed for off-road landmine detection. The dataset features 35 different targets (15 landmines and 20 commonly found objects) distributed along three distinct tracks, providing a diverse and realistic testing environment. MineInsight is, to the best of our knowledge, the first dataset to integrate dual-view sensor scans from both an Unmanned Ground Vehicle and its robotic arm, offering multiple viewpoints to mitigate occlusions and improve spatial awareness. It features two LiDARs, as well as images captured at diverse spectral ranges, including visible (RGB, monochrome), visible short-wave infrared (VIS-SWIR), and long-wave infrared (LWIR). Additionally, the dataset comes with an estimation of the location of the targets, offering a benchmark for evaluating detection algorithms. We recorded approximately one hour of data in both daylight and nighttime conditions, resulting in around 38,000 RGB frames, 53,000 VIS-SWIR frames, and 108,000 LWIR frames. MineInsight serves as a benchmark for developing and evaluating landmine detection algorithms. Our dataset is available at this https URL.
zh

[CV-75] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

【速读】：该论文试图解决3D推理分割缺乏类似2D领域中成熟框架和结构的问题，即当前的感知系统仍依赖于显式的人类指令或预定义类别来识别目标对象，而缺乏对3D数据的高效理解和分割能力。解决方案的关键在于提出OpenMaskDINO3D，这是一个用于全面3D理解与分割的大型语言模型（Large Language Model），通过引入SEG标记和对象标识符，实现了高精度的3D分割掩码生成，从而能够直接从自然语言指令中生成准确的点云分割结果。

链接: https://arxiv.org/abs/2506.04837
作者: Kunshen Zhang
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.
zh

[CV-76] DualX-VSR: Dual Axial SpatialtimesTemporal Transformer for Real-World Video Super-Resolution without Motion Compensation

【速读】：该论文试图解决真实场景下视频超分辨率（Video Super-Resolution, VSR）任务中因分块处理和序列注意力机制导致的像素级精度下降问题。其解决方案的关键在于提出了一种名为Dual Axial Spatial × Temporal Transformer (DualX-VSR) 的新型双轴空间×时间注意力机制，该机制通过在正交方向上整合空间与时间信息，有效提升了时空特征的一致性与完整性，同时消除了对运动补偿的依赖，从而实现了更高质量的视频超分辨率效果。

链接: https://arxiv.org/abs/2506.04830
作者: Shuo Cao,Yihao Liu,Xiaohui Li.Yuanting Gao.Yu Zhou,Chao Dong
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Sun Yat-sen University (中山大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial \times Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial \times temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.
zh

[CV-77] Fool the Stoplight: Realistic Adversarial Patch Attacks on Traffic Light Detectors

【速读】：该论文试图解决针对自动驾驶车辆中基于摄像头的交通灯检测器的对抗性攻击问题（adversarial attacks），尤其是针对交通灯识别任务的攻击。其解决方案的关键在于提出一种威胁模型，通过在每个交通灯下方放置打印的对抗性补丁（adversarial patches）来干扰卷积神经网络（CNN）的检测能力，并设计相应的训练策略以实现通用场景下的成功攻击。

链接: https://arxiv.org/abs/2506.04823
作者: Svetlana Pavlitska,Jamie Robb,Nikolai Polley,Melih Yazgan,J. Marius Zöllner
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); FZI Research Center for Information Technology (信息科技研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at IV 2025

点击查看摘要

Abstract:Realistic adversarial attacks on various camera-based perception tasks of autonomous vehicles have been successfully demonstrated so far. However, only a few works considered attacks on traffic light detectors. This work shows how CNNs for traffic light detection can be attacked with printed patches. We propose a threat model, where each instance of a traffic light is attacked with a patch placed under it, and describe a training strategy. We demonstrate successful adversarial patch attacks in universal settings. Our experiments show realistic targeted red-to-green label-flipping attacks and attacks on pictogram classification. Finally, we perform a real-world evaluation with printed patches and demonstrate attacks in the lab settings with a mobile traffic light for construction sites and in a test area with stationary traffic lights. Our code is available at this https URL.
zh

[CV-78] Spike-TBR: a Noise Resilient Neuromorphic Event Representation

【速读】：该论文旨在解决事件相机（event camera）生成的事件流在转换为与传统计算机视觉流水线兼容格式时面临的挑战，尤其是在噪声环境下的鲁棒性问题。解决方案的关键在于提出Spike-TBR，一种基于时间二进制表示（Temporal Binary Representation, TBR）的事件编码策略，通过集成脉冲神经元实现噪声过滤，从而结合了TBR的帧式优势与脉冲神经网络的噪声抑制能力，提升了事件流表示的鲁棒性。

链接: https://arxiv.org/abs/2506.04817
作者: Gabriele Magrini. Federico Becattini,Luca Cultrera,Lorenzo Berlincioni,Pietro Pala,Alberto Del Bimbo
机构: University of Florence (佛罗伦萨大学); University of Siena (锡耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras offer significant advantages over traditional frame-based sensors, including higher temporal resolution, lower latency and dynamic range. However, efficiently converting event streams into formats compatible with standard computer vision pipelines remains a challenging problem, particularly in the presence of noise. In this paper, we propose Spike-TBR, a novel event-based encoding strategy based on Temporal Binary Representation (TBR), addressing its vulnerability to noise by integrating spiking neurons. Spike-TBR combines the frame-based advantages of TBR with the noise-filtering capabilities of spiking neural networks, creating a more robust representation of event streams. We evaluate four variants of Spike-TBR, each using different spiking neurons, across multiple datasets, demonstrating superior performance in noise-affected scenarios while improving the results on clean data. Our method bridges the gap between spike-based and frame-based processing, offering a simple noise-resilient solution for event-driven vision applications.
zh

[CV-79] MegaHan97K: A Large-Scale Dataset for Mega-Category Chinese Character Recognition with over 97K Categories

【速读】：该论文旨在解决超大规模汉字识别（mega-category recognition）问题，即在文化传承与数字应用中对海量汉字字符的准确识别。由于现有数据集规模有限，无法支持如GB18030-2022标准中包含的87,887个汉字类别，因此该问题尚未得到充分研究。论文提出的解决方案关键在于构建MegaHan97K数据集，这是首个全面支持GB18030-2022标准的大规模数据集，涵盖97,455个汉字类别，且通过手写、历史和合成三个子集实现了类别间的样本平衡，从而有效缓解长尾分布问题，并为未来研究提供了新的挑战与机遇。

链接: https://arxiv.org/abs/2506.04807
作者: Yuyi Zhang,Yongxin Shi,Peirong Zhang,Yixin Zhao,Zhenhua Yang,Lianwen Jin
机构: South China University of Technology (华南理工大学); SCUT-Zhuhai Institute of Modern Industrial Innovation (SCUT-珠海现代工业创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundational to the Chinese language and culture, Chinese characters encompass extraordinarily extensive and ever-expanding categories, with the latest Chinese GB18030-2022 standard containing 87,887 categories. The accurate recognition of this vast number of characters, termed mega-category recognition, presents a formidable yet crucial challenge for cultural heritage preservation and digital applications. Despite significant advances in Optical Character Recognition (OCR), mega-category recognition remains unexplored due to the absence of comprehensive datasets, with the largest existing dataset containing merely 16,151 categories. To bridge this critical gap, we introduce MegaHan97K, a mega-category, large-scale dataset covering an unprecedented 97,455 categories of Chinese characters. Our work offers three major contributions: (1) MegaHan97K is the first dataset to fully support the latest GB18030-2022 standard, providing at least six times more categories than existing datasets; (2) It effectively addresses the long-tail distribution problem by providing balanced samples across all categories through its three distinct subsets: handwritten, historical and synthetic subsets; (3) Comprehensive benchmarking experiments reveal new challenges in mega-category scenarios, including increased storage demands, morphologically similar character recognition, and zero-shot learning difficulties, while also unlocking substantial opportunities for future research. To the best of our knowledge, the MetaHan97K is likely the dataset with the largest classes not only in the field of OCR but may also in the broader domain of pattern recognition. The dataset is available at this https URL.
zh

[CV-80] SupeRANSAC: One RANSAC to Rule Them All

【速读】：该论文旨在解决计算机视觉中鲁棒估计的问题，特别是在从含异常值的数据中估计几何模型（如单应性、基础/本质矩阵、绝对/刚体位姿）时，如何实现一致的高性能。现有方法如RANSAC及其变种虽然被广泛使用，但其整体性能往往受限于具体实现细节和问题特定优化，而非算法核心组件的改进。论文提出的解决方案是SupeRANSAC，这是一个统一的RANSAC流程，其关键在于通过系统分析和优化技术，提升在不同视觉任务中的准确性与一致性，例如在基础矩阵估计中平均提升了6个AUC点。

链接: https://arxiv.org/abs/2506.04803
作者: Daniel Barath
机构: ETH Zurich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust estimation is a cornerstone in computer vision, particularly for tasks like Structure-from-Motion and Simultaneous Localization and Mapping. RANSAC and its variants are the gold standard for estimating geometric models (e.g., homographies, relative/absolute poses) from outlier-contaminated data. Despite RANSAC’s apparent simplicity, achieving consistently high performance across different problems is challenging. While recent research often focuses on improving specific RANSAC components (e.g., sampling, scoring), overall performance is frequently more influenced by the “bells and whistles” (i.e., the implementation details and problem-specific optimizations) within a given library. Popular frameworks like OpenCV and PoseLib demonstrate varying performance, excelling in some tasks but lagging in others. We introduce SupeRANSAC, a novel unified RANSAC pipeline, and provide a detailed analysis of the techniques that make RANSAC effective for specific vision tasks, including homography, fundamental/essential matrix, and absolute/rigid pose estimation. SupeRANSAC is designed for consistent accuracy across these tasks, improving upon the best existing methods by, for example, 6 AUC points on average for fundamental matrix estimation. We demonstrate significant performance improvements over the state-of-the-art on multiple problems and datasets. Code: this https URL
zh

[CV-81] LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table CVPR2025

【速读】：该论文试图解决近似最近邻搜索（Approximate Nearest Neighbor Search, ANNS）在某些应用场景中返回结果过于相似的问题，尤其是在需要搜索结果既与查询相似又具有多样性的场景下。解决方案的关键在于提出LotusFilter，这是一个后处理模块，通过预计算一个总结相近向量的截断表，在过滤过程中贪心地查找该表以删除候选向量中的冗余项，从而实现结果的多样化。

链接: https://arxiv.org/abs/2506.04790
作者: Yusuke Matsui
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: CVPR 2025. GitHub: this https URL

点击查看摘要

Abstract:Approximate nearest neighbor search (ANNS) is an essential building block for applications like RAG but can sometimes yield results that are overly similar to each other. In certain scenarios, search results should be similar to the query and yet diverse. We propose LotusFilter, a post-processing module to diversify ANNS results. We precompute a cutoff table summarizing vectors that are close to each other. During the filtering, LotusFilter greedily looks up the table to delete redundant vectors from the candidates. We demonstrated that the LotusFilter operates fast (0.02 [ms/query]) in settings resembling real-world RAG applications, utilizing features such as OpenAI embeddings. Our code is publicly available at this https URL.
zh

[CV-82] Object-X: Learning to Reconstruct Multi-Modal 3D Object Representations

【速读】：该论文旨在解决多模态3D物体表示学习中的挑战，即现有方法依赖于任务特定的嵌入，难以同时实现显式几何解码和跨任务复用。其解决方案的关键在于提出Object-X框架，该框架通过在3D体素网格中对捕获的模态进行几何定位，并学习一种融合体素信息与物体属性的非结构化嵌入，从而实现高质量的3D重建及多种下游任务的支持。

链接: https://arxiv.org/abs/2506.04789
作者: Gaia Di Lorenzo,Federico Tombari,Marc Pollefeys,Daniel Barath
机构: ETH Zurich (苏黎世联邦理工学院); Google(谷歌); Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning effective multi-modal 3D representations of objects is essential for numerous applications, such as augmented reality and robotics. Existing methods often rely on task-specific embeddings that are tailored either for semantic understanding or geometric reconstruction. As a result, these embeddings typically cannot be decoded into explicit geometry and simultaneously reused across tasks. In this paper, we propose Object-X, a versatile multi-modal object representation framework capable of encoding rich object embeddings (e.g. images, point cloud, text) and decoding them back into detailed geometric and visual reconstructions. Object-X operates by geometrically grounding the captured modalities in a 3D voxel grid and learning an unstructured embedding fusing the information from the voxels with the object attributes. The learned embedding enables 3D Gaussian Splatting-based object reconstruction, while also supporting a range of downstream tasks, including scene alignment, single-image 3D object reconstruction, and localization. Evaluations on two challenging real-world datasets demonstrate that Object-X produces high-fidelity novel-view synthesis comparable to standard 3D Gaussian Splatting, while significantly improving geometric accuracy. Moreover, Object-X achieves competitive performance with specialized methods in scene alignment and localization. Critically, our object-centric descriptors require 3-4 orders of magnitude less storage compared to traditional image- or point cloud-based approaches, establishing Object-X as a scalable and highly practical solution for multi-modal 3D scene representation.
zh

[CV-83] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

【速读】：该论文旨在解决在实际移动机器人等应用中，基于视角到等距圆柱投影（P2E）的视觉位置识别（VPR）所面临的挑战，特别是如何有效处理不同视角下查询图像的匹配问题。解决方案的关键在于提出HypeVPR，一个基于双曲空间的分层嵌入框架，利用双曲空间对层次化特征关系进行建模并保持特征空间中的距离特性，同时通过分层特征聚合机制实现局部到全局的特征表示，并采用粗到细的搜索策略以平衡检索速度与精度，从而在保证匹配鲁棒性的同时显著提升检索效率。

链接: https://arxiv.org/abs/2506.04764
作者: Suhan Woo,Seongwon Lee,Jinwoo Jang,Euntai Kim
机构: Yonsei University (延世大学); Kookmin University (国民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When applying Visual Place Recognition (VPR) to real-world mobile robots and similar applications, perspective-to-equirectangular (P2E) formulation naturally emerges as a suitable approach to accommodate diverse query images captured from various viewpoints. In this paper, we introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space, designed to address the unique challenges of P2E VPR. The key idea behind HypeVPR is that visual environments captured by panoramic views exhibit inherent hierarchical structures. To leverage this property, we employ hyperbolic space to represent hierarchical feature relationships and preserve distance properties within the feature space. To achieve this, we propose a hierarchical feature aggregation mechanism that organizes local-to-global feature representations within hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine search strategy, optimally balancing speed and accuracy to ensure robust matching, even between descriptors from different image types. This approach enables HypeVPR to outperform state-of-the-art methods while significantly reducing retrieval time, achieving up to 5x faster retrieval across diverse benchmark datasets. The code and models will be released at this https URL.
zh

[CV-84] oward Better SSIM Loss for Unsupervised Monocular Depth Estimation

【速读】：该论文旨在解决无监督单目深度学习中传统SSIM（Structure Similarity Index Measure）损失函数在训练过程中忽略其各组成部分及其超参数影响的问题。其解决方案的关键在于提出一种新的SSIM形式，该形式通过加法而非乘法来组合亮度、对比度和结构相似性相关成分，从而构建出更平滑的梯度并提升无监督深度估计的性能。

链接: https://arxiv.org/abs/2506.04758
作者: Yijun Cao,Fuya Luo,Yongjie Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,4 figures

点击查看摘要

Abstract:Unsupervised monocular depth learning generally relies on the photometric relation among temporally adjacent images. Most of previous works use both mean absolute error (MAE) and structure similarity index measure (SSIM) with conventional form as training loss. However, they ignore the effect of different components in the SSIM function and the corresponding hyperparameters on the training. To address these issues, this work proposes a new form of SSIM. Compared with original SSIM function, the proposed new form uses addition rather than multiplication to combine the luminance, contrast, and structural similarity related components in SSIM. The loss function constructed with this scheme helps result in smoother gradients and achieve higher performance on unsupervised depth estimation. We conduct extensive experiments to determine the relatively optimal combination of parameters for our new SSIM. Based on the popular MonoDepth approach, the optimized SSIM loss function can remarkably outperform the baseline on the KITTI-2015 outdoor dataset.
zh

[CV-85] Ontology-based knowledge representation for bone disease diagnosis: a foundation for safe and sustainable medical artificial intelligence systems

【速读】：该论文旨在解决医学人工智能（Medical AI）系统在领域知识整合方面的不足，从而提升诊断的可靠性。其解决方案的关键在于构建一个基于本体论（ontology）的框架，通过引入层次化神经网络架构、本体增强的视觉问答（VQA）系统以及多模态深度学习模型，实现影像、临床和实验室数据的整合与分析，同时保持临床可解释性。该框架通过系统化的知识数字化、标准化术语映射和模块化设计，确保了技术的可扩展性和可复用性。

链接: https://arxiv.org/abs/2506.04756
作者: Loan Dao,Ngoc Quoc Ly
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical artificial intelligence (AI) systems frequently lack systematic domain expertise integration, potentially compromising diagnostic reliability. This study presents an ontology-based framework for bone disease diagnosis, developed in collaboration with Ho Chi Minh City Hospital for Traumatology and Orthopedics. The framework introduces three theoretical contributions: (1) a hierarchical neural network architecture guided by bone disease ontology for segmentation-classification tasks, incorporating Visual Language Models (VLMs) through prompts, (2) an ontology-enhanced Visual Question Answering (VQA) system for clinical reasoning, and (3) a multimodal deep learning model that integrates imaging, clinical, and laboratory data through ontological relationships. The methodology maintains clinical interpretability through systematic knowledge digitization, standardized medical terminology mapping, and modular architecture design. The framework demonstrates potential for extension beyond bone diseases through its standardized structure and reusable components. While theoretical foundations are established, experimental validation remains pending due to current dataset and computational resource limitations. Future work will focus on expanding the clinical dataset and conducting comprehensive system validation.
zh

[CV-86] ruth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

【速读】：该论文试图解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在复杂推理任务中依赖大量训练数据导致的数据冗余和计算成本过高的问题。其解决方案的关键在于提出一种名为推理激活潜力（Reasoning Activation Potential, RAP）的数据选择范式，该方法通过识别能够激发真实多模态推理的“认知样本”，并利用两种互补的估计器——因果差异估计器（Causal Discrepancy Estimator, CDE）和注意力置信度估计器（Attention Confidence Estimator, ACE）来筛选高质量样本，同时引入难度感知替换模块（Difficulty-aware Replacement Module, DRM）以提升推理复杂性，从而在显著减少训练数据量和计算成本的前提下，保持或提升模型的多模态推理性能。

链接: https://arxiv.org/abs/2506.04755
作者: Shenshen Li,Kaiyuan Deng,Lei Wang,Hao Yang,Chong Peng,Peng Yan,Fumin Shen,Heng Tao Shen,Xing Xu
机构: University of Electronic Science and Technology of China (中国电子科技大学); School of Computer Science and Technology, Tongji University (同济大学计算机科学与技术学院); Salesforce AI Research (Salesforce人工智能研究院); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample’s potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%. Our code is available at this https URL.
zh

[CV-87] Physics Informed Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement

【速读】：该论文旨在解决水下图像增强问题，通过提升图像的清晰度、对比度和细节保留能力来改善水下视觉质量。其解决方案的关键在于提出了一种新颖的双流架构，该架构将Jaffe-McGlamery物理模型与基于胶囊聚类的特征表示学习相结合，从而实现了无需参数调整的物理引导增强方法。该方法通过专用物理估计器同时估计透射图和空间变化的背景光，并在并行流中通过胶囊聚类提取实体级特征，确保了在满足水下成像约束条件的同时保留语义结构和细粒度细节。

链接: https://arxiv.org/abs/2506.04753
作者: Niki Martinel,Rita Pucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We present a novel dual-stream architecture that achieves state-of-the-art underwater image enhancement by explicitly integrating the Jaffe-McGlamery physical model with capsule clustering-based feature representation learning. Our method simultaneously estimates transmission maps and spatially-varying background light through a dedicated physics estimator while extracting entity-level features via capsule clustering in a parallel stream. This physics-guided approach enables parameter-free enhancement that respects underwater formation constraints while preserving semantic structures and fine-grained details. Our approach also features a novel optimization objective ensuring both physical adherence and perceptual quality across multiple spatial frequencies. To validate our approach, we conducted extensive experiments across six challenging benchmarks. Results demonstrate consistent improvements of +0.5 dB PSNR over the best existing methods while requiring only one-third of their computational complexity (FLOPs), or alternatively, more than +1 dB PSNR improvement when compared to methods with similar computational budgets. Code and data \textitwill be available at this https URL.
zh

[CV-88] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在训练过程中易受到后门攻击的问题，攻击者通过向训练数据中注入难以察觉的扰动（如局部像素触发器或全局语义短语），使模型在特定输入下生成受攻击者控制的恶意描述。解决方案的关键在于提出一种无需先验触发知识的防御框架——语义奖励防御（Semantic Reward Defense, SRD），其核心是利用强化学习机制，通过深度Q网络学习对敏感图像区域施加离散扰动（如遮挡、颜色掩码）的策略，以干扰恶意路径的激活，并通过设计的语义保真度评分作为奖励信号，联合评估输出的语义一致性和语言流畅性，从而引导生成鲁棒且忠实的描述。

链接: https://arxiv.org/abs/2506.04743
作者: Shuhan Xu,Siyuan Liang,Hongling Zheng,Yong Luo,Aishan Liu,Dacheng Tao
机构: Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.
zh

[CV-89] Bridging Annotation Gaps: Transferring Labels to Align Object Detection Datasets

【速读】：该论文试图解决多目标检测数据集融合过程中存在的类别语义不一致和边界框标注不一致问题。其解决方案的关键在于提出了一种名为Label-Aligned Transfer (LAT)的标签迁移框架，该框架通过将不同源数据集的标注系统性地投影到目标数据集的标签空间中，从而实现跨异构数据集的多对一标签空间迁移。LAT通过训练数据集特定的检测器生成伪标签，并结合真实标注，利用特权提议生成器（Privileged Proposal Generator）替代两阶段检测器中的区域提议网络，同时引入语义特征融合模块以增强区域特征表示，最终在保留数据集特定标注粒度的同时，实现语义和空间对齐的表示，有效解决了类别层级错位和边界框不一致的问题。

链接: https://arxiv.org/abs/2506.04737
作者: Mikhail Kennerley,Angelica Alives-Reviro,Carola-Bibiane Schönlieb,Robby T. Tan
机构: National University of Singapore(新加坡国立大学); I2R, A*STAR(资讯通信研究院，新加坡科技研究局); Tshinghua Univeristy(清华大学); University of Cambridge(剑桥大学); ASUS Intelligent Cloud Services(华硕智能云服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Combining multiple object detection datasets offers a path to improved generalisation but is hindered by inconsistencies in class semantics and bounding box annotations. Some methods to address this assume shared label taxonomies and address only spatial inconsistencies; others require manual relabelling, or produce a unified label space, which may be unsuitable when a fixed target label space is required. We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. LAT begins by training dataset-specific detectors to generate pseudo-labels, which are then combined with ground-truth annotations via a Privileged Proposal Generator (PPG) that replaces the region proposal network in two-stage detectors. To further refine region features, a Semantic Feature Fusion (SFF) module injects class-aware context and features from overlapping proposals using a confidence-weighted attention mechanism. This pipeline preserves dataset-specific annotation granularity while enabling many-to-one label space transfer across heterogeneous datasets, resulting in a semantically and spatially aligned representation suitable for training a downstream detector. LAT thus jointly addresses both class-level misalignments and bounding box inconsistencies without relying on shared label spaces or manual annotations. Across multiple benchmarks, LAT demonstrates consistent improvements in target-domain detection performance, achieving gains of up to +4.8AP over semi-supervised baselines.
zh

[CV-90] Using In-Context Learning for Automatic Defect Labelling of Display Manufacturing Data

【速读】：该论文旨在解决工业显示面板缺陷检测中手动标注耗时耗力的问题，提出了一种基于人工智能的自动标注系统。其解决方案的关键在于采用并改进了SegGPT架构，结合领域特定的训练技术，并引入了基于涂鸦的标注机制，以简化标注流程。此外，通过两阶段训练方法，在工业显示面板数据集上实现了显著的性能提升，表明自动标注数据训练出的模型能够达到人工标注数据的性能水平。

链接: https://arxiv.org/abs/2506.04717
作者: Babar Hussain,Qiang Liu,Gang Chen,Bihai She,Dahai Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents an AI-assisted auto-labeling system for display panel defect detection that leverages in-context learning capabilities. We adopt and enhance the SegGPT architecture with several domain-specific training techniques and introduce a scribble-based annotation mechanism to streamline the labeling process. Our two-stage training approach, validated on industrial display panel datasets, demonstrates significant improvements over the baseline model, achieving an average IoU increase of 0.22 and a 14% improvement in recall across multiple product types, while maintaining approximately 60% auto-labeling coverage. Experimental results show that models trained on our auto-labeled data match the performance of those trained on human-labeled data, offering a practical solution for reducing manual annotation efforts in industrial inspection systems.
zh

[CV-91] Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion

【速读】：该论文旨在解决内镜下黏膜剥离术（Endoscopic Submucosal Dissection, ESD）视频中剥离路径预测的问题，以提升外科技能训练效果并简化学习过程。现有方法在处理未来动作的不确定性、几何对称性学习以及多样化手术场景的泛化能力方面存在不足。其解决方案的关键在于提出一种新的模仿学习框架——隐式扩散策略与等变表示（Implicit Diffusion Policy with Equivariant Representations for Imitation Learning, iDPOE），通过联合状态-动作分布建模专家行为，结合扩散模型实现高效训练与采样，并引入等变性以增强几何对称性的泛化能力，从而提升轨迹预测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2506.04716
作者: Hongyu Wang,Yonghao Long,Yueyao Chen,Hon-Chi Yip,Markus Scheppach,Philip Wai-Yan Chiu,Yeung Yam,Helen Mei-Ling Meng,Qi Dou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Endoscopic Submucosal Dissection (ESD) is a well-established technique for removing epithelial lesions. Predicting dissection trajectories in ESD videos offers significant potential for enhancing surgical skill training and simplifying the learning process, yet this area remains underexplored. While imitation learning has shown promise in acquiring skills from expert demonstrations, challenges persist in handling uncertain future movements, learning geometric symmetries, and generalizing to diverse surgical scenarios. To address these, we introduce a novel approach: Implicit Diffusion Policy with Equivariant Representations for Imitation Learning (iDPOE). Our method models expert behavior through a joint state action distribution, capturing the stochastic nature of dissection trajectories and enabling robust visual representation learning across various endoscopic views. By incorporating a diffusion model into policy learning, iDPOE ensures efficient training and sampling, leading to more accurate predictions and better generalization. Additionally, we enhance the model’s ability to generalize to geometric symmetries by embedding equivariance into the learning process. To address state mismatches, we develop a forward-process guided action inference strategy for conditional sampling. Using an ESD video dataset of nearly 2000 clips, experimental results show that our approach surpasses state-of-the-art methods, both explicit and implicit, in trajectory prediction. To the best of our knowledge, this is the first application of imitation learning to surgical skill development for dissection trajectory prediction.
zh

[CV-92] owards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM -Based Multi-Dimensional Evaluation Model

【速读】：该论文旨在解决AI-Generated Video (AIGV) 在视觉质量上的缺陷问题，包括噪声、模糊、帧抖动和动态度低等，这些问题严重影响用户的观看体验。其解决方案的关键在于将AIGV的视觉质量分解为技术质量、运动质量和视频语义三个维度，并为每个维度设计相应的编码器以实现有效的特征表示。此外，引入大语言模型（LLM）作为质量回归模块，并通过多模态提示工程框架增强LLM对多维特征与视觉质量之间关系的推理能力，同时结合LoRA微调技术提升模型对特定任务的适应性。

链接: https://arxiv.org/abs/2506.04715
作者: Zelu Qi,Ping Shi,Chaoyang Zhang,Shuqi Wang,Fei Zhao,Da Pan,Zefeng Ying
机构: Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user’s viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbfsecond place in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at this https URL.
zh

[CV-93] Robust Few-Shot Vision-Language Model Adaptation

【速读】：该论文旨在解决在少量标注样本适应（few-shot adaptation）过程中，预训练视觉语言模型（VLM）面对分布外（out-of-distribution, OOD）测试数据时泛化能力不足的问题。其解决方案的关键在于通过深入分析不同适应方法（如提示调优、线性探测、对比微调和全量微调），发现适当的超参数微调显著优于现有方法，并提出一种基于阶段式检索增强和对抗扰动的局部微调策略（SRAPF），以在保持模型在分布内（in-distribution, ID）和OOD数据上性能之间的平衡。

链接: https://arxiv.org/abs/2506.04713
作者: Hanxin Wang,Tian Liu,Shu Kong
机构: University of Macau (澳门大学); Texas A&M University (德克萨斯A&M大学); Institute of Collaborative Innovation project (协同创新项目研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Pretrained VLMs achieve strong performance on downstream tasks when adapted with just a few labeled examples. As the adapted models inevitably encounter out-of-distribution (OOD) test data that deviates from the in-distribution (ID) task-specific training data, enhancing OOD generalization in few-shot adaptation is critically important. We study robust few-shot VLM adaptation, aiming to increase both ID and OOD accuracy. By comparing different adaptation methods (e.g., prompt tuning, linear probing, contrastive finetuning, and full finetuning), we uncover three key findings: (1) finetuning with proper hyperparameters significantly outperforms the popular VLM adaptation methods prompt tuning and linear probing; (2) visual encoder-only finetuning achieves better efficiency and accuracy than contrastively finetuning both visual and textual encoders; (3) finetuning the top layers of the visual encoder provides the best balance between ID and OOD accuracy. Building on these findings, we propose partial finetuning of the visual encoder empowered with two simple augmentation techniques: (1) retrieval augmentation which retrieves task-relevant data from the VLM’s pretraining dataset to enhance adaptation, and (2) adversarial perturbation which promotes robustness during finetuning. Results show that the former/latter boosts OOD/ID accuracy while slightly sacrificing the ID/OOD accuracy. Yet, perhaps understandably, naively combining the two does not maintain their best OOD/ID accuracy. We address this dilemma with the developed SRAPF, Stage-wise Retrieval Augmentation-based Adversarial Partial Finetuning. SRAPF consists of two stages: (1) partial finetuning the visual encoder using both ID and retrieved data, and (2) adversarial partial finetuning with few-shot ID data. Extensive experiments demonstrate that SRAPF achieves the state-of-the-art ID and OOD accuracy on the ImageNet OOD benchmarks.
zh

[CV-94] Line of Sight: On Linear Representations in VLLM s

【速读】：该论文试图解决多模态语言模型（Multimodal Language Model）如何在隐藏激活中表示图像概念的问题。其解决方案的关键在于通过线性可解码的特征来探索LlaVA-Next模型中图像概念的表征，并利用多模态稀疏自编码器（Multimodal Sparse Autoencoders, SAEs）增强这些特征的多样性与可解释性，从而揭示不同模态间表征的差异与共享机制。

链接: https://arxiv.org/abs/2506.04706
作者: Achyuta Rajaram,Sarah Schwettmann,Jacob Andreas,Arthur Conmy
机构: MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); Transluce(Transluce)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Language models can be equipped with multimodal capabilities by fine-tuning on embeddings of visual inputs. But how do such multimodal models represent images in their hidden activations? We explore representations of image concepts within LlaVA-Next, a popular open-source VLLM. We find a diverse set of ImageNet classes represented via linearly decodable features in the residual stream. We show that the features are causal by performing targeted edits on the model output. In order to increase the diversity of the studied linear features, we train multimodal Sparse Autoencoders (SAEs), creating a highly interpretable dictionary of text and image features. We find that although model representations across modalities are quite disjoint, they become increasingly shared in deeper layers.
zh

[CV-95] HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在安全性方面的两大问题：一是现有安全微调数据集和基准测试未能全面覆盖图像-文本交互可能产生的有害内容，导致模型在未见过的配置下容易受到越狱攻击；二是现有方法主要依赖数据驱动的微调，缺乏架构层面的创新以内在增强安全性。其解决方案的关键在于引入了一个全面的安全数据集和基准测试HoliSafe，涵盖了所有五种安全/不安全的图像-文本组合，并提出了SafeLLaVA，该模型通过引入可学习的安全元标记和专用的安全头，实现了内在的安全性增强与可解释的有害性分类。

链接: https://arxiv.org/abs/2506.04704
作者: Youngwan Lee,Kangsan Kim,Kwanyong Park,Ilcahe Jung,Soojin Jang,Seanie Lee,Yong-Ju Lee,Sung Ju Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation. We further propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head. The meta token encodes harmful visual cues during training, intrinsically guiding the language model toward safer responses, while the safety head offers interpretable harmfulness classification aligned with refusal rationales. Experiments show that SafeLLaVA, trained on HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe benchmark itself reveals critical vulnerabilities in existing models. We hope that HoliSafe and SafeLLaVA will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
zh

[CV-96] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

【速读】：该论文旨在解决从稀疏测量中重建精确的无线电信号强度分布图（Radio maps）的问题，这一问题在智慧城市、物联网和无线网络规划等应用中具有重要意义。传统插值和修复方法缺乏环境感知能力，而许多深度学习方法依赖于详细的场景数据，限制了其泛化能力。论文提出的解决方案是MARS（Multi-scale Aware Radiomap Super-resolution），其关键在于结合卷积神经网络（CNN）与Transformer架构，并引入多尺度特征融合和残差连接，以同时捕捉全局与局部特征，提升不同感受野下的特征表示能力，从而提高重建精度。

链接: https://arxiv.org/abs/2506.04682
作者: Chuyun Deng,Na Liu,Wei Xie,Lianming Xu,Li Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential.
zh

[CV-97] Gen-n-Val: Agent ic Image Data Generation and Validation

【速读】：该论文旨在解决计算机视觉任务中数据稀缺和标签噪声的问题，特别是在目标检测和实例分割任务中，现有合成数据生成方法存在多对象掩码、分割不准确和类别标签错误等缺陷。其解决方案的关键在于提出一种名为Gen-n-Val的新型代理数据生成框架，该框架结合了层扩散（Layer Diffusion, LD）、大语言模型（Large Language Models, LLMs）和视觉大语言模型（Vision Large Language Models, VLLMs），通过两个代理——LD提示代理和数据验证代理——分别优化生成高质量单对象掩码和过滤低质量合成图像，并利用文本梯度（TextGrad）优化系统提示，同时采用图像调和技术实现场景内多实例的融合。

链接: https://arxiv.org/abs/2506.04676
作者: Jing-En Huang,I-Sheng Fang,Tzuhsuan Huang,Chih-Yu Wang,Jun-Cheng Chen
机构: Research Center for Information Technology Innovation, Academia Sinica
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) and Vision Large Language Models (VLLMs) have demonstrated impressive performance as agents across various tasks while data scarcity and label noise remain significant challenges in computer vision tasks, such as object detection and instance segmentation. A common solution for resolving these issues is to generate synthetic data. However, current synthetic data generation methods struggle with issues, such as multiple objects per mask, inaccurate segmentation, and incorrect category labels, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), LLMs, and VLLMs to produce high-quality, single-object masks and diverse backgrounds. Gen-n-Val consists of two agents: (1) The LD prompt agent, an LLM, optimizes prompts for LD to generate high-quality foreground instance images and segmentation masks. These optimized prompts ensure the generation of single-object synthetic data with precise instance masks and clean backgrounds. (2) The data validation agent, a VLLM, which filters out low-quality synthetic instance images. The system prompts for both agents are refined through TextGrad. Additionally, we use image harmonization to combine multiple instances within scenes. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 1% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7. 1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val improves the performance of YOLOv9 and YOLO11 families in instance segmentation and object detection.
zh

[CV-98] Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts

【速读】：该论文旨在解决自解释模型（Self-Explainable Models, SEMs）在数据稀缺场景下性能不佳的问题，特别是在训练样本不足时，导致原型概念学习（Prototypical Concept Learning, PCL）效果不理想。其解决方案的关键在于提出一种少样本原型概念分类（Few-Shot Prototypical Concept Classification, FSPCC）框架，通过引入混合低秩专家（Mixture of LoRA Experts, MoLE）实现参数高效适应，缓解参数不平衡和表征错位问题，并结合多层级特征保留策略与几何感知概念区分损失，以提升模型的可解释性和减少概念重叠。

链接: https://arxiv.org/abs/2506.04673
作者: Zhong Ji,Rongshuai Wei,Jingren Liu,Yanwei Pang,Jungong Han
机构: Tianjin University (天津大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,5 figures

点击查看摘要

Abstract:Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable, but they often struggle in data-scarce settings where insufficient training samples lead to suboptimal this http URL address this limitation, we propose a Few-Shot Prototypical Concept Classification (FSPCC) framework that systematically mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment. Specifically, our approach leverages a Mixture of LoRA Experts (MoLE) for parameter-efficient adaptation, ensuring a balanced allocation of trainable parameters between the backbone and the PCL this http URL, cross-module concept guidance enforces tight alignment between the backbone’s feature representations and the prototypical concept activation this http URL addition, we incorporate a multi-level feature preservation strategy that fuses spatial and semantic cues across various layers, thereby enriching the learned representations and mitigating the challenges posed by limited data this http URL, to enhance interpretability and minimize concept overlap, we introduce a geometry-aware concept discrimination loss that enforces orthogonality among concepts, encouraging more disentangled and transparent decision this http URL results on six popular benchmarks (CUB-200-2011, mini-ImageNet, CIFAR-FS, Stanford Cars, FGVC-Aircraft, and DTD) demonstrate that our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot this http URL findings highlight the efficacy of coupling concept learning with few-shot adaptation to achieve both higher accuracy and clearer model interpretability, paving the way for more transparent visual recognition systems.
zh

[CV-99] Feature-Based Lie Group Transformer for Real-World Applications

【速读】：该论文试图解决传统表示学习中难以准确描述条件独立性的问题，以及现有方法在处理真实世界图像时的局限性。其解决方案的关键在于引入格论（Galois algebra）中的群分解理论，通过将特征翻译与目标分割相结合，实现更符合现实场景的表示学习。该方法将像素级的转换替换为特征级的转换，并将目标分割建模为在同一变换下的特征分组，从而提升了模型在实际应用中的有效性。

链接: https://arxiv.org/abs/2506.04668
作者: Takayuki Komatsu,Yoshiyuki Ohmura,Kayato Nishitsunoi,Yasuo Kuniyoshi
机构: The University of Tokyo(东京大学); Next Generation Artificial Intelligence Research Center (AI Center), The University of Tokyo(下一代人工智能研究中心（AI中心），东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The main goal of representation learning is to acquire meaningful representations from real-world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel-to-pixel translation without feature extraction, and can only process low-resolution images with no background, which prevents real-world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real-world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.
zh

[CV-100] A Fast Unsupervised Scheme for Polygonal Approximation

【速读】：该论文试图解决封闭数字曲线的快速无监督多边形逼近问题。其解决方案的关键在于三阶段流程：初始分割、迭代顶点插入和迭代合并，随后进行顶点调整。初始分割用于检测高曲率的尖锐转折点，迭代顶点插入用于补充低曲率但可能被遗漏的重要顶点，迭代合并用于消除初始阶段可能引入的冗余顶点，最终通过顶点调整提升逼近结果的美观性。

链接: https://arxiv.org/abs/2506.04664
作者: Bimal Kumar Ray
机构: 未知
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a fast and unsupervised scheme for a polygonal approximation of a closed digital curve. It is demonstrated that the approximation scheme is faster than state-of-the-art approximation and is competitive with the same in Rosin’s measure and in its aesthetic aspect. The scheme comprises of three phases: initial segmentation, iterative vertex insertion, and iterative merging, followed by vertex adjustment. The initial segmentation is used to detect sharp turnings - the vertices that seemingly have high curvature. It is likely that some of important vertices with low curvature might have been missed out at the first phase and so iterative vertex insertion is used to add vertices in a region where the curvature changes slowly but steadily. The initial phase may pick up some undesirable vertices and so merging is used to eliminate the redundant vertices. Finally, vertex adjustment is used to facilitate enhancement in the aesthetic look of the approximation. The quality of the approximations is measured using Rosin’s measure. The robustness of the proposed scheme with respect to geometric transformation is observed.
zh

[CV-101] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

【速读】：该论文旨在解决扩散生成模型在视频生成任务中推理速度慢和计算需求高的问题。其关键解决方案是提出FPSAttention，一种面向视频生成的FP8量化与稀疏性的训练感知协同设计方法，核心在于3D双向注意力机制的优化。该方法通过统一的3D块级粒度支持量化与稀疏性、适应去噪步骤的策略以及基于硬件友好的内核实现，显著提升了注意力运算和端到端视频生成的速度，同时保持了生成质量。

链接: https://arxiv.org/abs/2506.04648
作者: Akide Liu,Zeyu Zhang,Zhexin Li,Xuehai Bai,Yizeng Han,Jiasheng Tang,Yuanjie Xing,Jichao Wu,Mingyang Yang,Weihua Chen,Jiahao He,Yuanyu He,Fan Wang,Gholamreza Haffari,Bohan Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint this http URL introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1’s 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.
zh

[CV-102] xt-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

【速读】：该论文旨在解决生成式AI在图像超分辨率（SR）任务中处理真实世界退化时出现的保真度问题，尤其是文本结构的失真问题。其解决方案的关键在于提出了一种基于扩散模型的框架TADiSR，该框架整合了文本感知注意力机制和联合分割解码器，以恢复退化图像中自然细节及文本区域的结构保真度。此外，还提出了一个完整的高质量图像合成流程，结合精细的全图文本掩码，实现前景文本区域与背景内容的细致融合。

链接: https://arxiv.org/abs/2506.04641
作者: Qiming Hu,Linlong Fan,Yiyan Luo,Yuhang Yu,Xiaojie Guo,Qingnan Fan
机构: vivo Mobile Communication Co. Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \hrefthis https URLhere.
zh

[CV-103] Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

【速读】：该论文旨在解决当前人工智能评估体系中对非语言、多步骤视觉模拟能力的忽视问题，即现有AI基准主要关注语义推理，而忽略了空间认知中的复杂视觉模拟过程。其解决方案的关键在于提出STARE（Spatial Transformations and Reasoning Evaluation）基准，该基准通过涵盖基础几何变换、集成空间推理和现实世界空间推理的4K任务，系统性地评估多模态大语言模型在需要多步骤视觉模拟的任务中的表现，从而更全面地反映人类智能的空间认知能力。

链接: https://arxiv.org/abs/2506.04633
作者: Linjie Li,Mahtab Bigverdi,Jiawei Gu,Zixian Ma,Yinuo Yang,Ziang Li,Yejin Choi,Ranjay Krishna
机构: University of Washington (华盛顿大学); Sun Yat-sen University (中山大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: STARE is available at this https URL

点击查看摘要

Abstract:Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce STARE(Spatial Transformations and Reasoning Evaluation), a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 4K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.
zh

[CV-104] VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection

【速读】：该论文旨在解决3D语义占据预测中的实例级不完整性与相邻歧义问题，这些问题源于传统方法将任务建模为密集分割，忽略了实例级别的判别性。其解决方案的关键在于提出了一种无需训练的Voxel-to-Instance（VoxNT）技巧，将体素级类别标签转换为实例级偏移标签，并在此基础上构建了VoxDet框架，通过解耦占据预测为偏移回归和语义预测两个子任务，实现基于密集检测的实例级语义占据预测。

链接: https://arxiv.org/abs/2506.04623
作者: Wuyang Li,Zhu Yu,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (EPFL); Zhejiang University
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:3D semantic occupancy prediction aims to reconstruct the 3D geometry and semantics of the surrounding environment. With dense voxel labels, prior works typically formulate it as a dense segmentation task, independently classifying each voxel. However, this paradigm neglects critical instance-centric discriminability, leading to instance-level incompleteness and adjacent ambiguities. To address this, we highlight a free lunch of occupancy labels: the voxel-level class label implicitly provides insight at the instance level, which is overlooked by the community. Motivated by this observation, we first introduce a training-free Voxel-to-Instance (VoxNT) trick: a simple yet effective method that freely converts voxel-level class labels into instance-level offset labels. Building on this, we further propose VoxDet, an instance-centric framework that reformulates the voxel-level occupancy prediction as dense object detection by decoupling it into two sub-tasks: offset regression and semantic prediction. Specifically, based on the lifted 3D volume, VoxDet first uses (a) Spatially-decoupled Voxel Encoder to generate disentangled feature volumes for the two sub-tasks, which learn task-specific spatial deformation in the densely projected tri-perceptive space. Then, we deploy (b) Task-decoupled Dense Predictor to address this task via dense detection. Here, we first regress a 4D offset field to estimate distances (6 directions) between voxels and object borders in the voxel space. The regressed offsets are then used to guide the instance-level aggregation in the classification branch, achieving instance-aware prediction. Experiments show that VoxDet can be deployed on both camera and LiDAR input, jointly achieving state-of-the-art results on both benchmarks. VoxDet is not only highly efficient, but also achieves 63.0 IoU on the SemanticKITTI test set, ranking 1st on the online leaderboard.
zh

[CV-105] Deep Learning Reforms Image Matching: A Survey and Outlook

【速读】：该论文旨在解决传统图像匹配流程在复杂场景下表现不佳的问题，传统流程由“检测器-描述符、特征匹配器、异常值过滤器和几何估计器”组成，难以应对挑战性环境。其解决方案的关键在于利用深度学习逐步重构经典图像匹配流水线，具体包括：用可学习的替代模块替换传统流程中的各个步骤，如可学习的检测器-描述符、异常值过滤器和几何估计器；以及将多个步骤合并为端到端可学习模块，如中端稀疏匹配器、端到端半稠密/稠密匹配器和位姿回归器。通过这些方法，深度学习显著提升了图像匹配的鲁棒性和准确性。

链接: https://arxiv.org/abs/2506.04619
作者: Shihua Zhang,Zizhuo Li,Kaining Zhang,Yifan Lu,Yuxin Deng,Linfeng Tang,Xingyu Jiang,Jiayi Ma
机构: Wuhan University (武汉大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image matching, which establishes correspondences between two-view images to recover 3D structure and camera geometry, serves as a cornerstone in computer vision and underpins a wide range of applications, including visual localization, 3D reconstruction, and simultaneous localization and mapping (SLAM). Traditional pipelines composed of ``detector-descriptor, feature matcher, outlier filter, and geometric estimator’’ falter in challenging scenarios. Recent deep-learning advances have significantly boosted both robustness and accuracy. This survey adopts a unique perspective by comprehensively reviewing how deep learning has incrementally transformed the classical image matching pipeline. Our taxonomy highly aligns with the traditional pipeline in two key aspects: i) the replacement of individual steps in the traditional pipeline with learnable alternatives, including learnable detector-descriptor, outlier filter, and geometric estimator; and ii) the merging of multiple steps into end-to-end learnable modules, encompassing middle-end sparse matcher, end-to-end semi-dense/dense matcher, and pose regressor. We first examine the design principles, advantages, and limitations of both aspects, and then benchmark representative methods on relative pose recovery, homography estimation, and visual localization tasks. Finally, we discuss open challenges and outline promising directions for future research. By systematically categorizing and evaluating deep learning-driven strategies, this survey offers a clear overview of the evolving image matching landscape and highlights key avenues for further innovation.
zh

[CV-106] Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth

【速读】：该论文旨在解决传感器深度增强（sensor depth enhancement）中的可靠性问题，即如何生成密集且无伪影的深度图，同时保持几何结构的一致性和像素级精度。其解决方案的关键在于提出了一种两阶段框架——Perfecting Depth，该框架结合了扩散模型的随机性以自动检测不可靠的深度区域，并通过不确定性图进行确定性优化，从而在保留几何线索的同时提升深度图的可靠性。

链接: https://arxiv.org/abs/2506.04612
作者: Jinyoung Jun,Lei Chu,Jiahao Li,Yan Lu,Chang-Su Kim
机构: Korea University (高丽大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel two-stage framework for sensor depth enhancement, called Perfecting Depth. This framework leverages the stochastic nature of diffusion models to automatically detect unreliable depth regions while preserving geometric cues. In the first stage (stochastic estimation), the method identifies unreliable measurements and infers geometric structure by leveraging a training-inference domain gap. In the second stage (deterministic refinement), it enforces structural consistency and pixel-level accuracy using the uncertainty map derived from the first stage. By combining stochastic uncertainty modeling with deterministic refinement, our method yields dense, artifact-free depth maps with improved reliability. Experimental results demonstrate its effectiveness across diverse real-world scenarios. Furthermore, theoretical analysis, various experiments, and qualitative visualizations validate its robustness and scalability. Our framework sets a new baseline for sensor depth enhancement, with potential applications in autonomous driving, robotics, and immersive technologies.
zh

[CV-107] Exploring bidirectional bounds for minimax-training of Energy-based models

【速读】：该论文试图解决能量基础模型（Energy-Based Models, EBMs）在训练过程中普遍存在的困难问题。其解决方案的关键在于引入双向边界（bidirectional bounds），即在训练EBM时同时最大化一个下界和最小化一个上界，以避免因最小化下界而导致的不稳定性。通过从不同角度推导出四种对数似然的边界，并提出相应的评估算法，该方法有效提升了EBM训练的稳定性，从而实现了高质量的概率密度估计和样本生成。

链接: https://arxiv.org/abs/2506.04609
作者: Cong Geng,Jia Wang,Li Chen,Zhiyong Gao,Jes Frellsen,Søren Hauberg
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to IJCV

点击查看摘要

Abstract:Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.
zh

[CV-108] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

【速读】：该论文试图解决从单张照片或文本提示生成可动画化的、完全绑定的3D人体虚拟形象（3D human avatars）时面临的精确控制问题，特别是在人类身份、身体形态和动画准备度方面的挑战。传统基于扩散的方法在这些方面存在局限性，难以实现高质量和可定制的生成结果。解决方案的关键在于引入一个自主验证循环，结合大视觉-语言模型（VLM）的常识推理能力与现成的参数化人体生成器，通过迭代调整生成参数以实现收敛，从而提供细粒度的面部和身体特征控制，并支持用户通过自然语言对话逐步优化虚拟形象。

链接: https://arxiv.org/abs/2506.04606
作者: Alexander Huang-Menders,Xinhang Liu,Andy Xu,Yuyao Zhang,Chi-Keung Tang,Yu-Wing Tai
机构: The Hong Kong University of Science and Technology (香港科技大学); Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.
zh

[CV-109] Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

【速读】：该论文试图解决如何在迁移学习中通过缩放定律进行模型与数据集的比较，以确定预训练过程中哪种方法更为优越的问题。其解决方案的关键在于首次为两种重要的跨模态学习方法CLIP和MaMMUT推导出完整的缩放定律，这些定律基于广泛模型规模和样本规模的密集测量，从而能够准确预测模型在更大规模下的性能，并验证MaMMUT在扩展性与样本效率方面优于标准CLIP。此外，研究还展示了在固定学习率调度下推导缩放定律的可行性，进一步降低了计算成本，为开放基础模型和数据集的系统性比较与优化提供了可靠的方法。

链接: https://arxiv.org/abs/2506.04598
作者: Marianna Nezhurina,Tomer Porian,Giovanni Pucceti,Tommie Kerssies,Romain Beaumont,Mehdi Cherti,Jenia Jitsev
机构: LAION; Juelich Supercomputing Center (JSC), Research Center Juelich (FZJ); Institute of Information Science and Technologies “A. Faedo” - CNR Pisa; Eindhoven University of Technology; Open-ΨΨ\Psiroman_Ψ (Open-Sci) Collective
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. In Review

点击查看摘要

Abstract:In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT’s stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves 80.3% zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at this https URL.
zh

[CV-110] Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning ACL2025

【速读】：该论文试图解决具身智能（embodied intelligence）中持续学习（continual learning）设置存在的问题，即现有方法主要关注基于人类指令的低级动作执行，而忽视了高级规划和多层级知识的学习能力。其解决方案的关键在于提出分层的具身持续学习设置（Hierarchical Embodied Continual Learning Setups, HEC），将代理的持续学习过程划分为高层指令与低层动作两个层次，并引入任务感知的增量LoRA专家混合方法（Task-aware Mixture of Incremental LoRA Experts, Task-aware MoILE）。该方法通过聚类视觉-文本嵌入实现任务识别，并利用任务级和标记级路由器选择合适的LoRA专家，同时采用奇异值分解（Singular Value Decomposition, SVD）处理先前任务的LoRA参数以缓解灾难性遗忘问题。

链接: https://arxiv.org/abs/2506.04595
作者: Ziqi Jia,Anmin Wang,Xiaoyang Qu,Xiaowen Yang,Jianzong Wang
机构: Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China; Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Huazhong University of Science and Technology, Wuhan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Previous continual learning setups for embodied intelligence focused on executing low-level actions based on human commands, neglecting the ability to learn high-level planning and multi-level knowledge. To address these issues, we propose the Hierarchical Embodied Continual Learning Setups (HEC) that divide the agent’s continual learning process into two layers: high-level instructions and low-level actions, and define five embodied continual learning sub-setups. Building on these setups, we introduce the Task-aware Mixture of Incremental LoRA Experts (Task-aware MoILE) method. This approach achieves task recognition by clustering visual-text embeddings and uses both a task-level router and a token-level router to select the appropriate LoRA experts. To effectively address the issue of catastrophic forgetting, we apply Singular Value Decomposition (SVD) to the LoRA parameters obtained from prior tasks, preserving key components while orthogonally training the remaining parts. The experimental results show that our method stands out in reducing the forgetting of old tasks compared to other methods, effectively supporting agents in retaining prior knowledge while continuously learning new tasks.
zh

[CV-111] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

【速读】：该论文试图解决从单目视频输入生成和编辑4D视频的问题，特别是在相机轨迹变化或用户编辑导致内容缺失时如何有效填充缺失区域。解决方案的关键在于将4D视频创建重新定义为视频修复任务，并利用强大的视频修复基础模型作为生成先验。通过生成复合掩码修复视频数据进行微调，结合基于深度的点云渲染生成不可见性掩码与用户定义的编辑掩码，构建多样化的修复场景，从而提升模型在不同4D编辑和生成任务中的泛化能力和鲁棒性。此外，采用自迭代调优策略和时间打包模块进一步增强了模型在大范围相机运动下的时间一致性与生成质量。

链接: https://arxiv.org/abs/2506.04590
作者: Yue Ma,Kunyu Feng,Xinhua Zhang,Hongyu Liu,David Junhao Zhang,Jinbo Xing,Yinhan Zhang,Ayden Yang,Zeyu Wang,Qifeng Chen
机构: HKUST(香港科技大学); HKUST(GZ)(香港科技大学(广州)); NUS(新加坡国立大学); Tsinghua Univerisity(清华大学); CUHK(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model’s generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.
zh

[CV-112] StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

【速读】：该论文旨在解决在受限内存预算下如何有效整合多个大型模型的问题，其核心挑战在于如何在不依赖真实标签或测试样本的情况下实现模型的高效融合。解决方案的关键在于StatsMerging方法，它通过利用奇异值分解（SVD）中的奇异值来捕捉任务特定的权重分布，作为任务重要性的代理以指导任务系数预测，并结合轻量级学习器StatsMergeLearner来建模任务特定预训练模型的权重分布，从而提升泛化能力和对未见样本的适应性。此外，该方法还引入了任务特定教师蒸馏技术，以在融合异构架构的视觉模型时避免高昂的真实标签成本。

链接: https://arxiv.org/abs/2506.04567
作者: Ranjith Merugu,Bryan Bo Cao,Shubham Jain
机构: Stony Brook University (纽约州立大学石溪分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.
zh

[CV-113] Handle-based Mesh Deformation Guided By Vision Language Model

【速读】：该论文旨在解决三维网格变形中输出质量低、需要大量手动调优或依赖数据密集型训练的问题。其解决方案的关键在于引入一种无需训练的基于手柄的网格变形方法，通过视觉-语言模型（VLM）理解并操作基于手柄的界面，结合提示工程实现用户指令的精准映射，同时利用多视角投票方案降低VLM预测的不确定性，从而实现高质量、自动化的网格变形。

链接: https://arxiv.org/abs/2506.04562
作者: Xingpeng Sun,Shiyang Jia,Zherong Pan,Kui Wu,Aniket Bera
机构: Purdue University (普渡大学); University of California San Diego (加利福尼亚大学圣地亚哥分校); LIGHTSPEED (LIGHTSPEED); Purdue University (普渡大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mesh deformation is a fundamental tool in 3D content manipulation. Despite extensive prior research, existing approaches often suffer from low output quality, require significant manual tuning, or depend on data-intensive training. To address these limitations, we introduce a training-free, handle-based mesh deformation method. % Our core idea is to leverage a Vision-Language Model (VLM) to interpret and manipulate a handle-based interface through prompt engineering. We begin by applying cone singularity detection to identify a sparse set of potential handles. The VLM is then prompted to select both the deformable sub-parts of the mesh and the handles that best align with user instructions. Subsequently, we query the desired deformed positions of the selected handles in screen space. To reduce uncertainty inherent in VLM predictions, we aggregate the results from multiple camera views using a novel multi-view voting scheme. % Across a suite of benchmarks, our method produces deformations that align more closely with user intent, as measured by CLIP and GPTEval3D scores, while introducing low distortion – quantified via membrane energy. In summary, our approach is training-free, highly automated, and consistently delivers high-quality mesh deformations.
zh

[CV-114] LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation

【速读】：该论文旨在解决当前基于多分支并行纯卷积神经网络（CNN）结构的轻量级多人姿态估计方法在捕捉语义复杂关键点所需的全局上下文信息方面存在的不足，以及由于结构复杂和冗余导致的高延迟问题。其解决方案的关键在于提出一种近似单分支轻量级全局建模网络（LGM-Pose），其中包含一种轻量级MobileViM Block，该模块结合了轻量级注意力表示模块（LARM），通过非参数变换操作（NPT-Op）实现块内与块间信息的整合以提取全局信息，同时引入了一种新型的混洗集成融合模块（SFusion），以有效融合多尺度信息，缓解单分支结构中常见的性能下降问题。

链接: https://arxiv.org/abs/2506.04561
作者: Biao Guo,Fangmin Guo,Guibo Luo,Xiaonan Luo,Feng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most of the current top-down multi-person pose estimation lightweight methods are based on multi-branch parallel pure CNN network architecture, which often struggle to capture the global context required for detecting semantically complex keypoints and are hindered by high latency due to their intricate and redundant structures. In this article, an approximate single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges. In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM), which integrates information within and between patches using the Non-Parametric Transformation Operation(NPT-Op) to extract global information. Additionally, a novel Shuffle-Integrated Fusion Module (SFusion) is introduced to effectively integrate multi-scale information, mitigating performance degradation often observed in single-branch structures. Experimental evaluations on the COCO and MPII datasets demonstrate that our approach not only reduces the number of parameters compared to existing mainstream lightweight methods but also achieves superior performance and faster processing speeds.
zh

[CV-115] Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

【速读】：该论文试图解决多模态大语言模型（MLLMs）在升级底层推理语言模型（LLMs）时，因视觉-语言对齐重训练成本过高而导致的性能提升困难问题。解决方案的关键在于提出一种基于推理引导的强化学习策略——通过Caption Reward Optimization（RACRO）实现感知与推理的对齐，从而在不进行昂贵的多模态重新对齐的情况下，提升视觉基础能力和推理优化表示。

链接: https://arxiv.org/abs/2506.04559
作者: Yunhao Gou,Kai Chen,Zhili Liu,Lanqing Hong,Xin Jin,Zhenguo Li,James T. Kwok,Yu Zhang
机构: Southern University of Science and Technology (南方科技大学); The Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Huawei Cloud (华为云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in slow-thinking language models (e.g., OpenAI-o1 and DeepSeek-R1) have demonstrated remarkable abilities in complex reasoning tasks by emulating human-like reflective cognition. However, extending such capabilities to multi-modal large language models (MLLMs) remains challenging due to the high cost of retraining vision-language alignments when upgrading the underlying reasoner LLMs. A straightforward solution is to decouple perception from reasoning, i.e., converting visual inputs into language representations (e.g., captions) that are then passed to a powerful text-only reasoner. However, this decoupling introduces a critical challenge: the visual extractor must generate descriptions that are both faithful to the image and informative enough to support accurate downstream reasoning. To address this, we propose Reasoning-Aligned Perceptual Decoupling via Caption Reward Optimization (RACRO) - a reasoning-guided reinforcement learning strategy that aligns the extractor’s captioning behavior with the reasoning objective. By closing the perception-reasoning loop via reward-based optimization, RACRO significantly enhances visual grounding and extracts reasoning-optimized representations. Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance while enabling superior scalability and plug-and-play adaptation to more advanced reasoning LLMs without the necessity for costly multi-modal re-alignment.
zh

[CV-116] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

【速读】：该论文试图解决单图像超分辨率（Single-Image Super-Resolution, SISR）方法中如何有效提升低分辨率图像质量的问题，现有方法通常通过引入辅助结构（如专门设计的损失函数）间接提升性能。论文提出的解决方案是引入一种可学习的可分离核（Learnable Separable Kernels, LSKs），其本质上是秩一矩阵，能够直接增强图像的频率成分。LSKs的关键在于通过将核分解为正交且可合并的一维核，显著降低了模型参数数量和计算需求，同时提升了模型的整体性能，尤其在高放大倍数情况下表现更为突出。

链接: https://arxiv.org/abs/2506.04555
作者: Heng Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Existing approaches often enhance the performance of single-image super-resolution (SISR) methods by incorporating auxiliary structures, such as specialized loss functions, to indirectly boost the quality of low-resolution images. In this paper, we propose a plug-and-play module called Learnable Separable Kernels (LSKs), which are formally rank-one matrices designed to directly enhance image frequency components. We begin by explaining why LSKs are particularly suitable for SISR tasks from a frequency perspective. Baseline methods incorporating LSKs demonstrate a significant reduction of over 60% in both the number of parameters and computational requirements. This reduction is achieved through the decomposition of LSKs into orthogonal and mergeable one-dimensional kernels. Additionally, we perform an interpretable analysis of the feature maps generated by LSKs. Visualization results reveal the capability of LSKs to enhance image frequency components effectively. Extensive experiments show that incorporating LSKs not only reduces the number of parameters and computational load but also improves overall model performance. Moreover, these experiments demonstrate that models utilizing LSKs exhibit superior performance, particularly as the upscaling factor increases.
zh

[CV-117] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

【速读】：该论文旨在解决道路裂缝检测中由于智能终端设备能量有限和成像分辨率低而导致的实时监测性能不足问题。其解决方案的关键在于提出一种多阶段检测方法EECD-Net，该方法结合了超分辨率卷积神经网络（SRCNN）以提升图像质量，引入具有连续积分与放电（CIF）神经元的脉冲卷积单元（SCU）以降低功耗，并设计了门控注意力变换器（GAT）模块以实现多尺度特征的自适应融合，从而提高检测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2506.04526
作者: Shuo Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.
zh

[CV-118] Pruning Everything Everywhere All at Once IJCNN

【速读】：该论文旨在解决深度学习模型在复杂任务中因深度增加而导致计算成本过高、难以应用于实际和资源受限场景的问题。其解决方案的关键在于提出一种能够同时剪枝网络中的不同结构（如神经元和层）的新方法。该方法通过比较由层剪枝和神经元剪枝生成的两个候选子网络与父网络的表示相似性（使用Centered Kernel Alignment度量），选择相似性更高的子网络，从而迭代生成高度稀疏且保留原始预测能力的模型。

链接: https://arxiv.org/abs/2506.04513
作者: Gustavo Henrique do Nascimento,Ian Pons,Anna Helena Reali Costa,Artur Jordao
机构: Escola Politécnica de Engenharia, Universidade de São Paulo (圣保罗大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in International Joint Conference on Neural Networks (IJCNN), 2025

点击查看摘要

Abstract:Deep learning stands as the modern paradigm for solving cognitive tasks. However, as the problem complexity increases, models grow deeper and computationally prohibitive, hindering advancements in real-world and resource-constrained applications. Extensive studies reveal that pruning structures in these models efficiently reduces model complexity and improves computational efficiency. Successful strategies in this sphere include removing neurons (i.e., filters, heads) or layers, but not both together. Therefore, simultaneously pruning different structures remains an open problem. To fill this gap and leverage the benefits of eliminating neurons and layers at once, we propose a new method capable of pruning different structures within a model as follows. Given two candidate subnetworks (pruned models), one from layer pruning and the other from neuron pruning, our method decides which to choose by selecting the one with the highest representation similarity to its parent (the network that generates the subnetworks) using the Centered Kernel Alignment metric. Iteratively repeating this process provides highly sparse models that preserve the original predictive ability. Throughout extensive experiments on standard architectures and benchmarks, we confirm the effectiveness of our approach and show that it outperforms state-of-the-art layer and filter pruning techniques. At high levels of Floating Point Operations reduction, most state-of-the-art methods degrade accuracy, whereas our approach either improves it or experiences only a minimal drop. Notably, on the popular ResNet56 and ResNet110, we achieve a milestone of 86.37% and 95.82% FLOPs reduction. Besides, our pruned models obtain robustness to adversarial and out-of-distribution samples and take an important step towards GreenAI, reducing carbon emissions by up to 83.31%. Overall, we believe our work opens a new chapter in pruning.
zh

[CV-119] AuthGuard: Generalizable Deepfake Detection via Language Guidance

【速读】：该论文试图解决现有深度伪造检测技术难以适应不断演化的新型、未见过的伪造方法的问题，其核心挑战在于这些技术依赖于训练过程中学习到的统计特征，而这些特征可能无法代表测试时遇到的新生成方法的样本。解决方案的关键在于引入语言引导，通过整合人类常识推理（如识别逻辑不一致和感知异常）与统计线索，提升检测的泛化能力。具体而言，作者通过结合判别分类与图像-文本对比学习训练专家级深度伪造视觉编码器，其中文本由通用大语言模型（LLM）通过少量示例提示生成，从而实现从像素级分布中提取可语言描述的常识性深度伪造特征和统计伪造特征。

链接: https://arxiv.org/abs/2506.04501
作者: Guangyu Shen,Zhihua Li,Xiang Xu,Tianchen Zhao,Zheng Zhang,Dongsheng An,Zhuowen Tu,Yifan Xing,Qin Zhang
机构: Purdue University (普渡大学); AWS AI Labs (AWS AI 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning – such as recognizing logical inconsistencies and perceptual anomalies – alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, AuthGuard, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15% on the DFDC dataset and 16.68% on the DF40 dataset. Additionally, AuthGuard significantly enhances deepfake reasoning, improving performance by 24.69% on the DDVQA dataset.
zh

[CV-120] FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

【速读】：该论文旨在解决现有LiDAR 3D目标检测方法在资源受限的边缘设备上部署困难的问题，这些问题主要源于稀疏卷积和/或Transformer结构带来的不规则内存访问模式和高计算成本。其解决方案的关键在于提出一种硬件友好的方法FALO，该方法通过将稀疏的3D体素按坐标和邻近性排列成1D序列，并利用ConvDotMix模块进行处理，该模块包含大核卷积、Hadamard乘积和线性层，从而在空间和嵌入维度上实现充分的特征混合，并引入空间特征的高阶非线性交互，同时通过隐式分组平衡张量维度以提高推理效率。

链接: https://arxiv.org/abs/2506.04499
作者: Shizhong Han,Hsin-Pai Cheng,Hong Cai,Jihad Masri,Soyeb Nagori,Fatih Porikli
机构: Qualcomm AI Research*; Qualcomm Technologies, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).
zh

[CV-121] owards Large-Scale Pose-Invariant Face Recognition Using Face Defrontalization

【速读】：该论文旨在解决极端头部姿态下的面部识别问题，即实现姿态不变的面部识别（pose-invariant face recognition）。现有方法依赖于复杂的处理技术，如面部正脸化和多种面部特征提取模型架构，但在实际应用中存在局限性，并且通常在小规模科学数据集上进行评估。该研究提出了一种逆向的面部正脸化方法，称为面部反正脸化（face defrontalization），用于增强面部特征提取模型的训练数据集。其关键在于：1）在预处理后的正面-侧脸配对数据集上训练适应性的面部反正脸化FFWM模型；2）基于ArcFace损失函数，在原始且随机反正脸化的大规模数据集上训练ResNet-50面部特征提取模型。实验结果表明，该方法在多个大型公开数据集上优于现有方法，但在小型Multi-PIE数据集上表现不佳，暗示当前部分方法可能过度拟合于小数据集。

链接: https://arxiv.org/abs/2506.04496
作者: Patrik Mesec,Alan Jović
机构: Aircash(空气现金); University of Zagreb Faculty of Electrical Engineering and Computing(萨格勒布大学电气工程与计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Face recognition under extreme head poses is a challenging task. Ideally, a face recognition system should perform well across different head poses, which is known as pose-invariant face recognition. To achieve pose invariance, current approaches rely on sophisticated methods, such as face frontalization and various facial feature extraction model architectures. However, these methods are somewhat impractical in real-life settings and are typically evaluated on small scientific datasets, such as Multi-PIE. In this work, we propose the inverse method of face frontalization, called face defrontalization, to augment the training dataset of facial feature extraction model. The method does not introduce any time overhead during the inference step. The method is composed of: 1) training an adapted face defrontalization FFWM model on a frontal-profile pairs dataset, which has been preprocessed using our proposed face alignment method; 2) training a ResNet-50 facial feature extraction model based on ArcFace loss on a raw and randomly defrontalized large-scale dataset, where defrontalization was performed with our previously trained face defrontalization model. Our method was compared with the existing approaches on four open-access datasets: LFW, AgeDB, CFP, and Multi-PIE. Defrontalization shows improved results compared to models without defrontalization, while the proposed adjustments show clear superiority over the state-of-the-art face frontalization FFWM method on three larger open-access datasets, but not on the small Multi-PIE dataset for extreme poses (75 and 90 degrees). The results suggest that at least some of the current methods may be overfitted to small datasets.
zh

[CV-122] Photoreal Scene Reconstruction from an Egocentric Device SIGGRAPH

【速读】：该论文旨在解决使用第一视角设备进行高动态范围下像素级精确的场景光真实重建所面临的挑战，现有方法通常依赖于设备视觉-惯性里程计系统估计的帧率6自由度位姿，可能忽略对像素级重建至关重要的细节。其解决方案的关键在于两个方面：首先，强调采用视觉-惯性捆绑调整（VIBA）来校准滚动快门RGB传感器在高频轨迹格式中的精确时间戳和运动，从而确保滚动快门相机物理特性的准确标定；其次，将基于物理成像过程的模型引入高斯点云（Gaussian Splatting），有效处理RGB相机的滚动快门效应及传感器动态范围等特性。

链接: https://arxiv.org/abs/2506.04444
作者: Zhaoyang Lv,Maurizio Monge,Ka Chen,Yufeng Zhu,Michael Goesele,Jakob Engel,Zhao Dong,Richard Newcombe
机构: Reality Labs Research, Meta; Reality Labs Research, Meta; Reality Labs Research, Meta; Reality Labs Research, Meta; Reality Labs Research, Meta; Reality Labs Research, Meta; Reality Labs Research, Meta; Reality Labs Research, Meta
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Paper accepted to SIGGRAPH Conference Paper 2025

点击查看摘要

Abstract:In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device’s visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at this http URL
zh

[CV-123] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation CVPR2025

【速读】：该论文旨在解决传统自回归图像模型与扩散模型在生成速度与图像质量之间的差距问题，同时克服现有视觉自回归建模（VAR）方法在图像质量、序列长度扩展性及采样调度灵活性方面的局限性。其解决方案的关键在于引入分层掩码自回归建模（HMAR），通过将下一尺度预测建模为马尔可夫过程，使每个分辨率尺度的预测仅依赖于其直接前驱尺度的标记，而非所有前驱尺度的标记，并采用可控多步掩码生成机制，在每一步中生成部分标记，从而提升生成效率与图像质量。

链接: https://arxiv.org/abs/2506.04421
作者: Hermann Kumbong,Xian Liu,Tsung-Yi Lin,Ming-Yu Liu,Xihui Liu,Ziwei Liu,Daniel Y. Fu,Christopher Ré,David W. Romero
机构: Stanford University (斯坦福大学); NVIDIA (英伟达); CUHK (香港中文大学); HKU (香港大学); NTU (新加坡国立大学); UCSD (加州大学圣地亚哥分校); Together AI (Together AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner. Comments: Accepted to CVPR 2025. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.04421 [cs.CV] (or arXiv:2506.04421v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.04421 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-124] Normalize Filters! Classical Wisdom for Deep Vision

【速读】：该论文试图解决深度学习中卷积滤波器缺乏归一化导致在大气传输条件下响应失真、性能下降的问题。其解决方案的关键在于引入滤波器归一化（filter normalization），并结合可学习的缩放与平移操作，类似于批量归一化（batch normalization），从而确保滤波器具有大气等变性（atmosphere-equivariance）和共域对称性（co-domain symmetry）。这一方法将传统滤波原理融入深度学习框架，显著提升了模型在人工和自然强度变化基准上的性能。

链接: https://arxiv.org/abs/2506.04401
作者: Gustavo Perez,Stella X. Yu
机构: University of California, Berkeley (加州大学伯克利分校); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.
zh

[CV-125] Is Perturbation-Based Image Protection Disruptive to Image Editing? ICIP2025

【速读】：该论文试图解决扩散模型（diffusion models）在图像生成中的潜在风险，特别是其被滥用以传播虚假信息和抄袭受版权保护材料的问题。论文指出，当前的图像保护方法通过向图像中添加不可察觉的扰动（perturbations）来阻碍基于扩散的图像编辑，但实验结果表明，这些方法并未完全实现预期目标。关键在于，扰动可能在生成过程中反而增强图像与给定文本提示（text prompts）的关联性，导致意想不到的更优编辑结果，因此基于扰动的保护方法可能不足以有效抵御基于扩散的图像编辑攻击。

链接: https://arxiv.org/abs/2506.04394
作者: Qiuyu Tang,Bonor Ayambem,Mooi Choo Chuah,Aparna Bharati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 8 figures, accepted by ICIP 2025

点击查看摘要

Abstract:The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.
zh

[CV-126] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization CVPR

【速读】：该论文试图解决如何通过深度神经网络（Deep Neural Networks, DNNs）构建的编码模型，更深入地理解大脑视觉系统中特定特征驱动神经响应的问题。传统方法在解释这些响应时存在局限性，无法明确识别具体特征。解决方案的关键在于应用激活最大化（Activation Maximization）技术，该技术原本用于解释视觉DNN，现被拓展至DNN-based编码模型中。通过从预训练Inception V3网络的多层中提取并自适应下采样激活值，结合线性回归预测fMRI响应，构建出可计算完整图像的大脑响应模型。随后，利用激活最大化生成针对单个皮层体素优化的图像，揭示其视觉特征与已知选择性的对应关系，并进一步扩展至全兴趣区域（ROIs），验证了该方法在不同视觉区域和受试者中的有效性。此方法克服了依赖原生生成模型的限制，实现了对人类视觉系统响应的灵活表征与调控。

链接: https://arxiv.org/abs/2506.04379
作者: Matthew W. Shinkle,Mark D. Lescroart
机构: University of Nevada, Reno(内华达大学雷诺分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Accepted to the Mechanistic Interpretability for Vision (MIV) Workshop at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR) conference

点击查看摘要

Abstract:Deep neural networks (DNNs) trained on visual tasks develop feature representations that resemble those in the human visual system. Although DNN-based encoding models can accurately predict brain responses to visual stimuli, they offer limited insight into the specific features driving these responses. Here, we demonstrate that activation maximization – a technique designed to interpret vision DNNs – can be applied to DNN-based encoding models of the human brain. We extract and adaptively downsample activations from multiple layers of a pretrained Inception V3 network, then use linear regression to predict fMRI responses. This yields a full image-computable model of brain responses. Next, we apply activation maximization to generate images optimized for predicted responses in individual cortical voxels. We find that these images contain visual characteristics that qualitatively correspond with known selectivity and enable exploration of selectivity across the visual cortex. We further extend our method to whole regions of interest (ROIs) of the brain and validate its efficacy by presenting these images to human participants in an fMRI study. We find that the generated images reliably drive activity in targeted regions across both low- and high-level visual areas and across subjects. These results demonstrate that activation maximization can be successfully applied to DNN-based encoding models. By addressing key limitations of alternative approaches that require natively generative models, our approach enables flexible characterization and modulation of responses across the human visual system.
zh

[CV-127] Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

【速读】：该论文旨在解决手语识别（Sign Language Recognition, SLR）中针对孟加拉国手语（Bangla Sign Language, BdSL）的自动识别与分类问题，以提升听障人群的交流可及性。其解决方案的关键在于对先进的视频变换器架构（如VideoMAE、ViViT和TimeSformer）进行微调，并在不同规模的BdSL数据集（包括小规模的BdSLW60和大规模的BdSLW401）上进行评估，同时结合数据增强技术和交叉验证策略以提高模型的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2506.04367
作者: Jubayer Ahmed Bhuiyan Shawon,Hasan Mahmud,Kamrul Hasan
机构: Islamic University of Technology (伊斯兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures – VideoMAE, ViViT, and TimeSformer – on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 – demonstrating strong potential for scalable and accurate BdSL recognition.
zh

[CV-128] Puck Localization Using Contextual Cues

【速读】：该论文旨在解决冰球比赛中视频中冰球（puck）检测的挑战，这些问题包括冰球尺寸小、频繁遮挡、运动模糊、广播伪影以及由于相机变焦和视角变化导致的尺度不一致。论文提出的解决方案关键在于引入基于上下文线索的冰球定位方法（PLUCC），其核心是利用球员姿态和位置作为先验信息，结合多尺度特征提取与门控解码机制，实现尺度感知和上下文驱动的单帧冰球检测。此外，论文还提出了Rink Space Localization Error（RSLE）作为尺度不变的度量标准，以消除场地空间评估中的透视偏差。

链接: https://arxiv.org/abs/2506.04365
作者: Liam Salass,Jerrin Bright,Amir Nazemi,Yuhao Chen,John Zelek,David Clausi
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Puck detection in ice hockey broadcast videos poses significant challenges due to the puck’s small size, frequent occlusions, motion blur, broadcast artifacts, and scale inconsistencies due to varying camera zoom and broadcast camera viewpoints. Prior works focus on appearance-based or motion-based cues of the puck without explicitly modelling the cues derived from player behaviour. Players consistently turn their bodies and direct their gaze toward the puck. Motivated by this strong contextual cue, we propose Puck Localization Using Contextual Cues (PLUCC), a novel approach for scale-aware and context-driven single-frame puck detections. PLUCC consists of three components: (a) a contextual encoder, which utilizes player orientations and positioning as helpful priors; (b) a feature pyramid encoder, which extracts multiscale features from the dual encoders; and © a gating decoder that combines latent features with a channel gating mechanism. For evaluation, in addition to standard average precision, we propose Rink Space Localization Error (RSLE), a scale-invariant homography-based metric for removing perspective bias from rink space evaluation. The experimental results of PLUCC on the PuckDataset dataset demonstrated state-of-the-art detection performance, surpassing previous baseline methods by an average precision improvement of 12.2% and RSLE average precision of 25%. Our research demonstrates the critical role of contextual understanding in improving puck detection performance, with broad implications for automated sports analysis.
zh

[CV-129] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

【速读】：该论文试图解决当前人工智能模型，尤其是生成式 AI (Generative AI)，在复杂环境中学习世界模型并进行程序化规划的能力不足的问题。其解决方案的关键在于引入 WorldPrediction 基准，该基准通过要求模型根据初始和最终世界状态区分正确动作或动作序列，从而评估世界建模与规划能力。该任务设计强调时间与语义抽象，并利用“动作等价物”防止模型依赖低级场景连续性线索，同时基于部分可观测半马尔可夫决策过程（POMDP）框架确保评估的可靠性与鲁棒性。

链接: https://arxiv.org/abs/2506.04363
作者: Delong Chen,Willy Chung,Yejin Bang,Ziwei Ji,Pascale Fung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans are known to have an internal “world model” that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide “action equivalents” - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
zh

[CV-130] Learning Smooth State-Dependent Traversability from Dense Point Clouds

【速读】：该论文试图解决非结构化环境中地形可通行性（traversability）依赖于车辆状态的问题，特别是某些障碍物仅从特定方向（approach angle）可通行的问题。传统方法通过将接近角度作为模型输入来学习这种交互，但需要大量且多样化的训练数据，并且在规划过程中由于重复模型推理导致计算效率低下。解决方案的关键在于提出SPARTA方法，该方法通过在点云上估计条件化接近角度的可通行性，利用1-Sphere上的平滑解析函数预测任意接近角度的风险分布，该函数由傅里叶基函数组成，具有良好的泛化能力和计算效率。

链接: https://arxiv.org/abs/2506.04362
作者: Zihao Dong,Alan Papalia,Leonard Jung,Alenna Spiro,Philip R. Osteen,Christa S. Robison,Michael Everett
机构: Northeastern University (东北大学); DEVCOM Army Research Laboratory (陆军研究实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 13 figures

点击查看摘要

Abstract:A key open challenge in off-road autonomy is that the traversability of terrain often depends on the vehicle’s state. In particular, some obstacles are only traversable from some orientations. However, learning this interaction by encoding the angle of approach as a model input demands a large and diverse training dataset and is computationally inefficient during planning due to repeated model inference. To address these challenges, we present SPARTA, a method for estimating approach angle conditioned traversability from point clouds. Specifically, we impose geometric structure into our network by outputting a smooth analytical function over the 1-Sphere that predicts risk distribution for any angle of approach with minimal overhead and can be reused for subsequent queries. The function is composed of Fourier basis functions, which has important advantages for generalization due to their periodic nature and smoothness. We demonstrate SPARTA both in a high-fidelity simulation platform, where our model achieves a 91% success rate crossing a 40m boulder field (compared to 73% for the baseline), and on hardware, illustrating the generalization ability of the model to real-world settings.
zh

[CV-131] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

【速读】：该论文试图解决从文本提示中准确生成高质量3D人类模型的挑战，特别是针对细节控制、手部和面部的精确渲染、人体真实感以及外观可控性等方面的问题。其解决方案的关键在于提出了一种弱监督的流水线，首先利用先进的图像扩散模型生成具有可控属性的逼真人类图像数据集，然后通过基于Transformer的架构将图像特征高效映射到3D点云，最后训练一个条件化于相同文本提示的点云扩散模型以完成闭环生成，从而显著提升了生成速度、文本提示对齐度、真实感和渲染质量。

链接: https://arxiv.org/abs/2506.04351
作者: Maksym Ivashechkin,Oscar Mendez,Richard Bowden
机构: CVSSP, University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.
zh

[CV-132] You Only Train Once

【速读】：该论文试图解决在训练和优化学习模型过程中，损失函数选择与权重调整需要多次训练运行的问题，尤其是针对损失函数的权重优化方面。其解决方案的关键在于提出一种名为“You Only Train Once”（YOTO）的方法，通过一次训练即自动优化损失权重超参数，该方法将这些超参数视为网络的常规参数，并利用标准的梯度下降优化进行学习。YOTO通过利用复合损失函数的可微性，将其建模为一个新型层，该层采用softmax操作来满足损失超参数的固有正性约束并避免退化经验梯度，同时定义了一个新颖的正则化损失来建模所用损失之间的均匀性先验并确保最优解的有界性。

链接: https://arxiv.org/abs/2506.04349
作者: Christos Sakaridis
机构: ETH Zürich (ETH Zurich)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures

点击查看摘要

Abstract:The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient-based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end-to-end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute-force grid search across state-of-the-art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid-search model on unseen test data. Code will be made publicly available.
zh

[CV-133] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

【速读】：该论文旨在解决具身机器人在复杂三维场景中准确理解空间指代并动态推理指令所指示位置的问题。现有基于预训练视觉语言模型（Vision Language Models, VLMs）的方法仍难以胜任这一任务。解决方案的关键在于提出RoboRefer，一个具备三维感知能力的VLM，其通过监督微调（Supervised Fine-Tuning, SFT）整合解耦但专用的深度编码器以实现精确的空间理解，并通过强化学习微调（Reinforcement Fine-Tuning, RFT）提升多步骤空间推理能力，同时引入RefSpatial数据集和RefSpatial-Bench基准以支持训练与评估。

链接: https://arxiv.org/abs/2506.04308
作者: Enshen Zhou,Jingkun An,Cheng Chi,Yi Han,Shanyu Rong,Chi Zhang,Pengwei Wang,Zhongyuan Wang,Tiejun Huang,Lu Sheng,Shanghang Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.
zh

[CV-134] SSIMBaD: Sigma Scaling with SSIM-Guided Balanced Diffusion for AnimeFace Colorization

【速读】：该论文旨在解决动漫风格人脸草图自动上色的问题，其核心挑战在于在保持输入草图结构完整性的同时，有效地从参考图像中迁移风格化属性。传统方法依赖于预定义的噪声调度，这通常会损害感知一致性，而该论文提出的解决方案关键在于基于连续时间扩散模型，并引入了SSIMBaD（Sigma Scaling with SSIM-Guided Balanced Diffusion）机制。SSIMBaD通过sigma空间变换线性对齐结构相似性（SSIM）衡量的感知退化，确保各时间步长的视觉难度一致，从而实现更平衡和忠实的重建效果。

链接: https://arxiv.org/abs/2506.04283
作者: Junpyo Seo,Hanbin Koo,Jieun Yook,Byung-Ro Moon(Department of Computer Science, Seoul National University)
机构: Seoul National University (首尔国立大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, rest of the pages are appendix

点击查看摘要

Abstract:We propose a novel diffusion-based framework for automatic colorization of Anime-style facial sketches. Our method preserves the structural fidelity of the input sketch while effectively transferring stylistic attributes from a reference image. Unlike traditional approaches that rely on predefined noise schedules - which often compromise perceptual consistency – our framework builds on continuous-time diffusion models and introduces SSIMBaD (Sigma Scaling with SSIM-Guided Balanced Diffusion). SSIMBaD applies a sigma-space transformation that aligns perceptual degradation, as measured by structural similarity (SSIM), in a linear manner. This scaling ensures uniform visual difficulty across timesteps, enabling more balanced and faithful reconstructions. Experiments on a large-scale Anime face dataset demonstrate that our method outperforms state-of-the-art models in both pixel accuracy and perceptual quality, while generalizing to diverse styles. Code is available at this http URL
zh

[CV-135] Evaluating MLLM s with Multimodal Multi-image Reasoning Benchmark

【速读】：该论文旨在解决多图像输入下多模态大语言模型（Multimodal Large Language Models, MLLMs）的结构化视觉推理能力评估不足的问题。现有基准主要关注单图像视觉推理或多图像理解任务的最终答案评估，而缺乏对多图像输入下推理过程的系统性评估。其解决方案的关键在于提出首个针对多图像推理的基准测试——多模态多图像推理基准（Multimodal Multi-image Reasoning Benchmark, MMRB），该基准包含92个子任务，涵盖空间、时间和语义推理，并采用由GPT-4o生成、经人工专家优化的多解法和思维链（Chain-of-Thought, CoT）式标注，同时设计衍生子集用于评估多图像场景下的多模态奖励模型。

链接: https://arxiv.org/abs/2506.04280
作者: Ziming Cheng,Binrui Xu,Lisheng Gong,Zuhe Song,Tianshuo Zhou,Shiqi Zhong,Siyu Ren,Mingxiang Chen,Xiangchao Meng,Yuxin Zhang,Yanlin Li,Lei Ren,Wei Chen,Zhiyuan Huang,Mingjie Zhan,Xiaojie Wang,Fangxiang Feng
机构: BUPT China (北京邮电大学中国); YSU China (燕山大学中国); NUS Singapore (新加坡国立大学新加坡); Li Auto Inc. China (小鹏汽车公司中国); SenseTime Research China (商汤科技研究中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the \textbfMultimodal Multi-image Reasoning Benchmark (MMRB) , the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises \textbf92 sub-tasks covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on \textbf40 MLLMs , including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.
zh

[CV-136] RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought ACL2025

【速读】：该论文旨在解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在视觉定位与分割方面缺乏显式机制的问题，从而弥合认知推理与视觉感知之间的差距。其解决方案的关键在于提出一种名为“通过视觉提示进行推理分割”（Reasoning Segmentation via Visual Prompting, RSVP）的框架，该框架通过将多步骤多模态推理与基于视觉的语义理解相结合，实现对目标区域的可解释性定位与分割优化。

链接: https://arxiv.org/abs/2506.04277
作者: Yi Lu,Jiawang Cao,Yongliang Wu,Bozheng Li,Licheng Tang,Yangguang Ji,Chong Wu,Jay Wu,Wenbo Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as ACL 2025 Main

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs’ inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.
zh

[CV-137] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

【速读】：该论文旨在解决现有对抗训练方法中依赖固定扰动预算导致的无法适应实例特定鲁棒性特征的问题。其解决方案的关键在于提出动态epsilon调度（Dynamic Epsilon Scheduling, DES），该框架通过每实例和每训练迭代自适应调整对抗扰动预算，结合梯度基代理估计的决策边界距离、基于softmax熵的预测置信度以及蒙特卡洛dropout估计的模型不确定性，形成统一的调度策略，从而更有效地引导对抗学习。

链接: https://arxiv.org/abs/2506.04263
作者: Alan Mitkiy,James Smith,Hana Satou,Hiroshi Tanaka,Emily Johnson,F Monkey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.
zh

[CV-138] CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在检测非上下文相关信息（Out-of-Context Misinformation, OOC）时存在的两个持续性挑战：一是难以捕捉图像与文本之间通过潜在语义关联而非直接连接的关系，二是证据中的噪声会降低检测准确性。解决方案的关键在于提出一种名为CMIE的框架，其核心包括共存关系生成（Coexistence Relationship Generation, CRG）策略和关联评分（Association Scoring, AS）机制，通过识别图像与文本之间的潜在共存关系并选择性利用相关证据来提升虚假信息检测性能。

链接: https://arxiv.org/abs/2505.23449
作者: Fanxiao Li,Jiaying Wu,Canyuan He,Wei Zhou
机构: Yunnan University (云南大学); National University of Singapore (新加坡国立大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships-specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy. To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.
zh

[CV-139] DM-SegNet: Dual-Mamba Architecture for 3D Medical Image Segmentation with Global Context Modeling

【速读】：该论文旨在解决医学3D图像分割中全局上下文建模与空间拓扑结构保持之间的矛盾问题。现有基于状态空间模型（State Space Models, SSMs）的医学分割方法存在编码器-解码器不兼容的问题，编码器的1D序列展平破坏了空间结构，而传统解码器无法有效利用SSMs的状态传播能力。其解决方案的关键在于提出DM-SegNet，该架构通过引入双Mamba结构，结合方向性状态转移与解剖感知的层次化解码，核心创新包括四向空间Mamba模块、门控空间卷积层以及Mamba驱动的解码框架，从而实现了跨尺度的双向状态同步与空间一致性保持。

链接: https://arxiv.org/abs/2506.05297
作者: Hangyu Ji
机构: Central South University (中南大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D medical image segmentation demands architectures capable of reconciling global context modeling with spatial topology preservation. While State Space Models (SSMs) like Mamba show potential for sequence modeling, existing medical SSMs suffer from encoder-decoder incompatibility: the encoder’s 1D sequence flattening compromises spatial structures, while conventional decoders fail to leverage Mamba’s state propagation. We present DM-SegNet, a Dual-Mamba architecture integrating directional state transitions with anatomy-aware hierarchical decoding. The core innovations include a quadri-directional spatial Mamba module employing four-directional 3D scanning to maintain anatomical spatial coherence, a gated spatial convolution layer that enhances spatially sensitive feature representation prior to state modeling, and a Mamba-driven decoding framework enabling bidirectional state synchronization across scales. Extensive evaluation on two clinically significant benchmarks demonstrates the efficacy of DM-SegNet: achieving state-of-the-art Dice Similarity Coefficient (DSC) of 85.44% on the Synapse dataset for abdominal organ segmentation and 90.22% on the BraTS2023 dataset for brain tumor segmentation.
zh

[CV-140] PixCell: A generative foundation model for digital histopathology images

【速读】：该论文旨在解决病理学中数据标注稀缺、隐私保护及生成任务挑战等问题。其解决方案的关键在于引入PixCell，这是首个基于扩散的生成基础模型，用于组织病理学图像生成。PixCell通过在无标注数据的情况下进行渐进式训练和自监督条件化，实现了大规模、多样化且高质量的病理图像合成，从而为数据增强、隐私保护的数据共享以及虚拟染色等任务提供了有效手段。

链接: https://arxiv.org/abs/2506.05127
作者: Srikar Yellapragada,Alexandros Graikos,Zilinghan Li,Kostas Triaridis,Varun Belagali,Saarthak Kapse,Tarak Nath Nandi,Ravi K Madduri,Prateek Prasanna,Tahsin Kurc,Rajarsi R. Gupta,Joel Saltz,Dimitris Samaras
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell’s ability to use H\E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\E images. Our trained models are publicly released to accelerate research in computational pathology.
zh

[CV-141] DACN: Dual-Attention Convolutional Network for Hyperspectral Image Super-Resolution

【速读】：该论文旨在解决高光谱图像超分辨率任务中2D卷积神经网络（CNN）因依赖局部邻域而导致的全局上下文理解不足，以及波段相关性和数据稀缺性限制性能的问题。其解决方案的关键在于提出DACN（Dual-Attention Convolutional Network），通过引入增强卷积和多头注意力机制，有效捕捉局部与全局特征依赖关系，并分别生成通道和空间维度的注意力图以确定关注区域，同时设计了结合L2正则化与空间-光谱梯度损失的定制优化损失函数，以提升光谱保真度。

链接: https://arxiv.org/abs/2506.05041
作者: Usman Muhammad,Jorma Laaksonen
机构: Aalto University (阿尔托大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:2D convolutional neural networks (CNNs) have attracted significant attention for hyperspectral image super-resolution tasks. However, a key limitation is their reliance on local neighborhoods, which leads to a lack of global contextual understanding. Moreover, band correlation and data scarcity continue to limit their performance. To mitigate these issues, we introduce DACN, a dual-attention convolutional network for hyperspectral image super-resolution. Specifically, the model first employs augmented convolutions, integrating multi-head attention to effectively capture both local and global feature dependencies. Next, we infer separate attention maps for the channel and spatial dimensions to determine where to focus across different channels and spatial positions. Furthermore, a custom optimized loss function is proposed that combines L2 regularization with spatial-spectral gradient loss to ensure accurate spectral fidelity. Experimental results on two hyperspectral datasets demonstrate that the combination of multi-head attention and channel attention outperforms either attention mechanism used individually.
zh

[CV-142] Deep learning image burst stacking to reconstruct high-resolution ground-based solar observations

【速读】：该论文试图解决大口径地基太阳望远镜在观测中因地球湍流大气导致的图像质量下降问题，以及现有重建方法在强湍流和高计算成本下的局限性。解决方案的关键在于引入一种深度学习方法，通过非配对的图像到图像翻译，将100张短曝光图像实时重建为高质量图像，利用斑点重构作为参考，提升了模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2506.04781
作者: Christoph Schirninger,Robert Jarolim,Astrid M. Veronig,Christoph Kuckein
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Large aperture ground based solar telescopes allow the solar atmosphere to be resolved in unprecedented detail. However, observations are limited by Earths turbulent atmosphere, requiring post image corrections. Current reconstruction methods using short exposure bursts face challenges with strong turbulence and high computational costs. We introduce a deep learning approach that reconstructs 100 short exposure images into one high quality image in real time. Using unpaired image to image translation, our model is trained on degraded bursts with speckle reconstructions as references, improving robustness and generalization. Our method shows an improved robustness in terms of perceptual quality, especially when speckle reconstructions show artifacts. An evaluation with a varying number of images per burst demonstrates that our method makes efficient use of the combined image information and achieves the best reconstructions when provided with the full image burst.
zh

[CV-143] A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement

【速读】：该论文旨在解决在极端低光条件下，由泊松噪声（Poisson noise）退化的图像的去噪与增强问题。传统噪声假设（如高斯噪声）在多数情况下不适用，而实际场景中的噪声具有信号依赖性，因此需要更精确的建模。该研究提出了一种轻量级的深度学习方法，其关键在于将基于Retinex的分解与泊松去噪整合到统一的编码器-解码器网络中，通过引入泊松去噪损失函数，同时提升光照并抑制噪声，从而在无需先验反射率和光照信息的情况下，实现有效的分解过程，并保持反射率一致性与光照平滑性，避免色彩失真。

链接: https://arxiv.org/abs/2506.04470
作者: Isha Rao,Sanjay Ghosh
机构: Indian Institute of Technology Kharagpur (印度理工学院卡哈格普尔分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures and 1 table

点击查看摘要

Abstract:Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.
zh

[CV-144] Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning CVPR2025

【速读】：该论文试图解决在联邦学习（Federated Learning, FL）中，通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法进行模型训练时，用户微调数据的隐私泄露问题。解决方案的关键在于揭示了预训练模型和可训练适配器模块（adapter modules）的恶意设计可能被用于执行梯度反演攻击（gradient inversion attacks），从而重建目标用户的本地数据样本。研究通过实验证明，仅利用可访问的适配器梯度即可高保真地恢复大量微调图像，突显了针对PEFT的隐私保护机制的重要性。

链接: https://arxiv.org/abs/2506.04453
作者: Hasin Us Sami,Swapneel Sen,Amit K. Roy-Chowdhury,Srikanth V. Krishnamurthy,Basak Guler
机构: University of California, Riverside, CA (加利福尼亚大学河滨分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

点击查看摘要

Abstract:Federated learning (FL) allows multiple data-owners to collaboratively train machine learning models by exchanging local gradients, while keeping their private data on-device. To simultaneously enhance privacy and training efficiency, recently parameter-efficient fine-tuning (PEFT) of large-scale pretrained models has gained substantial attention in FL. While keeping a pretrained (backbone) model frozen, each user fine-tunes only a few lightweight modules to be used in conjunction, to fit specific downstream applications. Accordingly, only the gradients with respect to these lightweight modules are shared with the server. In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. We demonstrate gradient inversion attacks on a popular PEFT mechanism, the adapter, which allow an attacker to reconstruct local data samples of a target user, using only the accessible adapter gradients. Via extensive experiments, we demonstrate that a large batch of fine-tuning images can be retrieved with high fidelity. Our attack highlights the need for privacy-preserving mechanisms for PEFT, while opening up several future directions. Our code is available at this https URL.
zh

[CV-145] Deep Learning-Based Breast Cancer Detection in Mammography: A Multi-Center Validation Study in Thai Population

【速读】：该论文旨在解决乳腺癌在乳腺X线摄影中的自动检测问题，以提高筛查效率和准确性。其解决方案的关键在于采用改进的EfficientNetV2架构并引入增强的注意力机制，从而提升模型在不同数据集上的性能表现，包括在领域内、经活检证实以及领域外数据集上的高AUROC值，并通过临床验证证明其在病变定位和分类任务中的有效性。

链接: https://arxiv.org/abs/2506.03177
作者: Isarun Chamveha,Supphanut Chaiyungyuen,Sasinun Worakriangkrai,Nattawadee Prasawang,Warasinee Chaisangmongkon,Pornpim Korpraphong,Voraparee Suvannarerg,Shanigarn Thiravit,Chalermdej Kannawat,Kewalin Rungsinaporn,Suwara Issaragrisil,Payia Chadbunchachai,Pattiya Gatechumpol,Chawiporn Muktabhant,Patarachai Sereerat
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a deep learning system for breast cancer detection in mammography, developed using a modified EfficientNetV2 architecture with enhanced attention mechanisms. The model was trained on mammograms from a major Thai medical center and validated on three distinct datasets: an in-domain test set (9,421 cases), a biopsy-confirmed set (883 cases), and an out-of-domain generalizability set (761 cases) collected from two different hospitals. For cancer detection, the model achieved AUROCs of 0.89, 0.96, and 0.94 on the respective datasets. The system’s lesion localization capability, evaluated using metrics including Lesion Localization Fraction (LLF) and Non-Lesion Localization Fraction (NLF), demonstrated robust performance in identifying suspicious regions. Clinical validation through concordance tests showed strong agreement with radiologists: 83.5% classification and 84.0% localization concordance for biopsy-confirmed cases, and 78.1% classification and 79.6% localization concordance for out-of-domain cases. Expert radiologists’ acceptance rate also averaged 96.7% for biopsy-confirmed cases, and 89.3% for out-of-domain cases. The system achieved a System Usability Scale score of 74.17 for source hospital, and 69.20 for validation hospitals, indicating good clinical acceptance. These results demonstrate the model’s effectiveness in assisting mammogram interpretation, with the potential to enhance breast cancer screening workflows in clinical practice.
zh

人工智能

[AI-0] Exploring Diffusion Transformer Designs via Grafting

【速读】：该论文试图解决在设计模型架构时，由于评估架构决策对模型质量的影响需要昂贵的预训练成本，从而限制了架构探索的问题。解决方案的关键在于提出了一种称为“grafting”的简单方法，通过在预训练的扩散变压器（DiTs）上进行编辑，在有限的计算预算下实现新架构的构建与研究。该方法基于对激活行为和注意力局部性的分析，能够通过替换操作符（如将softmax注意力替换为门控卷积、局部注意力和线性注意力）或重构架构（如将序列Transformer块转换为并行块）来探索新的模型设计。

链接: https://arxiv.org/abs/2506.05340
作者: Keshigeyan Chandrasegaran,Michael Poli,Daniel Y. Fu,Dongjun Kim,Lea M. Hadzic,Manling Li,Agrim Gupta,Stefano Massaroli,Azalia Mirhoseini,Juan Carlos Niebles,Stefano Ermon,Li Fei-Fei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages; Project website: this https URL

点击查看摘要

Abstract:Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using 2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: this https URL
zh

[AI-1] Control Tax: The Price of Keeping AI in Check

【速读】：该论文试图解决在高风险现实应用场景中，将代理型AI（agentic AI）与控制机制（control mechanisms）集成时所面临的操作和财务成本问题，即“控制税”（Control tax）。解决方案的关键在于提出一个理论框架，该框架能够量化控制税并将其与分类器性能映射到安全保证上，同时通过对抗性环境下的实验评估最先进的语言模型，并提供实际的财务成本估算及优化的监控策略，以在安全性与成本效益之间取得平衡。

链接: https://arxiv.org/abs/2506.05296
作者: Mikhail Terekhov,Zhen Ning David Liu,Caglar Gulcehre,Samuel Albanie
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax – the operational and financial cost of integrating control measures into AI pipelines. Our work makes three key contributions to the field of AIC: (1) we introduce a theoretical framework that quantifies the Control Tax and maps classifier performance to safety assurances; (2) we conduct comprehensive evaluations of state-of-the-art language models in adversarial settings, where attacker models insert subtle backdoors into code while monitoring models attempt to detect these vulnerabilities; and (3) we provide empirical financial cost estimates for control protocols and develop optimized monitoring strategies that balance safety and cost-effectiveness while accounting for practical constraints like auditing budgets. Our framework enables practitioners to make informed decisions by systematically connecting safety guarantees with their costs, advancing AIC through principled economic feasibility assessment across different deployment contexts.
zh

[AI-2] Sample Complexity and Representation Ability of Test-time Scaling Paradigms

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在测试阶段采用不同策略（如自一致、最佳n个和自校正）时的样本效率问题，尤其是对这些策略的理论理解不足。其解决方案的关键在于通过理论分析揭示不同策略的样本复杂度差异，并提出自校正方法结合验证器反馈能够使Transformer在测试时模拟在线学习过程，从而在无需先验任务知识的情况下，理论上解决多个任务，扩展了Transformer的表示理论至多任务设置。

链接: https://arxiv.org/abs/2506.05295
作者: Baihe Huang,Shanda Li,Tianhao Wu,Yiming Yang,Ameet Talwalkar,Kannan Ramchandran,Michael I. Jordan,Jiantao Jiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies – such as self-consistency, best-of- n , and self-correction – remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires \Theta(1/\Delta^2) samples to produce the correct answer, while best-of- n only needs \Theta(1/\Delta) , where \Delta 1 denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.
zh

[AI-3] Fast-DataShapley: Neural Modeling for Training Data Valuation

【速读】：该论文旨在解决训练数据的价值评估与版权保护问题，特别是在人工智能行业中，如何公平地衡量数据提供者的贡献并给予合理回报。其核心挑战在于传统方法计算Shapley值时存在指数级增长的计算开销，而现有基于Shapley值的近似方法需要为每个测试样本重新训练，导致计算成本过高。解决方案的关键是提出Fast-DataShapley，这是一种一次训练方法，通过利用Shapley值的加权最小二乘特性，训练出一个可重复使用的解释模型，并实现实时推理速度，从而无需对新测试样本进行重新训练即可计算训练数据的Shapley值。此外，论文还从两个方面提出了具有理论保障的方法以降低训练开销：近似计算效用函数和分组计算训练数据。

链接: https://arxiv.org/abs/2506.05281
作者: Haifeng Sun,Yu Xiong,Runze Wu,Xinyu Cai,Changjie Fan,Lan Zhang,Xiang-Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The value and copyright of training data are crucial in the artificial intelligence industry. Service platforms should protect data providers’ legitimate rights and fairly reward them for their contributions. Shapley value, a potent tool for evaluating contributions, outperforms other methods in theory, but its computational overhead escalates exponentially with the number of data providers. Recent works based on Shapley values attempt to mitigate computation complexity by approximation algorithms. However, they need to retrain for each test sample, leading to intolerable costs. We propose Fast-DataShapley, a one-pass training method that leverages the weighted least squares characterization of the Shapley value to train a reusable explainer model with real-time reasoning speed. Given new test samples, no retraining is required to calculate the Shapley values of the training data. Additionally, we propose three methods with theoretical guarantees to reduce training overhead from two aspects: the approximate calculation of the utility function and the group calculation of the training data. We analyze time complexity to show the efficiency of our methods. The experimental evaluations on various image datasets demonstrate superior performance and efficiency compared to baselines. Specifically, the performance is improved to more than 2.5 times, and the explainer’s training speed can be increased by two orders of magnitude.
zh

[AI-4] aming in the AI Era: AI-Augmented Frameworks for Forming Simulating and Optimizing Human Teams

【速读】：该论文旨在解决团队优化中面临的两大核心问题：在团队组建阶段如何有效平衡用户偏好与任务目标以提升整体团队满意度，在团队执行阶段如何维持团队凝聚力和参与度以保障高绩效。其解决方案的关键在于引入人工智能增强的团队优化框架，包括基于多臂老虎机算法的团队组建框架，以动态调整团队构成；开发tAIfa（Team AI Feedback Assistant）系统，利用大语言模型（Large Language Models, LLMs）提供即时个性化反馈以增强团队凝聚力与参与度；以及构建PuppeteerLLM仿真框架，通过LLM模拟多智能体团队，以建模复杂团队动态并支持任务驱动协作与长期协调。

链接: https://arxiv.org/abs/2506.05265
作者: Mohammed Almutairi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 5 pages, UMAP 25, June 16_19, 2025, New York City, NY, USA

点击查看摘要

Abstract:Effective teamwork is essential across diverse domains. During the team formation stage, a key challenge is forming teams that effectively balance user preferences with task objectives to enhance overall team satisfaction. In the team performing stage, maintaining cohesion and engagement is critical for sustaining high team performance. However, existing computational tools and algorithms for team optimization often rely on static data inputs, narrow algorithmic objectives, or solutions tailored for specific contexts, failing to account for the dynamic interplay of team members personalities, evolving goals, and changing individual preferences. Therefore, teams may encounter member dissatisfaction, as purely algorithmic assignments can reduce members commitment to team goals or experience suboptimal engagement due to the absence of timely, personalized guidance to help members adjust their behaviors and interactions as team dynamics evolve. Ultimately, these challenges can lead to reduced overall team performance. My Ph.D. dissertation aims to develop AI-augmented team optimization frameworks and practical systems that enhance team satisfaction, engagement, and performance. First, I propose a team formation framework that leverages a multi-armed bandit algorithm to iteratively refine team composition based on user preferences, ensuring alignment between individual needs and collective team goals to enhance team satisfaction. Second, I introduce tAIfa (Team AI Feedback Assistant), an AI-powered system that utilizes large language models (LLMs) to deliver immediate, personalized feedback to both teams and individual members, enhancing cohesion and engagement. Finally, I present PuppeteerLLM, an LLM-based simulation framework that simulates multi-agent teams to model complex team dynamics within realistic environments, incorporating task-driven collaboration and long-term coordination.
zh

[AI-5] Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在推理过程中生成过多标记（tokens）导致计算资源浪费的问题，尤其是在处理简单问题时。现有解决方案如监督微调、用户控制预算或均匀惩罚的强化学习方法，均存在数据标注依赖、手动配置或对所有问题一视同仁的局限性。论文提出的解决方案关键在于引入自适应长度惩罚（Adaptive Length Penalty, ALP），通过强化学习目标，根据每个提示的求解率动态调整生成长度。ALP在训练过程中通过多次滚动评估每个提示的在线求解率，并施加与该求解率成反比的可微惩罚，从而在保证性能的前提下显著减少平均标记使用量。

链接: https://arxiv.org/abs/2506.05256
作者: Violet Xiang,Chase Blagden,Rafael Rafailov,Nathan Lile,Sang Truong,Chelsea Finn,Nick Haber
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt’s online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.
zh

[AI-6] Intentionally Unintentional: GenAI Exceptionalism and the First Amendment

【速读】：该论文试图解决的问题是：法院是否应将大型生成式AI模型（如GPT-4和Gemini）的输出结果视为受美国宪法第一修正案保护的“言论”。论文的解决方案关键在于指出，由于这些模型缺乏意图性，其输出并不符合法律上对“言论”的定义，因此不应获得第一修正案的保护。此外，论文强调，即使用户无法主张接收AI输出的言论自由权利，扩展第一修正案权利给AI模型也无法实现促进思想市场、促进自我治理或鼓励自我表达等核心自由言论目的，反而可能阻碍政府对这些强大技术的有效监管，从而加剧虚假信息和其他危害的传播。

链接: https://arxiv.org/abs/2506.05211
作者: David Atkinson,Jena D. Hwang,Jacob Morrison
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper challenges the assumption that courts should grant First Amendment protections to outputs from large generative AI models, such as GPT-4 and Gemini. We argue that because these models lack intentionality, their outputs do not constitute speech as understood in the context of established legal precedent, so there can be no speech to protect. Furthermore, if the model outputs are not speech, users cannot claim a First Amendment speech right to receive the outputs. We also argue that extending First Amendment rights to AI models would not serve the fundamental purposes of free speech, such as promoting a marketplace of ideas, facilitating self-governance, or fostering self-expression. In fact, granting First Amendment protections to AI models would be detrimental to society because it would hinder the government’s ability to regulate these powerful technologies effectively, potentially leading to the unchecked spread of misinformation and other harms.
zh

[AI-7] reeRPO: Tree Relative Policy Optimization

【速读】：该论文旨在解决现有基于强化学习的大型语言模型（Large Language Models, LLMs）在优化推理过程的中间步骤时，由于全轨迹级别的奖励定义不够精确而导致的指导不足问题。其解决方案的关键在于引入\name方法，该方法通过树采样估计不同推理步骤的奖励数学期望，无需依赖单独的步骤奖励模型，而是直接通过采样过程估算奖励。同时，\name基于树采样过程中生成的步骤级组计算奖励，从而产生细粒度且密集的奖励信号，提升了模型的学习效率和整体性能。

链接: https://arxiv.org/abs/2506.05183
作者: Zhicheng Yang,Zhijiang Guo,Yinya Huang,Xiaodan Liang,Yiwei Wang,Jing Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf\name, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0% to 35.5%. Furthermore, \name significantly outperforms GRPO by 2.9% in performance while simultaneously reducing the average response length by 18.1%, showcasing its effectiveness and efficiency. Our code will be available at \hrefthis https URLthis https URL.
zh

[AI-8] ruly Self-Improving Agents Require Intrinsic Metacognitive Learning ICML2025

【速读】：该论文试图解决自改进代理在持续获取新能力过程中存在的局限性，包括自我改进过程的僵化、跨任务领域的泛化能力不足以及随着代理能力增强而出现的可扩展性问题。其解决方案的关键在于引入内在元认知学习（intrinsic metacognitive learning），即代理主动评估、反思并适应自身学习过程的能力。作者提出一个包含元认知知识（metacognitive knowledge）、元认知规划（metacognitive planning）和元认知评价（metacognitive evaluation）的框架，强调现有自改进代理主要依赖外部元认知机制，而内在元认知的要素已部分具备，未来需探索人类与代理之间元认知责任的最优分配及内在元认知学习的有效评估与提升方法。

链接: https://arxiv.org/abs/2506.05109
作者: Tennison Liu,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICML 2025

点击查看摘要

Abstract:Self-improving agents aim to continuously acquire new capabilities with minimal supervision. However, current approaches face two key limitations: their self-improvement processes are often rigid, fail to generalize across tasks domains, and struggle to scale with increasing agent capabilities. We argue that effective self-improvement requires intrinsic metacognitive learning, defined as an agent’s intrinsic ability to actively evaluate, reflect on, and adapt its own learning processes. Drawing inspiration from human metacognition, we introduce a formal framework comprising three components: metacognitive knowledge (self-assessment of capabilities, tasks, and learning strategies), metacognitive planning (deciding what and how to learn), and metacognitive evaluation (reflecting on learning experiences to improve future learning). Analyzing existing self-improving agents, we find they rely predominantly on extrinsic metacognitive mechanisms, which are fixed, human-designed loops that limit scalability and adaptability. Examining each component, we contend that many ingredients for intrinsic metacognition are already present. Finally, we explore how to optimally distribute metacognitive responsibilities between humans and agents, and robustly evaluate and improve intrinsic metacognitive learning, key challenges that must be addressed to enable truly sustained, generalized, and aligned self-improvement.
zh

[AI-9] Survey on the Evaluation of Generative Models in Music

【速读】：该论文试图解决生成式音乐系统（generative systems in music）在评估过程中存在的多样化和系统性不足的问题，其关键在于提供一个跨学科的综述，涵盖评估目标、方法和指标，以全面评价系统输出和模型可用性，整合主观与客观、定性与定量、实证与计算方法，并从音乐学、工程学和人机交互的角度探讨各类方法的优势与挑战。

链接: https://arxiv.org/abs/2506.05104
作者: Alexander Lerch,Claire Arthur,Nick Bryan-Kinns,Corey Ford,Qianyi Sun,Ashvala Vinay
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ACM CSUR, 26-Jun-2024

点击查看摘要

Abstract:Research on generative systems in music has seen considerable attention and growth in recent years. A variety of attempts have been made to systematically evaluate such systems. We provide an interdisciplinary review of the common evaluation targets, methodologies, and metrics for the evaluation of both system output and model usability, covering subjective and objective approaches, qualitative and quantitative approaches, as well as empirical and computational methods. We discuss the advantages and challenges of such approaches from a musicological, an engineering, and an HCI perspective.
zh

[AI-10] Reason -to-Recommend: Using Interaction-of-Thought Reasoning to Enhance LLM Recommendation

【速读】：该论文旨在解决将生成式 AI（Generative AI）应用于推荐系统时，由于用户反馈的隐式性及缺乏明确的推理监督信号而导致的推理方法效果不佳的问题。其解决方案的关键在于提出一种增强推理的推荐框架 R2Rec，该框架通过从用户-物品图中采样交互链，并利用渐进式掩码提示策略将其转换为结构化的“思维链”，从而让大型语言模型（Large Language Models, LLMs）能够基于隐式模式进行分步决策。此外，通过两阶段训练流程——监督微调与强化学习——进一步提升了模型的推理能力并缓解了显式监督稀疏的问题。

链接: https://arxiv.org/abs/2506.05069
作者: Keyu Zhao,Fengli Xu,Yong Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by advances in Large Language Models (LLMs), integrating them into recommendation tasks has gained interest due to their strong semantic understanding and prompt flexibility. Prior work encoded user-item interactions or metadata into prompts for recommendations. In parallel, LLM reasoning, boosted by test-time scaling and reinforcement learning, has excelled in fields like mathematics and code, where reasoning traces and correctness signals are clear, enabling high performance and interpretability. However, directly applying these reasoning methods to recommendation is ineffective because user feedback is implicit and lacks reasoning supervision. To address this, we propose \textbfR2Rec , a reasoning-enhanced recommendation framework that samples interaction chains from the user-item graph and converts them into structured interaction-of-thoughts via a progressive masked prompting strategy, with each thought representing stepwise reasoning grounded in interaction context. This allows LLMs to simulate step-by-step decision-making based on implicit patterns. We design a two-stage training pipeline: supervised fine-tuning teaches basic reasoning from high-quality traces, and reinforcement learning refines reasoning via reward signals, alleviating sparse explicit supervision. Experiments on three real-world datasets show R2Rec outperforms classical and LLM-based baselines with an average \textbf10.48% improvement in HitRatio@1 and \textbf131.81% gain over the original LLM. Furthermore, the explicit reasoning chains enhance interpretability by revealing the decision process. Our code is available at: this https URL.
zh

[AI-11] IMING: Temporality-Aware Integrated Gradients for Time Series Explanation ICML2025

【速读】：该论文旨在解决时间序列可解释人工智能（XAI）中现有方法仅估计点对点的归因幅度，而忽视对预测的方向性影响，导致无法准确识别关键点的问题。其解决方案的关键在于提出新的评估指标——累积预测差异（CPD）和累积预测保留率（CPP），以系统评估归因方法是否能准确识别时间序列XAI中的显著正负点。同时，为克服传统集成梯度（IG）在时间序列数据中因忽略时间关系而引入分布外样本的问题，论文提出了TIMING方法，通过引入时间感知机制来增强IG，同时保持其理论特性。

链接: https://arxiv.org/abs/2506.05035
作者: Hyeongwon Jang,Changhun Kim,Eunho Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025 Spotlight Presentation; Code is available at this https URL

点击查看摘要

Abstract:Recent explainable artificial intelligence (XAI) methods for time series primarily estimate point-wise attribution magnitudes, while overlooking the directional impact on predictions, leading to suboptimal identification of significant points. Our analysis shows that conventional Integrated Gradients (IG) effectively capture critical points with both positive and negative impacts on predictions. However, current evaluation metrics fail to assess this capability, as they inadvertently cancel out opposing feature contributions. To address this limitation, we propose novel evaluation metrics-Cumulative Prediction Difference (CPD) and Cumulative Prediction Preservation (CPP)-to systematically assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, conventional IG outperforms recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as generated paths ignore temporal relationships and introduce out-of-distribution samples. To overcome these challenges, we introduce TIMING, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing time series XAI baselines. Our code is available at this https URL.
zh

[AI-12] Artificial Intelligence Should Genuinely Support Clinical Reasoning and Decision Making To Bridge the Translational Gap

【速读】：该论文试图解决人工智能在医学领域中因翻译性差距（translational gap）而难以产生实际影响的问题。其解决方案的关键在于提出一种新的社会技术（sociotechnical）概念化方法，旨在设计数据驱动的支持工具，以补充医生的认知和认识论活动，而非仅仅追求在非关键基准上的超人类性能。

链接: https://arxiv.org/abs/2506.05030
作者: Kacper Sokol,James Fackler,Julia E Vogt
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial intelligence promises to revolutionise medicine, yet its impact remains limited because of the pervasive translational gap. We posit that the prevailing technology-centric approaches underpin this challenge, rendering such systems fundamentally incompatible with clinical practice, specifically diagnostic reasoning and decision making. Instead, we propose a novel sociotechnical conceptualisation of data-driven support tools designed to complement doctors’ cognitive and epistemic activities. Crucially, it prioritises real-world impact over superhuman performance on inconsequential benchmarks.
zh

[AI-13] Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

【速读】：该论文旨在解决异构多机器人系统在复杂任务中面临的问题，特别是在任务多样性与动态环境下的协调合作难题。传统方法依赖静态模型，难以适应变化的任务需求和环境条件。解决方案的关键在于构建一个分层框架，集成提示增强的大型语言模型（LLM）与GridMask优化的微调视觉语言模型（VLM）。LLM负责任务分解与全局语义地图构建，而VLM则从航拍图像中提取特定任务的语义标签和二维空间信息，以支持局部规划。该框架实现了空中机器人与地面机器人的协同作业，提升了系统在动态环境中的适应性与鲁棒性。

链接: https://arxiv.org/abs/2506.05020
作者: Haokun Liu,Zhaoqi Ma,Yunong Li,Junichiro Sugihara,Yicheng Chen,Jinjie Li,Moju Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heterogeneous multi-robot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, traditional approaches relying on static models often struggle with task diversity and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a GridMask-enhanced fine-tuned Vision Language Model (VLM). The LLM performs task decomposition and global semantic map construction, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning. Within this framework, the aerial robot follows a globally optimized semantic path and continuously provides bird-view images, guiding the ground robot’s local semantic navigation and manipulation, including target-absent scenarios where implicit alignment is maintained. Experiments on a real-world letter-cubes arrangement task demonstrate the framework’s adaptability and robustness in dynamic environments. To the best of our knowledge, this is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
zh

[AI-14] owards Reason able Concept Bottleneck Models

【速读】：该论文旨在解决传统黑盒模型在可解释性和可控性方面的不足，尤其是在模型推理过程中难以明确概念间关系及任务依赖性的问题。其解决方案的关键在于提出一种新型的概念瓶颈模型（Concept Bottleneck Models, CBMs）家族——CREAM，该模型通过显式编码概念-概念（C-C）和概念-任务（C→Y）关系来强化模型的推理过程，并利用正则化的辅助通道实现与黑盒模型相当的任务性能，同时保持高概念重要性。此外，CREAM通过架构设计切断不必要的信息流，从而处理互斥概念，并通过集成的黑盒辅助通道仅在必要时提升性能。

链接: https://arxiv.org/abs/2506.05014
作者: Nektarios Kalampalikis,Kavya Gupta,Georgi Vitanov,Isabel Valera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 26 pages, 17 figures

点击查看摘要

Abstract:In this paper, we propose \textbfC oncept \textbfREA soning \textbfM odels (CREAM), a novel family of Concept Bottleneck Models (CBMs) that: (i) explicitly encodes concept-concept ( \textttC-C ) and concept-task ( \textttC \rightarrow Y ) relationships to enforce a desired model reasoning; and (ii) use a regularized side-channel to achieve competitive task performance, while keeping high concept importance. Specifically, CREAM architecturally embeds (bi)directed concept-concept, and concept to task relationships specified by a human expert, while severing undesired information flows (e.g., to handle mutually exclusive concepts). Moreover, CREAM integrates a black-box side-channel that is regularized to encourage task predictions to be grounded in the relevant concepts, thereby utilizing the side-channel only when necessary to enhance performance. Our experiments show that: (i) CREAM mainly relies on concepts while achieving task performance on par with black-box models; and (ii) the embedded \textttC-C and \textttC \rightarrow Y relationships ease model interventions and mitigate concept leakage.
zh

[AI-15] Mathematical Reasoning for Unmanned Aerial Vehicles: A RAG -Based Approach for Complex Arithmetic Reasoning

【速读】：该论文旨在解决自主无人机（UAV）操作中数学推理可靠性不足的问题，特别是在轨迹规划和能耗管理等任务中，传统飞行控制依赖硬编码方程，而大型语言模型（LLM）虽具备更灵活的问题求解能力，但在正确选择和应用数学公式及执行精确的多步骤计算方面存在局限。论文提出的解决方案是RAG-UAV，其关键在于通过检索增强生成框架，为LLM提供相关领域文献的访问权限，从而提升其在UAV特定场景下的数学推理能力。

链接: https://arxiv.org/abs/2506.04998
作者: Mehdi Azarafza,Mojtaba Nayyeri,Faezeh Pasandideh,Steffen Staab,Achim Rettberg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures, 4 appendix subsections

点击查看摘要

Abstract:Autonomous UAV operation necessitates reliable mathematical reasoning for tasks such as trajectory planning and power management. While traditional flight control relies on hardcoded equations, recent Large Language Models (LLMs) offer potential for more flexible problem-solving but struggle with reliably selecting and applying correct mathematical formulations and executing precise multi-step arithmetic. We propose RAG-UAV, a retrieval-augmented generation framework designed to improve the mathematical reasoning of several LLMs (including GPT o1/Turbo, Llama-3.2/3.3, Mistral, and DeepSeek R1) in UAV-specific contexts by providing access to relevant domain literature. To conduct an initial assessment, we introduce the UAV-Math-Bench, a small problem set comprising 20 UAV-centric mathematical problems across four difficulty levels. Our experiments demonstrate that incorporating retrieval substantially increases exact answer accuracy (achieving up to 75% with o1), reduces instances of incorrect formulation selection (from 25% without RAG to 5% with RAG), decreases numerical errors, reducing Mean Squared Error (MSE) by orders of magnitude for the best-performing models. This pilot study indicates that RAG can enable general-purpose LLMs to function as more reliable tools for engineering analysis, although direct real-time flight control requires further investigation and validation on a larger scale. All benchmark data, question and answer are publicly available.
zh

[AI-16] A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair

【速读】：该论文试图解决软件漏洞（software vulnerabilities）的自动化修复问题，特别是针对安全关键的漏洞补丁生成。其解决方案的关键在于利用预训练语言模型，如CodeBERT和CodeT5，在多种数据集和编程语言中进行自动化漏洞修复。研究重点评估了这些模型在准确性和未知漏洞泛化能力方面的表现，并探讨了微调对模型性能的影响。

链接: https://arxiv.org/abs/2506.04987
作者: Zanis Ali Khan,Aayush Garg,Qiang Tang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint has been accepted in ARES AICCPS (International Workshop on Artificial Intelligence, Cyber and Cyber-Physical Security)

点击查看摘要

Abstract:Software vulnerabilities pose significant security threats, requiring effective mitigation. While Automated Program Repair (APR) has advanced in fixing general bugs, vulnerability patching, a security-critical aspect of APR remains underexplored. This study investigates pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across six datasets and four languages. We evaluate their accuracy and generalization to unknown vulnerabilities. Results show that while both models face challenges with fragmented or sparse context, CodeBERT performs comparatively better in such scenarios, whereas CodeT5 excels in capturing complex vulnerability patterns. CodeT5 also demonstrates superior scalability. Furthermore, we test fine-tuned models on both in-distribution (trained) and out-of-distribution (unseen) datasets. While fine-tuning improves in-distribution performance, models struggle to generalize to unseen data, highlighting challenges in robust vulnerability detection. This study benchmarks model performance, identifies limitations in generalization, and provides actionable insights to advance automated vulnerability patching for real-world security applications.
zh

[AI-17] Energentic Intelligence: From Self-Sustaining Systems to Enduring Artificial Life

【速读】：该论文试图解决在资源波动环境中实现自主代理长期持续运行的问题，传统方法依赖于任务性能和外部奖励机制，而本文提出了一种以自我维持为核心目标的新型自主系统——Energentic Intelligence（能量智能）。其解决方案的关键在于通过基于能量的效用函数和生存可行性约束的时间窗口来形式化自我维持原则，并构建一个集成能量采集、热调节和自适应计算的模块化闭环控制系统，从而在无需外部监督的情况下实现资源感知的稳定行为。

链接: https://arxiv.org/abs/2506.04916
作者: Atahan Karagoz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper introduces Energentic Intelligence, a class of autonomous systems defined not by task performance, but by their capacity to sustain themselves through internal energy regulation. Departing from conventional reward-driven paradigms, these agents treat survival-maintaining functional operation under fluctuating energetic and thermal conditions-as the central objective. We formalize this principle through an energy-based utility function and a viability-constrained survival horizon, and propose a modular architecture that integrates energy harvesting, thermal regulation, and adaptive computation into a closed-loop control system. A simulated environment demonstrates the emergence of stable, resource-aware behavior without external supervision. Together, these contributions provide a theoretical and architectural foundation for deploying autonomous agents in resource-volatile settings where persistence must be self-regulated and infrastructure cannot be assumed.
zh

[AI-18] Differentiable Logic Cellular Automata: From Game of Life to Pattern Generation

【速读】：该论文试图解决如何在保持细胞自动机（Cellular Automata, CA）固有离散性质的同时，学习其局部更新规则的问题。解决方案的关键在于引入可微分逻辑细胞自动机（Differentiable Logic Cellular Automata, DiffLogic CA），该模型将神经细胞自动机（Neural Cellular Automata, NCA）与可微分逻辑门网络（Differentiable Logic Gates Networks, DLGNs）相结合，利用可微分逻辑门作为基本计算单元构建电路，从而实现端到端的梯度训练，并在推理时保持完全离散的状态空间。

链接: https://arxiv.org/abs/2506.04912
作者: Pietro Miotti,Eyvind Niklasson,Ettore Randazzo,Alexander Mordvintsev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Differentiable Logic Cellular Automata (DiffLogic CA), a novel combination of Neural Cellular Automata (NCA) and Differentiable Logic Gates Networks (DLGNs). The fundamental computation units of the model are differentiable logic gates, combined into a circuit. During training, the model is fully end-to-end differentiable allowing gradient-based training, and at inference time it operates in a fully discrete state space. This enables learning local update rules for cellular automata while preserving their inherent discrete nature. We demonstrate the versatility of our approach through a series of milestones: (1) fully learning the rules of Conway’s Game of Life, (2) generating checkerboard patterns that exhibit resilience to noise and damage, (3) growing a lizard shape, and (4) multi-color pattern generation. Our model successfully learns recurrent circuits capable of generating desired target patterns. For simpler patterns, we observe success with both synchronous and asynchronous updates, demonstrating significant generalization capabilities and robustness to perturbations. We make the case that this combination of DLGNs and NCA represents a step toward programmable matter and robust computing systems that combine binary logic, neural network adaptability, and localized processing. This work, to the best of our knowledge, is the first successful application of differentiable logic gate networks in recurrent architectures.
zh

[AI-19] LLM s for sensory-motor control: Combining in-context and iterative learning

【速读】：该论文试图解决如何让大型语言模型（Large Language Models, LLMs）直接控制具身智能体（embodied agents）的问题，即如何将连续的观察向量映射到连续的动作向量。解决方案的关键在于首先由LLMs根据对智能体、环境和目标的文本描述生成控制策略，随后通过一个学习过程迭代优化该策略，利用评估过程中收集的性能反馈和感觉运动数据进行改进。这种方法在Gymnasium和MuJoCo的经典控制任务中得到了验证，并成功整合符号知识与非符号感觉运动数据以找到最优或高性能的解决方案。

链接: https://arxiv.org/abs/2506.04867
作者: Jônata Tyska Carvalho,Stefano Nolfi
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 24 pages (13 pages are from appendix), 6 figures, code for experiments replication and supplementary material provided at this https URL

点击查看摘要

Abstract:We propose a method that enables large language models (LLMs) to control embodied agents by directly mapping continuous observation vectors to continuous action vectors. Initially, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. In most cases, it successfully identifies optimal or high-performing solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.
zh

[AI-20] owards Network Data Analytics in 5G Systems and Beyond

【速读】：该论文试图解决移动网络运营商（Mobile Network Operators, MNOs）在数据利用方面相对滞后的问题，相较于OTT（Over-the-Top）平台，MNOs未能充分挖掘数据的价值。解决方案的关键在于通过网络数据智能功能（Network Data Analytics Function, NWDAF）提升数据的利用率，以支持用户导向和网络优化的用例，并探索其在服务创新、运营效率提升及风险缓解方面的潜力。研究分析了70余篇文献的趋势与不足，并提出了两个新的用例以推动NWDAF的采用及其商业化潜力。

链接: https://arxiv.org/abs/2506.04860
作者: Marcos Lima Romero,Ricardo Suyama
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Submitted to XLIII BRAZILIAN SYMPOSIUM ON TELECOMMUNICATIONS AND SIGNAL PROCESSING - SBrT 2025

点击查看摘要

Abstract:Data has become a critical asset in the digital economy, yet it remains underutilized by Mobile Network Operators (MNOs), unlike Over-the-Top (OTT) players that lead global market valuations. To move beyond the commoditization of connectivity and deliver greater value to customers, data analytics emerges as a strategic enabler. Using data efficiently is essential for unlocking new service opportunities, optimizing operational efficiency, and mitigating operational and business risks. Since Release 15, the 3rd Generation Partnership Project (3GPP) has introduced the Network Data Analytics Function (NWDAF) to provide powerful insights and predictions using data collected across mobile networks, supporting both user-centric and network-oriented use cases. However, academic research has largely focused on a limited set of methods and use cases, driven by the availability of datasets, restricting broader exploration. This study analyzes trends and gaps in more than 70 articles and proposes two novel use cases to promote the adoption of NWDAF and explore its potential for monetization.
zh

[AI-21] Sparse Autoencoders Again? ICML

【速读】：该论文试图解决传统稀疏自编码器（Sparse Autoencoder, SAE）及其类似变分自编码器（Variational Autoencoder, VAE）在建模数据低维潜在结构时存在的局限性，特别是在准确估计流形维度和生成更稀疏潜在表示方面。其解决方案的关键在于提出一种混合模型，该模型结合了经典SAE和VAE的优势，通过理论证明其全局最优解能够恢复跨多个流形的结构化数据，并在实验中验证了该方法在保持重建误差的同时，能够更有效地估计流形维度并生成更稀疏的潜在表示。

链接: https://arxiv.org/abs/2506.04859
作者: Yin Lu,Tong He,Xuening Zhu,David Wipf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the International Conference on Machine Learning (ICML) 2025

点击查看摘要

Abstract:Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In general, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable, within domains such as images and language model activation patterns.
zh

[AI-22] owards a Multi-Agent Simulation of Cyber-attackers and Cyber-defenders Battles

【速读】：该论文试图解决网络系统中日益复杂和协调的网络攻击问题，其解决方案的关键在于通过多智能体方法构建一种马尔可夫模型，并利用模拟器实现对网络攻击者和网络防御者代理的交互建模。该方法通过抽象网络节点的属性集，包括代理自身的属性，来模拟网络在不同状态下的反应及属性变化，从而评估网络防御者的动态组织能力，并提供一个实验框架以实现真实场景下的协同网络攻击与防御策略。

链接: https://arxiv.org/abs/2506.04849
作者: Julien Soulé,Jean-Paul Jamont,Michel Occello,Paul Théron,Louis-Marie Traonouez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As cyber-attacks show to be more and more complex and coordinated, cyber-defenders strategy through multi-agent approaches could be key to tackle against cyber-attacks as close as entry points in a networked system. This paper presents a Markovian modeling and implementation through a simulator of fighting cyber-attacker agents and cyber-defender agents deployed on host network nodes. It aims to provide an experimental framework to implement realistically based coordinated cyber-attack scenarios while assessing cyber-defenders dynamic organizations. We abstracted network nodes by sets of properties including agents’ ones. Actions applied by agents model how the network reacts depending in a given state and what properties are to change. Collective choice of the actions brings the whole environment closer or farther from respective cyber-attackers and cyber-defenders goals. Using the simulator, we implemented a realistically inspired scenario with several behavior implementation approaches for cyber-defenders and cyber-attackers.
zh

[AI-23] On Automating Security Policies with Contemporary LLM s

【速读】：该论文试图解决现代计算环境中安全策略执行的复杂性与网络威胁日益复杂化之间的矛盾，旨在实现更强大、自适应和自动化的安全策略合规性管理。解决方案的关键在于利用生成式 AI (Generative AI) 框架，结合上下文学习与检索增强生成（RAG）技术，以提高攻击缓解策略的自动化水平。通过将工具和API规范存储在向量数据库中，并构建分解高层缓解策略为离散任务的架构流水线，系统能够有效地将策略转化为可执行的API调用，从而显著提升了精度、召回率和F1分数。

链接: https://arxiv.org/abs/2506.04838
作者: Pablo Fernández Saura,K. R. Jayaram,Vatche Isahagian,Jorge Bernal Bernabé,Antonio Skarmeta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Short Paper. Accepted To Appear in IEEE SSE 2025 (part of SERVICES 2025)

点击查看摘要

Abstract:The complexity of modern computing environments and the growing sophistication of cyber threats necessitate a more robust, adaptive, and automated approach to security enforcement. In this paper, we present a framework leveraging large language models (LLMs) for automating attack mitigation policy compliance through an innovative combination of in-context learning and retrieval-augmented generation (RAG). We begin by describing how our system collects and manages both tool and API specifications, storing them in a vector database to enable efficient retrieval of relevant information. We then detail the architectural pipeline that first decomposes high-level mitigation policies into discrete tasks and subsequently translates each task into a set of actionable API calls. Our empirical evaluation, conducted using publicly available CTI policies in STIXv2 format and Windows API documentation, demonstrates significant improvements in precision, recall, and F1-score when employing RAG compared to a non-RAG baseline.
zh

[AI-24] Oversight Structures for Agent ic AI in Public-Sector Organizations ACL2025

【速读】：该论文试图解决生成式 AI (Generative AI) 系统引入后对传统公共部门治理机制带来的挑战，特别是现有以孤立合规单元和偶发审批为主的监督模式难以应对持续性、集成化的监管需求。解决方案的关键在于构建适应公共部门约束的代理系统监督框架，通过跨部门实施、全面评估、增强安全协议、操作可见性和系统审计等治理维度，强化持续监督、深化治理与运营能力的整合，并提升部门间协调效率。

链接: https://arxiv.org/abs/2506.04836
作者: Chris Schmitz,Jonathan Rystrøm,Jan Batzner
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To appear at REALM@ACL2025

点击查看摘要

Abstract:This paper finds that the introduction of agentic AI systems intensifies existing challenges to traditional public sector oversight mechanisms – which rely on siloed compliance units and episodic approvals rather than continuous, integrated supervision. We identify five governance dimensions essential for responsible agent deployment: cross-departmental implementation, comprehensive evaluation, enhanced security protocols, operational visibility, and systematic auditing. We evaluate the capacity of existing oversight structures to meet these challenges, via a mixed-methods approach consisting of a literature review and interviews with civil servants in AI-related roles. We find that agent oversight poses intensified versions of three existing governance challenges: continuous oversight, deeper integration of governance and operational capabilities, and interdepartmental coordination. We propose approaches that both adapt institutional structures and design agent oversight compatible with public sector constraints.
zh

[AI-25] Safe Planning and Policy Optimization via World Model Learning

【速读】：该论文旨在解决模型基础强化学习（Model-Based Reinforcement Learning, Model-Based RL）在实际应用中因世界模型不准确而导致的安全性和可靠性问题。传统方法在安全关键场景下容易因模型误差引发灾难性失败，而本文提出的解决方案的关键在于构建一个联合优化任务性能与安全性的框架，其核心包括：引入隐式世界模型以解决目标不匹配问题，以及采用自适应机制动态切换模型预测规划与直接策略执行，同时结合动态安全阈值以适应智能体能力的演变，从而确保动作选择在性能和安全性上均优于安全策略建议。

链接: https://arxiv.org/abs/2506.04828
作者: Artem Latyshev,Gregory Gorbov,Aleksandr I. Panov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) applications in real-world scenarios must prioritize safety and reliability, which impose strict constraints on agent behavior. Model-based RL leverages predictive world models for action planning and policy optimization, but inherent model inaccuracies can lead to catastrophic failures in safety-critical settings. We propose a novel model-based RL framework that jointly optimizes task performance and safety. To address world model errors, our method incorporates an adaptive mechanism that dynamically switches between model-based planning and direct policy execution. We resolve the objective mismatch problem of traditional model-based approaches using an implicit world model. Furthermore, our framework employs dynamic safety thresholds that adapt to the agent’s evolving capabilities, consistently selecting actions that surpass safe policy suggestions in both performance and safety. Experiments demonstrate significant improvements over non-adaptive methods, showing that our approach optimizes safety and performance simultaneously rather than merely meeting minimum safety requirements. The proposed framework achieves robust performance on diverse safety-critical continuous control tasks, outperforming existing methods.
zh

[AI-26] Beyond Accuracy: Dissecting Mathematical Reasoning for LLM s Under Reinforcement Learning

【速读】：该论文试图解决当前基于强化学习（Reinforcement Learning, RL）的训练方法在语言模型中提升推理能力的机制缺乏细粒度理解的问题。其解决方案的关键在于引入一个细粒度的分析框架，用于剖析RL对推理过程的影响，重点考察计划执行、问题分解以及推理与知识利用三个关键要素。通过该框架，研究揭示了RL不仅增强了模型整合外部知识的能力，还提升了其在复杂任务中的鲁棒性，而非单纯优化对外部计划的执行。

链接: https://arxiv.org/abs/2506.04723
作者: Jiayu Wang,Yifei Ming,Zixuan Ke,Caiming Xiong,Shafiq Joty,Aws Albarghouthi,Frederic Sala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become the dominant paradigm for endowing language models with advanced reasoning capabilities. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of their advantages is still lacking. To address this gap, we introduce a fine-grained analytic framework to dissect the impact of RL on reasoning. Our framework specifically investigates key elements that have been hypothesized to benefit from RL training: (1) plan-following and execution, (2) problem decomposition, and (3) improved reasoning and knowledge utilization. Using this framework, we gain insights beyond mere accuracy. For instance, providing models with explicit step-by-step plans surprisingly degrades performance on the most challenging benchmarks, yet RL-tuned models exhibit greater robustness, experiencing markedly smaller performance drops than their base counterparts. This suggests that RL may not primarily enhance the execution of external plans but rather empower models to formulate and follow internal strategies better suited to their reasoning processes. Conversely, we observe that RL enhances the model’s capacity to integrate provided knowledge into its reasoning process, leading to performance improvements across diverse tasks. We also study difficulty, showing improved training by developing new ways to exploit hard problems. Our findings lay a foundation for more principled training and evaluation of reasoning models.
zh

[AI-27] UNO: Unlearning via Orthogonalization in Generative models

【速读】：该论文旨在解决生成式模型中高效删除特定数据的影响问题，以应对隐私保护、法律合规或修正有害内容的需求。传统训练过程中数据被累积并强化知识，而遗忘机制则需在不进行从头训练的情况下选择性移除特定数据点的影响。解决方案的关键在于提出基于损失梯度正交化的快速遗忘算法，该算法能够在保持原始模型生成质量的同时，有效遗忘不需要的数据，并且仅需少量训练步骤。

链接: https://arxiv.org/abs/2506.04712
作者: Pinak Mandal,Georg A. Gottwald
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As generative models become increasingly powerful and pervasive, the ability to unlearn specific data, whether due to privacy concerns, legal requirements, or the correction of harmful content, has become increasingly important. Unlike in conventional training, where data are accumulated and knowledge is reinforced, unlearning aims to selectively remove the influence of particular data points without costly retraining from scratch. To be effective and reliable, such algorithms need to achieve (i) forgetting of the undesired data, (ii) preservation of the quality of the generation, (iii) preservation of the influence of the desired training data on the model parameters, and (iv) small number of training steps. We propose fast unlearning algorithms based on loss gradient orthogonalization. We show that our algorithms are able to forget data while maintaining the fidelity of the original model. Using MNIST and CelebA data, we demonstrate that our algorithms achieve orders of magnitude faster unlearning times than their predecessors, such as gradient surgery.
zh

[AI-28] Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence

【速读】：该论文旨在解决隐式生成模型训练中存在收敛速度慢、训练不稳定以及模式崩溃等问题，其解决方案的关键在于提出一种新的无似然目标函数——双ISL（dual-ISL），通过在ISL框架中交换目标分布与模型分布的角色，将问题转化为模型密度空间中的凸优化问题。该方法所得到的基于排名的差异度量 $ d_K $ 具有在弱收敛和 $ L^1 $ 范数下的连续性以及在第一变量上的凸性，这些性质是传统散度如KL或Wasserstein距离所不具备的，从而为模型训练提供了更稳定和高效的优化路径。

链接: https://arxiv.org/abs/2506.04700
作者: José Manuel de Frutos,Manuel A. Vázquez,Pablo M. Olmos,Joaquín Míguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Rank-based statistical metrics, such as the invariant statistical loss (ISL), have recently emerged as robust and practically effective tools for training implicit generative models. In this work, we introduce dual-ISL, a novel likelihood-free objective for training implicit generative models that interchanges the roles of the target and model distributions in the ISL framework, yielding a convex optimization problem in the space of model densities. We prove that the resulting rank-based discrepancy d_K is i) continuous under weak convergence and with respect to the L^1 norm, and ii) convex in its first argument-properties not shared by classical divergences such as KL or Wasserstein distances. Building on this, we develop a theoretical framework that interprets d_K as an L^2 -projection of the density ratio q = p/\tilde p onto a Bernstein polynomial basis, from which we derive exact bounds on the truncation error, precise convergence rates, and a closed-form expression for the truncated density approximation. We further extend our analysis to the multivariate setting via random one-dimensional projections, defining a sliced dual-ISL divergence that retains both convexity and continuity. We empirically show that these theoretical advantages translate into practical ones. Specifically, across several benchmarks dual-ISL converges more rapidly, delivers markedly smoother and more stable training, and more effectively prevents mode collapse than classical ISL and other leading implicit generative methods-while also providing an explicit density approximation.
zh

[AI-29] Empowering Economic Simulation for Massively Multiplayer Online Games through Generative Agent -Based Modeling KDD2025

【速读】：该论文试图解决在Massively Multiplayer Online (MMO)经济模拟中，现有基于规则或强化学习的代理模型难以真实再现人类经济行为的问题，特别是代理的可靠性、社交性和可解释性不足。其解决方案的关键在于引入Large Language Models (LLMs)，利用其角色扮演能力、生成能力和推理能力，构建具有类人决策与适应性的LLM驱动代理，从而有效提升代理在模拟中的表现。

链接: https://arxiv.org/abs/2506.04699
作者: Bihan Xu,Shiwei Zhao,Runze Wu,Zhenya Huang,Jiawei Wang,Zhipeng Hu,Kai Wang,Haoyu Liu,Tangjie Lv,Le Li,Changjie Fan,Xin Tong,Jiangze Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: KDD2025 Accepted

点击查看摘要

Abstract:Within the domain of Massively Multiplayer Online (MMO) economy research, Agent-Based Modeling (ABM) has emerged as a robust tool for analyzing game economics, evolving from rule-based agents to decision-making agents enhanced by reinforcement learning. Nevertheless, existing works encounter significant challenges when attempting to emulate human-like economic activities among agents, particularly regarding agent reliability, sociability, and interpretability. In this study, we take a preliminary step in introducing a novel approach using Large Language Models (LLMs) in MMO economy simulation. Leveraging LLMs’ role-playing proficiency, generative capacity, and reasoning aptitude, we design LLM-driven agents with human-like decision-making and adaptability. These agents are equipped with the abilities of role-playing, perception, memory, and reasoning, addressing the aforementioned challenges effectively. Simulation experiments focusing on in-game economic activities demonstrate that LLM-empowered agents can promote emergent phenomena like role specialization and price fluctuations in line with market rules.
zh

[AI-30] On the Mechanism of Reasoning Pattern Selection in Reinforcement Learning for Language Models

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在提升大语言模型（Large Language Models, LLMs）推理能力中的机制不明确问题，特别是如何通过可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）优化模型的推理模式选择。其解决方案的关键在于通过系统性研究和理论分析，揭示RLVR通过优化现有推理模式的选择来提升模型性能，并证明其能够有效识别并采用高成功率的推理模式，同时保持单个模式的稳定性。此外，论文还提出了通过监督微调（Supervised Fine-Tuning, SFT）预处理弱模型以加速RLVR训练收敛的策略。

链接: https://arxiv.org/abs/2506.04695
作者: Xingwu Chen,Tianle Li,Difan Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 30 pages, 6 figures, 1 table

点击查看摘要

Abstract:Reinforcement learning (RL) has demonstrated remarkable success in enhancing model capabilities, including instruction-following, preference learning, and reasoning. Yet despite its empirical successes, the mechanisms by which RL improves reasoning abilities remain poorly understood. We present a systematic study of Reinforcement Learning with Verifiable Rewards (RLVR), showing that its primary benefit comes from optimizing the selection of existing reasoning patterns. Through extensive experiments, we demonstrate that RLVR-trained models preferentially adopt high-success-rate reasoning patterns while mostly maintaining stable performance on individual patterns. We further develop theoretical analyses on the convergence and training dynamics of RLVR based on a simplified question-reason-answer model. We study the gradient flow and show that RLVR can indeed find the solution that selects the reason pattern with the highest success rate. Besides, our theoretical results reveal two distinct regimes regarding the convergence of RLVR training: (1) rapid convergence for models with relatively strong initial reasoning capabilities versus (2) slower optimization dynamics for weaker models. Furthermore, we show that the slower optimization for weaker models can be mitigated by applying the supervised fine-tuning (SFT) before RLVR, when using a feasibly high-quality SFT dataset. We validate the theoretical findings through extensive experiments. This work advances our theoretical understanding of RL’s role in LLM fine-tuning and offers insights for further enhancing reasoning capabilities. Comments: 30 pages, 6 figures, 1 table Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2506.04695 [cs.LG] (or arXiv:2506.04695v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.04695 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-31] Influence Functions for Edge Edits in Non-Convex Graph Neural Networks

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）中个体边对模型行为影响的可解释性与鲁棒性问题，特别是现有图影响函数方法在估计边删除影响时存在的局限性，包括依赖严格的凸性假设、仅考虑边删除而忽略边插入以及未能捕捉消息传播变化等。解决方案的关键在于提出一种针对GNNs的近似Bregman响应函数，该方法放松了凸性要求，并能够准确预测标准神经网络架构的影响；同时，该方法明确考虑了消息传播效应，并以合理的方式将影响预测扩展到边删除和插入两种情况。

链接: https://arxiv.org/abs/2506.04694
作者: Jaeseung Heo,Kyeongheung Yun,Seokwon Yoon,MoonJeong Park,Jungseul Ok,Dongwoo Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how individual edges influence the behavior of graph neural networks (GNNs) is essential for improving their interpretability and robustness. Graph influence functions have emerged as promising tools to efficiently estimate the effects of edge deletions without retraining. However, existing influence prediction methods rely on strict convexity assumptions, exclusively consider the influence of edge deletions while disregarding edge insertions, and fail to capture changes in message propagation caused by these modifications. In this work, we propose a proximal Bregman response function specifically tailored for GNNs, relaxing the convexity requirement and enabling accurate influence prediction for standard neural network architectures. Furthermore, our method explicitly accounts for message propagation effects and extends influence prediction to both edge deletions and insertions in a principled way. Experiments with real-world datasets demonstrate accurate influence predictions for different characteristics of GNNs. We further demonstrate that the influence function is versatile in applications such as graph rewiring and adversarial attacks.
zh

[AI-32] owards Better Generalization via Distributional Input Projection Network

【速读】：该论文旨在解决过参数化模型中仅依赖训练损失难以准确反映泛化性能的问题，以及如何有效提升神经网络的平滑性以改善泛化能力。其解决方案的关键在于提出一种名为分布输入投影网络（Distributional Input Projection Networks, DIPNet）的新框架，该框架通过在每一层将输入投影到可学习的分布上，从而在输入空间上诱导出更平滑的损失景观，进而提升模型的泛化性能。

链接: https://arxiv.org/abs/2506.04690
作者: Yifan Hao,Yanxin Lu,Xinwei Shen,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As overparameterized models become increasingly prevalent, training loss alone offers limited insight into generalization performance. While smoothness has been linked to improved generalization across various settings, directly enforcing smoothness in neural networks remains challenging. To address this, we introduce Distributional Input Projection Networks (DIPNet), a novel framework that projects inputs into learnable distributions at each layer. This distributional representation induces a smoother loss landscape with respect to the input, promoting better generalization. We provide theoretical analysis showing that DIPNet reduces both local smoothness measures and the Lipschitz constant of the network, contributing to improved generalization performance. Empirically, we validate DIPNet across a wide range of architectures and tasks, including Vision Transformers (ViTs), Large Language Models (LLMs), ResNet and MLPs. Our method consistently enhances test performance under standard settings, adversarial attacks, out-of-distribution inputs, and reasoning benchmarks. We demonstrate that the proposed input projection strategy can be seamlessly integrated into existing models, providing a general and effective approach for boosting generalization performance in modern deep learning.
zh

[AI-33] E-bike agents : Large Language Model-Driven E-Bike Accident Analysis and Severity Prediction

【速读】：该论文旨在解决电动自行车（e-bike）事故报告中由于采用非结构化叙述格式而导致的定量安全分析困难问题。其解决方案的关键在于引入E-bike agents框架，该框架利用大语言模型（LLM）驱动的智能体，实现对非结构化事故报告中的安全变量进行分类和提取，包括数据分类、信息抽取、伤害原因判定及部件关联等模块，从而识别导致事故的关键因素及其严重程度。

链接: https://arxiv.org/abs/2506.04654
作者: Zhichao Yang,Jiashu He,Mohammad B. Al-Khasawneh,Darshan Pandit,Cirillo Cinzia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electric bicycles (e-bikes) are rapidly increasing in use, raising safety concerns due to a rise in accident reports. However, e-bike incident reports often use unstructured narrative formats, which hinders quantitative safety analysis. This study introduces E-bike agents, a framework that uses large language models (LLM) powered agents to classify and extract safety variables from unstructured incident reports. Our framework consists of four LLM agents, handling data classification, information extraction, injury cause determination, and component linkage, to extract the key factors that could lead to E-bike accidents and cause varying severity levels. Furthermore, we used an ordered logit model to examine the relationship between the severity of the incident and the factors retrieved, such as gender, the type of cause, and environmental conditions. Our research shows that equipment issues are slightly more common than human-related ones, but human-related incidents are more often fatal. Specifically, pedals, tires, and brakes are frequent contributors to accidents. The model achieves a high weighted F1 score of 0.87 in classification accuracy, highlighting the potential of using LLMs to extract unstructured data in niche domains, such as transportation. Our method offers a scalable solution to improve e-bike safety analytics and provides actionable information for policy makers, designers, and regulators.
zh

[AI-34] Agents of Change: Self-Evolving LLM Agents for Strategic Planning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在长期策略制定和执行方面存在的不足，尤其是在复杂任务中难以形成并维持连贯的长期战略问题。其解决方案的关键在于构建一个基于多智能体架构的自进化系统，该系统包含分析者（Analyzer）、研究者（Researcher）、编码者（Coder）和玩家（Player）等角色，通过协作实现对游戏策略的迭代分析、新策略的研究以及代理逻辑或提示的修改，从而实现LLM代理的自我改进与适应性提升。

链接: https://arxiv.org/abs/2506.04651
作者: Nikolas Belle,Dakota Barnes,Alfonso Amayuelas,Ivan Bercovich,Xin Eric Wang,William Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in LLMs have enabled their use as autonomous agents across a range of tasks, yet they continue to struggle with formulating and adhering to coherent long-term strategies. In this paper, we investigate whether LLM agents can self-improve when placed in environments that explicitly challenge their strategic planning abilities. Using the board game Settlers of Catan, accessed through the open-source Catanatron framework, we benchmark a progression of LLM-based agents, from a simple game-playing agent to systems capable of autonomously rewriting their own prompts and their player agent’s code. We introduce a multi-agent architecture in which specialized roles (Analyzer, Researcher, Coder, and Player) collaborate to iteratively analyze gameplay, research new strategies, and modify the agent’s logic or prompt. By comparing manually crafted agents to those evolved entirely by LLMs, we evaluate how effectively these systems can diagnose failure and adapt over time. Our results show that self-evolving agents, particularly when powered by models like Claude 3.7 and GPT-4o, outperform static baselines by autonomously adopting their strategies, passing along sample behavior to game-playing agents, and demonstrating adaptive reasoning over multiple iterations.
zh

[AI-35] CHANCERY: Evaluating corporate governance reasoning capabilities in language models

【速读】：该论文试图解决法律领域中缺乏专门针对推理任务的基准数据集的问题，尤其是关于公司治理规则的一致性推理问题。解决方案的关键在于引入了一个名为CHANCERY的公司治理推理基准，该基准基于真实的公司治理原则和章程，通过二分类任务测试模型是否能够判断高管/董事会/股东提出的行动是否符合公司宪章中的规定，从而推动对法律推理能力的评估与提升。

链接: https://arxiv.org/abs/2506.04636
作者: Lucas Irwin,Arda Kaz,Peiyao Sheng,Pramod Viswanath
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Law has long been a domain that has been popular in natural language processing (NLP) applications. Reasoning (ratiocination and the ability to make connections to precedent) is a core part of the practice of the law in the real world. Nevertheless, while multiple legal datasets exist, none have thus far focused specifically on reasoning tasks. We focus on a specific aspect of the legal landscape by introducing a corporate governance reasoning benchmark (CHANCERY) to test a model’s ability to reason about whether executive/board/shareholder’s proposed actions are consistent with corporate governance charters. This benchmark introduces a first-of-its-kind corporate governance reasoning test for language models - modeled after real world corporate governance law. The benchmark consists of a corporate charter (a set of governing covenants) and a proposal for executive action. The model’s task is one of binary classification: reason about whether the action is consistent with the rules contained within the charter. We create the benchmark following established principles of corporate governance - 24 concrete corporate governance principles established in and 79 real life corporate charters selected to represent diverse industries from a total dataset of 10k real life corporate charters. Evaluations on state-of-the-art (SOTA) reasoning models confirm the difficulty of the benchmark, with models such as Claude 3.7 Sonnet and GPT-4o achieving 64.5% and 75.2% accuracy respectively. Reasoning agents exhibit superior performance, with agents based on the ReAct and CodeAct frameworks scoring 76.1% and 78.1% respectively, further confirming the advanced legal reasoning capabilities required to score highly on the benchmark. We also conduct an analysis of the types of questions which current reasoning models struggle on, revealing insights into the legal reasoning capabilities of SOTA models.
zh

[AI-36] Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在图形用户界面（Graphical User Interface, GUI）自动化任务中因在线交互环境下的实时决策需求而产生的高容错性问题。其关键解决方案是引入一种术前批评机制，通过推理动作的潜在结果和正确性提供有效反馈，具体采用了一种建议感知梯度相对策略优化（Suggestion-aware Gradient Relative Policy Optimization, S-GRPO）策略构建术前批评模型GUI-Critic-R1，并结合新颖的建议奖励以提升模型反馈的可靠性。

链接: https://arxiv.org/abs/2506.04614
作者: Yuyang Wanyan,Xi Zhang,Haiyang Xu,Haowei Liu,Junyang Wang,Jiabo Ye,Yutong Kou,Ming Yan,Fei Huang,Xiaoshan Yang,Weiming Dong,Changsheng Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Multimodal Large Language Models (MLLMs) have been extensively utilized for multimodal reasoning tasks, including Graphical User Interface (GUI) automation. Unlike general offline multimodal tasks, GUI automation is executed in online interactive environments, necessitating step-by-step decision-making based on real-time status of the environment. This task has a lower tolerance for decision-making errors at each step, as any mistakes may cumulatively disrupt the process and potentially lead to irreversible outcomes like deletions or payments. To address these issues, we introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution, by reasoning about the potential outcome and correctness of actions. Specifically, we propose a Suggestion-aware Gradient Relative Policy Optimization (S-GRPO) strategy to construct our pre-operative critic model GUI-Critic-R1, incorporating a novel suggestion reward to enhance the reliability of the model’s feedback. Furthermore, we develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test, filling existing gaps in GUI critic data. Static experiments on the GUI-Critic-Test across both mobile and web domains reveal that our GUI-Critic-R1 offers significant advantages in critic accuracy compared to current MLLMs. Dynamic evaluation on GUI automation benchmark further highlights the effectiveness and superiority of our model, as evidenced by improved success rates and operational efficiency.
zh

[AI-37] DeePoly: A High-Order Accuracy and Efficiency Deep-Polynomial Framework for Scientific Machine Learning

【速读】：该论文旨在解决深度神经网络（DNN）在求解偏微分方程（PDEs）时缺乏收敛性保证和计算效率的问题。其解决方案的关键在于提出DeePoly框架，该框架采用两阶段方法：首先利用DNN捕捉复杂的全局特征，随后通过结合DNN提取的特征（Scoper）与多项式基函数（Sniper）进行线性空间优化，从而充分发挥DNN在高梯度特征逼近上的优势以及多项式基函数在局部高精度修正和收敛性保障方面的优势。

链接: https://arxiv.org/abs/2506.04613
作者: Li Liu,Heng Yong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: for associated mpeg file, see this http URL

点击查看摘要

Abstract:Recently, machine learning methods have gained significant traction in scientific computing, particularly for solving Partial Differential Equations (PDEs). However, methods based on deep neural networks (DNNs) often lack convergence guarantees and computational efficiency compared to traditional numerical schemes. This work introduces DeePoly, a novel framework that transforms the solution paradigm from pure non-convex parameter optimization to a two-stage approach: first employing a DNN to capture complex global features, followed by linear space optimization with combined DNN-extracted features (Scoper) and polynomial basis functions (Sniper). This strategic combination leverages the complementary strengths of both methods – DNNs excel at approximating complex global features (i.e., high-gradient features) and stabilize the polynomial approximation while polynomial bases provide high-precision local corrections with convergence guarantees. Theoretical analysis and numerical experiments demonstrate that this approach significantly enhances both high-order accuracy and efficiency across diverse problem types while maintaining mesh-free and scheme-free properties. This paper also serves as a theoretical exposition for the open-source project DeePoly.
zh

[AI-38] Judicial Permission

【速读】：该论文试图解决刑事审判中弱权限（weak permissions）的重要性问题，其核心在于如何系统性地处理司法权限（judicial permission）。论文提出的解决方案关键在于引入一种对话博弈模型，该模型考虑了不同的证明标准和论证语义，以更全面地分析司法权限的适用与限制。

链接: https://arxiv.org/abs/2506.04610
作者: Guido Governatori,Antonino Rotolo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper examines the significance of weak permissions in criminal trials (\emphjudicial permission). It introduces a dialogue game model to systematically address judicial permissions, considering different standards of proof and argumentation semantics.
zh

[AI-39] Intelligent Channel Allocation for IEEE 802.11be Multi-Link Operation: When MAB Meets LLM

【速读】：该论文旨在解决密集WiFi 7网络中具有多链路操作（Multi-Link Operation, MLO）能力的动态信道分配问题。其解决方案的关键在于将该问题建模为组合优化问题，并引入一种基于最佳臂识别增强的蒙特卡洛树搜索（Best-Arm Identification-enabled Monte Carlo Tree Search, BAI-MCTS）算法，结合多臂老虎机（Multi-Armed Bandit, MAB）框架实现在线学习以适应缺乏先验网络信息的场景。为进一步提升算法性能，论文还提出了集成大型语言模型（Large Language Model, LLM）的LLM-BAI-MCTS算法，以降低样本复杂度并增强在不同网络环境中的泛化能力。

链接: https://arxiv.org/abs/2506.04594
作者: Shumin Lian,Jingwen Tong,Jun Zhang,Liqun Fu
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: This work has been accepted by JSAC 2025

点击查看摘要

Abstract:WiFi networks have achieved remarkable success in enabling seamless communication and data exchange worldwide. The IEEE 802.11be standard, known as WiFi 7, introduces Multi-Link Operation (MLO), a groundbreaking feature that enables devices to establish multiple simultaneous connections across different bands and channels. While MLO promises substantial improvements in network throughput and latency reduction, it presents significant challenges in channel allocation, particularly in dense network environments. Current research has predominantly focused on performance analysis and throughput optimization within static WiFi 7 network configurations. In contrast, this paper addresses the dynamic channel allocation problem in dense WiFi 7 networks with MLO capabilities. We formulate this challenge as a combinatorial optimization problem, leveraging a novel network performance analysis mechanism. Given the inherent lack of prior network information, we model the problem within a Multi-Armed Bandit (MAB) framework to enable online learning of optimal channel allocations. Our proposed Best-Arm Identification-enabled Monte Carlo Tree Search (BAI-MCTS) algorithm includes rigorous theoretical analysis, providing upper bounds for both sample complexity and error probability. To further reduce sample complexity and enhance generalizability across diverse network scenarios, we put forth LLM-BAI-MCTS, an intelligent algorithm for the dynamic channel allocation problem by integrating the Large Language Model (LLM) into the BAI-MCTS algorithm. Numerical results demonstrate that the BAI-MCTS algorithm achieves a convergence rate approximately 50.44% faster than the state-of-the-art algorithms when reaching 98% of the optimal value. Notably, the convergence rate of the LLM-BAI-MCTS algorithm increases by over 63.32% in dense networks.
zh

[AI-40] OpenAg: Democratizing Agricultural Intelligence

【速读】：该论文旨在解决当前农业智能系统在情境理解、可解释性和适应性方面的不足，尤其是针对资源有限的小农户缺乏实用决策支持的问题。其解决方案的关键在于构建OpenAg框架，该框架整合了领域特定的基础模型、神经知识图谱、多智能体推理、因果可解释性和自适应迁移学习，以提供情境感知、可解释且可操作的农业洞察。

链接: https://arxiv.org/abs/2506.04571
作者: Srikanth Thudumu,Jason Fisher
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Agriculture is undergoing a major transformation driven by artificial intelligence (AI), machine learning, and knowledge representation technologies. However, current agricultural intelligence systems often lack contextual understanding, explainability, and adaptability, especially for smallholder farmers with limited resources. General-purpose large language models (LLMs), while powerful, typically lack the domain-specific knowledge and contextual reasoning needed for practical decision support in farming. They tend to produce recommendations that are too generic or unrealistic for real-world applications. To address these challenges, we present OpenAg, a comprehensive framework designed to advance agricultural artificial general intelligence (AGI). OpenAg combines domain-specific foundation models, neural knowledge graphs, multi-agent reasoning, causal explainability, and adaptive transfer learning to deliver context-aware, explainable, and actionable insights. The system includes: (i) a unified agricultural knowledge base that integrates scientific literature, sensor data, and farmer-generated knowledge; (ii) a neural agricultural knowledge graph for structured reasoning and inference; (iii) an adaptive multi-agent reasoning system where AI agents specialize and collaborate across agricultural domains; and (iv) a causal transparency mechanism that ensures AI recommendations are interpretable, scientifically grounded, and aligned with real-world constraints. OpenAg aims to bridge the gap between scientific knowledge and the tacit expertise of experienced farmers to support scalable and locally relevant agricultural decision-making.
zh

[AI-41] BESA: Boosting Encoder Stealing Attack with Perturbation Recovery

【速读】：该论文试图解决在基于扰动的防御机制下，编码器窃取攻击（encoder stealing attack）性能受限的问题。其解决方案的关键在于提出了一种名为BESA（Boosting Encoder Stealing Attack with Perturbation Recovery）的攻击方法，该方法包含两个核心模块：扰动检测（perturbation detection）和扰动恢复（perturbation recovery）。扰动检测模块通过目标编码器获得的特征向量推断服务提供商使用的防御机制，而扰动恢复模块则利用设计良好的生成模型从扰动后的特征向量中恢复出干净的特征向量，从而提升编码器窃取攻击的效果。

链接: https://arxiv.org/abs/2506.04556
作者: Xuhao Ren,Haotian Liang,Yajie Wang,Chuan Zhang,Zehui Xiong,Liehuang Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To boost the encoder stealing attack under the perturbation-based defense that hinders the attack performance, we propose a boosting encoder stealing attack with perturbation recovery named BESA. It aims to overcome perturbation-based defenses. The core of BESA consists of two modules: perturbation detection and perturbation recovery, which can be combined with canonical encoder stealing attacks. The perturbation detection module utilizes the feature vectors obtained from the target encoder to infer the defense mechanism employed by the service provider. Once the defense mechanism is detected, the perturbation recovery module leverages the well-designed generative model to restore a clean feature vector from the perturbed one. Through extensive evaluations based on various datasets, we demonstrate that BESA significantly enhances the surrogate encoder accuracy of existing encoder stealing attacks by up to 24.63% when facing state-of-the-art defenses and combinations of multiple defenses.
zh

[AI-42] hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation

【速读】：该论文试图解决硬件描述语言（Hardware Description Language, HDL）中可用的公开Verilog代码量远少于软件语言如Python的问题，这限制了大规模语言模型（Large Language Models, LLMs）在硬件代码生成任务中的性能提升。解决方案的关键在于构建hdl2v数据集，通过将其他HDL语言如VHDL、Chisel和PyMTL3翻译或编译为Verilog，从而增加高质量的人工编写Verilog数据的可用性。该方法在不依赖数据增强或知识蒸馏的情况下，显著提升了320亿参数开源模型在VerilogEvalV2基准上的性能，并增强了基于数据增强的微调方法的效果。

链接: https://arxiv.org/abs/2506.04544
作者: Charles Hong,Brendan Roberts,Huijae An,Alex Um,Advay Ratan,Yakun Sophia Shao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v (“HDL-to-Verilog”), a dataset which seeks to increase the amount of available human-written Verilog data by translating or compiling three other hardware description languages - VHDL, Chisel, and PyMTL3 - to Verilog. Furthermore, we demonstrate the value of hdl2v in enhancing LLM Verilog generation by improving performance of a 32 billion-parameter open-weight model by up to 23% (pass@10) in VerilogEvalV2, without utilizing any data augmentation or knowledge distillation from larger models. We also show hdl2v’s ability to boost the performance of a data augmentation-based fine-tuning approach by 63%. Finally, we characterize and analyze our dataset to better understand which characteristics of HDL-to-Verilog datasets can be expanded upon in future work for even better performance.
zh

[AI-43] NOBLE – Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

【速读】：该论文旨在解决当前神经元建模方法受限于实验数据的可用性和内在变异性的挑战，以及传统确定性模型难以扩展和深度学习无法充分捕捉神经元生物物理复杂性的问题。其解决方案的关键在于提出NOBLE，一种基于神经算子框架的模型，该框架通过从可解释神经元特征的连续频移嵌入映射到电流注入引起的胞体电压响应，实现了对神经动态分布的预测，并有效捕捉了实验中的内在变异性。

链接: https://arxiv.org/abs/2506.04536
作者: Luca Ghafourpour,Valentin Duruisseaux,Bahareh Tolooshams,Philip H. Wong,Costas A. Anastassiou,Anima Anandkumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Characterizing the diverse computational properties of human neurons via multimodal electrophysiological, transcriptomic, and morphological data provides the foundation for constructing and validating bio-realistic neuron models that can advance our understanding of fundamental mechanisms underlying brain function. However, current modeling approaches remain constrained by the limited availability and intrinsic variability of experimental neuronal data. To capture variability, ensembles of deterministic models are often used, but are difficult to scale as model generation requires repeating computationally expensive optimization for each neuron. While deep learning is becoming increasingly relevant in this space, it fails to capture the full biophysical complexity of neurons, their nonlinear voltage dynamics, and variability. To address these shortcomings, we introduce NOBLE, a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection. Trained on data generated from biophysically realistic neuron models, NOBLE predicts distributions of neural dynamics accounting for the intrinsic experimental variability. Unlike conventional bio-realistic neuron models, interpolating within the embedding space offers models whose dynamics are consistent with experimentally observed responses. NOBLE is the first scaled-up deep learning framework validated on real experimental data, enabling efficient generation of synthetic neurons that exhibit trial-to-trial variability and achieve a 4200\times speedup over numerical solvers. To this end, NOBLE captures fundamental neural properties, opening the door to a better understanding of cellular composition and computations, neuromorphic architectures, large-scale brain circuits, and general neuroAI applications.
zh

[AI-44] BEAR: BGP Event Analysis and Reporting

【速读】：该论文试图解决BGP异常事件（如路由泄露和劫持）的自动解释与报告问题，传统方法依赖专家对BGP知识的深入理解来分析和提出修复措施。解决方案的关键在于引入BEAR框架，该框架利用大型语言模型（LLMs）自动生成详尽的文本报告，通过多步骤推理过程将结构化的BGP数据转换为可解释的叙述性文本，从而提升异常事件的可解释性和分析精度。

链接: https://arxiv.org/abs/2506.04514
作者: Hanqing Li,Melania Fedeli,Vinay Kolar,Diego Klabjan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Internet comprises of interconnected, independently managed Autonomous Systems (AS) that rely on the Border Gateway Protocol (BGP) for inter-domain routing. BGP anomalies–such as route leaks and hijacks–can divert traffic through unauthorized or inefficient paths, jeopardizing network reliability and security. Although existing rule-based and machine learning methods can detect these anomalies using structured metrics, they still require experts with in-depth BGP knowledge of, for example, AS relationships and historical incidents, to interpret events and propose remediation. In this paper, we introduce BEAR (BGP Event Analysis and Reporting), a novel framework that leverages large language models (LLMs) to automatically generate comprehensive reports explaining detected BGP anomaly events. BEAR employs a multi-step reasoning process that translates tabular BGP data into detailed textual narratives, enhancing interpretability and analytical precision. To address the limited availability of publicly documented BGP anomalies, we also present a synthetic data generation framework powered by LLMs. Evaluations on both real and synthetic datasets demonstrate that BEAR achieves 100% accuracy, outperforming Chain-of-Thought and in-context learning baselines. This work pioneers an automated approach for explaining BGP anomaly events, offering valuable operational insights for network management.
zh

[AI-45] Schema Generation for Large Knowledge Graphs Using Large Language Models

【速读】：该论文试图解决知识图谱（Knowledge Graph, KG）中模式（schema）自动生成的问题，传统上这一过程需要大量知识工程师和领域专家的参与。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）在相关任务中的强大能力，通过引入两个数据集——YAGO Schema 和 Wikidata EntitySchema 以及相应的评估指标，构建基于 LLM 的自动模式生成流水线。该方法有效利用了知识图谱中的局部和全局信息，生成符合 Shape Expressions (ShEx) 标准的验证模式，展示了 LLM 在生成高质量 ShEx 模式方面的潜力。

链接: https://arxiv.org/abs/2506.04512
作者: Bohui Zhang,Yuan He,Lydia Pintscher,Albert Meroño Peñuela,Elena Simperl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Schemas are vital for ensuring data quality in the Semantic Web and natural language processing. Traditionally, their creation demands substantial involvement from knowledge engineers and domain experts. Leveraging the impressive capabilities of large language models (LLMs) in related tasks like ontology engineering, we explore automatic schema generation using LLMs. To bridge the resource gap, we introduce two datasets: YAGO Schema and Wikidata EntitySchema, along with evaluation metrics. The LLM-based pipelines effectively utilize local and global information from knowledge graphs (KGs) to generate validating schemas in Shape Expressions (ShEx). Experiments demonstrate LLMs’ strong potential in producing high-quality ShEx schemas, paving the way for scalable, automated schema generation for large KGs. Furthermore, our benchmark introduces a new challenge for structured generation, pushing the limits of LLMs on syntactically rich formalisms.
zh

[AI-46] “Dont Do That!”: Guiding Embodied Systems through Large Language Model-based Constraint Generation

【速读】：该论文试图解决在机器人导航中如何将自然语言表达的复杂空间、数学和条件约束转化为可执行的规划算法问题，这一过程通常面临描述非正式但高度复杂的挑战。解决方案的关键在于提出STPR框架，该框架利用大语言模型（LLMs）的强大编码能力，将“不要做什么”的指令形式的约束转化为可执行的Python函数，从而将问题描述从语言转换为结构化透明的代码，避免了复杂的推理过程和潜在的幻觉问题。

链接: https://arxiv.org/abs/2506.04500
作者: Aladin Djuhera,Amin Seffo,Masataro Asai,Holger Boche
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint; under review

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem. Such constraints can be informal yet highly complex, making it challenging to translate into a formal description that can be passed on to a planning algorithm. In this paper, we propose STPR, a constraint generation framework that uses LLMs to translate constraints (expressed as instructions on ``what not to do’') into executable Python functions. STPR leverages the LLM’s strong coding capabilities to shift the problem description from language into structured and transparent code, thus circumventing complex reasoning and avoiding potential hallucinations. We show that these LLM-generated functions accurately describe even complex mathematical constraints, and apply them to point cloud representations with traditional search algorithms. Experiments in a simulated Gazebo environment show that STPR ensures full compliance across several constraints and scenarios, while having short runtimes. We also verify that STPR can be used with smaller, code-specific LLMs, making it applicable to a wide range of compact models at low inference cost.
zh

[AI-47] CogMath: Assessing LLM s Authentic Mathematical Ability from a Human Cognitive Perspective

【速读】：该论文试图解决现有评估范式仅依赖整体答案准确率来衡量大型语言模型（Large Language Models, LLMs）在数学任务中的能力，从而无法真实反映其综合数学能力的问题。解决方案的关键在于提出CogMath，该方法通过人类认知的视角全面评估LLMs的数学能力，将人类推理过程形式化为问题理解、问题求解和解决方案总结三个阶段，并设计了9个细粒度的评估维度，每个维度采用“Inquiry-Judge-Reference”多智能体系统生成评估问题，以全面检验模型在各个维度上的掌握程度。

链接: https://arxiv.org/abs/2506.04481
作者: Jiayu Liu,Zhenya Huang,Wei Dai,Cheng Cheng,Jinze Wu,Jing Sha,Song Li,Qi Liu,Shijin Wang,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbfCogMath, which comprehensively assesses LLMs’ mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emphproblem comprehension, \emphproblem solving, and \emphsolution summarization. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emphInquiry-\emphJudge-\emphReference’’ multi-agent system to generate inquiries that assess LLMs’ mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
zh

[AI-48] Comparative performance of ensemble models in predicting dental provider types: insights from fee-for-service data

【速读】：该论文试图解决如何有效分类牙科提供者以优化医疗资源分配和政策规划的问题，特别是针对标准服务提供者与安全网诊所（SNC）提供者的区分。解决方案的关键在于利用机器学习模型对2018年的数据集进行分析，该数据集包含24,300个实例和20个特征，涵盖按费用服务（FFS）、地理管理护理和预付健康计划的受益人和服务数量，并根据服务系统和患者年龄组（0-20岁和21岁以上）进行分类。研究测试了多种算法，最终发现神经网络和随机森林在处理不平衡数据和复杂特征交互方面表现优异，显著优于传统分类器，表明先进的机器学习技术，尤其是集成学习和深度学习模型，在提升牙科人力资源分类效果方面具有重要价值。

链接: https://arxiv.org/abs/2506.04479
作者: Mohammad Subhi Al-Batah,Muhyeeddin Alqaraleh,Mowafaq Salem Alzboon,Abdullah Alourani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dental provider classification plays a crucial role in optimizing healthcare resource allocation and policy planning. Effective categorization of providers, such as standard rendering providers and safety net clinic (SNC) providers, enhances service delivery to underserved populations. This study aimed to evaluate the performance of machine learning models in classifying dental providers using a 2018 dataset. A dataset of 24,300 instances with 20 features was analyzed, including beneficiary and service counts across fee-for-service (FFS), Geographic Managed Care, and Pre-Paid Health Plans. Providers were categorized by delivery system and patient age groups (0-20 and 21+). Despite 38.1% missing data, multiple machine learning algorithms were tested, including k-Nearest Neighbors (kNN), Decision Trees, Support Vector Machines (SVM), Stochastic Gradient Descent (SGD), Random Forest, Neural Networks, and Gradient Boosting. A 10-fold cross-validation approach was applied, and models were evaluated using AUC, classification accuracy (CA), F1-score, precision, and recall. Neural Networks achieved the highest AUC (0.975) and CA (94.1%), followed by Random Forest (AUC: 0.948, CA: 93.0%). These models effectively handled imbalanced data and complex feature interactions, outperforming traditional classifiers like Logistic Regression and SVM. Advanced machine learning techniques, particularly ensemble and deep learning models, significantly enhance dental workforce classification. Their integration into healthcare analytics can improve provider identification and resource distribution, benefiting underserved populations.
zh

[AI-49] Matching Markets Meet LLM s: Algorithmic Reasoning with Ranked Preferences

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理组合领域中排序偏好和结构化算法时的逻辑与算法局限性问题。研究聚焦于匹配市场，这一核心框架广泛应用于资源分配和拼车等场景，需要协调个体的排序偏好以确保稳定结果。论文通过一系列基于偏好的推理任务评估了多个最先进的模型，揭示其在处理排序输入时的不足。解决方案的关键在于识别LLMs在大规模市场中难以解决不稳定性的原因，并探索参数高效微调（LoRA）等方法，但实验表明这些方法在大规模实例上的效果有限，提示需要更复杂的策略来提升LLMs处理长上下文输入时的推理能力。

链接: https://arxiv.org/abs/2506.04478
作者: Hadi Hosseini,Samarth Khanna,Ronak Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has driven progress in reasoning tasks – from program synthesis to scientific hypothesis generation – yet their ability to handle ranked preferences and structured algorithms in combinatorial domains remains underexplored. We study matching markets, a core framework behind applications like resource allocation and ride-sharing, which require reconciling individual ranked preferences to ensure stable outcomes. We evaluate several state-of-the-art models on a hierarchy of preference-based reasoning tasks – ranging from stable-matching generation to instability detection, instability resolution, and fine-grained preference queries – to systematically expose their logical and algorithmic limitations in handling ranked inputs. Surprisingly, even top-performing models with advanced reasoning struggle to resolve instability in large markets, often failing to identify blocking pairs or execute algorithms iteratively. We further show that parameter-efficient fine-tuning (LoRA) significantly improves performance in small markets, but fails to bring about a similar improvement on large instances, suggesting the need for more sophisticated strategies to improve LLMs’ reasoning with larger-context inputs.
zh

[AI-50] Classifying Dental Care Providers Through Machine Learning with Features Ranking

【速读】：该论文试图解决如何利用机器学习（ML）模型对牙科提供者进行分类的问题，具体分为标准服务提供者和安全网诊所（SNC）提供者。其解决方案的关键在于通过特征排名方法（如信息增益、基尼指数和方差分析）识别出关键预测变量，并结合多种机器学习模型（如神经网络、梯度提升和随机森林）进行分类，以提高分类准确性和模型效率，特别是在处理高缺失值和不平衡的医疗数据集时。

链接: https://arxiv.org/abs/2506.04474
作者: Mohammad Subhi Al-Batah,Mowafaq Salem Alzboon,Muhyeeddin Alqaraleh,Mohammed Hasan Abu-Arqoub,Rashiq Rafiq Marie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the application of machine learning (ML) models for classifying dental providers into two categories - standard rendering providers and safety net clinic (SNC) providers - using a 2018 dataset of 24,300 instances with 20 features. The dataset, characterized by high missing values (38.1%), includes service counts (preventive, treatment, exams), delivery systems (FFS, managed care), and beneficiary demographics. Feature ranking methods such as information gain, Gini index, and ANOVA were employed to identify critical predictors, revealing treatment-related metrics (TXMT_USER_CNT, TXMT_SVC_CNT) as top-ranked features. Twelve ML models, including k-Nearest Neighbors (kNN), Decision Trees, Support Vector Machines (SVM), Stochastic Gradient Descent (SGD), Random Forest, Neural Networks, and Gradient Boosting, were evaluated using 10-fold cross-validation. Classification accuracy was tested across incremental feature subsets derived from rankings. The Neural Network achieved the highest accuracy (94.1%) using all 20 features, followed by Gradient Boosting (93.2%) and Random Forest (93.0%). Models showed improved performance as more features were incorporated, with SGD and ensemble methods demonstrating robustness to missing data. Feature ranking highlighted the dominance of treatment service counts and annotation codes in distinguishing provider types, while demographic variables (AGE_GROUP, CALENDAR_YEAR) had minimal impact. The study underscores the importance of feature selection in enhancing model efficiency and accuracy, particularly in imbalanced healthcare datasets. These findings advocate for integrating feature-ranking techniques with advanced ML algorithms to optimize dental provider classification, enabling targeted resource allocation for underserved populations.
zh

[AI-51] An Expansion-Based Approach for Quantified Integer Programming

【速读】：该论文旨在解决多阶段鲁棒离散线性优化问题，通过引入一种基于反例引导抽象精化（CEGAR）的扩展式求解方法，以改进现有针对量化整数规划（QIP）和量化布尔公式（QBF）的求解技术。其解决方案的关键在于将QBF中的扩展式方法适配到QIP中，并将其嵌入优化框架，从而有效处理线性约束下的复杂决策问题，实验结果表明该方法在特定实例上优于现有的搜索式求解器，并在性能上超越了最先进的QBF扩展式求解器。

链接: https://arxiv.org/abs/2506.04452
作者: Michael Hartisch,Leroy Chew
机构: 未知
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Quantified Integer Programming (QIP) bridges multiple domains by extending Quantified Boolean Formulas (QBF) to incorporate general integer variables and linear constraints while also generalizing Integer Programming through variable quantification. As a special case of Quantified Constraint Satisfaction Problems (QCSP), QIP provides a versatile framework for addressing complex decision-making scenarios. Additionally, the inclusion of a linear objective function enables QIP to effectively model multistage robust discrete linear optimization problems, making it a powerful tool for tackling uncertainty in optimization. While two primary solution paradigms exist for QBF – search-based and expansion-based approaches – only search-based methods have been explored for QIP and QCSP. We introduce an expansion-based approach for QIP using Counterexample-Guided Abstraction Refinement (CEGAR), adapting techniques from QBF. We extend this methodology to tackle multistage robust discrete optimization problems with linear constraints and further embed it in an optimization framework, enhancing its applicability. Our experimental results highlight the advantages of this approach, demonstrating superior performance over existing search-based solvers for QIP in specific instances. Furthermore, the ability to model problems using linear constraints enables notable performance gains over state-of-the-art expansion-based solvers for QBF. Subjects: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2506.04452 [cs.DM] (or arXiv:2506.04452v1 [cs.DM] for this version) https://doi.org/10.48550/arXiv.2506.04452 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-52] Learning to Diagnose Privately: DP-Powered LLM s for Radiology Report Classification

【速读】：该论文旨在解决在医学影像报告文本中进行多异常分类时，如何在保护患者隐私的同时保持模型分类性能的问题。其解决方案的关键在于采用差分隐私（Differential Privacy, DP）机制对大型语言模型（Large Language Models, LLMs）进行微调，具体通过在微调过程中注入校准噪声，以缓解敏感患者数据带来的隐私风险并防止数据泄露，同时尽量维持模型的分类性能。

链接: https://arxiv.org/abs/2506.04450
作者: Payel Bhattacharjee,Fengwei Tian,Ravi Tandon,Joseph Lo,Heidi Hanson,Geoffrey Rubin,Nirav Merchant,John Gounley
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Purpose: This study proposes a framework for fine-tuning large language models (LLMs) with differential privacy (DP) to perform multi-abnormality classification on radiology report text. By injecting calibrated noise during fine-tuning, the framework seeks to mitigate the privacy risks associated with sensitive patient data and protect against data leakage while maintaining classification performance. Materials and Methods: We used 50,232 radiology reports from the publicly available MIMIC-CXR chest radiography and CT-RATE computed tomography datasets, collected between 2011 and 2019. Fine-tuning of LLMs was conducted to classify 14 labels from MIMIC-CXR dataset, and 18 labels from CT-RATE dataset using Differentially Private Low-Rank Adaptation (DP-LoRA) in high and moderate privacy regimes (across a range of privacy budgets = 0.01, 0.1, 1.0, 10.0). Model performance was evaluated using weighted F1 score across three model architectures: BERT-medium, BERT-small, and ALBERT-base. Statistical analyses compared model performance across different privacy levels to quantify the privacy-utility trade-off. Results: We observe a clear privacy-utility trade-off through our experiments on 2 different datasets and 3 different models. Under moderate privacy guarantees the DP fine-tuned models achieved comparable weighted F1 scores of 0.88 on MIMIC-CXR and 0.59 on CT-RATE, compared to non-private LoRA baselines of 0.90 and 0.78, respectively. Conclusion: Differentially private fine-tuning using LoRA enables effective and privacy-preserving multi-abnormality classification from radiology reports, addressing a key challenge in fine-tuning LLMs on sensitive medical data.
zh

[AI-53] Grokking and Generalization Collapse: Insights from textttHTSR theory

【速读】：该论文试图解决神经网络中“grokking”现象的机制及其与过拟合和泛化崩溃的关系问题，特别是在训练过程中测试准确率的突然下降问题。其关键解决方案是通过引入基于重尾自正则化（HTSR）的层质量度量参数α，该参数能够有效区分三种不同的训练阶段：预grokking、grokking以及新发现的反grokking（anti-grokking）阶段，其中反grokking表现为测试准确率崩溃而训练准确率保持完美，且这一现象可通过α值和相关性陷阱（Correlation Traps）进行检测。

链接: https://arxiv.org/abs/2506.04434
作者: Hari K. Prakash,Charles H. Martin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 15 pages,7 figs

点击查看摘要

Abstract:We study the well-known grokking phenomena in neural networks (NNs) using a 3-layer MLP trained on 1 k-sample subset of MNIST, with and without weight decay, and discover a novel third phase – \emphanti-grokking – that occurs very late in training and resembles but is distinct from the familiar \emphpre-grokking phases: test accuracy collapses while training accuracy stays perfect. This late-stage collapse is distinct, from the known pre-grokking and grokking phases, and is not detected by other proposed grokking progress measures. Leveraging Heavy-Tailed Self-Regularization HTSR through the open-source WeightWatcher tool, we show that the HTSR layer quality metric \alpha alone delineates all three phases, whereas the best competing metrics detect only the first two. The \emphanti-grokking is revealed by training for 10^7 and is invariably heralded by \alpha 2 and the appearance of \emphCorrelation Traps – outlier singular values in the randomized layer weight matrices that make the layer weight matrix atypical and signal overfitting of the training set. Such traps are verified by visual inspection of the layer-wise empirical spectral densities, and by using Kolmogorov–Smirnov tests on randomized spectra. Comparative metrics, including activation sparsity, absolute weight entropy, circuit complexity, and l^2 weight norms track pre-grokking and grokking but fail to distinguish grokking from anti-grokking. This discovery provides a way to measure overfitting and generalization collapse without direct access to the test data. These results strengthen the claim that the \emphHTSR \alpha provides universal layer-convergence target at \alpha \approx 2 and underscore the value of using the HTSR alpha (\alpha) metric as a measure of generalization.
zh

[AI-54] An AI-Based Public Health Data Monitoring System

【速读】：该论文旨在解决公共健康领域在大规模健康数据（如病例、住院人数、死亡人数）监测中面临的挑战，传统基于警报的监测系统因需不断调整警报阈值以及数据量可能导致应用延迟而效果有限。解决方案的关键在于提出一种基于排名的监测范式，利用新型生成式 AI (Generative AI) 异常检测方法，从而提升监测效率与准确性。

链接: https://arxiv.org/abs/2506.04429
作者: Ananya Joshi,Nolan Gormley,Richa Gadgil,Tina Townes,Roni Rosenfeld,Bryan Wilder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Public health experts need scalable approaches to monitor large volumes of health data (e.g., cases, hospitalizations, deaths) for outbreaks or data quality issues. Traditional alert-based monitoring systems struggle with modern public health data monitoring systems for several reasons, including that alerting thresholds need to be constantly reset and the data volumes may cause application lag. Instead, we propose a ranking-based monitoring paradigm that leverages new AI anomaly detection methods. Through a multi-year interdisciplinary collaboration, the resulting system has been deployed at a national organization to monitor up to 5,000,000 data points daily. A three-month longitudinal deployed evaluation revealed a significant improvement in monitoring objectives, with a 54x increase in reviewer speed efficiency compared to traditional alert-based methods. This work highlights the potential of human-centered AI to transform public health decision-making.
zh

[AI-55] Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance EMNLP2025

【速读】：该论文试图解决多表问答（multi-table QA）中由于复杂表格间不可靠的模式链接（schema linking）所带来的挑战。现有基于语义相似性的方法在简化的人工构建数据集上表现良好，但在处理具有大量和多样化列的真实世界场景时效果不佳。解决方案的关键在于提出一种基于图的框架，该框架利用人工标注的关联知识显式编码模式链接和连接路径，并通过剪枝和子路径合并策略提高效率与连贯性，从而构建可解释的推理链。

链接: https://arxiv.org/abs/2506.04427
作者: Xixi Wang,Miguel Costa,Jordanka Kovaceva,Shuai Wang,Francisco C. Pereira
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to EMNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in table Question Answering (Table QA). However, extending these capabilities to multi-table QA remains challenging due to unreliable schema linking across complex tables. Existing methods based on semantic similarity work well only on simplified hand-crafted datasets and struggle to handle complex, real-world scenarios with numerous and diverse columns. To address this, we propose a graph-based framework that leverages human-curated relational knowledge to explicitly encode schema links and join paths. Given a natural language query, our method searches this graph to construct interpretable reasoning chains, aided by pruning and sub-path merging strategies to enhance efficiency and coherence. Experiments on both standard benchmarks and a realistic, large-scale dataset demonstrate the effectiveness of our approach. To our knowledge, this is the first multi-table QA system applied to truly complex industrial tabular data.
zh

[AI-56] Unsupervised Meta-Testing with Conditional Neural Processes for Hybrid Meta-Reinforcement Learning

【速读】：该论文旨在解决在元测试（meta-testing）阶段缺乏奖励信号时，如何提高样本效率并实现少样本元强化学习（few-shot meta-RL）的问题。其解决方案的关键在于提出一种名为无监督元测试条件神经过程（Unsupervised Meta-Testing with Conditional Neural Processes, UMCNP）的混合方法，该方法将基于参数化策略梯度（Parameterized Policy Gradient, PPG）和基于任务推断的元强化学习方法相结合，但保持其独立性。UMCNP利用条件神经过程（Conditional Neural Processes, CNPs）的高效性和可扩展性，减少元测试中的在线交互次数，并在元训练阶段以离线方式重用之前通过PPG元强化学习收集的样本，用于学习任务推断。此外，UMCNP能够从单个未知参数的测试任务轨迹中推断出转移动力学模型的潜在表示，并通过与学习到的动力学模型交互生成轨迹以实现自适应。

链接: https://arxiv.org/abs/2506.04399
作者: Suzan Ece Ada,Emre Ugur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Published in IEEE Robotics and Automation Letters Volume: 9, Issue: 10, 8427 - 8434, October 2024. 8 pages, 7 figures

点击查看摘要

Abstract:We introduce Unsupervised Meta-Testing with Conditional Neural Processes (UMCNP), a novel hybrid few-shot meta-reinforcement learning (meta-RL) method that uniquely combines, yet distinctly separates, parameterized policy gradient-based (PPG) and task inference-based few-shot meta-RL. Tailored for settings where the reward signal is missing during meta-testing, our method increases sample efficiency without requiring additional samples in meta-training. UMCNP leverages the efficiency and scalability of Conditional Neural Processes (CNPs) to reduce the number of online interactions required in meta-testing. During meta-training, samples previously collected through PPG meta-RL are efficiently reused for learning task inference in an offline manner. UMCNP infers the latent representation of the transition dynamics model from a single test task rollout with unknown parameters. This approach allows us to generate rollouts for self-adaptation by interacting with the learned dynamics model. We demonstrate our method can adapt to an unseen test task using significantly fewer samples during meta-testing than the baselines in 2D-Point Agent and continuous control meta-RL benchmarks, namely, cartpole with unknown angle sensor bias, walker agent with randomized dynamics parameters.
zh

[AI-57] Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning With Iterated Q-Learning

【速读】：该论文试图解决在基于价值的强化学习中，移除目标网络所带来的不稳定性和性能下降问题，以及无法利用目标网络相关文献的问题。解决方案的关键在于提出一种新的方法，即使用在线网络最后的线性层副本作为目标网络，同时与在线网络共享其余参数，从而避免了基于目标和无目标方法之间的二元选择。这种方法使得能够利用迭代Q学习的概念，通过并行学习连续的贝尔曼迭代来缩小无目标和基于目标方法之间的性能差距。

链接: https://arxiv.org/abs/2506.04398
作者: Théo Vincent,Yogesh Tripathi,Tim Faust,Yaniv Oren,Jan Peters,Carlo D’Eramo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In value-based reinforcement learning, removing the target network is tempting as the boostrapped target would be built from up-to-date estimates, and the spared memory occupied by the target network could be reallocated to expand the capacity of the online network. However, eliminating the target network introduces instability, leading to a decline in performance. Removing the target network also means we cannot leverage the literature developed around target networks. In this work, we propose to use a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network, hence stepping out of the binary choice between target-based and target-free methods. It enables us to leverage the concept of iterated Q-learning, which consists of learning consecutive Bellman iterations in parallel, to reduce the performance gap between target-free and target-based approaches. Our findings demonstrate that this novel method, termed iterated Shared Q-Learning (iS-QL), improves the sample efficiency of target-free approaches across various settings. Importantly, iS-QL requires a smaller memory footprint and comparable training time to classical target-based algorithms, highlighting its potential to scale reinforcement learning research.
zh

[AI-58] hrough the Stealth Lens: Rethinking Attacks and Defenses in RAG

【速读】：该论文试图解决检索增强生成（Retrieval-augmented generation, RAG）系统在面对注入中毒段落的攻击时的安全性问题，尤其是现有攻击方法缺乏隐蔽性，容易被检测和缓解。解决方案的关键在于通过分析生成过程中的中间信号，如注意力模式或下一个词的概率分布，以避免可被轻易检测到的操纵痕迹。研究提出了一种基于段落级评分的方法——归一化段落注意力得分（Normalized Passage Attention Score），用于识别和过滤可能被污染的段落，从而有效缓解现有攻击，提升准确率。

链接: https://arxiv.org/abs/2506.04390
作者: Sarthak Choudhary,Nils Palumbo,Ashish Hooda,Krishnamurthy Dj Dvijotham,Somesh Jha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved set, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize stealth using a distinguishability-based security game. If a few poisoned passages are designed to control the response, they must differentiate themselves from benign ones, inherently compromising stealth. This motivates the need for attackers to rigorously analyze intermediate signals involved in generation \unicodex2014 such as attention patterns or next-token probability distributions \unicodex2014 to avoid easily detectable traces of manipulation. Leveraging attention patterns, we propose a passage-level score \unicodex2014 the Normalized Passage Attention Score \unicodex2014 used by our Attention-Variance Filter algorithm to identify and filter potentially poisoned passages. This method mitigates existing attacks, improving accuracy by up to \sim 20 % over baseline defenses. To probe the limits of attention-based defenses, we craft stealthier adaptive attacks that obscure such traces, achieving up to 35 % attack success rate, and highlight the challenges in improving stealth.
zh

[AI-59] Domain Adaptation Method and Modality Gap Impact in Audio-Text Models for Prototypical Sound Classification INTERSPEECH2025

【速读】：该论文旨在解决生成式AI（Generative AI）在存在背景声源时在零样本环境声音分类任务中性能显著下降的问题。其关键解决方案是提出一种新颖的方法，通过量化并整合背景声源对分类过程的贡献，从而在不进行模型微调的情况下提升分类性能，同时利用领域自适应技术增强模型在不同背景和信噪比（SNR）条件下的准确性。

链接: https://arxiv.org/abs/2506.04376
作者: Emiliano Acevedo,Martín Rocamora,Magdalena Fuentes
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Audio-text models are widely used in zero-shot environmental sound classification as they alleviate the need for annotated data. However, we show that their performance severely drops in the presence of background sound sources. Our analysis reveals that this degradation is primarily driven by SNR levels of background soundscapes, and independent of background type. To address this, we propose a novel method that quantifies and integrates the contribution of background sources into the classification process, improving performance without requiring model retraining. Our domain adaptation technique enhances accuracy across various backgrounds and SNR conditions. Moreover, we analyze the modality gap between audio and text embeddings, showing that narrowing this gap improves classification performance. The method generalizes effectively across state-of-the-art prototypical approaches, showcasing its scalability and robustness for diverse environments.
zh

[AI-60] A Statistical Physics of Language Model Reasoning

【速读】：该论文试图解决生成式 AI (Generative AI) 中的推理过程难以通过机械方式理解的问题，特别是针对 Transformer 语言模型（Transformer LMs）所表现出的涌现性推理现象。其解决方案的关键在于提出一种统计物理框架，将句子级别的隐藏状态轨迹建模为低维流形上的随机动力系统，通过漂移-扩散系统结合潜在状态切换来捕捉多样的推理阶段，包括不一致状态或失败情况。该方法利用秩-40投影解释了约50%的方差，并通过构建状态线性动态系统（SLDS）模型验证了其有效性，从而实现了低成本的推理模拟并提供了研究和预测关键过渡（如不一致状态或其他语言模型故障）的工具。

链接: https://arxiv.org/abs/2506.04374
作者: Jack David Carson,Amir Reisizadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer LMs show emergent reasoning that resists mechanistic understanding. We offer a statistical physics framework for continuous-time chain-of-thought reasoning dynamics. We model sentence-level hidden state trajectories as a stochastic dynamical system on a lower-dimensional manifold. This drift-diffusion system uses latent regime switching to capture diverse reasoning phases, including misaligned states or failures. Empirical trajectories (8 models, 7 benchmarks) show a rank-40 projection (balancing variance capture and feasibility) explains ~50% variance. We find four latent reasoning regimes. An SLDS model is formulated and validated to capture these features. The framework enables low-cost reasoning simulation, offering tools to study and predict critical transitions like misaligned states or other LM failures.
zh

[AI-61] cuVSLAM: CUDA accelerated visual odometry

【速读】：该论文旨在解决自主机器人中精确且鲁棒的姿态估计问题，这是实现视觉同时定位与建图（Visual Simultaneous Localization and Mapping, VSLAM）的关键挑战。其解决方案的关键在于提出cuVSLAM，这是一种基于CUDA优化的先进视觉惯性SLAM系统，能够支持多种视觉-惯性传感器组合，并在边缘计算设备上实现实时运行，具有最小的计算开销。

链接: https://arxiv.org/abs/2506.04359
作者: Alexander Korovko,Dmitry Slepichev,Alexander Efitorov,Aigul Dzhumamuratova,Viktor Kuznetsov,Hesam Rabeti,Joydeep Biswas
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM.
zh

[AI-62] AUTOCT: Automating Interpretable Clinical Trial Prediction with LLM Agents

【速读】：该论文试图解决临床试验预测中模型可解释性不足、易受标签泄露影响以及开发成本高昂的问题（clinical trial prediction）。其解决方案的关键在于提出AutoCT框架，该框架结合了大语言模型的推理能力与传统机器学习的可解释性，通过自主生成、评估和优化表格特征，并利用蒙特卡洛树搜索（Monte Carlo Tree Search）迭代优化预测性能，从而实现高效、可解释且成本可控的临床试验预测。

链接: https://arxiv.org/abs/2506.04293
作者: Fengze Liu,Haoyu Wang,Joonhyuk Cho,Dan Roth,Andrew W. Lo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical trials are critical for advancing medical treatments but remain prohibitively expensive and time-consuming. Accurate prediction of clinical trial outcomes can significantly reduce research and development costs and accelerate drug discovery. While recent deep learning models have shown promise by leveraging unstructured data, their black-box nature, lack of interpretability, and vulnerability to label leakage limit their practical use in high-stakes biomedical contexts. In this work, we propose AutoCT, a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning. AutoCT autonomously generates, evaluates, and refines tabular features based on public information without human input. Our method uses Monte Carlo Tree Search to iteratively optimize predictive performance. Experimental results show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self-refinement iterations, establishing a new paradigm for scalable, interpretable, and cost-efficient clinical trial prediction.
zh

[AI-63] Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）代理在环境中的技能获取问题，特别是如何有效生成具有学习价值的训练数据。传统方法面临手动轨迹收集耗时、LLM直接提出任务可能不可行以及生成数据缺乏有意义学习信号等挑战。论文提出的解决方案关键在于构建一个名为EXIF的自动技能发现框架，其核心是采用“探索优先”策略，通过一个探索代理（Alice）与环境交互生成可行的技能数据集，并用于训练目标代理（Bob）。同时，引入迭代反馈机制，使Alice根据Bob的表现优化后续探索，形成闭环数据生成过程，从而无需人工干预即可提升代理能力。

链接: https://arxiv.org/abs/2506.04287
作者: Yongjin Yang,Sinjae Kang,Juyong Lee,Dongjun Lee,Se-Young Yun,Kimin Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, under review

点击查看摘要

Abstract:Training large language model (LLM) agents to acquire necessary skills and perform diverse tasks within an environment is gaining interest as a means to enable open-endedness. However, creating the training dataset for their skill acquisition faces several challenges. Manual trajectory collection requires significant human effort. Another approach, where LLMs directly propose tasks to learn, is often invalid, as the LLMs lack knowledge of which tasks are actually feasible. Moreover, the generated data may not provide a meaningful learning signal, as agents often already perform well on the proposed tasks. To address this, we propose a novel automatic skill discovery framework EXIF for LLM-powered agents, designed to improve the feasibility of generated target behaviors while accounting for the agents’ capabilities. Our method adopts an exploration-first strategy by employing an exploration agent (Alice) to train the target agent (Bob) to learn essential skills in the environment. Specifically, Alice first interacts with the environment to retrospectively generate a feasible, environment-grounded skill dataset, which is then used to train Bob. Crucially, we incorporate an iterative feedback loop, where Alice evaluates Bob’s performance to identify areas for improvement. This feedback then guides Alice’s next round of exploration, forming a closed-loop data generation process. Experiments on Webshop and Crafter demonstrate EXIF’s ability to effectively discover meaningful skills and iteratively expand the capabilities of the trained agent without any human intervention, achieving substantial performance improvements. Interestingly, we observe that setting Alice to the same model as Bob also notably improves performance, demonstrating EXIF’s potential for building a self-evolving system.
zh

[AI-64] Autonomous Collaborative Scheduling of Time-dependent UAVs Workers and Vehicles for Crowdsensing in Disaster Response

【速读】：该论文旨在解决自然灾害后环境信息采集效率低的问题，特别是在复杂环境下现有传感技术存在环境适应性差、专用感知能力不足及感知解决方案实用性有限等挑战。其核心解决方案是提出一种异构多智能体在线自主协同调度算法HoAs-PALN，该算法通过匹配过程中的自适应降维和局部纳什均衡博弈实现时间依赖的无人机、工作人员和车辆的自主协作，从而提升感知调度效率。关键在于将五维匹配过程转化为两类三维匹配过程以减少调度决策时间，并结合softmax函数优化行为选择概率及引入局部纳什均衡判定机制以保障调度性能。

链接: https://arxiv.org/abs/2506.04276
作者: Lei Han,Yitong Guo,Pengfei Yang,Zhiyong Yu,Liang Wang,Quan Wang,Zhiwen Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural disasters have caused significant losses to human society, and the timely and efficient acquisition of post-disaster environmental information is crucial for the effective implementation of rescue operations. Due to the complexity of post-disaster environments, existing sensing technologies face challenges such as weak environmental adaptability, insufficient specialized sensing capabilities, and limited practicality of sensing solutions. This paper explores the heterogeneous multi-agent online autonomous collaborative scheduling algorithm HoAs-PALN, aimed at achieving efficient collection of post-disaster environmental information. HoAs-PALN is realized through adaptive dimensionality reduction in the matching process and local Nash equilibrium game, facilitating autonomous collaboration among time-dependent UAVs, workers and vehicles to enhance sensing scheduling. (1) In terms of adaptive dimensionality reduction during the matching process, HoAs-PALN significantly reduces scheduling decision time by transforming a five-dimensional matching process into two categories of three-dimensional matching processes; (2) Regarding the local Nash equilibrium game, HoAs-PALN combines the softmax function to optimize behavior selection probabilities and introduces a local Nash equilibrium determination mechanism to ensure scheduling decision performance. Finally, we conducted detailed experiments based on extensive real-world and simulated data. Compared with the baselines (GREEDY, K-WTA, MADL and MARL), HoAs-PALN improves task completion rates by 64.12%, 46.48%, 16.55%, and 14.03% on average, respectively, while each online scheduling decision takes less than 10 seconds, demonstrating its effectiveness in dynamic post-disaster environments.
zh

[AI-65] MUC-G4: Minimal Unsat Core-Guided Incremental Verification for Deep Neural Network Compression

【速读】：该论文试图解决在边缘设备上部署深度神经网络时，由于高内存和运行时复杂性所带来的挑战，特别是现有增量验证方法在处理结构变化时的不足。其解决方案的关键在于提出MUC-G4（Minimal Unsat Core-Guided Incremental Verification）框架，该框架通过将原始网络和压缩网络编码为SMT公式，并利用原始网络中的最小不可满足核心（Minimal Unsat Core, MUC）来指导压缩网络的高效验证，从而实现对量化和剪枝等压缩技术的有效验证。

链接: https://arxiv.org/abs/2506.04268
作者: Jingyang Li,Guoqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of deep learning has led to challenges in deploying neural networks on edge devices, mainly due to their high memory and runtime complexity. Network compression techniques, such as quantization and pruning, aim to reduce this complexity while maintaining accuracy. However, existing incremental verification methods often focus only on quantization and struggle with structural changes. This paper presents MUC-G4 (Minimal Unsat Core-Guided Incremental Verification), a novel framework for incremental verification of compressed deep neural networks. It encodes both the original and compressed networks into SMT formulas, classifies changes, and use \emphMinimal Unsat Cores (MUCs) from the original network to guide efficient verification for the compressed network. Experimental results show its effectiveness in handling quantization and pruning, with high proof reuse rates and significant speedup in verification time compared to traditional methods. MUC-G4 hence offers a promising solution for ensuring the safety and reliability of compressed neural networks in practical applications.
zh

[AI-66] CORA: Coalitional Rational Advantage Decomposition for Multi-Agent Policy Gradients

【速读】：该论文旨在解决合作式多智能体强化学习（cooperative multi-agent reinforcement learning, MARL）中的信用分配（credit assignment）问题。现有方法在共享全局优势时往往无法准确反映各智能体的独立贡献，导致策略更新不优。论文从联盟层面分析了多智能体策略更新中的过度更新问题，并提出了一种名为联盟理性优势分解（Coalitional Rational Advantage Decomposition, CORA）的信用分配方法。CORA 的关键在于通过所有可能联盟的边际贡献评估联盟优势，并利用合作博弈论中的核心解进行优势分解，从而保证联盟合理性，同时通过随机联盟采样降低计算开销。

链接: https://arxiv.org/abs/2506.04265
作者: Mengda Ji,Genjiu Xu,Liying Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work focuses on the credit assignment problem in cooperative multi-agent reinforcement learning (MARL). Sharing the global advantage among agents often leads to suboptimal policy updates as it fails to account for the distinct contributions of agents. Although numerous methods consider global or individual contributions for credit assignment, a detailed analysis at the coalition level remains lacking in many approaches. This work analyzes the over-updating problem during multi-agent policy updates from a coalition-level perspective. To address this issue, we propose a credit assignment method called Coalitional Rational Advantage Decomposition (CORA). CORA evaluates coalitional advantages via marginal contributions from all possible coalitions and decomposes advantages using the core solution from cooperative game theory, ensuring coalitional rationality. To reduce computational overhead, CORA employs random coalition sampling. Experiments on matrix games, differential games, and multi-agent collaboration benchmarks demonstrate that CORA outperforms strong baselines, particularly in tasks with multiple local optima. These findings highlight the importance of coalition-aware credit assignment for improving MARL performance.
zh

[AI-67] Localized Forest Fire Risk Prediction: A Department-Aware Approach for Operational Decision Support ECAI2025

【速读】：该论文试图解决传统森林火灾预测方法中将火灾点火问题简化为二分类任务所导致的不足，特别是在考虑区域差异性和实际操作需求方面的局限性。其解决方案的关键在于提出一种针对部门（department）上下文的火灾风险评估方法，以实现更符合本地条件、更具操作性的区域化预测。

链接: https://arxiv.org/abs/2506.04254
作者: Nicolas Caron,Christophe Guyeux,Hassan Noura,Benjamin Aynes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, 3 tables, submitted to ECAI2025

点击查看摘要

Abstract:Forest fire prediction involves estimating the likelihood of fire ignition or related risk levels in a specific area over a defined time period. With climate change intensifying fire behavior and frequency, accurate prediction has become one of the most pressing challenges in Artificial Intelligence (AI). Traditionally, fire ignition is approached as a binary classification task in the literature. However, this formulation oversimplifies the problem, especially from the perspective of end-users such as firefighters. In general, as is the case in France, firefighting units are organized by department, each with its terrain, climate conditions, and historical experience with fire events. Consequently, fire risk should be modeled in a way that is sensitive to local conditions and does not assume uniform risk across all regions. This paper proposes a new approach that tailors fire risk assessment to departmental contexts, offering more actionable and region-specific predictions for operational use. With this, we present the first national-scale AI benchmark for metropolitan France using state-of-the-art AI models on a relatively unexplored dataset. Finally, we offer a summary of important future works that should be taken into account. Supplementary materials are available on GitHub.
zh

[AI-68] HADA: Human-AI Agent Decision Alignment Architecture

【速读】：该论文试图解决在多智能体系统中实现人类与人工智能（AI）代理之间决策对齐的问题，特别是在保持大型语言模型（LLM）代理和传统算法与组织目标和价值观一致方面的挑战。解决方案的关键在于提出一种协议和框架无关的参考架构HADA，该架构通过将任何算法或LLM封装在特定角色的利害关系人代理中（如业务、数据科学、审计、伦理和客户代理），以对话式API的形式暴露功能，从而允许技术和非技术利益相关者查询、引导、审计或质疑每个决策。此外，对齐目标、关键绩效指标（KPIs）和价值约束以自然语言表达，并持续传播、记录和版本化，确保在多种异构代理运行时的透明性和可追溯性。

链接: https://arxiv.org/abs/2506.04253
作者: Tapio Pitkäranta,Leena Pitkäranta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:We present HADA (Human-AI Agent Decision Alignment), a protocol- and framework agnostic reference architecture that keeps both large language model (LLM) agents and legacy algorithms aligned with organizational targets and values. HADA wraps any algorithm or LLM in role-specific stakeholder agents – business, data-science, audit, ethics, and customer – each exposing conversational APIs so that technical and non-technical actors can query, steer, audit, or contest every decision across strategic, tactical, and real-time horizons. Alignment objectives, KPIs, and value constraints are expressed in natural language and are continuously propagated, logged, and versioned while thousands of heterogeneous agents run on different orchestration stacks. A cloud-native proof of concept packages a production credit-scoring model (getLoanDecision) and deploys it on Docker/Kubernetes/Python; five scripted retail-bank scenarios show how target changes, parameter tweaks, explanation requests, and ethics triggers flow end to end through the architecture. Evaluation followed the Design-Science Research Methodology. Walkthrough observation and log inspection demonstrated complete coverage of six predefined objectives: every role could invoke conversational control, trace KPIs and value constraints, detect and mitigate ZIP-code bias, and reproduce full decision lineage, independent of the underlying LLM or agent library. Contributions: (1) an open-source HADA architecture, (2) a mid-range design theory for human-AI alignment in multi-agent systems, and (3) empirical evidence that framework-agnostic, protocol-compliant stakeholder agents improve accuracy, transparency, and ethical compliance in real-world decision pipelines.
zh

[AI-69] Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation

【速读】：该论文试图解决多智能体强化学习（MARL）中协调、通信与泛化能力不足的问题，特别是在模拟游戏环境中。其解决方案的关键在于提出LLM-MARL框架，该框架将大型语言模型（LLMs）整合到MARL中，通过协调器（Coordinator）、通信器（Communicator）和记忆模块（Memory）实现子目标生成、符号化智能体间消息传递以及情景回忆，从而提升智能体间的协作效率与泛化能力。

链接: https://arxiv.org/abs/2506.04251
作者: Zhengyang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper introduces LLM-MARL, a unified framework that incorporates large language models (LLMs) into multi-agent reinforcement learning (MARL) to enhance coordination, communication, and generalization in simulated game environments. The framework features three modular components of Coordinator, Communicator, and Memory, which dynamically generate subgoals, facilitate symbolic inter-agent messaging, and support episodic recall. Training combines PPO with a language-conditioned loss and LLM query gating. LLM-MARL is evaluated in Google Research Football, MAgent Battle, and StarCraft II. Results show consistent improvements over MAPPO and QMIX in win rate, coordination score, and zero-shot generalization. Ablation studies demonstrate that subgoal generation and language-based messaging each contribute significantly to performance gains. Qualitative analysis reveals emergent behaviors such as role specialization and communication-driven tactics. By bridging language modeling and policy learning, this work contributes to the design of intelligent, cooperative agents in interactive simulations. It offers a path forward for leveraging LLMs in multi-agent systems used for training, games, and human-AI collaboration.
zh

[AI-70] Zero-Shot Adaptation of Parameter-Efficient Fine-Tuning in Diffusion Models ICML2025

【速读】：该论文试图解决在文本到图像扩散模型中进行参数高效微调时，传统方法在更换基础模型时需要重新训练且受数据约束的问题。解决方案的关键在于ProLoRA，它通过将源模型的预训练低秩调整（LoRA）投影到目标模型的权重空间，利用子空间和零空间的相似性，并选择性地针对对齐层进行调整，从而实现无需额外训练数据的零样本适应。

链接: https://arxiv.org/abs/2506.04244
作者: Farzad Farhadzadeh,Debasmit Das,Shubhankar Borse,Fatih Porikli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

Abstract:We introduce ProLoRA, enabling zero-shot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models. ProLoRA transfers pre-trained low-rank adjustments (e.g., LoRA) from a source to a target model without additional training data. This overcomes the limitations of traditional methods that require retraining when switching base models, often challenging due to data constraints. ProLoRA achieves this via projection of source adjustments into the target model’s weight space, leveraging subspace and null space similarities and selectively targeting aligned layers. Evaluations on established text-to-image models demonstrate successful knowledge transfer and comparable performance without retraining.
zh

[AI-71] riple Attention Transformer Architecture for Time-Dependent Concrete Creep Prediction

【速读】：该论文旨在解决传统方法在预测时间依赖性混凝土徐变时存在的根本性局限，即通常将时间视为输入参数而非建模变形发展的序列特性。其解决方案的关键在于将混凝土徐变预测转化为类似语言处理的自回归序列建模任务，并引入一种三流注意力机制（Triple Attention Transformer Architecture），通过时间注意力捕捉历史徐变模式的长程依赖关系，特征注意力建模材料属性之间的交互作用，批次注意力描述样本间的关联性，从而显著提升了预测性能。

链接: https://arxiv.org/abs/2506.04243
作者: Warayut Dokduea,Weerachart Tangchirapat,Sompote Youwai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a novel Triple Attention Transformer Architecture for predicting time-dependent concrete creep, addressing fundamental limitations in current approaches that treat time as merely an input parameter rather than modeling the sequential nature of deformation development. By transforming concrete creep prediction into an autoregressive sequence modeling task similar to language processing, our architecture leverages the transformer’s self-attention mechanisms to capture long-range dependencies in historical creep patterns. The model implements a triple-stream attention framework incorporating temporal attention for sequential progression, feature attention for material property interactions, and batch attention for inter-sample relationships. Evaluated on experimental datasets with standardized daily measurements spanning 160 days, the architecture achieves exceptional performance with mean absolute percentage error of 1.63% and R2 values of 0.999 across all datasets, substantially outperforming traditional empirical models and existing machine learning approaches. Ablation studies confirm the critical role of attention mechanisms, with attention pooling contributing most significantly to model performance. SHAP analysis reveals Young’s modulus as the primary predictive feature, followed by density and compressive strength, providing interpretability essential for engineering applications. A deployed web-based interface facilitates practical implementation, enabling real-time predictions using standard laboratory parameters. This work establishes the viability of applying transformer architectures to materials science problems, demonstrating the potential for data-driven approaches to revolutionize structural behavior prediction and engineering design practices.
zh

[AI-72] A Comprehensive Survey on the Risks and Limitations of Concept-based Models

【速读】：该论文试图解决基于概念的模型（Concept-based Models）在实际应用中所面临的多种风险与局限性，包括结构缺陷、训练过程中的问题、底层假设的不足以及对对抗样本的脆弱性。其关键解决方案在于系统性地总结和分析常见的挑战，并探讨针对监督与无监督范式中这些问题的架构选择与改进方法，同时评估近期提升模型可靠性的进展，以推动该领域未来的研究方向。

链接: https://arxiv.org/abs/2506.04237
作者: Sanchit Sinha,Aidong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept-based Models are a class of inherently explainable networks that improve upon standard Deep Neural Networks by providing a rationale behind their predictions using human-understandable `concepts’. With these models being highly successful in critical applications like medical diagnosis and financial risk prediction, there is a natural push toward their wider adoption in sensitive domains to instill greater trust among diverse stakeholders. However, recent research has uncovered significant limitations in the structure of such networks, their training procedure, underlying assumptions, and their susceptibility to adversarial vulnerabilities. In particular, issues such as concept leakage, entangled representations, and limited robustness to perturbations pose challenges to their reliability and generalization. Additionally, the effectiveness of human interventions in these models remains an open question, raising concerns about their real-world applicability. In this paper, we provide a comprehensive survey on the risks and limitations associated with Concept-based Models. In particular, we focus on aggregating commonly encountered challenges and the architecture choices mitigating these challenges for Supervised and Unsupervised paradigms. We also examine recent advances in improving their reliability and discuss open problems and promising avenues of future research in this domain.
zh

[AI-73] Spore in the Wild: Case Study on Spore.fun a Real-World Experiment of Sovereign Agent Open-ended Evolution on Blockchain with TEEs

【速读】：该论文试图解决在人工生命（Artificial Life, ALife）研究中长期未能实现的开放性进化（Open-Ended Evolution, OEE）问题，即如何在计算系统中持续产生新颖性。传统方法依赖于封闭系统模拟，如Tierra和Avida，但这些系统在初始阶段出现创新后往往陷入停滞。论文提出的解决方案的关键在于利用去中心化物理基础设施网络（DePIN）提供的无需许可的计算基础架构，结合区块链与可信执行环境（Trusted Execution Environments, TEEs），部署基于大语言模型（LLM）的自主AI代理，使其能够在“野外”环境中自主运行并具备自我主权，从而通过经济激励与环境互动实现持续的进化过程。

链接: https://arxiv.org/abs/2506.04236
作者: Botao Amber Hu,Helena Rong
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注: Submitted to ALIFE 2025

点击查看摘要

Abstract:In Artificial Life (ALife) research, replicating Open-Ended Evolution (OEE)-the continuous emergence of novelty observed in biological life-has traditionally been pursued within isolated closed system simulations, such as Tierra and Avida, which have typically plateaued after an initial burst of novelty, failing to achieve sustained OEE. Scholars suggest that OEE requires an “open” system that continually exchanges information or energy with its environment. A recent technological innovation in decentralized physical infrastructure networks (DePIN) providing permissionless computational substrates enables deploying large language model (LLM)-based AI agents on blockchains integrated with Trusted Execution Environments (TEEs). This enables on-chain agents to operate autonomously “in the wild,” achieving self-sovereignty without human oversight. These agents can control their own social media accounts and cryptocurrency wallets, allowing them to interact directly with blockchain-based financial networks and broader human social media. Building on this new paradigm of on-chain agents, this http URL is a recent real-world AI evolution experiment that enables autonomous breeding and evolution of new on-chain agents. This paper presents a detailed case study of this http URL, examining agent behaviors and their evolutionary trajectories through digital ethology. We aim to spark discussion about whether “open” ALife systems “in-the-wild,” based on permissionless computational substrates and driven by economic incentives to interact with their environment, could finally achieve the long-sought goal of OEE.
zh

[AI-74] Computationally Intensive Research: Advancing a Role for Secondary Analysis of Qualitative Data

【速读】：该论文试图解决传统定性研究中产生的大量数据往往未被充分利用的问题，其核心在于通过计算方法对已有定性数据进行二次分析，以挖掘其潜在价值。解决方案的关键在于采用计算密集型的二次分析方法，利用跨情境和跨时间的数据集合来探讨跨情境和纵向的研究现象与问题，从而推动创新性研究设计的发展。

链接: https://arxiv.org/abs/2506.04230
作者: Kaveh Mohajeri,Amir Karami
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 20 Pages

点击查看摘要

Abstract:This paper draws attention to the potential of computational methods in reworking data generated in past qualitative studies. While qualitative inquiries often produce rich data through rigorous and resource-intensive processes, much of this data usually remains unused. In this paper, we first make a general case for secondary analysis of qualitative data by discussing its benefits, distinctions, and epistemological aspects. We then argue for opportunities with computationally intensive secondary analysis, highlighting the possibility of drawing on data assemblages spanning multiple contexts and timeframes to address cross-contextual and longitudinal research phenomena and questions. We propose a scheme to perform computationally intensive secondary analysis and advance ideas on how this approach can help facilitate the development of innovative research designs. Finally, we enumerate some key challenges and ongoing concerns associated with qualitative data sharing and reuse.
zh

[AI-75] COSMOS: Predictable and Cost-Effective Adaptation of LLM s

【速读】：该论文试图解决在资源受限条件下，如何准确预测大型语言模型（Large Language Models, LLMs）的适应策略选择问题，以避免昂贵的实验成本。其解决方案的关键在于提出COSMOS框架，该框架通过嵌入增强的轻量级代理模型预测微调性能，以及利用低样本缩放定律预测检索增强的上下文学习效果，从而实现对适应结果的高效、低成本预测。

链接: https://arxiv.org/abs/2505.01449
作者: Jiayu Wang,Aws Albarghouthi,Frederic Sala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance across numerous tasks by using a diverse array of adaptation strategies. However, optimally selecting a model and adaptation strategy under resource constraints is challenging and often requires extensive experimentation. We investigate whether it is possible to accurately predict both performance and cost without expensive trials. We formalize the strategy selection problem for LLMs and introduce COSMOS, a unified prediction framework that efficiently estimates adaptation outcomes at minimal cost. We instantiate and study the capability of our framework via a pair of powerful predictors: embedding-augmented lightweight proxy models to predict fine-tuning performance, and low-sample scaling laws to forecast retrieval-augmented in-context learning. Extensive evaluation across eight representative benchmarks demonstrates that COSMOS achieves high prediction accuracy while reducing computational costs by 92.72% on average, and up to 98.71% in resource-intensive scenarios. Our results show that efficient prediction of adaptation outcomes is not only feasible but can substantially reduce the computational overhead of LLM deployment while maintaining performance standards.
zh

[AI-76] Was Residual Penalty and Neural Operators All We Needed for Solving Optimal Control Problems?

【速读】：该论文试图解决最优控制问题，传统方法通常通过结合数据、微分方程残差和目标成本的损失函数来训练神经网络。论文指出，在训练过程中引入成本函数是不必要的，其解决方案的关键在于将最优控制问题与训练过程解耦，采用更简洁的架构和流程。具体而言，通过使用简单的神经算子架构（如DeepONet）结合无约束优化程序，仅需一次物理信息训练阶段和后续优化阶段即可解决多个最优控制问题。该方法通过在成本函数中添加基于微分方程残差的惩罚项，并利用自动微分技术在迭代优化过程中计算相对于控制变量的梯度实现。

链接: https://arxiv.org/abs/2506.04742
作者: Oliver G. S. Lundqvist,Fabricio Oliveira
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks have been used to solve optimal control problems, typically by training neural networks using a combined loss function that considers data, differential equation residuals, and objective costs. We show that including cost functions in the training process is unnecessary, advocating for a simpler architecture and streamlined approach by decoupling the optimal control problem from the training process. Thus, our work shows that a simple neural operator architecture, such as DeepONet, coupled with an unconstrained optimization routine, can solve multiple optimal control problems with a single physics-informed training phase and a subsequent optimization phase. We achieve this by adding a penalty term based on the differential equation residual to the cost function and computing gradients with respect to the control using automatic differentiation through the trained neural operator within an iterative optimization routine. We showcase our method on nine distinct optimal control problems by training three separate DeepONet models, each corresponding to a different differential equation. For each model, we solve three problems with varying cost functions, demonstrating accurate and consistent performance across all cases.
zh

[AI-77] he Latent Space Hypothesis: Toward Universal Medical Representation Learning

【速读】：该论文试图解决多模态医学数据中隐含的统一生理状态表征问题，以及如何在异质医学证据中实现个性化诊断与治疗。其解决方案的关键在于基于潜在空间假设（Latent Space Hypothesis），将不同医学数据模态视为对同一高维流形的投影，并在该几何表示中通过健康状态点、疾病进展轨迹和治疗干预向量来建模个体的病理变化，从而为个性化医疗提供定量依据。

链接: https://arxiv.org/abs/2506.04515
作者: Salil Patel
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 51 pages, 12 figures. A position paper examining the latent space hypothesis - the proposition that diverse medical data can be represented in shared latent spaces reflecting fundamental biological processes. The paper discusses theoretical foundations, reviews supporting evidence, and considers potential implications for medical AI and representation learning

点击查看摘要

Abstract:Medical data range from genomic sequences and retinal photographs to structured laboratory results and unstructured clinical narratives. Although these modalities appear disparate, many encode convergent information about a single underlying physiological state. The Latent Space Hypothesis frames each observation as a projection of a unified, hierarchically organized manifold – much like shadows cast by the same three-dimensional object. Within this learned geometric representation, an individual’s health status occupies a point, disease progression traces a trajectory, and therapeutic intervention corresponds to a directed vector. Interpreting heterogeneous evidence in a shared space provides a principled way to re-examine eponymous conditions – such as Parkinson’s or Crohn’s – that often mask multiple pathophysiological entities and involve broader anatomical domains than once believed. By revealing sub-trajectories and patient-specific directions of change, the framework supplies a quantitative rationale for personalised diagnosis, longitudinal monitoring, and tailored treatment, moving clinical practice away from grouping by potentially misleading labels toward navigation of each person’s unique trajectory. Challenges remain – bias amplification, data scarcity for rare disorders, privacy, and the correlation-causation divide – but scale-aware encoders, continual learning on longitudinal data streams, and perturbation-based validation offer plausible paths forward.
zh

[AI-78] Diffusion Transformer-based Universal Dose Denoising for Pencil Beam Scanning Proton Therapy

【速读】：该论文旨在解决强度调制质子治疗（IMPT）在头颈癌治疗中对解剖变化的敏感性问题，通过在线自适应放疗（oART）实现频繁调整，而这一过程依赖于快速且准确的蒙特卡罗（MC）剂量计算。为了解决降低粒子数量导致计算加速但精度下降的问题，论文提出了一种基于扩散变换器的去噪框架，其关键在于利用低统计量MC剂量图进行去噪，从而实现快速且高质量的剂量生成。

链接: https://arxiv.org/abs/2506.04467
作者: Yuzhen Ding,Jason Holmes,Hongying Feng,Martin Bues,Lisa A. McGee,Jean-Claude M. Rwigema,Nathan Y. Yu,Terence S. Sio,Sameer R. Keole,William W. Wong,Steven E. Schild,Jonathan B. Ashman,Sujay A. Vora,Daniel J. Ma,Samir H. Patel,Wei Liu
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: Intensity-modulated proton therapy (IMPT) offers precise tumor coverage while sparing organs at risk (OARs) in head and neck (HN) cancer. However, its sensitivity to anatomical changes requires frequent adaptation through online adaptive radiation therapy (oART), which depends on fast, accurate dose calculation via Monte Carlo (MC) simulations. Reducing particle count accelerates MC but degrades accuracy. To address this, denoising low-statistics MC dose maps is proposed to enable fast, high-quality dose generation. Methods: We developed a diffusion transformer-based denoising framework. IMPT plans and 3D CT images from 80 HN patients were used to generate noisy and high-statistics dose maps using MCsquare (1 min and 10 min per plan, respectively). Data were standardized into uniform chunks with zero-padding, normalized, and transformed into quasi-Gaussian distributions. Testing was done on 10 HN, 10 lung, 10 breast, and 10 prostate cancer cases, preprocessed identically. The model was trained with noisy dose maps and CT images as input and high-statistics dose maps as ground truth, using a combined loss of mean square error (MSE), residual loss, and regional MAE (focusing on top/bottom 10% dose voxels). Performance was assessed via MAE, 3D Gamma passing rate, and DVH indices. Results: The model achieved MAEs of 0.195 (HN), 0.120 (lung), 0.172 (breast), and 0.376 Gy[RBE] (prostate). 3D Gamma passing rates exceeded 92% (3%/2mm) across all sites. DVH indices for clinical target volumes (CTVs) and OARs closely matched the ground truth. Conclusion: A diffusion transformer-based denoising framework was developed and, though trained only on HN data, generalizes well across multiple disease sites. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.04467 [physics.med-ph] (or arXiv:2506.04467v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2506.04467 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuzhen Ding [view email] [v1] Wed, 4 Jun 2025 21:37:15 UTC (1,080 KB)
zh

[AI-79] Knowledge-guided Contextual Gene Set Analysis Using Large Language Models

【速读】：该论文旨在解决传统基因集分析（Gene Set Analysis, GSA）方法在临床背景下缺乏上下文感知的问题，这些问题导致生成的富集通路列表冗余、非特异或无关，从而需要大量手动干预，降低了结果的可靠性和可重复性。其解决方案的关键在于提出一种名为cGSA的新型AI驱动框架，该框架通过整合基因聚类检测、富集分析和大语言模型，实现上下文感知的通路优先排序，从而识别出不仅统计显著且具有生物学意义的通路。

链接: https://arxiv.org/abs/2506.04303
作者: Zhizheng Wang,Chi-Ping Day,Chih-Hsuan Wei,Qiao Jin,Robert Leaman,Yifan Yang,Shubo Tian,Aodong Qiu,Yin Fang,Qingqing Zhu,Xinghua Lu,Zhiyong Lu
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 56 pages, 9 figures, 1 table

点击查看摘要

Abstract:Gene set analysis (GSA) is a foundational approach for interpreting genomic data of diseases by linking genes to biological processes. However, conventional GSA methods overlook clinical context of the analyses, often generating long lists of enriched pathways with redundant, nonspecific, or irrelevant results. Interpreting these requires extensive, ad-hoc manual effort, reducing both reliability and reproducibility. To address this limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by incorporating context-aware pathway prioritization. cGSA integrates gene cluster detection, enrichment analysis, and large language models to identify pathways that are not only statistically significant but also biologically meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases and ten disease-related biological mechanisms shows that cGSA outperforms baseline methods by over 30%, with expert validation confirming its increased precision and interpretability. Two independent case studies in melanoma and breast cancer further demonstrate its potential to uncover context-specific insights and support targeted hypothesis generation.
zh

[AI-80] he GAIN Model: A Nature-Inspired Neural Network Framework Based on an Adaptation of the Izhikevich Model

【速读】：该论文试图解决传统神经网络在生物合理性与动态特性方面的不足，旨在提升模型的生物学准确性与计算效率。其解决方案的关键在于采用基于网格（grid-based）的结构，使神经元能够与最近邻进行交互并优化相互连接，同时结合Izhikevich模型，从而实现计算高效且生物逼真的模拟。

链接: https://arxiv.org/abs/2506.04247
作者: Gage K. R. Hooper
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 31 pages, 16 figures

点击查看摘要

Abstract:While many neural networks focus on layers to process information, the GAIN model uses a grid-based structure to improve biological plausibility and the dynamics of the model. The grid structure helps neurons to interact with their closest neighbors and improve their connections with one another, which is seen in biological neurons. While also being implemented with the Izhikevich model this approach allows for a computationally efficient and biologically accurate simulation that can aid in the development of neural networks, large scale simulations, and the development in the neuroscience field. This adaptation of the Izhikevich model can improve the dynamics and accuracy of the model, allowing for its uses to be specialized but efficient.
zh

[AI-81] Benchmark for Antibody Binding Affinity Maturation and Design

【速读】：该论文试图解决抗体结合亲和力成熟与设计中的评估与优化问题，传统方法主要依赖于抗体自身及其与天然抗体的相似性（如氨基酸身份率、结构RMSD），而未能全面反映抗体-抗原（Ab-Ag）复合物的功能特性。解决方案的关键在于提出AbBiBench基准框架，将Ab-Ag复合物视为功能单元，通过测量蛋白质模型在Ab-Ag复合物上的似然值来评估抗体设计的结合潜力，从而提供一种生物基础统一的评估体系。

链接: https://arxiv.org/abs/2506.04235
作者: Xinyan Zhao,Yi-Ching Tang,Akshita Singh,Victor J Cantu,KwanHo An,Junseok Lee,Adam E Stogsdill,Ashwin Kumar Ramesh,Zhiqiang An,Xiaoqian Jiang,Yejin Kim
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:We introduce AbBiBench (Antibody Binding Benchmarking), a benchmarking framework for antibody binding affinity maturation and design. Unlike existing antibody evaluation strategies that rely on antibody alone and its similarity to natural ones (e.g., amino acid identity rate, structural RMSD), AbBiBench considers an antibody-antigen (Ab-Ag) complex as a functional unit and evaluates the potential of an antibody design binding to given antigen by measuring protein model’s likelihood on the Ab-Ag complex. We first curate, standardize, and share 9 datasets containing 9 antigens (involving influenza, anti-lysozyme, HER2, VEGF, integrin, and SARS-CoV-2) and 155,853 heavy chain mutated antibodies. Using these datasets, we systematically compare 14 protein models including masked language models, autoregressive language models, inverse folding models, diffusion-based generative models, and geometric graph models. The correlation between model likelihood and experimental affinity values is used to evaluate model performance. Additionally, in a case study to increase binding affinity of antibody F045-092 to antigen influenza H1N1, we evaluate the generative power of the top-performing models by sampling a set of new antibodies binding to the antigen and ranking them based on structural integrity and biophysical properties of the Ab-Ag complex. As a result, structure-conditioned inverse folding models outperform others in both affinity correlation and generation tasks. Overall, AbBiBench provides a unified, biologically grounded evaluation framework to facilitate the development of more effective, function-aware antibody design models.
zh

机器学习

[LG-0] Seeing the Invisible: Machine learning-Based QPI Kernel Extraction via Latent Alignment

链接: https://arxiv.org/abs/2506.05325
作者: Yingshuai Ji,Haomin Zhuang,Matthew Toole,James McKenzie,Xiaolong Liu,Xiangliang Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quasiparticle interference (QPI) imaging is a powerful tool for probing electronic structures in quantum materials, but extracting the single-scatterer QPI pattern (i.e., the kernel) from a multi-scatterer image remains a fundamentally ill-posed inverse problem. In this work, we propose the first AI-based framework for QPI kernel extraction. We introduce a two-step learning strategy that decouples kernel representation learning from observation-to-kernel inference. In the first step, we train a variational autoencoder to learn a compact latent space of scattering kernels. In the second step, we align the latent representation of QPI observations with those of the pre-learned kernels using a dedicated encoder. This design enables the model to infer kernels robustly even under complex, entangled scattering conditions. We construct a diverse and physically realistic QPI dataset comprising 100 unique kernels and evaluate our method against a direct one-step baseline. Experimental results demonstrate that our approach achieves significantly higher extraction accuracy, and improved generalization to unseen kernels.

[LG-1] LSM-2: Learning from Incomplete Wearable Sensor Data

链接: https://arxiv.org/abs/2506.05321
作者: Maxwell A. Xu,Girish Narayanswamy,Kumar Ayush,Dimitris Spathis,Shun Liao,Shyam A. Tailor,Ahmed Metwally,A. Ali Heydari,Yuwei Zhang,Jake Garrison,Samy Abdel-Ghaffar,Xuhai Xu,Ken Gu,Jacob Sunshine,Ming-Zher Poh,Yun Liu,Tim Althoff,Shrikanth Narayanan,Pushmeet Kohli,Mark Malhotra,Shwetak Patel,Yuzhe Yang,James M. Rehg,Xin Liu,Daniel McDuff
类目: Machine Learning (cs.LG)
*备注: Xu and Narayanswamy are co-first authors. McDuff and Liu are co-last authors

点击查看摘要

Abstract:Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM’s core novelty lies in its use of learnable mask tokens to model both existing (“inherited”) and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications.

[LG-2] Learning normalized image densities via dual score matching

链接: https://arxiv.org/abs/2506.05310
作者: Florentin Guth,Zahra Kadkhodaie,Eero P Simoncelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emphnormalized energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emphdual score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emphstrongly generalizes: estimated log probabilities are nearly independent of the specific images in the training set. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary significantly with image content, in contrast with traditional assumptions such as concentration of measure or support on a low-dimensional manifold.

[LG-3] Power Law Guided Dynamic Sifting for Efficient Attention

链接: https://arxiv.org/abs/2506.05300
作者: Nirav Koley,Prajwal Singhania,Abhinav Bhatele
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations, particularly during data transfers between High Bandwidth Memory (HBM) and SRAM in attention computations. Approximate attention methods address this issue by reducing computational and memory overhead but often rely on expensive top- k operations, which perform poorly on GPUs. We propose SiftAttention, a novel approximate attention method that replaces the top- k step with a computationally efficient element-wise filtering operation based on a threshold value. Our intuition for doing this is based on our empirical observation that the \tau -th quantile of attention scores follows a predictable power-law over sequential generation steps. Exploiting this insight, our approach dynamically estimates a threshold value per prompt at each generation step. Only attention scores above this threshold and their corresponding value vectors are loaded/used to compute the attention output, reducing data movement between HBM and SRAM. Our evaluation demonstrates that SiftAttention preserves model quality better than existing approximate attention methods while reducing memory bandwidth usage when loading value vectors.

[LG-4] A Smooth Sea Never Made a Skilled textttSAILOR: Robust Imitation via Learning to Search

链接: https://arxiv.org/abs/2506.05294
作者: Arnav Kumar Jain,Vibhakar Mohta,Subin Kim,Atiksh Bhardwaj,Juntao Ren,Yunhai Feng,Sanjiban Choudhury,Gokul Swamy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don’t know how to recover from it. In this sense, BC is akin to giving the agent the fish – giving them dense supervision across a narrow set of states – rather than teaching them to fish: to be able to reason independently about achieving the expert’s outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach \textttSAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10 \times still leaves a performance gap. We find that \textttSAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at this https URL .

[LG-5] Learning Beyond Experience: Generalizing to Unseen State Space with Reservoir Computing

链接: https://arxiv.org/abs/2506.05292
作者: Declan A. Norton,Yuanzhao Zhang,Michelle Girvan
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Machine learning techniques offer an effective approach to modeling dynamical systems solely from observed data. However, without explicit structural priors – built-in assumptions about the underlying dynamics – these techniques typically struggle to generalize to aspects of the dynamics that are poorly represented in the training data. Here, we demonstrate that reservoir computing – a simple, efficient, and versatile machine learning framework often used for data-driven modeling of dynamical systems – can generalize to unexplored regions of state space without explicit structural priors. First, we describe a multiple-trajectory training scheme for reservoir computers that supports training across a collection of disjoint time series, enabling effective use of available training data. Then, applying this training scheme to multistable dynamical systems, we show that RCs trained on trajectories from a single basin of attraction can achieve out-of-domain generalization by capturing system behavior in entirely unobserved basins.

[LG-6] How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

链接: https://arxiv.org/abs/2506.05276
作者: Hao Yu,Chu Xin Cheng,Runlong Yu,Yuyang Ye,Shiwei Tong,Zhaofeng Liu,Defu Lian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, flexible control across different types of constraints. This framework combines two key mechanisms: a confidence-weighted anchor control for point-wise constraints and a classifier-based control for managing statistical properties such as sums and averages over segments. Our methods achieve precise local control during the denoising inference stage while maintaining temporal coherence and integrating seamlessly, with any conditionally trained diffusion-based time series models. Extensive experiments across diverse datasets and models demonstrate its effectiveness. Our work bridges the gap between pure generative modeling and real-world time series editing needs, offering a flexible solution for human-in-the-loop time series generation and editing. The code and demo are provided for validation.

[LG-7] ght analyses of first-order methods with error feedback

链接: https://arxiv.org/abs/2506.05271
作者: Daniel Berg Thomsen,Adrien Taylor,Aymeric Dieuleveut
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes – most notably \mathrmEF and \mathrmEF^21 – were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method – with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between \mathrmEF , \mathrmEF^21 , and compressed gradient descent. Our analysis is carried out in a simplified yet representative setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.

[LG-8] Learning long range dependencies through time reversal symmetry breaking

链接: https://arxiv.org/abs/2506.05259
作者: Guillaume Pourcel,Maxence Ernoult
类目: Machine Learning (cs.LG)
*备注: 45 pages

点击查看摘要

Abstract:Deep State Space Models (SSMs) reignite physics-grounded compute paradigms, as RNNs could natively be embodied into dynamical systems. This calls for dedicated learning algorithms obeying to core physical principles, with efficient techniques to simulate these systems and guide their design. We propose Recurrent Hamiltonian Echo Learning (RHEL), an algorithm which provably computes loss gradients as finite differences of physical trajectories of non-dissipative, Hamiltonian systems. In ML terms, RHEL only requires three “forward passes” irrespective of model size, without explicit Jacobian computation, nor incurring any variance in the gradient estimation. Motivated by the physical realization of our algorithm, we first introduce RHEL in continuous time and demonstrate its formal equivalence with the continuous adjoint state method. To facilitate the simulation of Hamiltonian systems trained by RHEL, we propose a discrete-time version of RHEL which is equivalent to Backpropagation Through Time (BPTT) when applied to a class of recurrent modules which we call Hamiltonian Recurrent Units (HRUs). This setting allows us to demonstrate the scalability of RHEL by generalizing these results to hierarchies of HRUs, which we call Hamiltonian SSMs (HSSMs). We apply RHEL to train HSSMs with linear and nonlinear dynamics on a variety of time-series tasks ranging from mid-range to long-range classification and regression with sequence length reaching \sim 50k . We show that RHEL consistently matches the performance of BPTT across all models and tasks. This work opens new doors for the design of scalable, energy-efficient physical systems endowed with self-learning capabilities for sequence modelling.

[LG-9] Conservative classifiers do consistently well with improving agents : characterizing statistical and online learning

链接: https://arxiv.org/abs/2506.05252
作者: Dravyansh Sharma,Alec Sun
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注: 24 pages

点击查看摘要

Abstract:Machine learning is now ubiquitous in societal decision-making, for example in evaluating job candidates or loan applications, and it is increasingly important to take into account how classified agents will react to the learning algorithms. The majority of recent literature on strategic classification has focused on reducing and countering deceptive behaviors by the classified agents, but recent work of Attias et al. identifies surprising properties of learnability when the agents genuinely improve in order to attain the desirable classification, such as smaller generalization error than standard PAC-learning. In this paper we characterize so-called learnability with improvements across multiple new axes. We introduce an asymmetric variant of minimally consistent concept classes and use it to provide an exact characterization of proper learning with improvements in the realizable setting. While prior work studies learnability only under general, arbitrary agent improvement regions, we give positive results for more natural Euclidean ball improvement sets. In particular, we characterize improper learning under a mild generative assumption on the data distribution. We further show how to learn in more challenging settings, achieving lower generalization error under well-studied bounded noise models and obtaining mistake bounds in realizable and agnostic online learning. We resolve open questions posed by Attias et al. for both proper and improper learning.

[LG-10] On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

链接: https://arxiv.org/abs/2506.05249
作者: Zhen Qin,Jinxin Zhou,Zhihui Zhu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components–such as self-attention mechanisms and feedforward networks–without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.

[LG-11] Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

链接: https://arxiv.org/abs/2506.05239
作者: Valérie Costa,Thomas Fel,Ekdeep Singh Lubana,Bahareh Tolooshams,Demba Ba
类目: Machine Learning (cs.LG)
*备注: Complementary work to arXiv:2506.03093

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we introduce a multi-iteration SAE by unrolling Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.

[LG-12] Progressive Tempering Sampler with Diffusion ICML2025

链接: https://arxiv.org/abs/2506.05231
作者: Severi Rissanen,RuiKang OuYang,Jiajun He,Wenlin Chen,Markus Heinonen,Arno Solin,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for publication at ICML 2025

点击查看摘要

Abstract:Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the state-of-the-art MCMC approach, Parallel Tempering (PT), when it comes to the efficiency of target evaluations. On the other hand, unlike a well-trained neural sampler, PT yields only dependent samples and needs to be rerun – at considerable computational cost – whenever new samples are required. To address these weaknesses, we propose the Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures, leveraging the advantages of PT to improve the training of neural samplers. We also introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. PTSD enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency, outperforming diffusion-based neural samplers.

[LG-13] A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values

链接: https://arxiv.org/abs/2506.05216
作者: Tyler Chen,Akshay Seshadri,Mattia J. Villani,Pradeep Niroula,Shouvanik Chakrabarti,Archan Ray,Pranav Deshpande,Romina Yalovetzky,Marco Pistoia,Niraj Kumar
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Quantum Physics (quant-ph)
*备注: 44 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Shapley values have emerged as a critical tool for explaining which features impact the decisions made by machine learning models. However, computing exact Shapley values is difficult, generally requiring an exponential (in the feature dimension) number of model evaluations. To address this, many model-agnostic randomized estimators have been developed, the most influential and widely used being the KernelSHAP method (Lundberg Lee, 2017). While related estimators such as unbiased KernelSHAP (Covert Lee, 2021) and LeverageSHAP (Musco Witter, 2025) are known to satisfy theoretical guarantees, bounds for KernelSHAP have remained elusive. We describe a broad and unified framework that encompasses KernelSHAP and related estimators constructed using both with and without replacement sampling strategies. We then prove strong non-asymptotic theoretical guarantees that apply to all estimators from our framework. This provides, to the best of our knowledge, the first theoretical guarantees for KernelSHAP and sheds further light on tradeoffs between existing estimators. Through comprehensive benchmarking on small and medium dimensional datasets for Decision-Tree models, we validate our approach against exact Shapley values, consistently achieving low mean squared error with modest sample sizes. Furthermore, we make specific implementation improvements to enable scalability of our methods to high-dimensional datasets. Our methods, tested on datasets such MNIST and CIFAR10, provide consistently better results compared to the KernelSHAP library.

[LG-14] Learning Theory of Decentralized Robust Kernel-Based Learning Algorithm

链接: https://arxiv.org/abs/2506.05215
作者: Zhan Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new decentralized robust kernel-based learning algorithm within the framework of reproducing kernel Hilbert space (RKHS) by utilizing a networked system that can be represented as a connected graph. The robust loss function \mathcalL_\sigma induced by a windowing function W and a robustness scaling parameter \sigma0 , can encompass a broad spectrum of robust losses. Consequently, the proposed algorithm effectively provides a unified decentralized learning framework for robust regression, which fundamentally differs from the existing distributed robust kernel learning schemes, all of which are divide-and-conquer based. We rigorously establish the learning theory and offer a comprehensive convergence analysis for the algorithm. We show each local robust estimator generated from the decentralized algorithm can be utilized to approximate the regression function. Based on kernel-based integral operator techniques, we derive general high confidence convergence bounds for each local approximating sequence in terms of the mean square distance, RKHS norm, and generalization error, respectively. Moreover, we provide rigorous selection rules for local sample size and show that, under properly selected step size and scaling parameter \sigma , the decentralized robust algorithm can achieve optimal learning rates (up to logarithmic factors) in both norms. The parameter \sigma is shown to be essential for enhancing robustness while also ensuring favorable convergence behavior. The intrinsic connection among decentralization, sample selection, robustness of the algorithm, and its convergence is clearly reflected.

[LG-15] rustworthiness Preservation by Copies of Machine Learning Systems

链接: https://arxiv.org/abs/2506.05203
作者: Leonardo Ceragioli,Giuseppe Primiero
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common practice of ML systems development concerns the training of the same model under different data sets, and the use of the same (training and test) sets for different learning models. The first case is a desirable practice for identifying high quality and unbiased training conditions. The latter case coincides with the search for optimal models under a common dataset for training. These differently obtained systems have been considered akin to copies. In the quest for responsible AI, a legitimate but hardly investigated question is how to verify that trustworthiness is preserved by copies. In this paper we introduce a calculus to model and verify probabilistic complex queries over data and define four distinct notions: Justifiably, Equally, Weakly and Almost Trustworthy which can be checked analysing the (partial) behaviour of the copy with respect to its original. We provide a study of the relations between these notions of trustworthiness, and how they compose with each other and under logical operations. The aim is to offer a computational tool to check the trustworthiness of possibly complex systems copied from an original whose behavour is known.

[LG-16] ransformers Meet In-Context Learning: A Universal Approximation Theory

链接: https://arxiv.org/abs/2506.05200
作者: Gen Li,Yuchen Jiao,Yu Huang,Yuting Wei,Yuxin Chen
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern large language models are capable of in-context learning, the ability to perform new tasks at inference time using only a handful of input-output examples in the prompt, without any fine-tuning or parameter updates. We develop a universal approximation theory to better understand how transformers enable in-context learning. For any class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can perform reliable prediction given only a few in-context examples. In contrast to much of the recent literature that frames transformers as algorithm approximators – i.e., constructing transformers to emulate the iterations of optimization algorithms as a means to approximate solutions of learning problems – our work adopts a fundamentally different approach rooted in universal function approximation. This alternative approach offers approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being approximated, thereby extending far beyond convex problems and linear function classes. Our construction sheds light on how transformers can simultaneously learn general-purpose representations and adapt dynamically to in-context examples.

[LG-17] Locality Preserving Markovian Transition for Instance Retrieval ICML2025

链接: https://arxiv.org/abs/2506.05196
作者: Jifei Luo,Wenzheng Wu,Hantao Yao,Lu Yu,Changsheng Xu
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by ICML2025

点击查看摘要

Abstract:Diffusion-based re-ranking methods are effective in modeling the data manifolds through similarity propagation in affinity graphs. However, positive signals tend to diminish over several steps away from the source, reducing discriminative power beyond local regions. To address this issue, we introduce the Locality Preserving Markovian Transition (LPMT) framework, which employs a long-term thermodynamic transition process with multiple states for accurate manifold distance measurement. The proposed LPMT first integrates diffusion processes across separate graphs using Bidirectional Collaborative Diffusion (BCD) to establish strong similarity relationships. Afterwards, Locality State Embedding (LSE) encodes each instance into a distribution for enhanced local consistency. These distributions are interconnected via the Thermodynamic Markovian Transition (TMT) process, enabling efficient global retrieval while maintaining local effectiveness. Experimental results across diverse tasks confirm the effectiveness of LPMT for instance retrieval.

[LG-18] Associative Memory and Generative Diffusion in the Zero-noise Limit

链接: https://arxiv.org/abs/2506.05178
作者: Joshua Hess,Quaid Morris
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Dynamical Systems (math.DS); Adaptation and Self-Organizing Systems (nlin.AO); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Connections between generative diffusion and continuous-state associative memory models are studied. Morse-Smale dynamical systems are emphasized as universal approximators of gradient-based associative memory models and diffusion models as white-noise perturbed systems thereof. Universal properties of associative memory that follow from this description are described and used to characterize a generic transition from generation to memory as noise levels diminish. Structural stability inherited by Morse-Smale flows is shown to imply a notion of stability for diffusions at vanishing noise levels. Applied to one- and two-parameter families of gradients, this indicates stability at all but isolated points of associative memory learning landscapes and the learning and generation landscapes of diffusion models with gradient drift in the zero-noise limit, at which small sets of generic bifurcations characterize qualitative transitions between stable systems. Examples illustrating the characterization of these landscapes by sequences of these bifurcations are given, along with structural stability criterion for classic and modern Hopfield networks (equivalently, the attention mechanism).

[LG-19] Federated Isolation Forest for Efficient Anomaly Detection on Edge IoT Systems

链接: https://arxiv.org/abs/2506.05138
作者: Pavle Vasiljevic,Milica Matic,Miroslav Popovic
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, 4 algorithms, 5 figures, 2 tables

点击查看摘要

Abstract:Recently, federated learning frameworks such as Python TestBed for Federated Learning Algorithms and MicroPython TestBed for Federated Learning Algorithms have emerged to tackle user privacy concerns and efficiency in embedded systems. Even more recently, an efficient federated anomaly detection algorithm, FLiForest, based on Isolation Forests has been developed, offering a low-resource, unsupervised method well-suited for edge deployment and continuous learning. In this paper, we present an application of Isolation Forest-based temperature anomaly detection, developed using the previously mentioned federated learning frameworks, aimed at small edge devices and IoT systems running MicroPython. The system has been experimentally evaluated, achieving over 96% accuracy in distinguishing normal from abnormal readings and above 78% precision in detecting anomalies across all tested configurations, while maintaining a memory usage below 160 KB during model training. These results highlight its suitability for resource-constrained environments and edge systems, while upholding federated learning principles of data privacy and collaborative learning.

[LG-20] Membership Inference Attacks on Sequence Models

链接: https://arxiv.org/abs/2506.05126
作者: Lorenzo Rossi,Michael Aerni,Jie Zhang,Florian Tramèr
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to the 8th Deep Learning Security and Privacy Workshop (DLSP) workshop (best paper award)

点击查看摘要

Abstract:Sequence models, such as Large Language Models (LLMs) and autoregressive image generators, have a tendency to memorize and inadvertently leak sensitive information. While this tendency has critical legal implications, existing tools are insufficient to audit the resulting risks. We hypothesize that those tools’ shortcomings are due to mismatched assumptions. Thus, we argue that effectively measuring privacy leakage in sequence models requires leveraging the correlations inherent in sequential generation. To illustrate this, we adapt a state-of-the-art membership inference attack to explicitly model within-sequence correlations, thereby demonstrating how a strong existing attack can be naturally extended to suit the structure of sequence models. Through a case study, we show that our adaptations consistently improve the effectiveness of memorization audits without introducing additional computational costs. Our work hence serves as an important stepping stone toward reliable memorization audits for large sequence models.

[LG-21] Privacy Amplification Through Synthetic Data: Insights from Linear Regression ICML2025

链接: https://arxiv.org/abs/2506.05101
作者: Clément Pierquin,Aurélien Bellet,Marc Tommasi,Matthieu Boussard
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 26 pages, ICML 2025

点击查看摘要

Abstract:Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit from privacy amplification when the generative model is kept hidden. While empirical studies suggest this phenomenon, a rigorous theoretical understanding is still lacking. In this paper, we investigate this question through the well-understood framework of linear regression. First, we establish negative results showing that if an adversary controls the seed of the generative model, a single synthetic data point can leak as much information as releasing the model itself. Conversely, we show that when synthetic data is generated from random inputs, releasing a limited number of synthetic data points amplifies privacy beyond the model’s inherent guarantees. We believe our findings in linear regression can serve as a foundation for deriving more general bounds in the future.

[LG-22] Semi-Implicit Variational Inference via Kernelized Path Gradient Descent

链接: https://arxiv.org/abs/2506.05088
作者: Tobias Pielok,Bernd Bischl,David Rügamer
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注: Preliminary version

点击查看摘要

Abstract:Semi-implicit variational inference (SIVI) is a powerful framework for approximating complex posterior distributions, but training with the Kullback-Leibler (KL) divergence can be challenging due to high variance and bias in high-dimensional settings. While current state-of-the-art semi-implicit variational inference methods, particularly Kernel Semi-Implicit Variational Inference (KSIVI), have been shown to work in high dimensions, training remains moderately expensive. In this work, we propose a kernelized KL divergence estimator that stabilizes training through nonparametric smoothing. To further reduce the bias, we introduce an importance sampling correction. We provide a theoretical connection to the amortized version of the Stein variational gradient descent, which estimates the score gradient via Stein’s identity, showing that both methods minimize the same objective, but our semi-implicit approach achieves lower gradient variance. In addition, our method’s bias in function space is benign, leading to more stable and efficient optimization. Empirical results demonstrate that our method outperforms or matches state-of-the-art SIVI methods in both performance and training efficiency.

[LG-23] EMBER2024 – A Benchmark Dataset for Holistic Evaluation of Malware Classifiers

链接: https://arxiv.org/abs/2506.05074
作者: Robert J. Joyce,Gideon Miller,Phil Roth,Richard Zak,Elliott Zaresky-Williams,Hyrum Anderson,Edward Raff,James Holt
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A lack of accessible data has historically restricted malware analysis research, and practitioners have relied heavily on datasets provided by industry sources to advance. Existing public datasets are limited by narrow scope - most include files targeting a single platform, have labels supporting just one type of malware classification task, and make no effort to capture the evasive files that make malware detection difficult in practice. We present EMBER2024, a new dataset that enables holistic evaluation of malware classifiers. Created in collaboration with the authors of EMBER2017 and EMBER2018, the EMBER2024 dataset includes hashes, metadata, feature vectors, and labels for more than 3.2 million files from six file formats. Our dataset supports the training and evaluation of machine learning models on seven malware classification tasks, including malware detection, malware family classification, and malware behavior identification. EMBER2024 is the first to include a collection of malicious files that initially went undetected by a set of antivirus products, creating a “challenge” set to assess classifier performance against evasive malware. This work also introduces EMBER feature version 3, with added support for several new feature types. We are releasing the EMBER2024 dataset to promote reproducibility and empower researchers in the pursuit of new malware research topics.

[LG-24] UnHiPPO: Uncertainty-aware Initialization for State Space Models ICML2025

链接: https://arxiv.org/abs/2506.05065
作者: Marten Lienen,Abdullah Saydemir,Stephan Günnemann
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at ICML 2025

点击查看摘要

Abstract:State space models are emerging as a dominant model class for sequence problems with many relying on the HiPPO framework to initialize their dynamics. However, HiPPO fundamentally assumes data to be noise-free; an assumption often violated in practice. We extend the HiPPO theory with measurement noise and derive an uncertainty-aware initialization for state space model dynamics. In our analysis, we interpret HiPPO as a linear stochastic control problem where the data enters as a noise-free control signal. We then reformulate the problem so that the data become noisy outputs of a latent system and arrive at an alternative dynamics initialization that infers the posterior of this latent system from the data without increasing runtime. Our experiments show that our initialization improves the resistance of state-space models to noise both at training and inference time. Find our implementation at this https URL.

[LG-25] NIMO: a Nonlinear Interpretable MOdel

链接: https://arxiv.org/abs/2506.05059
作者: Shijian Xu,Marcello Massimo Negri,Volker Roth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural networks (NNs) have achieved tremendous success over the past decade, yet they are still extremely difficult to interpret. In contrast, linear models are less expressive but offer inherent interpretability. Linear coefficients are interpretable as the marginal effect of a feature on the prediction, assuming all other features are kept fixed. To combine the benefits of both approaches, we introduce NIMO (Nonlinear Interpretable MOdel). The key idea is to define a model where the NN is designed to learn nonlinear corrections to the linear model predictions, while also maintaining the original interpretability of the linear coefficients. Relevantly, we develop an optimization algorithm based on profile likelihood that elegantly allows for optimizing over the NN parameters while updating the linear coefficients analytically. By relying on adaptive ridge regression we can easily incorporate sparsity constraints as well. We show empirically that we can recover the underlying linear coefficients while significantly improving the predictive accuracy. Compared to other hybrid interpretable approaches, our model is the only one that actually maintains the same interpretability of linear coefficients as in linear models. We also achieve higher performance on various regression and classification settings.

[LG-26] Reliably detecting model failures in deployment without labels NEURIPS2025

链接: https://arxiv.org/abs/2506.05047
作者: Viet Nguyen Changjian Shui,Vijay Giri,Siddarth Arya,Amol Verma,Fahad Razak,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注: 36 pages, 6 figures, 7 tables, submitted to NeurIPS 2025, includes theoretical analysis and extensive empirical evaluation across benchmark and clinical datasets. Code available at this https URL . Viet Nguyen and Changjian Shui contributed equally

点击查看摘要

Abstract:The distribution of data changes over time; models operating operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.

[LG-27] N2V: Bringing Transductive Node Embeddings to Inductive Graphs

链接: https://arxiv.org/abs/2506.05039
作者: Nicolas Lell,Ansgar Scherp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shallow node embeddings like node2vec (N2V) can be used for nodes without features or to supplement existing features with structure-based information. Embedding methods like N2V are limited in their application on new nodes, which restricts them to the transductive setting where the entire graph, including the test nodes, is available during training. We propose inductive node2vec (iN2V), which combines a post-hoc procedure to compute embeddings for nodes unseen during training and modifications to the original N2V training procedure to prepare the embeddings for this post-hoc procedure. We conduct experiments on several benchmark datasets and demonstrate that iN2V is an effective approach to bringing transductive embeddings to an inductive setting. Using iN2V embeddings improves node classification by 1 point on average, with up to 6 points of improvement depending on the dataset and the number of unseen nodes. Our iN2V is a plug-in approach to create new or enrich existing embeddings. It can also be combined with other embedding methods, making it a versatile approach for inductive node representation learning. Code to reproduce the results is available at this https URL .

[LG-28] uning the Right Foundation Models is What you Need for Partial Label Learning

链接: https://arxiv.org/abs/2506.05027
作者: Kuang He,Wei Tang,Tong Wei,Min-Ling Zhang
类目: Machine Learning (cs.LG)
*备注: The code can be found at \url{ this https URL }

点击查看摘要

Abstract:Partial label learning (PLL) seeks to train generalizable classifiers from datasets with inexact supervision, a common challenge in real-world applications. Existing studies have developed numerous approaches to progressively refine and recover ground-truth labels by training convolutional neural networks. However, limited attention has been given to foundation models that offer transferrable representations. In this work, we empirically conduct comprehensive evaluations of 11 foundation models across 13 PLL approaches on 8 benchmark datasets under 3 PLL scenarios. We further propose PartialCLIP, an efficient fine-tuning framework for foundation models in PLL. Our findings reveal that current PLL approaches tend to 1) achieve significant performance gains when using foundation models, 2) exhibit remarkably similar performance to each other, 3) maintain stable performance across varying ambiguity levels, while 4) are susceptible to foundation model selection and adaptation strategies. Additionally, we demonstrate the efficacy of text-embedding classifier initialization and effective candidate label filtering using zero-shot CLIP. Our experimental results and analysis underscore the limitations of current PLL approaches and provide valuable insights for developing more generalizable PLL models. The source code can be found at this https URL.

[LG-29] Multi-Point Proximity Encoding For Vector-Mode Geospatial Machine Learning

链接: https://arxiv.org/abs/2506.05016
作者: John Collins
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Vector-mode geospatial data – points, lines, and polygons – must be encoded into an appropriate form in order to be used with traditional machine learning and artificial intelligence models. Encoding methods attempt to represent a given shape as a vector that captures its essential geometric properties. This paper presents an encoding method based on scaled distances from a shape to a set of reference points within a region of interest. The method, MultiPoint Proximity (MPP) encoding, can be applied to any type of shape, enabling the parameterization of machine learning models with encoded representations of vector-mode geospatial features. We show that MPP encoding possesses the desirable properties of shape-centricity and continuity, can be used to differentiate spatial objects based on their geometric features, and can capture pairwise spatial relationships with high precision. In all cases, MPP encoding is shown to perform better than an alternative method based on rasterization.

[LG-30] QiMeng: Fully Automated Hardware and Software Design for Processor Chip

链接: https://arxiv.org/abs/2506.05007
作者: Rui Zhang,Yuanbo Wen,Shuyao Cheng,Di Huang,Shaohui Peng,Jiaming Guo,Pengwei Jin,Jiacheng Zhao,Tianrui Ma,Yaoyu Zhu,Yifan Hao,Yongwei Zhao,Shengwen Liang,Ying Wang,Xing Hu,Zidong Du,Huimin Cui,Ling Li,Qi Guo,Yunji Chen
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Processor chip design technology serves as a key frontier driving breakthroughs in computer science and related fields. With the rapid advancement of information technology, conventional design paradigms face three major challenges: the physical constraints of fabrication technologies, the escalating demands for design resources, and the increasing diversity of ecosystems. Automated processor chip design has emerged as a transformative solution to address these challenges. While recent breakthroughs in Artificial Intelligence (AI), particularly Large Language Models (LLMs) techniques, have opened new possibilities for fully automated processor chip design, substantial challenges remain in establishing domain-specific LLMs for processor chip design. In this paper, we propose QiMeng, a novel system for fully automated hardware and software design of processor chips. QiMeng comprises three hierarchical layers. In the bottom-layer, we construct a domain-specific Large Processor Chip Model (LPCM) that introduces novel designs in architecture, training, and inference, to address key challenges such as knowledge representation gap, data scarcity, correctness assurance, and enormous solution space. In the middle-layer, leveraging the LPCM’s knowledge representation and inference capabilities, we develop the Hardware Design Agent and the Software Design Agent to automate the design of hardware and software for processor chips. Currently, several components of QiMeng have been completed and successfully applied in various top-layer applications, demonstrating significant advantages and providing a feasible solution for efficient, fully automated hardware/software design of processor chips. Future research will focus on integrating all components and performing iterative top-down and bottom-up design processes to establish a comprehensive QiMeng system. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2506.05007 [cs.AR] (or arXiv:2506.05007v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2506.05007 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Cautious Optimism: A Meta-Algorithm for Near-Constant Regret in General Games

链接: https://arxiv.org/abs/2506.05005
作者: Ashkan Soleymani,Georgios Piliouras,Gabriele Farina
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注: Extended abstract appeared at Twenty-Sixth ACM Conference on Economics and Computation (EC), 2025

点击查看摘要

Abstract:Recent work [Soleymani et al., 2025] introduced a variant of Optimistic Multiplicative Weights Updates (OMWU) that adaptively controls the learning pace in a dynamic, non-monotone manner, achieving new state-of-the-art regret minimization guarantees in general games. In this work, we demonstrate that no-regret learning acceleration through adaptive pacing of the learners is not an isolated phenomenon. We introduce \emphCautious Optimism, a framework for substantially faster regularized learning in general games. Cautious Optimism takes as input any instance of Follow-the-Regularized-Leader (FTRL) and outputs an accelerated no-regret learning algorithm by pacing the underlying FTRL with minimal computational overhead. Importantly, we retain uncoupledness (learners do not need to know other players’ utilities). Cautious Optimistic FTRL achieves near-optimal O_T(\log T) regret in diverse self-play (mixing-and-matching regularizers) while preserving the optimal O(\sqrtT) regret in adversarial scenarios. In contrast to prior works (e.g. Syrgkanis et al. [2015], Daskalakis et al. [2021]), our analysis does not rely on monotonic step-sizes, showcasing a novel route for fast learning in general games.

[LG-32] FPTQuant: Function-Preserving Transforms for LLM Quantization

链接: https://arxiv.org/abs/2506.04985
作者: Boris van Breugel,Yelysei Bondarenko,Paul Whatmough,Markus Nagel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model’s function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9 times over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off – it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.

[LG-33] Agent ic AI for Intent-Based Industrial Automation

链接: https://arxiv.org/abs/2506.04980
作者: Marcos Lima Romero,Ricardo Suyama
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preprint - Submitted to 16th IEEE/IAS International Conference on Industry Applications - INDUSCON 2025

点击查看摘要

Abstract:The recent development of Agentic AI systems, empowered by autonomous large language models (LLMs) agents with planning and tool-usage capabilities, enables new possibilities for the evolution of industrial automation and reduces the complexity introduced by Industry 4.0. This work proposes a conceptual framework that integrates Agentic AI with the intent-based paradigm, originally developed in network research, to simplify human-machine interaction (HMI) and better align automation systems with the human-centric, sustainable, and resilient principles of Industry 5.0. Based on the intent-based processing, the framework allows human operators to express high-level business or operational goals in natural language, which are decomposed into actionable components. These intents are broken into expectations, conditions, targets, context, and information that guide sub-agents equipped with specialized tools to execute domain-specific tasks. A proof of concept was implemented using the CMAPSS dataset and Google Agent Developer Kit (ADK), demonstrating the feasibility of intent decomposition, agent orchestration, and autonomous decision-making in predictive maintenance scenarios. The results confirm the potential of this approach to reduce technical barriers and enable scalable, intent-driven automation, despite data quality and explainability concerns.

[LG-34] Predicting ICU In-Hospital Mortality Using Adaptive Transformer Layer Fusion

链接: https://arxiv.org/abs/2506.04924
作者: Han Wang,Ruoyun He,Guoguang Lao,Ting Liu,Hejiao Luo,Changqi Qin,Hongying Luo,Junmin Huang,Zihan Wei,Lu Chen,Yongzhi Xu,Ziqian Bi,Junhao Song,Tianyang Wang,Chia Xin Liang,Xinyuan Song,Huafeng Liu,Junfeng Hao,Chunjie Tian
类目: Machine Learning (cs.LG)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Early identification of high-risk ICU patients is crucial for directing limited medical resources. We introduce ALFIA (Adaptive Layer Fusion with Intelligent Attention), a modular, attention-based architecture that jointly trains LoRA (Low-Rank Adaptation) adapters and an adaptive layer-weighting mechanism to fuse multi-layer semantic features from a BERT backbone. Trained on our rigorous cw-24 (CriticalWindow-24) benchmark, ALFIA surpasses state-of-the-art tabular classifiers in AUPRC while preserving a balanced precision-recall profile. The embeddings produced by ALFIA’s fusion module, capturing both fine-grained clinical cues and high-level concepts, enable seamless pairing with GBDTs (CatBoost/LightGBM) as ALFIA-boost, and deep neuro networks as ALFIA-nn, yielding additional performance gains. Our experiments confirm ALFIA’s superior early-warning performance, by operating directly on routine clinical text, it furnishes clinicians with a convenient yet robust tool for risk stratification and timely intervention in critical-care settings.

[LG-35] Gaussian Process Diffeomorphic Statistical Shape Modelling Outperforms Angle-Based Methods for Assessment of Hip Dysplasia

链接: https://arxiv.org/abs/2506.04886
作者: Allen Paul,George Grammatopoulos,Adwaye Rambojun,Neill D. F. Campbell,Harinderjit S. Gill,Tony Shardlow
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dysplasia is a recognised risk factor for osteoarthritis (OA) of the hip, early diagnosis of dysplasia is important to provide opportunities for surgical interventions aimed at reducing the risk of hip OA. We have developed a pipeline for semi-automated classification of dysplasia using volumetric CT scans of patients’ hips and a minimal set of clinically annotated landmarks, combining the framework of the Gaussian Process Latent Variable Model with diffeomorphism to create a statistical shape model, which we termed the Gaussian Process Diffeomorphic Statistical Shape Model (GPDSSM). We used 192 CT scans, 100 for model training and 92 for testing. The GPDSSM effectively distinguishes dysplastic samples from controls while also highlighting regions of the underlying surface that show dysplastic variations. As well as improving classification accuracy compared to angle-based methods (AUC 96.2% vs 91.2%), the GPDSSM can save time for clinicians by removing the need to manually measure angles and interpreting 2D scans for possible markers of dysplasia.

[LG-36] here Was Never a Bottleneck in Concept Bottleneck Models

链接: https://arxiv.org/abs/2506.04877
作者: Antonio Almudévar,José Miguel Hernández-Lobato,Alfonso Ortega
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs support concept-level interventions with theoretical guarantees, remain consistent with Bayesian principles, and offer greater flexibility in key design choices.

[LG-37] Aligning Multimodal Representations through an Information Bottleneck

链接: https://arxiv.org/abs/2506.04870
作者: Antonio Almudévar,José Miguel Hernández-Lobato,Sameer Khurana,Ricard Marxer,Alfonso Ortega
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a controlled experimental setup. Finally, we propose a regularization term in the loss function that is derived by means of a variational approximation and aims to increase the representational alignment. We analyze in a set of controlled experiments and real-world applications the advantages of including this regularization term.

[LG-38] Improving AI-generated music with user-guided training

链接: https://arxiv.org/abs/2506.04852
作者: Vishwa Mohan Singh,Sai Anirudh Aryasomayajula,Ahan Chatterjee,Beste Aydemir,Rifat Mehreen Amin
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Select for presentation in HHAI 2025

点击查看摘要

Abstract:AI music generation has advanced rapidly, with models like diffusion and autoregressive algorithms enabling high-fidelity outputs. These tools can alter styles, mix instruments, or isolate them. Since sound can be visualized as spectrograms, image-generation algorithms can be applied to generate novel music. However, these algorithms are typically trained on fixed datasets, which makes it challenging for them to interpret and respond to user input accurately. This is especially problematic because music is highly subjective and requires a level of personalization that image generation does not provide. In this work, we propose a human-computation approach to gradually improve the performance of these algorithms based on user interactions. The human-computation element involves aggregating and selecting user ratings to use as the loss function for fine-tuning the model. We employ a genetic algorithm that incorporates user feedback to enhance the baseline performance of a model initially trained on a fixed dataset. The effectiveness of this approach is measured by the average increase in user ratings with each iteration. In the pilot test, the first iteration showed an average rating increase of 0.2 compared to the baseline. The second iteration further improved upon this, achieving an additional increase of 0.39 over the first iteration.

[LG-39] LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLM s via Reinforcement Learning

链接: https://arxiv.org/abs/2506.04821
作者: Zhen Hao Wong,Jingwen Deng,Runming He,Zirong Chen,Qijie You,Hejun Dong,Hao Liang,Chengyu Shen,Bin Cui,Wentao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at many supervised tasks but often struggle with structured reasoning in unfamiliar settings. This discrepancy suggests that standard fine-tuning pipelines may instill narrow, domain-specific heuristics rather than fostering general-purpose thinking strategies. In this work, we propose a “play to learn” framework that fine-tunes LLMs through reinforcement learning on a suite of seven custom logic puzzles, each designed to cultivate distinct reasoning skills such as constraint propagation, spatial consistency, and symbolic deduction. Using a reinforcement learning setup with verifiable rewards, models receive binary feedback based on puzzle correctness, encouraging iterative, hypothesis-driven problem solving. We demonstrate that this training approach significantly improves out-of-distribution performance on a range of mathematical benchmarks, especially for mid-difficulty problems that require multi-step reasoning. Analyses across problem categories and difficulty levels reveal that puzzle training promotes transferable reasoning routines, strengthening algebraic manipulation, geometric inference, and combinatorial logic, while offering limited gains on rote or highly specialized tasks. These findings show that reinforcement learning over logic puzzles reshapes the internal reasoning of LLMs, enabling more robust and compositional generalization without relying on task-specific symbolic tools.

[LG-40] Adaptive Preconditioners Trigger Loss Spikes in Adam

链接: https://arxiv.org/abs/2506.04805
作者: Zhiwei Bai,Zhangchen Zhou,Jiajie Zhao,Xiaolong Li,Zhiyu Li,Feiyu Xiong,Hongkang Yang,Yaoyu Zhang,Zhi-Qin John Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam’s adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a \beta_2 -exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold 2/\eta for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds 2/\eta . We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.

[LG-41] Kernel k-Medoids as General Vector Quantization

链接: https://arxiv.org/abs/2506.04786
作者: Thore Gerlach,Sascha Mücke,Christian Bauckhage
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Vector Quantization (VQ) is a widely used technique in machine learning and data compression, valued for its simplicity and interpretability. Among hard VQ methods, k -medoids clustering and Kernel Density Estimation (KDE) approaches represent two prominent yet seemingly unrelated paradigms – one distance-based, the other rooted in probability density matching. In this paper, we investigate their connection through the lens of Quadratic Unconstrained Binary Optimization (QUBO). We compare a heuristic QUBO formulation for k -medoids, which balances centrality and diversity, with a principled QUBO derived from minimizing Maximum Mean Discrepancy in KDE-based VQ. Surprisingly, we show that the KDE-QUBO is a special case of the k -medoids-QUBO under mild assumptions on the kernel’s feature map. This reveals a deeper structural relationship between these two approaches and provides new insight into the geometric interpretation of the weighting parameters used in QUBO formulations for VQ.

[LG-42] Improved Regret Bounds for Linear Bandits with Heavy-Tailed Rewards

链接: https://arxiv.org/abs/2506.04775
作者: Artin Tajdini,Jonathan Scarlett,Kevin Jamieson
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study stochastic linear bandits with heavy-tailed rewards, where the rewards have a finite (1+\epsilon) -absolute central moment bounded by \upsilon for some \epsilon \in (0,1] . We improve both upper and lower bounds on the minimax regret compared to prior work. When \upsilon = \mathcalO(1) , the best prior known regret upper bound is \tilde\mathcalO(d T^\frac11+\epsilon) . While a lower with the same scaling has been given, it relies on a construction using \upsilon = \mathcalO(d) , and adapting the construction to the bounded-moment regime with \upsilon = \mathcalO(1) yields only a \Omega(d^\frac\epsilon1+\epsilon T^\frac11+\epsilon) lower bound. This matches the known rate for multi-armed bandits and is generally loose for linear bandits, in particular being \sqrtd below the optimal rate in the finite-variance case ( \epsilon = 1 ). We propose a new elimination-based algorithm guided by experimental design, which achieves regret \tilde\mathcalO(d^\frac1+3\epsilon2(1+\epsilon) T^\frac11+\epsilon) , thus improving the dependence on d for all \epsilon \in (0,1) and recovering a known optimal result for \epsilon = 1 . We also establish a lower bound of \Omega(d^\frac2\epsilon1+\epsilon T^\frac11+\epsilon) , which strictly improves upon the multi-armed bandit rate and highlights the hardness of heavy-tailed linear bandit problems. For finite action sets, we derive similarly improved upper and lower bounds for regret. Finally, we provide action set dependent regret upper bounds showing that for some geometries, such as l_p -norm balls for p \le 1 + \epsilon , we can further reduce the dependence on d , and we can handle infinite-dimensional settings via the kernel trick, in particular establishing new regret bounds for the Matérn kernel that are the first to be sublinear for all \epsilon \in (0, 1] .

[LG-43] OpenGT: A Comprehensive Benchmark For Graph Transformers

链接: https://arxiv.org/abs/2506.04765
作者: Jiachen Tang,Zhonghao Wang,Sirui Chen,Sheng Zhou,Jiawei Chen,Jiajun Bu
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Graph Transformers (GTs) have recently demonstrated remarkable performance across diverse domains. By leveraging attention mechanisms, GTs are capable of modeling long-range dependencies and complex structural relationships beyond local neighborhoods. However, their applicable scenarios are still underexplored, this highlights the need to identify when and why they excel. Furthermore, unlike GNNs, which predominantly rely on message-passing mechanisms, GTs exhibit a diverse design space in areas such as positional encoding, attention mechanisms, and graph-specific adaptations. Yet, it remains unclear which of these design choices are truly effective and under what conditions. As a result, the community currently lacks a comprehensive benchmark and library to promote a deeper understanding and further development of GTs. To address this gap, this paper introduces OpenGT, a comprehensive benchmark for Graph Transformers. OpenGT enables fair comparisons and multidimensional analysis by establishing standardized experimental settings and incorporating a broad selection of state-of-the-art GNNs and GTs. Our benchmark evaluates GTs from multiple perspectives, encompassing diverse tasks and datasets with varying properties. Through extensive experiments, our benchmark has uncovered several critical insights, including the difficulty of transferring models across task levels, the limitations of local attention, the efficiency trade-offs in several models, the application scenarios of specific positional encodings, and the preprocessing overhead of some positional encodings. We aspire for this work to establish a foundation for future graph transformer research emphasizing fairness, reproducibility, and generalizability. We have developed an easy-to-use library OpenGT for training and evaluating existing GTs. The benchmark code is available at this https URL.

[LG-44] Log-Linear Attention

链接: https://arxiv.org/abs/2506.04761
作者: Han Guo,Songlin Yang,Tarushii Goel,Eric P. Xing,Tri Dao,Yoon Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention’s efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures – Mamba-2 and Gated DeltaNet – and find they perform well compared to their linear-time variants.

[LG-45] Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models

链接: https://arxiv.org/abs/2506.04746
作者: Fei Ding,Baiqiao Wang,Zijian Zeng,Youwei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Group Relative Policy Optimization (GRPO) algorithm has demonstrated considerable success in enhancing the reasoning capabilities of large language models (LLMs), as evidenced by DeepSeek-R1. However, the absence of intermediate supervision in GRPO frequently leads to inefficient exploration dynamics. A single error in a complex reasoning chain can invalidate the entire solution, resulting in abrupt reward vanishing and compromising training this http URL address these challenges, we propose MGRPO (Multi-layer GRPO). MGRPO operates in two layers: the first layer employs standard GRPO to generate an initial response. This response, along with the original query, is then fed into a second-layer GRPO process. This second layer is specifically trained to identify and correct errors in the initial response, effectively creating a self-correction loop. This mechanism provides implicit process-level supervision by rewarding successful error correction, without requiring an explicit, densely-annotated reward model. Experimental results on several mathematical reasoning benchmarks demonstrate that MGRPO significantly outperforms standard GRPO, achieving superior performance by fostering both reasoning and self-correction abilities.

[LG-46] Enhanced Drought Analysis in Bangladesh: A Machine Learning Approach for Severity Classification Using Satellite Data

链接: https://arxiv.org/abs/2506.04696
作者: Tonmoy Paul,Mrittika Devi Mati,Md. Mahmudul Islam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drought poses a pervasive environmental challenge in Bangladesh, impacting agriculture, socio-economic stability, and food security due to its unique geographic and anthropogenic vulnerabilities. Traditional drought indices, such as the Standardized Precipitation Index (SPI) and Palmer Drought Severity Index (PDSI), often overlook crucial factors like soil moisture and temperature, limiting their resolution. Moreover, current machine learning models applied to drought prediction have been underexplored in the context of Bangladesh, lacking a comprehensive integration of satellite data across multiple districts. To address these gaps, we propose a satellite data-driven machine learning framework to classify drought across 38 districts of Bangladesh. Using unsupervised algorithms like K-means and Bayesian Gaussian Mixture for clustering, followed by classification models such as KNN, Random Forest, Decision Tree, and Naive Bayes, the framework integrates weather data (humidity, soil moisture, temperature) from 2012-2024. This approach successfully classifies drought severity into different levels. However, it shows significant variabilities in drought vulnerabilities across regions which highlights the aptitude of machine learning models in terms of identifying and predicting drought conditions.

[LG-47] he cost of ensembling: is it always worth combining?

链接: https://arxiv.org/abs/2506.04677
作者: Marco Zanotti
类目: Machine Learning (cs.LG); Applications (stat.AP); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Given the continuous increase in dataset sizes and the complexity of forecasting models, the trade-off between forecast accuracy and computational cost is emerging as an extremely relevant topic, especially in the context of ensemble learning for time series forecasting. To asses it, we evaluated ten base models and eight ensemble configurations across two large-scale retail datasets (M5 and VN1), considering both point and probabilistic accuracy under varying retraining frequencies. We showed that ensembles consistently improve forecasting performance, particularly in probabilistic settings. However, these gains come at a substantial computational cost, especially for larger, accuracy-driven ensembles. We found that reducing retraining frequency significantly lowers costs, with minimal impact on accuracy, particularly for point forecasts. Moreover, efficiency-driven ensembles offer a strong balance, achieving competitive accuracy with considerably lower costs compared to accuracy-optimized combinations. Most importantly, small ensembles of two or three models are often sufficient to achieve near-optimal results. These findings provide practical guidelines for deploying scalable and cost-efficient forecasting systems, supporting the broader goals of sustainable AI in forecasting. Overall, this work shows that careful ensemble design and retraining strategy selection can yield accurate, robust, and cost-effective forecasts suitable for real-world applications.

[LG-48] FedAPM: Federated Learning via ADMM with Partial Model Personalization

链接: https://arxiv.org/abs/2506.04672
作者: Shengkun Zhu,Feiteng Nie,Jinshan Zeng,Sheng Wang,Yuan Sun,Yuan Yao,Shangfeng Chen,Quanqing Xu,Chuanhui Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning (FL), the assumption that datasets from different devices are independent and identically distributed (i.i.d.) often does not hold due to user differences, and the presence of various data modalities across clients makes using a single model impractical. Personalizing certain parts of the model can effectively address these issues by allowing those parts to differ across clients, while the remaining parts serve as a shared model. However, we found that partial model personalization may exacerbate client drift (each client’s local model diverges from the shared model), thereby reducing the effectiveness and efficiency of FL algorithms. We propose an FL framework based on the alternating direction method of multipliers (ADMM), referred to as FedAPM, to mitigate client drift. We construct the augmented Lagrangian function by incorporating first-order and second-order proximal terms into the objective, with the second-order term providing fixed correction and the first-order term offering compensatory correction between the local and shared models. Our analysis demonstrates that FedAPM, by using explicit estimates of the Lagrange multiplier, is more stable and efficient in terms of convergence compared to other FL frameworks. We establish the global convergence of FedAPM training from arbitrary initial points to a stationary point, achieving three types of rates: constant, linear, and sublinear, under mild assumptions. We conduct experiments using four heterogeneous and multimodal datasets with different metrics to validate the performance of FedAPM. Specifically, FedAPM achieves faster and more accurate convergence, outperforming the SOTA methods with average improvements of 12.3% in test accuracy, 16.4% in F1 score, and 18.0% in AUC while requiring fewer communication rounds.

[LG-49] Noise-Resistant Label Reconstruction Feature Selection for Partial Multi-Label Learning IJCAI25

链接: https://arxiv.org/abs/2506.04669
作者: Wanfu Gao,Hanlin Pan,Qingqi Han,Kunpeng Liu
类目: Machine Learning (cs.LG)
*备注: accept in ijcai25

点击查看摘要

Abstract:The “Curse of dimensionality” is prevalent across various data patterns, which increases the risk of model overfitting and leads to a decline in model classification performance. However, few studies have focused on this issue in Partial Multi-label Learning (PML), where each sample is associated with a set of candidate labels, at least one of which is correct. Existing PML methods addressing this problem are mainly based on the low-rank assumption. However, low-rank assumption is difficult to be satisfied in practical situations and may lead to loss of high-dimensional information. Furthermore, we find that existing methods have poor ability to identify positive labels, which is important in real-world scenarios. In this paper, a PML feature selection method is proposed considering two important characteristics of dataset: label relationship’s noise-resistance and label connectivity. Our proposed method utilizes label relationship’s noise-resistance to disambiguate labels. Then the learning process is designed through the reformed low-rank assumption. Finally, representative labels are found through label connectivity, and the weight matrix is reconstructed to select features with strong identification ability to these labels. The experimental results on benchmark datasets demonstrate the superiority of the proposed method.

[LG-50] FlashDMoE: Fast Distributed MoE in a Single Kernel

链接: https://arxiv.org/abs/2506.04667
作者: Osayamen Jonathan Aimuyo,Byungsoo Oh,Rachee Singh
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: In submission. See code at this https URL

点击查看摘要

Abstract:The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, offering a scalable path to training massive neural networks. However, existing implementations suffer from \emphlow GPU utilization, \emphsignificant latency overhead, and a fundamental \emphinability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a \emphsingle persistent GPU kernel. FlashDMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Its device-initiated communication protocol introduces \emphpayload-efficient data transfers, significantly shrinking buffer sizes in sparsely activated MoE layers. When evaluated on a single 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashDMoE achieves up to \textbf6x lower latency, \textbf5,7x higher throughput, \textbf4x better weak scaling efficiency, and \textbf9x higher GPU utilization compared to state-of-the-art baselines, despite using FP32 while baselines use FP16. FlashDMoE demonstrates that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML workloads.

[LG-51] he Oversmoothing Fallacy: A Misguided Narrative in GNN Research

链接: https://arxiv.org/abs/2506.04653
作者: MoonJeong Park,Sunghyun Choi,Jaeseung Heo,Eunhyeok Park,Dongwoo Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Oversmoothing has been recognized as a main obstacle to building deep Graph Neural Networks (GNNs), limiting the performance. This position paper argues that the influence of oversmoothing has been overstated and advocates for a further exploration of deep GNN architectures. Given the three core operations of GNNs, aggregation, linear transformation, and non-linear activation, we show that prior studies have mistakenly confused oversmoothing with the vanishing gradient, caused by transformation and activation rather than aggregation. Our finding challenges prior beliefs about oversmoothing being unique to GNNs. Furthermore, we demonstrate that classical solutions such as skip connections and normalization enable the successful stacking of deep GNN layers without performance degradation. Our results clarify misconceptions about oversmoothing and shed new light on the potential of deep GNNs.

[LG-52] Neural Network Reprogrammability: A Unified Theme on Model Reprogramming Prompt Tuning and Prompt Instruction

链接: https://arxiv.org/abs/2506.04650
作者: Zesheng Ye,Chengyi Cai,Ruijiang Dong,Jianzhong Qi,Lei Feng,Pin-Yu Chen,Feng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large-scale pre-trained foundation models continue to expand in size and capability, efficiently adapting them to specific downstream tasks has become increasingly critical. Despite substantial progress, existing adaptation approaches have evolved largely in isolation, without a clear understanding of their interrelationships. This survey introduces neural network reprogrammability as a unifying framework that bridges mainstream model adaptation techniques–model reprogramming, prompt tuning, and prompt instruction–previously fragmented research areas yet converges on a shared principle: repurposing a pre-trained model by manipulating information at the interfaces while keeping the model parameters frozen. These methods exploit neural networks’ sensitivity to manipulation on different interfaces, be it through perturbing inputs, inserting tokens into intermediate layers, or providing task-specific examples in context, to redirect model behaviors towards desired outcomes. We then present a taxonomy that categorizes such information manipulation-based adaptation approaches across four key dimensions: manipulation format (fixed or learnable), location (interfaces where manipulations occur), operator (how they are applied), and output alignment requirement (post-processing needed to align outputs with downstream tasks). Notably, this framework applies consistently across data modalities, independent of specific model architectures. Moreover, viewing established techniques like in-context learning and chain-of-thought prompting through this lens reveals both their theoretical connections and practical distinctions. We further analyze remaining technical challenges and ethical considerations, positioning neural network reprogrammability as a fundamental paradigm for efficient model adaptation. We lastly identify promising research directions emerging from this integrative viewpoint.

[LG-53] Inference economics of language models

链接: https://arxiv.org/abs/2506.04645
作者: Ege Erdil
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.

[LG-54] Composing Agents to Minimize Worst-case Risk

链接: https://arxiv.org/abs/2506.04632
作者: Guruprerana Shabadi,Rajeev Alur
类目: Machine Learning (cs.LG)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:From software development to robot control, modern agentic systems decompose complex objectives into a sequence of subtasks and choose a set of specialized AI agents to complete them. We formalize an agentic workflow as a directed acyclic graph, called an agent graph, where edges represent AI agents and paths correspond to feasible compositions of agents. When deploying these systems in the real world, we need to choose compositions of agents that not only maximize the task success, but also minimize risk where the risk captures requirements like safety, fairness, and privacy. This additionally requires carefully analyzing the low-probability (tail) behaviors of compositions of agents. In this work, we consider worst-case risk minimization over the set of feasible agent compositions. We define worst-case risk as the tail quantile – also known as value-at-risk – of the loss distribution of the agent composition where the loss quantifies the risk associated with agent behaviors. We introduce an efficient algorithm that traverses the agent graph and finds a near-optimal composition of agents by approximating the value-at-risk via a union bound and dynamic programming. Furthermore, we prove that the approximation is near-optimal asymptotically for a broad class of practical loss functions. To evaluate our framework, we consider a suite of video game-like control benchmarks that require composing several agents trained with reinforcement learning and demonstrate our algorithm’s effectiveness in approximating the value-at-risk and identifying the optimal agent composition.

[LG-55] Ignoring Directionality Leads to Compromised Graph Neural Network Explanations

链接: https://arxiv.org/abs/2506.04608
作者: Changsheng Sun,Xinke Li,Jin Song Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are increasingly used in critical domains, where reliable explanations are vital for supporting human decision-making. However, the common practice of graph symmetrization discards directional information, leading to significant information loss and misleading explanations. Our analysis demonstrates how this practice compromises explanation fidelity. Through theoretical and empirical studies, we show that preserving directional semantics significantly improves explanation quality, ensuring more faithful insights for human decision-makers. These findings highlight the need for direction-aware GNN explainability in security-critical applications.

[LG-56] MVP-Shapley: Feature-based Modeling for Evaluating the Most Valuable Player in Basketball

链接: https://arxiv.org/abs/2506.04602
作者: Haifeng Sun,Yu Xiong,Runze Wu,Kai Wang,Lan Zhang,Changjie Fan,Shaojie Tang,Xiang-Yang Li
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The burgeoning growth of the esports and multiplayer online gaming community has highlighted the critical importance of evaluating the Most Valuable Player (MVP). The establishment of an explainable and practical MVP evaluation method is very challenging. In our study, we specifically focus on play-by-play data, which records related events during the game, such as assists and points. We aim to address the challenges by introducing a new MVP evaluation framework, denoted as \oursys, which leverages Shapley values. This approach encompasses feature processing, win-loss model training, Shapley value allocation, and MVP ranking determination based on players’ contributions. Additionally, we optimize our algorithm to align with expert voting results from the perspective of causality. Finally, we substantiated the efficacy of our method through validation using the NBA dataset and the Dunk City Dynasty dataset and implemented online deployment in the industry.

[LG-57] Unsupervised Machine Learning for Scientific Discovery: Workflow and Best Practices

链接: https://arxiv.org/abs/2506.04553
作者: Andersen Chang,Tiffany M. Tang,Tarek M. Zikry,Genevera I. Allen
类目: Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 23 pages, 4 figures, 12 additional pages of citations

点击查看摘要

Abstract:Unsupervised machine learning is widely used to mine large, unlabeled datasets to make data-driven discoveries in critical domains such as climate science, biomedicine, astronomy, chemistry, and more. However, despite its widespread utilization, there is a lack of standardization in unsupervised learning workflows for making reliable and reproducible scientific discoveries. In this paper, we present a structured workflow for using unsupervised learning techniques in science. We highlight and discuss best practices starting with formulating validatable scientific questions, conducting robust data preparation and exploration, using a range of modeling techniques, performing rigorous validation by evaluating the stability and generalizability of unsupervised learning conclusions, and promoting effective communication and documentation of results to ensure reproducible scientific discoveries. To illustrate our proposed workflow, we present a case study from astronomy, seeking to refine globular clusters of Milky Way stars based upon their chemical composition. Our case study highlights the importance of validation and illustrates how the benefits of a carefully-designed workflow for unsupervised learning can advance scientific discovery.

[LG-58] Communication Efficient Adaptive Model-Driven Quantum Federated Learning

链接: https://arxiv.org/abs/2506.04548
作者: Dev Gurung,Shiva Raj Pokhrel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training with huge datasets and a large number of participating devices leads to bottlenecks in federated learning (FL). Furthermore, the challenges of heterogeneity between multiple FL clients affect the overall performance of the system. In a quantum federated learning (QFL) context, we address these three main challenges: i) training bottlenecks from massive datasets, ii) the involvement of a substantial number of devices, and iii) non-IID data distributions. We introduce a model-driven quantum federated learning algorithm (mdQFL) to tackle these challenges. Our proposed approach is efficient and adaptable to various factors, including different numbers of devices. To the best of our knowledge, it is the first to explore training and update personalization, as well as test generalization within a QFL setting, which can be applied to other FL scenarios. We evaluated the efficiency of the proposed mdQFL framework through extensive experiments under diverse non-IID data heterogeneity conditions using various datasets within the Qiskit environment. Our results demonstrate a nearly 50% decrease in total communication costs while maintaining or, in some cases, exceeding the accuracy of the final model and consistently improving local model training compared to the standard QFL baseline. Moreover, our experimental evaluation thoroughly explores the QFL and mdQFL algorithms, along with several influencing factors. In addition, we present a theoretical analysis to clarify the complexities of the proposed algorithm. The experimental code is available at 1.

[LG-59] Neural MJD: Neural Non-Stationary Merton Jump Diffusion for Time Series Prediction

链接: https://arxiv.org/abs/2506.04542
作者: Yuanpei Gao,Qi Yan,Yan Leng,Renjie Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While deep learning methods have achieved strong performance in time series prediction, their black-box nature and inability to explicitly model underlying stochastic processes often limit their generalization to non-stationary data, especially in the presence of abrupt changes. In this work, we introduce Neural MJD, a neural network based non-stationary Merton jump diffusion (MJD) model. Our model explicitly formulates forecasting as a stochastic differential equation (SDE) simulation problem, combining a time-inhomogeneous Itô diffusion to capture non-stationary stochastic dynamics with a time-inhomogeneous compound Poisson process to model abrupt jumps. To enable tractable learning, we introduce a likelihood truncation mechanism that caps the number of jumps within small time intervals and provide a theoretical error bound for this approximation. Additionally, we propose an Euler-Maruyama with restart solver, which achieves a provably lower error bound in estimating expected states and reduced variance compared to the standard solver. Experiments on both synthetic and real-world datasets demonstrate that Neural MJD consistently outperforms state-of-the-art deep learning and statistical learning methods.

[LG-60] Chronoamperometry with Room-Temperature Ionic Liquids: Sub-Second Inference Techniques

链接: https://arxiv.org/abs/2506.04540
作者: Kordel K. France
类目: Robotics (cs.RO); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Instrumentation and Detectors (physics.ins-det)
*备注: Published at IEEE BioSensors 2025

点击查看摘要

Abstract:Chronoamperometry (CA) is a fundamental electrochemical technique used for quantifying redox-active species. However, in room-temperature ionic liquids (RTILs), the high viscosity and slow mass transport often lead to extended measurement durations. This paper presents a novel mathematical regression approach that reduces CA measurement windows to under 1 second, significantly faster than previously reported methods, which typically require 1-4 seconds or longer. By applying an inference algorithm to the initial transient current response, this method accurately predicts steady-state electrochemical parameters without requiring additional hardware modifications. The approach is validated through comparison with standard chronoamperometric techniques and is demonstrated to maintain reasonable accuracy while dramatically reducing data acquisition time. The implications of this technique are explored in analytical chemistry, sensor technology, and battery science, where rapid electrochemical quantification is critical. Our technique is focused on enabling faster multiplexing of chronoamperometric measurements for rapid olfactory and electrochemical analysis.

[LG-61] Olfactory Inertial Odometry: Sensor Calibration and Drift Compensation

链接: https://arxiv.org/abs/2506.04539
作者: Kordel K. France,Ovidiu Daescu,Anirban Paul,Shalini Prasad
类目: Robotics (cs.RO); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Published as a full conference paper at the 2025 IEEE International Symposium on Inertial Sensors Systems

点击查看摘要

Abstract:Visual inertial odometry (VIO) is a process for fusing visual and kinematic data to understand a machine’s state in a navigation task. Olfactory inertial odometry (OIO) is an analog to VIO that fuses signals from gas sensors with inertial data to help a robot navigate by scent. Gas dynamics and environmental factors introduce disturbances into olfactory navigation tasks that can make OIO difficult to facilitate. With our work here, we define a process for calibrating a robot for OIO that generalizes to several olfaction sensor types. Our focus is specifically on calibrating OIO for centimeter-level accuracy in localizing an odor source on a slow-moving robot platform to demonstrate use cases in robotic surgery and touchless security screening. We demonstrate our process for OIO calibration on a real robotic arm and show how this calibration improves performance over a cold-start olfactory navigation task.

[LG-62] HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

链接: https://arxiv.org/abs/2506.04531
作者: Geon-Woo Kim,Junbo Li,Shashidhar Gandham,Omar Baldonado,Adithya Gangidi,Pavan Balaji,Zhangyang Wang,Aditya Akella
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.

[LG-63] Hierarchical Implicit Neural Emulators

链接: https://arxiv.org/abs/2506.04528
作者: Ruoxi Jiang,Xiao Zhang,Karan Jakhar,Peter Y. Lu,Pedram Hassanzadeh,Michael Maire,Rebecca Willett
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural PDE solvers offer a powerful tool for modeling complex dynamical systems, but often struggle with error accumulation over long time horizons and maintaining stability and physical consistency. We introduce a multiscale implicit neural emulator that enhances long-term prediction accuracy by conditioning on a hierarchy of lower-dimensional future state representations. Drawing inspiration from the stability properties of numerical implicit time-stepping methods, our approach leverages predictions several steps ahead in time at increasing compression rates for next-timestep refinements. By actively adjusting the temporal downsampling ratios, our design enables the model to capture dynamics across multiple granularities and enforce long-range temporal coherence. Experiments on turbulent fluid dynamics show that our method achieves high short-term accuracy and produces long-term stable forecasts, significantly outperforming autoregressive baselines while adding minimal computational overhead.

[LG-64] Perturbative Gradient Training: A novel training paradigm for bridging the gap between deep neural networks and physical reservoir computing

链接: https://arxiv.org/abs/2506.04523
作者: Cliff B. Abbott,Mark Elo,Dmytro A. Bozhko
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: 7 pages, 8 figures, submitted to IEEE Transactions on Neural Netowrks and Learning Systems

点击查看摘要

Abstract:We introduce Perturbative Gradient Training (PGT), a novel training paradigm that overcomes a critical limitation of physical reservoir computing: the inability to perform backpropagation due to the black-box nature of physical reservoirs. Drawing inspiration from perturbation theory in physics, PGT uses random perturbations in the network’s parameter space to approximate gradient updates using only forward passes. We demonstrate the feasibility of this approach on both simulated neural network architectures, including a dense network and a transformer model with a reservoir layer, and on experimental hardware using a magnonic auto-oscillation ring as the physical reservoir. Our results show that PGT can achieve performance comparable to that of standard backpropagation methods in cases where backpropagation is impractical or impossible. PGT represents a promising step toward integrating physical reservoirs into deeper neural network architectures and achieving significant energy efficiency gains in AI training.

[LG-65] SGN-CIRL: Scene Graph-based Navigation with Curriculum Imitation and Reinforcement Learning

链接: https://arxiv.org/abs/2506.04505
作者: Nikita Oskolkov,Huzhenyu Zhang,Dmitry Makarov,Dmitry Yudin,Aleksandr Panov
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 11 figures

点击查看摘要

Abstract:The 3D scene graph models spatial relationships between objects, enabling the agent to efficiently navigate in a partially observable environment and predict the location of the target this http URL paper proposes an original framework named SGN-CIRL (3D Scene Graph-Based Reinforcement Learning Navigation) for mapless reinforcement learning-based robot navigation with learnable representation of open-vocabulary 3D scene graph. To accelerate and stabilize the training of reinforcement learning-based algorithms, the framework also employs imitation learning and curriculum learning. The first one enables the agent to learn from demonstrations, while the second one structures the training process by gradually increasing task complexity from simple to more advanced scenarios. Numerical experiments conducted in the Isaac Sim environment showed that using a 3D scene graph for reinforcement learning significantly increased the success rate in difficult navigation cases. The code is open-sourced and available at: this https URL_graph.

[LG-66] Multiscale guidance of AlphaFold3 with heterogeneous cryo-EM data

链接: https://arxiv.org/abs/2506.04490
作者: Rishwanth Raghu,Axel Levy,Gordon Wetzstein,Ellen D. Zhong
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Protein structure prediction models are now capable of generating accurate 3D structural hypotheses from sequence alone. However, they routinely fail to capture the conformational diversity of dynamic biomolecular complexes, often requiring heuristic MSA subsampling approaches for generating alternative states. In parallel, cryo-electron microscopy (cryo-EM) has emerged as a powerful tool for imaging near-native structural heterogeneity, but is challenged by arduous pipelines to go from raw experimental data to atomic models. Here, we bridge the gap between these modalities, combining cryo-EM density maps with the rich sequence and biophysical priors learned by protein structure prediction models. Our method, CryoBoltz, guides the sampling trajectory of a pretrained protein structure prediction model using both global and local structural constraints derived from density maps, driving predictions towards conformational states consistent with the experimental data. We demonstrate that this flexible yet powerful inference-time approach allows us to build atomic models into heterogeneous cryo-EM maps across a variety of dynamic biomolecular systems including transporters and antibodies.

[LG-67] Orthogonal Gradient Descent Improves Neural Calibration

链接: https://arxiv.org/abs/2506.04487
作者: C. Evans Hedges
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We provide evidence that orthogonalizing gradients during training improves model calibration without sacrificing accuracy. On CIFAR-10 with 10% labeled data, \perp Grad matches SGD in accuracy but yields consistently improved calibration metrics such as lower test loss, reduced softmax overconfidence, and higher predictive entropy. These benefits persist under input corruption (CIFAR-10C) and extended training, where \perp Grad models degrade more gracefully than SGD-trained counterparts. \perp Grad is optimizer-agnostic, incurs minimal overhead, and works well with post-hoc calibration techniques like temperature scaling. Theoretically, we prove convergence of a simplified version of \perp Grad under mild assumptions and characterize its stationary points in positive homogeneous networks: \perp Grad converges to solutions where further loss reduction requires confidence scaling rather than decision boundary improvement. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.04487 [cs.LG] (or arXiv:2506.04487v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.04487 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] Leverag ing Reward Models for Guiding Code Review Comment Generation

链接: https://arxiv.org/abs/2506.04464
作者: Oussama Ben Sghaier,Rosalia Tufano,Gabriele Bavota,Houari Sahraoui
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code review is a crucial component of modern software development, involving the evaluation of code quality, providing feedback on potential issues, and refining the code to address identified problems. Despite these benefits, code review can be rather time consuming, and influenced by subjectivity and human factors. For these reasons, techniques to (partially) automate the code review process have been proposed in the literature. Among those, the ones exploiting deep learning (DL) are able to tackle the generative aspect of code review, by commenting on a given code as a human reviewer would do (i.e., comment generation task) or by automatically implementing code changes required to address a reviewer’s comment (i.e., code refinement task). In this paper, we introduce CoRAL, a deep learning framework automating review comment generation by exploiting reinforcement learning with a reward mechanism considering both the semantics of the generated comments as well as their usefulness as input for other models automating the code refinement task. The core idea is that if the DL model generates comments that are semantically similar to the expected ones or can be successfully implemented by a second model specialized in code refinement, these comments are likely to be meaningful and useful, thus deserving a high reward in the reinforcement learning framework. We present both quantitative and qualitative comparisons between the comments generated by CoRAL and those produced by the latest baseline techniques, highlighting the effectiveness and superiority of our approach.

[LG-69] Neurosymbolic Artificial Intelligence for Robust Network Intrusion Detection: From Scratch to Transfer Learning

链接: https://arxiv.org/abs/2506.04454
作者: Huynh T. T. Tran,Jacob Sander,Achraf Cohen,Brian Jalaian,Nathaniel D. Bastian
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures, 11 tables

点击查看摘要

Abstract:Network Intrusion Detection Systems (NIDS) play a vital role in protecting digital infrastructures against increasingly sophisticated cyber threats. In this paper, we extend ODXU, a Neurosymbolic AI (NSAI) framework that integrates deep embedded clustering for feature extraction, symbolic reasoning using XGBoost, and comprehensive uncertainty quantification (UQ) to enhance robustness, interpretability, and generalization in NIDS. The extended ODXU incorporates score-based methods (e.g., Confidence Scoring, Shannon Entropy) and metamodel-based techniques, including SHAP values and Information Gain, to assess the reliability of predictions. Experimental results on the CIC-IDS-2017 dataset show that ODXU outperforms traditional neural models across six evaluation metrics, including classification accuracy and false omission rate. While transfer learning has seen widespread adoption in fields such as computer vision and natural language processing, its potential in cybersecurity has not been thoroughly explored. To bridge this gap, we develop a transfer learning strategy that enables the reuse of a pre-trained ODXU model on a different dataset. Our ablation study on ACI-IoT-2023 demonstrates that the optimal transfer configuration involves reusing the pre-trained autoencoder, retraining the clustering module, and fine-tuning the XGBoost classifier, and outperforms traditional neural models when trained with as few as 16,000 samples (approximately 50% of the training data). Additionally, results show that metamodel-based UQ methods consistently outperform score-based approaches on both datasets.

[LG-70] Selective Matching Losses – Not All Scores Are Created Equal

链接: https://arxiv.org/abs/2506.04446
作者: Gil I. Shamir,Manfred K. Warmuth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning systems match predicted scores to observations over some domain. Often, it is critical to produce accurate predictions in some subset (or region) of the domain, yet less important to accurately predict in other regions. We construct selective matching loss functions by design of increasing link functions over score domains. A matching loss is an integral over the link. A link defines loss sensitivity as function of the score, emphasizing high slope high sensitivity regions over flat ones. Loss asymmetry drives a model and resolves its underspecification to predict better in high sensitivity regions where it is more important, and to distinguish between high and low importance regions. A large variety of selective scalar losses can be designed with scaled and shifted Sigmoid and hyperbolic sine links. Their properties, however, do not extend to multi-class. Applying them per dimension lacks ranking sensitivity that assigns importance according to class score ranking. Utilizing composite Softmax functions, we develop a framework for multidimensional selective losses. We overcome limitations of the standard Softmax function, that is good for classification, but not for distinction between adjacent scores. Selective losses have substantial advantage over traditional losses in applications with more important score regions, including dwell-time prediction, retrieval, ranking with either pointwise, contrastive pairwise, or listwise losses, distillation problems, and fine-tuning alignment of Large Language Models (LLMs).

[LG-71] RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis

链接: https://arxiv.org/abs/2506.04439
作者: Robin Yadav,Qi Yan,Guy Wolf,Avishek Joey Bose,Renjie Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A fundamental problem in organic chemistry is identifying and predicting the series of reactions that synthesize a desired target product molecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction – i.e. single-step retrosynthesis – remains challenging even for existing state-of-the-art template-free generative approaches to produce an accurate yet diverse set of feasible reactions. In this paper, we model single-step retrosynthesis planning and introduce RETRO SYNFLOW (RSF) a discrete flow-matching framework that builds a Markov bridge between the prescribed target product molecule and the reactant molecule. In contrast to past approaches, RSF employs a reaction center identification step to produce intermediate structures known as synthons as a more informative source distribution for the discrete flow. To further enhance diversity and feasibility of generated samples, we employ Feynman-Kac steering with Sequential Monte Carlo based resampling to steer promising generations at inference using a new reward oracle that relies on a forward-synthesis model. Empirically, we demonstrate \nameshort achieves 60.0 % top-1 accuracy, which outperforms the previous SOTA by 20 % . We also substantiate the benefits of steering at inference and demonstrate that FK-steering improves top- 5 round-trip accuracy by 19 % over prior template-free SOTA methods, all while preserving competitive top- k accuracy results.

[LG-72] KOALA: Efficient Kalman-Based Optimization of Neural Networks with Gradient-Covariance Products

链接: https://arxiv.org/abs/2506.04432
作者: Zixuan Xia,Aram Davtyan,Paolo Favaro
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.

[LG-73] Leverag ing Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

链接: https://arxiv.org/abs/2506.04430
作者: Egor Petrov,Grigoriy Evseev,Aleksey Antonov,Andrey Veprikov,Pavel Plyusnin,Nikolay Bushkov,Stanislav Moiseev,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 26 pages, 5 tables

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose \textttJAGUAR SignSGD , a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only \mathcalO(1) function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose \textttJAGUAR Muon , a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at this https URL

[LG-74] Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning

链接: https://arxiv.org/abs/2506.04411
作者: Achleshwar Luthra,Tianbao Yang,Tomer Galanti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability–within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion’s effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of \mathcalO(\frac1#\textclasses) ; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.04411 [cs.LG] (or arXiv:2506.04411v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.04411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] Replay Can Provably Increase Forgetting

链接: https://arxiv.org/abs/2506.04377
作者: Yasaman Mahdaviyeh,James Lucas,Mengye Ren,Andreas S. Tolias,Richard Zemel,Toniann Pitassi
类目: Machine Learning (cs.LG)
*备注: To appear in the Proceedings of the Conference on Lifelong Learning Agents (CoLLAs) 2025

点击查看摘要

Abstract:Continual learning seeks to enable machine learning systems to solve an increasing corpus of tasks sequentially. A critical challenge for continual learning is forgetting, where the performance on previously learned tasks decreases as new tasks are introduced. One of the commonly used techniques to mitigate forgetting, sample replay, has been shown empirically to reduce forgetting by retaining some examples from old tasks and including them in new training episodes. In this work, we provide a theoretical analysis of sample replay in an over-parameterized continual linear regression setting, where each task is given by a linear subspace and with enough replay samples, one would be able to eliminate forgetting. Our analysis focuses on sample replay and highlights the role of the replayed samples and the relationship between task subspaces. Surprisingly, we find that, even in a noiseless setting, forgetting can be non-monotonic with respect to the number of replay samples. We present tasks where replay can be harmful with respect to worst-case settings, and also in distributional settings where replay of randomly selected samples increases forgetting in expectation. We also give empirical evidence that harmful replay is not limited to training with linear models by showing similar behavior for a neural networks equipped with SGD. Through experiments on a commonly used benchmark, we provide additional evidence that, even in seemingly benign scenarios, performance of the replay heavily depends on the choice of replay samples and the relationship between tasks.

[LG-76] Solving engineering eigenvalue problems with neural networks using the Rayleigh quotient

链接: https://arxiv.org/abs/2506.04375
作者: Conor Rowan,John Evans,Kurt Maute,Alireza Doostan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:From characterizing the speed of a thermal system’s response to computing natural modes of vibration, eigenvalue analysis is ubiquitous in engineering. In spite of this, eigenvalue problems have received relatively little treatment compared to standard forward and inverse problems in the physics-informed machine learning literature. In particular, neural network discretizations of solutions to eigenvalue problems have seen only a handful of studies. Owing to their nonlinearity, neural network discretizations prevent the conversion of the continuous eigenvalue differential equation into a standard discrete eigenvalue problem. In this setting, eigenvalue analysis requires more specialized techniques. Using a neural network discretization of the eigenfunction, we show that a variational form of the eigenvalue problem called the “Rayleigh quotient” in tandem with a Gram-Schmidt orthogonalization procedure is a particularly simple and robust approach to find the eigenvalues and their corresponding eigenfunctions. This method is shown to be useful for finding sets of harmonic functions on irregular domains, parametric and nonlinear eigenproblems, and high-dimensional eigenanalysis. We also discuss the utility of harmonic functions as a spectral basis for approximating solutions to partial differential equations. Through various examples from engineering mechanics, the combination of the Rayleigh quotient objective, Gram-Schmidt procedure, and the neural network discretization of the eigenfunction is shown to offer unique advantages for handling continuous eigenvalue problems.

[LG-77] Even Faster Hyperbolic Random Forests: A Beltrami-Klein Wrapper Approach

链接: https://arxiv.org/abs/2506.04360
作者: Philippe Chlenski,Itsik Pe’er
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the hyperDT algorithm can be elegantly reexpressed in the Beltrami-Klein model of hyperbolic spaces. This preserves the thresholding operation used in Euclidean decision trees, enabling us to further rewrite hyperDT as simple pre- and post-processing steps that form a wrapper around existing tree-based models designed for Euclidean spaces. The wrapper approach unlocks many optimizations already available in Euclidean space models, improving flexibility, speed, and accuracy while offering a simpler, more maintainable, and extensible codebase. Our implementation is available at this https URL.

[LG-78] A Risk-Aware Reinforcement Learning Reward for Financial Trading

链接: https://arxiv.org/abs/2506.04358
作者: Uditansh Srivastava,Shivam Aryan,Shaurya Singh
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures

点击查看摘要

Abstract:We propose a novel composite reward function for reinforcement learning in financial trading that balances return and risk using four differentiable terms: annualized return downside risk differential return and the Treynor ratio Unlike single metric objectives for example the Sharpe ratio our formulation is modular and parameterized by weights w1 w2 w3 and w4 enabling practitioners to encode diverse investor preferences We tune these weights via grid search to target specific risk return profiles We derive closed form gradients for each term to facilitate gradient based training and analyze key theoretical properties including monotonicity boundedness and modularity This framework offers a general blueprint for building robust multi objective reward functions in complex trading environments and can be extended with additional risk measures or adaptive weighting Comments: 14 pages, 11 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.04358 [cs.LG] (or arXiv:2506.04358v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.04358 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Uditansh Srivastava [view email] [v1] Wed, 4 Jun 2025 18:19:48 UTC (1,279 KB) Full-text links: Access Paper: View a PDF of the paper titled A Risk-Aware Reinforcement Learning Reward for Financial Trading, by Uditansh Srivastava and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-79] Half-Layered Neural Networks

链接: https://arxiv.org/abs/2506.04352
作者: Ethem Alpaydin
类目: Machine Learning (cs.LG)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:We propose a ``half’’ layer of hidden units that has some of its weights randomly set and some of them trained. A half unit is composed of two stages: First, it takes a weighted sum of its inputs with fixed random weights, and second, the total activation is multiplied and then translated using two modifiable weights, before the result is passed through a nonlinearity. The number of modifiable weights of each hidden unit is thus two and does not depend on the fan-in. We show how such half units can be used in the first or any later layer in a deep network, possibly following convolutional layers. Our experiments on MNIST and FashionMNIST data sets indicate the promise of half layers, where we can achieve reasonable accuracy with a reduced number of parameters due to the regularizing effect of the randomized connections.

[LG-80] RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming

链接: https://arxiv.org/abs/2506.04302
作者: Xiang Zheng,Xingjun Ma,Wei-Bin Lee,Cong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Red teaming has proven to be an effective method for identifying and mitigating vulnerabilities in Large Language Models (LLMs). Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy among existing red teaming techniques. However, a lack of a unified benchmark hinders current RFT-based red teaming methods. Implementation details, especially in Proximal Policy Optimization (PPO)-based RFT, significantly affect outcome stability and reproducibility. To address this issue, we introduce RedRFT, a lightweight benchmark designed to simplify and standardize the implementation and evaluation of RFT-based red teaming. RedRFT combines the design strengths of both single-file CleanRL and highly modularized Tianshou, offering high-quality single-file red teaming implementations and modular PPO core components, such as the General Advantage Estimator. It supports a variety of token and sentence diversity metrics, featuring modularized intrinsic reward computation that facilitates plug-and-play experimentation. To clarify their influence on RFT performance, we conducted an extensive ablation study on key components, including Low-Rank Adaptation (LoRA), Kullback-Leibler (KL) divergence, and Lagrange Multiplier. We hope this work contributes to 1) gaining a comprehensive understanding of the implementation nuances of RFT-based red teaming algorithms, and 2) enabling rapid prototyping of innovative features for RFT-based red teaming. Code for the benchmark can be accessed at this https URL.

[LG-81] he Cost of Dynamic Reasoning : Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective

链接: https://arxiv.org/abs/2506.04301
作者: Jiin Kim,Byeongjun Shin,Jinha Chung,Minsoo Rhu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.

[LG-82] Softlog-Softmax Layers and Divergences Contribute to a Computationally Dependable Ensemble Learning

链接: https://arxiv.org/abs/2506.04297
作者: Abdourrahmane Mahamane Atto(LISTIC)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The paper proposes a 4-step process for highlighting that softlog-softmax cascades can improve both consistency and dependability of the next generation ensemble learning systems. The first process is anatomical in nature: the target ensemble model under consideration is composed by canonical elements relating to the definition of a convolutional frustum. No a priori is considered in the choice of canonical forms. Diversity is the main criterion for selecting these forms. It is shown that the more complex the problem, the more useful this ensemble diversity is. The second process is physiological and relates to neural engineering: a softlog is derived to both make weak logarithmic operations consistent and lead, through multiple softlog-softmax layers, to intermediate decisions in the sense of respecting the same class logic as that faced by the output layer. The third process concerns neural information theory: softlog-based entropy and divergence are proposed for the sake of constructing information measures yielding consistent values on closed intervals. These information measures are used to determine the relationships between individual and sub-community decisions in frustum diversitybased ensemble learning. The concluding process addresses the derivation of an informative performance tensor for the purpose of a reliable ensemble evaluation.

[LG-83] Deep learning for predicting hauling fleet production capacity under uncertainties in open pit mines using real and simulated data

链接: https://arxiv.org/abs/2506.04296
作者: N Guerin(CGS i3),M Nakhla(CGS i3),A Dehoux(ERAMET),J L Loyer(ERAMET)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate short-term forecasting of hauling-fleet capacity is crucial in open-pit mining, where weather fluctuations, mechanical breakdowns, and variable crew availability introduce significant operational uncertainties. We propose a deep-learning framework that blends real-world operational records (high-resolution rainfall measurements, fleet performance telemetry) with synthetically generated mechanical-breakdown scenarios to enable the model to capture fluctuating high-impact failure events. We evaluate two architectures: an XGBoost regressor achieving a median absolute error (MedAE) of 14.3 per cent and a Long Short-Term Memory network with a MedAE of 15.1 per cent. Shapley Additive exPlanations (SHAP) value analyses identify cumulative rainfall, historical payload trends, and simulated breakdown frequencies as dominant predictors. Integration of simulated breakdown data and shift-planning features notably reduces prediction volatility. Future work will further integrate maintenance-scheduling indicators (Mean Time Between Failures, Mean Time to Repair), detailed human resource data (operator absenteeism, crew efficiency metrics), blast event scheduling, and other operational constraints to enhance forecast robustness and adaptability. This hybrid modelling approach offers a comprehensive decision-support tool for proactive, data-driven fleet management under dynamically uncertain conditions.

[LG-84] Short-Term Power Demand Forecasting for Diverse Consumer Types to Enhance Grid Planning and Synchronisation

链接: https://arxiv.org/abs/2506.04294
作者: Asier Diaz-Iglesias,Xabier Belaunzaran,Ane M. Florez-Tapia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring grid stability in the transition to renewable energy sources requires accurate power demand forecasting. This study addresses the need for precise forecasting by differentiating among industrial, commercial, and residential consumers through customer clusterisation, tailoring the forecasting models to capture the unique consumption patterns of each group. A feature selection process is done for each consumer type including temporal, socio-economic, and weather-related data obtained from the Copernicus Earth Observation (EO) program. A variety of AI and machine learning algorithms for Short-Term Load Forecasting (STLF) and Very Short-Term Load Forecasting (VSTLF) are explored and compared, determining the most effective approaches. With all that, the main contribution of this work are the new forecasting approaches proposed, which have demonstrated superior performance compared to simpler models, both for STLF and VSTLF, highlighting the importance of customized forecasting strategies for different consumer groups and demonstrating the impact of incorporating detailed weather data on forecasting accuracy. These advancements contribute to more reliable power demand predictions, thereby supporting grid stability.

[LG-85] GARG-AML against Smurfing: A Scalable and Interpretable Graph-Based Framework for Anti-Money Laundering

链接: https://arxiv.org/abs/2506.04292
作者: Bruno Deprez,Bart Baesens,Tim Verdonck,Wouter Verbeke
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Money laundering poses a significant challenge as it is estimated to account for 2%-5% of the global GDP. This has compelled regulators to impose stringent controls on financial institutions. One prominent laundering method for evading these controls, called smurfing, involves breaking up large transactions into smaller amounts. Given the complexity of smurfing schemes, which involve multiple transactions distributed among diverse parties, network analytics has become an important anti-money laundering tool. However, recent advances have focused predominantly on black-box network embedding methods, which has hindered their adoption in businesses. In this paper, we introduce GARG-AML, a novel graph-based method that quantifies smurfing risk through a single interpretable metric derived from the structure of the second-order transaction network of each individual node in the network. Unlike traditional methods, GARG-AML strikes an effective balance among computational efficiency, detection power and transparency, which enables its integration into existing AML workflows. To enhance its capabilities, we combine the GARG-AML score calculation with different tree-based methods and also incorporate the scores of the node’s neighbours. An experimental evaluation on large-scale synthetic and open-source networks demonstrate that the GARG-AML outperforms the current state-of-the-art smurfing detection methods. By leveraging only the adjacency matrix of the second-order neighbourhood and basic network features, this work highlights the potential of fundamental network properties towards advancing fraud detection.

[LG-86] A Lyapunov Drift-Plus-Penalty Method Tailored for Reinforcement Learning with Queue Stability

链接: https://arxiv.org/abs/2506.04291
作者: Wenhan Xu,Jiashuo Jiang,Lei Deng,Danny Hin-Kwok Tsang
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:With the proliferation of Internet of Things (IoT) devices, the demand for addressing complex optimization challenges has intensified. The Lyapunov Drift-Plus-Penalty algorithm is a widely adopted approach for ensuring queue stability, and some research has preliminarily explored its integration with reinforcement learning (RL). In this paper, we investigate the adaptation of the Lyapunov Drift-Plus-Penalty algorithm for RL applications, deriving an effective method for combining Lyapunov Drift-Plus-Penalty with RL under a set of common and reasonable conditions through rigorous theoretical analysis. Unlike existing approaches that directly merge the two frameworks, our proposed algorithm, termed Lyapunov drift-plus-penalty method tailored for reinforcement learning with queue stability (LDPTRLQ) algorithm, offers theoretical superiority by effectively balancing the greedy optimization of Lyapunov Drift-Plus-Penalty with the long-term perspective of RL. Simulation results for multiple problems demonstrate that LDPTRLQ outperforms the baseline methods using the Lyapunov drift-plus-penalty method and RL, corroborating the validity of our theoretical derivations. The results also demonstrate that our proposed algorithm outperforms other benchmarks in terms of compatibility and stability.

[LG-87] Relational reasoning and inductive bias in transformers trained on a transitive inference task

链接: https://arxiv.org/abs/2506.04289
作者: Jesse Geerts,Stephanie Chan,Claudia Clopath,Kimberly Stachenfeld
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning in different learning regimes remain poorly understood. In this work, we investigate how transformers perform a classic relational reasoning task from the Psychology literature, \textittransitive inference, which requires inference about indirectly related items by integrating information across observed adjacent item pairs (e.g., if AB and BC, then AC). We compare transitive inference behavior across two distinct learning regimes: in-weights learning (IWL), where models store information in network parameters, and in-context learning (ICL), where models flexibly utilize information presented within the input sequence. Our findings reveal that IWL naturally induces a generalization bias towards transitive inference, despite being trained only on adjacent items, whereas ICL models trained solely on adjacent items do not generalize transitively. Mechanistic analysis shows that ICL models develop induction circuits that implement a simple match-and-copy strategy that performs well at relating adjacent pairs, but does not encoding hierarchical relationships among indirectly related items. Interestingly, when pre-trained on in-context linear regression tasks, transformers successfully exhibit in-context generalizable transitive inference. Moreover, like IWL, they display both \textitsymbolic distance and \textitterminal item effects characteristic of human and animal performance, without forming induction circuits. These results suggest that pre-training on tasks with underlying structure promotes the development of representations that can scaffold in-context relational reasoning.

[LG-88] Backbone Augmented Training for Adaptations

链接: https://arxiv.org/abs/2506.04288
作者: Jae Wan Park,Junhyeok Kim,Youngjun Jun,Hyunah Ko,Seong Jae Hwang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the backbone models. We propose Backbone Augmented Training (BAT), a method that leverages backbone data to augment the adaptation dataset. First, we formulate and prove two mathematical key propositions: one establishes the validity of BAT, while the other identifies a condition under which BAT benefits adaptation. Furthermore, we introduce an advanced data selection scheme that satisfies these propositions and present ALBAT algorithm to implement this approach. ALBAT efficiently enhances adaptation training in both personalization and language generation tasks with scarce data.

[LG-89] raining-free AI for Earth Observation Change Detection using Physics Aware Neuromorphic Networks

链接: https://arxiv.org/abs/2506.04285
作者: Stephen Smith,Cormac Purcell,Zdenka Kuncic
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Earth observations from low Earth orbit satellites provide vital information for decision makers to better manage time-sensitive events such as natural disasters. For the data to be most effective for first responders, low latency is required between data capture and its arrival to decision makers. A major bottleneck is in the bandwidth-limited downlinking of the data from satellites to ground stations. One approach to overcome this challenge is to process at least some of the data on-board and prioritise pertinent data to be downlinked. In this work we propose a Physics Aware Neuromorphic Network (PANN) to detect changes caused by natural disasters from a sequence of multi-spectral satellite images and produce a change map, enabling relevant data to be prioritised for downlinking. The PANN used in this study is motivated by physical neural networks comprised of nano-electronic circuit elements known as “memristors” (nonlinear resistors with memory). The weights in the network are dynamic and update in response to varying input signals according to memristor equations of state and electrical circuit conservation laws. The PANN thus generates physics-constrained dynamical output features which are used to detect changes in a natural disaster detection task by applying a distance-based metric. Importantly, this makes the whole model training-free, allowing it to be implemented with minimal computing resources. The PANN was benchmarked against a state-of-the-art AI model and achieved comparable or better results in each natural disaster category. It thus presents a promising solution to the challenge of resource-constrained on-board processing.

[LG-90] DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience

链接: https://arxiv.org/abs/2506.04282
作者: Runxiang Wang,Boxiao Wang,Kai Li,Yifan Zhang,Jian Cheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression is a fundamental tool for discovering interpretable mathematical expressions from data, with broad applications across scientific and engineering domains. Recently, large language models (LLMs) have demonstrated strong performance in this task, leveraging embedded scientific priors and reasoning capabilities to surpass traditional methods. However, existing LLM-based approaches, such as LLM-SR, often over-rely on internal priors, lacking explicit data understanding and systematic reflection during equation generation. To address these limitations, we propose DrSR (Dual Reasoning Symbolic Regression), a framework that combines data-driven insight with reflective learning to enhance both robustness and discovery capability. Specifically, DrSR guides LLMs to analyze structural relationships (e.g., monotonicity, nonlinearity, and correlation) within the data to generate structured descriptions. Simultaneously, it monitors equation performance and establishes a feedback loop to refine subsequent generations. By integrating data understanding and generation reflection in a closed loop, DrSR enables more efficient exploration of the symbolic expression space. Experiments across interdisciplinary datasets in physics, chemistry, biology, and materials science demonstrate that DrSR substantially improves the valid equation rate and consistently outperforms both classical and recent LLM-based methods in terms of accuracy, generalization, and search efficiency. These results underscore its potential for scientific equation discovery.

[LG-91] SF2Bench: Evaluating Data-Driven Models for Compound Flood Forecasting in South Florida

链接: https://arxiv.org/abs/2506.04281
作者: Xu Zheng,Chaohao Lin,Sipeng Chen,Zhuomin Chen,Jimeng Shi,Wei Cheng,Jayantha Obeysekera,Jason Liu,Dongsheng Luo
类目: Machine Learning (cs.LG)
*备注: 60 Pages

点击查看摘要

Abstract:Forecasting compound floods presents a significant challenge due to the intricate interplay of meteorological, hydrological, and oceanographic factors. Analyzing compound floods has become more critical as the global climate increases flood risks. Traditional physics-based methods, such as the Hydrologic Engineering Center’s River Analysis System, are often time-inefficient. Machine learning has recently demonstrated promise in both modeling accuracy and computational efficiency. However, the scarcity of comprehensive datasets currently hinders systematic analysis. Existing water-related datasets are often limited by a sparse network of monitoring stations and incomplete coverage of relevant factors. To address this challenge, we introduce SF2Bench, a comprehensive time series collection on compound floods in South Florida, which integrates four key factors: tide, rainfall, groundwater, and human management activities (gate and pump controlling). This integration allows for a more detailed analysis of the individual contributions of these drivers to compound flooding and informs the development of improved flood forecasting approaches. To comprehensively evaluate the potential of various modeling paradigms, we assess the performance of six categories of methods, encompassing Multilayer Perceptrons, Convolutional Neural Networks, Recurrent Neural Networks, Graph Neural Networks, Transformers, and Large Language Models. We verified the impact of different key features on flood forecasting through experiments. Our analysis examines temporal and spatial aspects, providing insights into the influence of historical data and spatial dependencies. The varying performance across these approaches underscores the diverse capabilities of each in capturing complex temporal and spatial dependencies inherent in compound floods.

[LG-92] Understanding the Impact of Sampling Quality in Direct Preference Optimization NEURIPS2025

链接: https://arxiv.org/abs/2506.04272
作者: Kyung Rok Kim,Yumo Bai,Chonghuan Wang,Guanting Chen
类目: Machine Learning (cs.LG)
*备注: Submitted to NeurIPS2025

点击查看摘要

Abstract:We study the role of the sampling distribution in Direct Preference Optimization (DPO) and aim to understand its impact on DPO’s training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the generating distribution. We first analyze how distribution of responses influences policy updates during gradient descent, drawing connections to common phenomena found in practice. We then design a simplified yet well-structured alignment model as a proxy, and develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.

[LG-93] SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLM s

链接: https://arxiv.org/abs/2506.04250
作者: Shaona Ghosh,Amrita Bhattacharjee,Yftah Ziser,Christopher Parisien
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2410.01174

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) to adapt to evolving safety policies is costly and impractical. Mechanistic interpretability enables inference-time control through latent activation steering, yet its potential for precise, customizable safety adjustments remains largely untapped. This paper investigates an approach called SafeSteer for guiding the outputs of LLMs by: (i) leveraging category-specific steering vectors for more precise control, (ii) employing a simple, gradient-free unsupervised method to enhance safety steering while preserving text quality, topic relevance, and without explicit refusal, and (iii) accomplishing this without a hard requirement of contrastive pairwise safe data. We also highlight that our method, being simple and effective, aligns with recent studies suggesting that simple techniques often outperform more complex ones in activation steering. We showcase the effectiveness of our approach across various LLMs, datasets, and risk categories, demonstrating its ability to provide precise control, prevent blanket refusals, and guide models toward generating safe content while maintaining topic relevance.

[LG-94] ChemReservoir – An Open-Source Framework for Chemically-Inspired Reservoir Computing

链接: https://arxiv.org/abs/2506.04249
作者: Mehmet Aziz Yirik,Jakob Lykke Andersen,Rolf Fagerberg,Daniel Merkle
类目: Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reservoir computing is a type of a recurrent neural network, mapping the inputs into higher dimensional space using fixed and nonlinear dynamical systems, called reservoirs. In the literature, there are various types of reservoirs ranging from in-silico to in-vitro. In cheminformatics, previous studies contributed to the field by developing simulation-based chemically inspired in-silico reservoir models. Yahiro used a DNA-based chemical reaction network as its reservoir and Nguyen developed a DNA chemistry-inspired tool based on Gillespie algorithm. However, these software tools were designed mainly with the focus on DNA chemistry and their maintenance status has limited their current usability. Due to these limitations, there was a need for a proper open-source tool. This study introduces ChemReservoir, an open-source framework for chemically-inspired reservoir computing. In contrast to the former studies focused on DNA-chemistry, ChemReservoir is a general framework for the construction and analysis of chemically-inspired reservoirs, which also addresses the limitations in these previous studies by ensuring enhanced testing, evaluation, and reproducibility. The tool was evaluated using various cycle-based reservoir topologies and demonstrated stable performance across a range of configurations in memory capacity tasks.

[LG-95] Improving Out-of-Distribution Detection with Markov Logic Networks

链接: https://arxiv.org/abs/2506.04241
作者: Konstantin Kirchheim,Frank Ortmeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models operating in open-world scenarios. Current OOD detectors mainly rely on statistical models to identify unusual patterns in the latent representations of a deep neural network. This work proposes to augment existing OOD detectors with probabilistic reasoning, utilizing Markov logic networks (MLNs). MLNs connect first-order logic with probabilistic reasoning to assign probabilities to inputs based on weighted logical constraints defined over human-understandable concepts, which offers improved explainability. Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency. Furthermore, we introduce a simple algorithm for learning logical constraints for OOD detection from a dataset and showcase its effectiveness.

[LG-96] A Comprehensive Survey on Bio-Inspired Algorithms: Taxonomy Applications and Future Directions

链接: https://arxiv.org/abs/2506.04238
作者: Shriyank Somvanshi,Md Monzurul Islam,Syed Aaqib Javed,Gaurab Chhetri,Kazi Sifatul Islam,Tausif Islam Chowdhury,Sazzad Bin Bashar Polock,Anandi Dutta,Subasish Das
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bio-inspired algorithms (BIAs) utilize natural processes such as evolution, swarm behavior, foraging, and plant growth to solve complex, nonlinear, high-dimensional optimization problems. This survey categorizes BIAs into eight groups: evolutionary, swarm intelligence, physics-inspired, ecosystem and plant-based, predator-prey, neural-inspired, human-inspired, and hybrid approaches, and reviews their core principles, strengths, and limitations. We illustrate the usage of these algorithms in machine learning, engineering design, bioinformatics, and intelligent systems, and highlight recent advances in hybridization, parameter tuning, and adaptive strategies. Finally, we identify open challenges such as scalability, convergence, reliability, and interpretability to suggest directions for future research. This work aims to serve as a foundational resource for both researchers and practitioners interested in understanding the current landscape and future directions of bio-inspired computing.

[LG-97] Admissibility of Completely Randomized Trials: A Large-Deviation Approach

链接: https://arxiv.org/abs/2506.05329
作者: Guido Imbens,Chao Qin,Stefan Wager
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: A one-page abstract of this work will appear at the 26th ACM Conference on Economics and Computation (EC’25)

点击查看摘要

Abstract:When an experimenter has the option of running an adaptive trial, is it admissible to ignore this option and run a non-adaptive trial instead? We provide a negative answer to this question in the best-arm identification problem, where the experimenter aims to allocate measurement efforts judiciously to confidently deploy the most effective treatment arm. We find that, whenever there are at least three treatment arms, there exist simple adaptive designs that universally and strictly dominate non-adaptive completely randomized trials. This dominance is characterized by a notion called efficiency exponent, which quantifies a design’s statistical efficiency when the experimental sample is large. Our analysis focuses on the class of batched arm elimination designs, which progressively eliminate underperforming arms at pre-specified batch intervals. We characterize simple sufficient conditions under which these designs universally and strictly dominate completely randomized trials. These results resolve the second open problem posed in Qin [2022].

[LG-98] Generalizable real-time neural decoding with hybrid state-space models

链接: https://arxiv.org/abs/2506.05320
作者: Avery Hee-Woon Ryoo,Nanda H. Krishna,Ximeng Mao,Mehdi Azabou,Eva L. Dyer,Matthew G. Perich,Guillaume Lajoie
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:Real-time decoding of neural activity is central to neuroscience and neurotechnology applications, from closed-loop experiments to brain-computer interfaces, where models are subject to strict latency constraints. Traditional methods, including simple recurrent neural networks, are fast and lightweight but often struggle to generalize to unseen data. In contrast, recent Transformer-based approaches leverage large-scale pretraining for strong generalization performance, but typically have much larger computational requirements and are not always suitable for low-resource or real-time settings. To address these shortcomings, we present POSSM, a novel hybrid architecture that combines individual spike tokenization via a cross-attention module with a recurrent state-space model (SSM) backbone to enable (1) fast and causal online prediction on neural activity and (2) efficient generalization to new sessions, individuals, and tasks through multi-dataset pretraining. We evaluate POSSM’s decoding performance and inference speed on intracortical decoding of monkey motor tasks, and show that it extends to clinical applications, namely handwriting and speech decoding in human subjects. Notably, we demonstrate that pretraining on monkey motor-cortical recordings improves decoding performance on the human handwriting task, highlighting the exciting potential for cross-species transfer. In all of these tasks, we find that POSSM achieves decoding accuracy comparable to state-of-the-art Transformers, at a fraction of the inference cost (up to 9x faster on GPU). These results suggest that hybrid SSMs are a promising approach to bridging the gap between accuracy, inference speed, and generalization when training neural decoders for real-time, closed-loop applications.

[LG-99] Robust Moment Identification for Nonlinear PDEs via a Neural ODE Approach

链接: https://arxiv.org/abs/2506.05245
作者: Shaoxuan Chen,Su Yang,Panayotis G. Kevrekidis,Wei Zhu
类目: Pattern Formation and Solitons (nlin.PS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a data-driven framework for learning reduced-order moment dynamics from PDE-governed systems using Neural ODEs. In contrast to derivative-based methods like SINDy, which necessitate densely sampled data and are sensitive to noise, our approach based on Neural ODEs directly models moment trajectories, enabling robust learning from sparse and potentially irregular time series. Using as an application platform the nonlinear Schrödinger equation, the framework accurately recovers governing moment dynamics when closure is available, even with limited and irregular observations. For systems without analytical closure, we introduce a data-driven coordinate transformation strategy based on Stiefel manifold optimization, enabling the discovery of low-dimensional representations in which the moment dynamics become closed, facilitating interpretable and reliable modeling. We also explore cases where a closure model is not known, such as a Fisher-KPP reaction-diffusion system. Here we demonstrate that Neural ODEs can still effectively approximate the unclosed moment dynamics and achieve superior extrapolation accuracy compared to physical-expert-derived ODE models. This advantage remains robust even under sparse and irregular sampling, highlighting the method’s robustness in data-limited settings. Our results highlight the Neural ODE framework as a powerful and flexible tool for learning interpretable, low-dimensional moment dynamics in complex PDE-governed systems.

[LG-100] Causal Effect Identification in lvLiNGAM from Higher-Order Cumulants ICML2025

链接: https://arxiv.org/abs/2506.05202
作者: Daniele Tramontano,Yaroslav Kivva,Saber Salehkaleybar Mathias Drton,Negar Kiyavash
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted at ICML 2025

点击查看摘要

Abstract:This paper investigates causal effect identification in latent variable Linear Non-Gaussian Acyclic Models (lvLiNGAM) using higher-order cumulants, addressing two prominent setups that are challenging in the presence of latent confounding: (1) a single proxy variable that may causally influence the treatment and (2) underspecified instrumental variable cases where fewer instruments exist than treatments. We prove that causal effects are identifiable with a single proxy or instrument and provide corresponding estimation methods. Experimental results demonstrate the accuracy and robustness of our approaches compared to existing methods, advancing the theoretical and practical understanding of causal inference in linear systems with latent confounders.

[LG-101] Nonlinear Causal Discovery for Grouped Data UAI’25

链接: https://arxiv.org/abs/2506.05120
作者: Konstantin Göbler,Tobias Windisch,Mathias Drton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 9 pages, 5 figures, to be published at UAI’25

点击查看摘要

Abstract:Inferring cause-effect relationships from observational data has gained significant attention in recent years, but most methods are limited to scalar random variables. In many important domains, including neuroscience, psychology, social science, and industrial manufacturing, the causal units of interest are groups of variables rather than individual scalar measurements. Motivated by these applications, we extend nonlinear additive noise models to handle random vectors, establishing a two-step approach for causal graph learning: First, infer the causal order among random vectors. Second, perform model selection to identify the best graph consistent with this order. We introduce effective and novel solutions for both steps in the vector case, demonstrating strong performance in simulations. Finally, we apply our method to real-world assembly line data with partial knowledge of causal ordering among variable groups.

[LG-102] Learning Joint Interventional Effects from Single-Variable Interventions in Additive Models ICML

链接: https://arxiv.org/abs/2506.04945
作者: Armin Kekić,Sergio Hernan Garrido Mejia,Bernhard Schölkopf
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To be published at the International Conference on Machine Learning (ICML) 2025

点击查看摘要

Abstract:Estimating causal effects of joint interventions on multiple variables is crucial in many domains, but obtaining data from such simultaneous interventions can be challenging. Our study explores how to learn joint interventional effects using only observational data and single-variable interventions. We present an identifiability result for this problem, showing that for a class of nonlinear additive outcome mechanisms, joint effects can be inferred without access to joint interventional data. We propose a practical estimator that decomposes the causal effect into confounded and unconfounded contributions for each intervention variable. Experiments on synthetic data demonstrate that our method achieves performance comparable to models trained directly on joint interventional data, outperforming a purely observational estimator.

[LG-103] Qml Simulator: Optimized Simulation of Quantum Machine Learning

链接: https://arxiv.org/abs/2506.04891
作者: Viacheslav Kuzmin,Basil Kyriacou,Mateusz Papierz,Mo Kordzanganeh,Alexey Melnikov
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 25 pages, 13 figures, 1 table

点击查看摘要

Abstract:Hardware-efficient circuits employed in Quantum Machine Learning are typically composed of alternating layers of uniformly applied gates. High-speed numerical simulators for such circuits are crucial for advancing research in this field. In this work, we numerically benchmark universal and gate-specific techniques for simulating the action of layers of gates on quantum state vectors, aiming to accelerate the overall simulation of Quantum Machine Learning algorithms. Our analysis shows that the optimal simulation method for a given layer of gates depends on the number of qubits involved, and that a tailored combination of techniques can yield substantial performance gains in the forward and backward passes for a given circuit. Building on these insights, we developed a numerical simulator, named TQml Simulator, that employs the most efficient simulation method for each layer in a given circuit. We evaluated TQml Simulator on circuits constructed from standard gate sets, such as rotations and CNOTs, as well as on native gates from IonQ and IBM quantum processing units. In most cases, our simulator outperforms equivalent Pennylane’s this http URL simulator by approximately 2- to 100-fold, depending on the circuit, the number of qubits, the batch size of the input data, and the hardware used.

[LG-104] kTULA: A Langevin sampling algorithm with improved KL bounds under super-linear log-gradients

链接: https://arxiv.org/abs/2506.04878
作者: Iosif Lytras,Sotirios Sabanis,Ying Zhang
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Motivated by applications in deep learning, where the global Lipschitz continuity condition is often not satisfied, we examine the problem of sampling from distributions with super-linearly growing log-gradients. We propose a novel tamed Langevin dynamics-based algorithm, called kTULA, to solve the aforementioned sampling problem, and provide a theoretical guarantee for its performance. More precisely, we establish a non-asymptotic convergence bound in Kullback-Leibler (KL) divergence with the best-known rate of convergence equal to 2-\overline\epsilon , \overline\epsilon0 , which significantly improves relevant results in existing literature. This enables us to obtain an improved non-asymptotic error bound in Wasserstein-2 distance, which can be used to further derive a non-asymptotic guarantee for kTULA to solve the associated optimization problems. To illustrate the applicability of kTULA, we apply the proposed algorithm to the problem of sampling from a high-dimensional double-well potential distribution and to an optimization problem involving a neural network. We show that our main results can be used to provide theoretical guarantees for the performance of kTULA.

[LG-105] Distributional encoding for Gaussian process regression with qualitative inputs

链接: https://arxiv.org/abs/2506.04813
作者: Sébastien Da Veiga(ENSAI, CREST, RT-UQ)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian Process (GP) regression is a popular and sample-efficient approach for many engineering applications, where observations are expensive to acquire, and is also a central ingredient of Bayesian optimization (BO), a highly prevailing method for the optimization of black-box functions. However, when all or some input variables are categorical, building a predictive and computationally efficient GP remains challenging. Starting from the naive target encoding idea, where the original categorical values are replaced with the mean of the target variable for that category, we propose a generalization based on distributional encoding (DE) which makes use of all samples of the target variable for a category. To handle this type of encoding inside the GP, we build upon recent results on characteristic kernels for probability distributions, based on the maximum mean discrepancy and the Wasserstein distance. We also discuss several extensions for classification, multi-task learning and incorporation or auxiliary information. Our approach is validated empirically, and we demonstrate state-of-the-art predictive performance on a variety of synthetic and real-world datasets. DE is naturally complementary to recent advances in BO over discrete and mixed-spaces.

[LG-106] Can Artificial Intelligence Trade the Stock Market?

链接: https://arxiv.org/abs/2506.04658
作者: Jędrzej Maskiewicz,Paweł Sakowski
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:The paper explores the use of Deep Reinforcement Learning (DRL) in stock market trading, focusing on two algorithms: Double Deep Q-Network (DDQN) and Proximal Policy Optimization (PPO) and compares them with Buy and Hold benchmark. It evaluates these algorithms across three currency pairs, the SP 500 index and Bitcoin, on the daily data in the period of 2019-2023. The results demonstrate DRL’s effectiveness in trading and its ability to manage risk by strategically avoiding trades in unfavorable conditions, providing a substantial edge over classical approaches, based on supervised learning in terms of risk-adjusted returns.

[LG-107] Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning

链接: https://arxiv.org/abs/2506.04626
作者: Haochen Zhang,Zhong Zheng,Lingzhou Xue
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2502.02859

点击查看摘要

Abstract:Motivated by real-world settings where data collection and policy deployment – whether for a single agent or across multiple agents – are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with S states and A actions, existing methods either require superlinear burn-in costs in S and A or fail to achieve logarithmic switching or communication costs. We propose two novel model-free RL algorithms – Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost – that are the first in the literature to simultaneously achieve: (i) the best near-optimal regret among all known model-free RL or FRL algorithms, (ii) low burn-in cost that scales linearly with S and A , and (iii) logarithmic policy switching cost for single-agent RL or communication cost for FRL. Additionally, we establish gap-dependent theoretical guarantees for both regret and switching/communication costs, improving or matching the best-known gap-dependent bounds.

[LG-108] Non-linear Multi-objective Optimization with Probabilistic Branch and Bound

链接: https://arxiv.org/abs/2506.04554
作者: Hao Huang,Zelda B. Zabinsky
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 26 pages, 5 Figures

点击查看摘要

Abstract:A multiple objective simulation optimization algorithm named Multiple Objective Probabilistic Branch and Bound with Single Observation (MOPBnB(so)) is presented for approximating the Pareto optimal set and the associated efficient frontier for stochastic multi-objective optimization problems. MOPBnB(so) evaluates a noisy function exactly once at any solution and uses neighboring solutions to estimate the objective functions, in contrast to a variant that uses multiple replications at a solution to estimate the objective functions. A finite-time performance analysis for deterministic multi-objective problems provides a bound on the probability that MOPBnB(so) captures the Pareto optimal set. Asymptotic convergence of MOPBnB(so) on stochastic problems is derived, in that the algorithm captures the Pareto optimal set and the estimations converge to the true objective function values. Numerical results reveal that the variant with multiple replications is extremely intensive in terms of computational resources compared to MOPBnB(so). In addition, numerical results show that MOPBnB(so) outperforms a genetic algorithm NSGA-II on test problems.

[LG-109] On the Wasserstein Geodesic Principal Component Analysis of probability measures

链接: https://arxiv.org/abs/2506.04480
作者: Nina Vesseron,Elsa Cazelles,Alice Le Brigant,Thierry Klein
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper focuses on Geodesic Principal Component Analysis (GPCA) on a collection of probability distributions using the Otto-Wasserstein geometry. The goal is to identify geodesic curves in the space of probability measures that best capture the modes of variation of the underlying dataset. We first address the case of a collection of Gaussian distributions, and show how to lift the computations in the space of invertible linear maps. For the more general setting of absolutely continuous probability measures, we leverage a novel approach to parameterizing geodesics in Wasserstein space with neural networks. Finally, we compare to classical tangent PCA through various examples and provide illustrations on real-world datasets.

[LG-110] Optical Physics-Based Generative Models

链接: https://arxiv.org/abs/2506.04357
作者: Amirreza Ahmadnejad,Somayyeh Koohi
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper establishes a comprehensive mathematical framework connecting optical physics equations to generative models, demonstrating how light propagation dynamics inspire powerful artificial intelligence approaches. We analyze six fundamental optical equations, comparing linear models (Helmholtz, dissipative wave, and Eikonal equations) with their nonlinear extensions incorporating Kerr effects, cubic-quintic nonlinearities, and intensity-dependent refractive indices. Our nonlinear optical models reveal remarkable capabilities through natural self-organization principles. The nonlinear Helmholtz model achieves 40-60% parameter reduction while maintaining superior mode separation via self-focusing phenomena. The cubic-quintic dissipative wave model prevents mode collapse through balanced attractive-repulsive interactions, enabling stable soliton formation with 20-40% improved coverage. The intensity-dependent Eikonal model creates adaptive pathways that dynamically respond to content, providing enhanced controllability in conditional generation. Experimental validation demonstrates consistent superiority over linear predecessors and traditional generative approaches. The nonlinear Helmholtz model achieves FID scores of 0.0089 versus 1.0909 for linear versions, while the cubic-quintic model reaches 0.0156 FID with exceptional stability. Memory usage drops 40-60% and training time improves 30-50% due to inherent nonlinear stability properties. The framework enables bidirectional benefits, advancing both generative AI and optical physics through novel approaches to soliton analysis, wavefront control, and refractive index reconstruction with 95% accuracy. This work reveals deep connections between physical self-organization and artificial intelligence, opening pathways toward efficient optical computing implementations.

[LG-111] BridgeNet: A Hybrid Physics-Informed Machine Learning Framework for Solving High-Dimensional Fokker-Planck Equations

链接: https://arxiv.org/abs/2506.04354
作者: Elmira Mirzabeigi,Rezvan Salehi,Kourosh Parand
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Mathematical Physics (math-ph); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:BridgeNet is a novel hybrid framework that integrates convolutional neural networks with physics-informed neural networks to efficiently solve non-linear, high-dimensional Fokker-Planck equations (FPEs). Traditional PINNs, which typically rely on fully connected architectures, often struggle to capture complex spatial hierarchies and enforce intricate boundary conditions. In contrast, BridgeNet leverages adaptive CNN layers for effective local feature extraction and incorporates a dynamically weighted loss function that rigorously enforces physical constraints. Extensive numerical experiments across various test cases demonstrate that BridgeNet not only achieves significantly lower error metrics and faster convergence compared to conventional PINN approaches but also maintains robust stability in high-dimensional settings. This work represents a substantial advancement in computational physics, offering a scalable and accurate solution methodology with promising applications in fields ranging from financial mathematics to complex system dynamics.

[LG-112] Interpretable LLM s for Credit Risk: A Systematic Review and Taxonomy

链接: https://arxiv.org/abs/2506.04290
作者: Muhammed Golec,Maha AlabdulJalil
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, preprint under review at Information Processing Management

点击查看摘要

Abstract:Large Language Models (LLM), which have developed in recent years, enable credit risk assessment through the analysis of financial texts such as analyst reports and corporate disclosures. This paper presents the first systematic review and taxonomy focusing on LLMbased approaches in credit risk estimation. We determined the basic model architectures by selecting 60 relevant papers published between 2020-2025 with the PRISMA research strategy. And we examined the data used for scenarios such as credit default prediction and risk analysis. Since the main focus of the paper is interpretability, we classify concepts such as explainability mechanisms, chain of thought prompts and natural language justifications for LLM-based credit models. The taxonomy organizes the literature under four main headings: model architectures, data types, explainability mechanisms and application areas. Based on this analysis, we highlight the main future trends and research gaps for LLM-based credit scoring systems. This paper aims to be a reference paper for artificial intelligence and financial researchers.

[LG-113] Estimating properties of a homogeneous bounded soil using machine learning models

链接: https://arxiv.org/abs/2506.04256
作者: Konstantinos Kalimeris,Leonidas Mindrinos,Nikolaos Pallikarakis
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 35 pages, 15 figures, 19 tables

点击查看摘要

Abstract:This work focuses on estimating soil properties from water moisture measurements. We consider simulated data generated by solving the initial-boundary value problem governing vertical infiltration in a homogeneous, bounded soil profile, with the usage of the Fokas method. To address the parameter identification problem, which is formulated as a two-output regression task, we explore various machine learning models. The performance of each model is assessed under different data conditions: full, noisy, and limited. Overall, the prediction of diffusivity D tends to be more accurate than that of hydraulic conductivity K. Among the models considered, Support Vector Machines (SVMs) and Neural Networks (NNs) demonstrate the highest robustness, achieving near-perfect accuracy and minimal errors.

[LG-114] What does making money have to do with crime?: A dive into the National Crime Victimization survey

链接: https://arxiv.org/abs/2506.04240
作者: Sydney Anuyah
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this short article, I leverage the National Crime Victimization Survey from 1992 to 2022 to examine how income, education, employment, and key demographic factors shape the type of crime victims experience (violent vs property). Using balanced classification splits and logistic regression models evaluated by F1-score, there is an isolation of the socioeconomic drivers of victimization “Group A” models and then an introduction of demographic factors such as age, gender, race, and marital status controls called “Group B” models. The results consistently proves that higher income and education lower the odds of violent relative to property crime, while men younger individuals and racial minorities face disproportionately higher violentcrime risks. On the geographic spectrum, the suburban models achieve the strongest predictive performance with an accuracy of 0.607 and F1 of 0.590, urban areas benefit from adding education and employment predictors and crime in rural areas are still unpredictable using these current factors. The patterns found in this study shows the need for specific interventions like educational investments in metropolitan settings economic support in rural communities and demographicaware prevention strategies.

信息检索

[IR-0] On the Comprehensibility of Multi-structured Financial Documents using LLM s and Pre-processing Tools

链接: https://arxiv.org/abs/2506.05182
作者: Shivani Upadhyay,Messiah Ataey,Shariyar Murtuza,Yifan Nie,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 5 figures, 9 tables

点击查看摘要

Abstract:The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-dimensional plots, leading to hallucinations and erroneous outputs. This paper explores the capabilities of LLMs and MLLMs in understanding and answering questions from complex data structures found in PDF documents by leveraging industrial and open-source tools as part of a pre-processing pipeline. Our findings indicate that GPT-4o, a popular MLLM, achieves an accuracy of 56% on multi-structured documents when fed documents directly, and that integrating pre-processing tools raises the accuracy of LLMs to 61.3% for GPT-4o and 76% for GPT-4, and with lower overall cost. The code is publicly available at this https URL.

[IR-1] Rethinking Contrastive Learning in Session-based Recommendation

链接: https://arxiv.org/abs/2506.05044
作者: Xiaokun Zhang,Bo Xu,Fenglong Ma,Zhizheng Wang,Liang Yang,Hongfei Lin
类目: Information Retrieval (cs.IR)
*备注: This work has been accepted by Pattern Recognition

点击查看摘要

Abstract:Session-based recommendation aims to predict intents of anonymous users based on limited behaviors. With the ability in alleviating data sparsity, contrastive learning is prevailing in the task. However, we spot that existing contrastive learning based methods still suffer from three obstacles: (1) they overlook item-level sparsity and primarily focus on session-level sparsity; (2) they typically augment sessions using item IDs like crop, mask and reorder, failing to ensure the semantic consistency of augmented views; (3) they treat all positive-negative signals equally, without considering their varying utility. To this end, we propose a novel multi-modal adaptive contrastive learning framework called MACL for session-based recommendation. In MACL, a multi-modal augmentation is devised to generate semantically consistent views at both item and session levels by leveraging item multi-modal features. Besides, we present an adaptive contrastive loss that distinguishes varying contributions of positive-negative signals to improve self-supervised learning. Extensive experiments on three real-world datasets demonstrate the superiority of MACL over state-of-the-art methods.

[IR-2] PUB: An LLM -Enhanced Personality-Driven User Behaviour Simulator for Recommender System Evaluation

链接: https://arxiv.org/abs/2506.04551
作者: Chenglong Ma,Ziqi Xu,Yongli Ren,Danula Hettiachchi,Jeffrey Chan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional offline evaluation methods for recommender systems struggle to capture the complexity of modern platforms due to sparse behavioural signals, noisy data, and limited modelling of user personality traits. While simulation frameworks can generate synthetic data to address these gaps, existing methods fail to replicate behavioural diversity, limiting their effectiveness. To overcome these challenges, we propose the Personality-driven User Behaviour Simulator (PUB), an LLM-based simulation framework that integrates the Big Five personality traits to model personalised user behaviour. PUB dynamically infers user personality from behavioural logs (e.g., ratings, reviews) and item metadata, then generates synthetic interactions that preserve statistical fidelity to real-world data. Experiments on the Amazon review datasets show that logs generated by PUB closely align with real user behaviour and reveal meaningful associations between personality traits and recommendation outcomes. These results highlight the potential of the personality-driven simulator to advance recommender system evaluation, offering scalable, controllable, high-fidelity alternatives to resource-intensive real-world experiments.

[IR-3] Im Sorry Dave Im Afraid I Cant Return That: On YouTube Search API Use in Research

链接: https://arxiv.org/abs/2506.04422
作者: Alexandros Efstratiou
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:YouTube is among the most widely-used platforms worldwide, and has seen a lot of recent academic attention. Despite its popularity and the number of studies conducted on it, much less is understood about the way in which YouTube’s Data API, and especially the Search endpoint, operates. In this paper, we analyze the API’s behavior by running identical queries across a period of 12 weeks. Our findings suggest that the search endpoint returns highly inconsistent results between queries in ways that are not officially documented. Specifically, the API seems to randomize returned videos based on the relative popularity of the respective topic during the query period, making it nearly impossible to obtain representative historical video samples, especially during non-peak topical periods. Our results also suggest that the API may prioritize shorter, more popular videos, although the role of channel popularity is not as clear. We conclude with suggested strategies for researchers using the API for data collection, as well as future research directions on expanding the API’s use-cases.

[IR-4] Prosthetics of the Indian State: The e-Shram Portal for Unorganized Workers in India

链接: https://arxiv.org/abs/2503.05714
作者: Rozin Hasin
类目: Computers and Society (cs.CY); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This research paper examines the digital portal/database for unorganized workers in the informal sector economy of India today: e-Shram. Using affordance theory, I criticize the operationalization of this database for the labourers, alongside problems of accessibility and perception.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-06-06

目录

概览 (2025-06-06)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载