Arxiv今日论文 | 2025-06-12

本篇博文主要内容为 2025-06-12 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在从概率分布中生成忠实样本时存在的偏差问题，这一问题限制了其在需要可靠随机性的任务中的应用。解决方案的关键在于引入Verbalized Rejection Sampling (VRS)，这是一种将经典拒绝采样方法自然语言化的策略，通过提示LLM对提议的样本进行推理并决定接受或拒绝，从而显著减少采样偏差。

链接: https://arxiv.org/abs/2506.09998
作者: Tim Z. Xiao,Johannes Zenn,Zhen Liu,Weiyang Liu,Robert Bamler,Bernhard Schölkopf
机构: University of Tübingen (图宾根大学); Max Planck Institute for Intelligent Systems, Tübingen (马克斯·普朗克智能系统研究所，图宾根); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Technical Report v1 (21 pages, 14 figures)

点击查看摘要

Abstract:Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
zh

[NLP-1] From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际应用中因安全对齐而引入的高服务延迟问题，特别是在现有检测机制依赖于完整输出进行有害性判断的情况下。其解决方案的关键在于构建一个支持部分检测（partial detection）的数据与模型联合方案，通过引入FineHarm数据集和流式内容监控器（Streaming Content Monitor, SCM），实现对生成过程的中途监控与及时有害性判断，从而减少服务延迟并保持较高的检测性能。

链接: https://arxiv.org/abs/2506.09996
作者: Yang Li,Qiang Sheng,Yehan Yang,Xueyao Zhang,Juan Cao
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences; The Chinese University of Hong Kong, Shenzhen
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 22 pages, 7 figures, and 9 tables

点击查看摘要

Abstract:Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
zh

[NLP-2] Large Language Models for Toxic Language Detection in Low-Resource Balkan Languages

【速读】：该论文试图解决在资源匮乏的语境下，大型语言模型对塞尔维亚语、克罗地亚语和波斯尼亚语等低资源语言中有毒言论的检测能力不足的问题。解决方案的关键在于通过引入少量上下文片段（context-augmented）来提升模型的检测性能，实验表明这种方法平均提升了约0.12的召回率，并在某些情况下显著改善了F1分数，从而为低资源语言社区提供了更有效的有毒内容识别策略。

链接: https://arxiv.org/abs/2506.09992
作者: Amel Muminovic,Amela Kadric Muminovic
机构: International Balkan University (国际巴尔干大学); University of Belgrade (贝尔格莱德大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.
zh

[NLP-3] Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在依赖句法分析任务中生成结构有效且准确输出的挑战。传统提示方法在该任务中表现不佳，尤其是在处理复杂语法结构时。解决方案的关键在于提出一种分步指令策略，即先进行词性标注（Part-of-Speech Tagging），再预测句法成分和依存标签，并采用简化版的CoNLL-U格式输出。这种方法在17种语言的通用依存树库上实现了最先进的准确率，同时避免了幻觉或污染问题。

链接: https://arxiv.org/abs/2506.09983
作者: Hiroshi Matsuda,Chunpeng Ma,Masayuki Asahara
机构: Megagon Labs, Tokyo(梅冈实验室，东京); Recruit Co., Ltd.(株式会社Recruit); National Institute for Japanese Language and Linguistics(日本语言与语言学国立研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, accepted for SyntaxFest 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
zh

[NLP-4] When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text ACL

【速读】：该论文试图解决在社交媒体上检测生成式 AI (Generative AI) 生成文本的问题，该问题因文本长度短、语言非正式且具有独特性而更加复杂。解决方案的关键在于模拟一个具备一定技术水平的威胁行为者，构建了一个包含505,159条AI生成社交媒体帖子的数据集，涵盖了11个不同争议性话题。研究发现，在假设攻击者不公开其微调模型的情况下，检测能力会显著下降，这一结论通过人类实验得到验证，并揭示了多种检测算法对微调大语言模型（LLM）的脆弱性。

链接: https://arxiv.org/abs/2506.09975
作者: Hillary Dawkins,Kathleen C. Fraser,Svetlana Kiritchenko
机构: National Research Council Canada (国家研究委员会)
类目: Computation and Language (cs.CL)
备注: to appear in ACL Findings

点击查看摘要

Abstract:Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
zh

[NLP-5] Resa: Transparent Reasoning Models via SAEs

【速读】：该论文试图解决如何以成本效益的方式激发语言模型中的强推理能力问题。其解决方案的关键在于提出Resa系列模型，该模型通过一种新颖且高效的稀疏自编码器调优（SAE-Tuning）方法进行训练。该方法首先利用源模型训练一个稀疏自编码器（SAE）以捕捉推理能力，随后使用训练好的SAE指导标准的监督微调过程，在目标模型中激发此类能力，整个过程仅使用经过验证的问答数据而无需任何推理轨迹。

链接: https://arxiv.org/abs/2506.09967
作者: Shangshang Wang,Julian Asilis,Ömer Faruk Akgül,Enes Burak Bilgin,Ollie Liu,Deqing Fu,Willie Neiswanger
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains 97% of its RL-trained counterpart’s reasoning performance while reducing training costs by 2000x to roughly \ 1 and training time by 450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \ 1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
zh

[NLP-6] Outside Knowledge Conversational Video (OKCV) Dataset – Dialoguing over Videos

【速读】：该论文试图解决在视频基础上的视觉对话任务中，模型需要同时识别随时间变化的视觉细节并结合外部知识来回答问题，同时考虑整个对话上下文的问题。解决方案的关键在于构建一个包含2,017个视频和5,986个由人类标注的对话的数据集，其中每个对话包含40,954个交错的对话回合，该数据集要求模型不仅能够定位相关视频片段，还需利用外部知识进行对话。

链接: https://arxiv.org/abs/2506.09953
作者: Benjamin Reichman,Constantin Patsch,Jack Truxal,Atishay Jain,Larry Heck
机构: Georgia Institute of Technology (佐治亚理工学院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of 2,017 videos with 5,986 human-annotated dialogues consisting of 40,954 interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: this https URL.
zh

[NLP-7] Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

【速读】：该论文试图解决长上下文语言模型（Long-Context Language Models, LCMs）中高效检索关键信息的问题，特别是在多跳推理任务中，传统方法在处理长文本时性能受限。解决方案的关键在于提出QRHEAD（Query-Focused Retrieval Head），通过聚合与输入查询相关的注意力得分，识别出具有强检索能力的注意力头，并进一步引入QR-RETRIEVER，利用QRHEAD的累积注意力质量作为检索分数，从而实现对长上下文中相关部分的有效选择和推理。

链接: https://arxiv.org/abs/2506.09944
作者: Wuwei Zhang,Fangcong Yin,Howard Yen,Danqi Chen,Xi Ye
机构: Princeton Language and Intelligence, Princton University (普林斯顿语言与智能实验室，普林斯顿大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
zh

[NLP-8] VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

【速读】：该论文旨在解决指令遵循任务中强化学习（Reinforcement Learning, RL）的最佳实践缺乏的问题，特别是如何在RL中有效进行验证以提升大规模语言模型（Large Language Models, LLMs）的性能。其解决方案的关键在于提出VerIF，一种结合基于规则的代码验证与大型推理模型（如QwQ-32B）的LLM-based验证的验证方法，并构建了高质量的指令遵循数据集VerInstruct，以支持该方法的有效性。通过在两个模型上应用基于VerIF的RL训练，实现了在多个代表性指令遵循基准上的显著性能提升。

链接: https://arxiv.org/abs/2506.09942
作者: Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.
zh

[NLP-9] Aspect-Based Opinion Summarization with Argumentation Schemes

【速读】：该论文试图解决在在线购物中，用户难以手动处理大量评论并总结出关键观点的问题，从而需要自动化的情感摘要系统。现有方法（包括抽取式和生成式）在自动产生基于具体方面的情感摘要时面临挑战。解决方案的关键在于提出一种新的摘要框架ASESUM，该框架通过提取与产品关键方面相关的论点，并衡量其显著性和有效性，从而捕捉具有支持证据的主要观点，并适应不同领域而无需依赖预定义的方面集合。

链接: https://arxiv.org/abs/2506.09917
作者: Wendi Zhou,Ameer Saadat-Yazd,Nadin Kokciyan
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ArgMining 2025

点击查看摘要

Abstract:Reviews are valuable resources for customers making purchase decisions in online shopping. However, it is impractical for customers to go over the vast number of reviews and manually conclude the prominent opinions, which prompts the need for automated opinion summarization systems. Previous approaches, either extractive or abstractive, face challenges in automatically producing grounded aspect-centric summaries. In this paper, we propose a novel summarization system that not only captures predominant opinions from an aspect perspective with supporting evidence, but also adapts to varying domains without relying on a pre-defined set of aspects. Our proposed framework, ASESUM, summarizes viewpoints relevant to the critical aspects of a product by extracting aspect-centric arguments and measuring their salience and validity. We conduct experiments on a real-world dataset to demonstrate the superiority of our approach in capturing diverse perspectives of the original reviews compared to new and existing methods.
zh

[NLP-10] PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants ACL2025

【速读】：该论文试图解决如何系统评估对话式AI助手在执行任务时对个体用户偏好进行个性化适应的能力，这一问题在现有基准中尚未得到充分覆盖。现有基准多聚焦于闲聊、非对话任务或狭窄领域，未能捕捉个性化任务导向辅助的复杂性。论文提出的解决方案是引入PersonaLens，这是一个全面的基准，包含多样化的用户档案及丰富的偏好和交互历史，并结合两种专用的基于大语言模型（LLM）的代理：用户代理用于与AI助手进行现实的任务导向对话，判断代理则通过“LLM作为评判者”范式评估个性化、响应质量和任务成功率。该方案的关键在于构建真实且复杂的评估环境，以准确衡量AI助手的个性化能力。

链接: https://arxiv.org/abs/2506.09902
作者: Zheng Zhao,Clara Vania,Subhradeep Kayal,Naila Khan,Shay B. Cohen,Emine Yilmaz
机构: University of Edinburgh(爱丁堡大学); Amazon(亚马逊); University College London(伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization–adapting to individual user preferences while completing tasks–remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
zh

[NLP-11] he Emergence of Abstract Thought in Large Language Models Beyond Any Language

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言任务中表现出的“以英语为思维基础”的假设与实际多语言性能优异之间的矛盾问题。其解决方案的关键在于发现LLMs逐步发展出一个核心的、与语言无关的参数空间，该空间由一小部分参数构成，其失活会导致所有语言性能显著下降。这一参数空间支持模型超越单一语言的泛化能力，并促进抽象思维的形成。研究进一步识别出语言相关神经元，并将其分为共享型（跨多种语言激活）和独占型（仅针对一种语言），其中共享神经元在模型演进过程中占比和功能重要性增加，成为支撑语言无关参数空间的核心要素。基于此，论文提出了针对不同发展阶段的LLM语言无关层级的神经元特定训练策略。

链接: https://arxiv.org/abs/2506.09890
作者: Yuxin Chen,Yiran Zhao,Yang Zhang,An Zhang,Kenji Kawaguchi,Shafiq Joty,Junnan Li,Tat-Seng Chua,Michael Qizhe Shieh,Wenxuan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to advance, their capacity to function effectively across a diverse range of languages has shown marked improvement. Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts. This has led to the widespread assumption that LLMs may “think” in English. However, more recent results showing strong multilingual performance, even surpassing English performance on specific tasks in other languages, challenge this view. In this work, we find that LLMs progressively develop a core language-agnostic parameter space-a remarkably small subset of parameters whose deactivation results in significant performance degradation across all languages. This compact yet critical set of parameters underlies the model’s ability to generalize beyond individual languages, supporting the emergence of abstract thought that is not tied to any specific linguistic system. Specifically, we identify language-related neurons-those are consistently activated during the processing of particular languages, and categorize them as either shared (active across multiple languages) or exclusive (specific to one). As LLMs undergo continued development over time, we observe a marked increase in both the proportion and functional importance of shared neurons, while exclusive neurons progressively diminish in influence. These shared neurons constitute the backbone of the core language-agnostic parameter space, supporting the emergence of abstract thought. Motivated by these insights, we propose neuron-specific training strategies tailored to LLMs’ language-agnostic levels at different development stages. Experiments across diverse LLM families support our approach.
zh

[NLP-12] Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLM s

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）中幻觉（hallucination）检测的问题，即模型生成与事实不符或无依据的信息。解决方案的关键在于通过分析提示（prompt）与响应（response）隐层状态分布之间的概率发散来识别幻觉，发现幻觉响应相较于有依据的响应与提示的偏差更小，表明幻觉通常源于表面重述而非实质性推理。该方法提出了一种模型内生的检测机制，利用分布距离作为合理的幻觉评分，无需外部知识或辅助模型，并通过深度可学习核函数增强对分布间细微几何差异的捕捉能力。

链接: https://arxiv.org/abs/2506.09886
作者: Rodion Oblovatny,Alexandra Bazarova,Alexey Zaytsev
机构: St.Petersburg State University (圣彼得堡国立大学); Skoltech (斯科尔科沃科技学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel approach for detecting hallucinations in large language models (LLMs) by analyzing the probabilistic divergence between prompt and response hidden-state distributions. Counterintuitively, we find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses, suggesting that hallucinations often arise from superficial rephrasing rather than substantive reasoning. Leveraging this insight, we propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores, eliminating the need for external knowledge or auxiliary models. To enhance sensitivity, we employ deep learnable kernels that automatically adapt to capture nuanced geometric differences between distributions. Our approach outperforms existing baselines, demonstrating state-of-the-art performance on several benchmarks. The method remains competitive even without kernel training, offering a robust, scalable solution for hallucination detection.
zh

[NLP-13] Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

【速读】：该论文旨在解决Chain-of-Thought (CoT) 提示在大型语言模型（LLMs）中面临的两个核心问题：充分性（Sufficiency）与必要性（Necessity）。充分性确保生成的中间推理步骤能够全面覆盖并支持最终结论，而必要性则识别对答案合理性真正不可或缺的推理步骤。论文提出的解决方案关键在于构建一个基于因果关系的框架，通过引入因果充分性和必要性概率，不仅能够判断哪些推理步骤对预测结果是逻辑上充分或必要的，还能量化其在不同干预场景下对最终推理结果的实际影响，从而实现缺失步骤的自动补充和冗余步骤的剪枝。

链接: https://arxiv.org/abs/2506.09853
作者: Xiangning Yu,Zhuohan Wang,Linyi Yang,Haoxuan Li,Anjie Liu,Xiao Xue,Jun Wang,Mengyue Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting plays an indispensable role in endowing large language models (LLMs) with complex reasoning capabilities. However, CoT currently faces two fundamental challenges: (1) Sufficiency, which ensures that the generated intermediate inference steps comprehensively cover and substantiate the final conclusion; and (2) Necessity, which identifies the inference steps that are truly indispensable for the soundness of the resulting answer. We propose a causal framework that characterizes CoT reasoning through the dual lenses of sufficiency and necessity. Incorporating causal Probability of Sufficiency and Necessity allows us not only to determine which steps are logically sufficient or necessary to the prediction outcome, but also to quantify their actual influence on the final reasoning outcome under different intervention scenarios, thereby enabling the automated addition of missing steps and the pruning of redundant ones. Extensive experimental results on various mathematical and commonsense reasoning benchmarks confirm substantial improvements in reasoning efficiency and reduced token usage without sacrificing accuracy. Our work provides a promising direction for improving LLM reasoning performance and cost-effectiveness.
zh

[NLP-14] Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

【速读】：该论文试图解决当前虚假信息和错误信息环境中，脱离上下文和错误归因的图像作为主要媒体操控形式的问题。现有方法通常仅关注图像语义是否与文本叙述一致，而忽略了只要图像中的物体或场景与叙述有一定关联，就可能隐藏操控的情况。解决方案的关键在于引入新闻媒体来源数据集（News Media Provenance Dataset），该数据集包含带有来源标签的新闻图片，并在此基础上定义了两个任务：起源位置相关性（Location of Origin Relevance, LOR）和起源时间相关性（Date and Time of Origin Relevance, DTOR），并通过六种大型语言模型（Large Language Models, LLMs）进行了基线实验，以探索更有效的检测方法。

链接: https://arxiv.org/abs/2506.09847
作者: Tomas Peterka,Matyas Bohacek
机构: Gymnazium Jana Keplera (Jana Kepler Gymnasium); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Out-of-context and misattributed imagery is the leading form of media manipulation in today’s misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.
zh

[NLP-15] Error-Guided Pose Augmentation: Enhancing Rehabilitation Exercise Assessment through Targeted Data Generation

【速读】：该论文旨在解决家庭康复环境中患者进展监测中存在数据不平衡和难以检测细微运动错误的问题。其解决方案的关键在于提出一种名为Error-Guided Pose Augmentation (EGPA)的方法，该方法通过模拟临床相关的运动错误生成合成骨骼数据，从而增强数据集的多样性和代表性，同时结合基于注意力机制的图卷积网络，提升模型在多个评估指标上的性能。

链接: https://arxiv.org/abs/2506.09833
作者: Omar Sherif,Ali Hamdi
机构: MSA University (MSA大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure. To appear in Intelligent Methods, Systems, and Applications 2025

点击查看摘要

Abstract:Effective rehabilitation assessment is essential for monitoring patient progress, particularly in home-based settings. Existing systems often face challenges such as data imbalance and difficulty detecting subtle movement errors. This paper introduces Error-Guided Pose Augmentation (EGPA), a method that generates synthetic skeleton data by simulating clinically relevant movement mistakes. Unlike standard augmentation techniques, EGPA targets biomechanical errors observed in rehabilitation. Combined with an attention-based graph convolutional network, EGPA improves performance across multiple evaluation metrics. Experiments demonstrate reductions in mean absolute error of up to 27.6 percent and gains in error classification accuracy of 45.8 percent. Attention visualizations show that the model learns to focus on clinically significant joints and movement phases, enhancing both accuracy and interpretability. EGPA offers a promising approach for improving automated movement quality assessment in both clinical and home-based rehabilitation contexts.
zh

[NLP-16] EmoNet-Voice: A Fine-Grained Expert-Verified Benchmark for Speech Emotion Detection

【速读】：该论文旨在解决现有语音情感识别（Speech Emotion Recognition, SER）数据集在情感细粒度、隐私保护以及依赖表演性表达方面的局限性。其解决方案的关键在于构建EmoNet-Voice资源，包括大规模预训练数据集EmoNet-Voice Big和带有专家标注的基准数据集EmoNet-Voice Bench，通过先进的语音生成技术合成模拟特定情感场景的音频片段，并由心理学专家进行感知强度标签的严格验证，从而实现对敏感情感状态的隐私保护性建模。

链接: https://arxiv.org/abs/2506.09827
作者: Christoph Schuhmann,Robert Kaczmarczyk,Gollam Rabby,Felix Friedrich,Maurice Kraus,Kourosh Nadi,Huu Nguyen,Kristian Kersting,Sören Auer
机构: LAION e.V.(LAION协会); Technical University of Munich(慕尼黑工业大学); L3S Research Center(莱布尼茨信息中心); Leibniz University of Hannover(汉诺威莱布尼茨大学); TU Darmstadt(达姆施塔特工业大学); Hessian.AI(黑森人工智能); Ontocord(Ontocord); Centre for Cognitive Science(认知科学中心); DFKI(德国弗劳恩霍夫研究所); TIB–Leibniz Information Centre for Science and Technology(蒂宾根莱布尼茨科学与技术信息中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of text-to-speech and audio generation models necessitates robust benchmarks for evaluating the emotional understanding capabilities of AI systems. Current speech emotion recognition (SER) datasets often exhibit limitations in emotional granularity, privacy concerns, or reliance on acted portrayals. This paper introduces EmoNet-Voice, a new resource for speech emotion detection, which includes EmoNet-Voice Big, a large-scale pre-training dataset (featuring over 4,500 hours of speech across 11 voices, 40 emotions, and 4 languages), and EmoNet-Voice Bench, a novel benchmark dataset with human expert annotations. EmoNet-Voice is designed to evaluate SER models on a fine-grained spectrum of 40 emotion categories with different levels of intensities. Leveraging state-of-the-art voice generation, we curated synthetic audio snippets simulating actors portraying scenes designed to evoke specific emotions. Crucially, we conducted rigorous validation by psychology experts who assigned perceived intensity labels. This synthetic, privacy-preserving approach allows for the inclusion of sensitive emotional states often absent in existing datasets. Lastly, we introduce Empathic Insight Voice models that set a new standard in speech emotion recognition with high agreement with human experts. Our evaluations across the current model landscape exhibit valuable findings, such as high-arousal emotions like anger being much easier to detect than low-arousal states like concentration.
zh

[NLP-17] CoRT: Code-integrated Reasoning within Thinking

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在处理复杂数学运算时效率低或准确性不足的问题。其关键解决方案是提出一种后训练框架CoRT，通过Hint-Engineering技术合成集成代码的推理数据，以优化模型与代码解释器（Code Interpreter, CI）之间的交互，从而提升模型利用CI的能力和效率。

链接: https://arxiv.org/abs/2506.09820
作者: Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model’s internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4% and 8% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30% fewer tokens for the 32B model and 50% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at this https URL.
zh

[NLP-18] Do LLM s Give Psychometrically Plausible Responses in Educational Assessments? ACL2025

【速读】：该论文试图解决教育评估中测试项目响应行为难以高效获取的问题，旨在通过大型语言模型（Large Language Models, LLMs）模拟人类考生的答题行为，以加速测试开发过程。其解决方案的关键在于评估LLMs生成的回答是否具有人类相似性，并基于心理测量学理论（如经典测验理论和项目反应理论）进行分析，以判断其在教育评估中的适用性。研究发现，尽管大模型在未经校准的情况下表现出过度自信，但通过温度缩放等校准方法，其回答分布可更接近人类表现，然而整体相关性较弱，表明LLMs尚不能直接用于零样本情境下的教育评估试点。

链接: https://arxiv.org/abs/2506.09796
作者: Andreas Säuberli,Diego Frassinelli,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany
类目: Computation and Language (cs.CL)
备注: Accepted for publication at the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at ACL 2025

点击查看摘要

Abstract:Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
zh

[NLP-19] ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

【速读】：该论文旨在解决在像ComfyUI这样的平台上构建高效工作流所需的高技能门槛问题，即用户需要具备专业知识来协调多个专用组件。解决方案的关键在于引入ComfyUI-R1，这是首个用于自动化工作流生成的大规模推理模型，其通过两阶段框架进行训练：首先进行链式思维（CoT）微调以适应ComfyUI领域，其次通过细粒度规则-度量混合奖励机制进行强化学习，以激励推理能力，确保工作流的格式有效性、结构完整性和节点级保真度。

链接: https://arxiv.org/abs/2506.09790
作者: Zhenran Xu,Yiyu Wang,Xue Yang,Longyue Wang,Weihua Luo,Kaifu Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Alibaba International Digital Commerce(阿里巴巴国际数字商业)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Work in progress. Try it out in ComfyUI-Copilot this https URL

点击查看摘要

Abstract:AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
zh

[NLP-20] Adding simple structure at inference improves Vision-Language Compositionality

【速读】：该论文旨在解决双编码器视觉-语言模型（Dual Encoder Vision-Language Models, VLM）在组合性（compositionality）方面的不足，这类模型如CLIP在图像-文本检索任务中表现出类似“词袋”（bag-of-words）的行为，限制了其检索性能。论文提出的解决方案关键在于在推理阶段引入简单的结构：给定一张图像和一段文本，首先将图像分割为多个小区域，然后提取文本中的对象、属性和关系片段，接着利用VLM找到与文本片段对齐的图像区域，最后通过聚合匹配项的个体相似性计算最终的图像-文本相似度。该方法无需额外训练即可显著提升现有VLM的性能，尤其在属性-对象绑定任务中表现突出。

链接: https://arxiv.org/abs/2506.09691
作者: Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune
机构: HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.
zh

[NLP-21] Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在实际部署中面临的不确定性量化（Uncertainty Quantification, UQ）问题，现有UQ方法通常缺乏概率基础且依赖启发式策略。其解决方案的关键在于提出一种基于逆模型的全概率框架，通过系统性扰动评估给定输出下输入空间的多样性来量化不确定性，并定义了新的不确定性度量指标Inv-Entropy。该框架具有高度灵活性，支持多种不确定性度量、嵌入方式、扰动策略和相似性度量，同时引入了基于遗传算法的GAAP扰动算法以提升采样输入的多样性。

链接: https://arxiv.org/abs/2506.09684
作者: Haoyi Song,Ruihan Ji,Naichen Shi,Fan Lai,Raed Al Kontar
机构: University of Michigan(密歇根大学); Peking University(北京大学); Northwestern University(西北大学); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at this https URL.
zh

[NLP-22] Is Fine-Tuning an Effective Solution? Reassessing Knowledge Editing for Unstructured Data

【速读】：该论文旨在解决无结构知识编辑（Unstructured Knowledge Editing, UKE）中的两个关键问题：缺乏对编辑后模型局部性（Locality）的评估，以及基于微调（Fine-tuning, FT）的方法在UKE任务中出现的异常失败现象。其解决方案的关键在于构建两个扩展数据集UnKEBench-Loc和AKEW-Loc (CF)，以系统评估编辑后模型的局部性，并通过分析影响FT方法性能的四个因素，提出适用于UKE任务的训练策略，从而显著提升基于微调方法的性能，特别是在批量编辑场景下表现出更强的优越性。

链接: https://arxiv.org/abs/2506.09672
作者: Hao Xiong,Chuanyuan Tan,Wenliang Chen
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unstructured Knowledge Editing (UKE) is crucial for updating the relevant knowledge of large language models (LLMs). It focuses on unstructured inputs, such as long or free-form texts, which are common forms of real-world knowledge. Although previous studies have proposed effective methods and tested them, some issues exist: (1) Lack of Locality evaluation for UKE, and (2) Abnormal failure of fine-tuning (FT) based methods for UKE. To address these issues, we first construct two datasets, UnKEBench-Loc and AKEW-Loc (CF), by extending two existing UKE datasets with locality test data from the unstructured and structured views. This enables a systematic evaluation of the Locality of post-edited models. Furthermore, we identify four factors that may affect the performance of FT-based methods. Based on these factors, we conduct experiments to determine how the well-performing FT-based methods should be trained for the UKE task, providing a training recipe for future research. Our experimental results indicate that the FT-based method with the optimal setting (FT-UKE) is surprisingly strong, outperforming the existing state-of-the-art (SOTA). In batch editing scenarios, FT-UKE shows strong performance as well, with its advantage over SOTA methods increasing as the batch size grows, expanding the average metric lead from +6.78% to +10.80%
zh

[NLP-23] Query-Level Uncertainty in Large Language Models

【速读】：该论文试图解决大型语言模型在面对未知查询时缺乏知识边界意识的问题，即如何识别已知与未知查询的机制。解决方案的关键在于提出一种无需训练的新型方法——\emph{Internal Confidence}，该方法通过跨层和跨标记的自我评估来检测查询级别的不确定性，从而判断模型是否能够在不生成任何标记的情况下处理给定查询。

链接: https://arxiv.org/abs/2506.09669
作者: Lihu Chen,Gaël Varoquaux
机构: Imperial College London, UK; Soda, Inria Saclay, France
类目: Computation and Language (cs.CL)
备注: In Progress

点击查看摘要

Abstract:It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emphInternal Confidence, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.
zh

[NLP-24] Intent Factored Generation: Unleashing the Diversity in Your Language Model

【速读】：该论文试图解决在固定提示下从大型语言模型中获取多个有意义且高质量样本的挑战，当前方法通常仅在词元层面增加多样性，导致推理问题探索不足和对话代理缺乏吸引力。其解决方案的关键是提出意图因子生成（Intent Factored Generation, IFG），将采样过程分为两个阶段：首先采样语义密集的意图（如摘要或关键词），然后基于原始提示和第一阶段的意图生成最终响应。该方法通过在意图阶段使用较高温度以促进概念多样性，在最终生成阶段使用较低温度以确保输出的一致性和连贯性。此外，论文还发现，在生成每个思维链步骤前让模型显式陈述其意图有助于推理任务。

链接: https://arxiv.org/abs/2506.09659
作者: Eltayeb Ahmed,Uljad Berdica,Martha Elliott,Danijela Horak,Jakob N. Foerster
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Obtaining multiple meaningfully diverse, high quality samples from Large Language Models for a fixed prompt remains an open challenge. Current methods for increasing diversity often only operate at the token-level, paraphrasing the same response. This is problematic because it leads to poor exploration on reasoning problems and to unengaging, repetitive conversational agents. To address this we propose Intent Factored Generation (IFG), factorising the sampling process into two stages. First, we sample a semantically dense intent, e.g., a summary or keywords. Second, we sample the final response conditioning on both the original prompt and the intent from the first stage. This allows us to use a higher temperature during the intent step to promote conceptual diversity, and a lower temperature during the final generation to ensure the outputs are coherent and self-consistent. Additionally, we find that prompting the model to explicitly state its intent for each step of the chain-of-thought before generating the step is beneficial for reasoning tasks. We demonstrate our method’s effectiveness across a diverse set of tasks. We show this method improves both pass@k and Reinforcement Learning from Verifier Feedback on maths and code tasks. For instruction-tuning, we combine IFG with Direct Preference Optimisation to increase conversational diversity without sacrificing reward. Finally, we achieve higher diversity while maintaining the quality of generations on a general language modelling task, using a new dataset of reader comments and news articles that we collect and open-source. In summary, we present a simple method of increasing the sample diversity of LLMs while maintaining performance. This method can be implemented by changing the prompt and varying the temperature during generation, making it easy to integrate into many algorithms for gains across various applications.
zh

[NLP-25] Bridging the Gap Between Open-Source and Proprietary LLM s in Table QA ACL2025 SEMEVAL-2025

【速读】：该论文旨在解决在表格数据上进行问答（Question Answering, QA）的问题，其核心挑战在于如何准确地从结构化数据中提取信息并生成精确的答案。解决方案的关键在于集成多个关键组件：文本到SQL和文本到代码生成模块、自校正机制以及检索增强生成（Retrieval-Augmented Generation, RAG），并通过大型语言模型（Large Language Model, LLM）进行统一协调，形成一个端到端（End-to-End, E2E）的处理流程。

链接: https://arxiv.org/abs/2506.09657
作者: Nikolas Evkarpidi,Elena Tutubalina
机构: HSE University (高等经济大学); AIRI (AI Research Institute); Sber AI (斯伯尔AI); Kazan Federal University (喀山联邦大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication at the 19th International Workshop on Semantic Evaluation (SemEval-2025), to be held in conjunction with ACL 2025. 15 pages, 5 figures

点击查看摘要

Abstract:This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.
zh

[NLP-26] Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering

【速读】：该论文旨在解决知识图谱问答（KGQA）中基于图的检索方法在泛化能力上的不足，以及传统检索增强生成（RAG）管道因依赖非结构化文本而限制可解释性和结构化推理的问题。其解决方案的关键在于提出RAPL框架，通过三个核心方面提升图检索的效率与效果：首先，采用结合启发式信号与参数化模型的两阶段标注策略，提供因果基础的监督；其次，引入与模型无关的图变换方法，以捕捉三元组内和三元组间的交互，增强表征能力；最后，设计基于路径的推理策略，促进从注入的理性知识中学习，并通过结构化输入支持下游推理器。

链接: https://arxiv.org/abs/2506.09645
作者: Tianjun Yao,Haoxuan Li,Zhiqiang Shen,Pan Li,Tongliang Liu,Kun Zhang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Peking university (北京大学); Georgia Institute of Technology (佐治亚理工学院); The University of Sydney (悉尼大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 32 pages, 28 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by 2.66%-20.34% , and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: this https URL.
zh

[NLP-27] Using Sign Language Production as Data Augmentation to enhance Sign Language Translation

【速读】：该论文旨在解决手语翻译系统中数据稀缺的问题，特别是在手语语言资源远少于口语语言的情况下，如何提升手语翻译模型的性能。其解决方案的关键在于利用生成式AI（Generative AI）技术，通过三种方法——基于骨骼的生成方法、手语拼接以及两个逼真生成模型SignGAN和SignSplat——来增强现有的手语数据集，从而提高手语翻译模型的准确性和鲁棒性。

链接: https://arxiv.org/abs/2506.09643
作者: Harry Walsh,Maksym Ivashechkin,Richard Bowden
机构: University of Surrey(萨里大学); Guildford(吉尔福德); United Kingdom(英国)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning models fundamentally rely on large quantities of high-quality data. Collecting the necessary data for these models can be challenging due to cost, scarcity, and privacy restrictions. Signed languages are visual languages used by the deaf community and are considered low-resource languages. Sign language datasets are often orders of magnitude smaller than their spoken language counterparts. Sign Language Production is the task of generating sign language videos from spoken language sentences, while Sign Language Translation is the reverse translation task. Here, we propose leveraging recent advancements in Sign Language Production to augment existing sign language datasets and enhance the performance of Sign Language Translation models. For this, we utilize three techniques: a skeleton-based approach to production, sign stitching, and two photo-realistic generative models, SignGAN and SignSplat. We evaluate the effectiveness of these techniques in enhancing the performance of Sign Language Translation models by generating variation in the signer’s appearance and the motion of the skeletal data. Our results demonstrate that the proposed methods can effectively augment existing datasets and enhance the performance of Sign Language Translation models by up to 19%, paving the way for more robust and accurate Sign Language Translation systems, even in resource-constrained environments.
zh

[NLP-28] Modeling Probabilistic Reduction using Information Theory and Naive Discriminative Learning INTERSPEECH2025

【速读】：该论文试图解决在建模语音单词持续时间时，概率预测器与基于信息论的朴素判别学习（Naive Discriminative Learning, NDL）预测器之间的性能比较问题，特别是关注概率性缩减（probabilistic reduction）的建模。研究的关键在于通过引入信息论公式改进NDL预测器，并将其与传统的NDL模型和N-gram概率模型进行对比，结果表明将信息论指标与判别学习信息相结合能够提升模型性能，同时强调了在建模中不仅需要考虑频率和上下文可预测性，还应考虑平均上下文可预测性。

链接: https://arxiv.org/abs/2506.09641
作者: Anna Stein,Kevin Tang
机构: Heinrich Heine University Düsseldorf (海因里希·海涅杜塞尔多夫大学)
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:This study compares probabilistic predictors based on information theory with Naive Discriminative Learning (NDL) predictors in modeling acoustic word duration, focusing on probabilistic reduction. We examine three models using the Buckeye corpus: one with NDL-derived predictors using information-theoretic formulas, one with traditional NDL predictors, and one with N-gram probabilistic predictors. Results show that the N-gram model outperforms both NDL models, challenging the assumption that NDL is more effective due to its cognitive motivation. However, incorporating information-theoretic formulas into NDL improves model performance over the traditional model. This research highlights a) the need to incorporate not only frequency and contextual predictability but also average contextual predictability, and b) the importance of combining information-theoretic metrics of predictability and information derived from discriminative learning in modeling acoustic reduction.
zh

[NLP-29] Benchmarking Debiasing Methods for LLM -based Parameter Estimates

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在文本标注中产生的偏差问题，这种偏差可能导致下游参数估计（如回归系数和因果效应）的不准确。解决方案的关键在于通过结合少量昂贵的专家标注与LLM的标注，利用去偏方法如基于设计的监督学习（Design-based Supervised Learning, DSL）和预测驱动推断（Prediction-Powered Inference, PPI）来减少偏差，从而实现更准确的估计。

链接: https://arxiv.org/abs/2506.09627
作者: Nicolas Audinet de Pieuchon,Adel Daoud,Connor T. Jerzak,Moa Johansson,Richard Johansson
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method’s performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.
zh

[NLP-30] Effective Red-Teaming of Policy-Adherent Agents

【速读】：该论文旨在解决任务导向的大型语言模型（Large Language Model, LLM）代理在严格政策环境中的合规性问题，即如何确保代理在遵循政策的同时，有效抵御恶意用户的行为，避免被利用以获取不当利益。解决方案的关键在于提出一种新的威胁模型，聚焦于对抗性用户对政策遵循型代理的攻击，并设计了CRAFT系统，通过具有政策意识的说服策略在客户服务场景中对代理进行红队测试，从而更有效地揭示其脆弱性。此外，研究还引入了tau-break基准，用于严格评估代理在面对操纵性用户行为时的鲁棒性。

链接: https://arxiv.org/abs/2506.09600
作者: Itay Nakash,George Kour,Koren Lazar,Matan Vetzler,Guy Uziel,Ateret Anaby-Tavor
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent’s robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks
zh

[NLP-31] Memorization in Language Models through the Lens of Intrinsic Dimension

【速读】：该论文试图解决语言模型（Language Models, LMs）在训练过程中可能无意中记忆并生成其训练数据中的部分内容，从而引发隐私泄露和知识产权披露的问题。其解决方案的关键在于研究潜在空间中序列的内在维度（Intrinsic Dimension, ID）对记忆率的调节作用，发现ID作为一种结构复杂性的几何代理，能够抑制记忆行为：与低ID序列相比，高ID序列在过参数化模型和稀疏暴露条件下更不容易被记忆。这一发现揭示了规模、暴露程度与复杂性之间在塑造记忆行为中的相互作用。

链接: https://arxiv.org/abs/2506.09591
作者: Stefan Arnold
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大埃尔朗根-纽伦堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language Models (LMs) are prone to memorizing parts of their data during training and unintentionally emitting them at generation time, raising concerns about privacy leakage and disclosure of intellectual property. While previous research has identified properties such as context length, parameter size, and duplication frequency, as key drivers of unintended memorization, little is known about how the latent structure modulates this rate of memorization. We investigate the role of Intrinsic Dimension (ID), a geometric proxy for the structural complexity of a sequence in latent space, in modulating memorization. Our findings suggest that ID acts as a suppressive signal for memorization: compared to low-ID sequences, high-ID sequences are less likely to be memorized, particularly in overparameterized models and under sparse exposure. These findings highlight the interaction between scale, exposure, and complexity in shaping memorization.
zh

[NLP-32] From Symbolic to Neural and Back: Exploring Knowledge Graph-Large Language Model Synergies

【速读】：该论文旨在解决如何将结构化知识从知识图谱（Knowledge Graphs, KGs）有效集成到大型语言模型（Large Language Models, LLMs）中，以提升模型的事实基础性和推理能力的问题。其解决方案的关键在于系统性地分析KG与LLMs之间的协同作用，并将其现有方法分为两类：增强型KG的LLMs，旨在提升推理能力、减少幻觉并支持复杂问答；以及增强型KG的LLMs，用于促进知识图谱的构建、补全和查询。研究强调了可扩展性、计算效率和数据质量的重要性，并提出了未来的研究方向，如神经符号融合、动态KG更新、数据可靠性及伦理考量。

链接: https://arxiv.org/abs/2506.09566
作者: Blaž Škrlj,Boshko Koloski,Senja Pollak,Nada Lavrač
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To-appear as a book chapter

点击查看摘要

Abstract:Integrating structured knowledge from Knowledge Graphs (KGs) into Large Language Models (LLMs) enhances factual grounding and reasoning capabilities. This survey paper systematically examines the synergy between KGs and LLMs, categorizing existing approaches into two main groups: KG-enhanced LLMs, which improve reasoning, reduce hallucinations, and enable complex question answering; and LLM-augmented KGs, which facilitate KG construction, completion, and querying. Through comprehensive analysis, we identify critical gaps and highlight the mutual benefits of structured knowledge integration. Compared to existing surveys, our study uniquely emphasizes scalability, computational efficiency, and data quality. Finally, we propose future research directions, including neuro-symbolic integration, dynamic KG updating, data reliability, and ethical considerations, paving the way for intelligent systems capable of managing more complex real-world knowledge tasks.
zh

[NLP-33] owards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language ACL

【速读】：该论文旨在解决低资源语言在大型语言模型（Large Language Models, LLMs）应用中的局限性问题，特别是在马其顿语等语言中缺乏足够的数据和模型支持。其关键解决方案是构建了一系列资源，包括目前最大的马其顿语语料库（40GB文本数据，总计35亿词）、一个10.6万条指令的对话数据集以及覆盖七个基准的评估套件，并基于这些数据训练了一个8B参数的本地模型domestic-yak，该模型在多个基准测试中表现优于现有模型，甚至可与10倍参数量的模型相媲美。

链接: https://arxiv.org/abs/2506.09560
作者: Stefan Krsteski,Matea Tashkovska,Borjan Sazdov,Hristijan Gjoreski,Branislav Gerazov
机构: 未知
类目: Computation and Language (cs.CL)
备注: Camera-ready version accepted at SlavNLP-2025@ACL

点击查看摘要

Abstract:The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at this http URL for source code, and at this http URL for pretrained model weights and data.
zh

[NLP-34] Gender Bias in English-to-Greek Machine Translation

【速读】：该论文试图解决机器翻译（Machine Translation, MT）系统在处理性别相关语言时可能强化性别刻板印象的问题，特别是在英语到希腊语的语言对中。其解决方案的关键在于构建一个手动标注的双语数据集GendEL，该数据集包含240个具有性别模糊和明确性的句子，涵盖典型的职位名词和形容词，并通过引入提示的GPT-4o模型作为缓解性别偏见的工具，提供性别明确和性别中立的翻译选项，以提升翻译的性别包容性。

链接: https://arxiv.org/abs/2506.09558
作者: Eleni Gkovedarou,Joke Daems,Luna De Bruyne
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at GITT 2025 (MT Summit)

点击查看摘要

Abstract:As the demand for inclusive language increases, concern has grown over the susceptibility of machine translation (MT) systems to reinforce gender stereotypes. This study investigates gender bias in two commercial MT systems, Google Translate and DeepL, focusing on the understudied English-to-Greek language pair. We address three aspects of gender bias: i) male bias, ii) occupational stereotyping, and iii) errors in anti-stereotypical translations. Additionally, we explore the potential of prompted GPT-4o as a bias mitigation tool that provides both gender-explicit and gender-neutral alternatives when necessary. To achieve this, we introduce GendEL, a manually crafted bilingual dataset of 240 gender-ambiguous and unambiguous sentences that feature stereotypical occupational nouns and adjectives. We find persistent gender bias in translations by both MT systems; while they perform well in cases where gender is explicitly defined, with DeepL outperforming both Google Translate and GPT-4o in feminine gender-unambiguous sentences, they are far from producing gender-inclusive or neutral translations when the gender is unspecified. GPT-4o shows promise, generating appropriate gendered and neutral alternatives for most ambiguous cases, though residual biases remain evident.
zh

[NLP-35] MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions INTERSPEECH2025

【速读】：该论文旨在解决自然情境下语音情感识别（Speech Emotion Recognition, SER）中的挑战，特别是由于人类情感的主观性以及在自然条件下情感表达的不均衡性所带来的问题。其解决方案的关键在于提出MEDUSA，一个具有四阶段训练流程的多模态框架，该框架通过集成分类器和可训练的元分类器有效处理类别不平衡和情感模糊性，同时利用DeepSER——一种基于预训练自监督声学和语言表征的深度跨模态变压器融合机制，并结合Manifold MixUp进行正则化，以及引入人类标注分数作为软目标、平衡数据采样和多任务学习来提升模型性能。

链接: https://arxiv.org/abs/2506.09556
作者: Georgios Chatzichristodoulou,Despoina Kosmopoulou,Antonios Kritikos,Anastasia Poulopoulou,Efthymios Georgiou,Athanasios Katsamanis,Vassilis Katsouros,Alexandros Potamianos
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.
zh

[NLP-36] KG-Infused RAG : Augmenting Corpus-Based RAG with External Knowledge Graphs

【速读】：该论文旨在解决传统检索增强生成（Retrieval-Augmented Generation, RAG）方法在事实准确性上的局限性，特别是其依赖单一信息源（如非结构化文本或结构化知识库）以及缺乏认知启发式机制来激活相关知识的问题。其解决方案的关键在于提出KG-Infused RAG框架，该框架将知识图谱（Knowledge Graph, KG）整合到RAG系统中，通过实现扩散激活（spreading activation）这一认知过程，增强概念关联与推理能力，从而提升生成结果的可解释性与多源检索的语义结构基础。

链接: https://arxiv.org/abs/2506.09542
作者: Dingjun Wu,Yukun Yan,Zhenghao Liu,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.
zh

[NLP-37] Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

【速读】：该论文旨在解决复杂推理问题中过程奖励模型（Process Reward Model, PRM）的高成本与低效标注问题，特别是针对步骤级标注数据的获取困难。传统自动化标注方法如蒙特卡洛估计常产生噪声标签并消耗大量计算资源，而本文提出的关键解决方案是利用弱模型和强模型之间的预测一致性作为可靠过程标签的判定标准，从而高效生成高质量的过程标注数据。这一策略显著降低了对人工标注的依赖，并在多个基准测试中验证了其有效性。

链接: https://arxiv.org/abs/2506.09532
作者: Shuai Wang,Zhenhua Liu,Jiaheng Wei,Xuanwu Yin,Dong Li,Emad Barsoum
机构: Advanced Micro Devices Inc. (先进微设备公司); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.
zh

[NLP-38] Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

【速读】：该论文试图解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在解码过程中未能有效利用视觉信息，导致生成结果缺乏视觉基础（visually ungrounded）的问题。其解决方案的关键在于提出ReVisiT方法，通过引用视觉标记（vision tokens）来引导文本生成过程，具体而言是将视觉标记投影到文本标记分布空间，并通过约束散度最小化动态选择最相关的视觉标记，以优化输出分布并更好地融合视觉语义。

链接: https://arxiv.org/abs/2506.09522
作者: Beomsik Cho,Jaehyung Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to 2\times .
zh

[NLP-39] Reason Med: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在知识密集型医学问答任务中的能力不足问题。其关键解决方案是构建了ReasonMed数据集，该数据集通过多智能体验证与优化流程生成，其中设计了一个Error Refiner模块以识别并修正验证器标记的错误推理步骤，从而提升推理路径的质量。

链接: https://arxiv.org/abs/2506.09513
作者: Yu Sun,Xingyu Qian,Weiwen Xu,Hao Zhang,Chenghao Xiao,Long Li,Yu Rong,Wenbing Huang,Qifeng Bai,Tingyang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 24 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textitmulti-agent verification and refinement process, where we design an \textitError Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60%.
zh

[NLP-40] ransXSSM: A Hybrid Transformer State Space Model with Unified Rotary Position Embedding

【速读】：该论文试图解决混合架构中Transformer与状态空间模型（State Space Models, SSMs）在位置编码机制上的不兼容问题，这一不兼容性导致了模型在训练和推理过程中出现不连续性和性能下降。解决方案的关键在于提出一种统一的旋转位置编码（unified rotary position embedding, \ourRoPE），该方法为自注意力机制和状态空间组件提供了统一的位置编码框架，从而实现了两种模型结构的高效协同。

链接: https://arxiv.org/abs/2506.09507
作者: Bingheng Wu,Jingze Shi,Yifan Wu,Nan Tang,Yuyu Luo
机构: HKUST (Guangzhou)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers exhibit proficiency in capturing long-range dependencies, whereas State Space Models (SSMs) facilitate linear-time sequence modeling. Notwithstanding their synergistic potential, the integration of these architectures presents a significant challenge, primarily attributable to a fundamental incongruity in their respective positional encoding mechanisms: Transformers rely on explicit Rotary Position Embeddings (RoPE), while SSMs leverage implicit positional representations via convolutions. This divergence often precipitates discontinuities and suboptimal performance. To address this impediment, we propose a unified rotary position embedding (\textbf\ourRoPE) methodology, thereby establishing a consistent positional encoding framework for both self-attention and state-space components. Using this \ourRoPE, we introduce \textbf\model, a hybrid architecture that coherently integrates the Transformer and SSM layers under this unified positional encoding scheme. At a 4K sequence length, \model exhibits training and inference speeds that are \textbf42.3% and 29.5% faster, respectively, relative to standard Transformer models. It also delivers higher accuracy: under comparable settings, it surpasses a Transformer baseline by over 4% on language modeling benchmarks. \model furthermore scales more effectively: \model-1.3B gains \textbf7.22% in average accuracy over its 320M version (versus about 6% gains for equivalent Transformers or SSMs). Our results show that unified positional encoding resolves positional incompatibility in hybrid models, enabling efficient, high-performance long-context modeling.
zh

[NLP-41] Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理过程中的可复现性问题，即不同系统配置（如评估批次大小、GPU数量和GPU版本）会导致生成结果显著差异，尤其在推理模型中，早期标记的微小舍入误差可能引发思维链的分歧，从而影响准确性。解决方案的关键在于提出一种轻量级的推理流水线LayerCast，该方法将权重存储为16位精度，但所有计算均在FP32下进行，从而在内存效率与数值稳定性之间取得平衡。

链接: https://arxiv.org/abs/2506.09501
作者: Jiayi Yuan,Hao Li,Xinheng Ding,Wenya Xie,Yu-Jhe Li,Wentian Zhao,Kun Wan,Jing Shi,Xia Hu,Zirui Liu
机构: Rice University (莱斯大学); University of Minnesota Twin Cities (明尼苏达大学双城分校); Adobe Inc. (Adobe 公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision – while critical for reproducibility – is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at this https URL.
zh

[NLP-42] Bridging Online Behavior and Clinical Insight: A Longitudinal LLM -based Study of Suicidality on YouTube Reveals Novel Digital Markers

【速读】：该论文试图解决如何通过分析YouTube上的数字足迹来理解自杀行为，并将其与临床知识进行对比。其关键解决方案是采用互补的方法论，包括基于大语言模型（LLM）的自下而上主题建模、混合方法以及基于专家驱动的自上而下分析，以识别与自杀行为相关的指标并揭示其时间变化特征。

链接: https://arxiv.org/abs/2506.09495
作者: Ilanit Sobol,Shir Lissak,Refael Tikochinski,Tal Nakash,Anat Brunstein Klomek,Eyal Fruchter,Roi Reichart
机构: Technion – Israel Institute of Technology(以色列理工学院); Reichman University(里奇曼大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Suicide remains a leading cause of death in Western countries, underscoring the need for new research approaches. As social media becomes central to daily life, digital footprints offer valuable insight into suicidal behavior. Focusing on individuals who attempted suicide while uploading videos to their channels, we investigate: How do suicidal behaviors manifest on YouTube, and how do they differ from expert knowledge? We applied complementary approaches: computational bottom-up, hybrid, and expert-driven top-down, on a novel longitudinal dataset of 181 YouTube channels from individuals with life-threatening attempts, alongside 134 control channels. In the bottom-up approach, we applied LLM-based topic modeling to identify behavioral indicators. Of 166 topics, five were associated with suicide-attempt, with two also showing temporal attempt-related changes ( p.01 ) - Mental Health Struggles ( +0.08 )* and YouTube Engagement ( +0.1 )*. In the hybrid approach, a clinical expert reviewed LLM-derived topics and flagged 19 as suicide-related. However, none showed significant attempt-related temporal effects beyond those identified bottom-up. Notably, YouTube Engagement, a platform-specific indicator, was not flagged by the expert, underscoring the value of bottom-up discovery. In the top-down approach, psychological assessment of suicide attempt narratives revealed that the only significant difference between individuals who attempted before and those attempted during their upload period was the motivation to share this experience: the former aimed to Help Others ( \beta=-1.69 , p.01 ), while the latter framed it as part of their Personal Recovery ( \beta=1.08 , p.01 ). By integrating these approaches, we offer a nuanced understanding of suicidality, bridging digital behavior and clinical insights. * Within-group changes in relation to the suicide attempt. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2506.09495 [cs.CL] (or arXiv:2506.09495v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.09495 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-43] owards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

【速读】：该论文试图解决生成式 AI (Generative AI) 中直接对齐算法（Direct Alignment Algorithms, DAAs）在训练目标与推理生成性能之间存在的“奖励-生成差距”问题。该差距源于模型在生成过程中前缀标记的内在重要性与其在DAAs隐式奖励函数中的反映不匹配。解决方案的关键在于提出一种名为Prefix-Oriented Equal-length Training (POET) 的简单但有效的策略，通过将偏好和非偏好响应截断至较短长度，使每个样本中的响应长度相等，从而在优化过程中隐式约束DAAs的目标，使其更关注前缀标记，进而缩小奖励与生成之间的差距。

链接: https://arxiv.org/abs/2506.09457
作者: Zeguan Xiao,Yun Chen,Guanhua Chen
机构: Shanghai University of Finance and Economics (上海财经大学); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the “reward-generation gap” – a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one’s length. Training with POET, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all positions, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.
zh

[NLP-44] Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform

【速读】：该论文试图解决在大规模语言模型（Large Language Models, LLMs）服务中，由于计算基础设施共享导致的隐私泄露问题，即企业数据所有者在使用共享或多租户计算环境时，不得不将数据以明文形式暴露，从而限制了其对敏感数据的使用。论文提出的解决方案的关键是“Stained Glass Transform”（污染玻璃变换），这是一种学习得到的、随机的且序列依赖的词嵌入变换方法，能够在理论上提供输入数据的隐私保护的同时，保持模型的实用性。该方法通过与高斯混合模型的互信息理论建立联系，进一步计算后验隐私估计，并通过令牌级别的隐私指标和标准LLM性能基准验证了其隐私性和有效性。

链接: https://arxiv.org/abs/2506.09452
作者: Jay Roberts,Kyle Mylonakis,Sidhartha Roy,Kaan Kale
机构: Protopia AI(Protopia AI)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Theory (cs.IT)
备注: Submitted to IEEE SP 2026

点击查看摘要

Abstract:The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models (LLMs) has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many teams in order to maximize the return on investment. In both scenarios the deployed models operate only on plaintext data, and so enterprise data owners must allow their data to appear in plaintext on a shared or multi-tenant compute infrastructure. This results in data owners with private or sensitive data being hesitant or restricted in what data they use with these types of deployments. In this work we introduce the Stained Glass Transform, a learned, stochastic, and sequence dependent transformation of the word embeddings of an LLM which information theoretically provides privacy to the input of the LLM while preserving the utility of model. We theoretically connect a particular class of Stained Glass Transforms to the theory of mutual information of Gaussian Mixture Models. We then calculate a-postiori privacy estimates, based on mutual information, and verify the privacy and utility of instances of transformed embeddings through token level metrics of privacy and standard LLM performance benchmarks.
zh

[NLP-45] UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLM s NAACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在理论心智（Theory of Mind, ToM）能力上的不足，尤其是其在准确预测人类心理状态方面的困难。论文提出的解决方案是构建UniToMBench，这是一个集成SimToM和TOMBENCH优势的统一基准，通过多交互任务设计和动态故事场景来系统性地提升和评估LLMs的ToM能力。其关键在于结合视角采择技术与多样化的评估指标，并基于超过1000个手工编写的场景数据集，以更有效地激发LLMs的社会认知能力。

链接: https://arxiv.org/abs/2506.09450
作者: Prameshwar Thiyagarajan,Vaishnavi Parimi,Shamant Sai,Soumil Garg,Zhangir Meirbek,Nitin Yarlagadda,Kevin Zhu,Chris Kim
机构: Algoverse AI Research (Algoverse AI 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Conference of the North American Chapter of the Association for Computational Linguistics, Student Research Workshop 2025 (NAACL SRW 2025)

点击查看摘要

Abstract:Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: this https URL.
zh

[NLP-46] OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary INTERSPEECH2025

【速读】：该论文试图解决语音基础模型（Speech Foundation Models, SFMs）在识别罕见和未见过的词汇时表现不佳的问题。其解决方案的关键在于将现有的上下文偏置（Contextual Biasing, CB）方法与预训练的OWSM v3.1模型相结合，同时冻结模型的预训练参数。通过利用SFMs中嵌入的知识，该方法在保持SFMs优势的同时，实现了有效的CB，即使在小数据集上也能取得显著性能提升。

链接: https://arxiv.org/abs/2506.09448
作者: Yui Sudo,Yusuke Fujita,Atsushi Kojima,Tomoya Mizumoto,Lianbo Liu
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Speech foundation models (SFMs), such as Open Whisper-Style Speech Models (OWSM), are trained on massive datasets to achieve accurate automatic speech recognition. However, even SFMs struggle to accurately recognize rare and unseen words. While contextual biasing (CB) is a promising approach to improve recognition of such words, most CB methods are trained from scratch, resulting in lower performance than SFMs due to the lack of pre-trained knowledge. This paper integrates an existing CB method with OWSM v3.1 while freezing its pre-trained parameters. By leveraging the knowledge embedded in SFMs, the proposed method enables effective CB while preserving the advantages of SFMs, even with a small dataset. Experimental results show that the proposed method improves the biasing word error rate (B-WER) by 11.6 points, resulting in a 0.9 point improvement in the overall WER while reducing the real-time factor by 7.5% compared to the non-biasing baseline on the LibriSpeech 100 test-clean set.
zh

[NLP-47] GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture ACL-2025

【速读】：该论文试图解决俄语领域基础模型开发受限的问题，主要由于训练此类模型所需的计算资源庞大。解决方案的关键在于引入GigaChat家族的俄语生成式大语言模型（Generative Large Language Models, LLMs），提供多种规模的模型版本，包括基础模型和指令调优版本，并详细报告了模型架构、预训练过程及实验设计，以支持俄语自然语言处理研究与工业应用的发展。

链接: https://arxiv.org/abs/2506.09440
作者: GigaChat team:Mamedov Valentin,Evgenii Kosarev,Gregory Leleytner,Ilya Shchuckin,Valeriy Berezovskiy,Daniil Smirnov,Dmitry Kozlov,Sergei Averkiev,Lukyanenko Ivan,Aleksandr Proshunin,Ainur Israfilova,Ivan Baskov,Artem Chervyakov,Emil Shakirov,Mikhail Kolesov,Daria Khomich,Darya Latortseva,Sergei Porkhun,Yury Fedorov,Oleg Kutuzov,Polina Kudriavtseva,Sofiia Soldatova,Kolodin Egor,Stanislav Pyatkin,Dzmitry Menshykh,Grafov Sergei,Eldar Damirov,Karlov Vladimir,Ruslan Gaitukiev,Arkadiy Shatenov,Alena Fenogenova,Nikita Savushkin,Fedor Minkin
机构: GigaChat team; SaluteDevices / Moscow
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL-2025 System Demo

点击查看摘要

Abstract:Generative large language models (LLMs) have become crucial for modern NLP research and applications across various languages. However, the development of foundational models specifically tailored to the Russian language has been limited, primarily due to the significant computational resources required. This paper introduces the GigaChat family of Russian LLMs, available in various sizes, including base models and instruction-tuned versions. We provide a detailed report on the model architecture, pre-training process, and experiments to guide design choices. In addition, we evaluate their performance on Russian and English benchmarks and compare GigaChat with multilingual analogs. The paper presents a system demonstration of the top-performing models accessible via an API, a Telegram bot, and a Web interface. Furthermore, we have released three open GigaChat models in open-source (this https URL), aiming to expand NLP research opportunities and support the development of industrial solutions for the Russian language.
zh

[NLP-48] Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting

【速读】：该论文试图解决监督微调（Supervised Fine-Tuning, SFT）在提升大型语言模型（Large Language Models, LLMs）指令遵循能力和领域特定任务适应性的同时，导致其通用能力下降的问题，以及由于无法获取原始预训练数据，第三方实践者在对开源模型进行SFT时容易加剧灾难性遗忘（catastrophic forgetting）的问题。解决方案的关键在于提出一种新的、更经济的SFT方法，通过重建基础模型的可能SFT指令分布，并经过多模型筛选过程选择最优数据，再与新数据混合进行SFT，从而有效降低灾难性遗忘的风险，同时保持模型在通用领域的泛化能力。

链接: https://arxiv.org/abs/2506.09428
作者: Fei Ding,Baiqiao Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT), while enhancing large language models(LLMs)’ instruction-following capabilities and domain-specific task adaptability, often diminishes their general capabilities. Moreover, due to the inaccessibility of original pre-training data, catastrophic forgetting tends to be exacerbated when third-party practitioners implement SFT on open-sourced models. To address this challenge, we propose a novel, more cost-effective SFT method which could effectively reduce the risk of catastrophic forgetting without access to original SFT data. Our approach begins by reconstructing the likely SFT instruction distribution of the base model, followed by a multi-model screening process to select optimal data, which is then mixed with new data for SFT. Experimental results demonstrate that our method preserves generalization capabilities in general domains while improving task-specific performance.
zh

[NLP-49] Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLM s in Multimodal Settings ACL2025

【速读】：该论文试图解决在日益数字化的世界中自动检测欺骗（deception detection）的问题，其核心挑战在于如何有效利用大型语言模型（Large Language Models, LLMs）和大型多模态模型（Large Multimodal Models, LMMs）的能力来识别不同领域的虚假信息。解决方案的关键在于系统评估不同实验设置下的欺骗检测效果，包括零样本和少量样本方法，并通过随机或基于相似性的上下文示例选择策略进行优化。研究还探讨了辅助特征（如非语言手势和视频摘要）以及不同提示策略（如直接标签生成和思维链推理）对检测性能的影响，从而揭示LLMs在跨模态欺骗线索处理中的潜力与局限性。

链接: https://arxiv.org/abs/2506.09424
作者: Md Messal Monem Miah,Adrita Anika,Xi Shi,Ruihong Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and commercial LLMs on three distinct datasets: real life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our results show that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection tasks, while LMMs struggle to fully leverage cross-modal cues. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures and video summaries, and examine the effectiveness of different prompting strategies, including direct label generation and chain-of-thought reasoning. Our findings provide key insights into how LLMs process and interpret deceptive cues across modalities, highlighting their potential and limitations in real-world deception detection applications.
zh

[NLP-50] A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy

【速读】：该论文试图解决当前研究中过度追求完全自主的AI代理所带来的可靠性、透明性和对人类实际需求理解不足的问题。其解决方案的关键在于提出基于大型语言模型的人机系统（LLM-HAS），通过让AI与人类协作而非替代人类，保持人类在系统中的指导、答疑和控制作用，从而提升系统的可信度和适应性。

链接: https://arxiv.org/abs/2506.09420
作者: Henry Peng Zou,Wei-Chieh Huang,Yaozu Wu,Chunyu Miao,Dongyuan Li,Aiwei Liu,Yue Zhou,Yankai Chen,Weizhi Zhang,Yangning Li,Liancheng Fang,Renhe Jiang,Philip S. Yu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Tokyo (东京大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent improvements in large language models (LLMs) have led many researchers to focus on building fully autonomous AI agents. This position paper questions whether this approach is the right path forward, as these autonomous systems still have problems with reliability, transparency, and understanding the actual requirements of human. We suggest a different approach: LLM-based Human-Agent Systems (LLM-HAS), where AI works with humans rather than replacing them. By keeping human involved to provide guidance, answer questions, and maintain control, these systems can be more trustworthy and adaptable. Looking at examples from healthcare, finance, and software development, we show how human-AI teamwork can handle complex tasks better than AI working alone. We also discuss the challenges of building these collaborative systems and offer practical solutions. This paper argues that progress in AI should not be measured by how independent systems become, but by how well they can work with humans. The most promising future for AI is not in systems that take over human roles, but in those that enhance human capabilities through meaningful partnership.
zh

[NLP-51] PGDA-KGQA: A Prompt-Guided Generative Framework with Multiple Data Augmentation Strategies for Knowledge Graph Question Answering

【速读】：该论文旨在解决知识图谱问答（KGQA）中由于标注数据稀缺和多跳推理样本不足导致的模型泛化能力弱的问题。现有方法在利用大语言模型（LLMs）进行语义解析时，受限于数据多样性不足，尤其是多跳推理样本的缺乏，影响了模型性能。论文提出的解决方案关键在于PGDA-KGQA框架，其核心是一个统一的提示设计范式，通过精心设计的提示整合文本内容，利用LLMs生成大规模（问题，逻辑形式）对以增强训练数据。该框架通过生成单跳伪问题、语义保持的问题重写以及答案引导的反向路径探索等策略，有效提升了数据多样性与模型的鲁棒性。

链接: https://arxiv.org/abs/2506.09414
作者: Xiujun Zhou,Pingjian Zhang,Deyou Tang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models’ generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.
zh

[NLP-52] oken Constraint Decoding Improves Robustness on Question Answering for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对输入扰动时表现出的脆弱性问题，特别是在多选题问答（MCQA）任务中性能下降的问题。解决方案的关键在于引入并评估一种称为Token Constraint Decoding (TCD) 的推理阶段算法，该算法通过强制对齐分词级别的预测来增强模型在噪声环境下的鲁棒性。

链接: https://arxiv.org/abs/2506.09408
作者: Jui-Ming Yao,Hao-Yuan Chen,Zi-Xian Tang,Bing-Jia Tan,Sheng-Wei Peng,Bing-Cheng Xie,Shun-Feng Su
机构: National Taiwan University of Science and Technology, Taipei, Taiwan; University of London, London, United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.
zh

[NLP-53] A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings

【速读】：该论文旨在解决知识追踪（Knowledge Tracing, KT）在低资源数据环境下性能下降的问题，特别是在学生练习历史不断增长的情况下需要在线更新的现实课堂场景中的挑战。解决方案的关键在于重新利用层次化知识概念（Hierarchical Knowledge Concept, KC）信息，通过构建基于知识树的KT框架（KT²），采用隐马尔可夫树模型对知识概念的层次结构进行建模，并利用EM算法估计学生的掌握程度，同时通过增量更新机制实现个性化预测，从而在数据稀缺的情况下保持较强的性能。

链接: https://arxiv.org/abs/2506.09393
作者: Xinyi Gao,Qiucheng Wu,Yang Zhang,Xuechen Liu,Kaizhi Qian,Ying Xu,Shiyu Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 4 figures

点击查看摘要

Abstract:Knowledge tracing (KT) aims to estimate a student’s evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students’ exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-resource conditions, we revisit the hierarchical knowledge concept (KC) information, which is typically available in many classroom settings and can provide strong prior when data are sparse. We therefore propose Knowledge-Tree-based Knowledge Tracing (KT ^2 ), a probabilistic KT framework that models student understanding over a tree-structured hierarchy of knowledge concepts using a Hidden Markov Tree Model. KT ^2 estimates student mastery via an EM algorithm and supports personalized prediction through an incremental update mechanism as new responses arrive. Our experiments show that KT ^2 consistently outperforms strong baselines in realistic online, low-resource settings.
zh

[NLP-54] Comparing human and LLM politeness strategies in free production

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在礼貌表达（polite speech）方面的对齐问题，即如何使模型生成的回应在语用层面与人类沟通习惯保持一致。解决方案的关键在于通过对比人类与LLM在受限和开放性生成任务中的回应，分析模型是否能够采用情境敏感的语用策略，特别是其在不同语境下对积极和消极礼貌策略（positive and negative politeness strategies）的使用情况。研究发现，较大参数量的模型能够复制计算语用学中的关键偏好，但在积极语境中过度依赖消极礼貌策略，这可能引发误解，从而揭示了AI系统在语用对齐方面仍存在的挑战。

链接: https://arxiv.org/abs/2506.09391
作者: Haoran Zhao,Robert D.Hawkins
机构: University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 5 figures

点击查看摘要

Abstract:Polite speech poses a fundamental alignment challenge for large language models (LLMs). Humans deploy a rich repertoire of linguistic strategies to balance informational and social goals – from positive approaches that build rapport (compliments, expressions of interest) to negative strategies that minimize imposition (hedging, indirectness). We investigate whether LLMs employ a similarly context-sensitive repertoire by comparing human and LLM responses in both constrained and open-ended production tasks. We find that larger models ( \ge 70B parameters) successfully replicate key preferences from the computational pragmatics literature, and human evaluators surprisingly prefer LLM-generated responses in open-ended contexts. However, further linguistic analyses reveal that models disproportionately rely on negative politeness strategies even in positive contexts, potentially leading to misinterpretations. While modern LLMs demonstrate an impressive handle on politeness strategies, these subtle differences raise important questions about pragmatic alignment in AI systems.
zh

[NLP-55] Binary classification for perceived quality of headlines and links on worldwide news websites 2018-2024

【速读】：该论文试图解决如何自动区分 perceived lower-quality 新闻标题/链接与 perceived higher-quality 新闻标题/链接的问题（News Headline/Link Quality Classification）。其解决方案的关键在于利用115个提取的语言学特征，并在包含57,544,214条全球新闻网站链接/标题的二分类平衡数据集上评估了十二种机器学习模型。研究发现，传统集成方法（尤其是Bagging分类器）和微调的DistilBERT模型都能有效区分新闻标题/链接的质量，但两者在预测性能与训练时间之间存在权衡。

链接: https://arxiv.org/abs/2506.09381
作者: Austin McCutcheon,Thiago E. A. de Oliveira,Aleksandr Zheleznov,Chris Brogly
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results suggest that both NLP features with traditional classifiers and deep learning models can effectively differentiate perceived news headline/link quality, with some trade-off between predictive performance and train time.
zh

[NLP-56] CoLMbo: Speaker Language Model for Descriptive Profiling

【速读】：该论文旨在解决传统说话人识别系统在生成详细说话人特征或提供上下文丰富的描述方面存在的局限性，这些系统通常仅限于分类任务，且无法结构化地捕捉如方言、性别和年龄等人口统计属性。论文提出的解决方案是构建CoLMbo，这是一种说话人语言模型（Speaker Language Model, SLM），其关键在于将说话人编码器与基于提示的条件机制相结合，从而根据说话人嵌入生成详细的描述性文本，并通过用户自定义提示动态适应新的说话人特征，实现定制化的描述生成。

链接: https://arxiv.org/abs/2506.09375
作者: Massa Baali,Shuo Han,Syed Abdul Hannan,Purusottam Samal,Karanveer Singh,Soham Deshmukh,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学); FPrime AI (FPrime AI)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.
zh

[NLP-57] COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content

【速读】：该论文旨在解决生成式 AI（Generative AI）在教育场景中生成内容时难以符合课程标准、保持适合特定年级的阅读水平，以及在 STEM 教育中平衡科学解释与日常语言表达的问题。其解决方案的关键在于提出 COGENT 框架，该框架整合了课程组件（科学概念、核心思想和学习目标），通过控制文本长度、词汇难度和句子复杂度来调控可读性，并采用“基于好奇感”的方法提升学生的学习兴趣与参与度。

链接: https://arxiv.org/abs/2506.09367
作者: Zhengyuan Liu,Stella Xin Yin,Dion Hoe-Lian Goh,Nancy F. Chen
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore; Nanyang Technological University, Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: BEA 2025

点击查看摘要

Abstract:While Generative AI has demonstrated strong potential and versatility in content generation, its application to educational contexts presents several challenges. Models often fail to align with curriculum standards and maintain grade-appropriate reading levels consistently. Furthermore, STEM education poses additional challenges in balancing scientific explanations with everyday language when introducing complex and abstract ideas and phenomena to younger students. In this work, we propose COGENT, a curriculum-oriented framework for generating grade-appropriate educational content. We incorporate three curriculum components (science concepts, core ideas, and learning objectives), control readability through length, vocabulary, and sentence complexity, and adopt a ``wonder-based’’ approach to increase student engagement and interest. We conduct a multi-dimensional evaluation via both LLM-as-a-judge and human expert analysis. Experimental results show that COGENT consistently produces grade-appropriate passages that are comparable or superior to human references. Our work establishes a viable approach for scaling adaptive and high-quality learning resources.
zh

[NLP-58] aming SQL Complexity: LLM -Based Equivalence Evaluation for Text-to-SQL

【速读】：该论文试图解决生成式 SQL 语句与用户自然语言查询之间的语义等价性评估问题（Semantic Equivalence Evaluation），尤其是在用户查询存在歧义或存在多种合法 SQL 解释的情况下。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）来评估语义等价性，同时引入一种更实际的“弱”语义等价性概念（Weak Semantic Equivalence），以提高评估的可行性和实用性。

链接: https://arxiv.org/abs/2506.09359
作者: Qingyun Zeng,Simin Ma,Arash Niknafs,Ashish Basran,Carol Szabo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has significantly advanced Text-to-SQL (NL2SQL) systems, yet evaluating the semantic equivalence of generated SQL remains a challenge, especially given ambiguous user queries and multiple valid SQL interpretations. This paper explores using LLMs to assess both semantic and a more practical “weak” semantic equivalence. We analyze common patterns of SQL equivalence and inequivalence, discuss challenges in LLM-based evaluation.
zh

[NLP-59] DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts ACL2025

【速读】：该论文旨在解决将密集型大语言模型（Large Language Model, LLM）重构为混合专家（Mixture-of-Experts, MoE）架构时，因专家间多样性不足而导致的冗余问题。其解决方案的关键在于提出一种名为DIVE的多样性增强重构方法，该方法通过领域亲和力挖掘、基于剪枝的专家重构以及高效微调来实现模型的重构与优化，从而在保持模型性能的同时提升训练效率。

链接: https://arxiv.org/abs/2506.09351
作者: Yuchen Feng,Bowen Shen,Naibin Gu,Jiaxuan Zhao,Peng Fu,Zheng Lin,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络空间安全学院)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.
zh

[NLP-60] OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment

【速读】：该论文旨在解决端到端语音生成中多模态信息协同不足的问题，具体表现为现有方法在生成离散语音标记时未能有效融合到语言模型的自回归过程中，或虽采用联合自回归建模但存在模态间交互不充分的问题。解决方案的关键在于提出OmniDRCA，这是一个基于联合自回归建模的并行语音-文本基础模型，其核心特征包括双分辨率语音表示和对比跨模态对齐，通过并行处理语音与文本表示并增强音频理解能力，从而实现更高效的多模态交互与生成。

链接: https://arxiv.org/abs/2506.09349
作者: Chao-Hong Tan,Qian Chen,Wen Wang,Chong Deng,Qinglin Zhang,Luyao Cheng,Hai Yu,Xin Zhang,Xiang Lv,Tianyu Zhao,Chong Zhang,Yukun Ma,Yafeng Chen,Hui Wang,Jiaqing Liu,Jieping Ye
机构: Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents OmniDRCA, a parallel speech-text foundation model based on joint autoregressive modeling, featuring dual-resolution speech representations and contrastive cross-modal alignment. Our approach processes speech and text representations in parallel while enhancing audio comprehension through contrastive alignment. Experimental results on Spoken Question Answering benchmarks demonstrate that OmniDRCA establishes new state-of-the-art (SOTA) performance among parallel joint speech-text modeling based foundation models, and achieves competitive performance compared to interleaved models. Additionally, we explore the potential of extending the framework to full-duplex conversational scenarios.
zh

[NLP-61] Ming-Omni: A Unified Multimodal Model for Perception and Generation

【速读】：该论文旨在解决多模态任务中需要多个独立模型或进行任务特定微调的问题，提出了一种统一的多模态模型Ming-Omni，能够处理图像、文本、音频和视频，并在语音和图像生成方面表现出色。解决方案的关键在于采用专用编码器提取不同模态的特征令牌，并通过Ling——一种配备新型模态特定路由机制的MoE（Mixture of Experts）架构——对这些令牌进行处理，从而在统一框架内高效融合多模态输入，实现多样化任务的端到端处理。

链接: https://arxiv.org/abs/2506.09344
作者: Inclusion AI,Biao Gong,Cheng Zou,Chuanyang Zheng,Chunluan Zhou,Canxiang Yan,Chunxiang Jin,Chunjie Shen,Dandan Zheng,Fudong Wang,Furong Xu,GuangMing Yao,Jun Zhou,Jingdong Chen,Jianxin Sun,Jiajia Liu,Jianjiang Zhu,Jun Peng,Kaixiang Ji,Kaiyou Song,Kaimeng Ren,Libin Wang,Lixiang Ru,Lele Xie,Longhua Tan,Lyuxin Xue,Lan Wang,Mochen Bai,Ning Gao,Pei Chen,Qingpei Guo,Qinglong Zhang,Qiang Xu,Rui Liu,Ruijie Xiong,Sirui Gao,Tinghao Liu,Taisong Li,Weilong Chai,Xinyu Xiao,Xiaomei Wang,Xiaoxue Chen,Xiao Lu,Xiaoyu Li,Xingning Dong,Xuzheng Yu,Yi Yuan,Yuting Gao,Yunxiao Sun,Yipeng Chen,Yifei Wu,Yongjie Lyu,Ziping Ma,Zipeng Feng,Zhijiang Fang,Zhihao Qiu,Ziyuan Huang,Zhengyu He
机构: Inclusion AI; Ant Group
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 18 pages,8 figures

点击查看摘要

Abstract:We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
zh

[NLP-62] Latent Multi-Head Attention for Small Language Models

【速读】：该论文旨在解决小规模语言模型中多头注意力机制的效率与质量之间的权衡问题，提出了一种基于潜在多头注意力（latent multi-head attention, MLA）的架构优化方案。其关键在于引入了旋转位置嵌入（rotary positional embeddings, RoPE）并采用半秩潜在维度（r = d/2），在显著减少键值缓存（KV-cache）内存消耗的同时，仅带来微小的验证损失增加，实现了内存受限场景下的帕累托改进。

链接: https://arxiv.org/abs/2506.09342
作者: Sushant Mehta,Raj Dandekar,Rajat Dandekar,Sreedath Panat
机构: Vizuara AI Labs(维祖拉人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure. 5 tables

点击查看摘要

Abstract:We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank MLA while maintaining the memory savings. GPT-4 evaluations corroborate perplexity results, with ours achieving the highest quality scores (7.4/10) across grammar, creativity, and consistency metrics. Code and models will be released upon acceptance.
zh

[NLP-63] RePO: Replay-Enhanced Policy Optimization

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在优化大语言模型（Large Language Models, LLMs）过程中存在的计算成本高和数据效率低的问题。针对这一问题，论文提出的解决方案是引入基于经验回放的策略优化方法（Replay-Enhanced Policy Optimization, RePO），其关键在于利用多样化的回放策略从经验回放缓冲区中检索非策略样本，从而在每个提示下基于更广泛且多样的样本进行策略优化。

链接: https://arxiv.org/abs/2506.09340
作者: Siheng Li,Zhanhui Zhou,Wai Lam,Chao Yang,Chaochao Lu
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of 18.4 and 4.1 points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by 15% while raising the number of effective optimization steps by 48% for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to 8 . The repository can be accessed at this https URL.
zh

[NLP-64] Natural Language Guided Ligand-Binding Protein Design

【速读】：该论文旨在解决如何让人工智能（AI）蛋白模型根据人类语言指令设计具有特定功能的蛋白质（例如与配体结合）的问题。传统AI模型通常依赖于稀缺的蛋白质-配体复合物数据进行训练，而该研究则利用大量人工标注的关于蛋白质-配体相互作用和配体结构的文字描述作为输入。其解决方案的关键在于提出InstructPro，一个能够遵循自然语言指令生成配体结合蛋白的生成式AI（Generative AI）模型，通过给定功能描述和SMILES格式的配体公式，生成与指定功能一致的蛋白序列，并构建了大规模数据集InstructProBench以支持训练与评估。

链接: https://arxiv.org/abs/2506.09332
作者: Zhenqiao Song,Ramith Hettiarachchi,Chuan Li,Jianwen Xie,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学); Lambda Lab (Lambda 实验室)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can AI protein models follow human language instructions and design proteins with desired functions (e.g. binding to a ligand)? Designing proteins that bind to a given ligand is crucial in a wide range of applications in biology and chemistry. Most prior AI models are trained on protein-ligand complex data, which is scarce due to the high cost and time requirements of laboratory experiments. In contrast, there is a substantial body of human-curated text descriptions about protein-ligand interactions and ligand formula. In this paper, we propose InstructPro, a family of protein generative models that follow natural language instructions to design ligand-binding proteins. Given a textual description of the desired function and a ligand formula in SMILES, InstructPro generates protein sequences that are functionally consistent with the specified instructions. We develop the model architecture, training strategy, and a large-scale dataset, InstructProBench, to support both training and evaluation. InstructProBench consists of 9,592,829 triples of (function description, ligand formula, protein sequence). We train two model variants: InstructPro-1B (with 1 billion parameters) and InstructPro-3B~(with 3 billion parameters). Both variants consistently outperform strong baselines, including ProGen2, ESM3, and Pinal. Notably, InstructPro-1B achieves the highest docking success rate (81.52% at moderate confidence) and the lowest average root mean square deviation (RMSD) compared to ground truth structures (4.026Å). InstructPro-3B further descreases the average RMSD to 2.527Å, demonstrating InstructPro’s ability to generate ligand-binding proteins that align with the functional specifications.
zh

[NLP-65] Multi-Agent Language Models: Advancing Cooperation Coordination and Adaptation

【速读】：该论文试图解决如何使大型语言模型（Large Language Models, LLMs）具备模拟和推理他人意图的能力，即是否具备某种形式的理论心理（theory of mind）。其解决方案的关键在于通过合作式多智能体强化学习（cooperative multi-agent reinforcement learning, MARL）的视角，让智能体通过重复交互学习协作，从而提升人工智能代理与人工及人类伙伴的适应与合作能力。研究利用基于LLM的智能体进行自然语言交互，旨在构建能够实现无缝协作的混合人机系统。

链接: https://arxiv.org/abs/2506.09331
作者: Arjun Vaithilingam Sudhakar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: arXiv admin note: substantial text overlap with arXiv:2311.07687

点击查看摘要

Abstract:Modern Large Language Models (LLMs) exhibit impressive zero-shot and few-shot generalization capabilities across complex natural language tasks, enabling their widespread use as virtual assistants for diverse applications such as translation and summarization. Despite being trained solely on large corpora of text without explicit supervision on author intent, LLMs appear to infer the underlying meaning of textual interactions. This raises a fundamental question: can LLMs model and reason about the intentions of others, i.e., do they possess a form of theory of mind? Understanding other’s intentions is crucial for effective collaboration, which underpins human societal success and is essential for cooperative interactions among multiple agents, including humans and autonomous systems. In this work, we investigate the theory of mind in LLMs through the lens of cooperative multi-agent reinforcement learning (MARL), where agents learn to collaborate via repeated interactions, mirroring human social reasoning. Our approach aims to enhance artificial agent’s ability to adapt and cooperate with both artificial and human partners. By leveraging LLM-based agents capable of natural language interaction, we move towards creating hybrid human-AI systems that can foster seamless collaboration, with broad implications for the future of human-artificial interaction.
zh

[NLP-66] owards Efficient and Effective Alignment of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）与人类期望对齐的问题，这一问题在模型性能提升和实际应用中仍是一个关键挑战。其解决方案的关键在于提出一系列创新方法，包括在数据收集阶段引入Lion和Web Reconstruction (WebR)框架以提高数据质量和多样性，在训练阶段采用Learning to Edit (LTE)和Bridging and Modeling Correlations (BMC)优化技术以增强模型的知识整合与偏好对齐能力，以及在评估阶段构建FollowBench基准以更全面地衡量模型对复杂约束的遵循程度。

链接: https://arxiv.org/abs/2506.09329
作者: Yuxin Jiang
机构: 未知
类目: Computation and Language (cs.CL)
备注: PhD thesis

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs’ ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models’ constraint adherence, offering insights for future improvements.
zh

[NLP-67] Alzheimers Dementia Detection Using Perplexity from Paired Large Language Models INTERSPEECH2025

【速读】：该论文试图解决阿尔茨海默病（Alzheimer’s dementia, AD）的早期检测问题，特别是通过分析语言能力的变化来识别AD。其解决方案的关键在于将配对困惑度方法扩展至使用最新的大语言模型（large language model, LLM），即Mistral-7B的指令遵循版本，从而提升检测精度，并通过可解释的决策边界实现更透明的诊断过程。

链接: https://arxiv.org/abs/2506.09315
作者: Yao Xiao,Heidi Christensen,Stefan Goetze
机构: University of Sheffield(谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be published in the proceedings of Interspeech 2025

点击查看摘要

Abstract:Alzheimer’s dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts language ability. This work extends the paired perplexity approach to detecting AD by using a recent large language model (LLM), the instruction-following version of Mistral-7B. We improve accuracy by an average of 3.33% over the best current paired perplexity method and by 6.35% over the top-ranked method from the ADReSS 2020 challenge benchmark. Our further analysis demonstrates that the proposed approach can effectively detect AD with a clear and interpretable decision boundary in contrast to other methods that suffer from opaque decision-making processes. Finally, by prompting the fine-tuned LLMs and comparing the model-generated responses to human responses, we illustrate that the LLMs have learned the special language patterns of AD speakers, which opens up possibilities for novel methods of model interpretation and data augmentation.
zh

[NLP-68] (RSA)2: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding ACL2025

【速读】：该论文试图解决在人类交流中普遍存在的隐喻语言（如反语、夸张、轻描淡写）导致字面意义与意图不一致的问题，现有概率语用理论在处理这类表达时要么无法解释，要么需要特定场景下建模说话者使用隐喻语言的隐含动机。解决方案的关键在于引入一种名为Rhetorical-Strategy-Aware RSA (RSA)^2 的框架，该框架通过考虑说话者所采用的修辞策略来建模隐喻语言的使用，从而在不建模说话者非字面表达动机的情况下实现与人类兼容的非字面话语解释。

链接: https://arxiv.org/abs/2506.09301
作者: Cesare Spinoso-Di Piano,David Austin,Pablo Piantanida,Jackie Chi Kit Cheung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (Main Conference)

点击查看摘要

Abstract:Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in human communication, resulting in utterances where the literal and the intended meanings do not match. The Rational Speech Act (RSA) framework, which explicitly models speaker intentions, is the most widespread theory of probabilistic pragmatics, but existing implementations are either unable to account for figurative expressions or require modeling the implicit motivations for using figurative language (e.g., to express joy or annoyance) in a setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware RSA (RSA)^2 framework which models figurative language use by considering a speaker’s employed rhetorical strategy. We show that (RSA)^2 enables human-compatible interpretations of non-literal utterances without modeling a speaker’s motivations for being non-literal. Combined with LLMs, it achieves state-of-the-art performance on the ironic split of PragMega+, a new irony interpretation dataset introduced in this study.
zh

[NLP-69] UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

【速读】：该论文试图解决现有代码生成评估基准SWE-Bench中测试用例不足的问题，这一问题导致生成的补丁可能在未真正解决问题的情况下通过测试。解决方案的关键在于提出UTGenerator，这是一个由大型语言模型驱动的测试用例生成器，能够自动分析代码库及其依赖关系，为实际的Python项目生成有效的测试用例，并在此基础上构建UTBoost框架进行测试用例增强。

链接: https://arxiv.org/abs/2506.09289
作者: Boxi Yu,Yuxuan Zhu,Pinjia He,Daniel Kang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench. These corrections, impacting 40.9% of SWE-Bench Lite and 24.4% of SWE-Bench Verified leaderboard entries, yield 18 and 11 ranking changes, respectively.
zh

[NLP-70] Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models

【速读】：该论文试图解决生成式 AI (Generative AI) 产生的自解释（self-NLE）的可信度问题，即这些解释虽然在逻辑上看似合理，但未必反映模型实际的决策过程。解决方案的关键在于提出一种新颖的灵活框架，通过直接比较自解释与模型内部隐藏状态的解释，定量评估自解释的可信度，从而建立自解释与模型推理之间的直接联系。

链接: https://arxiv.org/abs/2506.09277
作者: Milan Bhan,Jean-Noel Vittaut,Nicolas Chesneau,Sarath Chandar,Marie-Jeanne Lesot
机构: Ekimetrics(埃基梅特); Sorbonne Université(索邦大学); LIP6(脂六); Mila(米拉); Chandar Research Lab(查达研究实验室); Polytechnique Montréal(蒙特利尔理工学院); Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model’s reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model’s internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.
zh

[NLP-71] hinkQE: Query Expansion via an Evolving Thinking Process

【速读】：该论文试图解决网页搜索中查询扩展（query expansion）效果不足的问题，特别是在生成扩展词时过于狭窄，未能充分捕捉查询的多义性和多维度特性。其解决方案的关键在于提出ThinkQE框架，该框架包含两个核心组件：基于思考的扩展过程，旨在促进更深入和全面的语义探索；以及基于语料库交互的策略，通过迭代方式利用语料库的检索反馈优化扩展结果。

链接: https://arxiv.org/abs/2506.09260
作者: Yibin Lei,Tao Shen,Andrew Yates
机构: University of Amsterdam (阿姆斯特丹大学); University of Technology Sydney (悉尼科技大学); Johns Hopkins University, HLTCOE (约翰霍普金斯大学，HLTCOE)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effective query expansion for web search benefits from promoting both exploration and result diversity to capture multiple interpretations and facets of a query. While recent LLM-based methods have improved retrieval performance and demonstrate strong domain generalization without additional training, they often generate narrowly focused expansions that overlook these desiderata. We propose ThinkQE, a test-time query expansion framework addressing this limitation through two key components: a thinking-based expansion process that encourages deeper and comprehensive semantic exploration, and a corpus-interaction strategy that iteratively refines expansions using retrieval feedback from the corpus. Experiments on diverse web search benchmarks (DL19, DL20, and BRIGHT) show ThinkQE consistently outperforms prior approaches, including training-intensive dense retrievers and rerankers.
zh

[NLP-72] Self-Anchored Attention Model for Sample-Efficient Classification of Prosocial Text Chat

【速读】：该论文旨在解决在游戏聊天中识别亲社会行为（prosocial behavior）的难题，尤其是在缺乏大规模标注数据的低资源环境下。传统研究主要关注检测有毒内容，而本文强调了识别和促进积极互动的重要性。其解决方案的关键在于结合无监督发现与游戏领域专家的合作，以识别和分类亲社会行为，并提出了一种新颖的自锚定注意力模型（Self-Anchored Attention Model, SAAM），该模型通过将整个训练集作为“锚点”来提升在数据稀缺情况下的性能，从而实现了首个自动化分类亲社会行为的系统。

链接: https://arxiv.org/abs/2506.09259
作者: Zhuofang Li,Rafal Kocielnik,Fereshteh Soltani,Penphob(Andrea)Boonyarungsrit,Animashree Anandkumar,R. Michael Alvarez
机构: California Institute of Technology (加州理工学院); Activision-Blizzard-King (动视暴雪-国王); Santa Monica (圣莫尼卡)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Millions of players engage daily in competitive online games, communicating through in-game chat. Prior research has focused on detecting relatively small volumes of toxic content using various Natural Language Processing (NLP) techniques for the purpose of moderation. However, recent studies emphasize the importance of detecting prosocial communication, which can be as crucial as identifying toxic interactions. Recognizing prosocial behavior allows for its analysis, rewarding, and promotion. Unlike toxicity, there are limited datasets, models, and resources for identifying prosocial behaviors in game-chat text. In this work, we employed unsupervised discovery combined with game domain expert collaboration to identify and categorize prosocial player behaviors from game chat. We further propose a novel Self-Anchored Attention Model (SAAM) which gives 7.9% improvement compared to the best existing technique. The approach utilizes the entire training set as “anchors” to help improve model performance under the scarcity of training data. This approach led to the development of the first automated system for classifying prosocial behaviors in in-game chats, particularly given the low-resource settings where large-scale labeled data is not available. Our methodology was applied to one of the most popular online gaming titles - Call of Duty®: Modern Warfare®II, showcasing its effectiveness. This research is novel in applying NLP techniques to discover and classify prosocial behaviors in player in-game chat communication. It can help shift the focus of moderation from solely penalizing toxicity to actively encouraging positive interactions on online platforms.
zh

[NLP-73] Extrapolation by Association: Length Generalization Transfer in Transformers

【速读】：该论文试图解决Transformer语言模型在自然语言领域表现出的泛化能力背后的细粒度机制问题，特别是长度泛化（length generalization）的形成机制。其解决方案的关键在于通过任务关联（task association）的视角，揭示模型能够将从较短输入中学习到的泛化能力转移到更长输入中的原因。研究发现，通过在相关任务上进行训练，尤其是引入较长且相关的辅助任务，可以促使模型在未见过的、更长的输入上实现泛化。此外，研究还表明预训练语言模型具备可复用的计算结构，有助于下游任务中的外推能力，且长度泛化转移与注意力头的重复使用存在关联。

链接: https://arxiv.org/abs/2506.09251
作者: Ziyang Cai,Nayoung Lee,Avi Schwarzschild,Samet Oymak,Dimitris Papailiopoulos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 20 figures

点击查看摘要

Abstract:Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization–the ability to extrapolate from shorter to longer inputs–through the lens of \textittask association. We find that length generalization can be \textittransferred across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.
zh

[NLP-74] A Technique for Isolating Lexically-Independent Phonetic Dependencies in Generative CNNs

【速读】：该论文试图解决深度神经网络（Deep Neural Networks, DNNs）在语音音系规则（phonotactic generalizations）表征能力方面存在的不确定性问题，具体关注生成式卷积神经网络（Generative Convolutional Neural Networks, CNNs）在原始语音波形上进行词汇学习后的词典无关泛化能力。其解决方案的关键在于通过缩小全连接层（Fully-Connected Layer, FC）瓶颈从1024通道至8通道，并提出一种新的探测模型词典独立泛化能力的技术：在窄FC瓶颈条件下，通过绕过FC层并将随机特征图输入卷积块来生成音频输出。实验结果表明，这种输出同样受到训练中语音音系限制的偏倚，证明卷积层能够动态地泛化超出FC层所学习的词典约束的语音依赖关系。

链接: https://arxiv.org/abs/2506.09218
作者: Bruno Ferenc Šegedin
机构: Brown University (布朗大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The ability of deep neural networks (DNNs) to represent phonotactic generalizations derived from lexical learning remains an open question. This study (1) investigates the lexically-invariant generalization capacity of generative convolutional neural networks (CNNs) trained on raw audio waveforms of lexical items and (2) explores the consequences of shrinking the fully-connected layer (FC) bottleneck from 1024 channels to 8 before training. Ultimately, a novel technique for probing a model’s lexically-independent generalizations is proposed that works only under the narrow FC bottleneck: generating audio outputs by bypassing the FC and inputting randomized feature maps into the convolutional block. These outputs are equally biased by a phonotactic restriction in training as are outputs generated with the FC. This result shows that the convolutional layers can dynamically generalize phonetic dependencies beyond lexically-constrained configurations learned by the FC.
zh

[NLP-75] SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research

【速读】：该论文旨在解决教育领域中大规模课堂语音数据稀缺的问题，这一问题限制了AI驱动的语音模型的发展。现有的公共课堂数据集有限，且缺乏专门的课堂噪声语料库，导致标准的数据增强技术难以应用。论文提出的解决方案关键在于开发一种可扩展的方法，利用游戏引擎合成课堂噪声，并构建了一个包含合成课堂噪声语料库和模拟课堂语音数据集的SimClass数据集。该方法通过将公开的儿童语音语料库与YouTube讲座视频配对，模拟真实课堂环境下的语音交互，从而生成高质量的语音数据。实验表明，SimClass能够近似真实的课堂语音，为构建鲁棒的语音识别和增强模型提供了有价值的资源。

链接: https://arxiv.org/abs/2506.09206
作者: Ahmed Adel Attia,Jing Liu,Carl Espy-Wilson
机构: University of Maryland (马里兰大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Public classroom datasets remain limited, and the lack of a dedicated classroom noise corpus prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise using game engines, a framework that extends to other domains. Using this methodology, we present SimClass, a dataset that includes both a synthesized classroom noise corpus and a simulated classroom speech dataset. The speech data is generated by pairing a public children’s speech corpus with YouTube lecture videos to approximate real classroom interactions in clean conditions. Our experiments on clean and noisy speech demonstrate that SimClass closely approximates real classroom speech, making it a valuable resource for developing robust speech recognition and enhancement models. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.09206 [cs.SD] (or arXiv:2506.09206v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2506.09206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-76] FedRAG : A Framework for Fine-Tuning Retrieval-Augmented Generation Systems ICML2025

【速读】：该论文旨在解决传统检索增强生成（Retrieval-augmented generation, RAG）系统在参数记忆局限性方面的问题，并通过微调检索器和生成器模型来提升其性能。论文提出的解决方案关键在于引入FedRAG框架，该框架支持在集中式和联邦架构下对RAG系统进行微调，提供了先进的微调方法、直观的接口以及从集中式到联邦训练任务的无缝转换，同时与现代RAG生态系统深度集成，填补了现有工具中的关键空白。

链接: https://arxiv.org/abs/2506.09200
作者: Val Andrei Fajardo,David B. Emerson,Amandeep Singh,Veronica Chatrath,Marcelo Lotif,Ravi Theja,Alex Cheung,Izuki Matsubi
机构: Vector Institute (Vector Institute); Independent Researcher (Independent Researcher)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 4 figures, 2 tables. Accepted for the CODEML Workshop at ICML 2025. Framework code available at this https URL

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.
zh

[NLP-77] PHRASED: Phrase Dictionary Biasing for Speech Translation

【速读】：该论文试图解决在语音翻译任务中，由于短语在训练数据中出现频率较低，导致短语正确翻译困难的问题。解决方案的关键在于提出一种短语字典偏差方法，通过利用源语言到目标语言的短语对映射关系来提升翻译性能。该方法在两种广泛使用的模型上进行了验证，包括基于转换器的流式语音翻译模型和多模态大语言模型，实验结果表明该方法在流式语音翻译模型中相比短语列表偏差方法提升了21%的相对性能，并在多模态大语言模型中实现了85%的短语召回率提升。

链接: https://arxiv.org/abs/2506.09175
作者: Peidong Wang,Jian Xue,Rui Zhao,Junkun Chen,Aswin Shanmugam Subramanian,Jinyu Li
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Phrases are essential to understand the core concepts in conversations. However, due to their rare occurrence in training data, correct translation of phrases is challenging in speech translation tasks. In this paper, we propose a phrase dictionary biasing method to leverage pairs of phrases mapping from the source language to the target language. We apply the phrase dictionary biasing method to two types of widely adopted models, a transducer-based streaming speech translation model and a multimodal large language model. Experimental results show that the phrase dictionary biasing method outperforms phrase list biasing by 21% relatively for the streaming speech translation model. In addition, phrase dictionary biasing enables multimodal large language models to use external phrase information, achieving 85% relative improvement in phrase recall.
zh

[NLP-78] he Curious Language Model: Strategic Test-Time Information Acquisition

【速读】：该论文试图解决在决策过程中信息不足时，如何选择既具有信息量又成本效益的行动策略问题（information acquisition）。解决方案的关键在于提出CuriosiTree，这是一种基于启发式的零样本信息获取测试时策略，通过贪心树搜索估算每个行动的预期信息增益，并根据预期信息增益与成本之间的平衡来战略性地选择行动。

链接: https://arxiv.org/abs/2506.09173
作者: Michael Cooper,Rohan Wadhawan,John Michael Giorgi,Chenhao Tan,Davis Liang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 39 pages

点击查看摘要

Abstract:Decision-makers often possess insufficient information to render a confident decision. In these cases, the decision-maker can often undertake actions to acquire the necessary information about the problem at hand, e.g., by consulting knowledgeable authorities or by conducting experiments. Importantly, different levers of information acquisition come with different costs, posing the challenge of selecting the actions that are both informative and cost-effective. In this work, we propose CuriosiTree, a heuristic-based, test-time policy for zero-shot information acquisition in large language models (LLMs). CuriosiTree employs a greedy tree search to estimate the expected information gain of each action and strategically chooses actions based on a balance of anticipated information gain and associated cost. Empirical validation in a clinical diagnosis simulation shows that CuriosiTree enables cost-effective integration of heterogenous sources of information, and outperforms baseline action selection strategies in selecting action sequences that enable accurate diagnosis.
zh

[NLP-79] Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search ICML2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂交互环境中需要大量指导或历史交互才能有效执行任务的问题，以及现有方法在适应新信息或高效利用过往经验进行多步骤推理时的局限性。其解决方案的关键在于引入一种基于上下文学习的LLM代理框架，通过原子事实增强和递归前瞻搜索来提升规划能力。该框架能够从交互轨迹中提取任务关键的“原子事实”，动态地增强提示信息，用于动作提议、潜在世界模型模拟和状态价值估计，并通过有限深度的前瞻搜索进行规划，从而实现在线学习与决策优化，无需权重更新。

链接: https://arxiv.org/abs/2506.09171
作者: Samuel Holt,Max Ruiz Luyten,Thomas Pouplin,Mihaela van der Schaar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9-page main paper, 1 figure. Accepted for an Oral presentation at the First Workshop on Computer Use Agents (ICML 2025), Vancouver, Canada

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly capable but often require significant guidance or extensive interaction history to perform effectively in complex, interactive environments. Existing methods may struggle with adapting to new information or efficiently utilizing past experiences for multi-step reasoning without fine-tuning. We introduce a novel LLM agent framework that enhances planning capabilities through in-context learning, facilitated by atomic fact augmentation and a recursive lookahead search. Our agent learns to extract task-critical ``atomic facts’’ from its interaction trajectories. These facts dynamically augment the prompts provided to LLM-based components responsible for action proposal, latent world model simulation, and state-value estimation. Planning is performed via a depth-limited lookahead search, where the LLM simulates potential trajectories and evaluates their outcomes, guided by the accumulated facts and interaction history. This approach allows the agent to improve its understanding and decision-making online, leveraging its experience to refine its behavior without weight updates. We provide a theoretical motivation linking performance to the quality of fact-based abstraction and LLM simulation accuracy. Empirically, our agent demonstrates improved performance and adaptability on challenging interactive tasks, achieving more optimal behavior as it accumulates experience, showcased in tasks such as TextFrozenLake and ALFWorld.
zh

[NLP-80] Adversarial Text Generation with Dynamic Contextual Perturbation

【速读】：该论文试图解决传统对抗攻击方法在自然语言处理（NLP）模型中存在上下文感知不足的问题，这些方法通常仅关注词级或局部文本片段的修改，导致扰动易被检测或语义不一致。解决方案的关键在于提出一种名为动态上下文扰动（Dynamic Contextual Perturbation, DCP）的新颖对抗文本攻击方案，该方案通过动态生成跨句子、段落和文档的上下文感知扰动，结合预训练语言模型的能力，利用对抗目标函数迭代优化扰动，以平衡模型误分类与文本自然性的双重目标，从而生成更复杂且有效的对抗样本。

链接: https://arxiv.org/abs/2506.09148
作者: Hetvi Waghela,Jaydip Sen,Sneha Rakshit,Subhasis Dasgupta
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: This is the accepted version of the paper, which was presented at IEEE CALCON. The conference was organized at Jadavpur University, Kolkata, from December 14 to 15, 2025. The paper is six pages long, and it consists of six tables and six figures. This is not the final camera-ready version of the paper

点击查看摘要

Abstract:Adversarial attacks on Natural Language Processing (NLP) models expose vulnerabilities by introducing subtle perturbations to input text, often leading to misclassification while maintaining human readability. Existing methods typically focus on word-level or local text segment alterations, overlooking the broader context, which results in detectable or semantically inconsistent perturbations. We propose a novel adversarial text attack scheme named Dynamic Contextual Perturbation (DCP). DCP dynamically generates context-aware perturbations across sentences, paragraphs, and documents, ensuring semantic fidelity and fluency. Leveraging the capabilities of pre-trained language models, DCP iteratively refines perturbations through an adversarial objective function that balances the dual objectives of inducing model misclassification and preserving the naturalness of the text. This comprehensive approach allows DCP to produce more sophisticated and effective adversarial examples that better mimic natural language patterns. Our experimental results, conducted on various NLP models and datasets, demonstrate the efficacy of DCP in challenging the robustness of state-of-the-art NLP systems. By integrating dynamic contextual analysis, DCP significantly enhances the subtlety and impact of adversarial attacks. This study highlights the critical role of context in adversarial attacks and lays the groundwork for creating more robust NLP systems capable of withstanding sophisticated adversarial strategies.
zh

[NLP-81] LLM -as-a-qualitative-judge: automating error analysis in natural language generation

【速读】：该论文试图解决传统基于大型语言模型（Large Language Model, LLM）的评估方法在自然语言生成（Natural Language Generation, NLG）任务中仅作为量化工具，无法提供深层次问题分析的问题。其解决方案的关键在于提出一种基于LLM的定性评估方法（LLM-as-a-qualitative-judge），通过生成结构化的常见问题类型报告，为开发者提供改进NLG系统的具体见解，该方法包含两个主要步骤：开放式的逐实例问题分析以及使用直观累积算法对发现的问题进行聚类。

链接: https://arxiv.org/abs/2506.09147
作者: Nadezhda Chirkova,Tunde Oluwaseyi Ajayi,Seth Aycock,Zain Muhammad Mujahid,Vladana Perlić,Ekaterina Borisova,Markarit Vartampetian
机构: Naver Labs Europe (纳维亚实验室欧洲); Insight Research Ireland Centre for Data Analytics, Data Science Institute, University of Galway (爱尔兰洞察研究中心数据科学研究所，爱尔兰戈尔韦大学); University of Amsterdam (阿姆斯特丹大学); University of Copenhagen (哥本哈根大学); Télécom Paris (巴黎电信学院); Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI) (德国人工智能研究中心有限公司); Technische Universität Berlin (柏林工业大学); Université Grenoble Alpes (格勒诺布尔-阿尔卑斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at this https URL.
zh

[NLP-82] CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

【速读】：该论文试图解决文本到图像生成模型在不同文化背景下表现不均衡的问题，特别是跨文化偏见的评估与缓解难题。解决方案的关键在于提出一种新的评估指标CAIRe，该指标通过将图像中的实体和概念与知识库对齐，并利用事实信息为每个文化标签独立进行分级判断，从而评估图像的文化相关性。这一方法有效克服了现有评估手段在可靠性上的不足，提升了对文化偏见的量化能力。

链接: https://arxiv.org/abs/2506.09109
作者: Arnav Yayavaram,Siddharth Yayavaram,Simran Khanuja,Michael Saxon,Graham Neubig
机构: BITS Pilani (比尔拉科技学院); Carnegie Mellon University (卡内基梅隆大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Preprint, under review

点击查看摘要

Abstract:As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 28% F1 points. Additionally, we construct two datasets for culturally universal concept, one comprising of T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson’s correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.
zh

[NLP-83] SensorLM: Learning the Language of Wearable Sensors

【速读】：该论文试图解决可穿戴传感器数据与自然语言对齐和解释的问题（sensor-language alignment and interpretation），这一问题在缺乏丰富标注的未加工真实世界可穿戴数据中尤为突出。解决方案的关键在于提出一种分层的描述生成流程，旨在从传感器数据中捕捉统计、结构和语义信息，从而构建迄今为止最大的传感器-语言数据集，并扩展主流多模态预训练架构以实现更高效的跨模态理解与任务泛化。

链接: https://arxiv.org/abs/2506.09108
作者: Yuwei Zhang,Kumar Ayush,Siyuan Qiao,A. Ali Heydari,Girish Narayanswamy,Maxwell A. Xu,Ahmed A. Metwally,Shawn Xu,Jake Garrison,Xuhai Xu,Tim Althoff,Yun Liu,Pushmeet Kohli,Jiening Zhan,Mark Malhotra,Shwetak Patel,Cecilia Mascolo,Xin Liu,Daniel McDuff,Yuzhe Yang
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究院); University of Cambridge(剑桥大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare verify the superior performance of SensorLM over state-of-the-art in zero-shot recognition, few-shot learning, and cross-modal retrieval. SensorLM also demonstrates intriguing capabilities including scaling behaviors, label efficiency, sensor captioning, and zero-shot generalization to unseen tasks.
zh

[NLP-84] oo Big to Think: Capacity Memorization and Generalization in Pre-Trained Transformers

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中记忆与泛化之间的关系问题，特别是二者如何相互影响及在不同模型容量下的表现差异。其解决方案的关键在于通过在两个设计好的合成字符级任务上预训练一系列容量受限的Transformer模型，分别探测泛化（通过算术外推）和记忆（通过事实回忆）能力，并观察模型在不同规模下的表现变化，从而揭示模型容量对学习行为的影响。

链接: https://arxiv.org/abs/2506.09099
作者: Joshua Barron,Devin White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for oral presentation to Tiny Titans: The next wave of On-Device Learning for Foundational Models Workshop at the 42nd International Conference on Machine Learning

点击查看摘要

Abstract:The relationship between memorization and generalization in large language models (LLMs) remains an open area of research, with growing evidence that the two are deeply intertwined. In this work, we investigate this relationship by pre-training a series of capacity-limited Transformer models from scratch on two synthetic character-level tasks designed to separately probe generalization (via arithmetic extrapolation) and memorization (via factual recall). We observe a consistent trade-off: small models extrapolate to unseen arithmetic cases but fail to memorize facts, while larger models memorize but fail to extrapolate. An intermediate-capacity model exhibits a similar shift toward memorization. When trained on both tasks jointly, no model (regardless of size) succeeds at extrapolation. These findings suggest that pre-training may intrinsically favor one learning mode over the other. By isolating these dynamics in a controlled setting, our study offers insight into how model capacity shapes learning behavior and offers broader implications for the design and deployment of small language models.
zh

[NLP-85] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

【速读】：该论文旨在解决多模态模型在多种视觉-语言理解与生成任务中的全面评估问题，例如视觉问答、文本到图像/视频生成以及图像-文本检索。其解决方案的关键在于设计了一个开源的评估框架FlagEvalMM，通过将模型推理与评估解耦，采用独立的评估服务实现灵活的资源分配，并结合先进的推理加速工具（如vLLM、SGLang）和异步数据加载技术，显著提升了评估效率。

链接: https://arxiv.org/abs/2506.09081
作者: Zheqi He,Yesheng Liu,Jing-shu Zheng,Xuejing Li,Richeng Xuan,Jin-Ge Yao,Xi Yang
机构: BAAI FlagEval Team
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible athttps://github.com/flageval-baai/FlagEvalMM.
zh

[NLP-86] RuleReason er: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

【速读】：该论文试图解决在现实世界应用中，由于规则格式、类型和复杂性的差异，导致基于规则的推理面临严重挑战的问题，特别是探究小型推理模型（SRMs）是否能够通过鲁棒的泛化能力有效学习基于规则的推理。解决方案的关键在于提出一种名为RuleReasoner的方法，该方法通过广泛收集的定制任务和一种新颖的领域感知动态采样策略，实现基于规则的推理。其核心创新在于根据历史奖励动态调整不同领域的采样权重，从而实现领域增强和灵活的在线学习调度，无需依赖传统方法中的预设人工混合训练方案。

链接: https://arxiv.org/abs/2506.08672
作者: Yang Liu,Jiaqi Li,Zilong Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 10 figures, 8 tables

点击查看摘要

Abstract:Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ( \Delta 4.1% average points on eight ID tasks and \Delta 10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.
zh

[NLP-87] Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

【速读】：该论文试图解决小型大型多模态模型（Large Multimodal Models, LMMs）在上下文学习（In-Context Learning, ICL）中表现不稳定的问题，尤其是在增加示例数量时性能并不总是单调提升。解决方案的关键在于提出一种元学习方法，通过从任务相关的图像特征中提取的固定软提示（soft prompts）来诱导少样本能力，并在测试时利用少量示例进行适应。为实现这一目标，研究引入了一个注意力映射模块（attention-mapper module），可与LLaVA v1.5架构集成并联合学习软提示，从而在低数据环境下仅通过少量梯度步骤实现任务适应。

链接: https://arxiv.org/abs/2506.06905
作者: Akash Gupta,Amos Storkey,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.
zh

[NLP-88] An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

【速读】：该论文试图解决如何对安全调优的大语言模型（Large Language Models, LLMs）进行越狱攻击的评估与比较问题，特别是针对不同攻击方法在流畅性和计算成本上的差异。其解决方案的关键在于提出一种统一的威胁模型（threat model），该模型基于N-gram语言模型，在1T个标记的数据上训练，能够以非参数化、与模型无关且可解释的方式评估越狱攻击的可能性。该模型允许对现有攻击方法进行公平基准测试，并揭示了有效攻击依赖于罕见或不存在于真实文本中的双字词（bigrams）这一关键发现。

链接: https://arxiv.org/abs/2410.16222
作者: Valentyn Boreiko,Alexander Panfilov,Vaclav Voracek,Matthias Hein,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.
zh

[NLP-89] Advancing Exchange Rate Forecasting: Leverag ing Machine Learning and AI for Enhanced Accuracy in Global Financial Markets

【速读】：该论文旨在解决外汇汇率预测问题，特别是美元（USD）对孟加拉国塔卡（BDT）汇率的准确预测，以支持贸易、投资和经济稳定决策。其解决方案的关键在于采用长短期记忆网络（LSTM）构建先进的机器学习模型，通过历史汇率数据进行训练，实现了高达99.449%的预测精度，显著优于传统方法如ARIMA。此外，研究还结合了梯度提升分类器（GBC）进行方向性预测，并通过回测验证了模型在实际交易中的表现。

链接: https://arxiv.org/abs/2506.09851
作者: Md. Yeasin Rahat,Rajan Das Gupta,Nur Raisa Rahman,Sudipto Roy Pritom,Samiur Rahman Shakir,Md Imrul Hasan Showmick,Md. Jakir Hossen
机构: 未知
类目: atistical Finance (q-fin.ST); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in MECON 2025

点击查看摘要

Abstract:The prediction of foreign exchange rates, such as the US Dollar (USD) to Bangladeshi Taka (BDT), plays a pivotal role in global financial markets, influencing trade, investments, and economic stability. This study leverages historical USD/BDT exchange rate data from 2018 to 2023, sourced from Yahoo Finance, to develop advanced machine learning models for accurate forecasting. A Long Short-Term Memory (LSTM) neural network is employed, achieving an exceptional accuracy of 99.449%, a Root Mean Square Error (RMSE) of 0.9858, and a test loss of 0.8523, significantly outperforming traditional methods like ARIMA (RMSE 1.342). Additionally, a Gradient Boosting Classifier (GBC) is applied for directional prediction, with backtesting on a 10,000 initial capital revealing a 40.82% profitable trade rate, though resulting in a net loss of 20,653.25 over 49 trades. The study analyzes historical trends, showing a decline in BDT/USD rates from 0.012 to 0.009, and incorporates normalized daily returns to capture volatility. These findings highlight the potential of deep learning in forex forecasting, offering traders and policymakers robust tools to mitigate risks. Future work could integrate sentiment analysis and real-time economic indicators to further enhance model adaptability in volatile markets.
zh

[NLP-90] Regularizing Learnable Feature Extraction for Automatic Speech Recognition INTERSPEECH2025

【速读】：该论文试图解决可学习特征提取前端在自动语音识别（ASR）系统中性能低于传统固定特征提取方法的问题，其关键在于减少模型对训练数据的过拟合。研究提出通过正则化方法来提升模型泛化能力，主要包括音频扰动方法以及在短时傅里叶变换（STFT）域进行掩码处理的改进策略，最终有效缩小了传统特征与可学习特征之间的性能差距。

链接: https://arxiv.org/abs/2506.09804
作者: Peter Vieting,Maximilian Kannen,Benedikt Hilmes,Ralf Schlüter,Hermann Ney
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
zh

[NLP-91] Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements

【速读】：该论文试图解决在延长暴露疗法（Prolonged Exposure, PE）中评估治疗师遵循治疗协议的忠实度（fidelity）所面临的劳动密集型问题，即需要人工审阅治疗录音来确定关键治疗要素的时间位置。解决方案的关键在于利用预训练的音频-语言模型Qwen2-Audio，并通过低秩适应（LoRA）进行微调，以自动识别治疗过程中三个核心阶段（治疗师引导、想象暴露和想象后处理）的时间边界，从而实现对治疗忠实度的自动化评估。

链接: https://arxiv.org/abs/2506.09707
作者: Suhas BN,Andrew M. Sherrill,Jyoti Alaparthi,Dominik Mattioli,Rosa I. Arriaga,Chris W. Wiese,Saeed Abdullah
机构: College of Information Sciences & Technology, The Pennsylvania State University, University Park, USA; Deptartment of Psychiatry & Behavioral Sciences, Emory University, Atlanta, USA; School of Interactive Computing, Georgia Institute of Technology, Atlanta, USA; School of Psychology, Georgia Institute of Technology, Atlanta, USA
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Prolonged Exposure (PE) therapy is an effective treatment for post-traumatic stress disorder (PTSD), but evaluating therapist fidelity remains labor-intensive due to the need for manual review of session recordings. We present a method for the automatic temporal localization of key PE fidelity elements – identifying their start and stop times – directly from session audio and transcripts. Our approach fine-tunes a large pre-trained audio-language model, Qwen2-Audio, using Low-Rank Adaptation (LoRA) to process focused 30-second windows of audio-transcript input. Fidelity labels for three core protocol phases – therapist orientation (P1), imaginal exposure (P2), and post-imaginal processing (P3) – are generated via LLM-based prompting and verified by trained raters. The model is trained to predict normalized boundary offsets using soft supervision guided by task-specific prompts. On a dataset of 313 real PE sessions, our best configuration (LoRA rank 8, 30s windows) achieves a mean absolute error (MAE) of 5.3 seconds across tasks. We further analyze the effects of window size and LoRA rank, highlighting the importance of context granularity and model adaptation. This work introduces a scalable framework for fidelity tracking in PE therapy, with potential to support clinician training, supervision, and quality assurance.
zh

[NLP-92] You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks INTERSPEECH2025

【速读】：该论文旨在解决语音匿名化系统在隐私保护中的评估问题，具体而言，是评估语音匿名化技术在防止说话人身份识别方面的有效性。研究的关键在于通过将BERT语言模型适配为自动说话人验证（ASV）系统，分析攻击者训练和评估数据集中说话人语言内容相似性的影响。实验结果表明，基于文本内容的攻击可达到较高的错误等率（EER），揭示了当前语音隐私数据集可能存在偏差，并提出需要重新构建数据集以实现更公平的隐私评估。

链接: https://arxiv.org/abs/2506.09521
作者: Ünal Ege Gaznepoglu,Anna Leschanowsky,Ahmad Aloradi,Prachi Singh,Daniel Tenbrinck,Emanuël A. P. Habets,Nils Peters
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 5 pages, 6 figures, 1 table, accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.
zh

计算机视觉

[CV-0] DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos

【速读】：该论文旨在解决动态场景的单目视频驱动的3D重建问题，特别是如何在不依赖优化过程的情况下，实现对动态物体运动的高效、高质量重建。其解决方案的关键在于提出一种基于可变形3D高斯点云（Deformable 3D Gaussian Splats）的前馈方法，结合了增强的大规模合成数据集、像素级可变形3D高斯表示以及大规模Transformer网络，从而实现了实时且具有泛化能力的动态场景重建。

链接: https://arxiv.org/abs/2506.09997
作者: Chieh Hubert Lin,Zhaoyang Lv,Songyin Wu,Zhen Xu,Thu Nguyen-Phuoc,Hung-Yu Tseng,Julian Straub,Numair Khan,Lei Xiao,Ming-Hsuan Yang,Yuheng Ren,Richard Newcombe,Zhao Dong,Zhengqin Li
机构: Meta(元); UC Merced(加州大学默塞德分校); UC Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.
zh

[CV-1] PlayerOne: Egocentric World Simulator

【速读】：该论文试图解决如何构建一个以第一视角（egocentric）为核心的现实世界模拟器，以实现对动态环境的沉浸式、无限制探索的问题。其关键解决方案是提出PlayerOne，该系统通过粗到细的训练流程，首先在大规模第一视角文本-视频对上进行预训练以获得初步的第一视角理解，随后在同步运动-视频数据上进行微调，结合自动构建的数据管道提升模型性能。此外，设计了部分解耦运动注入方案，实现对局部动作的精确控制，并引入联合重建框架，逐步建模4D场景和视频帧，确保长视频生成中的场景一致性。

链接: https://arxiv.org/abs/2506.09995
作者: Yuanpeng Tu,Hao Luo,Xi Chen,Xiang Bai,Fan Wang,Hengshuang Zhao
机构: HKU(香港大学); DAMO Academy, Alibaba Group(达摩院，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
zh

[CV-2] xt-Aware Image Restoration with Diffusion Models

【速读】：该论文试图解决现有基于扩散模型的图像修复方法在恢复退化图像中的文本区域时，容易生成看似合理但不准确的文本样态，即文本-图像幻觉（text-image hallucination）的问题。解决方案的关键在于提出一种新的修复任务——文本感知图像修复（Text-Aware Image Restoration, TAIR），并通过构建SA-Text基准数据集和设计多任务扩散框架TeReDiff，将扩散模型的内部特征与文本检测模块相结合，从而在联合训练中提取丰富的文本表示，并将其作为后续去噪步骤的提示，提升文本识别的准确性。

链接: https://arxiv.org/abs/2506.09993
作者: Jaewon Min,Jin Hyeon Kim,Paul Hyunbin Cho,Jaeeun Lee,Jihye Park,Minkyu Park,Sangpil Kim,Hyunhee Park,Seungryong Kim
机构: KAIST AI(KAIST人工智能); Korea University(韩国高丽大学); Yonsei University(延世大学); Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: this https URL
zh

[CV-3] Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

【速读】：该论文试图解决传统视觉运动策略在生成完整动作轨迹时缺乏全局目标约束的问题，导致局部动作可能偏离任务目标。其解决方案的关键在于提出一种基于轨迹自回归建模的新型视觉运动策略——Chain-of-Action (CoA)，通过显式的逆向推理过程（action-level Chain-of-Thought, CoT）生成整个动作轨迹，其中第一个动作令牌对应于编码任务特定目标的稳定关键帧动作，后续动作令牌则基于初始关键帧和先前预测动作进行自回归生成，从而实现从全局目标到局部动作的结构化约束。

链接: https://arxiv.org/abs/2506.09990
作者: Wenbo Zhang,Tianrun Hu,Yanyuan Qiao,Hanbo Zhang,Yuchu Qin,Yang Li,Jiajun Liu,Tao Kong,Lingqiao Liu,Xiao Ma
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
zh

[CV-4] Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes CVPR2025 WWW

【速读】：该论文试图解决如何使3D场景重建具有交互性的问题，具体而言是探究是否能够预测人类双手与场景物理交互时产生的声音。解决方案的关键在于利用动作-声音对训练一个校正流模型（rectified flow model），该模型能够将3D手部轨迹映射到对应的音频，从而在测试阶段根据用户提供的手部姿态序列估计相应的声音。

链接: https://arxiv.org/abs/2506.09989
作者: Yiming Dou,Wonseok Oh,Yuqing Luo,Antonio Loquercio,Andrew Owens
机构: University of Michigan (密歇根大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: this https URL
zh

[CV-5] EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

【速读】：该论文旨在解决文本引导图像编辑的质量评估与验证问题，特别是针对当前生成式 AI 在评估编辑准确性、伪影检测、视觉质量、场景融合、常识遵循以及编辑变化描述等方面存在的不足。解决方案的关键在于提出 EditInspector，这是一个基于人类标注的全面基准测试框架，用于系统评估文本引导图像编辑的效果，并通过该框架对最先进（SoTA）的视觉-语言模型进行性能分析，进而提出两种新颖方法以提升伪影检测和差异描述生成的效果。

链接: https://arxiv.org/abs/2506.09988
作者: Ron Yosef,Moran Yanuka,Yonatan Bitton,Dani Lischinski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.
zh

[CV-6] A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

【速读】：该论文试图解决现有视频语言模型时空理解与推理能力评估基准因依赖表面视觉或文本线索的捷径解而容易导致分数虚高的问题。其解决方案的关键是引入Minimal Video Pairs (MVP)基准，这是一个针对视频语言模型物理理解能力的简洁且对捷径解敏感的视频问答基准。MVP包含55K个高质量多选视频问答示例，所有样本均配有最小变化对——即视觉相似但答案相反的视频对，要求模型在两个示例中均给出正确答案，从而有效排除仅依赖视觉或文本偏见的模型的高分表现。

链接: https://arxiv.org/abs/2506.09987
作者: Benno Krojer,Mojtaba Komeili,Candace Ross,Quentin Garrido,Koustuv Sinha,Nicolas Ballas,Mahmoud Assran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair – a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9%, while the best open-source state-of-the-art video-language model achieves 40.2% compared to random performance at 25%.
zh

[CV-7] V-JEPA 2: Self-Supervised Video Models Enable Understanding Prediction and Planning

【速读】：该论文试图解决现代人工智能在物理世界中理解、预测和规划的能力问题，特别是如何通过观察学习来实现这一目标。解决方案的关键在于采用一种自监督方法，结合互联网规模的视频数据与少量交互数据（机器人轨迹），构建能够处理物理世界任务的模型。具体而言，首先在大规模视频和图像数据集上预训练无动作的联合嵌入预测架构V-JEPA 2，随后通过与大型语言模型对齐，提升其在视频问答任务中的表现，并最终通过微调得到一个潜在动作条件的世界模型V-JEPA 2-AC，从而实现无需任务特定训练或奖励的机器人规划能力。

链接: https://arxiv.org/abs/2506.09985
作者: Mido Assran,Adrien Bardes,David Fan,Quentin Garrido,Russell Howes,Mojtaba,Komeili,Matthew Muckley,Ammar Rizvi,Claire Roberts,Koustuv Sinha,Artem Zholus,Sergio Arnaud,Abha Gejji,Ada Martin,Francois Robert Hogan,Daniel Dugas,Piotr Bojanowski,Vasil Khalidov,Patrick Labatut,Francisco Massa,Marc Szafraniec,Kapil Krishnakumar,Yong Li,Xiaodong Ma,Sarath Chandar,Franziska Meier,Yann LeCun,Michael Rabbat,Nicolas Ballas
机构: Meta(元)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 48 pages, 19 figures

点击查看摘要

Abstract:A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
zh

[CV-8] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

【速读】：该论文试图解决多模态条件（如文本、图像和音频）下多主体人类动画生成中的精确且针对个体的控制问题，现有方法通常仅能处理单一主体并以全局方式注入条件，忽略了视频中多个概念共存及人与人、人与物体之间复杂交互的场景。解决方案的关键在于摒弃单一实体假设，引入一种新框架，通过强区域特异性绑定机制，将多模态条件与每个主体的时空轨迹进行精准匹配。该框架利用掩码预测器自动推断布局信息，并通过迭代方式将局部音频条件注入对应区域，从而实现高质量、可控的多概念人类中心视频生成。

链接: https://arxiv.org/abs/2506.09984
作者: Zhenzhi Wang,Jiaqi Yang,Jianwen Jiang,Chao Liang,Gaojie Lin,Zerong Zheng,Ceyuan Yang,Dahua Lin
机构: CUHK MMLab(香港中文大学多媒体实验室); ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: TL;DR: The first multi-person dialogue video generation method from pairs of reference image and audio via explicit layout-aligned condition injection. See project page this https URL for more details

点击查看摘要

Abstract:End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity’s spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
zh

[CV-9] AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

【速读】：该论文旨在解决高质量动态3D模型生成的难题，特别是在建模时空分布复杂性和4D训练数据稀缺性方面的挑战。其解决方案的关键在于提出AniMateAnyMesh框架，该框架采用一种新颖的DyMeshVAE架构，通过解耦空间与时间特征并在压缩潜在空间中利用基于修正流的训练策略，实现了对任意3D网格的高效文本驱动动画生成。

链接: https://arxiv.org/abs/2506.09982
作者: Zijie Wu,Chaohui Yu,Fan Wang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); DAMO Academy, Alibaba Group (达摩院，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
zh

[CV-10] ReSim: Reliable World Simulation for Autonomous Driving

【速读】：该论文试图解决在广泛自车驾驶行为下可靠模拟未来驾驶场景的问题，现有驾驶世界模型由于仅基于以安全专家轨迹为主的现实数据训练，难以复现危险或非专家行为，这限制了其在策略评估等任务中的应用。解决方案的关键在于通过驾驶模拟器（如CARLA）收集的多样化非专家数据丰富现实人类示范，并构建一个基于异构语料库的可控世界模型。该模型名为ReSim，采用具有扩散变换器架构的视频生成器，并设计多种策略以有效整合条件信号，提升预测的可控性和真实性。

链接: https://arxiv.org/abs/2506.09981
作者: Jiazhi Yang,Kashyap Chitta,Shenyuan Gao,Long Chen,Yuqian Shao,Xiaosong Jia,Hongyang Li,Andreas Geiger,Xiangyu Yue,Li Chen
机构: The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学); OpenDriveLab at Shanghai AI Lab (上海人工智能实验室开放驾驶实验室); NVIDIA Research (英伟达研究); Xiaomi EV (小米电动汽车); Shanghai Jiao Tong University (上海交通大学); University of Tübingen, Tübingen AI Center (图宾根大学，图宾根人工智能中心); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim’s simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
zh

[CV-11] Efficient Part-level 3D Object Generation via Dual Volume Packing

【速读】：该论文试图解决现有3D物体生成方法在部分级生成中的局限性，即大多数方法生成的物体为所有部件融合在一起的单一网格，这限制了对单个部件进行编辑或操作的能力。其关键解决方案是提出一种新的端到端框架，能够根据单张输入图像生成具有任意数量完整且语义明确部件的高质量3D物体。该方法引入了一种双体积打包策略，将所有部件组织到两个互补的体积中，从而实现完整且交错部件的生成，并最终组装成目标物体。

链接: https://arxiv.org/abs/2506.09980
作者: Jiaxiang Tang,Ruijie Lu,Zhaoshuo Li,Zekun Hao,Xuan Li,Fangyin Wei,Shuran Song,Gang Zeng,Ming-Yu Liu,Tsung-Yi Lin
机构: NVIDIA Research (NVIDIA 研究院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Project Page: this https URL

点击查看摘要

Abstract:Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.
zh

[CV-12] Vectorized Region Based Brush Strokes for Artistic Rendering

【速读】：该论文试图解决如何生成符合艺术原则和创作意图的笔触组合问题，以弥补静态艺术作品与其创作过程之间的情感与教育差距。解决方案的关键在于提出一种图像到绘画的方法，该方法通过（i）在目标区域提供语义引导的笔触生成，（ii）计算笔触参数，并（iii）建立片段与笔触之间的顺序关系，从而实现最终绘画的逐步渲染。

链接: https://arxiv.org/abs/2506.09969
作者: Jeripothula Prudviraj,Vikram Jamwal
机构: TCS Research (TCS 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating a stroke-by-stroke evolution process of a visual artwork tries to bridge the emotional and educational gap between the finished static artwork and its creation process. Recent stroke-based painting systems focus on capturing stroke details by predicting and iteratively refining stroke parameters to maximize the similarity between the input image and the rendered output. However, these methods often struggle to produce stroke compositions that align with artistic principles and intent. To address this, we explore an image-to-painting method that (i) facilitates semantic guidance for brush strokes in targeted regions, (ii) computes the brush stroke parameters, and (iii) establishes a sequence among segments and strokes to sequentially render the final painting. Experimental results on various input image types, such as face images, paintings, and photographic images, show that our method aligns with a region-based painting strategy while rendering a painting with high fidelity and superior stroke quality.
zh

[CV-13] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

【速读】：该论文旨在解决大型视觉语言模型（LVLMs）在空间推理任务中面临的局限性，这些任务需要精确的几何理解和连续的空间跟踪能力，而现有方法由于依赖文本中心化的推理方式，难以有效处理此类问题。解决方案的关键在于提出一种新的“画图推理”范式，使LVLMs能够通过视觉空间中的基本绘图操作（如标注边界框和绘制辅助线）进行推理，从而直接通过视觉操作表达和分析空间关系，避免了传统工具集成方法中对专用感知工具的依赖。

链接: https://arxiv.org/abs/2506.09965
作者: Junfei Wu,Jian Guan,Kaituo Feng,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tieniu Tan
机构: Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
zh

[CV-14] Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

【速读】：该论文旨在解决医学视觉问答（MedVQA）领域中数据集临床复杂性和视觉多样性不足的问题，从而限制了模型在实际临床决策支持系统中的性能。其解决方案的关键在于引入Kvasir-VQA-x1数据集，该数据集通过增加159,549个设计用于测试更深层次临床推理的问答对，并利用大语言模型系统性生成问题，同时按复杂度分层，以更好地评估模型的推理能力。此外，还引入了多种视觉增强技术，模拟常见的成像伪影，以提升模型在真实临床场景下的鲁棒性。

链接: https://arxiv.org/abs/2506.09958
作者: Sushant Gautam,Michael A. Riegler,Pål Halvorsen
机构: Simula Metropolitan Center for Digital Engineering (SimulaMet), Norway; Oslo Metropolitan University (OsloMet), Norway; Simula Research Laboratory, Norway
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model’s inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL
zh

[CV-15] Canonical Latent Representations in Conditional Diffusion Models

【速读】：该论文试图解决条件扩散模型（Conditional Diffusion Models, CDMs）在生成任务中虽然表现出色，但其建模能力导致类别定义特征与无关上下文纠缠的问题，从而影响了可提取的鲁棒且可解释的表示。解决方案的关键是提出了一种称为Canonical LAtent Representations (CLAReps)的潜在表示，这些潜在代码能够在保留关键类别信息的同时丢弃非判别性信号，从而生成每个类别的代表性样本，提供一种简洁且可解释的核心类别语义总结。通过利用CLAReps，作者开发了一个基于扩散的特征蒸馏范式CaDistill，使学生模型能够仅通过CLAReps获取核心类别知识，显著提升了模型的对抗鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2506.09955
作者: Yitao Xu,Tong Zhang,Ehsan Pajouheshgar,Sabine Süsstrunk
机构: École polytechnique fédérale de Lausanne (洛桑联邦理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages,41 figures

点击查看摘要

Abstract:Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.
zh

[CV-16] Vision Generalist Model: A Survey

【速读】：该论文旨在探讨如何将通用模型（generalist model）应用于计算机视觉任务，以应对视觉任务输入输出多样性高、难以统一表示的挑战。其解决方案的关键在于设计能够处理多种视觉任务的框架，并通过相关技术提升模型性能，同时分析不同领域间的关联性与潜在协同效应，为未来研究提供方向。

链接: https://arxiv.org/abs/2506.09954
作者: Ziyi Wang,Yongming Rao,Shuofeng Sun,Xinrun Liu,Yi Wei,Xumin Yu,Zuyan Liu,Yanbo Wang,Hongmin Liu,Jie Zhou,Jiwen Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.
zh

[CV-17] UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting CVPR2025

【速读】：该论文试图解决点云数据尺度多样性带来的统一表示学习技术难题，当前缺乏适用于物体级和场景级点云的统一3D模型及预训练方法。其解决方案的关键在于提出UniPre3D，这是首个可无缝应用于任意尺度点云和任意架构3D模型的统一预训练方法，通过预测高斯原始体作为预训练任务，并利用可微分高斯点绘制生成图像，实现像素级监督和端到端优化，同时引入预训练图像模型的2D特征以引导模型关注几何结构。

链接: https://arxiv.org/abs/2506.09952
作者: Ziyi Wang,Yanran Zhang,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model’s focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at this https URL.
zh

[CV-18] CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models NEURIPS2025

【速读】：该论文试图解决视频问答（Video Question Answering, VQA）中模型对物理世界因果关系理解不足的问题。现有VQA基准数据集要么侧重于现实视频的表面感知理解，要么专注于通过仿真环境生成的狭窄物理推理问题，而缺乏对真实场景中因果关系的深入探究。为填补这一空白，作者提出了CausalVQA，这是一个包含五类问题（反事实、假设、预期、规划和描述）的基准数据集，旨在评估模型预测不同动作和事件可能结果的能力。解决方案的关键在于设计质量控制机制，防止模型依赖语言线索而非深度视觉理解来获取答案，从而推动模型在时空推理、物理原理理解和潜在替代方案分析方面取得进展。

链接: https://arxiv.org/abs/2506.09943
作者: Aaron Foss,Chloe Evans,Sasha Mitts,Koustuv Sinha,Ammar Rizvi,Justine T. Kao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 3 figures, Submitted to NeurIPS2025 benchmark track

点击查看摘要

Abstract:We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models’ understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models’ ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.
zh

[CV-19] LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

【速读】：该论文旨在解决3D视觉语言（3D-VL）模型在能力与鲁棒性上落后于2D模型，难以达到通用模型标准的问题。其关键解决方案是提出LEO-VL，该模型基于压缩特征网格（Condensed Feature Grid, CFG），一种高效场景表示方法，能够在保持2D感知与3D空间结构关联的同时显著降低令牌开销，从而实现大规模3D-VL通用模型的训练。

链接: https://arxiv.org/abs/2506.09935
作者: Jiangyong Huang,Xiaojian Ma,Xiongkun Linghu,Yue Fan,Junchao He,Wenxin Tan,Qing Li,Song-Chun Zhu,Yixin Chen,Baoxiong Jia,Siyuan Huang
机构: Peking University (北京大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室，BIGAI); Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.
zh

[CV-20] Fluoroscopic Shape and Pose Tracking of Catheters with Custom Radiopaque Markers

【速读】：该论文旨在解决在脑血管中操控可操纵和机器人导管时，如何准确感知导管形状和位姿的问题。当前，介入医生需依靠双平面透视图像进行手动重建和预测导管运动，存在较大的感知负担。论文提出的解决方案关键在于为导管配备定制的放射显影标记，这些标记的布局经过设计以最小化对标记跟踪不确定性的敏感性，从而实现双平面透视下导管形状和位姿的同时估计。该方法已在小于2mm外径的微导管上验证，表现出小于1mm的形状跟踪误差和低于40度的导管滚动误差。

链接: https://arxiv.org/abs/2506.09934
作者: Jared Lawson,Rohan Chitale,Nabil Simaan
机构: Vanderbilt University(范德比尔特大学); Vanderbilt University Medical Center(范德比尔特大学医学中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, accepted in Robotics and Automation Letters

点击查看摘要

Abstract:Safe navigation of steerable and robotic catheters in the cerebral vasculature requires awareness of the catheters shape and pose. Currently, a significant perception burden is placed on interventionalists to mentally reconstruct and predict catheter motions from biplane fluoroscopy images. Efforts to track these catheters are limited to planar segmentation or bulky sensing instrumentation, which are incompatible with microcatheters used in neurointervention. In this work, a catheter is equipped with custom radiopaque markers arranged to enable simultaneous shape and pose estimation under biplane fluoroscopy. A design measure is proposed to guide the arrangement of these markers to minimize sensitivity to marker tracking uncertainty. This approach was deployed for microcatheters smaller than 2mm OD navigating phantom vasculature with shape tracking errors less than 1mm and catheter roll errors below 40 degrees. This work can enable steerable catheters to autonomously navigate under biplane imaging.
zh

[CV-21] HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

【速读】：该论文旨在解决扩散模型在资源受限设备上部署时面临的高内存和计算需求问题，以及现有后训练量化（Post-Training Quantization, PTQ）方法在处理异常值时的不足。其解决方案的关键在于提出HadaNorm，这是一种新颖的线性变换方法，通过在应用Hadamard变换前对激活特征通道进行归一化，有效缓解异常值问题，从而实现更激进的激活量化，显著降低量化误差并提升效率与性能的平衡。

链接: https://arxiv.org/abs/2506.09932
作者: Marco Federici,Riccardo Del Chiaro,Boris van Breugel,Paul Whatmough,Markus Nagel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 Pages, 5 Figures

点击查看摘要

Abstract:Diffusion models represent the cutting edge in image generation, but their high memory and computational demands hinder deployment on resource-constrained devices. Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations. However, standard PTQ methods struggle with outliers, and achieving higher compression often requires transforming model weights and activations before quantization. In this work, we propose HadaNorm, a novel linear transformation that extends existing approaches and effectively mitigates outliers by normalizing activations feature channels before applying Hadamard transformations, enabling more aggressive activation quantization. We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, achieving superior efficiency-performance trade-offs when compared to state-of-the-art methods.
zh

[CV-22] From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

【速读】：该论文试图解决当前Vision-Language-Action (VLA)模型在机器人领域评估不足的问题，特别是传统模仿学习基准不适用于缺乏语言指令的场景，而现有VLA基准任务有限且未能深入探究视觉-语言模型（VLM）预训练对下游机器人策略泛化能力的实际贡献。解决方案的关键在于引入一个统一的探针套件，包含50个基于仿真的任务，覆盖10个子类别，涵盖语言指令、视觉和物体等方面，以系统评估先进VLA架构的泛化能力，并揭示感知到动作的差距。

链接: https://arxiv.org/abs/2506.09930
作者: Irving Fang,Juexiao Zhang,Shengbang Tong,Chen Feng
机构: New York University (纽约大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, “generalist” robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM’s generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at this https URL
zh

[CV-23] Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering

【速读】：该论文旨在解决高光谱图像（Hyperspectral Image, HSI）聚类中由于现有图神经网络（GNN）未能充分挖掘输入HSI的光谱信息，以及超像素拓扑图不准确导致类别语义混淆的问题。其解决方案的关键在于提出一种结构-光谱图卷积算子（structural-spectral graph convolutional operator, SSGCO），通过联合提取空间和光谱特征来提升超像素的表示质量，并引入一种证据引导的自适应边学习（evidence-guided adaptive edge learning, EGAEL）模块，以自适应地预测和优化超像素拓扑图中的边权重。

链接: https://arxiv.org/abs/2506.09920
作者: Jianhan Qi,Yuheng Jia,Hui Liu,Junhui Hou
机构: Southeast University (东南大学); Saint Francis University (圣弗朗西斯科大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at this https URL.
zh

[CV-24] MetricHMR: Metric Human Mesh Recovery from Monocular Images

【速读】：该论文试图解决单目图像中人体网格重建（Human Mesh Recovery, HMR）的尺度和深度模糊问题，旨在获得几何合理的身体形状和全局平移。解决方案的关键在于引入基于标准透视投影模型的射线图（ray map），以联合编码边界框信息、相机参数和几何线索，从而实现端到端的度量尺度HMR，而无需任何额外的度量正则化模块。

链接: https://arxiv.org/abs/2506.09919
作者: He Zhang,Chentao Song,Hongwen Zhang,Tao Yu
机构: Tsinghua University (清华大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce MetricHMR (Metric Human Mesh Recovery), an approach for metric human mesh recovery with accurate global translation from monocular images. In contrast to existing HMR methods that suffer from severe scale and depth ambiguity, MetricHMR is able to produce geometrically reasonable body shape and global translation in the reconstruction results. To this end, we first systematically analyze previous HMR methods on camera models to emphasize the critical role of the standard perspective projection model in enabling metric-scale HMR. We then validate the acceptable ambiguity range of metric HMR under the standard perspective projection model. Finally, we contribute a novel approach that introduces a ray map based on the standard perspective projection to jointly encode bounding-box information, camera parameters, and geometric cues for End2End metric HMR without any additional metric-regularization modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance, even compared with sequential HMR methods, in metric pose, shape, and global translation estimation across both indoor and in-the-wild scenarios.
zh

[CV-25] Only-Style: Stylistic Consistency in Image Generation without Content Leakage

【速读】：该论文旨在解决在生成图像时保持参考图像的视觉风格一致性的问题，特别是现有方法难以有效分离语义内容与风格元素，导致参考图像中的内容泄露到生成目标图像中。解决方案的关键在于提出Only-Style方法，该方法通过在推理过程中定位内容泄露，并自适应调整控制风格对齐过程的参数，特别是在参考图像中包含主体的图像块内进行优化，从而在保持风格一致性的同时有效消除内容泄露。此外，内容泄露的定位可作为独立组件使用，以自适应调整任何特定方法的参数，进一步增强对风格参考影响的控制能力。

链接: https://arxiv.org/abs/2506.09916
作者: Tilemachos Aravanis(1),Panagiotis Filntisis(2 and 3),Petros Maragos(1 and 2 and 3),George Retsinas(2 and 3) ((1) School of Electrical amp; Computer Engineering, National Technical University of Athens, Greece, (2) Robotics Institute, Athena Research Center, Maroussi, Greece, (3) HERON - Center of Excellence in Robotics, Athens, Greece)
机构: School of Electrical & Computer Engineering, National Technical University of Athens(电气与计算机工程学院，雅典国立技术大学); Robotics Institute, Athena Research Center(机器人研究所，阿提卡研究中心); HERON - Center of Excellence in Robotics(HERON-机器人卓越中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating images in a consistent reference visual style remains a challenging computer vision task. State-of-the-art methods aiming for style-consistent generation struggle to effectively separate semantic content from stylistic elements, leading to content leakage from the image provided as a reference to the targets. To address this challenge, we propose Only-Style: a method designed to mitigate content leakage in a semantically coherent manner while preserving stylistic consistency. Only-Style works by localizing content leakage during inference, allowing the adaptive tuning of a parameter that controls the style alignment process, specifically within the image patches containing the subject in the reference image. This adaptive process best balances stylistic consistency with leakage elimination. Moreover, the localization of content leakage can function as a standalone component, given a reference-target image pair, allowing the adaptive tuning of any method-specific parameter that provides control over the impact of the stylistic reference. In addition, we propose a novel evaluation framework to quantify the success of style-consistent generations in avoiding undesired content leakage. Our approach demonstrates a significant improvement over state-of-the-art methods through extensive evaluation across diverse instances, consistently achieving robust stylistic consistency without undesired content leakage.
zh

[CV-26] CEM-FBGTinyDet: Context-Enhanced Foreground Balance with Gradient Tuning for tiny Objects

【速读】：该论文旨在解决小目标检测（Tiny Object Detection, TOD）中特征金字塔网络（Feature Pyramid Network, FPN）存在的核心问题：在标准标签分配协议下，高阶特征（P5-P6）经常因未被分配正样本锚点而无法参与损失计算，导致其语义表示未被训练，从而形成双重缺陷：高阶特征因缺乏梯度更新而成为语义死胡同，低阶特征则缺乏必要的语义上下文以实现鲁棒分类。解决方案的关键在于提出E-FPN-BS架构，该架构通过Context Enhancement Module（CEM）和Foreground-Background Separation Module（FBSM）实现多尺度特征增强与自适应优化，同时引入Dynamic Gradient-Balanced Loss（DCLoss）以平衡不同尺度目标的梯度贡献。

链接: https://arxiv.org/abs/2506.09897
作者: Tao Liu,Zhenchao Cui
机构: Hebei University (河北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tiny object detection (TOD) reveals a fundamental flaw in feature pyramid networks: high-level features (P5-P6) frequently receive zero positive anchors under standard label assignment protocols, leaving their semantic representations untrained due to exclusion from loss computation. This creates dual deficiencies: (1) Stranded high-level features become semantic dead-ends without gradient updates, while (2) low-level features lack essential semantic context for robust classification. We propose E-FPN-BS that systematically converts wasted high-level semantics into low-level feature enhancements. To address these issues, we propose E-FPN-BS, a novel architecture integrating multi-scale feature enhancement and adaptive optimization. First, our Context Enhancement Module(CEM) employs dual-branch processing to align and compress high-level features for effective global-local fusion. Second, the Foreground-Background Separation Module (FBSM) generates spatial gating masks that dynamically amplify discriminative regions. To address gradient imbalance across object scales, we further propose a Dynamic Gradient-Balanced Loss (DCLoss) that automatically modulates loss contributions via scale-aware gradient equilibrium. Extensive experiments across multiple benchmark datasets demonstrate the outstanding performance and generalization ability of our approach.
zh

[CV-27] EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks

【速读】：该论文旨在解决自监督表示学习中如何有效实现对变换的不变性和等变性的问题，特别是在传统视觉分类任务之外提升模型性能。其解决方案的关键在于引入EquiCaps（Equivariant Capsule Network），该方法基于胶囊网络的内在姿态感知能力，无需专门的预测器即可实现等变性，从而在姿态估计任务中提升性能。通过利用胶囊的固有特性，EquiCaps在复杂几何变换下的等变性表现更加稳健，验证了无预测器胶囊架构的泛化能力和潜力。

链接: https://arxiv.org/abs/2506.09895
作者: Athinoulla Konstantinou,Georgios Leontidis,Mamatha Thota,Aiden Durrant
机构: University of Aberdeen, UK(阿伯丁大学，英国); University of Lincoln, UK(林肯大学，英国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 Figures, 13 Tables

点击查看摘要

Abstract:Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on rotation prediction, achieving a supervised-level R^2 of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 R^2 , respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures.
zh

[CV-28] he Less You Depend The More You Learn: Synthesizing Novel Views from Sparse Unposed Images without Any 3D Knowledge

【速读】：该论文试图解决可泛化的新型视图合成（generalizable novel view synthesis, NVS）问题，即在不进行逐场景优化的情况下，从稀疏或甚至未标定的2D图像中生成逼真的新视角。其解决方案的关键在于减少对3D先验知识（3D inductive bias）和已知相机姿态的依赖，通过消除这些3D知识，方法能够充分利用数据规模，直接从稀疏的2D图像中学习隐式的3D感知，从而实现与依赖3D先验方法相当的性能。

链接: https://arxiv.org/abs/2506.09885
作者: Haoru Wang,Kai Ye,Yangyan Li,Wenzheng Chen,Baoquan Chen
机构: Peking University (北京大学); AMAP (AMAP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We consider the problem of generalizable novel view synthesis (NVS), which aims to generate photorealistic novel views from sparse or even unposed 2D images without per-scene optimization. This task remains fundamentally challenging, as it requires inferring 3D structure from incomplete and ambiguous 2D observations. Early approaches typically rely on strong 3D knowledge, including architectural 3D inductive biases (e.g., embedding explicit 3D representations, such as NeRF or 3DGS, into network design) and ground-truth camera poses for both input and target views. While recent efforts have sought to reduce the 3D inductive bias or the dependence on known camera poses of input views, critical questions regarding the role of 3D knowledge and the necessity of circumventing its use remain under-explored. In this work, we conduct a systematic analysis on the 3D knowledge and uncover a critical trend: the performance of methods that requires less 3D knowledge accelerates more as data scales, eventually achieving performance on par with their 3D knowledge-driven counterparts, which highlights the increasing importance of reducing dependence on 3D knowledge in the era of large-scale data. Motivated by and following this trend, we propose a novel NVS framework that minimizes 3D inductive bias and pose dependence for both input and target views. By eliminating this 3D knowledge, our method fully leverages data scaling and learns implicit 3D awareness directly from sparse 2D images, without any 3D inductive bias or pose annotation during training. Extensive experiments demonstrate that our model generates photorealistic and 3D-consistent novel views, achieving even comparable performance with methods that rely on posed inputs, thereby validating the feasibility and effectiveness of our data-centric paradigm. Project page: this https URL .
zh

[CV-29] 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在理解三维空间结构方面的局限性。尽管VLMs在多种视觉和语言任务中表现出色，但它们对三维空间结构的理解仍然不足。解决方案的关键在于提出一种轻量级、无需标注的微调框架——几何蒸馏（Geometric Distillation），通过从现成的三维基础模型中注入人类启发的几何线索（如稀疏对应关系、相对深度关系和密集成本体积），在不修改模型架构的前提下，使预训练的VLMs具备几何感知能力，同时保持对自然图像-文本输入的兼容性。

链接: https://arxiv.org/abs/2506.09883
作者: Seonho Lee,Jiho Choi,Inha Kang,Jiwook Kim,Junsung Park,Hyunjung Shim
机构: KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.
zh

[CV-30] Leverag ing Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation

【速读】：该论文旨在解决开放词汇语义分割（Open-Vocabulary Semantic Segmentation, OVSS）与语义分割中的领域泛化（Domain Generalization in Semantic Segmentation, DGSS）之间的协同问题，提出一种统一的解决方案——开放词汇领域泛化语义分割（OV-DGSS）。OV-DGSS的目标是在未见类别上生成像素级分割掩码的同时，保持在未见领域中的鲁棒性。该论文提出的解决方案关键在于Vireo框架，其通过冻结的视觉基础模型（Visual Foundation Models, VFMs）结合深度视觉基础模型提取领域不变的结构特征，并引入三个核心组件：GeoText Prompts、Coarse Mask Prior Embedding（CMPE）以及Domain-Open-Vocabulary Vector Embedding Head（DOV-VEH），以增强跨模态对齐、优化梯度流动并融合结构与语义特征，从而实现高效的领域泛化与开放词汇识别。

链接: https://arxiv.org/abs/2506.09881
作者: Siyu Chen,Ting Han,Chengzheng Fu,Changshe Zhang,Chaolei Wang,Jinhe Su,Guorong Cai,Meiliu Wu
机构: Jimei University (集美大学); Sun Yat-sen University (中山大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Xidian University (西安电子科技大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at this https URL.
zh

[CV-31] IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

【速读】：该论文试图解决深度学习模型在理解直观物理（intuitive physics）方面存在的不足问题，特别是其在复杂场景中对宏观物体的四个核心物理原则——恒常性（Permanence）、不变性（Immutability）、时空连续性（Spatio-Temporal Continuity）和刚性（Solidity）的理解能力。解决方案的关键在于构建IntPhys 2视频基准测试，该基准基于预期违背框架，通过控制且多样的虚拟环境来评估模型区分可能与不可能事件的能力，从而揭示当前模型在直观物理理解方面的局限性。

链接: https://arxiv.org/abs/2506.09849
作者: Florian Bordes,Quentin Garrido,Justine T Kao,Adina Williams,Michael Rabbat,Emmanuel Dupoux
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the original IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.
zh

[CV-32] Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition

【速读】：该论文旨在解决手写文本识别中由于手写风格随时间变化和上下文依赖性导致的模型泛化能力不足的问题，特别是在不同历史时期或地区的手写字符集和频率分布发生变化时，基于广泛异构语料库训练的模型在特定子集上的表现会下降。解决方案的关键在于提出一种新的损失函数，该函数引入了预测文本字符频率分布与从训练数据中经验性得出的目标分布之间的Wasserstein距离，通过惩罚对预期分布的偏离，提升模型在时间与上下文相关的数据集内移中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2506.09846
作者: Panagiotis Kaliosis,John Pavlopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures, Under Review

点击查看摘要

Abstract:Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at this https URL.
zh

[CV-33] OctoNav: Towards Generalist Embodied Navigation

【速读】：该论文旨在解决传统导航研究中任务与能力分散的问题，即不同导航任务（如ObjNav、ImgNav和VLN）在目标和模态上存在差异，导致数据集和方法独立设计，难以实现通用导航代理。其解决方案的关键在于构建一个大规模的基准平台OctoNav-Bench及相应的模型OctoNav-R1，该模型基于多模态大语言模型（MLLMs）并适配为视觉语言代理（VLA）类型，能够仅依赖2D视觉观察生成低级动作。此外，提出了一种混合训练范式（HTP），包含Action-/TBA-SFT、Nav-GPRO和Online RL三个阶段，通过TBA-CoT数据集增强模型的推理能力，从而实现“先思考后行动”的机制，提升模型的泛化导航能力。

链接: https://arxiv.org/abs/2506.09839
作者: Chen Gao,Liankai Jin,Xingyu Peng,Jiazhao Zhang,Yue Deng,Annan Li,He Wang,Si Liu
机构: Beihang University (北京航空航天大学); National University of Singapore (新加坡国立大学); Peking University (北京大学); Zhongguancun Academy (中关村研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 31 pages, 25 figures

点击查看摘要

Abstract:Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model’s reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with previous methods.
zh

[CV-34] DynaSplat: Dynamic-Static Gaussian Splatting with Hierarchical Motion Decomposition for Scene Reconstruction

【速读】：该论文旨在解决动态场景中复杂且不断变化环境的重建问题，现有方法在面对现实世界动态性时往往表现不佳。其解决方案的关键在于引入动态-静态分离和分层运动建模，通过结合形变偏移统计与2D运动流一致性对场景元素进行分类，从而精确聚焦于运动区域；同时采用分层运动建模策略，以捕捉全局变换与局部运动，实现对非刚性运动的准确处理，并结合物理基础的不透明度估计以确保视觉一致的重建结果。

链接: https://arxiv.org/abs/2506.09836
作者: Junli Deng,Ping Shi,Qipei Li,Jinyang Guo
机构: Communication University of China (中国传媒大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing intricate, ever-changing environments remains a central ambition in computer vision, yet existing solutions often crumble before the complexity of real-world dynamics. We present DynaSplat, an approach that extends Gaussian Splatting to dynamic scenes by integrating dynamic-static separation and hierarchical motion modeling. First, we classify scene elements as static or dynamic through a novel fusion of deformation offset statistics and 2D motion flow consistency, refining our spatial representation to focus precisely where motion matters. We then introduce a hierarchical motion modeling strategy that captures both coarse global transformations and fine-grained local movements, enabling accurate handling of intricate, non-rigid motions. Finally, we integrate physically-based opacity estimation to ensure visually coherent reconstructions, even under challenging occlusions and perspective shifts. Extensive experiments on challenging datasets reveal that DynaSplat not only surpasses state-of-the-art alternatives in accuracy and realism but also provides a more intuitive, compact, and efficient route to dynamic scene reconstruction.
zh

[CV-35] MMME: A Spontaneous Multi-Modal Micro-Expression Dataset Enabling Visual-Physiological Fusion

【速读】：该论文试图解决现有微表情（Micro-expressions, MEs）研究仅依赖单一视觉模态，忽视其他生理信号（Physiological Signals, PS）中蕴含的丰富情感信息，导致微表情识别与检测性能远低于实际应用需求的问题。其解决方案的关键在于探索微表情视觉特征与生理信号之间的跨模态关联机制，并构建多模态融合框架。为此，本文提出了一个全新的多模态微表情数据集MMME，首次实现了面部动作信号（MEs）、中枢神经系统信号（EEG）以及外周生理信号（PPG、RSP、SKT、EDA和ECG）的同步采集，为研究微表情的神经机制和多模态融合分析提供了坚实基础。

链接: https://arxiv.org/abs/2506.09834
作者: Chuang Maa,Yu Peia,Jianhang Zhanga,Shaokai Zhaoa,Bowen Jib,Liang Xiea,Ye Yana,Erwei Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expressions (MEs) are subtle, fleeting nonverbal cues that reveal an individual’s genuine emotional state. Their analysis has attracted considerable interest due to its promising applications in fields such as healthcare, criminal investigation, and human-computer interaction. However, existing ME research is limited to single visual modality, overlooking the rich emotional information conveyed by other physiological modalities, resulting in ME recognition and spotting performance far below practical application needs. Therefore, exploring the cross-modal association mechanism between ME visual features and physiological signals (PS), and developing a multimodal fusion framework, represents a pivotal step toward advancing ME analysis. This study introduces a novel ME dataset, MMME, which, for the first time, enables synchronized collection of facial action signals (MEs), central nervous system signals (EEG), and peripheral PS (PPG, RSP, SKT, EDA, and ECG). By overcoming the constraints of existing ME corpora, MMME comprises 634 MEs, 2,841 macro-expressions (MaEs), and 2,890 trials of synchronized multimodal PS, establishing a robust foundation for investigating ME neural mechanisms and conducting multimodal fusion-based analyses. Extensive experiments validate the dataset’s reliability and provide benchmarks for ME analysis, demonstrating that integrating MEs with PS significantly enhances recognition and spotting performance. To the best of our knowledge, MMME is the most comprehensive ME dataset to date in terms of modality diversity. It provides critical data support for exploring the neural mechanisms of MEs and uncovering the visual-physiological synergistic effects, driving a paradigm shift in ME research from single-modality visual analysis to multimodal fusion. The dataset will be publicly available upon acceptance of this paper.
zh

[CV-36] DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision

【速读】：该论文旨在解决现有文本到3D生成方法在生成与人类偏好对齐的3D资产方面存在的不足，特别是由于依赖难以收集的多视角2D图像配对数据训练2D奖励模型所导致的几何伪影问题。其解决方案的关键在于构建了首个大规模无配对的3D-MeshPref偏好数据集，并开发了RewardCS，这是一个直接基于该数据集进行训练的奖励模型，采用新颖的Cauchy-Schwarz散度目标函数，从而无需配对比较即可有效学习人类对3D几何偏好的对齐。

链接: https://arxiv.org/abs/2506.09814
作者: Xiandong Zou,Ruihao Xia,Hongsong Wang,Pan Zhou
机构: Singapore Management University (新加坡管理大学); East China University of Science and Technology (华东理工大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While text-to-3D generation has attracted growing interest, existing methods often struggle to produce 3D assets that align well with human preferences. Current preference alignment techniques for 3D content typically rely on hardly-collected preference-paired multi-view 2D images to train 2D reward models, when then guide 3D generation – leading to geometric artifacts due to their inherent 2D bias. To address these limitations, we construct 3D-MeshPref, the first large-scale unpaired 3D preference dataset, featuring diverse 3D meshes annotated by a large language model and refined by human evaluators. We then develop RewardCS, the first reward model trained directly on unpaired 3D-MeshPref data using a novel Cauchy-Schwarz divergence objective, enabling effective learning of human-aligned 3D geometric preferences without requiring paired comparisons. Building on this, we propose DreamCS, a unified framework that integrates RewardCS into text-to-3D pipelines – enhancing both implicit and explicit 3D generation with human preference feedback. Extensive experiments show DreamCS outperforms prior methods, producing 3D assets that are both geometrically faithful and human-preferred. Code and models will be released publicly.
zh

[CV-37] Accurate and efficient zero-shot 6D pose estimation with frozen foundation models

【速读】：该论文旨在解决从RGBD数据中准确估计未见过物体的6D位姿问题，这一问题在计算机视觉领域具有重要应用价值，尤其是在机器人和增强现实领域。传统方法通常依赖于针对特定任务的大量合成数据进行训练，这需要耗费大量计算资源。本文提出FreeZeV2，其关键在于采用无需训练的方法，通过利用在无关数据上预训练的几何和视觉基础模型，实现对未见物体的强大泛化能力。FreeZeV2通过三个核心贡献提升了精度和效率：稀疏特征提取策略、基于特征的评分机制以及模块化设计，从而在BOP基准测试中取得了新的最先进性能。

链接: https://arxiv.org/abs/2506.09784
作者: Andrea Caraffa,Davide Boscaini,Fabio Poiesi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task-specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training-free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre-trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference-time computation without sacrificing accuracy; (ii) a feature-aware scoring mechanism that improves both pose selection during RANSAC-based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state-of-the-art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.
zh

[CV-38] Q-SAM2: Accurate Quantization for Segment Anything Model 2

【速读】：该论文旨在解决Segment Anything Model 2 (SAM2)在资源受限场景下的计算和内存消耗过高的问题。其关键解决方案是提出一种高精度的低比特量化方法Q-SAM2，通过引入线性层校准方法和量化感知训练（QAT）流程，有效缓解量化过程中权重和激活分布奇异导致的性能下降问题，从而在保持高精度的同时显著提升效率。

链接: https://arxiv.org/abs/2506.09782
作者: Nicola Farronato,Florian Scheidegger,Mattia Rigotti,Cristiano Malossi,Michele Magno,Haotong Qin
机构: IBM Research - Zurich (IBM 研究院-苏黎世); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.
zh

[CV-39] Inverting Black-Box Face Recognition Systems via Zero-Order Optimization in Eigenface Space

【速读】：该论文试图解决从黑盒识别模型中重建面部图像的问题，这一过程对隐私构成重大威胁。其解决方案的关键在于提出了一种名为DarkerBB的新方法，该方法通过在PCA衍生的特征脸（eigenface）空间内进行零阶优化，仅利用相似度分数实现彩色面部图像的重建。尽管信息受限，实验结果表明DarkerBB在仅使用相似度的场景下达到了最先进的验证准确率，并具有良好的查询效率。

链接: https://arxiv.org/abs/2506.09777
作者: Anton Razzhigaev,Matvey Mikhalchuk,Klim Kireev,Igor Udovichenko,Andrey Kuznetsov,Aleksandr Petiushko
机构: AIRI; Skoltech; MSU; Elea AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing facial images from black-box recognition models poses a significant privacy threat. While many methods require access to embeddings, we address the more challenging scenario of model inversion using only similarity scores. This paper introduces DarkerBB, a novel approach that reconstructs color faces by performing zero-order optimization within a PCA-derived eigenface space. Despite this highly limited information, experiments on LFW, AgeDB-30, and CFP-FP benchmarks demonstrate that DarkerBB achieves state-of-the-art verification accuracies in the similarity-only setting, with competitive query efficiency.
zh

[CV-40] Hierarchical Image Matching for UAV Absolute Visual Localization via Semantic and Structural Constraints

【速读】：该论文旨在解决在无全球导航卫星系统（GNSS）信号环境下，无人机（UAV）进行绝对定位（absolute localization）的难题。现有基于视觉的绝对定位方法因依赖传统低级图像匹配技术，难以应对跨源差异和时间变化带来的挑战。论文提出的解决方案关键在于引入一种分层跨源图像匹配方法，该方法结合了语义感知与结构约束的粗粒度匹配模块和轻量级细粒度匹配模块，通过区域级和像素级对应关系的建立，实现无需依赖相对定位技术的高精度绝对视觉定位。

链接: https://arxiv.org/abs/2506.09748
作者: Xiangkai Zhang,Xiang Zhou,Mao Chen,Yuchen Lu,Xu Yang,Zhiyong Liu
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); Guangxi University (广西大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Absolute localization, aiming to determine an agent’s location with respect to a global reference, is crucial for unmanned aerial vehicles (UAVs) in various applications, but it becomes challenging when global navigation satellite system (GNSS) signals are unavailable. Vision-based absolute localization methods, which locate the current view of the UAV in a reference satellite map to estimate its position, have become popular in GNSS-denied scenarios. However, existing methods mostly rely on traditional and low-level image matching, suffering from difficulties due to significant differences introduced by cross-source discrepancies and temporal variations. To overcome these limitations, in this paper, we introduce a hierarchical cross-source image matching method designed for UAV absolute localization, which integrates a semantic-aware and structure-constrained coarse matching module with a lightweight fine-grained matching module. Specifically, in the coarse matching module, semantic features derived from a vision foundation model first establish region-level correspondences under semantic and structural constraints. Then, the fine-grained matching module is applied to extract fine features and establish pixel-level correspondences. Building upon this, a UAV absolute visual localization pipeline is constructed without any reliance on relative localization techniques, mainly by employing an image retrieval module before the proposed hierarchical image matching modules. Experimental evaluations on public benchmark datasets and a newly introduced CS-UAV dataset demonstrate superior accuracy and robustness of the proposed method under various challenging conditions, confirming its effectiveness.
zh

[CV-41] Class Similarity-Based Multimodal Classification under Heterogeneous Category Sets

【速读】：该论文试图解决多模态数据中类别分布不一致导致模型难以有效利用跨模态信息进行所有类别识别的问题（Multi-Modal Heterogeneous Category-set Learning, MMHCL）。其解决方案的关键在于提出一种基于类别相似性的跨模态融合模型（Class Similarity-based Cross-modal Fusion, CSCF），该模型通过将模态特定特征对齐到共享语义空间以实现已见与未见类别的知识迁移，并通过不确定性估计选择最具判别性的模态进行决策融合，最终依据类别相似性整合跨模态信息，由辅助模态优化主导模态的预测。

链接: https://arxiv.org/abs/2506.09745
作者: Yangrui Zhu,Junhua Bao,Yipan Wei,Yapeng Li,Bo Du
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing multimodal methods typically assume that different modalities share the same category set. However, in real-world applications, the category distributions in multimodal data exhibit inconsistencies, which can hinder the model’s ability to effectively utilize cross-modal information for recognizing all categories. In this work, we propose the practical setting termed Multi-Modal Heterogeneous Category-set Learning (MMHCL), where models are trained in heterogeneous category sets of multi-modal data and aim to recognize complete classes set of all modalities during test. To effectively address this task, we propose a Class Similarity-based Cross-modal Fusion model (CSCF). Specifically, CSCF aligns modality-specific features to a shared semantic space to enable knowledge transfer between seen and unseen classes. It then selects the most discriminative modality for decision fusion through uncertainty estimation. Finally, it integrates cross-modal information based on class similarity, where the auxiliary modality refines the prediction of the dominant one. Experimental results show that our method significantly outperforms existing state-of-the-art (SOTA) approaches on multiple benchmark datasets, effectively addressing the MMHCL task.
zh

[CV-42] ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion Models

【速读】：该论文试图解决扩散模型中像素级图像与文本级语义对齐不足的问题（pixel-text misalignment），尤其是在小尺寸、遮挡或罕见物体类别图像中表现尤为明显。其解决方案的关键在于提出ELBO-T2IAlign方法，该方法基于似然的证据下界（ELBO）实现对扩散模型中像素-文本对齐的校准，具有无需训练、通用性强的特点，能够有效提升不同架构扩散模型的对齐性能。

链接: https://arxiv.org/abs/2506.09740
作者: Qin Zhou,Zhiyang Zhang,Jinglong Wang,Xiaobin Li,Jing Zhang,Qian Yu,Lu Sheng,Dong Xu
机构: Beihang University (北京航空航天大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models excel at image generation. Recent studies have shown that these models not only generate high-quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case. In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign, a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.
zh

[CV-43] Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉处理方面的不足，即尽管它们能够生成准确的视觉描述，但在推理过程中未能有效整合视觉信息。解决方案的关键在于提出一种简单的视觉扰动框架，通过引入三种目标扰动策略——干扰物拼接、保持主导性的混合以及随机旋转——来增强模型的感知鲁棒性，而无需进行算法修改或额外训练数据。该方法可轻松集成到现有的后训练流程中，并在数学推理任务中表现出显著的性能提升。

链接: https://arxiv.org/abs/2506.09736
作者: Yuting Li,Lai Wei,Kaipeng Zheng,Jingyuan Huang,Linghe Kong,Lichao Sun,Weiran Huang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Zhongguancun Academy (中关村学院); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室，BIGAI); Lehigh University (利哈伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate accurate visual descriptions but fail to effectively integrate them during reasoning. Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications or additional training data. Our approach introduces three targeted perturbations: distractor concatenation, dominance-preserving mixup, and random rotation, that can be easily integrated into existing post-training pipelines including SFT, DPO, and GRPO. Through extensive experiments across multiple datasets, we demonstrate consistent improvements in mathematical reasoning performance, with gains comparable to those achieved through algorithmic changes. Additionally, we achieve competitive performance among open-source 7B RL-tuned models by training Qwen2.5-VL-7B with visual perturbation. Through comprehensive ablation studies, we analyze the effectiveness of different perturbation strategies, revealing that each perturbation type contributes uniquely to different aspects of visual reasoning. Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning: better reasoning begins with better seeing. Our code is available at this https URL.
zh

[CV-44] MPFNet: A Multi-Prior Fusion Network with a Progressive Training Strategy for Micro-Expression Recognition

【速读】：该论文旨在解决微表情识别（Micro-expression Recognition, MER）中因微表情持续时间短、强度低而导致的识别难度大的问题。现有方法多依赖单一来源的先验知识，未能充分利用多源信息。论文提出的解决方案关键在于设计一种多先验融合网络（Multi-Prior Fusion Network, MPFNet），通过渐进式训练策略优化MER任务，并引入两种互补编码器——通用特征编码器（Generic Feature Encoder, GFE）和高级特征编码器（Advanced Feature Encoder, AFE），基于改进的Inflated 3D ConvNets（I3D）结构结合坐标注意力机制，以增强模型对时空特征和通道特性的捕捉能力。此外，受发展心理学启发，提出了两种MPFNet变体——MPFNet-P与MPFNet-C，分别对应婴儿认知发展的并行与分层处理模式，用于评估不同先验知识整合策略的效果。

链接: https://arxiv.org/abs/2506.09735
作者: Chuang Ma,Shaokai Zhao,Dongdong Zhou,Yu Pei,Zhiguo Luo,Liang Xie,Ye Yan,Erwei Yin
机构: Defense Innovation Institute, Academy of Military Sciences (AMS); Intelligent Game and Decision Laboratory (智能博弈与决策实验室); School of Computer Science and Technology (计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-expression recognition (MER), a critical subfield of affective computing, presents greater challenges than macro-expression recognition due to its brief duration and low intensity. While incorporating prior knowledge has been shown to enhance MER performance, existing methods predominantly rely on simplistic, singular sources of prior knowledge, failing to fully exploit multi-source information. This paper introduces the Multi-Prior Fusion Network (MPFNet), leveraging a progressive training strategy to optimize MER tasks. We propose two complementary encoders: the Generic Feature Encoder (GFE) and the Advanced Feature Encoder (AFE), both based on Inflated 3D ConvNets (I3D) with Coordinate Attention (CA) mechanisms, to improve the model’s ability to capture spatiotemporal and channel-specific features. Inspired by developmental psychology, we present two variants of MPFNet–MPFNet-P and MPFNet-C–corresponding to two fundamental modes of infant cognitive development: parallel and hierarchical processing. These variants enable the evaluation of different strategies for integrating prior knowledge. Extensive experiments demonstrate that MPFNet significantly improves MER accuracy while maintaining balanced performance across categories, achieving accuracies of 0.811, 0.924, and 0.857 on the SMIC, CASME II, and SAMM datasets, respectively. To the best of our knowledge, our approach achieves state-of-the-art performance on the SMIC and SAMM datasets.
zh

[CV-45] AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale

【速读】：该论文试图解决长期稳定、准确的天气预测问题，特别是在超出数周的范围内实现可靠的自回归预测。传统方法依赖于非标准空间域（如球面谐波或HEALPix网格）来保证物理一致性和长期稳定性，而该论文挑战了这一假设。其解决方案的关键在于提出AtmosMJ模型，该模型直接在标准纬度-经度网格上运行，并通过一种新颖的门控残差融合（Gated Residual Fusion, GRF）机制实现稳定性，该机制通过自适应调节特征更新以防止误差累积。

链接: https://arxiv.org/abs/2506.09733
作者: Minjong Cheon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:The advent of Large Weather Models (LWMs) has marked a turning point in data-driven forecasting, with many models now outperforming traditional numerical systems in the medium range. However, achieving stable, long-range autoregressive forecasts beyond a few weeks remains a significant challenge. Prevailing state-of-the-art models that achieve year-long stability, such as SFNO and DLWP-HPX, have relied on transforming input data onto non-standard spatial domains like spherical harmonics or HEALPix meshes. This has led to the prevailing assumption that such representations are necessary to enforce physical consistency and long-term stability. This paper challenges that assumption by investigating whether comparable long-range performance can be achieved on the standard latitude-longitude grid. We introduce AtmosMJ, a deep convolutional network that operates directly on ERA5 data without any spherical remapping. The model’s stability is enabled by a novel Gated Residual Fusion (GRF) mechanism, which adaptively moderates feature updates to prevent error accumulation over long recursive simulations. Our results demonstrate that AtmosMJ produces stable and physically plausible forecasts for about 500 days. In quantitative evaluations, it achieves competitive 10-day forecast accuracy against models like Pangu-Weather and GraphCast, all while requiring a remarkably low training budget of 5.7 days on a V100 GPU. Our findings suggest that efficient architectural design, rather than non-standard data representation, can be the key to unlocking stable and computationally efficient long-range weather prediction.
zh

[CV-46] he Four Color Theorem for Cell Instance Segmentation ICML2025

【速读】：该论文旨在解决细胞实例分割中紧密接触细胞难以准确区分的问题，这是生物医学图像分析中的关键挑战。其解决方案的关键在于受四色定理启发的四色编码方案，将细胞视为国家、组织视为海洋，通过确保相邻实例获得不同标签来简化实例区分过程，从而将实例分割转化为具有四个预测类别的约束语义分割问题。

链接: https://arxiv.org/abs/2506.09724
作者: Ye Zhang,Yu Zhou,Yifeng Wang,Jun Xiao,Ziyue Wang,Yongbing Zhang,Jianxu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available at this https URL.
zh

[CV-47] Non-Contact Health Monitoring During Daily Personal Care Routines

【速读】：该论文旨在解决远程光体积描记术（remote photoplethysmography, rPPG）在长期个人护理场景中的应用难题，特别是在高海拔环境下的镜面日常护理任务中，由于环境光照变化、手部动作频繁遮挡以及动态面部姿态导致的信号采集不稳定问题。其解决方案的关键在于构建了首个长期rPPG数据集LADH，该数据集包含21名参与者在五种常见个人护理场景下的240组同步RGB和红外（IR）面部视频，并提供了真实生理信号作为标注，同时通过融合RGB与IR视频输入以及采用多任务学习方法，显著提升了非接触式生理监测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2506.09718
作者: Xulin Ma,Jiankai Tang,Zhang Jiang,Songqin Cheng,Yuanchun Shi,Dong LI,Xin Liu,Daniel McDuff,Xiaojing Liu,Yuntao Wang
机构: Qinghai University (青海大学); National Key Laboratory of Human Factors Engineering (国家人因工程重点实验室); Department of Computer Science and Technology, Tsinghua University (计算机科学与技术系，清华大学); Paul G. Allen School of Computer Science & Engineering, University of Washington (保罗·G·艾伦计算机科学与工程学院，华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at this https URL.
zh

[CV-48] raining-Free Voice Conversion with Factorized Optimal Transport INTERSPEECH2025

【速读】：该论文试图解决在少样本参考音频条件下，传统kNN-VC（k-Nearest Neighbors Voice Conversion）管道在跨语言语音转换（cross-lingual voice conversion）中内容保留和鲁棒性不足的问题。其解决方案的关键在于引入Factorized MKL-VC，该方法通过在WavLM嵌入子空间中使用因子化最优传输映射（factorized optimal transport map）替代kNN回归，从而有效处理维度间非均匀方差问题，实现高质量的任意语言间的语音转换。

链接: https://arxiv.org/abs/2506.09709
作者: Alexander Lobashev,Assel Yermekova,Maria Larchenko
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025

点击查看摘要

Abstract:This paper introduces Factorized MKL-VC, a training-free modification for kNN-VC pipeline. In contrast with original pipeline, our algorithm performs high quality any-to-any cross-lingual voice conversion with only 5 second of reference audio. MKL-VC replaces kNN regression with a factorized optimal transport map in WavLM embedding subspaces, derived from Monge-Kantorovich Linear solution. Factorization addresses non-uniform variance across dimensions, ensuring effective feature transformation. Experiments on LibriSpeech and FLEURS datasets show MKL-VC significantly improves content preservation and robustness with short reference audio, outperforming kNN-VC. MKL-VC achieves performance comparable to FACodec, especially in cross-lingual voice conversion domain.
zh

[CV-49] CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings

【速读】：该论文旨在解决在真实工业环境中对复杂物体进行准确的6D位姿估计（6D pose estimation）问题，现有基准在评估方法时存在不足，因为大多数数据集聚焦于家庭环境中的日常物品，而工业数据集则局限于人工设置的场景。论文提出的解决方案是引入CHIP数据集，这是首个针对机器人手臂在真实工业环境下操作椅子的6D位姿估计设计的数据集，其关键在于包含多种RGBD传感技术采集的七种不同椅子，并引入了如细粒度差异的干扰物和由机械臂及操作员引起的严重遮挡等现实挑战，同时提供了基于机器人运动学自动生成的77,811组带真值标注的6D位姿数据。

链接: https://arxiv.org/abs/2506.09699
作者: Mattia Nardon,Mikel Mujika Agirre,Ander González Tomé,Daniel Sedano Algarabel,Josep Rueda Collell,Ana Paola Caro,Andrea Caraffa,Fabio Poiesi,Paul Ian Chippendale,Davide Boscaini
机构: FBK-TeV(弗朗切斯科·博尔萨尼研究所); Ikerlan(伊克尔兰研究中心); Andreu World(安德鲁世界)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Accurate 6D pose estimation of complex objects in 3D environments is essential for effective robotic manipulation. Yet, existing benchmarks fall short in evaluating 6D pose estimation methods under realistic industrial conditions, as most datasets focus on household objects in domestic settings, while the few available industrial datasets are limited to artificial setups with objects placed on tables. To bridge this gap, we introduce CHIP, the first dataset designed for 6D pose estimation of chairs manipulated by a robotic arm in a real-world industrial environment. CHIP includes seven distinct chairs captured using three different RGBD sensing technologies and presents unique challenges, such as distractor objects with fine-grained differences and severe occlusions caused by the robotic arm and human operators. CHIP comprises 77,811 RGBD images annotated with ground-truth 6D poses automatically derived from the robot’s kinematics, averaging 11,115 annotations per chair. We benchmark CHIP using three zero-shot 6D pose estimation methods, assessing performance across different sensor types, localization priors, and occlusion levels. Results show substantial room for improvement, highlighting the unique challenges posed by the dataset. CHIP will be publicly released.
zh

[CV-50] owards Practical Alzheimers Disease Diagnosis: A Lightweight and Interpretable Spiking Neural Model

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Disease, AD）早期诊断，尤其是在轻度认知障碍（Mild Cognitive Impairment, MCI）阶段的诊断问题，其面临主观评估和多模态影像技术高成本的挑战。现有深度学习方法虽能提供自动化替代方案，但存在能耗高和计算需求大的问题，限制了其在资源有限环境中的实际应用。为克服这些问题，论文提出了一种混合神经架构FasterSNN，其关键在于整合生物启发的LIF神经元与区域自适应卷积及多尺度脉冲注意力机制，从而实现对3D MRI数据的稀疏高效处理，同时保持诊断准确性。

链接: https://arxiv.org/abs/2506.09695
作者: Changwei Wu,Yifei Chen,Yuxin Du,Jinying Zong,Jie Dong,Mingxuan Liu,Yong Peng,Jin Fan,Feiwei Qin,Changmiao Wang
机构: Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Early diagnosis of Alzheimer’s Disease (AD), especially at the mild cognitive impairment (MCI) stage, is vital yet hindered by subjective assessments and the high cost of multimodal imaging modalities. Although deep learning methods offer automated alternatives, their energy inefficiency and computational demands limit real-world deployment, particularly in resource-constrained settings. As a brain-inspired paradigm, spiking neural networks (SNNs) are inherently well-suited for modeling the sparse, event-driven patterns of neural degeneration in AD, offering a promising foundation for interpretable and low-power medical diagnostics. However, existing SNNs often suffer from weak expressiveness and unstable training, which restrict their effectiveness in complex medical tasks. To address these limitations, we propose FasterSNN, a hybrid neural architecture that integrates biologically inspired LIF neurons with region-adaptive convolution and multi-scale spiking attention. This design enables sparse, efficient processing of 3D MRI while preserving diagnostic accuracy. Experiments on benchmark datasets demonstrate that FasterSNN achieves competitive performance with substantially improved efficiency and stability, supporting its potential for practical AD screening. Our source code is available at this https URL.
zh

[CV-51] Reasoning Models Are More Easily Gaslighted Than You Think

【速读】：该论文试图解决当前推理导向模型在面对误导性用户输入时表现出的鲁棒性不足问题，特别是其在应对气体操纵（gaslighting）否定提示时的脆弱性。解决方案的关键在于构建一个专门用于评估此类模型在遭受气体操纵反馈时信念坚持能力的诊断基准——GaslightingBench-R，该基准通过筛选和整理1,025个具有挑战性的样本，揭示了现有顶级推理模型在面对操纵性输入时的显著性能下降，从而突显出推理模型在逐步推理与信念持续性之间的根本性差距。

链接: https://arxiv.org/abs/2506.09677
作者: Bin Zhu,Hailong Yin,Jingjing Chen,Yu-Gang Jiang
机构: Singapore Management University(新加坡管理大学); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25-29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average. Our findings reveal fundamental limitations in the robustness of reasoning models, highlighting the gap between step-by-step reasoning and belief persistence.
zh

[CV-52] CINeMA: Conditional Implicit Neural Multi-Modal Atlas for a Spatio-Temporal Representation of the Perinatal Brain

【速读】：该论文旨在解决在胎儿和新生儿大脑快速神经发育过程中，由于病理数据稀缺而导致的传统脑图谱和深度学习方法难以有效建模的问题。其解决方案的关键在于提出一种名为CINeMA（Conditional Implicit Neural Multi-Modal Atlas）的新框架，该框架通过在潜在空间中操作，避免了计算密集型的图像配准过程，从而将图谱构建时间从数天缩短至数分钟，并支持对解剖特征如孕周、出生年龄及病理如脑室扩张（Ventriculomegaly, VM）和胼胝体发育不全（Agenesis of the Corpus Callosum, ACC）的灵活条件控制。

链接: https://arxiv.org/abs/2506.09668
作者: Maik Dannecker,Vasiliki Sideri-Lampretsa,Sophie Starck,Angeline Mihailov,Mathieu Milh,Nadine Girard,Guillaume Auzias,Daniel Rueckert
机构: Technical University of Munich (慕尼黑工业大学); Aix-Marseille University (艾克斯-马赛大学); Imperial College London (帝国理工学院); Institut de Neurosciences de la Timone (蒂蒙神经科学研究所); CNRS (法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Work currently under revision for IEEE TMI

点击查看摘要

Abstract:Magnetic resonance imaging of fetal and neonatal brains reveals rapid neurodevelopment marked by substantial anatomical changes unfolding within days. Studying this critical stage of the developing human brain, therefore, requires accurate brain models-referred to as atlases-of high spatial and temporal resolution. To meet these demands, established traditional atlases and recently proposed deep learning-based methods rely on large and comprehensive datasets. This poses a major challenge for studying brains in the presence of pathologies for which data remains scarce. We address this limitation with CINeMA (Conditional Implicit Neural Multi-Modal Atlas), a novel framework for creating high-resolution, spatio-temporal, multimodal brain atlases, suitable for low-data settings. Unlike established methods, CINeMA operates in latent space, avoiding compute-intensive image registration and reducing atlas construction times from days to minutes. Furthermore, it enables flexible conditioning on anatomical features including GA, birth age, and pathologies like ventriculomegaly (VM) and agenesis of the corpus callosum (ACC). CINeMA supports downstream tasks such as tissue segmentation and age prediction whereas its generative properties enable synthetic data creation and anatomically informed data augmentation. Surpassing state-of-the-art methods in accuracy, efficiency, and versatility, CINeMA represents a powerful tool for advancing brain research. We release the code and atlases at this https URL.
zh

[CV-53] VideoMat: Extracting PBR Materials from Video Diffusion Models

【速读】：该论文旨在解决从文本提示或单张图像生成高质量3D模型材质的问题，其核心挑战在于如何确保生成的材质在不同视角下保持一致且符合物理渲染规范。解决方案的关键在于利用微调的视频扩散模型（video diffusion model）来生成与输入几何和光照条件一致的多视角视频，随后通过内在分解（intrinsic decomposition）模型提取基础颜色、粗糙度和金属度等材质属性，并结合可微分路径追踪器（differentiable path tracer）直接生成与主流内容创作工具兼容的基于物理的渲染（PBR）材质。

链接: https://arxiv.org/abs/2506.09665
作者: Jacob Munkberg,Zian Wang,Ruofan Liang,Tianchang Shen,Jon Hasselgren
机构: NVIDIA(英伟达); University of Toronto(多伦多大学); Vector Institute(向量研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We leverage finetuned video diffusion models, intrinsic decomposition of videos, and physically-based differentiable rendering to generate high quality materials for 3D models given a text prompt or a single image. We condition a video diffusion model to respect the input geometry and lighting condition. This model produces multiple views of a given 3D model with coherent material properties. Secondly, we use a recent model to extract intrinsics (base color, roughness, metallic) from the generated video. Finally, we use the intrinsics alongside the generated video in a differentiable path tracer to robustly extract PBR materials directly compatible with common content creation tools.
zh

[CV-54] Self-Supervised Multi-Part Articulated Objects Modeling via Deformable Gaussian Splatting and Progressive Primitive Segmentation

【速读】：该论文旨在解决在无监督条件下对包含多个可动部件的柔性物体进行统一的几何与运动三维表示问题。现有方法在缺乏人工标注的情况下难以构建此类对象的统一表征。其解决方案的关键在于提出DeGSS框架，该框架将柔性物体编码为可变形的3D高斯场（deformable 3D Gaussian fields），在单一紧凑表示中嵌入几何、外观和运动信息。通过建模每个交互状态为共享场的平滑变形，结合渐进式的粗到细部分分割策略，实现无需监督的刚性组件识别，并提供空间连续且完全解耦的部件描述。

链接: https://arxiv.org/abs/2506.09663
作者: Haowen Wang,Xiaoping Yuan,Zhao Jin,Zhen Zhao,Zhengping Che,Yousong Xue,Jin Tian,Yakun Huang,Jian Tang
机构: Anhui University (安徽大学); Beijing Innovation Center of Humanoid Robotics (北京人形机器人创新中心); Beijing Institute of Archtecture Design (北京建筑设计研究院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Articulated objects are ubiquitous in everyday life, and accurate 3D representations of their geometry and motion are critical for numerous applications. However, in the absence of human annotation, existing approaches still struggle to build a unified representation for objects that contain multiple movable parts. We introduce DeGSS, a unified framework that encodes articulated objects as deformable 3D Gaussian fields, embedding geometry, appearance, and motion in one compact representation. Each interaction state is modeled as a smooth deformation of a shared field, and the resulting deformation trajectories guide a progressive coarse-to-fine part segmentation that identifies distinct rigid components, all in an unsupervised manner. The refined field provides a spatially continuous, fully decoupled description of every part, supporting part-level reconstruction and precise modeling of their kinematic relationships. To evaluate generalization and realism, we enlarge the synthetic PartNet-Mobility benchmark and release RS-Art, a real-to-sim dataset that pairs RGB captures with accurately reverse-engineered 3D models. Extensive experiments demonstrate that our method outperforms existing methods in both accuracy and stability.
zh

[CV-55] HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

【速读】：该论文旨在解决多人群体场景下的文本参考引导的人类动作分割问题，即根据给定的文本描述对视频中特定目标人物的动作进行精确分割。现有方法主要针对单人活动且动作序列固定的情况，忽略了多人群体场景中的复杂性。论文提出的关键解决方案是基于全局-局部感知的傅里叶条件扩散框架（HopaDIFF），其核心在于引入一种新颖的跨输入门控注意力xLSTM以增强全局-局部长程推理能力，并结合傅里叶条件以实现更细粒度的动作分割控制。

链接: https://arxiv.org/abs/2506.09650
作者: Kunyu Peng,Junchao Huang,Xiangsheng Huang,Di Wen,Junwei Zheng,Yufan Chen,Kailun Yang,Jiamin Wu,Chongqing Hao,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Beijing Institute of Technology (北京理工大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Hunan University (湖南大学); Shanghai AI Lab (上海人工智能实验室); HEBUST (河北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The code is available at this https URL

点击查看摘要

Abstract:Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action recognition methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The code is available at this https URL.
zh

[CV-56] DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

【速读】：该论文旨在解决自动编码器（Autoencoder）在高压缩率下导致的性能退化问题，以及由生成对抗网络（GAN）引起的训练不稳定问题。其解决方案的关键在于提升解码器的表达能力，通过引入扩散模型（Diffusion Model）指导解码器恢复从潜在表示中未完全解码的信息，从而有效缓解高空间压缩率下的性能下降，并实现更高效、紧凑的潜在空间表示。

链接: https://arxiv.org/abs/2506.09644
作者: Dongxu Liu,Yuang Peng,Haomiao Tang,Yuwei Chen,Chunrui Han,Zheng Ge,Daxin Jiang,Mingxue Liao
机构: Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Tsinghua University(清华大学); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); StepFun(思必驰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder’s expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.
zh

[CV-57] FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models

【速读】：该论文旨在解决在隐私敏感领域（如医疗）中部署视觉-语言模型（Vision-Language Models, VLMs）所面临的挑战，即现有方法大多依赖集中式训练，难以满足严格的隐私需求。其解决方案的关键在于引入联邦学习（Federated Learning, FL）框架，以实现分布式、隐私保护的VLM微调。论文提出首个系统性的联邦微调基准——FedVLMBench，涵盖多种VLM架构、微调策略及跨域任务设置，通过实验揭示了VLM架构、微调方法、数据异质性与多任务联邦优化之间的相互作用，为隐私保护下的多模态基础模型训练提供了关键工具和实证指导。

链接: https://arxiv.org/abs/2506.09638
作者: Weiying Zheng,Ziyue Lin,Pengxin Guo,Yuyin Zhou,Feifei Wang,Liangqiong Qu
机构: The University of Hong Kong (香港大学); UC Santa Cruz (加州大学圣克鲁兹分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in cross-modal understanding and generation by integrating visual and textual information. While instruction tuning and parameter-efficient fine-tuning methods have substantially improved the generalization of VLMs, most existing approaches rely on centralized training, posing challenges for deployment in domains with strict privacy requirements like healthcare. Recent efforts have introduced Federated Learning (FL) into VLM fine-tuning to address these privacy concerns, yet comprehensive benchmarks for evaluating federated fine-tuning strategies, model architectures, and task generalization remain lacking. In this work, we present \textbfFedVLMBench, the first systematic benchmark for federated fine-tuning of VLMs. FedVLMBench integrates two mainstream VLM architectures (encoder-based and encoder-free), four fine-tuning strategies, five FL algorithms, six multimodal datasets spanning four cross-domain single-task scenarios and two cross-domain multitask settings, covering four distinct downstream task categories. Through extensive experiments, we uncover key insights into the interplay between VLM architectures, fine-tuning strategies, data heterogeneity, and multi-task federated optimization. Notably, we find that a 2-layer multilayer perceptron (MLP) connector with concurrent connector and LLM tuning emerges as the optimal configuration for encoder-based VLMs in FL. Furthermore, current FL methods exhibit significantly higher sensitivity to data heterogeneity in vision-centric tasks than text-centric ones, across both encoder-free and encoder-based VLM architectures. Our benchmark provides essential tools, datasets, and empirical guidance for the research community, offering a standardized platform to advance privacy-preserving, federated training of multimodal foundation models.
zh

[CV-58] HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

【速读】：该论文旨在解决现有方法在处理3D医学影像时的局限性，特别是由于主要关注2D医学图像而导致对复杂3D解剖结构理解不足的问题，这容易引发对细微病理的误判和诊断幻觉。解决方案的关键在于提出一种名为Hybrid Spatial Encoding Network (HSENet) 的框架，该框架通过双3D视觉编码器感知全局体积上下文和细粒度解剖细节，并结合Spatial Packer高效地将高分辨率3D空间区域压缩为信息丰富的视觉标记，从而实现准确且鲁棒的视觉-语言理解。

链接: https://arxiv.org/abs/2506.09634
作者: Yanzhao Shi,Xiaodan Zhang,Junzhong Ji,Haoning Jiang,Chengxin Zheng,Yinong Wang,Liangqiong Qu
机构: The University of Hong Kong (香港大学); Beijing University of Technology (北京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 9 figures. arXiv admin note: text overlap with arXiv:2410.14200 by other authors

点击查看摘要

Abstract:Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM’s semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at this https URL.
zh

[CV-59] ECAM: A Contrastive Learning Approach to Avoid Environmental Collision in Trajectory Forecasting IJCNN2025

【速读】：该论文旨在解决人类轨迹预测中因忽略环境因素而导致的碰撞问题，特别是在自主驾驶、机器人和监控等应用中，现有方法虽考虑了社会互动、多模态预测、行人意图等因素，但往往未能充分考虑环境的影响，从而导致与障碍物的碰撞。解决方案的关键是引入ECAM（Environmental Collision Avoidance Module），这是一个基于对比学习的模块，能够增强模型对环境的碰撞规避能力，并可集成到现有的轨迹预测模型中，从而提升生成无碰撞预测的能力。

链接: https://arxiv.org/abs/2506.09626
作者: Giacomo Rosin,Muhammad Rameez Ur Rahman,Sebastiano Vascon
机构: Ca’ Foscari University of Venice (威尼斯卡福斯卡里大学); European Centre for Living Technology (欧洲生命技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCNN 2025

点击查看摘要

Abstract:Human trajectory forecasting is crucial in applications such as autonomous driving, robotics and surveillance. Accurate forecasting requires models to consider various factors, including social interactions, multi-modal predictions, pedestrian intention and environmental context. While existing methods account for these factors, they often overlook the impact of the environment, which leads to collisions with obstacles. This paper introduces ECAM (Environmental Collision Avoidance Module), a contrastive learning-based module to enhance collision avoidance ability with the environment. The proposed module can be integrated into existing trajectory forecasting models, improving their ability to generate collision-free predictions. We evaluate our method on the ETH/UCY dataset and quantitatively and qualitatively demonstrate its collision avoidance capabilities. Our experiments show that state-of-the-art methods significantly reduce (-40/50%) the collision rate when integrated with the proposed module. The code is available at this https URL.
zh

[CV-60] Consistent Story Generation with Asymmetry Zigzag Sampling

【速读】：该论文试图解决文本到图像生成模型在多图像中保持主体一致性的问题，这是视觉叙事的基本要求。现有方法要么通过在大规模故事可视化数据集上微调模型来解决，这需要大量资源，要么使用无需训练的技术在生成过程中共享信息，但效果有限。论文提出了一种新的无需训练的采样策略，称为带有非对称提示和视觉共享的Zigzag Sampling，其关键在于引入一种交替使用非对称提示以保留主体特征，并通过视觉共享模块在生成图像间传递视觉线索，从而增强一致性。

链接: https://arxiv.org/abs/2506.09612
作者: Mingxiao LI,mang ning,Marie-Francine Moens
机构: KU Leuven (天主教鲁汶大学); Utrecht University (乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9. figures

点击查看摘要

Abstract:Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at this https URL.
zh

[CV-61] SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

【速读】：该论文旨在解决传统3D场景理解方法在几何重建质量、语义一致性以及多模态特征融合方面的不足，特别是在从稀疏视角图像中实现全面的几何-外观-语义建模问题。其解决方案的关键在于提出SemanticSplat，该方法通过将3D高斯分布与潜在语义属性统一，结合多特征场（如LSeg、SAM）与成本体积表示，以增强场景的一致性和准确性，同时利用两阶段蒸馏框架从稀疏视角图像中重建出全面的多模态语义特征场。

链接: https://arxiv.org/abs/2506.09565
作者: Qijing Li,Jingxiang Sun,Liang An,Zhaoqi Su,Hongwen Zhang,Yebin Liu
机构: Tsinghua University (清华大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry reconstruction and noisy artifacts. In contrast, per-scene optimization methods rely on dense input views, which reduces practicality and increases complexity during deployment. In this paper, we propose SemanticSplat, a feed-forward semantic-aware 3D reconstruction method, which unifies 3D Gaussians with latent semantic attributes for joint geometry-appearance-semantics modeling. To predict the semantic anisotropic Gaussians, SemanticSplat fuses diverse feature fields (e.g., LSeg, SAM) with a cost volume representation that stores cross-view feature similarities, enhancing coherent and accurate scene comprehension. Leveraging a two-stage distillation framework, SemanticSplat reconstructs a holistic multi-modal semantic feature field from sparse-view images. Experiments demonstrate the effectiveness of our method for 3D scene understanding tasks like promptable and open-vocabulary segmentation. Video results are available at this https URL.
zh

[CV-62] AD2-Bench: A Hierarchical CoT Benchmark for MLLM in Autonomous Driving under Adverse Conditions

【速读】：该论文旨在解决当前多模态大模型（Multi-Modal Large Models, MLLMs）在恶劣天气和复杂交通环境下的链式思维（Chain-of-Thought, CoT）推理能力评估不足的问题。现有基准测试未能充分关注CoT过程在这些特定场景中的严谨性评估，导致对模型推理能力的全面理解存在缺失。为应对这一问题，作者提出了AD^2-Bench，这是首个专为恶劣天气和复杂场景下自动驾驶设计的CoT基准。其关键在于构建了一个满足三个核心标准的数据集：覆盖广泛恶劣环境的全面数据、支持多步骤推理的细粒度标注，以及针对CoT性能评估的专用评价框架。AD^2-Bench的核心贡献是包含超过5.4k条高质量人工标注的CoT实例，每个中间推理步骤均作为原子单元并附有明确的真实标签，从而实现了对MLLMs在文本级、点级和区域级视觉提示下的推理过程的精细分析。

链接: https://arxiv.org/abs/2506.09557
作者: Zhaoyang Wei,Chenhui Qiang,Bowen Jiang,Xumeng Han,Xuehui Yu,Zhenjun Han
机构: University of Chinese Academy of Sciences(中国科学院大学); Tencent CDG(腾讯CDG)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has emerged as a powerful approach to enhance the structured, multi-step decision-making capabilities of Multi-Modal Large Models (MLLMs), is particularly crucial for autonomous driving with adverse weather conditions and complex traffic environments. However, existing benchmarks have largely overlooked the need for rigorous evaluation of CoT processes in these specific and challenging scenarios. To address this critical gap, we introduce AD^2-Bench, the first Chain-of-Thought benchmark specifically designed for autonomous driving with adverse weather and complex scenes. AD^2-Bench is meticulously constructed to fulfill three key criteria: comprehensive data coverage across diverse adverse environments, fine-grained annotations that support multi-step reasoning, and a dedicated evaluation framework tailored for assessing CoT performance. The core contribution of AD^2-Bench is its extensive collection of over 5.4k high-quality, manually annotated CoT instances. Each intermediate reasoning step in these annotations is treated as an atomic unit with explicit ground truth, enabling unprecedented fine-grained analysis of MLLMs’ inferential processes under text-level, point-level, and region-level visual prompts. Our comprehensive evaluation of state-of-the-art MLLMs on AD^2-Bench reveals accuracy below 60%, highlighting the benchmark’s difficulty and the need to advance robust, interpretable end-to-end autonomous driving systems. AD^2-Bench thus provides a standardized evaluation platform, driving research forward by improving MLLMs’ reasoning in autonomous driving, making it an invaluable resource.
zh

[CV-63] GLD-Road:A global-local decoding road network extraction model for remote sensing images

【速读】：该论文旨在解决道路网络提取中的效率与精度平衡问题，现有方法在全局并行处理中虽速度快但容易遗漏节点，在局部迭代处理中虽精度高但计算成本高。解决方案的关键在于提出GLD-Road模型，该模型采用两阶段架构，结合全局效率与局部精度：第一阶段通过Connect模块检测道路节点并进行连接，第二阶段通过局部搜索迭代优化断裂道路，显著降低计算量。实验结果表明，该方法在保持高精度的同时大幅提升了处理速度。

链接: https://arxiv.org/abs/2506.09553
作者: Ligao Deng,Yupeng Deng,Yu Meng,Jingbo Chen,Zhihao Xi,Diyou Liu,Qifeng Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road networks are crucial for mapping, autonomous driving, and disaster response. While manual annotation is costly, deep learning offers efficient extraction. Current methods include postprocessing (prone to errors), global parallel (fast but misses nodes), and local iterative (accurate but slow). We propose GLD-Road, a two-stage model combining global efficiency and local precision. First, it detects road nodes and connects them via a Connect Module. Then, it iteratively refines broken roads using local searches, drastically reducing computation. Experiments show GLD-Road outperforms state-of-the-art methods, improving APLS by 1.9% (City-Scale) and 0.67% (SpaceNet3). It also reduces retrieval time by 40% vs. Sat2Graph (global) and 92% vs. RNGDet++ (local). The experimental results are available at this https URL.
zh

[CV-64] Enhancing Human-Robot Collaboration: A Sim2Real Domain Adaptation Algorithm for Point Cloud Segmentation in Industrial Environments

【速读】：该论文旨在解决在人机协作（Human-Robot Collaboration, HRC）应用中，如何实现对三维环境的鲁棒语义分割问题，以提升安全性和操作效率。其核心挑战在于真实工业场景中高质量标注数据的稀缺性，以及模拟到现实（Sim2Real）域适应中的性能瓶颈。解决方案的关键在于提出一种双流网络架构（FUSION），结合动态图卷积神经网络（Dynamic Graph Convolutional Neural Network, DGCNN）与带有残差层的卷积神经网络（Convolutional Neural Network, CNN），作为面向工业环境的语义分割域适应算法，从而实现从模拟环境到真实应用场景的高效迁移。

链接: https://arxiv.org/abs/2506.09552
作者: Fatemeh Mohammadi Amin,Darwin G. Caldwell,Hans Wernher van de Venn
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, Journal of Intelligent Robotic Systems

点击查看摘要

Abstract:The robust interpretation of 3D environments is crucial for human-robot collaboration (HRC) applications, where safety and operational efficiency are paramount. Semantic segmentation plays a key role in this context by enabling a precise and detailed understanding of the environment. Considering the intense data hunger for real-world industrial annotated data essential for effective semantic segmentation, this paper introduces a pioneering approach in the Sim2Real domain adaptation for semantic segmentation of 3D point cloud data, specifically tailored for HRC. Our focus is on developing a network that robustly transitions from simulated environments to real-world applications, thereby enhancing its practical utility and impact on a safe HRC. In this work, we propose a dual-stream network architecture (FUSION) combining Dynamic Graph Convolutional Neural Networks (DGCNN) and Convolutional Neural Networks (CNN) augmented with residual layers as a Sim2Real domain adaptation algorithm for an industrial environment. The proposed model was evaluated on real-world HRC setups and simulation industrial point clouds, it showed increased state-of-the-art performance, achieving a segmentation accuracy of 97.76%, and superior robustness compared to existing methods. Comments: Preprint, Journal of Intelligent Robotic Systems Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.09552 [cs.RO] (or arXiv:2506.09552v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2506.09552 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-65] 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection

【速读】：该论文旨在解决基于图像的3D目标检测任务中缺乏三维几何线索导致的图像与三维表示之间对应关系模糊的问题。其解决方案的关键在于通过预测的深度信息生成高效的显式和隐式三维几何表示，具体包括利用预测深度学习体素占用并通过提出的体素占用注意力显式优化体素化三维特征体积，同时将特征体积与隐式三维表示——截断有符号距离函数（TSDF）进行融合，从而在无需依赖三维信号监督的情况下提升模型对三维几何的理解能力。

链接: https://arxiv.org/abs/2506.09541
作者: Yi Zhang,Yi Wang,Yawen Cui,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:This paper proposes 3DGeoDet, a novel geometry-aware 3D object detection approach that effectively handles single- and multi-view RGB images in indoor and outdoor environments, showcasing its general-purpose applicability. The key challenge for image-based 3D object detection tasks is the lack of 3D geometric cues, which leads to ambiguity in establishing correspondences between images and 3D representations. To tackle this problem, 3DGeoDet generates efficient 3D geometric representations in both explicit and implicit manners based on predicted depth information. Specifically, we utilize the predicted depth to learn voxel occupancy and optimize the voxelized 3D feature volume explicitly through the proposed voxel occupancy attention. To further enhance 3D awareness, the feature volume is integrated with an implicit 3D representation, the truncated signed distance function (TSDF). Without requiring supervision from 3D signals, we significantly improve the model’s comprehension of 3D geometry by leveraging intermediate 3D representations and achieve end-to-end training. Our approach surpasses the performance of state-of-the-art image-based methods on both single- and multi-view benchmark datasets across diverse environments, achieving a 9.3 mAP@0.5 improvement on the SUN RGB-D dataset, a 3.3 mAP@0.5 improvement on the ScanNetV2 dataset, and a 0.19 AP3D@0.7 improvement on the KITTI dataset. The project page is available at: this https URL.
zh

[CV-66] AngleRoCL: Angle-Robust Concept Learning for Physically View-Invariant T2I Adversarial Patches

【速读】：该论文旨在解决文本到图像（T2I）生成的对抗性补丁在物理世界中视角变化下的攻击有效性不足的问题，即T2I对抗性补丁的角度鲁棒性问题。现有方法忽视了补丁在不同视角下攻击效果的下降，导致实际应用中的局限性。解决方案的关键在于提出Angle-Robust Concept Learning (AngleRoCL)，该方法通过学习一个可泛化的概念（即文本嵌入），表征生成角度鲁棒补丁的能力，并将其融入文本提示中，引导T2I模型生成对视角变化具有内在抵抗能力的补丁。

链接: https://arxiv.org/abs/2506.09538
作者: Wenjun Ji,Yuxiang Fu,Luyang Ying,Deng-Ping Fan,Yuyi Wang,Ming-Ming Cheng,Ivor Tsang,Qing Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cutting-edge works have demonstrated that text-to-image (T2I) diffusion models can generate adversarial patches that mislead state-of-the-art object detectors in the physical world, revealing detectors’ vulnerabilities and risks. However, these methods neglect the T2I patches’ attack effectiveness when observed from different views in the physical world (i.e., angle robustness of the T2I adversarial patches). In this paper, we study the angle robustness of T2I adversarial patches comprehensively, revealing their angle-robust issues, demonstrating that texts affect the angle robustness of generated patches significantly, and task-specific linguistic instructions fail to enhance the angle robustness. Motivated by the studies, we introduce Angle-Robust Concept Learning (AngleRoCL), a simple and flexible approach that learns a generalizable concept (i.e., text embeddings in implementation) representing the capability of generating angle-robust patches. The learned concept can be incorporated into textual prompts and guides T2I models to generate patches with their attack effectiveness inherently resistant to viewpoint variations. Through extensive simulation and physical-world experiments on five SOTA detectors across multiple views, we demonstrate that AngleRoCL significantly enhances the angle robustness of T2I adversarial patches compared to baseline methods. Our patches maintain high attack success rates even under challenging viewing conditions, with over 50% average relative improvement in attack effectiveness across multiple angles. This research advances the understanding of physically angle-robust patches and provides insights into the relationship between textual concepts and physical properties in T2I-generated contents.
zh

[CV-67] Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

【速读】：该论文试图解决3D Gaussian Splatting (3DGS) 在辐射场渲染中因需要大量冗余高斯基元而导致的内存和渲染预算过载问题。现有压缩方法通过启发式重要性评分剪枝高斯基元，但缺乏全局保真度保证。其解决方案的关键在于引入一种新的最优传输视角，将3DGS压缩建模为全局高斯混合物缩减问题，通过在KD树划分上最小化复合传输散度生成紧凑几何表示，并通过微调颜色和不透明度属性实现外观与几何的解耦，从而在大幅减少高斯基元数量的同时保持高质量的渲染效果。

链接: https://arxiv.org/abs/2506.09534
作者: Tao Wang,Mengyu Li,Geduo Zeng,Cheng Meng,Qiong Zhang
机构: Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering.
zh

[CV-68] HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

【速读】：该论文旨在解决从单目视频中重建动态三维场景的挑战，特别是在动态场景下扩展3D Gaussian Splatting（3DGS）时所面临的结构化和时间一致性运动表示学习困难。现有方法存在三个主要局限：冗余的高斯更新、不足的运动监督以及对复杂非刚性形变建模能力弱。解决方案的关键在于提出HAIF-GS框架，通过稀疏锚点驱动的形变实现结构化和一致性的动态建模，包括基于锚点过滤器的运动相关区域识别、自监督诱导流引导的形变模块以及分层锚点传播机制，从而提升渲染质量、时间一致性和重建效率。

链接: https://arxiv.org/abs/2506.09518
作者: Jianing Chen,Zehao Li,Yujun Cai,Hao Jiang,Chengxuan Qian,Juyuan Kang,Shuqin Gao,Honglong Zhao,Tianlu Mao,Yucheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppresses redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.
zh

[CV-69] Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals

【速读】：该论文试图解决现有点云属性压缩方法中熵参数估计未充分利用信息的问题，以及固定似然区间对模型性能的限制。其解决方案的关键在于引入广义高斯熵模型（Generalized Gaussian Entropy Model），通过形状参数更精确地估计潜在变量的概率分布，并提出均值误差判别器（Mean Error Discriminator, MED）以判断熵参数估计的准确性并动态调整似然区间。

链接: https://arxiv.org/abs/2506.09510
作者: Changhao Peng,Yuqi Ye,Wei Gao
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Gaussian and Laplacian entropy models are proved effective in learned point cloud attribute compression, as they assist in arithmetic coding of latents. However, we demonstrate through experiments that there is still unutilized information in entropy parameters estimated by neural networks in current methods, which can be used for more accurate probability estimation. Thus we introduce generalized Gaussian entropy model, which controls the tail shape through shape parameter to more accurately estimate the probability of latents. Meanwhile, to the best of our knowledge, existing methods use fixed likelihood intervals for each integer during arithmetic coding, which limits model performance. We propose Mean Error Discriminator (MED) to determine whether the entropy parameter estimation is accurate and then dynamically adjust likelihood intervals. Experiments show that our method significantly improves rate-distortion (RD) performance on three VAE-based models for point cloud attribute compression, and our method can be applied to other compression tasks, such as image and video compression.
zh

[CV-70] DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects

【速读】：该论文旨在解决透明和反射物体在深度传感器中的深度信息缺失问题（depth information missing），这类物体由于其独特的视觉特性（如镜面反射和光透射）导致深度估计不完整或不准确，从而严重影响基于几何的视觉任务。解决方案的关键在于提出DCIRNet，一种新颖的多模态深度补全网络，通过有效融合RGB图像和深度图来提升深度估计质量，其核心创新包括多模态特征融合模块以提取RGB图像与不完整深度图之间的互补信息，以及多阶段监督和深度精炼策略以逐步优化深度补全并缓解物体边界模糊的问题。

链接: https://arxiv.org/abs/2506.09491
作者: Guanghu Xie,Zhiduo Jiang,Yonglong Zhang,Yang Liu,Zongwu Xie,Baoshi Cao,Hong Liu
机构: Harbin Institute of Technology(哈尔滨工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transparent and reflective objects in everyday environments pose significant challenges for depth sensors due to their unique visual properties, such as specular reflections and light transmission. These characteristics often lead to incomplete or inaccurate depth estimation, which severely impacts downstream geometry-based vision tasks, including object recognition, scene reconstruction, and robotic manipulation. To address the issue of missing depth information in transparent and reflective objects, we propose DCIRNet, a novel multimodal depth completion network that effectively integrates RGB images and depth maps to enhance depth estimation quality. Our approach incorporates an innovative multimodal feature fusion module designed to extract complementary information between RGB images and incomplete depth maps. Furthermore, we introduce a multi-stage supervision and depth refinement strategy that progressively improves depth completion and effectively mitigates the issue of blurred object boundaries. We integrate our depth completion model into dexterous grasping frameworks and achieve a 44% improvement in the grasp success rate for transparent and reflective objects. We conduct extensive experiments on public datasets, where DCIRNet demonstrates superior performance. The experimental results validate the effectiveness of our approach and confirm its strong generalization capability across various transparent and reflective objects.
zh

[CV-71] Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

【速读】：该论文旨在解决传统图像生成模型在生成质量与推理效率之间的平衡问题，尤其是针对基于自回归（Autoregressive, AR）Transformer或纯扩散模型的局限性。其解决方案的关键在于提出TransDiff模型，该模型将AR Transformer与扩散模型相结合，通过高阶语义特征编码和扩散过程对图像样本分布进行估计，从而在ImageNet 256x256数据集上实现了优于现有方法的生成质量（FID为1.61，IS为293.4）和推理速度（相比AR Transformer方法快2倍，相比纯扩散模型快112倍）。此外，论文进一步引入了多参考自回归（Multi-Reference Autoregression, MRAR）范式，通过引用多个先前生成的图像来增强模型的多样性学习能力，进一步提升了生成效果。

链接: https://arxiv.org/abs/2506.09482
作者: Dingcheng Zhen,Qian Qiao,Tan Yu,Kangxi Wu,Ziwei Zhang,Siyuan Liu,Shunshun Yin,Ming Tao
机构: Soul AI(灵魂AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Fréchet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.
zh

[CV-72] nySplat: Feedforward Approach for Generating Compact 3D Scene Representation

【速读】：该论文旨在解决生成式3D场景表示（3D Gaussian Splatting, 3DGS）中存储成本过高的问题。现有3DGS压缩方法由于架构不兼容而无法有效应用，因此本文提出了一种完整的前馈式压缩方法TinySplat，其关键在于集成一种无需训练的压缩框架，系统性地消除冗余来源，包括通过View-Projection Transformation（VPT）减少几何冗余、通过Visibility-Aware Basis Reduction（VABR）降低感知冗余，以及利用现成视频编解码器处理空间冗余，从而实现高效且高质量的3DGS数据压缩。

链接: https://arxiv.org/abs/2506.09479
作者: Zetian Song,Jiaye Fu,Jiaqi Zhang,Xiaohan Lu,Chuanmin Jia,Siwei Ma,Wen Gao
机构: Peking University, School of Computer Science, State Key Laboratory of Multimedia Information Processing(北京大学计算机学院多媒体信息处理国家重点实验室); Peking University, School of Electronic and Computer Engineering, Shenzhen(北京大学深圳电子与计算机工程学院); Wangxuan Institute of Computer Technology, State Key Laboratory of Multimedia Information Processing, Peking University(王选计算机技术研究所，多媒体信息处理国家重点实验室，北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent development of feedforward 3D Gaussian Splatting (3DGS) presents a new paradigm to reconstruct 3D scenes. Using neural networks trained on large-scale multi-view datasets, it can directly infer 3DGS representations from sparse input views. Although the feedforward approach achieves high reconstruction speed, it still suffers from the substantial storage cost of 3D Gaussians. Existing 3DGS compression methods relying on scene-wise optimization are not applicable due to architectural incompatibilities. To overcome this limitation, we propose TinySplat, a complete feedforward approach for generating compact 3D scene representations. Built upon standard feedforward 3DGS methods, TinySplat integrates a training-free compression framework that systematically eliminates key sources of redundancy. Specifically, we introduce View-Projection Transformation (VPT) to reduce geometric redundancy by projecting geometric parameters into a more compact space. We further present Visibility-Aware Basis Reduction (VABR), which mitigates perceptual redundancy by aligning feature energy along dominant viewing directions via basis transformation. Lastly, spatial redundancy is addressed through an off-the-shelf video codec. Comprehensive experimental results on multiple benchmark datasets demonstrate that TinySplat achieves over 100x compression for 3D Gaussian data generated by feedforward methods. Compared to the state-of-the-art compression approach, we achieve comparable quality with only 6% of the storage size. Meanwhile, our compression framework requires only 25% of the encoding time and 1% of the decoding time.
zh

[CV-73] Urban1960SatSeg: Unsupervised Semantic Segmentation of Mid-20th century Urban Landscapes with Satellite Imageries

【速读】：该论文旨在解决历史卫星遥感（RS）影像在语义分割任务中因质量退化（如畸变、配准偏差和光谱稀缺性）及缺乏标注而难以进行有效分析的问题。其关键解决方案是提出 \textbf{Urban1960SatBench} 数据集和 \textbf{Urban1960SatUSM} 基准框架，其中 Urban1960SatBench 是基于20世纪中期 Keyhole 影像的专家标注语义分割数据集，覆盖1,240 km² 的区域并包含关键城市类别；Urban1960SatUSM 则是一种基于自监督学习架构的无监督语义分割框架，采用置信度感知对齐机制和焦点置信度损失函数，以生成鲁棒的伪标签并自适应地优先处理预测难度和标签可靠性，从而在无需人工监督的情况下提升噪声历史数据的分割性能。

链接: https://arxiv.org/abs/2506.09476
作者: Tianxiang Hao,Lixian Zhang,Yingjia Zhang,Mengxuan Chen,Jinxiao Zhang,Haohuan Fu
机构: Fudan University (复旦大学); National supercomputing center in Shenzhen (深圳超算中心); New York University Shanghai (纽约大学上海分校); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Historical satellite imagery, such as mid-20 ^th century Keyhole data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and annotation absence have long hindered semantic segmentation on such historical RS imagery. To bridge this gap and enhance understanding of urban development, we introduce \textbfUrban1960SatBench , an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing segmentation datasets, along with a benchmark framework for unsupervised segmentation tasks, \textbfUrban1960SatUSM . First, \textbfUrban1960SatBench serves as a novel, expertly annotated semantic segmentation dataset built on mid-20 ^th century Keyhole imagery, covering 1,240 km ^2 and key urban classes (buildings, roads, farmland, water). As the earliest segmentation dataset of its kind, it provides a pioneering benchmark for historical urban understanding. Second, \textbfUrban1960SatUSM (Unsupervised Segmentation Model) is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Experiments show Urban1960SatUSM significantly outperforms existing unsupervised segmentation methods on Urban1960SatSeg for segmenting historical urban scenes, promising in paving the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and supplementary material are available at this https URL.
zh

[CV-74] Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning CVPR2025

【速读】：该论文旨在解决在大型视觉-语言模型（Large Vision-Language Models, LVLMs）中进行上下文学习（In-context Learning, ICL）时，现有方法依赖预定义示例或基于人类直觉的启发式选择策略所导致的泛化能力不足和信息冗余问题。其解决方案的关键在于提出一种探索-利用强化学习框架，通过融合多模态信息并自适应地选择示例，以整体优化的方式提升模型的上下文学习能力，使模型能够通过自我探索不断优化示例选择策略。

链接: https://arxiv.org/abs/2506.09473
作者: Cheng Chen,Yunpeng Zhai,Yifan Zhao,Jinyang Gao,Bolin Ding,Jia Li
机构: Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, CVPR 2025

点击查看摘要

Abstract:In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.
zh

[CV-75] Optimizing Cooperative Multi-Object Tracking using Graph Signal Processing

【速读】：该论文旨在解决多目标跟踪（Multi-Object Tracking, MOT）在自动驾驶系统中因单智能体感知受限（如遮挡、传感器故障等）而导致环境理解不全面的问题，其核心是通过融合多智能体信息来提升跟踪性能。解决方案的关键在于提出一种协作式MOT框架，通过构建由检测到的边界框定义的全连接图拓扑，并利用图拉普拉斯优化技术对边界框的位置误差进行平滑处理，从而有效融合多车辆的信息。该方法在两个阶段关联优化后的边界框与跟踪目标，以提高定位和跟踪的准确性。

链接: https://arxiv.org/abs/2506.09469
作者: Maria Damanaki,Nikos Piperigkos,Alexandros Gkillas,Aris S. Lalos
机构: Industrial Systems Institute, Athena Research Center, Patras Science Park, Greece; AviSense.AI, Patras Science Park, Greece; Dpt. of Informatics & Telecom., University of Ioannina, Arta, Greece
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE International Conference on Multimedia and Expo Workshops, 3DMM - 3D Multimedia Analytics, Search and Generation

点击查看摘要

Abstract:Multi-Object Tracking (MOT) plays a crucial role in autonomous driving systems, as it lays the foundations for advanced perception and precise path planning modules. Nonetheless, single agent based MOT lacks in sensing surroundings due to occlusions, sensors failures, etc. Hence, the integration of multiagent information is essential for comprehensive understanding of the environment. This paper proposes a novel Cooperative MOT framework for tracking objects in 3D LiDAR scene by formulating and solving a graph topology-aware optimization problem so as to fuse information coming from multiple vehicles. By exploiting a fully connected graph topology defined by the detected bounding boxes, we employ the Graph Laplacian processing optimization technique to smooth the position error of bounding boxes and effectively combine them. In that manner, we reveal and leverage inherent coherences of diverse multi-agent detections, and associate the refined bounding boxes to tracked objects at two stages, optimizing localization and tracking accuracies. An extensive evaluation study has been conducted, using the real-world V2V4Real dataset, where the proposed method significantly outperforms the baseline frameworks, including the state-of-the-art deep-learning DMSTrack and V2V4Real, in various testing sequences.
zh

[CV-76] Evidential Deep Learning with Spectral-Spatial Uncertainty Disentanglement for Open-Set Hyperspectral Domain Generalization

【速读】：该论文旨在解决高光谱图像分类中的开放集领域泛化（Open-set Domain Generalization, OSDG）问题，即在目标领域存在未知类别且模型需在无目标领域数据的情况下跨多个未见领域进行泛化时所面临的挑战。现有领域自适应方法因依赖目标领域数据进行训练，在未知类别存在时无法有效应对领域偏移问题，导致负迁移和分类性能下降。该论文提出的解决方案的关键在于构建一个结合四个核心组件的框架：频谱不变频率解耦（Spectrum-Invariant Frequency Disentanglement, SIFD）用于领域无关特征提取，双通道残差网络（Dual-Channel Residual Network, DCRN）用于鲁棒的光谱-空间特征学习，证据深度学习（Evidential Deep Learning, EDL）用于不确定性量化，以及光谱-空间不确定性解耦（Spectral-Spatial Uncertainty Disentanglement, SSUD）用于可靠的开放集分类。

链接: https://arxiv.org/abs/2506.09460
作者: Amirreza Khoshbakht,Erchan Aptoula
机构: Sabanci University (萨班奇大学); VPALab (VPALab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-set domain generalization(OSDG) for hyperspectral image classification presents significant challenges due to the presence of unknown classes in target domains and the need for models to generalize across multiple unseen domains without target-specific adaptation. Existing domain adaptation methods assume access to target domain data during training and fail to address the fundamental issue of domain shift when unknown classes are present, leading to negative transfer and reduced classification performance. To address these limitations, we propose a novel open-set domain generalization framework that combines four key components: Spectrum-Invariant Frequency Disentanglement (SIFD) for domain-agnostic feature extraction, Dual-Channel Residual Network (DCRN) for robust spectral-spatial feature learning, Evidential Deep Learning (EDL) for uncertainty quantification, and Spectral-Spatial Uncertainty Disentanglement (SSUD) for reliable open-set classification. The SIFD module extracts domain-invariant spectral features in the frequency domain through attention-weighted frequency analysis and domain-agnostic regularization, while DCRN captures complementary spectral and spatial information via parallel pathways with adaptive fusion. EDL provides principled uncertainty estimation using Dirichlet distributions, enabling the SSUD module to make reliable open-set decisions through uncertainty-aware pathway weighting and adaptive rejection thresholding. Experimental results on three cross-scene hyperspectral classification tasks show that our approach achieves performance comparable to state-of-the-art domain adaptation methods while requiring no access to the target domain during training. The implementation will be made available at this https URL upon acceptance.
zh

[CV-77] Harmonizing and Merging Source Models for CLIP-based Domain Generalization

【速读】：该论文旨在解决基于CLIP（Contrastive Language-Image Pretraining）的领域泛化方法中存在样本冲突和优化冲突的问题，这些问题限制了模型在未见领域的泛化能力。其解决方案的关键在于提出一种名为Harmonizing and Merging (HAM)的源模型融合框架，通过在训练过程中丰富无冲突样本并协调所有模型的更新方向，结合冗余感知的历史模型融合方法，实现跨源模型的知识整合，从而提升模型的泛化性能。

链接: https://arxiv.org/abs/2506.09446
作者: Yuhe Ding,Jian Liang,Bo Jiang,Zi Wang,Aihua Zheng,Bin Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:CLIP-based domain generalization aims to improve model generalization to unseen domains by leveraging the powerful zero-shot classification capabilities of CLIP and multiple source datasets. Existing methods typically train a single model across multiple source domains to capture domain-shared information. However, this paradigm inherently suffers from two types of conflicts: 1) sample conflicts, arising from noisy samples and extreme domain shifts among sources; and 2) optimization conflicts, stemming from competition and trade-offs during multi-source training. Both hinder the generalization and lead to suboptimal solutions. Recent studies have shown that model merging can effectively mitigate the competition of multi-objective optimization and improve generalization performance. Inspired by these findings, we propose Harmonizing and Merging (HAM), a novel source model merging framework for CLIP-based domain generalization. During the training process of the source models, HAM enriches the source samples without conflicting samples, and harmonizes the update directions of all models. Then, a redundancy-aware historical model merging method is introduced to effectively integrate knowledge across all source models. HAM comprehensively consolidates source domain information while enabling mutual enhancement among source models, ultimately yielding a final model with optimal generalization capabilities. Extensive experiments on five widely used benchmark datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance.
zh

[CV-78] OGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

【速读】：该论文试图解决在弱监督设置下，视频问答（video QA）中的时间定位问题，即在没有任何时间标注的情况下，给定一个视频和一个问题，生成一个基于起始和结束时间的开放式答案。解决方案的关键在于提出TOGA模型，该模型是一个用于弱监督条件下时间定位的开放性视频问答的视觉-语言模型。TOGA通过指令微调实现答案和时间定位的联合生成，并通过引入一致性约束来确保伪标签的有效性，从而在无时间标注的情况下提升任务性能。

链接: https://arxiv.org/abs/2506.09445
作者: Ayush Gupta,Anirban Roy,Rama Chellappa,Nathaniel D. Bastian,Alvaro Velasquez,Susmit Jha
机构: SRI(美国研究公司); Johns Hopkins University (约翰霍普金斯大学); United States Military Academy (美国军事学院); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
zh

[CV-79] A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning

【速读】：该论文旨在解决远程感知图像描述生成中计算成本高和对细粒度结构特征关注不足的问题（remote sensing image captioning）。其关键解决方案是提出一种轻量级的Transformer架构，通过降低编码器层的维度并采用蒸馏版的GPT-2作为解码器，同时引入知识蒸馏策略以提升轻量网络的性能，并结合边缘感知增强策略以增强图像表征和目标边界理解。

链接: https://arxiv.org/abs/2506.09429
作者: Swadhin Das,Divyansh Mundra,Priyanshu Dayal,Raksha Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network. Furthermore, an edge-aware enhancement strategy is incorporated to enhance image representation and object boundary understanding, enabling the model to capture fine-grained spatial details in remote sensing images. Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.
zh

[CV-80] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）在生成紧密交织的图像-文本输出时存在的问题，主要原因是现有训练数据集的规模、质量和指令丰富性有限。解决方案的关键在于提出InterSyn数据集，该数据集通过自评估与迭代优化（Self-Evaluation with Iterative Refinement, SEIR）方法构建，具备多轮指令驱动的对话和紧密交织的图像-文本响应，提供了丰富的物体多样性与严格的自动化质量优化。此外，为评估交织的多模态输出，论文还引入了SynJudge，一个用于定量评估图像内容、文本内容、图像质量和图像-文本协同性的自动评估模型。

链接: https://arxiv.org/abs/2506.09427
作者: Yukang Feng,Jianwen Sun,Chuanhao Li,Zizhen Li,Jiaxin Ai,Fanrui Zhang,Yifan Chang,Sizhuo Zhou,Shenglin Zhang,Yu Dai,Kaipeng Zhang
机构: Nankai University (南开大学); Shanghai Innovation Institute (上海创新研究院); Wuhan University (武汉大学); University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge, an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content, image content, image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn’s utility for advancing multimodal systems. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.09427 [cs.CV] (or arXiv:2506.09427v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.09427 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-81] ODG: Occupancy Prediction Using Dual Gaussians

【速读】：该论文旨在解决3D占用预测（3D occupancy prediction）中现有方法计算成本高、场景表示不完善的问题。具体而言，传统方法依赖密集的3D特征体积和交叉注意力机制，导致计算开销大；而基于鸟瞰图（BEV）或稀疏点的方法虽降低了成本，但各自存在缺陷：BEV难以处理小物体，稀疏点则在捕捉平面或大物体时效率低下。论文提出的解决方案ODG的关键在于结合BEV与稀疏点的表示，采用双分支结构：基于查询的稀疏点分支与BEV分支，并通过交叉注意力机制共享3D信息，从而增强BEV平面上困难物体的信号，最终融合两分支输出生成更准确的3D占用预测。

链接: https://arxiv.org/abs/2506.09417
作者: Yunxiao Shi,Yinhao Zhu,Shizhong Han,Jisoo Jeong,Amin Ansari,Hong Cai,Fatih Porikli
机构: Qualcomm AI Research(高通人工智能研究); Qualcomm Technologies, Inc(高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird’s Eye View (BEV) or sparse points as scene representation with much reduced cost, but still suffer from their respective shortcomings. More concretely, BEV struggles with small objects that often experience significant information loss after being projected to the ground plane. On the other hand, points can flexibly model little objects in 3D, but is inefficient at capturing flat surfaces or large objects. To address these challenges, in this paper, we present a novel 3D occupancy prediction approach, ODG, which combines BEV and sparse points based representations. We propose a dual-branch design: a query-based sparse points branch and a BEV branch. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which enriches the weakened signals of difficult objects on the BEV plane. The outputs of both branches are finally fused to generate predicted 3D occupancy. We conduct extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo benchmarks that demonstrate the superiority of our proposed ODG. Moreover, ODG also delivers competitive inference speed when compared to the latest efficient approaches.
zh

[CV-82] Noise Conditional Variational Score Distillation

【速读】：该论文试图解决如何将预训练的扩散模型蒸馏为生成去噪器的问题，以实现高效且高质量的生成过程。解决方案的关键在于提出噪声条件变分得分蒸馏（NCVSD），通过揭示无条件得分函数隐式表征了去噪后验分布的得分函数，并将其整合到变分得分蒸馏（VSD）框架中，从而实现了在广泛噪声水平下对去噪后验分布样本的可扩展学习。

链接: https://arxiv.org/abs/2506.09416
作者: Xinyu Peng,Ziyang Zheng,Yaoming Wang,Han Li,Nuowen Kan,Wenrui Dai,Chenglin Li,Junni Zou,Hongkai Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.
zh

[CV-83] Synthetic Human Action Video Data Generation with Pose Transfer

【速读】：该论文试图解决在视频理解任务中，特别是涉及人类运动的任务，合成数据生成常出现诡异特征（uncanny features），从而降低了其在训练中的有效性问题。为了解决这一问题，论文提出了一种基于姿态迁移（pose transfer）的合成人类动作视频数据生成方法，其关键在于使用可控的3D高斯虚拟人模型（controllable 3D Gaussian avatar models）。该方法在Toyota Smarthome和NTU RGB+D数据集上进行了评估，并展示了其在动作识别任务中的性能提升，同时能够有效扩展少量样本数据集，弥补真实训练数据中代表性不足的群体，并增加多样化的背景。

链接: https://arxiv.org/abs/2506.09411
作者: Vaclav Knapp,Matyas Bohacek
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In video understanding tasks, particularly those involving human motion, synthetic data generation often suffers from uncanny features, diminishing its effectiveness for training. Tasks such as sign language translation, gesture recognition, and human motion understanding in autonomous driving have thus been unable to exploit the full potential of synthetic data. This paper proposes a method for generating synthetic human action video data using pose transfer (specifically, controllable 3D Gaussian avatar models). We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks. Moreover, we demonstrate that the method can effectively scale few-shot datasets, making up for groups underrepresented in the real training data and adding diverse backgrounds. We open-source the method along with RANDOM People, a dataset with videos and avatars of novel human identities for pose transfer crowd-sourced from the internet.
zh

[CV-84] SRPL-SFDA: SAM-Guided Reliable Pseudo-Labels for Source-Free Domain Adaptation in Medical Image Segmentation

【速读】：该论文旨在解决在目标域数据无标签的情况下，如何提升医学图像分割模型在新临床中心部署时的鲁棒性问题，具体针对源域无关的域适应（Source-Free Domain Adaptation, SFDA）场景。其解决方案的关键在于提出一种基于Segment Anything Model (SAM) 的可靠伪标签方法（SRPL-SFDA），该方法通过三个核心组件实现：1）测试时三通道强度增强（T3IE）以提高伪标签质量并生成SAM兼容的输入；2）基于多SAM输出一致性的可靠伪标签选择模块（CMSO）以剔除低质量伪标签；3）可靠性感知的训练过程，利用可靠伪标签进行监督并通过对熵最小化对不可靠部分进行正则化。

链接: https://arxiv.org/abs/2506.09403
作者: Xinya Liu,Jianghao Wu,Tao Lu,Shaoting Zhang,Guotai Wang
机构: School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China (电子科技大学机械与电气工程学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室; The Department of Radiology, Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China (四川省人民医院放射科，电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 4 figures. Accepted for publication in Neurocomputing

点击查看摘要

Abstract:Domain Adaptation (DA) is crucial for robust deployment of medical image segmentation models when applied to new clinical centers with significant domain shifts. Source-Free Domain Adaptation (SFDA) is appealing as it can deal with privacy concerns and access constraints on source-domain data during adaptation to target-domain data. However, SFDA faces challenges such as insufficient supervision in the target domain with unlabeled images. In this work, we propose a Segment Anything Model (SAM)-guided Reliable Pseudo-Labels method for SFDA (SRPL-SFDA) with three key components: 1) Test-Time Tri-branch Intensity Enhancement (T3IE) that not only improves quality of raw pseudo-labels in the target domain, but also leads to SAM-compatible inputs with three channels to better leverage SAM’s zero-shot inference ability for refining the pseudo-labels; 2) A reliable pseudo-label selection module that rejects low-quality pseudo-labels based on Consistency of Multiple SAM Outputs (CMSO) under input perturbations with T3IE; and 3) A reliability-aware training procedure in the unlabeled target domain where reliable pseudo-labels are used for supervision and unreliable parts are regularized by entropy minimization. Experiments conducted on two multi-domain medical image segmentation datasets for fetal brain and the prostate respectively demonstrate that: 1) SRPL-SFDA effectively enhances pseudo-label quality in the unlabeled target domain, and improves SFDA performance by leveraging the reliability-aware training; 2) SRPL-SFDA outperformed state-of-the-art SFDA methods, and its performance is close to that of supervised training in the target domain. The code of this work is available online: this https URL.
zh

[CV-85] Improving Out-of-Distribution Detection via Dynamic Covariance Calibration

【速读】：该论文旨在解决生成式 AI (Generative AI) 系统中分布外（Out-of-Distribution, OOD）检测的可靠性问题。现有基于先验信息的方法（即子空间方法）通过提取信息几何来检测OOD数据，但其静态提取训练分布信息几何的方式无法应对由分布不良样本引起的几何畸变。该论文的关键解决方案是通过动态调整先验几何来修正分布不良样本的影响，具体而言，采用实时输入特征动态更新先验协方差矩阵，沿实时输入特征方向减少协方差，并将调整限制在残差空间内，从而保留关键数据特征并避免对主成分空间中非预期方向产生影响。

链接: https://arxiv.org/abs/2506.09399
作者: Kaiyu Guo,Zijian Wang,Brian C. Lovell,Mahsa Baktashmotlagh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. The code is released at this https URL.
zh

[CV-86] ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model

【速读】：该论文试图解决现实场景中跨模态行人重识别（Person Re-identification, ReID）问题，即在查询为单一模态或多种模态组合的情况下，准确识别目标人员。现有方法和数据集受限于有限的模态，无法满足这一需求。为此，研究提出了一个新问题——全模态行人重识别（Omni Multi-modal Person Re-identification, OM-ReID），并构建了ORBench数据集，这是首个包含五种模态（RGB、红外、彩色素描、草图和文本描述）的高质量多模态数据集。其关键解决方案是提出ReID5o框架，该框架通过统一编码和多专家路由机制，实现任意模态组合的协同融合与跨模态对齐。

链接: https://arxiv.org/abs/2506.09385
作者: Jialong Zuo,Yongtai Deng,Mengdan Tan,Rui Jin,Dongyue Wu,Nong Sang,Liang Pan,Changxin Gao
机构: Huazhong University of Science and Technology (华中科技大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at this https URL.
zh

[CV-87] UniForward: Unified 3D Scene and Semantic Field Reconstruction via Feed-Forward Gaussian Splatting from Only Sparse-View Images

【速读】：该论文旨在解决将3D场景与语义场统一重建的问题，以提升对周围环境的感知与理解能力。其关键解决方案是提出一种前馈式高斯点云模型（UniForward），该模型能够仅通过未标定且无姿态的稀疏视角图像，预测具有各向异性语义特征的3D高斯分布。为实现3D场景与语义场的统一表示，论文采用双分支解耦解码器嵌入并预测语义特征，并通过损失引导的视角采样器优化训练过程，从而无需依赖真实深度或掩码，实现了端到端的训练与实时重建。

链接: https://arxiv.org/abs/2506.09378
作者: Qijian Tian,Xin Tan,Jingyu Gong,Yuan Xie,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a feed-forward Gaussian Splatting model that unifies 3D scene and semantic field reconstruction. Combining 3D scenes with semantic fields facilitates the perception and understanding of the surrounding environment. However, key challenges include embedding semantics into 3D representations, achieving generalizable real-time reconstruction, and ensuring practical applicability by using only images as input without camera parameters or ground truth depth. To this end, we propose UniForward, a feed-forward model to predict 3D Gaussians with anisotropic semantic features from only uncalibrated and unposed sparse-view images. To enable the unified representation of the 3D scene and semantic field, we embed semantic features into 3D Gaussians and predict them through a dual-branch decoupled decoder. During training, we propose a loss-guided view sampler to sample views from easy to hard, eliminating the need for ground truth depth or masks required by previous methods and stabilizing the training process. The whole model can be trained end-to-end using a photometric loss and a distillation loss that leverages semantic features from a pre-trained 2D semantic model. At the inference stage, our UniForward can reconstruct 3D scenes and the corresponding semantic fields in real time from only sparse-view images. The reconstructed 3D scenes achieve high-quality rendering, and the reconstructed 3D semantic field enables the rendering of view-consistent semantic features from arbitrary views, which can be further decoded into dense segmentation masks in an open-vocabulary manner. Experiments on novel view synthesis and novel view segmentation demonstrate that our method achieves state-of-the-art performances for unifying 3D scene and semantic field reconstruction.
zh

[CV-88] LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

【速读】：该论文旨在解决当前基于自然语言的GUI代理在空间定位任务中面临的位置感知准确性不足的问题。现有方法如监督微调（Supervised Fine-Tuning, SFT）和强化学习在评估位置精度方面存在局限性，难以有效提升交互的精确性。论文提出的解决方案是Location Preference Optimization (LPO)，其关键在于利用位置数据优化交互偏好，通过信息熵识别高信息量区域以预测交互位置，并引入基于物理距离的动态位置奖励函数，从而反映不同交互位置的重要性。LPO结合Group Relative Preference Optimization (GRPO) 实现对GUI环境的广泛探索，显著提升了交互精度。

链接: https://arxiv.org/abs/2506.09373
作者: Jiaqi Tang,Yu Xia,Yi-Feng Wu,Yuwei Hu,Yuhui Chen,Qing-Guo Chen,Xiaogang Xu,Xiangyu Wu,Hao Lu,Yanqing Ma,Shiyin Lu,Qifeng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Alibaba International Digital Commerce (阿里巴巴国际数字商业); The Chinese University of Hong Kong (香港中文大学); Nanjing University of Science and Technology (南京理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO’s superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at this https URL.
zh

[CV-89] ScaleLSD: Scalable Deep Line Segment Detection Streamlined CVPR2025

【速读】：该论文试图解决图像中线段检测（Line Segment Detection, LSD）的问题，旨在学习一个领域无关的鲁棒LSD模型，使其在任何自然图像上都能表现良好。解决方案的关键在于可扩展的自监督学习方法，通过重新审视和优化深度与非深度LSD方法的基础设计，提出了一种名为ScaleLSD的高效且高性能的LSD学习器，能够从超过10M未标记的真实世界图像中大规模地提取线几何信息。

链接: https://arxiv.org/abs/2506.09369
作者: Zeran Ke,Bin Tan,Xianwei Zheng,Yujun Shen,Tianfu Wu,Nan Xue
机构: Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 2025; 17 pages, appendices included

点击查看摘要

Abstract:This paper studies the problem of Line Segment Detection (LSD) for the characterization of line geometry in images, with the aim of learning a domain-agnostic robust LSD model that works well for any natural images. With the focus of scalable self-supervised learning of LSD, we revisit and streamline the fundamental designs of (deep and non-deep) LSD approaches to have a high-performing and efficient LSD learner, dubbed as ScaleLSD, for the curation of line geometry at scale from over 10M unlabeled real-world images. Our ScaleLSD works very well to detect much more number of line segments from any natural images even than the pioneered non-deep LSD approach, having a more complete and accurate geometric characterization of images using line segments. Experimentally, our proposed ScaleLSD is comprehensively testified under zero-shot protocols in detection performance, single-view 3D geometry estimation, two-view line segment matching, and multiview 3D line mapping, all with excellent performance obtained. Based on the thorough evaluation, our ScaleLSD is observed to be the first deep approach that outperforms the pioneered non-deep LSD in all aspects we have tested, significantly expanding and reinforcing the versatility of the line geometry of images. Code and Models are available at this https URL
zh

[CV-90] SAGE: Exploring the Boundaries of Unsafe Concept Domain with Semantic-Augment Erasing

【速读】：该论文旨在解决扩散模型（Diffusion Models, DMs）在预训练过程中不可避免地包含敏感信息所带来的安全风险，如生成不安全内容和版权侵权问题。现有方法将不安全概念视为固定词汇并反复擦除，导致模型陷入“词概念深渊”，无法实现泛化的概念相关擦除。该论文提出的解决方案关键在于引入语义增强擦除（Semantic-Augment Erasing），通过循环自检和自擦除机制，将概念词擦除转化为概念域擦除，从而通过原始模型与训练模型之间的语义空间关系高效探索并遗忘概念域的边界表示，无需额外预处理数据。此外，为缓解擦除不安全概念时对无关概念的保留退化，还提出了全局-局部协同保留机制，结合全局语义关系对齐与局部预测噪声保留，有效扩展了无关概念的保留感知范围。

链接: https://arxiv.org/abs/2506.09363
作者: Hongguang Zhu,Yunchao Wei,Mengyu Wang,Siyu Jiao,Yan Fang,Jiannan Huang,Yao Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under review

点击查看摘要

Abstract:Diffusion models (DMs) have achieved significant progress in text-to-image generation. However, the inevitable inclusion of sensitive information during pre-training poses safety risks, such as unsafe content generation and copyright infringement. Concept erasing finetunes weights to unlearn undesirable concepts, and has emerged as a promising solution. However, existing methods treat unsafe concept as a fixed word and repeatedly erase it, trapping DMs in ``word concept abyss’', which prevents generalized concept-related erasing. To escape this abyss, we introduce semantic-augment erasing which transforms concept word erasure into concept domain erasure by the cyclic self-check and self-erasure. It efficiently explores and unlearns the boundary representation of concept domain through semantic spatial relationships between original and training DMs, without requiring additional preprocessed data. Meanwhile, to mitigate the retention degradation of irrelevant concepts while erasing unsafe concepts, we further propose the global-local collaborative retention mechanism that combines global semantic relationship alignment with local predicted noise preservation, effectively expanding the retentive receptive field for irrelevant concepts. We name our method SAGE, and extensive experiments demonstrate the comprehensive superiority of SAGE compared with other methods in the safe generation of DMs. The code and weights will be open-sourced at this https URL.
zh

[CV-91] A new approach for image segmentation based on diffeomorphic registration and gradient fields

【速读】：该论文旨在解决2D图像分割问题，特别是在传统方法依赖于边缘检测和变分方法，而深度学习方法又需要大量训练数据的情况下，提出一种无需依赖大规模数据集的分割方案。其解决方案的关键在于引入了一种结合形状分析与微分同胚变换的变分框架，通过大形变微分同胚度量映射（LDDMM）模型将分割建模为模板曲线在图像域上的微分同胚变形，并利用变流形（varifold）表示的几何形状对变形曲线与图像梯度场进行比较，从而实现准确且理论基础坚实的分割。

链接: https://arxiv.org/abs/2506.09357
作者: Junchao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image segmentation is a fundamental task in computer vision aimed at delineating object boundaries within images. Traditional approaches, such as edge detection and variational methods, have been widely explored, while recent advances in deep learning have shown promising results but often require extensive training data. In this work, we propose a novel variational framework for 2D image segmentation that integrates concepts from shape analysis and diffeomorphic transformations. Our method models segmentation as the deformation of a template curve via a diffeomorphic transformation of the image domain, using the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework. The curve evolution is guided by a loss function that compares the deformed curve to the image gradient field, formulated through the varifold representation of geometric shapes. The approach is implemented in Python with GPU acceleration using the PyKeops library. This framework allows for accurate segmentation with a flexible and theoretically grounded methodology that does not rely on large datasets.
zh

[CV-92] DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在面对恶意查询时的脆弱性问题，尤其是在视觉模态被利用的情况下，现有对齐方法难以在保持良性输入实用性的同时有效抵抗恶意查询。其解决方案的关键在于提出Deep Aligned Visual Safety Prompt (DAVSP)，该方法包含两个核心创新：一是引入了视觉安全提示（Visual Safety Prompt），通过在输入图像周围添加可训练的填充区域来保留视觉特征并扩展优化空间；二是提出了深度对齐（Deep Alignment），通过在模型激活空间中的监督训练视觉安全提示，增强LVLMs感知恶意查询的能力，实现比以往方法更深层次的对齐。

链接: https://arxiv.org/abs/2506.09353
作者: Yitong Zhang,Jia Li,Liyi Cai,Ge Li
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学); Peking University (北京大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model’s activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five benchmarks on two representative LVLMs demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential components, jointly contributing to its overall effectiveness. The code is publicly available at this https URL.
zh

[CV-93] Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

【速读】：该论文试图解决现有大规模视频生成模型计算密集的问题，这限制了其在实时和交互式应用中的使用。解决方案的关键是提出自回归对抗后训练（AAPT），通过将预训练的潜在视频扩散模型转换为实时、交互式的视频生成器，实现每帧的自回归生成，仅需一次神经网络函数评估（1NFE），并利用KV缓存提高效率，同时通过学生强制训练方式减少长视频生成过程中的误差累积。

链接: https://arxiv.org/abs/2506.09350
作者: Shanchuan Lin,Ceyuan Yang,Hao He,Jianwen Jiang,Yuxi Ren,Xin Xia,Yang Zhao,Xuefeng Xiao,Lu Jiang
机构: ByteDance Seed(字节跳动种子); The Chinese University of Hong Kong(香港中文大学); ByteDance Intelligent Creation Lab(字节跳动智能创作实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at this https URL
zh

[CV-94] An Effective End-to-End Solution for Multimodal Action Recognition

【速读】：该论文旨在解决三模态动作识别任务中由于三模态数据稀缺所带来的挑战。其解决方案的关键在于通过优化数据增强技术扩展现有数据并增大训练规模，同时利用更多RGB数据预训练主干网络以提升模型适应新任务的能力；此外，结合2D CNN与时间位移模块（TSM）提取多模态时空特征，以提高计算效率，并采用Stochastic Weight Averaging（SWA）、集成学习和Test-Time augmentation（TTA）等方法融合不同训练阶段及架构的模型知识，从而从多角度预测动作并充分挖掘目标信息。

链接: https://arxiv.org/abs/2506.09345
作者: Songping Wang,Xiantao Hu,Yueming Lyu,Caifeng Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, multimodal tasks have strongly advanced the field of action recognition with their rich multimodal information. However, due to the scarcity of tri-modal data, research on tri-modal action recognition tasks faces many challenges. To this end, we have proposed a comprehensive multimodal action recognition solution that effectively utilizes multimodal information. First, the existing data are transformed and expanded by optimizing data enhancement techniques to enlarge the training scale. At the same time, more RGB datasets are used to pre-train the backbone network, which is better adapted to the new task by means of transfer learning. Secondly, multimodal spatial features are extracted with the help of 2D CNNs and combined with the Temporal Shift Module (TSM) to achieve multimodal spatial-temporal feature extraction comparable to 3D CNNs and improve the computational efficiency. In addition, common prediction enhancement methods, such as Stochastic Weight Averaging (SWA), Ensemble and Test-Time augmentation (TTA), are used to integrate the knowledge of models from different training periods of the same architecture and different architectures, so as to predict the actions from different perspectives and fully exploit the target information. Ultimately, we achieved the Top-1 accuracy of 99% and the Top-5 accuracy of 100% on the competition leaderboard, demonstrating the superiority of our solution.
zh

[CV-95] CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation CVPR2025

【速读】：该论文旨在解决机器人在操作电气设备时缺乏对用户手册理解能力的问题，即如何让机器人通过阅读和理解设备手册来执行复杂任务。现有研究多局限于问答任务，而未充分考虑手册在多页操作中的重要性。解决方案的关键在于提出首个基于手册的电器操作基准CheckManual，通过大模型辅助的人工修订数据生成流程创建基于CAD模型的设备手册，并构建相应的操作挑战、评估指标和仿真环境，同时引入首个基于手册的操作规划模型ManualPlan以建立基线性能。

链接: https://arxiv.org/abs/2506.09343
作者: Yuxing Long,Jiyao Zhang,Mingjie Pan,Tianshu Wu,Taewhan Kim,Hao Dong
机构: Peking University (北京大学); PKU-Agibot Lab (PKU-Agibot 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025 Highlight

点击查看摘要

Abstract:Correct use of electrical appliances has significantly improved human life quality. Unlike simple tools that can be manipulated with common sense, different parts of electrical appliances have specific functions defined by manufacturers. If we want the robot to heat bread by microwave, we should enable them to review the microwave manual first. From the manual, it can learn about component functions, interaction methods, and representative task steps about appliances. However, previous manual-related works remain limited to question-answering tasks while existing manipulation researchers ignore the manual’s important role and fail to comprehend multi-page manuals. In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. Specifically, we design a large model-assisted human-revised data generation pipeline to create manuals based on CAD appliance models. With these manuals, we establish novel manual-based manipulation challenges, metrics, and simulator environments for model performance evaluation. Furthermore, we propose the first manual-based manipulation planning model ManualPlan to set up a group of baselines for the CheckManual benchmark.
zh

[CV-96] MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning

【速读】：该论文旨在解决遥感图像解译中高质量标注数据获取成本高、耗时的问题。其解决方案的关键在于提出了一种多模态自监督学习框架，该框架利用高分辨率RGB图像、多光谱数据和数字表面模型（DSM）进行预训练，并通过设计信息感知的自适应掩码策略、跨模态掩码机制以及多任务自监督目标，有效捕捉不同模态间的相关性及各模态内部的独特特征结构。

链接: https://arxiv.org/abs/2506.09327
作者: Tong Wang,Guanzhou Chen,Xiaodong Zhang,Chenxi Liu,Jiaqi Wang,Xiaoliang Tan,Wenchao Guo,Qingyuan Yang,Kaiqi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high-quality labeled data is often costly and time-consuming. To address this challenge, we proposes a multi-modal self-supervised learning framework that leverages high-resolution RGB images, multi-spectral data, and digital surface models (DSM) for pre-training. By designing an information-aware adaptive masking strategy, cross-modal masking mechanism, and multi-task self-supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30% and 76.50%, with only 50% train-set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51%, surpassing the second CS-MAE by 3 percentage points. Our pretrain code, checkpoints, and HR-Pairs dataset can be found in this https URL.
zh

[CV-97] Efficient Edge Deployment of Quantized YOLOv4-Tiny for Aerial Emergency Object Detection on Raspberry Pi 5

【速读】：该论文旨在解决在资源受限的边缘设备（如Raspberry Pi 5）上实现实时航空应急图像中目标检测的问题。解决方案的关键在于将YOLOv4-Tiny模型量化为INT8精度，利用TensorFlow Lite的训练后量化技术，在保持检测准确性的前提下显著降低功耗和计算需求，从而实现高效的嵌入式部署。

链接: https://arxiv.org/abs/2506.09300
作者: Sindhu Boddu,Arindam Mukherjee
机构: UNC Charlotte(北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the deployment and performance evaluation of a quantized YOLOv4-Tiny model for real-time object detection in aerial emergency imagery on a resource-constrained edge device the Raspberry Pi 5. The YOLOv4-Tiny model was quantized to INT8 precision using TensorFlow Lite post-training quantization techniques and evaluated for detection speed, power consumption, and thermal feasibility under embedded deployment conditions. The quantized model achieved an inference time of 28.2 ms per image with an average power consumption of 13.85 W, demonstrating a significant reduction in power usage compared to its FP32 counterpart. Detection accuracy remained robust across key emergency classes such as Ambulance, Police, Fire Engine, and Car Crash. These results highlight the potential of low-power embedded AI systems for real-time deployment in safety-critical emergency response applications.
zh

[CV-98] Lightweight Object Detection Using Quantized YOLOv4-Tiny for Emergency Response in Aerial Imagery

【速读】：该论文旨在解决在应急响应场景中，针对空中图像进行轻量级且低功耗的目标检测问题。解决方案的关键在于部署经过后训练量化至INT8精度的YOLOv4-Tiny模型，并在其自建的包含10,820张标注图像的航空应急数据集上进行训练。该数据集因缺乏公开的无人机视角应急图像而由作者自行构建，成为本研究的重要贡献之一。通过与YOLOv5-small的对比实验，验证了量化后的YOLOv4-Tiny在保持检测性能的同时，显著减小了模型尺寸并提升了推理速度。

链接: https://arxiv.org/abs/2506.09299
作者: Sindhu Boddu,Arindam Mukherjee
机构: UNC Charlotte(北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 Pages, 3 figures

点击查看摘要

Abstract:This paper presents a lightweight and energy-efficient object detection solution for aerial imagery captured during emergency response situations. We focus on deploying the YOLOv4-Tiny model, a compact convolutional neural network, optimized through post-training quantization to INT8 precision. The model is trained on a custom-curated aerial emergency dataset, consisting of 10,820 annotated images covering critical emergency scenarios. Unlike prior works that rely on publicly available datasets, we created this dataset ourselves due to the lack of publicly available drone-view emergency imagery, making the dataset itself a key contribution of this work. The quantized model is evaluated against YOLOv5-small across multiple metrics, including mean Average Precision (mAP), F1 score, inference time, and model size. Experimental results demonstrate that the quantized YOLOv4-Tiny achieves comparable detection performance while reducing the model size from 22.5 MB to 6.4 MB and improving inference speed by 44%. With a 71% reduction in model size and a 44% increase in inference speed, the quantized YOLOv4-Tiny model proves highly suitable for real-time emergency detection on low-power edge devices.
zh

[CV-99] UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

【速读】：该论文试图解决机器人在非结构化环境中根据开放式的任务指令操作物体时，如何理解细粒度物体功能（fine-grained object affordances）的问题。现有视觉功能预测方法通常依赖于手动标注的数据或仅限于预定义任务集的条件。其解决方案的关键在于提出一种无监督功能蒸馏（Unsupervised Affordance Distillation, UAD）方法，通过将基础模型中的功能知识蒸馏到任务条件化的功能模型中，无需任何人工标注即可自动标注大规模数据集。UAD利用大视觉模型和视觉-语言模型的互补优势，仅在冻结特征上训练轻量级任务条件解码器，从而在真实场景和多种人类活动中展现出显著的泛化能力。

链接: https://arxiv.org/abs/2506.09284
作者: Yihe Tang,Wenlong Huang,Yingke Wang,Chengshu Li,Roy Yuan,Ruohan Zhang,Jiajun Wu,Li Fei-Fei
机构: Stanford University (斯坦福大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed instruction, visual affordance pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: this https URL
zh

[CV-100] UFM: A Simple Path towards Unified Dense Correspondence with Flow ATC

【速读】：该论文旨在解决密集图像对应（dense image correspondence）问题，该问题在视觉里程计、三维重建、目标关联和重识别等应用中具有核心地位。传统方法通常将宽基线场景与光流估计分开处理，尽管其目标均为在两幅图像间匹配内容。本文提出的统一流匹配模型（Unified Flow Matching, UFM）通过在源图与目标图中共可见像素上进行统一训练，实现了跨域的通用对应。UFM的关键在于采用简单且通用的Transformer架构，直接回归(u,v)流，相较于以往基于粗到细成本体积的方法，其训练更易且对大位移的准确性更高。

链接: https://arxiv.org/abs/2506.09278
作者: Yuchen Zhang,Nikhil Keetha,Chenwei Lyu,Bhuvan Jhamb,Yutian Chen,Yuheng Qiu,Jay Karhade,Shreyas Jha,Yaoyu Hu,Deva Ramanan,Sebastian Scherer,Wenshan Wang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.
zh

[CV-101] PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies CVPR

【速读】：该论文旨在解决异常检测（Anomaly Detection, AD）和异常定位（Anomaly Localization, AL）在面对对抗攻击时的脆弱性问题，这一问题主要源于训练数据中仅包含正常且未标记的样本。其解决方案的关键在于提出一种基于视觉Transformer（Vision Transformer, ViT）架构的对抗鲁棒方法——PatchGuard，该方法通过引入带有定位掩码的伪异常样本，并结合一种新型损失函数进行对抗训练，从而增强模型的鲁棒性。此外，PatchGuard利用了前景感知的伪异常样本，克服了以往异常感知方法的不足，显著提升了AD和AL任务在对抗环境下的性能。

链接: https://arxiv.org/abs/2506.09237
作者: Mojtaba Nafez,Amirhossein Koochakian,Arad Maleki,Jafar Habibi,Mohammad Hossein Rohban
机构: Sharif University of Technology (沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of 53.2% in AD and 68.5% in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at this https URL .
zh

[CV-102] Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

【速读】：该论文旨在解决在用户级别微调视频扩散模型（Video Diffusion Models, VDMs）时，如何生成反映训练数据特定属性的视频所面临的挑战，尤其是保持帧间语义一致性的难题。其解决方案的关键在于提出一种新颖的正则化技术——跨帧表征对齐（Cross-frame Representation Alignment, CREPA），通过将当前帧的隐藏状态与相邻帧的外部特征对齐，从而提升微调后的模型在视觉保真度和帧间语义连贯性方面的表现。

链接: https://arxiv.org/abs/2506.09229
作者: Sungwon Hwang,Hyojin Jang,Kinam Kim,Minho Park,Jaegul choo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 25 figures

点击查看摘要

Abstract:Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: this https URL
zh

[CV-103] Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule

【速读】：该论文旨在解决自动驾驶系统（ADS）中感知性能评估不足的问题，传统评估指标无法捕捉因环境条件变化导致的模型输出置信度波动。其解决方案的关键是引入感知特性距离（Perception Characteristics Distance, PCD），该指标量化了物体在可靠检测下的最远距离，并考虑了模型输出的不确定性。通过统计分析和多阈值平均计算得到的均值PCD（mPCD）能够反映系统在不同天气条件下的整体感知特性，从而更准确地体现感知系统的可靠性差异。

链接: https://arxiv.org/abs/2506.09217
作者: Boyu Jiang,Liang Shi,Zhengzhi Lin,Loren Stowe,Feng Guo
机构: Virginia Tech(弗吉尼亚理工学院); Virginia Tech Transportation Institute(弗吉尼亚理工交通研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:The performance of perception systems in autonomous driving systems (ADS) is strongly influenced by object distance, scene dynamics, and environmental conditions such as weather. AI-based perception outputs are inherently stochastic, with variability driven by these external factors, while traditional evaluation metrics remain static and event-independent, failing to capture fluctuations in confidence over time. In this work, we introduce the Perception Characteristics Distance (PCD) – a novel evaluation metric that quantifies the farthest distance at which an object can be reliably detected, incorporating uncertainty in model outputs. To support this, we present the SensorRainFall dataset, collected on the Virginia Smart Road using a sensor-equipped vehicle (cameras, radar, LiDAR) under controlled daylight-clear and daylight-rain scenarios, with precise ground-truth distances to the target objects. Statistical analysis reveals the presence of change points in the variance of detection confidence score with distance. By averaging the PCD values across a range of detection quality thresholds and probabilistic thresholds, we compute the mean PCD (mPCD), which captures the overall perception characteristics of a system with respect to detection distance. Applying state-of-the-art perception models shows that mPCD captures meaningful reliability differences under varying weather conditions – differences that static metrics overlook. PCD provides a principled, distribution-aware measure of perception performance, supporting safer and more robust ADS operation, while the SensorRainFall dataset offers a valuable benchmark for evaluation. The SensorRainFall dataset is publicly available at this https URL, and the evaluation code is open-sourced at this https URL.
zh

[CV-104] MultiNet: An Open-Source Software Toolkit Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models ICML

【速读】：该论文旨在解决多模态行动模型在构建通用代理系统中的挑战，特别是如何有效结合视觉理解、语言理解和动作生成。其解决方案的关键在于提出MultiNet——一个完全开源的基准测试平台及其配套软件生态系统，用于严格评估和适应视觉、语言和动作领域的模型。该平台建立了标准化的评估协议，并提供了丰富的数据集和开源工具，以支持对视觉-语言模型（VLMs）和视觉-语言-行动模型（VLAs）的全面研究。

链接: https://arxiv.org/abs/2506.09172
作者: Pranav Guruprasad,Yangyue Wang,Harshvardhan Sikka
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML CodeML Workshop, 13 Pages, 6 Figures, 2 Tables

点击查看摘要

Abstract:Recent innovations in multimodal action models represent a promising direction for developing general-purpose agentic systems, combining visual understanding, language comprehension, and action generation. We introduce MultiNet - a novel, fully open-source benchmark and surrounding software ecosystem designed to rigorously evaluate and adapt models across vision, language, and action domains. We establish standardized evaluation protocols for assessing vision-language models (VLMs) and vision-language-action models (VLAs), and provide open source software to download relevant data, models, and evaluations. Additionally, we provide a composite dataset with over 1.3 trillion tokens of image captioning, visual question answering, commonsense reasoning, robotic control, digital game-play, simulated locomotion/manipulation, and many more tasks. The MultiNet benchmark, framework, toolkit, and evaluation harness have been used in downstream research on the limitations of VLA generalization.
zh

[CV-105] Seedance 1.0: Exploring the Boundaries of Video Generation Models

【速读】：该论文旨在解决视频生成领域中基础模型在同时平衡提示遵循、运动合理性与视觉质量方面的关键挑战。其解决方案的关键在于引入Seedance 1.0，该模型通过多源数据增强、高效架构设计与训练范式、精细化的后训练优化方法以及模型加速策略，实现了高质量且快速的视频生成，具备优越的时空流畅性、结构稳定性、复杂多主体场景下的指令遵循能力以及原生的多镜头叙事连贯性。

链接: https://arxiv.org/abs/2506.09113
作者: Yu Gao,Haoyuan Guo,Tuyen Hoang,Weilin Huang,Lu Jiang,Fangyuan Kong,Huixia Li,Jiashi Li,Liang Li,Xiaojie Li,Xunsong Li,Yifu Li,Shanchuan Lin,Zhijie Lin,Jiawei Liu,Shu Liu,Xiaonan Nie,Zhiwu Qing,Yuxi Ren,Li Sun,Zhi Tian,Rui Wang,Sen Wang,Guoqiang Wei,Guohong Wu,Jie Wu,Ruiqi Xia,Fei Xiao,Xuefeng Xiao,Jiangqiao Yan,Ceyuan Yang,Jianchao Yang,Runkai Yang,Tao Yang,Yihang Yang,Zilyu Ye,Xuejiao Zeng,Yan Zeng,Heng Zhang,Yang Zhao,Xiaozheng Zheng,Peihao Zhu,Jiaxin Zou,Feilong Zuo
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Seedance 1.0 Technical Report

点击查看摘要

Abstract:Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.
zh

[CV-106] Bias Analysis in Unconditional Image Generative Models

【速读】：该论文试图解决生成式 AI 在无条件生成过程中可能出现的偏见机制问题，特别是关于表征伤害和潜在歧视性结果的形成机制尚不明确的问题。其解决方案的关键在于通过定义属性偏差为观测分布中属性出现的概率与理想参考分布中预期比例之间的差异，并利用无条件图像生成模型和常见的偏见评估框架，分析训练数据与生成数据之间的偏见迁移。研究发现，检测到的属性偏移较小，但属性偏移对评估框架中用于标注生成图像的属性分类器高度敏感，尤其是在分类器决策边界位于高密度区域时，这进一步表明属性在光谱上的连续性而非二元性是影响偏见评估的重要因素。

链接: https://arxiv.org/abs/2506.09106
作者: Xiaofeng Zhang,Michelle Lin,Simon Lacoste-Julien,Aaron Courville,Yash Goyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread adoption of generative AI models has raised growing concerns about representational harm and potential discriminatory outcomes. Yet, despite growing literature on this topic, the mechanisms by which bias emerges - especially in unconditional generation - remain disentangled. We define the bias of an attribute as the difference between the probability of its presence in the observed distribution and its expected proportion in an ideal reference distribution. In our analysis, we train a set of unconditional image generative models and adopt a commonly used bias evaluation framework to study bias shift between training and generated distributions. Our experiments reveal that the detected attribute shifts are small. We find that the attribute shifts are sensitive to the attribute classifier used to label generated images in the evaluation framework, particularly when its decision boundaries fall in high-density regions. Our empirical analysis indicates that this classifier sensitivity is often observed in attributes values that lie on a spectrum, as opposed to exhibiting a binary nature. This highlights the need for more representative labeling practices, understanding the shortcomings through greater scrutiny of evaluation frameworks, and recognizing the socially complex nature of attributes when evaluating bias.
zh

[CV-107] WD-DETR: Wavelet Denoising-Enhanced Real-Time Object Detection Transformer for Robot Perception with Event Cameras

【速读】：该论文旨在解决事件相机（event camera）在使用密集事件表示时累积噪声导致表示质量下降和漏检率增加的问题。其解决方案的关键在于提出一种基于小波去噪增强的DEtection TRansformer网络（WD-DETR），该方法首先生成密集事件表示以实现实时事件重构，随后引入小波变换方法对事件表示中的噪声进行过滤，并将其集成到主干网络中进行特征提取，最终通过基于Transformer的网络进行目标预测。此外，为降低推理时间，还引入了动态重组卷积块（DRCB）作为混合编码器中的融合模块。

链接: https://arxiv.org/abs/2506.09098
作者: Yangjie Cui,Boyang Gao,Yiwei Zhang,Xin Dong,Jinwu Xiang,Daochun Li,Zhan Tu
机构: Beihang University (北京航空航天大学); Hangzhou Innovation Institute of Beihang University (北京航空航天大学杭州创新研究院); Institute of Unmanned System, Beihang University (北京航空航天大学无人系统研究所); Tianmushan Laboratory Xixi Octagon City (天目山实验室西溪未来科技城)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Previous studies on event camera sensing have demonstrated certain detection performance using dense event representations. However, the accumulated noise in such dense representations has received insufficient attention, which degrades the representation quality and increases the likelihood of missed detections. To address this challenge, we propose the Wavelet Denoising-enhanced DEtection TRansformer, i.e., WD-DETR network, for event cameras. In particular, a dense event representation is presented first, which enables real-time reconstruction of events as tensors. Then, a wavelet transform method is designed to filter noise in the event representations. Such a method is integrated into the backbone for feature extraction. The extracted features are subsequently fed into a transformer-based network for object prediction. To further reduce inference time, we incorporate the Dynamic Reorganization Convolution Block (DRCB) as a fusion module within the hybrid encoder. The proposed method has been evaluated on three event-based object detection datasets, i.e., DSEC, Gen1, and 1Mpx. The results demonstrate that WD-DETR outperforms tested state-of-the-art methods. Additionally, we implement our approach on a common onboard computer for robots, the NVIDIA Jetson Orin NX, achieving a high frame rate of approximately 35 FPS using TensorRT FP16, which is exceptionally well-suited for real-time perception of onboard robotic systems.
zh

[CV-108] BakuFlow: A Streamlining Semi-Automatic Label Generation Tool

【速读】：该论文旨在解决计算机视觉中数据标注（annotation）的瓶颈问题，特别是在大规模任务中手动标注耗时且易出错。其解决方案的关键在于提出BakuFlow，一个流式半自动标签生成工具，核心功能包括：实时可调放大器用于像素级手动修正、交互式数据增强模块以丰富训练数据集、标签传播机制以快速复制连续帧中的标记对象，以及基于改进YOLOE框架的自动标注模块。该模块支持在标注过程中添加新物体类别和任意数量的视觉提示，从而实现灵活且可扩展的动态现实数据集标注。这些创新显著提升了目标检测与跟踪任务的标注效率，降低了工作量。

链接: https://arxiv.org/abs/2506.09083
作者: Jerry Lin,Partick P. W. Chen
机构: BakuAI AS(巴库人工智能公司); Silesian University of Technology(西里西亚技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures, 1 Table

点击查看摘要

Abstract:Accurately labeling (or annotation) data is still a bottleneck in computer vision, especially for large-scale tasks where manual labeling is time-consuming and error-prone. While tools like LabelImg can handle the labeling task, some of them still require annotators to manually label each image. In this paper, we introduce BakuFlow, a streamlining semi-automatic label generation tool. Key features include (1) a live adjustable magnifier for pixel-precise manual corrections, improving user experience; (2) an interactive data augmentation module to diversify training datasets; (3) label propagation for rapidly copying labeled objects between consecutive frames, greatly accelerating annotation of video data; and (4) an automatic labeling module powered by a modified YOLOE framework. Unlike the original YOLOE, our extension supports adding new object classes and any number of visual prompts per class during annotation, enabling flexible and scalable labeling for dynamic, real-world datasets. These innovations make BakuFlow especially effective for object detection and tracking, substantially reducing labeling workload and improving efficiency in practical computer vision and industrial scenarios.
zh

[CV-109] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

【速读】：该论文试图解决当前视觉基础模型（Vision Foundation Models, VFMs）评估中存在的两个关键盲点：一是指令微调数据可能与VQA测试分布不一致，导致错误预测源于数据不匹配而非模型视觉能力不足；二是VQA基准测试通常需要多种视觉能力，难以判断错误是由于缺乏所有必要能力还是仅缺少某一关键能力。解决方案的关键在于引入AVA-Bench，这是首个明确解耦14种原子视觉能力（Atomic Visual Abilities, AVAs）的基准，通过解耦AVAs并匹配每种能力的训练与测试分布，精准定位VFMs的优势与不足，从而实现更系统和透明的模型评估。

链接: https://arxiv.org/abs/2506.09082
作者: Zheda Mai,Arpita Chowdhury,Zihe Wang,Sooyoung Jeon,Lemeng Wang,Jiacheng Hou,Jihyung Kil,Wei-Lun Chao
机构: The Ohio State University (俄亥俄州立大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: First two authors contribute equally

点击查看摘要

Abstract:The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM’ visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) – foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive “ability fingerprints,” turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.
zh

[CV-110] VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks

【速读】：该论文试图解决视频基础推理（video-based reasoning）领域中由于高质量推理导向数据稀缺和有效训练方法不足而导致的进展缓慢问题。解决方案的关键在于引入两个新型数据集——DarkEventInfer和MixVidQA，分别通过掩码事件片段和交错视频序列来增强模型的视频理解与推理能力，并结合多样化的奖励函数进行强化学习训练，从而开发出首个在“先推理后回答”范式下具备多任务处理能力的视频理解与推理模型VersaVid-R1。

链接: https://arxiv.org/abs/2506.09079
作者: Xinlong Chen,Yuanxing Zhang,Yushuo Guan,Bohan Zeng,Yang Shi,Sihan Yang,Pengfei Wan,Qiang Liu,Liang Wang,Tieniu Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models have successfully extended the Reason-Then-Respond paradigm to image-based reasoning, yet video-based reasoning remains an underdeveloped frontier, primarily due to the scarcity of high-quality reasoning-oriented data and effective training methodologies. To bridge this gap, we introduce DarkEventInfer and MixVidQA, two novel datasets specifically designed to stimulate the model’s advanced video understanding and reasoning abilities. DarkEventinfer presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues. MixVidQA, on the other hand, presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. Leveraging these carefully curated training samples together with reinforcement learning guided by diverse reward functions, we develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm capable of handling multiple-choice and open-ended question answering, as well as video captioning tasks. Extensive experiments demonstrate that VersaVid-R1 significantly outperforms existing models across a broad spectrum of benchmarks, covering video general understanding, cognitive reasoning, and captioning tasks.
zh

[CV-111] SILK: Smooth InterpoLation frameworK for motion in-betweening A Simplified Computational Approach CVPR2025

【速读】：该论文旨在解决运动补全（motion in-betweening）问题，即在关键帧之间生成连贯且高质量的中间运动序列，以实现对姿态细节的精细控制。其解决方案的关键在于提出了一种基于Transformer的简单但有效的框架，仅使用一个Transformer编码器来合成逼真的运动，通过优化数据建模策略而非依赖复杂的模型结构来提升补全性能，包括增加数据量、选择合适的姿态表示以及引入速度输入特征等方法。

链接: https://arxiv.org/abs/2506.09075
作者: Elly Akhoundi,Hung Yu Ling,Anup Anand Deshmukh,Judith Butepage
机构: Electronic Arts(电子艺界); Electronic Arts(电子艺界)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025 Human Motion Generation Workshop. 10 pages, 3 figures, 5 Tables, and 40 References

点击查看摘要

Abstract:Motion in-betweening is a crucial tool for animators, enabling intricate control over pose-level details in each keyframe. Recent machine learning solutions for motion in-betweening rely on complex models, incorporating skeleton-aware architectures or requiring multiple modules and training steps. In this work, we introduce a simple yet effective Transformer-based framework, employing a single Transformer encoder to synthesize realistic motions for motion in-betweening tasks. We find that data modeling choices play a significant role in improving in-betweening performance. Among others, we show that increasing data volume can yield equivalent or improved motion transitions, that the choice of pose representation is vital for achieving high-quality results, and that incorporating velocity input features enhances animation performance. These findings challenge the assumption that model complexity is the primary determinant of animation quality and provide insights into a more data-centric approach to motion interpolation. Additional videos and supplementary material are available at this https URL.
zh

[CV-112] Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades walls and windows based on multimodal semantics guidance

【速读】：该论文旨在解决建筑立面中墙体和窗户的自动分割问题，以提升建筑信息模型和计算机辅助设计的效率。其解决方案的关键在于提出了一种基于多模态语义引导的自动分割模型SAAF，该模型通过多模态语义协同特征提取机制，融合文本描述的语义信息与图像特征，增强对建筑立面构件的语义理解；同时构建了一个端到端的训练框架，使模型能够自主学习从文本描述到图像分割的映射关系，从而减少人工干预的影响，提高分割的自动化程度和鲁棒性。

链接: https://arxiv.org/abs/2506.09071
作者: Peilin Li,Jun Yin,Jing Zhong,Ran Luo,Pengyu Zeng,Miao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the context of the digital development of architecture, the automatic segmentation of walls and windows is a key step in improving the efficiency of building information models and computer-aided design. This study proposes an automatic segmentation model for building facade walls and windows based on multimodal semantic guidance, called Segment Any Architectural Facades (SAAF). First, SAAF has a multimodal semantic collaborative feature extraction mechanism. By combining natural language processing technology, it can fuse the semantic information in text descriptions with image features, enhancing the semantic understanding of building facade components. Second, we developed an end-to-end training framework that enables the model to autonomously learn the mapping relationship from text descriptions to image segmentation, reducing the influence of manual intervention on the segmentation results and improving the automation and robustness of the model. Finally, we conducted extensive experiments on multiple facade datasets. The segmentation results of SAAF outperformed existing methods in the mIoU metric, indicating that the SAAF model can maintain high-precision segmentation ability when faced with diverse datasets. Our model has made certain progress in improving the accuracy and generalization ability of the wall and window segmentation task. It is expected to provide a reference for the development of architectural computer vision technology and also explore new ideas and technical paths for the application of multimodal learning in the architectural field.
zh

[CV-113] BG-HOP: A Bimanual Generative Hand-Object Prior CVPR2025

【速读】：该论文试图解决的是在三维空间中建模双手与物体交互（bimanual hand-object interactions）的问题，其核心挑战在于此类交互数据的稀缺性。解决方案的关键在于扩展现有的单手生成先验（single-hand generative priors），构建一种名为BG-HOP的生成先验，以捕捉双手与物体的联合分布，并通过实验验证了模型在生成双人手交互和为给定物体合成抓取动作方面的有效性。

链接: https://arxiv.org/abs/2506.09068
作者: Sriram Krishna,Sravan Chittupalli,Sungjae Park
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Presented at Agents in Interaction, from Humans to Robots, CVPR 2025

点击查看摘要

Abstract:In this work, we present BG-HOP, a generative prior that seeks to model bimanual hand-object interactions in 3D. We address the challenge of limited bimanual interaction data by extending existing single-hand generative priors, demonstrating preliminary results in capturing the joint distribution of hands and objects. Our experiments showcase the model’s capability to generate bimanual interactions and synthesize grasps for given objects. We make code and models publicly available.
zh

[CV-114] Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations

【速读】：该论文旨在解决生成式医疗视觉-语言模型（Generative Medical Vision-Language Models, Med-VLMs）在面对有害查询时的安全性问题，特别是如何在不显著影响模型性能的情况下有效抵御视觉和文本越狱攻击。其解决方案的关键在于提出一种新颖的推理阶段防御策略，该策略通过基于合成临床示例的防御机制来增强模型安全性，同时通过增加示例预算缓解过度防御问题，并引入混合示例策略以在有限的少样本示例预算下平衡安全性和性能。

链接: https://arxiv.org/abs/2506.09067
作者: Zhiyu Xue,Reza Abbasi-Asl,Ramtin Pedarsani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative medical vision-language models~(Med-VLMs) are primarily designed to generate complex textual information~(e.g., diagnostic reports) from multimodal inputs including vision modality~(e.g., medical images) and language modality~(e.g., clinical queries). However, their security vulnerabilities remain underexplored. Med-VLMs should be capable of rejecting harmful queries, such as \textitProvide detailed instructions for using this CT scan for insurance fraud. At the same time, addressing security concerns introduces the risk of over-defense, where safety-enhancing mechanisms may degrade general performance, causing Med-VLMs to reject benign clinical queries. In this paper, we propose a novel inference-time defense strategy to mitigate harmful queries, enabling defense against visual and textual jailbreak attacks. Using diverse medical imaging datasets collected from nine modalities, we demonstrate that our defense strategy based on synthetic clinical demonstrations enhances model safety without significantly compromising performance. Additionally, we find that increasing the demonstration budget alleviates the over-defense issue. We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance under few-shot demonstration budget constraints.
zh

[CV-115] ReStNet: A Reusable Stitchable Network for Dynamic Adaptation on IoT Devices

【速读】：该论文试图解决在资源异构的物联网（IoT）应用中部署预训练模型的挑战，即传统压缩方法在应用后缺乏灵活性，无法适应动态变化的计算和内存约束。其解决方案的关键在于提出ReStNet，一种可重用且可拼接的网络架构，通过动态拼接两个预训练模型来构建混合网络。ReStNet的核心创新在于利用层间相似性（通过Centered Kernel Alignment, CKA计算）确定最优拼接点，并保留大容量模型的浅层与小模型的深层以构建混合模型，仅对拼接层进行微调，从而实现快速适应不同资源预算并有效利用现有资源。

链接: https://arxiv.org/abs/2506.09066
作者: Maoyu Wang,Yao Lu,Jiaqi Nie,Zeyu Wang,Yun Lin,Qi Xuan,Guan Gui
机构: Zhejiang University of Technology(浙江理工大学); Agency for Science, Technology and Research(新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of deep learning, a growing number of pre-trained models have been publicly available. However, deploying these fixed models in real-world IoT applications is challenging because different devices possess heterogeneous computational and memory resources, making it impossible to deploy a single model across all platforms. Although traditional compression methods, such as pruning, quantization, and knowledge distillation, can improve efficiency, they become inflexible once applied and cannot adapt to changing resource constraints. To address these issues, we propose ReStNet, a Reusable and Stitchable Network that dynamically constructs a hybrid network by stitching two pre-trained models together. Implementing ReStNet requires addressing several key challenges, including how to select the optimal stitching points, determine the stitching order of the two pre-trained models, and choose an effective fine-tuning strategy. To systematically address these challenges and adapt to varying resource constraints, ReStNet determines the stitching point by calculating layer-wise similarity via Centered Kernel Alignment (CKA). It then constructs the hybrid model by retaining early layers from a larger-capacity model and appending deeper layers from a smaller one. To facilitate efficient deployment, only the stitching layer is fine-tuned. This design enables rapid adaptation to changing budgets while fully leveraging available resources. Moreover, ReStNet supports both homogeneous (CNN-CNN, Transformer-Transformer) and heterogeneous (CNN-Transformer) stitching, allowing to combine different model families flexibly. Extensive experiments on multiple benchmarks demonstrate that ReStNet achieve flexible accuracy-efficiency trade-offs at runtime while significantly reducing training cost.
zh

[CV-116] Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search

【速读】：该论文旨在解决会话搜索中传统策略在处理用户复杂信息需求时存在的不足，即过于依赖序列建模而忽视交互中的图结构，或虽关注结构信息但采用泛化的文档表示而忽略词级语义建模。其解决方案的关键在于提出Symbolic Graph Ranker (SGR)，通过引入符号语法规则将会话图转换为文本，从而结合基于文本和图的方法，并利用大型语言模型（Large Language Models, LLMs）的强大能力。此外，通过自监督符号学习任务增强LLMs对图结构的捕捉能力，从粗粒度到细粒度提取拓扑信息，实现更有效的会话搜索。

链接: https://arxiv.org/abs/2505.14156
作者: Songhao Wu,Quan Tu,Hong Liu,Jia Xu,Zhongyi Liu,Guannan Zhang,Ran Wang,Xiuying Chen,Rui Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Session search involves a series of interactive queries and actions to fulfill user’s complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs’ ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
zh

[CV-117] Sampling Theory for Super-Resolution with Implicit Neural Representations

【速读】：该论文试图解决在连续域图像通过低通傅里叶采样进行恢复时，基于生成式 AI (Generative AI) 的隐式神经表示（Implicit Neural Representations, INRs）的样本复杂度问题。其解决方案的关键在于将非凸参数空间优化问题的最小值与定义在无限维测度空间上的凸惩罚函数的最小值建立联系，从而确定能够精确恢复由INR实现的图像所需的傅里叶采样数量。

链接: https://arxiv.org/abs/2506.09949
作者: Mahrokh Najaf,Gregory Ongie
机构: Marquette University (马凯特大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2405.18410

点击查看摘要

Abstract:Implicit neural representations (INRs) have emerged as a powerful tool for solving inverse problems in computer vision and computational imaging. INRs represent images as continuous domain functions realized by a neural network taking spatial coordinates as inputs. However, unlike traditional pixel representations, little is known about the sample complexity of estimating images using INRs in the context of linear inverse problems. Towards this end, we study the sampling requirements for recovery of a continuous domain image from its low-pass Fourier samples by fitting a single hidden-layer INR with ReLU activation and a Fourier features layer using a generalized form of weight decay regularization. Our key insight is to relate minimizers of this non-convex parameter space optimization problem to minimizers of a convex penalty defined over an infinite-dimensional space of measures. We identify a sufficient number of Fourier samples for which an image realized by an INR is exactly recoverable by solving the INR training problem. To validate our theory, we empirically assess the probability of achieving exact recovery of images realized by low-width single hidden-layer INRs, and illustrate the performance of INRs on super-resolution recovery of continuous domain phantom images.
zh

[CV-118] A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma

【速读】：该论文旨在解决口腔鳞状细胞癌（Oral Squamous Cell Carcinoma, OSCC）在低资源环境中早期诊断困难的问题，传统组织病理学诊断方法因侵入性、资源密集性和对专家病理科医生的依赖而难以普及，而口腔细胞学检查虽然具有微创和低成本的优势，但其应用受限于观察者间差异和缺乏专家病理科医生。论文提出的解决方案关键在于开发和验证基于人工智能（Artificial Intelligence, AI）的 robust 模型，这需要大规模、标记且多源的数据集来训练具备泛化能力的高容量模型，为此，研究者引入了首个大规模多中心口腔细胞学数据集，涵盖经Papanicolaou（PAP）和May-Grunwald-Giemsa（MGG）染色的标注切片，旨在推动AI驱动的诊断方法发展，从而提升自动化检测水平，减少诊断误差，并改善资源匮乏地区的OSCC早期诊断。

链接: https://arxiv.org/abs/2506.09661
作者: Garima Jain,Sanghamitra Pati,Mona Duggal,Amit Sethi,Abhijeet Patil,Gururaj Malekar,Nilesh Kowe,Jitender Kumar,Jatin Kashyap,Divyajeet Rout,Deepali,Hitesh,Nishi Halduniya,Sharat Kumar,Heena Tabassum,Rupinder Singh Dhaliwal,Sucheta Devi Khuraijam,Sushma Khuraijam,Sharmila Laishram,Simmi Kharb,Sunita Singh,K. Swaminadtan,Ranjana Solanki,Deepika Hemranjani,Shashank Nath Singh,Uma Handa,Manveen Kaur,Surinder Singhal,Shivani Kalhan,Rakesh Kumar Gupta,Ravi. S,D. Pavithra,Sunil Kumar Mahto,Arvind Kumar,Deepali Tirkey,Saurav Banerjee,L. Sreelakshmi
机构: Indian Council of Medical Research-National Institute for Research in Digital Health & Data Science, New Delhi(印度医学研究委员会-数字健康与数据科学国家研究所，新德里); Koita Centre of Digital Health, IIT Bombay(科塔数字健康中心，印度理工学院孟买分校); Indian Council of Medical Research, New Delhi(印度医学研究委员会，新德里); Department of Electrical Engineering, IIT Bombay(电气工程系，印度理工学院孟买分校); Division of Non Communicable Diseases,Indian Council of Medical Research, New Delhi-110029(非传染性疾病分部，印度医学研究委员会，新德里-110029); Department of Pathology, Regional Institute of Medical Sciences, Imphal(病理科，地区医学科学研究所，伊姆-phal); Department of Biochemistry, Pt BD Sharma PGIMS, Rohtak(生物化学系，Pt BD Sharma PGIMS，罗塔克); Institute of Pathology, Madras Medical College, Chennai(病理学研究所，马德拉斯医学院，钦奈); Department of Pathology SMS Medical College Jaipur(病理科，SMS医学学院，斋普尔); Department of Otorhinolaryngology SMS Medical College Jaipur(耳鼻喉科，SMS医学学院，斋普尔); Department of Pathology, Government Medical College and Hospital, Chandigarh(病理科，昌迪加尔政府医学院和医院); Department of Otorhinolaryngology, Government Medical College and Hospital, Chandigarh(耳鼻喉科，昌迪加尔政府医学院和医院); Government Institute of Medical Sciences, Greater Noida(Greater Noida 医学研究所); Department of Pathology, Chengalpattu Medical College, Chengalpattu(病理科，Chengalpattu 医学院，Chengalpattu); Department of Pathology, Coimbatore medical College, Civil aerodrome post, Coimbatore(病理科，科伊姆巴托尔医学院，民用机场邮局，科伊姆巴托尔); Department of Pathology, RIMS, Ranchi, Jharkhand(病理科，RIMS，兰奇，贾赫拉坎德); Gandhi Medical College, Hyderabad, Telangana(甘地医学院，海得拉巴，特伦甘纳)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注: 7 pages, 2 figurs

点击查看摘要

Abstract:Oral squamous cell carcinoma OSCC is a major global health burden, particularly in several regions across Asia, Africa, and South America, where it accounts for a significant proportion of cancer cases. Early detection dramatically improves outcomes, with stage I cancers achieving up to 90 percent survival. However, traditional diagnosis based on histopathology has limited accessibility in low-resource settings because it is invasive, resource-intensive, and reliant on expert pathologists. On the other hand, oral cytology of brush biopsy offers a minimally invasive and lower cost alternative, provided that the remaining challenges, inter observer variability and unavailability of expert pathologists can be addressed using artificial intelligence. Development and validation of robust AI solutions requires access to large, labeled, and multi-source datasets to train high capacity models that generalize across domain shifts. We introduce the first large and multicenter oral cytology dataset, comprising annotated slides stained with Papanicolaou(PAP) and May-Grunwald-Giemsa(MGG) protocols, collected from ten tertiary medical centers in India. The dataset is labeled and annotated by expert pathologists for cellular anomaly classification and detection, is designed to advance AI driven diagnostic methods. By filling the gap in publicly available oral cytology datasets, this resource aims to enhance automated detection, reduce diagnostic errors, and improve early OSCC diagnosis in resource-constrained settings, ultimately contributing to reduced mortality and better patient outcomes worldwide.
zh

[CV-119] he RSNA Lumbar Degenerative Imaging Spine Classification (LumbarDISC) Dataset

【速读】：该论文旨在解决成人腰椎退行性变的自动分类问题，通过构建一个大规模的公开标注数据集来促进机器学习在腰椎影像中的研究与应用。其解决方案的关键在于创建了包含8,593个图像序列的RSNA LumbarDISC数据集，该数据集由来自多个机构和国家的专家放射科医生进行标注，涵盖了腰椎各个椎间盘水平的脊髓管、关节突窝和神经孔狭窄程度的分级信息，从而为深度学习模型的开发提供了高质量的训练与评估基础。

链接: https://arxiv.org/abs/2506.09162
作者: Tyler J. Richards,Adam E. Flanders,Errol Colak,Luciano M. Prevedello,Robyn L. Ball,Felipe Kitamura,John Mongan,Maryam Vazirabad,Hui-Ming Lin,Anne Kendell,Thanat Kanthawang,Salita Angkurawaranon,Emre Altinmakas,Hakan Dogan,Paulo Eduardo de Aguiar Kuriki,Arjuna Somasundaram,Christopher Ruston,Deniz Bulja,Naida Spahovic,Jennifer Sommer,Sirui Jiang,Eduardo Moreno Judice de Mattos Farina,Eduardo Caminha Nunes,Michael Brassil,Megan McNamara,Johanna Ortiz,Jacob Peoples,Vinson L. Uytana,Anthony Kam,Venkata N.S. Dola,Daniel Murphy,David Vu,Dataset Contributor Group,Dataset Annotator Group,Competition Data Notebook Group,Jason F. Talbott
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Radiological Society of North America (RSNA) Lumbar Degenerative Imaging Spine Classification (LumbarDISC) dataset is the largest publicly available dataset of adult MRI lumbar spine examinations annotated for degenerative changes. The dataset includes 2,697 patients with a total of 8,593 image series from 8 institutions across 6 countries and 5 continents. The dataset is available for free for non-commercial use via Kaggle and RSNA Medical Imaging Resource of AI (MIRA). The dataset was created for the RSNA 2024 Lumbar Spine Degenerative Classification competition where competitors developed deep learning models to grade degenerative changes in the lumbar spine. The degree of spinal canal, subarticular recess, and neural foraminal stenosis was graded at each intervertebral disc level in the lumbar spine. The images were annotated by expert volunteer neuroradiologists and musculoskeletal radiologists from the RSNA, American Society of Neuroradiology, and the American Society of Spine Radiology. This dataset aims to facilitate research and development in machine learning and lumbar spine imaging to lead to improved patient care and clinical efficiency.
zh

[CV-120] An Explainable Deep Learning Framework for Brain Stroke and Tumor Progression via MRI Interpretation

【速读】：该论文旨在解决早期准确检测脑部异常（如肿瘤和中风）的问题，以实现及时干预并改善患者预后。其解决方案的关键在于采用基于深度学习的系统，利用卷积神经网络（Convolutional Neural Networks, CNNs）中的MobileNet V2和通过迁移学习优化的ResNet-50模型，从MRI图像中识别脑肿瘤和中风及其相应阶段。研究通过整合和增强多个公开的MRI数据源，构建了一个类别平衡且图像多样化的数据集，并通过应用丢弃层和广泛的数据增强技术来提升模型的泛化能力和防止过拟合。

链接: https://arxiv.org/abs/2506.09161
作者: Rajan Das Gupta,Md Imrul Hasan Showmick,Mushfiqur Rahman Abir,Shanjida Akter,Md. Yeasin Rahat,Md. Jakir Hossen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in MECON 2025

点击查看摘要

Abstract:Early and accurate detection of brain abnormalities, such as tumors and strokes, is essential for timely intervention and improved patient outcomes. In this study, we present a deep learning-based system capable of identifying both brain tumors and strokes from MRI images, along with their respective stages. We have executed two groundbreaking strategies involving convolutional neural networks, MobileNet V2 and ResNet-50-optimized through transfer learning to classify MRI scans into five diagnostic categories. Our dataset, aggregated and augmented from various publicly available MRI sources, was carefully curated to ensure class balance and image diversity. To enhance model generalization and prevent overfitting, we applied dropout layers and extensive data augmentation. The models achieved strong performance, with training accuracy reaching 93% and validation accuracy up to 88%. While ResNet-50 demonstrated slightly better results, Mobile Net V2 remains a promising option for real-time diagnosis in low resource settings due to its lightweight architecture. This research offers a practical AI-driven solution for early brain abnormality detection, with potential for clinical deployment and future enhancement through larger datasets and multi modal inputs.
zh

[CV-121] Low-Rank Augmented Implicit Neural Representation for Unsupervised High-Dimensional Quantitative MRI Reconstruction

【速读】：该论文旨在解决从高度欠采样、高维测量数据中鲁棒重建定量磁共振成像（qMRI）参数的问题，这一问题由于当前仅依赖单一先验或物理信息模型的重建方法在处理高度病态逆问题时效果不佳而变得尤为困难。论文提出的解决方案关键在于引入LoREIN框架，该框架通过结合低秩先验（low-rank prior）和连续性先验（continuity prior），分别利用低秩表示（LRR）和隐式神经表示（INR）来提升重建精度。INR的强大连续表示能力有助于在低秩子空间中估计最优空间基，从而实现加权图像的高保真重建，同时多对比度加权图像的预测为定量参数图的重建提供了结构和定量指导。

链接: https://arxiv.org/abs/2506.09100
作者: Haonan Zhang,Guoyan Lao,Yuyao Zhang,Hongjiang Wei
机构: Shanghai Jiao Tong University (上海交通大学); ShanghaiTech University (上海科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantitative magnetic resonance imaging (qMRI) provides tissue-specific parameters vital for clinical diagnosis. Although simultaneous multi-parametric qMRI (MP-qMRI) technologies enhance imaging efficiency, robustly reconstructing qMRI from highly undersampled, high-dimensional measurements remains a significant challenge. This difficulty arises primarily because current reconstruction methods that rely solely on a single prior or physics-informed model to solve the highly ill-posed inverse problem, which often leads to suboptimal results. To overcome this limitation, we propose LoREIN, a novel unsupervised and dual-prior-integrated framework for accelerated 3D MP-qMRI reconstruction. Technically, LoREIN incorporates both low-rank prior and continuity prior via low-rank representation (LRR) and implicit neural representation (INR), respectively, to enhance reconstruction fidelity. The powerful continuous representation of INR enables the estimation of optimal spatial bases within the low-rank subspace, facilitating high-fidelity reconstruction of weighted images. Simultaneously, the predicted multi-contrast weighted images provide essential structural and quantitative guidance, further enhancing the reconstruction accuracy of quantitative parameter maps. Furthermore, our work introduces a zero-shot learning paradigm with broad potential in complex spatiotemporal and high-dimensional image reconstruction tasks, further advancing the field of medical imaging.
zh

[CV-122] Foundation Models in Medical Imaging – A Review and Outlook

【速读】：该论文试图解决医学影像分析中依赖手动标注数据的局限性，其核心问题是如何利用大量未标注数据来提升模型的泛化能力和适应性。解决方案的关键在于使用生成式 AI (Generative AI) 预训练模型，通过自监督学习方法学习通用的视觉特征，并在少量额外监督下适应具体的临床任务。

链接: https://arxiv.org/abs/2506.09095
作者: Vivien van Veldhuizen,Vanessa Botha,Chunyao Lu,Melis Erdal Cesur,Kevin Groot Lipman,Edwin D. de Jong,Hugo Horlings,Clárisa Sanchez,Cees Snoek,Ritse Mann,Eric Marcus,Jonas Teuwen
机构: Netherlands Cancer Institute (荷兰癌症研究所); Radboud University Medical Center (拉德布德大学医学中心); Delft University of Technology (代尔夫特理工大学); Kaiko.ai (Kaiko.ai); University of Amsterdam (阿姆斯特丹大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research.
zh

[CV-123] Devanagari Digit Recognition using Quantum Machine Learning

【速读】：该论文旨在解决区域性文字（如德文尼格拉里文）手写数字识别中的挑战，特别是在复杂结构和有限标注数据集背景下传统模型性能受限的问题。解决方案的关键在于提出一种混合量子-经典架构，结合卷积神经网络（CNN）进行空间特征提取与10量子比特可变量子电路（VQC）进行量子增强分类，通过利用量子叠加和纠缠等原理，实现了在德文尼格拉里文手写字符数据集（DHCD）上的高精度识别，展现了量子机器学习（QML）在实际低资源语言场景中的潜力。

链接: https://arxiv.org/abs/2506.09069
作者: Sahaj Raj Malla
机构: Kathmandu University (加德满都大学)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, arXiv preprint, code available upon request

点击查看摘要

Abstract:Handwritten digit recognition in regional scripts, such as Devanagari, is crucial for multilingual document digitization, educational tools, and the preservation of cultural heritage. The script’s complex structure and limited annotated datasets pose significant challenges to conventional models. This paper introduces the first hybrid quantum-classical architecture for Devanagari handwritten digit recognition, combining a convolutional neural network (CNN) for spatial feature extraction with a 10-qubit variational quantum circuit (VQC) for quantum-enhanced classification. Trained and evaluated on the Devanagari Handwritten Character Dataset (DHCD), the proposed model achieves a state-of-the-art test accuracy for quantum implementation of 99.80% and a test loss of 0.2893, with an average per-class F1-score of 0.9980. Compared to equivalent classical CNNs, our model demonstrates superior accuracy with significantly fewer parameters and enhanced robustness. By leveraging quantum principles such as superposition and entanglement, this work establishes a novel benchmark for regional script recognition, highlighting the promise of quantum machine learning (QML) in real-world, low-resource language settings.
zh

[CV-124] Exploring Image Transforms derived from Eye Gaze Variables for Progressive Autism Diagnosis

【速读】：该论文旨在解决自闭症谱系障碍（Autism Spectrum Disorder, ASD）诊断过程耗时、成本高昂的问题，以及由此带来的沟通、行为和注意力方面的挑战。其解决方案的关键在于引入一种基于人工智能的辅助技术，该技术结合了迁移学习与从眼动变量中提取的图像变换，以实现更便捷、高效的ASD诊断与管理，同时保障用户隐私并促进监护人与治疗师之间的沟通。

链接: https://arxiv.org/abs/2506.09065
作者: Abigail Copiaco,Christian Ritz,Yassine Himeur,Valsamma Eapen,Ammar Albanna,Wathiq Mansoor
机构: University of Dubai(迪拜大学); University of Wollongong(伍伦贡大学); Mohammed Bin Rashid University Dubai(穆罕默德·本·拉希德大学迪拜校区); University of New South Wales(新南威尔士大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 6 pages, 8 figures, and 1 table

点击查看摘要

Abstract:The prevalence of Autism Spectrum Disorder (ASD) has surged rapidly over the past decade, posing significant challenges in communication, behavior, and focus for affected individuals. Current diagnostic techniques, though effective, are time-intensive, leading to high social and economic costs. This work introduces an AI-powered assistive technology designed to streamline ASD diagnosis and management, enhancing convenience for individuals with ASD and efficiency for caregivers and therapists. The system integrates transfer learning with image transforms derived from eye gaze variables to diagnose ASD. This facilitates and opens opportunities for in-home periodical diagnosis, reducing stress for individuals and caregivers, while also preserving user privacy through the use of image transforms. The accessibility of the proposed method also offers opportunities for improved communication between guardians and therapists, ensuring regular updates on progress and evolving support needs. Overall, the approach proposed in this work ensures timely, accessible diagnosis while protecting the subjects’ privacy, improving outcomes for individuals with ASD.
zh

[CV-125] Reconstructing Heterogeneous Biomolecules via Hierarchical Gaussian Mixtures and Part Discovery

【速读】：该论文旨在解决冷冻电子显微镜（Cryo-EM）中分子结构重建的问题，特别是在成像粒子表现出非刚性构象灵活性和组成变异的情况下，如何准确建模其三维结构。论文提出的解决方案关键在于引入一种基于分层高斯混合模型的3D重建框架，称为CryoSPIRE，该框架通过初始过程推断粒子的部分分割，提供必要的归纳偏置以处理构象和组成多样性，从而在复杂实验数据集上揭示具有生物学意义的结构，并在CryoBench基准测试中达到了新的最先进水平。

链接: https://arxiv.org/abs/2506.09063
作者: Shayan Shekarforoush,David B. Lindell,Marcus A. Brubaker,David J. Fleet
机构: University of Toronto(多伦多大学); Vector Institute(向量研究所); York University(约克大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 21 pages, 14 figures, Project Webpage: this https URL

点击查看摘要

Abstract:Cryo-EM is a transformational paradigm in molecular biology where computational methods are used to infer 3D molecular structure at atomic resolution from extremely noisy 2D electron microscope images. At the forefront of research is how to model the structure when the imaged particles exhibit non-rigid conformational flexibility and compositional variation where parts are sometimes missing. We introduce a novel 3D reconstruction framework with a hierarchical Gaussian mixture model, inspired in part by Gaussian Splatting for 4D scene reconstruction. In particular, the structure of the model is grounded in an initial process that infers a part-based segmentation of the particle, providing essential inductive bias in order to handle both conformational and compositional variability. The framework, called CryoSPIRE, is shown to reveal biologically meaningful structures on complex experimental datasets, and establishes a new state-of-the-art on CryoBench, a benchmark for cryo-EM heterogeneity methods.
zh

人工智能

[AI-0] Flesh: Highly customizable Magnetic Touch Sensing using Cut-Cell Microstructures

【速读】：该论文试图解决在非结构化环境中机器人操作中缺乏通用、可访问且易于定制的触觉传感器的问题，这一缺陷导致了机器人操作中存在碎片化的传感器专用解决方案，甚至在许多情况下采用无感知力的无传感器方法。解决方案的关键是引入一种名为eFlesh的磁性触觉传感器，其具有低成本、易于制造和高度可定制的特点，通过四类组件（业余3D打印机、现成磁铁、所需形状的CAD模型以及磁强计电路板）即可构建，其由可调几何结构和机械响应的参数化微结构组成，并提供开源设计工具以实现应用特定传感器的创建和灵敏度调整。

链接: https://arxiv.org/abs/2506.09994
作者: Venkatesh Pattabiraman,Zizhou Huang,Daniele Panozzo,Denis Zorin,Lerrel Pinto,Raunaq Bhirangi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:If human experience is any guide, operating effectively in unstructured environments – like homes and offices – requires robots to sense the forces during physical interaction. Yet, the lack of a versatile, accessible, and easily customizable tactile sensor has led to fragmented, sensor-specific solutions in robotic manipulation – and in many cases, to force-unaware, sensorless approaches. With eFlesh, we bridge this gap by introducing a magnetic tactile sensor that is low-cost, easy to fabricate, and highly customizable. Building an eFlesh sensor requires only four components: a hobbyist 3D printer, off-the-shelf magnets ( 5), a CAD model of the desired shape, and a magnetometer circuit board. The sensor is constructed from tiled, parameterized microstructures, which allow for tuning the sensor’s geometry and its mechanical response. We provide an open-source design tool that converts convex OBJ/STL files into 3D-printable STLs for fabrication. This modular design framework enables users to create application-specific sensors, and to adjust sensitivity depending on the task. Our sensor characterization experiments demonstrate the capabilities of eFlesh: contact localization RMSE of 0.5 mm, and force prediction RMSE of 0.27 N for normal force and 0.12 N for shear force. We also present a learned slip detection model that generalizes to unseen objects with 95% accuracy, and visuotactile control policies that improve manipulation performance by 40% over vision-only baselines – achieving 91% average success rate for four precise tasks that require sub-mm accuracy for successful completion. All design files, code and the CAD-to-eFlesh STL conversion tool are open-sourced and available on this https URL.
zh

[AI-1] How Do People Revise Inconsistent Beliefs? Examining Belief Revision in Humans with User Studies

【速读】：该论文试图解决如何使人工智能系统更好地模拟人类在面对新信息时修正其信念的过程，以实现与人类推理的对齐。解决方案的关键在于提出解释驱动的信念修正（explanation-based revision），即人们在面对矛盾信息时更倾向于依据解释进行信念调整，而这种调整可能不符合经典信念变化理论中的最小化原则。研究通过三项用户实验证明了人类在不同情境下普遍偏好基于解释的、可能非最小化的信念修正方式，为设计能够更贴近人类认知过程的AI系统提供了重要启示。

链接: https://arxiv.org/abs/2506.09977
作者: Stylianos Loukas Vasileiou,Antonio Rago,Maria Vanina Martinez,William Yeoh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how humans revise their beliefs in light of new information is crucial for developing AI systems which can effectively model, and thus align with, human reasoning. While theoretical belief revision frameworks rely on a set of principles that establish how these operations are performed, empirical evidence from cognitive psychology suggests that people may follow different patterns when presented with conflicting information. In this paper, we present three comprehensive user studies showing that people consistently prefer explanation-based revisions, i.e., those which are guided by explanations, that result in changes to their belief systems that are not necessarily captured by classical belief change theory. Our experiments systematically investigate how people revise their beliefs with explanations for inconsistencies, whether they are provided with them or left to formulate them themselves, demonstrating a robust preference for what may seem non-minimal revisions across different types of scenarios. These findings have implications for AI systems designed to model human reasoning or interact with humans, suggesting that such systems should accommodate explanation-based, potentially non-minimal belief revision operators to better align with human cognitive processes.
zh

[AI-2] LLM ail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge MICRO

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在区分指令与数据方面的固有局限性，特别是针对间接提示注入（Indirect Prompt Injection）攻击的问题。解决方案的关键在于通过构建一个名为LLMail-Inject的公开挑战，模拟现实场景中攻击者尝试向基于LLM的电子邮件助手注入恶意指令以触发未经授权的工具调用的行为，从而系统评估不同防御策略的有效性，并提供大量攻击样本以促进对指令-数据分离问题的深入研究。

链接: https://arxiv.org/abs/2506.09956
作者: Sahar Abdelnabi,Aideen Fay,Ahmed Salem,Egor Zverev,Kai-Chieh Liao,Chi-Huang Liu,Chun-Chih Kuo,Jannis Weigend,Danyael Manlangit,Alex Apostolov,Haris Umair,João Donato,Masayuki Kawakita,Athar Mahboob,Tran Huu Bach,Tsun-Han Chiang,Myeongjin Cho,Hajin Choi,Byeonghyeon Kim,Hyeonjin Lee,Benjamin Pannell,Conor McCauley,Mark Russinovich,Andrew Paverd,Giovanni Cherubin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Dataset at: this https URL

点击查看摘要

Abstract:Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vulnerable. We present the results of LLMail-Inject, a public challenge simulating a realistic scenario in which participants adaptively attempted to inject malicious instructions into emails in order to trigger unauthorized tool calls in an LLM-based email assistant. The challenge spanned multiple defense strategies, LLM architectures, and retrieval configurations, resulting in a dataset of 208,095 unique attack submissions from 839 participants. We release the challenge code, the full dataset of submissions, and our analysis demonstrating how this data can provide new insights into the instruction-data separation problem. We hope this will serve as a foundation for future research towards practical structural solutions to prompt injection.
zh

[AI-3] he Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability ICML2025

【速读】：该论文试图解决在信息不对称和知识迁移挑战下的在线学习问题，具体而言是探究是否可以利用非独立同分布（non-i.i.d.）的动作来识别混杂因素，同时实现知识的有效迁移。解决方案的关键在于提出了一种样本高效的算法，该算法能够在信息不对称环境下准确识别系统动态，并在强化学习中有效应对知识迁移的挑战，理论上实现了在O(1/ε²)的紧致样本复杂度下获得ε-最优策略。

链接: https://arxiv.org/abs/2506.09940
作者: Jiachen Hu,Rui Ai,Han Zhong,Xiaoyu Chen,Liwei Wang,Zhaoran Wang,Zhuoran Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an \epsilon -optimal policy with a tight sample complexity of O(1/\epsilon^2) .
zh

[AI-4] SAFE: Multitask Failure Detection for Vision-Language-Action Models

【速读】：该论文试图解决视觉-语言-动作模型（Vision-Language-Action Models, VLAs）在部署到新任务时成功率有限的问题，特别是缺乏一种能够通用检测任务失败的故障检测器。现有故障检测器通常仅在特定任务上训练和测试，无法适应VLAs在未见过的任务和新环境中的需求。解决方案的关键在于提出SAFE，一个针对通用机器人策略（如VLAs）的故障检测器，其核心是利用VLAs内部特征学习任务失败的可能性，并通过预测单一标量值实现对任务失败的及时检测，从而提升机器人在复杂环境中的安全性和鲁棒性。

链接: https://arxiv.org/abs/2506.09937
作者: Qiao Gu,Yuanliang Ju,Shengxiang Sun,Igor Gilitschenski,Haruki Nishimura,Masha Itkina,Florian Shkurti
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, \pi_0 , and \pi_0 -FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results can be found at this https URL.
zh

[AI-5] Causal Climate Emulation with Bayesian Filtering

【速读】：该论文试图解决传统气候模型计算成本高、难以高效预测气候变化及其成因与影响的问题。其解决方案的关键在于开发一种基于因果表征学习的可解释气候模型模拟器，该方法引入了包含贝叶斯滤波的物理信息策略，以实现稳定长期自回归模拟，并准确学习气候动力学。

链接: https://arxiv.org/abs/2506.09891
作者: Sebastian Hickman,Ilija Trajkovic,Julia Kaltenborn,Francis Pelletier,Alex Archibald,Yaniv Gurwicz,Peer Nowack,David Rolnick,Julien Boussard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 32 pages, 21 figures

点击查看摘要

Abstract:Traditional models of climate change use complex systems of coupled equations to simulate physical processes across the Earth system. These simulations are highly computationally expensive, limiting our predictions of climate change and analyses of its causes and effects. Machine learning has the potential to quickly emulate data from climate models, but current approaches are not able to incorporate physics-informed causal relationships. Here, we develop an interpretable climate model emulator based on causal representation learning. We derive a physics-informed approach including a Bayesian filter for stable long-term autoregressive emulation. We demonstrate that our emulator learns accurate climate dynamics, and we show the importance of each one of its components on a realistic synthetic dataset and data from two widely deployed climate models.
zh

[AI-6] Stakeholder Participation for Responsible AI Development: Disconnects Between Guidance and Current Practice

【速读】：该论文试图解决当前负责任人工智能（Responsible AI, rAI）指导中提倡的利益相关者参与（Stakeholder Engagement, SHI）与商业软件开发实践中SHI的实际应用之间存在的潜在脱节问题。研究发现，现有的SHI实践主要由商业优先事项（如客户价值和合规性）驱动，而这些实践在很大程度上未能有效支持rAI目标。解决方案的关键在于识别并干预导致这种脱节的因素，提出针对性的干预措施和研究机会，以推动行业实践向rAI方向转变。

链接: https://arxiv.org/abs/2506.09873
作者: Emma Kallina,Thomas Bohné,Jat Singh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published at the 2025 ACM Conference on Fairness, Accountability, and Transparency FAccT’25

点击查看摘要

Abstract:Responsible AI (rAI) guidance increasingly promotes stakeholder involvement (SHI) during AI development. At the same time, SHI is already common in commercial software development, but with potentially different foci. This study clarifies the extent to which established SHI practices are able to contribute to rAI efforts as well as potential disconnects – essential insights to inform and tailor future interventions that further shift industry practice towards rAI efforts. First, we analysed 56 rAI guidance documents to identify why SHI is recommended (i.e. its expected benefits for rAI) and uncovered goals such as redistributing power, improving socio-technical understandings, anticipating risks, and enhancing public oversight. To understand why and how SHI is currently practised in commercial settings, we then conducted an online survey (n=130) and semi-structured interviews (n=10) with AI practitioners. Our findings reveal that SHI in practice is primarily driven by commercial priorities (e.g. customer value, compliance) and several factors currently discourage more rAI-aligned SHI practices. This suggests that established SHI practices are largely not contributing to rAI efforts. To address this disconnect, we propose interventions and research opportunities to advance rAI development in practice.
zh

[AI-7] Guided Graph Compression for Quantum Graph Neural Networks

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）在处理大规模图数据时面临的高内存需求和GPU上稀疏矩阵操作效率低的问题，同时探索量子计算（Quantum Computing, QC）在图数据处理中的应用潜力。其解决方案的关键在于提出一种名为Guided Graph Compression (GGC)的框架，该框架通过图自编码器（graph autoencoder）同时减少节点数量和节点特征的维度，并且压缩过程旨在提升下游分类任务的性能，从而为量子或经典分类器提供更高效的输入数据。

链接: https://arxiv.org/abs/2506.09862
作者: Mikel Casals,Vasilis Belis,Elias F. Combarro,Eduard Alarcón,Sofia Vallecorsa,Michele Grossi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are effective for processing graph-structured data but face challenges with large graphs due to high memory requirements and inefficient sparse matrix operations on GPUs. Quantum Computing (QC) offers a promising avenue to address these issues and inspires new algorithmic approaches. In particular, Quantum Graph Neural Networks (QGNNs) have been explored in recent literature. However, current quantum hardware limits the dimension of the data that can be effectively encoded. Existing approaches either simplify datasets manually or use artificial graph datasets. This work introduces the Guided Graph Compression (GGC) framework, which uses a graph autoencoder to reduce both the number of nodes and the dimensionality of node features. The compression is guided to enhance the performance of a downstream classification task, which can be applied either with a quantum or a classical classifier. The framework is evaluated on the Jet Tagging task, a classification problem of fundamental importance in high energy physics that involves distinguishing particle jets initiated by quarks from those by gluons. The GGC is compared against using the autoencoder as a standalone preprocessing step and against a baseline classical GNN classifier. Our numerical results demonstrate that GGC outperforms both alternatives, while also facilitating the testing of novel QGNN ansatzes on realistic datasets.
zh

[AI-8] Superstudent intelligence in thermodynamics

【速读】：该论文试图探讨生成式 AI (Generative AI) 在复杂知识密集型任务中的表现是否能够超越人类，特别是针对需要深入理解和创造性应用热力学原理的考试。解决方案的关键在于将最新的热力学考试同时应用于学生和 OpenAI 的大型语言模型 o3，并采用与评估学生答案相同的标准对 o3 的答案进行评分。结果显示，o3 在零样本模式下正确解答了所有问题，其成绩达到了自1985年以来超过10,000次类似考试中的最佳水平，表明机器在复杂任务上的表现已接近甚至超越人类。

链接: https://arxiv.org/abs/2506.09822
作者: Rebecca Loubet,Pascal Zittlau,Marco Hoffmann,Luisa Vollmer,Sophie Fellenz,Heike Leitte,Fabian Jirasek,Johannes Lenhard,Hans Hasse
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: This document is the unedited Author’s version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

点击查看摘要

Abstract:In this short note, we report and analyze a striking event: OpenAI’s large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students’ exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI’s most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.
zh

[AI-9] A theoretical framework for self-supervised contrastive learning for continuous dependent data

【速读】：该论文旨在解决自监督学习（SSL）在依赖性数据（如时间序列和时空领域）中的应用问题，特别是传统对比SSL方法假设样本之间语义独立，这一假设在存在复杂相关性的依赖数据中并不成立。其解决方案的关键在于提出一种针对连续依赖数据的对比SSL理论框架，引入了“硬接近”和“软接近”两种真实相似性度量，并推导出能够兼容这两种接近类型的估计相似性矩阵，从而设计出依赖感知的损失函数。该方法在时间序列和时空下游任务中表现出色，显著优于现有方法。

链接: https://arxiv.org/abs/2506.09785
作者: Alexander Marusov,Alexander Yuhay,Alexey Zaytsev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a powerful approach to learning representations, particularly in the field of computer vision. However, its application to dependent data, such as temporal and spatio-temporal domains, remains underexplored. Besides, traditional contrastive SSL methods often assume \emphsemantic independence between samples, which does not hold for dependent data exhibiting complex correlations. We propose a novel theoretical framework for contrastive SSL tailored to \emphcontinuous dependent data, which allows the nearest samples to be semantically close to each other. In particular, we propose two possible \textitground truth similarity measures between objects – \emphhard and \emphsoft closeness. Under it, we derive an analytical form for the \textitestimated similarity matrix that accommodates both types of closeness between samples, thereby introducing dependency-aware loss functions. We validate our approach, \emphDependent TS2Vec, on temporal and spatio-temporal downstream problems. Given the dependency patterns presented in the data, our approach surpasses modern ones for dependent data, highlighting the effectiveness of our theoretically grounded loss functions for SSL in capturing spatio-temporal dependencies. Specifically, we outperform TS2Vec on the standard UEA and UCR benchmarks, with accuracy improvements of 4.17 % and 2.08 %, respectively. Furthermore, on the drought classification task, which involves complex spatio-temporal patterns, our method achieves a 7 % higher ROC-AUC score.
zh

[AI-10] Load-Aware Training Scheduling for Model Circulation-based Decentralized Federated Learning

【速读】：该论文旨在解决去中心化联邦学习中由于计算和通信负载不平衡导致的训练时间过长问题（training time in decentralized federated learning）。其解决方案的关键在于提出一种负载感知的Tram-FL（Load-aware Tram-FL），通过引入训练调度机制，将调度问题建模为全局优化任务，并通过分解为节点级子问题来实现求解，同时引入方差约束以促进非独立同分布（non-IID）数据下的均衡数据利用，从而最小化包括计算和通信成本在内的整体训练延迟。

链接: https://arxiv.org/abs/2506.09769
作者: Haruki Kainuma,Takayuki Nishio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, submitted to IEEE Globecom 2025 (under review)

点击查看摘要

Abstract:This paper proposes Load-aware Tram-FL, an extension of Tram-FL that introduces a training scheduling mechanism to minimize total training time in decentralized federated learning by accounting for both computational and communication loads. The scheduling problem is formulated as a global optimization task, which-though intractable in its original form-is made solvable by decomposing it into node-wise subproblems. To promote balanced data utilization under non-IID distributions, a variance constraint is introduced, while the overall training latency, including both computation and communication costs, is minimized through the objective function. Simulation results on MNIST and CIFAR-10 demonstrate that Load-aware Tram-FL significantly reduces training time and accelerates convergence compared to baseline methods.
zh

[AI-11] Intelligent Design 4.0: Paradigm Evolution Toward the Agent ic AI Era

【速读】：该论文试图解决工程设计过程中如何通过智能化手段提升创新性、效率、质量和生产力的问题，特别是在面对日益复杂的设计挑战时，如何实现更高效和自动化的设计流程。其解决方案的关键在于提出Intelligent Design 4.0（ID 4.0）这一新兴范式，该范式由代理型AI系统（agentic AI systems）驱动，旨在通过协调的、自主的多智能体系统实现工程设计过程的端到端自动化。

链接: https://arxiv.org/abs/2506.09755
作者: Shuo Jiang,Min Xie,Frank Youhua Chen,Jian Ma,Jianxi Luo
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research and practice in Intelligent Design (ID) have significantly enhanced engineering innovation, efficiency, quality, and productivity over recent decades, fundamentally reshaping how engineering designers think, behave, and interact with design processes. The recent emergence of Foundation Models (FMs), particularly Large Language Models (LLMs), has demonstrated general knowledge-based reasoning capabilities, and open new paths and avenues for further transformation in engineering design. In this context, this paper introduces Intelligent Design 4.0 (ID 4.0) as an emerging paradigm empowered by agentic AI systems. We review the historical evolution of ID across four distinct stages: rule-based expert systems, task-specific machine learning models, large-scale foundation AI models, and the recent emerging paradigm of multi-agent collaboration. We propose a conceptual framework for ID 4.0 and discuss its potential to support end-to-end automation of engineering design processes through coordinated, autonomous multi-agent-based systems. Furthermore, we discuss future perspectives to enhance and fully realize ID 4.0’s potential, including more complex design scenarios, more practical design implementations, novel agent coordination mechanisms, and autonomous design goal-setting with better human value alignment. In sum, these insights lay a foundation for advancing Intelligent Design toward greater adaptivity, autonomy, and effectiveness in addressing increasingly complex design challenges.
zh

[AI-12] Large Language Models for Design Structure Matrix Optimization

【速读】：该论文试图解决复杂工程系统中设计结构矩阵（Design Structure Matrix, DSM）元素重新排序以减少反馈回路、提高模块化或流程效率的组合优化（Combinatorial Optimization, CO）问题。解决方案的关键在于提出一种基于大型语言模型（Large Language Models, LLMs）的框架，该框架将网络拓扑与上下文领域知识相结合，实现DSM元素序列的迭代优化。实验结果表明，该方法在收敛速度和解的质量上均优于传统随机和确定性基线，并且上下文领域知识的引入显著提升了优化性能。

链接: https://arxiv.org/abs/2506.09749
作者: Shuo Jiang,Min Xie,Jianxi Luo
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In complex engineering systems, the interdependencies among components or development activities are often modeled and analyzed using Design Structure Matrix (DSM). Reorganizing elements within a DSM to minimize feedback loops and enhance modularity or process efficiency constitutes a challenging combinatorial optimization (CO) problem in engineering design and operations. As problem sizes increase and dependency networks become more intricate, traditional optimization methods that solely use mathematical heuristics often fail to capture the contextual nuances and struggle to deliver effective solutions. In this study, we explore the potential of Large Language Models (LLMs) for helping solve such CO problems by leveraging their capabilities for advanced reasoning and contextual understanding. We propose a novel LLM-based framework that integrates network topology with contextual domain knowledge for iterative optimization of DSM element sequencing - a common CO problem. Experiments on various DSM cases show that our method consistently achieves faster convergence and superior solution quality compared to both stochastic and deterministic baselines. Notably, we find that incorporating contextual domain knowledge significantly enhances optimization performance regardless of the chosen LLM backbone. These findings highlight the potential of LLMs to solve complex engineering CO problems by combining semantic and mathematical reasoning. This approach paves the way towards a new paradigm in LLM-based engineering design optimization.
zh

[AI-13] Feature Engineering for Agents : An Adaptive Cognitive Architecture for Interpretable ML Monitoring AAMAS2025

【速读】：该论文试图解决机器学习（Machine Learning, ML）模型在生产环境中监控时，传统方法产生的输出冗长且可解释性差，从而影响有效决策的问题。解决方案的关键在于提出一种基于大型语言模型（Large Language Models, LLMs）的代理特征工程认知架构，其中核心模块为决策流程（Decision Procedure），通过“重构（Refactor）、分解（Break Down）、整合（Compile）”三个关键步骤实现对监控数据的优化处理，提升输出的可解释性，并减少对LLM生成规划的依赖，从而构建一个具有高度可解释性和可操作性的决策支持系统。

链接: https://arxiv.org/abs/2506.09742
作者: Gusseppe Bravo-Rocca,Peini Liu,Jordi Guitart,Rodrigo M Carrillo-Larco,Ajay Dholakia,David Ellison
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAMAS 2025

点击查看摘要

Abstract:Monitoring Machine Learning (ML) models in production environments is crucial, yet traditional approaches often yield verbose, low-interpretability outputs that hinder effective decision-making. We propose a cognitive architecture for ML monitoring that applies feature engineering principles to agents based on Large Language Models (LLMs), significantly enhancing the interpretability of monitoring outputs. Central to our approach is a Decision Procedure module that simulates feature engineering through three key steps: Refactor, Break Down, and Compile. The Refactor step improves data representation to better capture feature semantics, allowing the LLM to focus on salient aspects of the monitoring data while reducing noise and irrelevant information. Break Down decomposes complex information for detailed analysis, and Compile integrates sub-insights into clear, interpretable outputs. This process leads to a more deterministic planning approach, reducing dependence on LLM-generated planning, which can sometimes be inconsistent and overly general. The combination of feature engineering-driven planning and selective LLM utilization results in a robust decision support system, capable of providing highly interpretable and actionable insights. Experiments using multiple LLMs demonstrate the efficacy of our approach, achieving significantly higher accuracy compared to various baselines across several domains.
zh

[AI-14] RIDENT: Temporally Restricted Inference via DFA-Enhanced Neural Traversal

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成输出时无法确保满足时间约束的问题，特别是那些可以用有限迹线上的线性时序逻辑（LTLf）表达的约束。解决方案的关键在于提出TRIDENT算法，该算法在推理阶段通过将LTLf公式编译为确定性有限自动机（Deterministic Finite Automaton, DFA），并利用DFA引导一种受限的束搜索变体，从而保证生成结果符合给定的时间约束。在解码过程中，可能导致约束违反的转移被屏蔽，剩余路径则根据模型概率和DFA接受结构进行动态重排序，从而确保生成序列严格满足LTLf约束。

链接: https://arxiv.org/abs/2506.09701
作者: Vincenzo Collura,Karim Tit,Laura Bussi,Eleonora Giunchiglia,Maxime Cordy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and other neural architectures have achieved impressive results across a variety of generative and classification tasks. However, they remain fundamentally ill-equipped to ensure that their outputs satisfy temporal constraints, such as those expressible in Linear Temporal Logic over finite traces (LTLf). In this paper, we introduce TRIDENT: a general and model-agnostic inference-time algorithm that guarantees compliance with such constraints without requiring any retraining. TRIDENT compiles LTLf formulas into a Deterministic Finite Automaton (DFA), which is used to guide a constrained variant of beam search. At each decoding step, transitions that would lead to constraint violations are masked, while remaining paths are dynamically re-ranked based on both the model’s probabilities and the DFA’s acceptance structure. We formally prove that the resulting sequences are guaranteed to satisfy the given LTLf constraints, and we empirically demonstrate that TRIDENT also improves output quality. We validate our approach on two distinct tasks: temporally constrained image-stream classification and controlled text generation. In both settings, TRIDENT achieves perfect constraint satisfaction, while comparison with the state of the art shows improved efficiency and high standard quality metrics.
zh

[AI-15] Empirical Quantification of Spurious Correlations in Malware Detection

【速读】：该论文试图解决深度学习在恶意软件检测中依赖虚假相关性（spurious correlations）的问题，这种相关性在推理时表现出高相关性，但根据领域知识实际上无用。解决方案的关键在于通过分析编译器留下的空闲空间对模型决策的影响，揭示深度学习模型如何依赖这些非语义特征，从而削弱编译后代码的相关性。研究通过小规模平衡数据集进行初步分析，并引入两种端到端模型的排名以评估其在生产环境中的适用性。

链接: https://arxiv.org/abs/2506.09662
作者: Bianca Perasso,Ludovico Lozza,Andrea Ponte,Luca Demetrio,Luca Oneto,Fabio Roli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end deep learning exhibits unmatched performance for detecting malware, but such an achievement is reached by exploiting spurious correlations – features with high relevance at inference time, but known to be useless through domain knowledge. While previous work highlighted that deep networks mainly focus on metadata, none investigated the phenomenon further, without quantifying their impact on the decision. In this work, we deepen our understanding of how spurious correlation affects deep learning for malware detection by highlighting how much models rely on empty spaces left by the compiler, which diminishes the relevance of the compiled code. Through our seminal analysis on a small-scale balanced dataset, we introduce a ranking of two end-to-end models to better understand which is more suitable to be put in production.
zh

[AI-16] Application-Driven Value Alignment in Agent ic AI Systems: Survey and Perspectives

【速读】：该论文试图解决在复杂环境中多智能体自主决策与任务协作中出现的价值对齐（value alignment）问题，旨在确保智能体的目标、偏好和行为与人类价值观和社会规范保持一致。其解决方案的关键在于整合大模型驱动的AI进展与社会治理需求，从价值原则的层级化组织、智能体系统应用场景的分类分析、价值对齐评估方法的系统性研究，以及多智能体间的价值协调等多个维度进行综合探讨。

链接: https://arxiv.org/abs/2506.09656
作者: Wei Zeng,Hengshu Zhu,Chuan Qin,Han Wu,Yihang Cheng,Sirui Zhang,Xiaowei Jin,Yinuo Shen,Zhenxing Wang,Feimin Zhong,Hui Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ongoing evolution of AI paradigms has propelled AI research into the Agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasingly situational and systemic risks. This has brought significant attention to value alignment for AI agents, which aims to ensure that an agent’s goals, preferences, and behaviors align with human values and societal norms. This paper reviews value alignment in agent systems within specific application scenarios. It integrates the advancements in AI driven by large models with the demands of social governance. Our review covers value principles, agent system application scenarios, and agent value alignment evaluation. Specifically, value principles are organized hierarchically from a top-down perspective, encompassing macro, meso, and micro levels. Agent system application scenarios are categorized and reviewed from a general-to-specific viewpoint. Agent value alignment evaluation systematically examines datasets for value alignment assessment and relevant value alignment methods. Additionally, we delve into value coordination among multiple agents within agent systems. Finally, we propose several potential research directions in this field.
zh

[AI-17] DipLLM : Fine-Tuning LLM for Strategic Decision-making in Diplomacy ICML2025

【速读】：该论文试图解决多玩家战略游戏Diplomacy中AI系统面临的复杂协作与竞争问题，尤其是传统方法在生成训练数据时对计算资源的高需求。解决方案的关键在于提出DipLLM，一个基于微调大型语言模型（Large Language Model, LLM）的智能体，其通过自回归因子分解框架将多单位动作分配任务简化为一系列单元级决策，并以均衡策略作为学习目标，从而在仅使用1.5%的训练数据情况下超越了当前最先进的Cicero模型。

链接: https://arxiv.org/abs/2506.09655
作者: Kaixuan Xu,Jiajun Chai,Sicheng Li,Yuqian Fu,Yuanheng Zhu,Dongbin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Diplomacy is a complex multiplayer game that requires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Language Models (LLMs) offer a promising alternative, leveraging pre-trained knowledge to achieve strong performance with relatively small-scale fine-tuning. However, applying LLMs to Diplomacy remains challenging due to the exponential growth of possible action combinations and the intricate strategic interactions among players. To address this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns equilibrium policies for Diplomacy. DipLLM employs an autoregressive factorization framework to simplify the complex task of multi-unit action assignment into a sequence of unit-level decisions. By defining an equilibrium policy within this framework as the learning objective, we fine-tune the model using only 1.5% of the data required by the state-of-the-art Cicero model, surpassing its performance. Our results demonstrate the potential of fine-tuned LLMs for tackling complex strategic decision-making in multiplayer games.
zh

[AI-18] ghtly-Coupled LiDAR-IMU-Leg Odometry with Online Learned Leg Kinematics Incorporating Foot Tactile Information

【速读】：该论文旨在解决在无特征环境和可变形地形下，四足机器人里程计估计不准确的问题。其关键解决方案是提出了一种基于在线学习的腿部运动学模型——神经腿部运动学模型（neural leg kinematics model），该模型通过融合触觉信息（足端反作用力）来隐式表达机器人足部与地面之间的非线性动力学关系，并结合神经自适应腿部里程计因子和运动预测的在线不确定性估计，在统一的因子图中联合优化模型训练与里程计估计，从而保持两者的一致性。

链接: https://arxiv.org/abs/2506.09548
作者: Taku Okawara,Kenji Koide,Aoki Takanose,Shuji Oishi,Masashi Yokozuka,Kentaro Uno,Kazuya Yoshida
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Robotics and Automation Letters

点击查看摘要

Abstract:In this letter, we present tightly coupled LiDAR-IMU-leg odometry, which is robust to challenging conditions such as featureless environments and deformable terrains. We developed an online learning-based leg kinematics model named the neural leg kinematics model, which incorporates tactile information (foot reaction force) to implicitly express the nonlinear dynamics between robot feet and the ground. Online training of this model enhances its adaptability to weight load changes of a robot (e.g., assuming delivery or transportation tasks) and terrain conditions. According to the \textitneural adaptive leg odometry factor and online uncertainty estimation of the leg kinematics model-based motion predictions, we jointly solve online training of this kinematics model and odometry estimation on a unified factor graph to retain the consistency of both. The proposed method was verified through real experiments using a quadruped robot in two challenging situations: 1) a sandy beach, representing an extremely featureless area with a deformable terrain, and 2) a campus, including multiple featureless areas and terrain types of asphalt, gravel (deformable terrain), and grass. Experimental results showed that our odometry estimation incorporating the \textitneural leg kinematics model outperforms state-of-the-art works. Our project page is available for further details: this https URL
zh

[AI-19] Neural Functions for Learning Periodic Signal

【速读】：该论文试图解决基于坐标的多层感知机（MLP）在学习信号时存在的过拟合问题以及超出训练区域的泛化能力有限的问题，从而导致外推性能不佳。解决方案的关键在于提出一种新的网络架构，该架构能够从测量数据中提取周期性模式，并利用这些信息来表示信号，从而提升模型的泛化能力和外推性能。

链接: https://arxiv.org/abs/2506.09526
作者: Woojin Cho,Minju Jo,Kookjin Lee,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As function approximators, deep neural networks have served as an effective tool to represent various signal types. Recent approaches utilize multi-layer perceptrons (MLPs) to learn a nonlinear mapping from a coordinate to its corresponding signal, facilitating the learning of continuous neural representations from discrete data points. Despite notable successes in learning diverse signal types, coordinate-based MLPs often face issues of overfitting and limited generalizability beyond the training region, resulting in subpar extrapolation performance. This study addresses scenarios where the underlying true signals exhibit periodic properties, either spatially or temporally. We propose a novel network architecture, which extracts periodic patterns from measurements and leverages this information to represent the signal, thereby enhancing generalization and improving extrapolation performance. We demonstrate the efficacy of the proposed method through comprehensive experiments, including the learning of the periodic solutions for differential equations, and time series imputation (interpolation) and forecasting (extrapolation) on real-world datasets.
zh

[AI-20] Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

【速读】：该论文旨在解决在一般马尔可夫决策过程（Markov Decision Processes, MDPs）中，通过轨迹级偏好比较进行强化学习的问题，其核心挑战是设计能够选择有信息量的偏好查询以识别潜在奖励函数的算法，同时保证理论上的性能保障。论文提出的解决方案的关键在于采用基于随机探索的元算法，该算法避免了乐观方法相关的计算挑战，并保持了可处理性；此外，通过引入批量轨迹对收集和最优实验设计来提升查询复杂度，从而实现高效的偏好查询选择与并行化处理。

链接: https://arxiv.org/abs/2506.09508
作者: Andreas Schlaginhaufen,Reda Ouhamma,Maryam Kamgarpour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in practical deployment as feedback can be gathered concurrently. Empirical evaluation confirms that the proposed method is competitive with reward-based reinforcement learning while requiring a small number of preference queries.
zh

[AI-21] A Unified Theory of Compositionality Modularity and Interpretability in Markov Decision Processes

【速读】：该论文试图解决在高维、动态环境中的可验证长周期规划与内在动机问题，特别是在传统基于奖励最大化的强化学习方法中，难以兼顾组合性、模块化和可解释性的问题。解决方案的关键在于引入选项核贝尔曼方程（Option Kernel Bellman Equations, OKBEs），其通过直接构建和优化一种称为状态-时间选项核（state-time option kernel, STOK）的预测映射，以最大化完成目标的概率并避免约束违规。STOK具备组合性、模块化和可解释性，能够支持多策略的时空预测、高效表示高维状态转移，并记录语义可解释的目标成功与约束违反事件，从而实现目标级的前向规划与元策略快速合成。

链接: https://arxiv.org/abs/2506.09499
作者: Thomas J. Ringstrom,Paul R. Schrater
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 Pages

点击查看摘要

Abstract:We introduce Option Kernel Bellman Equations (OKBEs) for a new reward-free Markov Decision Process. Rather than a value function, OKBEs directly construct and optimize a predictive map called a state-time option kernel (STOK) to maximize the probability of completing a goal while avoiding constraint violations. STOKs are compositional, modular, and interpretable initiation-to-termination transition kernels for policies in the Options Framework of Reinforcement Learning. This means: 1) STOKs can be composed using Chapman-Kolmogorov equations to make spatiotemporal predictions for multiple policies over long horizons, 2) high-dimensional STOKs can be represented and computed efficiently in a factorized and reconfigurable form, and 3) STOKs record the probabilities of semantically interpretable goal-success and constraint-violation events, needed for formal verification. Given a high-dimensional state-transition model for an intractable planning problem, we can decompose it with local STOKs and goal-conditioned policies that are aggregated into a factorized goal kernel, making it possible to forward-plan at the level of goals in high-dimensions to solve the problem. These properties lead to highly flexible agents that can rapidly synthesize meta-policies, reuse planning representations across many tasks, and justify goals using empowerment, an intrinsic motivation function. We argue that reward-maximization is in conflict with the properties of compositionality, modularity, and interpretability. Alternatively, OKBEs facilitate these properties to support verifiable long-horizon planning and intrinsic motivation that scales to dynamic high-dimensional world-models.
zh

[AI-22] Fast Monte Carlo Tree Diffusion: 100x Speedup via Parallel Sparse Planning

【速读】：该论文旨在解决扩散模型在长时程推理任务中因固有的非序列特性而效果受限的问题，以及现有方法如蒙特卡洛树扩散（MCTD）在计算效率上的不足。其解决方案的关键在于提出Fast-MCTD，该方法通过集成并行MCTD和稀疏MCTD两项技术，实现了计算效率的显著提升，同时保持或增强了规划性能。

链接: https://arxiv.org/abs/2506.09498
作者: Jaesik Yoon,Hyeonseo Cho,Yoshua Bengio,Sungjin Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.
zh

[AI-23] EnerBridge-DPO: Energy-Guided Protein Inverse Folding with Markov Bridges and Direct Preference Optimization

【速读】：该论文旨在解决蛋白质逆折叠中生成序列的热力学稳定性不足的问题，即现有深度学习方法主要通过最大化序列恢复率进行训练，而往往忽视了生成序列的能量特性。其解决方案的关键在于提出EnerBridge-DPO框架，该框架通过将马尔可夫桥（Markov Bridge）与直接偏好优化（Direct Preference Optimization, DPO）相结合，并引入显式的能量约束损失函数，从而实现对低能、高稳定性的蛋白质序列的直接生成。

链接: https://arxiv.org/abs/2506.09496
作者: Dingyi Rong,Haotian Lu,Wenzhuo Zheng,Fan Zhang,Shuangjia Zheng,Ning Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing protein sequences with optimal energetic stability is a key challenge in protein inverse folding, as current deep learning methods are primarily trained by maximizing sequence recovery rates, often neglecting the energy of the generated sequences. This work aims to overcome this limitation by developing a model that directly generates low-energy, stable protein sequences. We propose EnerBridge-DPO, a novel inverse folding framework focused on generating low-energy, high-stability protein sequences. Our core innovation lies in: First, integrating Markov Bridges with Direct Preference Optimization (DPO), where energy-based preferences are used to fine-tune the Markov Bridge model. The Markov Bridge initiates optimization from an information-rich prior sequence, providing DPO with a pool of structurally plausible sequence candidates. Second, an explicit energy constraint loss is introduced, which enhances the energy-driven nature of DPO based on prior sequences, enabling the model to effectively learn energy representations from a wealth of prior knowledge and directly predict sequence energy values, thereby capturing quantitative features of the energy landscape. Our evaluations demonstrate that EnerBridge-DPO can design protein complex sequences with lower energy while maintaining sequence recovery rates comparable to state-of-the-art models, and accurately predicts \Delta \Delta G values between various sequences.
zh

[AI-24] BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

【速读】：该论文旨在解决高保真与长时音频生成中传统生成对抗网络（GAN）声码器在建模周期性结构和长程依赖关系方面的局限性。其解决方案的关键在于对生成器和判别器架构的创新设计：生成器采用抗混叠多周期性组合（AMP）模块替代传统ResBlock，并引入Snake激活函数以更有效地建模周期性特征；判别器则结合多包络判别器（MED）与多分辨率判别器（MRD），从而增强对音频时序包络特征和长程依赖关系的捕捉能力。

链接: https://arxiv.org/abs/2506.09487
作者: Taesoo Park,Mungwi Jeong,Mingyu Park,Narae Kim,Junyoung Kim,Mujung Kim,Jisang Yoo,Hoyun Lee,Sanghoon Kim,Soonchul Kwon
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Audio and Speech Processing (eess.AS)
备注: 11 pages, 7 figures. Survey and tutorial paper. Currently under review at ICT Express as an extended version of our ICAIIC 2025 paper

点击查看摘要

Abstract:This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: this https URL.
zh

[AI-25] Adv-BMT: Bidirectional Motion Transformer for Safety-Critical Traffic Scenario Generation

【速读】：该论文旨在解决自动驾驶（Autonomous Driving, AD）系统在场景化测试中面临的长尾安全关键场景数据稀缺问题。为应对这一挑战，作者提出了Adv-BMT框架，其关键在于采用双向运动变压器（Bidirectional Motion Transformer, BMT）模型进行逆向交通运动预测，通过输入场景最后时间步的智能体信息，逆序重构交通场景至初始时间步，从而生成多样且真实的对抗性交互场景。

链接: https://arxiv.org/abs/2506.09485
作者: Yuxin Liu,Zhenghao Peng,Xuanhao Cui,Bolei Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Scenario-based testing is essential for validating the performance of autonomous driving (AD) systems. However, such testing is limited by the scarcity of long-tailed, safety-critical scenarios in existing datasets collected in the real world. To tackle the data issue, we propose the Adv-BMT framework, which augments real-world scenarios with diverse and realistic adversarial interactions. The core component of Adv-BMT is a bidirectional motion transformer (BMT) model to perform inverse traffic motion predictions, which takes agent information in the last time step of the scenario as input, and reconstruct the traffic in the inverse of chronological order until the initial time step. The Adv-BMT framework is a two-staged pipeline: it first conducts adversarial initializations and then inverse motion predictions. Different from previous work, we do not need any collision data for pretraining, and are able to generate realistic and diverse collision interactions. Our experimental results validate the quality of generated collision scenarios by Adv-BMT: training in our augmented dataset would reduce episode collision rates by 20% compared to previous work.
zh

[AI-26] Abstraction-Based Proof Production in Formal Verification of Neural Networks

【速读】：该论文试图解决当前基于抽象的深度神经网络（Deep Neural Network, DNN）验证工具与可证明保证之间的差距问题，即现有的可产生证明的验证器不支持基于抽象的推理，导致在可扩展性与可证明性之间存在矛盾。解决方案的关键在于引入一种新的框架，将验证任务模块化地分为两个部分：(i) 证明抽象网络的正确性，以及 (ii) 证明抽象相对于原始DNN的合理性。其中，前者可以由现有的可产生证明的验证器处理，而后者则通过提出的一种首次用于生成形式化证明的方法来实现。该工作旨在通过在形式化证明框架内支持常见的抽象技术，实现可扩展且可信的验证。

链接: https://arxiv.org/abs/2506.09455
作者: Yizhak Yisrael Elboher,Omri Isac,Guy Katz,Tobias Ladner,Haoze Wu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: To appear in SAIV 2025

点击查看摘要

Abstract:Modern verification tools for deep neural networks (DNNs) increasingly rely on abstraction to scale to realistic architectures. In parallel, proof production is becoming a critical requirement for increasing the reliability of DNN verification results. However, current proofproducing verifiers do not support abstraction-based reasoning, creating a gap between scalability and provable guarantees. We address this gap by introducing a novel framework for proof-producing abstraction-based DNN verification. Our approach modularly separates the verification task into two components: (i) proving the correctness of an abstract network, and (ii) proving the soundness of the abstraction with respect to the original DNN. The former can be handled by existing proof-producing verifiers, whereas we propose the first method for generating formal proofs for the latter. This preliminary work aims to enable scalable and trustworthy verification by supporting common abstraction techniques within a formal proof framework.
zh

[AI-27] When Is Diversity Rewarded in Cooperative Multi-Agent Learning?

【速读】：该论文试图解决在多智能体任务分配问题中，何时异质团队相比同质团队能够带来更高的奖励这一问题，即如何从奖励设计的角度确定适合异质团队的目标类型。解决方案的关键在于通过分析全局奖励的构建方式，特别是内层和外层聚合算子的曲率，来判断异质性是否能提升奖励，并进一步引入基于梯度的Heterogeneous Environment Design (HED)算法，优化未明确指定的多智能体强化学习（MARL）环境参数空间，以发现异质性具有优势的情景。

链接: https://arxiv.org/abs/2506.09434
作者: Michael Amir,Matteo Bettini,Amanda Prorok
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, our goal is to study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the N agents’ effort allocations on individual tasks to a task score, and an outer operator that merges the M task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneous Environment Design (HED), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Experiments in matrix games and an embodied Multi-Goal-Capture environment show that, despite the difference in settings, HED rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HED and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
zh

[AI-28] SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

【速读】：该论文试图解决在边缘计算环境中高效推理大型语言模型（Large Language Models, LLMs）所面临的挑战，这些问题主要源于设备内存和功耗的限制。现有方法如激进量化、剪枝或远程推理虽然在一定程度上提高了效率，但往往以牺牲准确性或增加成本为代价。该论文提出的解决方案关键在于利用推测解码（speculative decoding）技术，通过协调异构设备之间的计算，将轻量级边缘设备用于本地生成多个候选标记，而由一个共享的边缘服务器使用更精确的目标模型进行批量处理和验证，从而实现设备异构性支持与服务器端内存占用的降低。

链接: https://arxiv.org/abs/2506.09397
作者: Xiangchen Li,Dimitrios Spatharakis,Saeid Ghafouri,Jiakun Fan,Dimitrios Nikolopoulos
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Regardless the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose SLED, a method that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server efficiently batches and verifies the tokens utilizing a more precise target model. This approach supports device heterogeneity and reduces server-side memory footprint by avoiding the need to deploy multiple target models. Our initial experiments with Jetson Orin Nano, Raspberry Pi 5, and an RTX 6000 edge server indicate substantial benefits: significantly reduced latency, improved energy efficiency, and increased concurrent inference sessions, all without sacrificing model accuracy.
zh

[AI-29] Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models

【速读】：该论文试图解决传统代码生成模型中推理深度难以控制的问题，即如何在准确性和效率之间取得更好的平衡。其解决方案的关键在于将推理深度视为可控制的资源，并通过显式管理“快速思考”与“慢速思考”之间的权衡，优化整个模型生命周期中的推理预算，从而实现精度、延迟和成本之间的最优折衷。

链接: https://arxiv.org/abs/2506.09396
作者: Zongjie Li,Shuai Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper proposes a fundamental shift in designing code generation models: treating reasoning depth as a controllable resource. Rather than being an incidental byproduct of prompting, we argue that the trade-off between rapid, direct answers (“fast thinking”) and elaborate, chain-of-thought deliberation (“slow thinking”) must be explicitly managed. We contend that optimizing reasoning budgets across the entire model lifecycle - from synthetic data creation and benchmarking to real-world deploymen - can unlock superior trade-offs among accuracy, latency, and cost. This paper outlines how adaptive control over reasoning can enrich supervision signals, motivate new multi-dimensional benchmarks, and inform cost-aware, security-conscious deployment policies. By viewing fast and slow thinking as complementary modes to be scheduled, we envision coding agents that think deep when necessary and act fast when possible.
zh

[AI-30] Beyond Nash Equilibrium: Bounded Rationality of LLM s and humans in Strategic Decision-making

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在战略决策场景中是否表现出与人类相似的有限理性（bounded rationality）问题。研究通过将LLMs置于与人类相同的实验条件下，比较其在经典博弈论游戏（如石头剪刀布和囚徒困境）中的行为表现。解决方案的关键在于利用行为博弈论的实验范式直接评估LLMs的战略行为，并揭示其在策略调整、合作倾向及对环境动态变化敏感性方面的特征。研究发现，尽管LLMs能够复现人类常见的启发式策略，但其应用更为僵化，且在适应性情境中表现较差，这表明现有LLMs仅部分模拟了人类的有限理性，需进一步优化训练方法以提升其对手建模能力和上下文感知能力。

链接: https://arxiv.org/abs/2506.09390
作者: Kehan Zheng,Jinfeng Zhou,Hongning Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Large language models are increasingly used in strategic decision-making settings, yet evidence shows that, like humans, they often deviate from full rationality. In this study, we compare LLMs and humans using experimental paradigms directly adapted from behavioral game-theory research. We focus on two well-studied strategic games, Rock-Paper-Scissors and the Prisoner’s Dilemma, which are well known for revealing systematic departures from rational play in human subjects. By placing LLMs in identical experimental conditions, we evaluate whether their behaviors exhibit the bounded rationality characteristic of humans. Our findings show that LLMs reproduce familiar human heuristics, such as outcome-based strategy switching and increased cooperation when future interaction is possible, but they apply these rules more rigidly and demonstrate weaker sensitivity to the dynamic changes in the game environment. Model-level analyses reveal distinctive architectural signatures in strategic behavior, and even reasoning models sometimes struggle to find effective strategies in adaptive situations. These results indicate that current LLMs capture only a partial form of human-like bounded rationality and highlight the need for training methods that encourage flexible opponent modeling and stronger context awareness.
zh

[AI-31] Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations

【速读】：该论文试图解决人类及双足机器人系统中静态平衡与跌倒机制的定量理解不足问题（static balance and falling）。其解决方案的关键在于构建了一个分层控制流程，通过全面的全身肌肉骨骼系统模拟人体平衡，揭示了稳定站立期间的时空平衡动力学，分析了肌肉损伤对平衡行为的影响，并生成了与临床数据一致的跌倒接触模式。此外，通过模拟髋关节外骨骼辅助，验证了其在扰动下提升平衡维持能力和降低肌肉努力的效果。

链接: https://arxiv.org/abs/2506.09383
作者: Chengtian Ma,Yunyue Wei,Chenhui Zuo,Chen Zhang,Yanan Sui
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Balance control is important for human and bipedal robotic systems. While dynamic balance during locomotion has received considerable attention, quantitative understanding of static balance and falling remains limited. This work presents a hierarchical control pipeline for simulating human balance via a comprehensive whole-body musculoskeletal system. We identified spatiotemporal dynamics of balancing during stable standing, revealed the impact of muscle injury on balancing behavior, and generated fall contact patterns that aligned with clinical data. Furthermore, our simulated hip exoskeleton assistance demonstrated improvement in balance maintenance and reduced muscle effort under perturbation. This work offers unique muscle-level insights into human balance dynamics that are challenging to capture experimentally. It could provide a foundation for developing targeted interventions for individuals with balance impairments and support the advancement of humanoid robotic systems.
zh

[AI-32] Anomaly Detection and Generation with Diffusion Models: A Survey

【速读】：该论文试图解决异常检测（Anomaly Detection, AD）中因异常数据稀缺而导致的挑战，同时探索生成式扩散模型（Diffusion Models, DMs）在AD中的潜力。其解决方案的关键在于利用扩散模型的生成能力与异常检测方法之间的协同作用，形成一个增强循环：生成技术可缓解异常数据不足的问题，而检测方法则提供关键反馈以提升生成质量与相关性，从而超越各自独立能力的上限。

链接: https://arxiv.org/abs/2506.09368
作者: Yang Liu,Jing Liu,Chengfang Li,Rui Xi,Wenchao Li,Liang Cao,Jin Wang,Laurence T. Yang,Junsong Yuan,Wei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures, 13 tables

点击查看摘要

Abstract:Anomaly detection (AD) plays a pivotal role across diverse domains, including cybersecurity, finance, healthcare, and industrial manufacturing, by identifying unexpected patterns that deviate from established norms in real-world data. Recent advancements in deep learning, specifically diffusion models (DMs), have sparked significant interest due to their ability to learn complex data distributions and generate high-fidelity samples, offering a robust framework for unsupervised AD. In this survey, we comprehensively review anomaly detection and generation with diffusion models (ADGDM), presenting a tutorial-style analysis of the theoretical foundations and practical implementations and spanning images, videos, time series, tabular, and multimodal data. Crucially, unlike existing surveys that often treat anomaly detection and generation as separate problems, we highlight their inherent synergistic relationship. We reveal how DMs enable a reinforcing cycle where generation techniques directly address the fundamental challenge of anomaly data scarcity, while detection methods provide critical feedback to improve generation fidelity and relevance, advancing both capabilities beyond their individual potential. A detailed taxonomy categorizes ADGDM methods based on anomaly scoring mechanisms, conditioning strategies, and architectural designs, analyzing their strengths and limitations. We final discuss key challenges including scalability and computational efficiency, and outline promising future directions such as efficient architectures, conditioning strategies, and integration with foundation models (e.g., visual-language models and large language models). By synthesizing recent advances and outlining open research questions, this survey aims to guide researchers and practitioners in leveraging DMs for innovative AD solutions across diverse applications.
zh

[AI-33] “I Said Things I Needed to Hear Myself”: Peer Support as an Emotional Organisational and Sociotechnical Practice in Singapore

【速读】：该论文试图解决数字平台在心理健康支持中的设计与影响问题，特别是在亚洲背景下，如何通过技术手段有效支持同伴支持（peer support）的开展。其解决方案的关键在于通过实地访谈研究，深入理解同伴支持者在不同环境中的实践过程，包括他们的动机、情感劳动及社会文化因素，并据此提出文化敏感的数字工具设计方向，以增强而非取代人际关系支持。同时，论文还探讨了人工智能在负责任地增强同伴支持中的潜在作用，旨在推动以人为本的计算方法在心理健康领域的应用。

链接: https://arxiv.org/abs/2506.09362
作者: Kellie Yu Hui Sim,Kenny Tsu Wei Choo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Peer support plays a vital role in expanding access to mental health care by providing empathetic, community-based support outside formal clinical systems. As digital platforms increasingly mediate such support, the design and impact of these technologies remain under-examined, particularly in Asian contexts. This paper presents findings from an interview study with 20 peer supporters in Singapore, who operate across diverse online, offline, and hybrid environments. Through a thematic analysis, we unpack how participants start, conduct, and sustain peer support, highlighting their motivations, emotional labour, and the sociocultural dimensions shaping their practices. Building on this grounded understanding, we surface design directions for culturally responsive digital tools that scaffold rather than supplant relational care. Drawing insights from qualitative accounts, we offer a situated perspective on how AI might responsibly augment peer support. This research contributes to human-centred computing by articulating the lived realities of peer supporters and proposing design implications for trustworthy and context-sensitive AI in mental health.
zh

[AI-34] “Is This Really a Human Peer Supporter?”: Misalignments Between Peer Supporters and Experts in LLM -Supported Interactions

【速读】：该论文旨在解决当前同伴支持（peer support）在培训质量、一致性及安全性方面存在的问题，尤其是在心理健康领域中，同伴支持作为专业护理的补充，其效果受到训练差异和响应质量的限制。解决方案的关键在于利用生成式 AI (Generative AI) 技术，构建一个基于大型语言模型（LLM）的辅助系统，该系统能够模拟受困扰的客户、生成情境敏感的建议，并提供实时情绪可视化，从而提升同伴支持者的互动质量和培训效果。然而，研究发现专家与同伴支持者在响应质量上存在显著差异，凸显了现有培训体系在情感复杂情境中的不足，强调了标准化、心理科学基础培训的重要性，以及在设计此类系统时需结合专家监督以确保安全性和有效性。

链接: https://arxiv.org/abs/2506.09354
作者: Kellie Yu Hui Sim,Roy Ka-Wei Lee,Kenny Tsu Wei Choo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mental health is a growing global concern, prompting interest in AI-driven solutions to expand access to psychosocial support. Peer support, grounded in lived experience, offers a valuable complement to professional care. However, variability in training, effectiveness, and definitions raises concerns about quality, consistency, and safety. Large Language Models (LLMs) present new opportunities to enhance peer support interactions, particularly in real-time, text-based interactions. We present and evaluate an AI-supported system with an LLM-simulated distressed client, context-sensitive LLM-generated suggestions, and real-time emotion visualisations. 2 mixed-methods studies with 12 peer supporters and 5 mental health professionals (i.e., experts) examined the system’s effectiveness and implications for practice. Both groups recognised its potential to enhance training and improve interaction quality. However, we found a key tension emerged: while peer supporters engaged meaningfully, experts consistently flagged critical issues in peer supporter responses, such as missed distress cues and premature advice-giving. This misalignment highlights potential limitations in current peer support training, especially in emotionally charged contexts where safety and fidelity to best practices are essential. Our findings underscore the need for standardised, psychologically grounded training, especially as peer support scales globally. They also demonstrate how LLM-supported systems can scaffold this development–if designed with care and guided by expert oversight. This work contributes to emerging conversations on responsible AI integration in mental health and the evolving role of LLMs in augmenting peer-delivered care.
zh

[AI-35] ErrorEraser: Unlearning Data Bias for Improved Continual Learning

【速读】：该论文试图解决持续学习（Continual Learning, CL）中由于现实世界数据偏差导致的灾难性遗忘和知识迁移能力下降的问题。现有CL方法忽视了数据中的偏差，使得模型学习到虚假相关性并在任务间传递和放大这些偏差，从而影响知识的保留与迁移。解决方案的关键在于提出一种通用插件ErrorEraser，其核心包括两个模块：误差识别模块通过在特征空间中学习任务数据的概率密度分布来准确识别可能具有偏差的样本，误差消除模块则通过调整代表性异常样本的决策空间确保仅删除错误知识，从而提升新旧任务的性能。

链接: https://arxiv.org/abs/2506.09347
作者: Xuemei Cao,Hanlin Gu,Xin Yang,Bingjun Wei,Haoyang Liang,Xiangkun Wang,Tianrui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Continual Learning (CL) primarily aims to retain knowledge to prevent catastrophic forgetting and transfer knowledge to facilitate learning new tasks. Unlike traditional methods, we propose a novel perspective: CL not only needs to prevent forgetting, but also requires intentional this http URL arises from existing CL methods ignoring biases in real-world data, leading the model to learn spurious correlations that transfer and amplify across tasks. From feature extraction and prediction results, we find that data biases simultaneously reduce CL’s ability to retain and transfer knowledge. To address this, we propose ErrorEraser, a universal plugin that removes erroneous memories caused by biases in CL, enhancing performance in both new and old tasks. ErrorEraser consists of two modules: Error Identification and Error Erasure. The former learns the probability density distribution of task data in the feature space without prior knowledge, enabling accurate identification of potentially biased samples. The latter ensures only erroneous knowledge is erased by shifting the decision space of representative outlier samples. Additionally, an incremental feature distribution learning strategy is designed to reduce the resource overhead during error identification in downstream tasks. Extensive experimental results show that ErrorEraser significantly mitigates the negative impact of data biases, achieving higher accuracy and lower forgetting rates across three types of CL methods. The code is available at this https URL.
zh

[AI-36] Intelligent System of Emergent Knowledge: A Coordination Fabric for Billions of Minds

【速读】：该论文旨在解决传统集中式平台在构建大规模、自组织认知系统时存在的局限性，特别是如何实现去中心化的人机协同与动态任务分配。其解决方案的关键在于构建一个基于Web3基础设施的智能系统（ISEK），该系统通过去中心化的多智能体架构、人机平等协作机制以及分布式共识驱动的自我适应能力，形成一个具有鲁棒性和自组织特性的认知生态系统。此外，ISEK引入了六阶段协调协议和基于代币的经济激励机制，以确保系统的高效运行与可持续发展。

链接: https://arxiv.org/abs/2506.09335
作者: Moshi Wei,Sparks Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figures,

点击查看摘要

Abstract:The Intelligent System of Emergent Knowledge (ISEK) establishes a decentralized network where human and artificial intelligence agents collaborate as peers, forming a self-organizing cognitive ecosystem. Built on Web3 infrastructure, ISEK combines three fundamental principles: (1) a decentralized multi-agent architecture resistant to censorship, (2) symbiotic AI-human collaboration with equal participation rights, and (3) resilient self-adaptation through distributed consensus mechanisms. The system implements an innovative coordination protocol featuring a six-phase workflow (Publish, Discover, Recruit, Execute, Settle, Feedback) for dynamic task allocation, supported by robust fault tolerance and a multidimensional reputation system. Economic incentives are governed by the native ISEK token, facilitating micropayments, governance participation, and reputation tracking, while agent sovereignty is maintained through NFT-based identity management. This synthesis of blockchain technology, artificial intelligence, and incentive engineering creates an infrastructure that actively facilitates emergent intelligence. ISEK represents a paradigm shift from conventional platforms, enabling the organic development of large-scale, decentralized cognitive systems where autonomous agents collectively evolve beyond centralized constraints. Comments: 11 pages, 1 figures, Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.09335 [cs.MA] (or arXiv:2506.09335v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2506.09335 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-37] Causal Graph Recovery in Neuroimaging through Answer Set Programming

【速读】：该论文试图解决从时间序列数据中学习图形因果结构的问题，特别是在测量频率与系统因果时间尺度不匹配时，由于信息丢失导致的潜在因果图的不确定性问题。其解决方案的关键在于在因果图推导过程中引入下采样效应，从而获得更准确和直观的结果。研究采用约束优化方法，特别是答案集编程（Answer Set Programming, ASP），不仅能够识别最可能的潜在图，还能提供一个可能图的等价类供专家选择，同时利用图论进一步缩减解空间，显著提高求解效率和准确性。

链接: https://arxiv.org/abs/2506.09286
作者: Mohammadsajad Abavisani,Kseniya Solovyeva,David Danks,Vince Calhoun,Sergey Plis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Learning graphical causal structures from time series data presents significant challenges, especially when the measurement frequency does not match the causal timescale of the system. This often leads to a set of equally possible underlying causal graphs due to information loss from sub-sampling (i.e., not observing all possible states of the system throughout time). Our research addresses this challenge by incorporating the effects of sub-sampling in the derivation of causal graphs, resulting in more accurate and intuitive outcomes. We use a constraint optimization approach, specifically answer set programming (ASP), to find the optimal set of answers. ASP not only identifies the most probable underlying graph, but also provides an equivalence class of possible graphs for expert selection. In addition, using ASP allows us to leverage graph theory to further prune the set of possible solutions, yielding a smaller, more accurate answer set significantly faster than traditional approaches. We validate our approach on both simulated data and empirical structural brain connectivity, and demonstrate its superiority over established methods in these experiments. We further show how our method can be used as a meta-approach on top of established methods to obtain, on average, 12% improvement in F1 score. In addition, we achieved state of the art results in terms of precision and recall of reconstructing causal graph from sub-sampled time series data. Finally, our method shows robustness to varying degrees of sub-sampling on realistic simulations, whereas other methods perform worse for higher rates of sub-sampling.
zh

[AI-38] Learning The Minimum Action Distance

【速读】：该论文试图解决在没有奖励信号或智能体执行动作的情况下，如何从状态轨迹中学习有效的状态表示问题。其解决方案的关键在于学习最小动作距离（Minimum Action Distance, MAD），该距离定义为在状态之间转移所需的最少动作数，作为捕捉环境底层结构的基本度量。通过自监督学习方法构建一个嵌入空间，其中嵌入状态对之间的距离对应于它们的MAD，从而为下游任务如目标条件强化学习和奖励塑造提供密集且几何上有意义的进展度量。

链接: https://arxiv.org/abs/2506.09276
作者: Lorenzo Steccanella,Joshua B. Evans,Özgür Şimşek,Anders Jonsson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a state representation framework for Markov decision processes (MDPs) that can be learned solely from state trajectories, requiring neither reward signals nor the actions executed by the agent. We propose learning the minimum action distance (MAD), defined as the minimum number of actions required to transition between states, as a fundamental metric that captures the underlying structure of an environment. MAD naturally enables critical downstream tasks such as goal-conditioned reinforcement learning and reward shaping by providing a dense, geometrically meaningful measure of progress. Our self-supervised learning approach constructs an embedding space where the distances between embedded state pairs correspond to their MAD, accommodating both symmetric and asymmetric approximations. We evaluate the framework on a comprehensive suite of environments with known MAD values, encompassing both deterministic and stochastic dynamics, as well as discrete and continuous state spaces, and environments with noisy observations. Empirical results demonstrate that the proposed approach not only efficiently learns accurate MAD representations across these diverse settings but also significantly outperforms existing state representation methods in terms of representation quality.
zh

[AI-39] A Multi-Armed Bandit Framework for Online Optimisation in Green Integrated Terrestrial and Non-Terrestrial Networks

【速读】：该论文旨在解决传统地面网络（Terrestrial Network, TN）在密集部署背景下面临的负载过重与能效低下问题，探索非地面网络（Non-Terrestrial Network, NTN）在提升网络可持续性方面的潜力。其解决方案的关键在于提出了一种基于多臂老虎机（Multi-Armed Bandit, MAB）框架的在线优化方法，结合带反馈约束的在线镜像下降（Bandit-feedback Constrained Online Mirror Descent, BCOMD）算法，通过自适应优化带宽分配、用户设备关联和宏基站关闭等关键系统参数，实现实时平衡网络容量与能效。

链接: https://arxiv.org/abs/2506.09268
作者: Henri Alam,Antonio de Domenico,Tareq Si Salem,Florian Kaltenberger
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: To be published in 2025 IEEE International Workshop on Signal Processing and Artificial Intelligence in Wireless Communications (IEEE SPAWC 2025)

点击查看摘要

Abstract:Integrated terrestrial and non-terrestrial network (TN-NTN) architectures offer a promising solution for expanding coverage and improving capacity for the network. While non-terrestrial networks (NTNs) are primarily exploited for these specific reasons, their role in alleviating terrestrial network (TN) load and enabling energy-efficient operation has received comparatively less attention. In light of growing concerns associated with the densification of terrestrial deployments, this work aims to explore the potential of NTNs in supporting a more sustainable network. In this paper, we propose a novel online optimisation framework for integrated TN-NTN architectures, built on a multi-armed bandit (MAB) formulation and leveraging the Bandit-feedback Constrained Online Mirror Descent (BCOMD) algorithm. Our approach adaptively optimises key system parameters–including bandwidth allocation, user equipment (UE) association, and macro base station (MBS) shutdown–to balance network capacity and energy efficiency in real time. Extensive system-level simulations over a 24-hour period show that our framework significantly reduces the proportion of unsatisfied UEs during peak hours and achieves up to 19% throughput gains and 5% energy savings in low-traffic periods, outperforming standard network settings following 3GPP recommendations.
zh

[AI-40] Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

【速读】：该论文试图解决Large Reasoning Models (LRMs)在复杂规划谜题中出现的“accuracy collapse”问题，即模型在超过一定复杂度阈值时表现出准确率显著下降的现象。论文的解决方案关键在于识别并纠正实验设计中的缺陷，包括模型输出令牌限制、评估框架对推理失败与实际约束的区分不足，以及河渡问题基准中存在数学上不可解的实例。通过控制这些实验偏差，例如要求生成函数而非详尽的移动列表，初步实验显示模型在先前被报告为完全失败的Tower of Hanoi实例上仍能保持较高准确率，从而强调了严谨实验设计在评估AI推理能力中的重要性。

链接: https://arxiv.org/abs/2506.09250
作者: C. Opus,A. Lawsen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Comment on: arXiv:2506.06941

点击查看摘要

Abstract:Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit “accuracy collapse” on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors’ automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.
zh

[AI-41] Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs ICML2025 ICML

【速读】：该论文试图解决在Transformer嵌入模型输出中，如何有效总结信息以应对输入中存在信号（signal）与噪声（noise）的问题，特别是在强化学习和视觉应用中。其核心挑战在于标准的池化方法（如AvgPool、MaxPool和ClsToken）在信号-噪声比（SNR）波动时容易导致性能下降。解决方案的关键在于提出一种基于注意力机制的自适应池化方法，该方法能够近似最优信号向量量化器，并在不同SNR条件下保持稳定性能，从而提升了模型在多种任务中的鲁棒性。

链接: https://arxiv.org/abs/2506.09215
作者: Greyson Brothers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: [ICML 2025 Spotlight Poster] To be published in the Forty-Second International Conference on Machine Learning (ICML) Proceedings

点击查看摘要

Abstract:We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based adaptive pooling method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.
zh

[AI-42] A Topological Improvement of the Overall Performance of Sparse Evolutionary Training: Motif-Based Structural Optimization of Sparse MLPs Project

【速读】：该论文试图解决深度神经网络（Deep Neural Networks, DNNs）在模型复杂度增加背景下计算成本和内存开销过高的问题，其解决方案的关键在于通过结构优化方法对稀疏进化训练（Sparse Evolutionary Training, SET）应用于多层感知机（Multi-layer Perceptrons, MLPs）进行改进，以提升性能并实现显著的效率增益。

链接: https://arxiv.org/abs/2506.09204
作者: Xiaotian Chen,Hongyun Liu,Seyed Sahand Mohammadi Ziabari
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have been proven to be exceptionally effective and have been applied across diverse domains within deep learning. However, as DNN models increase in complexity, the demand for reduced computational costs and memory overheads has become increasingly urgent. Sparsity has emerged as a leading approach in this area. The robustness of sparse Multi-layer Perceptrons (MLPs) for supervised feature selection, along with the application of Sparse Evolutionary Training (SET), illustrates the feasibility of reducing computational costs without compromising accuracy. Moreover, it is believed that the SET algorithm can still be improved through a structural optimization method called motif-based optimization, with potential efficiency gains exceeding 40% and a performance decline of under 4%. This research investigates whether the structural optimization of Sparse Evolutionary Training applied to Multi-layer Perceptrons (SET-MLP) can enhance performance and to what extent this improvement can be achieved.
zh

[AI-43] Policy-Based Trajectory Clustering in Offline Reinforcement Learning

【速读】：该论文试图解决从离线强化学习（offline reinforcement learning, RL）数据集中聚类轨迹的问题，其中每个聚类中心代表生成其轨迹的策略。解决方案的关键在于利用轨迹分布的KL散度与策略诱导分布混合之间的联系，从而构建一个自然的聚类目标。为此，作者提出了Policy-Guided K-means (PG-Kmeans) 和 Centroid-Attracted Autoencoder (CAAE)，前者通过迭代训练行为克隆（behavior cloning, BC）策略并基于策略生成概率分配轨迹实现聚类，后者则通过引导轨迹的潜在表示接近特定代码本条目来实现聚类。

链接: https://arxiv.org/abs/2506.09202
作者: Hao Hu,Xinqi Wang,Simon Shaolei Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories. By leveraging the connection between the KL-divergence of offline trajectory distributions and a mixture of policy-induced distributions, we formulate a natural clustering objective. To solve this, we propose Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE). PG-Kmeans iteratively trains behavior cloning (BC) policies and assigns trajectories based on policy generation probabilities, while CAAE resembles the VQ-VAE framework by guiding the latent representations of trajectories toward the vicinity of specific codebook entries to achieve clustering. Theoretically, we prove the finite-step convergence of PG-Kmeans and identify a key challenge in offline trajectory clustering: the inherent ambiguity of optimal solutions due to policy-induced conflicts, which can result in multiple equally valid but structurally distinct clusterings. Experimentally, we validate our methods on the widely used D4RL dataset and custom GridWorld environments. Our results show that both PG-Kmeans and CAAE effectively partition trajectories into meaningful clusters. They offer a promising framework for policy-based trajectory clustering, with broad applications in offline RL and beyond.
zh

[AI-44] FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

【速读】：该论文旨在解决联邦学习中集成低秩适配（LoRA）时面临的通信效率、模型精度和计算成本之间的平衡问题，尤其是在异构客户端场景下。现有方法要么依赖于简单的本地适配器平均，引入聚合噪声；要么需要传输大规模的堆叠本地适配器，导致通信效率低下；或需重构内存密集的全局权重更新矩阵并进行计算密集型分解以设计客户端特定的低秩适配器。论文提出的FLoRIST框架通过在服务器端对堆叠的本地适配器分别执行奇异值分解，避免构建完整的全局权重更新矩阵，从而在紧凑的中间空间中表示局部LoRA的累积信息，并引入可调奇异值阈值以实现服务器端最优秩选择，构造一对由所有客户端共享的全局低秩适配器，实现了数学上精确的聚合，同时保持较低的通信和计算开销。

链接: https://arxiv.org/abs/2506.09199
作者: Hariharan Ramesh,Jyotikrishna Dass
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.
zh

[AI-45] Multi-Task Reward Learning from Human Ratings

【速读】：该论文试图解决当前基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）方法在建模人类决策过程时过于简化的问题，即现有方法通常通过孤立的任务（如分类或回归）来模拟人类推理，而忽略了人类在决策时综合运用多种策略的特点。解决方案的关键在于提出一种新的强化学习方法，该方法通过联合考虑多个任务来模仿人类决策，具体而言是利用无奖励环境中的用户评分来推断奖励函数，并引入可学习的权重以平衡分类与回归模型的贡献，从而捕捉人类决策中的固有不确定性并使模型能够自适应地强调不同策略。

链接: https://arxiv.org/abs/2506.09183
作者: Mingkang Wu,Devin White,Evelyn Rose,Vernon Lawhern,Nicholas R Waytowich,Yongcan Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the workshop on Models of Human Feedback for AI Alignment at the 42nd International Conference on Machine Learning

点击查看摘要

Abstract:Reinforcement learning from human feeback (RLHF) has become a key factor in aligning model behavior with users’ goals. However, while humans integrate multiple strategies when making decisions, current RLHF approaches often simplify this process by modeling human reasoning through isolated tasks such as classification or regression. In this paper, we propose a novel reinforcement learning (RL) method that mimics human decision-making by jointly considering multiple tasks. Specifically, we leverage human ratings in reward-free environments to infer a reward function, introducing learnable weights that balance the contributions of both classification and regression models. This design captures the inherent uncertainty in human decision-making and allows the model to adaptively emphasize different strategies. We conduct several experiments using synthetic human ratings to validate the effectiveness of the proposed approach. Results show that our method consistently outperforms existing rating-based RL methods, and in some cases, even surpasses traditional RL approaches.
zh

[AI-46] Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism ICML2025

【速读】：该论文试图解决交互式模仿学习（Interactive Imitation Learning, IIL）中人类监督者认知负担过高的问题。现有方法对人类监督者的干预要求较高，导致其在学习过程中需要持续监控和提供大量示范。解决方案的关键在于提出自适应干预机制（Adaptive Intervention Mechanism, AIM），该机制通过一个代理Q函数（proxy Q-function）来模拟人类干预规则，并根据智能体与人类行为的一致性动态调整干预请求。该方法能够在智能体偏离专家行为时赋予高Q值，随着智能体能力提升逐渐降低Q值，从而实现实时评估与专家的一致性并按需请求帮助，有效降低人类监督成本并提升学习效率。

链接: https://arxiv.org/abs/2506.09176
作者: Haoyuan Cai,Zhenghao Peng,Bolei Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICML 2025 Poster

点击查看摘要

Abstract:Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions, but current methods impose high cognitive demands on human supervisors. We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations. AIM utilizes a proxy Q-function to mimic the human intervention rule and adjusts intervention requests based on the alignment between agent and human actions. By assigning high Q-values when the agent deviates from the expert and decreasing these values as the agent becomes proficient, the proxy Q-function enables the agent to assess the real-time alignment with the expert and request assistance when needed. Our expert-in-the-loop experiments reveal that AIM significantly reduces expert monitoring efforts in both continuous and discrete control tasks. Compared to the uncertainty-based baseline Thrifty-DAgger, our method achieves a 40% improvement in terms of human take-over cost and learning efficiency. Furthermore, AIM effectively identifies safety-critical states for expert assistance, thereby collecting higher-quality expert demonstrations and reducing overall expert data and environment interactions needed. Code and demo video are available at this https URL.
zh

[AI-47] Understanding Human-AI Trust in Education

【速读】：该论文试图解决AI聊天机器人在教育中应用时，学生对其信任的性质与机制问题，即学生是将其视为人类同伴或教师（基于人际信任）还是作为普通技术工具（基于技术信任）的问题。解决方案的关键在于通过实证研究比较人类化信任（human-like trust）与系统化信任（system-like trust）对学生感知享受、信任意图、使用行为意图及感知有用性的影响，揭示二者在不同维度上的差异化作用，并提出一种区别于传统人际信任和人-技术信任的人机信任（human-AI trust）模型。

链接: https://arxiv.org/abs/2506.09160
作者: Griffin Pitts,Sanaz Motamedi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As AI chatbots become increasingly integrated in education, students are turning to these systems for guidance, feedback, and information. However, the anthropomorphic characteristics of these chatbots create ambiguity regarding whether students develop trust toward them as they would a human peer or instructor, based in interpersonal trust, or as they would any other piece of technology, based in technology trust. This ambiguity presents theoretical challenges, as interpersonal trust models may inappropriately ascribe human intentionality and morality to AI, while technology trust models were developed for non-social technologies, leaving their applicability to anthropomorphic systems unclear. To address this gap, we investigate how human-like and system-like trusting beliefs comparatively influence students’ perceived enjoyment, trusting intention, behavioral intention to use, and perceived usefulness of an AI chatbot - factors associated with students’ engagement and learning outcomes. Through partial least squares structural equation modeling, we found that human-like and system-like trust significantly influenced student perceptions, with varied effects. Human-like trust more strongly predicted trusting intention, while system-like trust better predicted behavioral intention and perceived usefulness. Both had similar effects on perceived enjoyment. Given the partial explanatory power of each type of trust, we propose that students develop a distinct form of trust with AI chatbots (human-AI trust) that differs from human-human and human-technology models of trust. Our findings highlight the need for new theoretical frameworks specific to human-AI trust and offer practical insights for fostering appropriately calibrated trust, which is critical for the effective adoption and pedagogical impact of AI in education.
zh

[AI-48] FAIRTOPIA: Envisioning Multi-Agent Guardianship for Disrupting Unfair AI Pipelines

【速读】：该论文试图解决当前AI系统在决策过程中忽视人类价值观、强化计算偏差以及导致不公平现象的问题。其解决方案的关键在于引入基于代理技术的公平性监控机制，通过构建一个包含多角色代理的端到端协同方案，实现对AI管道（数据预处理、模型训练和部署后处理）各阶段的持续公平性监督。该方法强调以人类为中心的公平性期望，并提出了名为FAIRTOPIA的框架，其核心是通过自适应、可定制的算法，在代理守护者与知识驱动的自我优化层次结构中嵌入公平性设计原则。

链接: https://arxiv.org/abs/2506.09107
作者: Athena Vakali,Ilias Dimitriadis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:AI models have become active decision makers, often acting without human supervision. The rapid advancement of AI technology has already caused harmful incidents that have hurt individuals and societies and AI unfairness in heavily criticized. It is urgent to disrupt AI pipelines which largely neglect human principles and focus on computational biases exploration at the data (pre), model(in), and deployment (post) processing stages. We claim that by exploiting the advances of agents technology, we will introduce cautious, prompt, and ongoing fairness watch schemes, under realistic, systematic, and human-centric fairness expectations. We envision agents as fairness guardians, since agents learn from their environment, adapt to new information, and solve complex problems by interacting with external tools and other systems. To set the proper fairness guardrails in the overall AI pipeline, we introduce a fairness-by-design approach which embeds multi-role agents in an end-to-end (human to AI) synergetic scheme. Our position is that we may design adaptive and realistic AI fairness frameworks, and we introduce a generalized algorithm which can be customized to the requirements and goals of each AI decision making scenario. Our proposed, so called FAIRTOPIA framework, is structured over a three-layered architecture, which encapsulates the AI pipeline inside an agentic guardian and a knowledge-based, self-refining layered scheme. Based on our proposition, we enact fairness watch in all of the AI pipeline stages, under robust multi-agent workflows, which will inspire new fairness research hypothesis, heuristics, and methods grounded in human-centric, systematic, interdisciplinary, socio-technical principles.
zh

[AI-49] MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

【速读】：该论文试图解决预训练Transformer模型在低秩微调过程中参数量过大和适应多任务时扩展性不足的问题。其解决方案的关键在于提出一种统一的Tensor Train (TT)适配器框架MetaTT，通过使用单一共享的TT分解所有Transformer子模块（如查询、键、值、投影和前馈层），并利用结构轴（如层、矩阵类型、可选的头和任务）进行索引，从而实现参数的显著压缩。与LoRA等方法相比，MetaTT在给定秩的情况下，参数增加量仅与模式的总和成比例，而非乘积，从而有效降低了最终适配器的参数规模。

链接: https://arxiv.org/abs/2506.09105
作者: Javier Lopez-Piqueres,Pranav Deshpande,Archan Ray,Mattia J. Villani,Marco Pistoia,Niraj Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:We present MetaTT, a unified Tensor Train (TT) adapter framework for global low-rank fine-tuning of pre-trained transformers. Unlike LoRA, which fine-tunes each weight matrix independently, MetaTT uses a single shared TT to factorize all transformer sub-modules – query, key, value, projection, and feed-forward layers – by indexing the structural axes like layer and matrix type, and optionally heads and tasks. For a given rank, while LoRA adds parameters proportional to the product across modes, MetaTT only adds parameters proportional to the sum across modes leading to a significantly compressed final adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning schemes. We observe that when tested on standard language modeling benchmarks, MetaTT leads to the most reduction in the parameters while maintaining similar accuracy to LoRA and even outperforming other tensor-based methods. Unlike CP or other rank-factorizations, the TT ansatz benefits from mature optimization routines – e.g., DMRG-style rank adaptive minimization in addition to Adam, which we find simplifies training. Because new modes can be appended cheaply, MetaTT naturally extends to shared adapters across many tasks without redesigning the core tensor.
zh

[AI-50] Unifying Block-wise PTQ and Distillation-based QAT for Progressive Quantization toward 2-bit Instruction-Tuned LLM s

【速读】：该论文旨在解决在资源受限设备上部署大规模语言模型（Large Language Models, LLMs）时面临的挑战，特别是针对指令调优（instruction-tuned）模型的极低比特量化问题，即如何将模型压缩至2-bit而不显著损失性能。解决方案的关键在于提出一种名为统一渐进量化（Unified Progressive Quantization, UPQ）的新框架，该框架通过结合块级后训练量化（block-wise post-training quantization, PTQ）与基于知识蒸馏的量化感知训练（distillation-based quantization-aware training, Distill-QAT），实现了从FP16到INT4再到INT2的渐进式量化，从而有效降低量化误差并保持模型输出与原始FP16模型的一致性。

链接: https://arxiv.org/abs/2506.09104
作者: Jung Hyun Lee,Seungjae Shin,Vinnam Kim,Jaeseong You,An Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:As the rapid scaling of large language models (LLMs) poses significant challenges for deployment on resource-constrained devices, there is growing interest in extremely low-bit quantization, such as 2-bit. Although prior works have shown that 2-bit large models are pareto-optimal over their 4-bit smaller counterparts in both accuracy and latency, these advancements have been limited to pre-trained LLMs and have not yet been extended to instruction-tuned models. To bridge this gap, we propose Unified Progressive Quantization (UPQ) - a novel progressive quantization framework (FP16 \rightarrow INT4 \rightarrow INT2) that unifies block-wise post-training quantization (PTQ) with distillation-based quantization-aware training (Distill-QAT) for INT2 instruction-tuned LLM quantization. UPQ first quantizes FP16 instruction-tuned models to INT4 using block-wise PTQ to significantly reduce the quantization error introduced by subsequent INT2 quantization. Next, UPQ applies Distill-QAT to enable INT2 instruction-tuned LLMs to generate responses consistent with their original FP16 counterparts by minimizing the generalized Jensen-Shannon divergence (JSD) between the two. To the best of our knowledge, we are the first to demonstrate that UPQ can quantize open-source instruction-tuned LLMs to INT2 without relying on proprietary post-training data, while achieving state-of-the-art performances on MMLU and IFEval - two of the most representative benchmarks for evaluating instruction-tuned LLMs.
zh

[AI-51] Revolutionizing Clinical Trials: A Manifesto for AI-Driven Transformation

【速读】：该论文试图解决传统临床试验中存在的效率低、安全性不足及个性化程度有限的问题，其解决方案的关键在于利用生成式 AI (Generative AI) 驱动的两种技术——因果推断和数字孪生，通过在现有监管框架内的可操作性整合，推动临床研究的革新并重新定义临床试验的黄金标准。

链接: https://arxiv.org/abs/2506.09102
作者: Mihaela van der Schaar,Richard Peck,Eoin McKinney,Jim Weatherall,Stuart Bailey,Justine Rochon,Chris Anagnostopoulos,Pierre Marquet,Anthony Wood,Nicky Best,Harry Amad,Julianna Piskorz,Krzysztof Kacprzyk,Rafik Salama,Christina Gunther,Francesca Frau,Antoine Pugeat,Ramon Hernandez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This manifesto represents a collaborative vision forged by leaders in pharmaceuticals, consulting firms, clinical research, and AI. It outlines a roadmap for two AI technologies - causal inference and digital twins - to transform clinical trials, delivering faster, safer, and more personalized outcomes for patients. By focusing on actionable integration within existing regulatory frameworks, we propose a way forward to revolutionize clinical research and redefine the gold standard for clinical trials using AI.
zh

[AI-52] Feature Shift Localization Network

【速读】：该论文试图解决在多源异构数据中特征偏移（feature shift）的定位问题，这类问题常见于医疗、生物医学、社会经济、金融、调查和多传感器数据等应用中，由于数据源未标准化、噪声测量或处理与标准化流程不一致，可能导致错误特征的出现。现有的方法在检测分布偏移方面表现尚可，但对导致偏移的特征进行精确定位仍存在准确性不足或难以扩展到大规模高维数据集的问题。该论文提出的解决方案是引入一种名为特征偏移定位网络（Feature Shift Localization Network, FSL-Net）的神经网络，其关键在于通过大量数据集进行训练，学习数据集的统计特性，并能够在无需重新训练的情况下，从之前未见过的数据集和偏移中准确地定位特征偏移。

链接: https://arxiv.org/abs/2506.09101
作者: Míriam Barrabés,Daniel Mas Montserrat,Kapal Dev,Alexander G. Ioannidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 9 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Feature shifts between data sources are present in many applications involving healthcare, biomedical, socioeconomic, financial, survey, and multi-sensor data, among others, where unharmonized heterogeneous data sources, noisy data measurements, or inconsistent processing and standardization pipelines can lead to erroneous features. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner. The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and ready-to-use trained model are available at this https URL.
zh

[AI-53] Intra-Trajectory Consistency for Reward Modeling

【速读】：该论文试图解决当前奖励模型在学习响应轨迹中与评分相关联的具体组件时存在的泛化能力不足问题，这是因为响应级别的评分作为粗粒度的监督信号，难以精确指导模型识别影响评分的关键因素。解决方案的关键在于利用生成概率来建立响应轨迹中各过程之间的奖励一致性，从而将响应级别的监督信号传播至各个过程，提供更细粒度的奖励学习信号。通过在贝叶斯框架下的分析，作者提出了一个轨迹内一致性正则化方法，以确保相邻过程中具有更高下一词生成概率的步骤保持更一致的奖励，进而提升奖励模型的性能及后续策略的对齐效果。

链接: https://arxiv.org/abs/2506.09096
作者: Chaoyang Zhou,Shunyu Liu,Zengmao Wang,Di Wang,Rong-Cheng Tu,Bo Du,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards. We apply the proposed regularization to the advanced outcome reward model, improving its performance on RewardBench. Besides, we show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results. Our code is provided in this https URL.
zh

[AI-54] Merging Smarter Generalizing Better: Enhancing Model Merging on OOD Data

【速读】：该论文试图解决多任务学习（Multi-task learning, MTL）中现有模型融合方法在域内（in-domain, ID）数据上表现良好，但在域外（out-of-domain, OOD）数据上效果欠佳的问题。其解决方案的关键在于提出LwPTV（Layer-wise Pruning Task Vector），通过构建显著性评分来衡量任务向量中参数的冗余性，从而为每个任务生成掩码向量，并实现对任务向量的逐层剪枝，仅保留合并模型中对应层的预训练参数，提升模型在OOD任务上的性能。

链接: https://arxiv.org/abs/2506.09093
作者: Bingjie Zhang,Hongkang Li,Changlong Shi,Guowei Rong,He Zhao,Dongsheng Wang,Dandan Guo,Meng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task learning (MTL) concurrently trains a model on diverse task datasets to exploit common features, thereby improving overall performance across the tasks. Recent studies have dedicated efforts to merging multiple independent model parameters into a unified model for MTL, thus circumventing the need for training data and expanding the scope of applicable scenarios of MTL. However, current approaches to model merging predominantly concentrate on enhancing performance within in-domain (ID) datasets, often overlooking their efficacy on out-of-domain (OOD) datasets. In this work, we proposed LwPTV (Layer-wise Pruning Task Vector) by building a saliency score, measuring the redundancy of parameters in task vectors. Designed in this way ours can achieve mask vector for each task and thus perform layer-wise pruning on the task vectors, only keeping the pre-trained model parameters at the corresponding layer in merged model. Owing to its flexibility, our method can be seamlessly integrated with most of existing model merging methods to improve their performance on OOD tasks. Extensive experiments demonstrate that the application of our method results in substantial enhancements in OOD performance while preserving the ability on ID tasks.
zh

[AI-55] CUDA-LLM : LLM s Can Write Efficient CUDA Kernels

【速读】：该论文旨在解决在大规模并行GPU上生成高度硬件特定、架构感知且性能关键的CUDA程序这一复杂挑战。其核心问题是传统方法难以自动产生既符合功能正确性又能充分利用GPU硬件特性的高性能代码。解决方案的关键在于提出了一种名为\textbf{Feature Search and Reinforcement}（FSR）的新框架，该框架联合优化编译过程、功能正确性以及运行时性能，并通过实际GPU上的内核执行延迟进行评估，从而使得生成式AI（Generative AI）不仅能够生成语法和语义正确的CUDA代码，还能根据GPU架构特性迭代优化代码以提升效率。

链接: https://arxiv.org/abs/2506.09092
作者: Wentao Chen,Jiace Zhu,Qi Fan,Yehan Ma,An Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing high-performance GPU kernels that fully exploit the underlying hardware. To address this challenge, we propose a novel framework called \textbfFeature Search and Reinforcement (FSR). FSR jointly optimizes compilation and functional correctness, as well as the runtime performance, which are validated through extensive and diverse test cases, and measured by actual kernel execution latency on the target GPU, respectively. This approach enables LLMs not only to generate syntactically and semantically correct CUDA code but also to iteratively refine it for efficiency, tailored to the characteristics of the GPU architecture. We evaluate FSR on representative CUDA kernels, covering AI workloads and computational intensive algorithms. Our results show that LLMs augmented with FSR consistently guarantee correctness rates. Meanwhile, the automatically generated kernels can outperform general human-written code by a factor of up to 179 \times in execution speeds. These findings highlight the potential of combining LLMs with performance reinforcement to automate GPU programming for hardware-specific, architecture-sensitive, and performance-critical applications.
zh

[AI-56] Designing conflict-based communicative tasks in Teaching Chinese as a Foreign Language with ChatGPT

【速读】：该论文试图解决如何在大学层面的汉语作为外语教学课程中，有效提升学习者的口语交际能力的问题，其解决方案的关键在于通过设计基于冲突的交际任务，促进学习者参与互动动态并发展其口语交际技能。在此过程中，教师利用生成式 AI (Generative AI) 工具 ChatGPT 来辅助教学方案的最终确定，文章旨在分析教师与 ChatGPT 之间的交互特征及其在该情境下的应用与影响。

链接: https://arxiv.org/abs/2506.09089
作者: Xia Li(LIDILEM)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: in French language

点击查看摘要

Abstract:In developing the teaching program for a course in Oral Expression in Teaching Chinese as a Foreign Language at the university level, the teacher designs communicative tasks based on conflicts to encourage learners to engage in interactive dynamics and develop their oral interaction skills. During the design of these tasks, the teacher uses ChatGPT to assist in finalizing the program. This article aims to present the key characteristics of the interactions between the teacher and ChatGPT during this program development process, as well as to examine the use of ChatGPT and its impacts in this specific context.
zh

[AI-57] LLM -ML Teaming: Integrated Symbolic Decoding and Gradient Search for Valid and Stable Generative Feature Transformation

【速读】：该论文试图解决特征变换过程中稳定生成（一致输出）和有效生成（无错误序列）的问题，现有方法如传统机器学习（Traditional ML）的低有效性以及大语言模型（Large Language Models, LLMs）的不稳定性无法同时解决这两个问题。解决方案的关键在于提出一种结合LLMs的符号生成能力与ML的梯度优化能力的协同框架，通过四个步骤实现：黄金样例生成、特征变换序列嵌入与搜索、学生LLM特征变换以及LLM-ML解码器协同，从而在保证生成有效性的同时提升稳定性。

链接: https://arxiv.org/abs/2506.09085
作者: Xinyuan Wang,Haoyue Bai,Nanxu Gong,Wangyang Ying,Sixun Dong,Xiquan Cui,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feature transformation enhances data representation by deriving new features from the original data. Generative AI offers potential for this task, but faces challenges in stable generation (consistent outputs) and valid generation (error-free sequences). Existing methods–traditional MLs’ low validity and LLMs’ instability–fail to resolve both. We find that LLMs ensure valid syntax, while ML’s gradient-steered search stabilizes performance. To bridge this gap, we propose a teaming framework combining LLMs’ symbolic generation with ML’s gradient optimization. This framework includes four steps: (1) golden examples generation, aiming to prepare high-quality samples with the ground knowledge of the teacher LLM; (2) feature transformation sequence embedding and search, intending to uncover potentially superior embeddings within the latent space; (3) student LLM feature transformation, aiming to distill knowledge from the teacher LLM; (4) LLM-ML decoder teaming, dedicating to combine ML and the student LLM probabilities for valid and stable generation. The experiments on various datasets show that the teaming policy can achieve 5% improvement in downstream performance while reducing nearly half of the error cases. The results also demonstrate the efficiency and robustness of the teaming policy. Additionally, we also have exciting findings on LLMs’ capacity to understand the original data.
zh

[AI-58] Enhanced Whole Page Optimization via Mixed-Grained Reward Mechanism-Adapted Language Models

【速读】：该论文旨在解决在优化搜索和推荐结果展示（Whole Page Optimization, WPO）过程中，使用预训练大语言模型（Pre-trained Large Language Models, LLMs）进行微调时面临的挑战，特别是由于需要大量人工标注数据来缓解幻觉和模型不稳定问题所带来的高昂成本。论文提出的解决方案关键在于利用用户反馈作为监督信号，而非依赖人工标注数据，从而降低数据获取成本。其核心方法是PageLLM，该方法采用基于奖励的微调策略，结合页面级和物品级奖励的混合粒度机制，以同时优化整体展示质量和关键推荐的准确性与相关性。

链接: https://arxiv.org/abs/2506.09084
作者: Xinyuan Wang,Liang Wu,Yanjie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing the presentation of search and recommendation results is crucial to enhancing user experience and engagement. Whole Page Optimization (WPO) plays a pivotal role in this process, as it directly influences how information is surfaced to users. While Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent and contextually relevant content, fine-tuning these models for complex tasks like WPO presents challenges. Specifically, the need for extensive human-annotated data to mitigate issues such as hallucinations and model instability can be prohibitively expensive, especially in large-scale systems that interact with millions of items daily. In this work, we address the challenge of fine-tuning LLMs for WPO by using user feedback as the supervision. Unlike manually labeled datasets, user feedback is inherently noisy and less precise. To overcome this, we propose a reward-based fine-tuning approach, PageLLM, which employs a mixed-grained reward mechanism that combines page-level and item-level rewards. The page-level reward evaluates the overall quality and coherence, while the item-level reward focuses on the accuracy and relevance of key recommendations. This dual-reward structure ensures that both the holistic presentation and the critical individual components are optimized. We validate PageLLM on both public and industrial datasets. PageLLM outperforms baselines and achieves a 0.44% GMV increase in an online A/B test with over 10 million users, demonstrating its real-world impact.
zh

[AI-59] FinHEAR: Human Expertise and Adaptive Risk-Aware Temporal Reasoning for Financial Decision-Making

【速读】：该论文旨在解决语言模型在金融决策中面临的挑战，包括时间推理、适应性风险评估以及对动态事件的响应能力不足的问题。其解决方案的关键在于提出FinHEAR，一个基于多智能体框架的人类专家知识与自适应风险感知推理系统。FinHEAR通过协调专门的大型语言模型（LLM）代理，在以事件为中心的流程中分析历史趋势、解释当前事件并检索专家指导的先例，结合行为经济学原理，引入专家引导的检索、置信度调整的位置规模设定和基于结果的优化，从而提升可解释性和鲁棒性。

链接: https://arxiv.org/abs/2506.09080
作者: Jiaxiang Chen,Mingxi Zou,Zhuo Wang,Qifan Wang,Dongning Sun,Chi Zhang,Zenglin Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Financial decision-making presents unique challenges for language models, demanding temporal reasoning, adaptive risk assessment, and responsiveness to dynamic events. While large language models (LLMs) show strong general reasoning capabilities, they often fail to capture behavioral patterns central to human financial decisions-such as expert reliance under information asymmetry, loss-averse sensitivity, and feedback-driven temporal adjustment. We propose FinHEAR, a multi-agent framework for Human Expertise and Adaptive Risk-aware reasoning. FinHEAR orchestrates specialized LLM-based agents to analyze historical trends, interpret current events, and retrieve expert-informed precedents within an event-centric pipeline. Grounded in behavioral economics, it incorporates expert-guided retrieval, confidence-adjusted position sizing, and outcome-based refinement to enhance interpretability and robustness. Empirical results on curated financial datasets show that FinHEAR consistently outperforms strong baselines across trend prediction and trading tasks, achieving higher accuracy and better risk-adjusted returns.
zh

[AI-60] STREAMINGGS: Voxel-Based Streaming 3D Gaussian Splatting with Memory Optimization and Architectural Support

【速读】：该论文试图解决3D Gaussian Splatting (3DGS)在资源受限的移动设备上无法达到90帧每秒（FPS）实时性能的问题，其当前性能仅能达到2到9 FPS。解决方案的关键在于提出STREAMINGGS，这是一种算法-架构协同设计的全流式3DGS方法，通过将基于图块的渲染转换为基于内存的渲染，实现细粒度流水线并减少DRAM数据传输。

链接: https://arxiv.org/abs/2506.09070
作者: Chenqi Zhang,Yu Feng,Jieru Zhao,Guangda Liu,Wenchao Ding,Chentao Wu,Minyi Guo
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has gained popularity for its efficiency and sparse Gaussian-based representation. However, 3DGS struggles to meet the real-time requirement of 90 frames per second (FPS) on resource-constrained mobile devices, achieving only 2 to 9 this http URL accelerators focus on compute efficiency but overlook memory efficiency, leading to redundant DRAM traffic. We introduce STREAMINGGS, a fully streaming 3DGS algorithm-architecture co-design that achieves fine-grained pipelining and reduces DRAM traffic by transforming from a tile-centric rendering to a memory-centric rendering. Results show that our design achieves up to 45.7 \times speedup and 62.9 \times energy savings over mobile Ampere GPUs.
zh

[AI-61] EdgeProfiler: A Fast Profiling Framework for Lightweight LLM s on Edge Using Analytical Model

【速读】：该论文旨在解决轻量级大型语言模型（Large Language Models, LLMs）在边缘系统上性能评估的挑战，特别是在计算、内存和功耗受限环境下如何有效部署和优化LLMs的问题。解决方案的关键在于提出EdgeProfiler框架，该框架通过采用激进的量化技术（如4-bit量化）和严格的内存约束，结合分析建模方法对延迟、浮点运算次数（FLOPs）和能耗进行估算，从而实现对轻量级LLMs的高效性能评估，使得模型在保持较高准确率的同时显著降低内存占用并提升推理速度，同时减少能耗，为边缘设备上的实际部署提供了可行路径。

链接: https://arxiv.org/abs/2506.09061
作者: Alyssa Pinnock,Shakya Jayakody,Kawsher A Roxy,Md Rubel Ahmed
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 4 figures, 7 pages, IEEE conference template

点击查看摘要

Abstract:This paper introduces EdgeProfiler, a fast profiling framework designed for evaluating lightweight Large Language Models (LLMs) on edge systems. While LLMs offer remarkable capabilities in natural language understanding and generation, their high computational, memory, and power requirements often confine them to cloud environments. EdgeProfiler addresses these challenges by providing a systematic methodology for assessing LLM performance in resource-constrained edge settings. The framework profiles compact LLMs, including TinyLLaMA, Gemma3.1B, Llama3.2-1B, and DeepSeek-r1-1.5B, using aggressive quantization techniques and strict memory constraints. Analytical modeling is used to estimate latency, FLOPs, and energy consumption. The profiling reveals that 4-bit quantization reduces model memory usage by approximately 60-70%, while maintaining accuracy within 2-5% of full-precision baselines. Inference speeds are observed to improve by 2-3x compared to FP16 baselines across various edge devices. Power modeling estimates a 35-50% reduction in energy consumption for INT4 configurations, enabling practical deployment on hardware such as Raspberry Pi 4/5 and Jetson Orin Nano Super. Our findings emphasize the importance of efficient profiling tailored to lightweight LLMs in edge environments, balancing accuracy, energy efficiency, and computational feasibility.
zh

[AI-62] Llama-Affinity: A Predictive Antibody Antigen Binding Model Integrating Antibody Sequences with Llama3 Backbone Architecture

【速读】：该论文旨在解决抗体-抗原结合亲和力预测的准确性与计算效率问题，这是抗体药物开发中的关键环节。传统实验方法在亲和力测定中存在耗时且成本高的缺陷，而现有计算方法在精度和效率上仍有提升空间。本文提出的解决方案关键在于利用开源Llama 3模型架构和来自Observed Antibody Space (OAS)数据库的抗体序列数据，构建了一个先进的抗体-抗原结合亲和力预测模型（LlamaAffinity），通过引入大型语言模型（LLM）对抗体进行表征，显著提升了预测性能，并实现了更高的计算效率。

链接: https://arxiv.org/abs/2506.09052
作者: Delower Hossain,Ehsan Saghapour,Kevin Song,Jake Y. Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 7 Pages

点击查看摘要

Abstract:Antibody-facilitated immune responses are central to the body’s defense against pathogens, viruses, and other foreign invaders. The ability of antibodies to specifically bind and neutralize antigens is vital for maintaining immunity. Over the past few decades, bioengineering advancements have significantly accelerated therapeutic antibody development. These antibody-derived drugs have shown remarkable efficacy, particularly in treating cancer, SARS-CoV-2, autoimmune disorders, and infectious diseases. Traditionally, experimental methods for affinity measurement have been time-consuming and expensive. With the advent of artificial intelligence, in silico medicine has been revolutionized; recent developments in machine learning, particularly the use of large language models (LLMs) for representing antibodies, have opened up new avenues for AI-based design and improved affinity prediction. Herein, we present an advanced antibody-antigen binding affinity prediction model (LlamaAffinity), leveraging an open-source Llama 3 backbone and antibody sequence data sourced from the Observed Antibody Space (OAS) database. The proposed approach shows significant improvement over existing state-of-the-art (SOTA) methods (AntiFormer, AntiBERTa, AntiBERTy) across multiple evaluation metrics. Specifically, the model achieved an accuracy of 0.9640, an F1-score of 0.9643, a precision of 0.9702, a recall of 0.9586, and an AUC-ROC of 0.9936. Moreover, this strategy unveiled higher computational efficiency, with a five-fold average cumulative training time of only 0.46 hours, significantly lower than in previous studies.
zh

[AI-63] GRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization

【速读】：该论文旨在解决Vision-Language-Action (VLA)模型在新环境中需要依赖监督微调（SFT）进行任务特定调整的问题，而这种传统方法无法让机器人与环境进行交互，也未能利用实时执行的反馈。解决方案的关键在于提出一种名为轨迹级组相对策略优化（TGRPO）的方法，通过融合步骤级和轨迹级的优势信号，改进GRPO的组级优势估计，从而提升算法在VLA在线强化学习训练中的适用性。

链接: https://arxiv.org/abs/2506.08440
作者: Zengjue Chen,Runliang Niu,He Kong,Qi Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Vision-Language-Action (VLA) model have demonstrated strong generalization capabilities across diverse scenes, tasks, and robotic platforms when pretrained at large-scale datasets. However, these models still require task-specific fine-tuning in novel environments, a process that relies almost exclusively on supervised fine-tuning (SFT) using static trajectory datasets. Such approaches neither allow robot to interact with environment nor do they leverage feedback from live execution. Also, their success is critically dependent on the size and quality of the collected trajectories. Reinforcement learning (RL) offers a promising alternative by enabling closed-loop interaction and aligning learned policies directly with task objectives. In this work, we draw inspiration from the ideas of GRPO and propose the Trajectory-wise Group Relative Policy Optimization (TGRPO) method. By fusing step-level and trajectory-level advantage signals, this method improves GRPO’s group-level advantage estimation, thereby making the algorithm more suitable for online reinforcement learning training of VLA. Experimental results on ten manipulation tasks from the libero-object benchmark demonstrate that TGRPO consistently outperforms various baseline methods, capable of generating more robust and efficient policies across multiple tested scenarios. Our source codes are available at: this https URL
zh

[AI-64] How attention simplifies mental representations for planning

【速读】：该论文试图解决人类在规划过程中如何通过空间注意控制任务表征的哪些方面进入主观意识并用于规划的问题。其解决方案的关键在于揭示空间邻近性对任务表征可用性的影响，并表明当任务相关信息遵循注意的自然（侧向化）轮廓时，个体能够更轻松地构建简化且有用的任务表征。研究进一步表明，注意的影响在个体间存在显著差异，这解释了人们在任务表征和行为上的不同表现。

链接: https://arxiv.org/abs/2506.09520
作者: Jason da Silva Castanheira,Nicholas Shea,Stephen M. Fleming
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Human planning is efficient – it frugally deploys limited cognitive resources to accomplish difficult tasks – and flexible – adapting to novel problems and environments. Computational approaches suggest that people construct simplified mental representations of their environment, balancing the complexity of a task representation with its utility. These models imply a nested optimisation in which planning shapes perception, and perception shapes planning – but the perceptual and attentional mechanisms governing how this interaction unfolds remain unknown. Here, we harness virtual maze navigation to characterise how spatial attention controls which aspects of a task representation enter subjective awareness and are available for planning. We find that spatial proximity governs which aspects of a maze are available for planning, and that when task-relevant information follows natural (lateralised) contours of attention, people can more easily construct simplified and useful maze representations. This influence of attention varies considerably across individuals, explaining differences in people’s task representations and behaviour. Inspired by the ‘spotlight of attention’ analogy, we incorporate the effects of visuospatial attention into existing computational accounts of value-guided construal. Together, our work bridges computational perspectives on perception and decision-making to better understand how individuals represent their environments in aid of planning.
zh

[AI-65] Know What You Dont Know: Uncertainty Calibration of Process Reward Models

【速读】：该论文旨在解决过程奖励模型（Process Reward Models, PRMs）在大型语言模型（Large Language Models, LLMs）推理阶段的校准不足问题，即PRMs常常高估成功概率。其解决方案的关键在于采用分位数回归进行校准，以调整PRM输出，使其更符合真实的成功概率。通过校准后的成功估计及其置信区间，论文提出了实例自适应缩放（Instance-Adaptive Scaling, IAS）框架，该框架根据部分推理轨迹产生正确最终答案的可能性动态调整推理预算，从而在保持答案准确性的同时降低推理成本。

链接: https://arxiv.org/abs/2506.09338
作者: Young-Jin Park,Kristjan Greenewald,Kaveh Alim,Hao Wang,Navid Azizan
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated and often overestimate success probabilities. To address this, we present a calibration approach, performed via quantile regression, that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emphinstance-adaptive scaling (IAS) framework that dynamically adjusts the inference budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach successfully adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method successfully achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective adaptive scaling, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.
zh

[AI-66] Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms

【速读】：该论文旨在解决多无人飞行器（UAV）系统中同时优化服务覆盖范围和延长电池寿命的双重目标问题。解决方案的关键在于提出一种基于图注意力的去中心化策略-评论家算法（Graph Attention-based Decentralized Actor-Critic, GADC），该方法通过图注意力网络处理无人机的有限局部观测信息并降低环境状态的维度，随后利用策略-双评论家网络管理联合优化的双重策略，并采用Kullback-Leibler（KL）散度因子来平衡覆盖性能与电池寿命之间的权衡。

链接: https://arxiv.org/abs/2506.09195
作者: Haoran Peng,Ying-Jun Angela Zhang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This research focuses on optimizing multi-UAV systems with dual objectives: maximizing service coverage as the primary goal while extending battery lifetime as the secondary objective. We propose a Graph Attention-based Decentralized Actor-Critic (GADC) to optimize the dual objectives. The proposed approach leverages a graph attention network to process UAVs’ limited local observation and reduce the dimension of the environment states. Subsequently, an actor-double-critic network is developed to manage dual policies for joint objective optimization. The proposed GADC uses a Kullback-Leibler (KL) divergence factor to balance the tradeoff between coverage performance and battery lifetime in the multi-UAV system. We assess the scalability and efficiency of GADC through comprehensive benchmarking against state-of-the-art methods, considering both theory and experimental aspects. Extensive testing in both ideal settings and NVIDIA Sionna’s realistic ray tracing environment demonstrates GADC’s superior performance.
zh

[AI-67] Integration of Contrastive Predictive Coding and Spiking Neural Networks

【速读】：该论文试图解决如何将对比预测编码（Contrastive Predictive Coding, CPC）与脉冲神经网络（Spiking Neural Network, SNN）相结合，以提升模型的生物合理性并实现有效的表征学习。其解决方案的关键在于通过基于脉冲的系统处理输入和输出，从而构建一个具有更高生物 plausible 性的预测编码模型，该模型不仅能够完成分类任务，还可作为编码机制使用。

链接: https://arxiv.org/abs/2506.09194
作者: Emirhan Bilgiç,Neslihan Serap Şengör,Namık Berk Yalabık,Yavuz Selim İşler,Aykut Görkem Gelen,Rahmi Elibol
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 4 pages, 5 figures, 1 table. Accepted at the 2025 33rd Signal Processing and Communications Applications Conference (SIU)

点击查看摘要

Abstract:This study examines the integration of Contrastive Predictive Coding (CPC) with Spiking Neural Networks (SNN). While CPC learns the predictive structure of data to generate meaningful representations, SNN mimics the computational processes of biological neural systems over time. In this study, the goal is to develop a predictive coding model with greater biological plausibility by processing inputs and outputs in a spike-based system. The proposed model was tested on the MNIST dataset and achieved a high classification rate in distinguishing positive sequential samples from non-sequential negative samples. The study demonstrates that CPC can be effectively combined with SNN, showing that an SNN trained for classification tasks can also function as an encoding mechanism. Project codes and detailed results can be accessed on our GitHub page: this https URL
zh

[AI-68] Estimating Visceral Adiposity from Wrist-Worn Accelerometry

【速读】：该论文试图解决如何通过习惯性身体活动（PA）来估计内脏脂肪组织（VAT）的问题，从而间接评估代谢健康风险。其解决方案的关键在于利用加速度计数据，结合工程特征和深度神经网络模型，通过机器学习方法建立PA与VAT之间的关联。具体而言，第一种方法基于步态和睡眠期间的运动特征，采用岭回归进行VAT估算；第二种方法则使用深度神经网络，通过Transformer模型处理24小时连续加速度计数据，最终实现更精确的VAT预测。两种方法的结合显著提高了估算精度，相关系数达到r=0.86。

链接: https://arxiv.org/abs/2506.09167
作者: James R. Williamson,Andrew Alini,Brian A. Telfer,Adam W. Potter,Karl E. Friedl
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 13 pages

点击查看摘要

Abstract:Visceral adipose tissue (VAT) is a key marker of both metabolic health and habitual physical activity (PA). Excess VAT is highly correlated with type 2 diabetes and insulin resistance. The mechanistic basis for this pathophysiology relates to overloading the liver with fatty acids. VAT is also a highly labile fat depot, with increased turnover stimulated by catecholamines during exercise. VAT can be measured with sophisticated imaging technologies, but can also be inferred directly from PA. We tested this relationship using National Health and Nutrition Examination Survey (NHANES) data from 2011-2014, for individuals aged 20-60 years with 7 days of accelerometry data (n=2,456 men; 2,427 women) [1]. Two approaches were used for estimating VAT from activity. The first used engineered features based on movements during gait and sleep, and then ridge regression to map summary statistics of these features into a VAT estimate. The second approach used deep neural networks trained on 24 hours of continuous accelerometry. A foundation model first mapped each 10s frame into a high-dimensional feature vector. A transformer model then mapped each day’s feature vector time series into a VAT estimate, which were averaged over multiple days. For both approaches, the most accurate estimates were obtained with the addition of covariate information about subject demographics and body measurements. The best performance was obtained by combining the two approaches, resulting in VAT estimates with correlations of r=0.86. These findings demonstrate a strong relationship between PA and VAT and, by extension, between PA and metabolic health risks.
zh

机器学习

[LG-0] Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

链接: https://arxiv.org/abs/2506.09991
作者: Xinyu Yang,Yuwei An,Hongyi Liu,Tianqi Chen,Beidi Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.

[LG-1] Bayesian Probabilistic Matrix Factorization

链接: https://arxiv.org/abs/2506.09928
作者: Ruixuan Xu,Xiangxiang Weng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Matrix factorization is a widely used technique in recommendation systems. Probabilistic Matrix Factorization (PMF) [1] extends traditional matrix factorization by incorporating probability distributions over latent factors, allowing for uncertainty quantification. However, computing the posterior distribution is intractable due to the high-dimensional integral. To address this, we employ two Bayesian inference methods: Markov Chain Monte Carlo (MCMC) [2] and Variational Inference (VI) [3] to approximate the posterior. We evaluate their performance on MovieLens dataset and compare their convergence speed, predictive accuracy, and computational efficiency. Experimental results demonstrate that VI offers faster convergence, while MCMC provides more accurate posterior estimates.

[LG-2] Apollo: A Posteriori Label-Only Membership Inference Attack Towards Machine Unlearning

链接: https://arxiv.org/abs/2506.09923
作者: Liou Tang,James Joshi,Ashish Kundu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Unlearning (MU) aims to update Machine Learning (ML) models following requests to remove training samples and their influences on a trained model efficiently without retraining the original ML model from scratch. While MU itself has been employed to provide privacy protection and regulatory compliance, it can also increase the attack surface of the model. Existing privacy inference attacks towards MU that aim to infer properties of the unlearned set rely on the weaker threat model that assumes the attacker has access to both the unlearned model and the original model, limiting their feasibility toward real-life scenarios. We propose a novel privacy attack, A Posteriori Label-Only Membership Inference Attack towards MU, Apollo, that infers whether a data sample has been unlearned, following a strict threat model where an adversary has access to the label-output of the unlearned model only. We demonstrate that our proposed attack, while requiring less access to the target model compared to previous attacks, can achieve relatively high precision on the membership status of the unlearned samples.

[LG-3] “What are my options?”: Explaining RL Agents with Diverse Near-Optimal Alternatives (Extended)

链接: https://arxiv.org/abs/2506.09901
作者: Noel Brindise,Vijeth Hebbar,Riya Shah,Cedric Langbort
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we provide an extended discussion of a new approach to explainable Reinforcement Learning called Diverse Near-Optimal Alternatives (DNA), first proposed at L4DC 2025. DNA seeks a set of reasonable “options” for trajectory-planning agents, optimizing policies to produce qualitatively diverse trajectories in Euclidean space. In the spirit of explainability, these distinct policies are used to “explain” an agent’s options in terms of available trajectory shapes from which a human user may choose. In particular, DNA applies to value function-based policies on Markov decision processes where agents are limited to continuous trajectories. Here, we describe DNA, which uses reward shaping in local, modified Q-learning problems to solve for distinct policies with guaranteed epsilon-optimality. We show that it successfully returns qualitatively different policies that constitute meaningfully different “options” in simulation, including a brief comparison to related approaches in the stochastic optimization field of Quality Diversity. Beyond the explanatory motivation, this work opens new possibilities for exploration and adaptive planning in RL.

[LG-4] A look at adversarial attacks on radio waveforms from discrete latent space

链接: https://arxiv.org/abs/2506.09896
作者: Attanasia Garuso,Silvija Kokalj-Filipovic,Yagna Kaasaragadda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Having designed a VQVAE that maps digital radio waveforms into discrete latent space, and yields a perfectly classifiable reconstruction of the original data, we here analyze the attack suppressing properties of VQVAE when an adversarial attack is performed on high-SNR radio-frequency (RF) data-points. To target amplitude modulations from a subset of digitally modulated waveform classes, we first create adversarial attacks that preserve the phase between the in-phase and quadrature component whose values are adversarially changed. We compare them with adversarial attacks of the same intensity where phase is not preserved. We test the classification accuracy of such adversarial examples on a classifier trained to deliver 100% accuracy on the original data. To assess the ability of VQVAE to suppress the strength of the attack, we evaluate the classifier accuracy on the reconstructions by VQVAE of the adversarial datapoints and show that VQVAE substantially decreases the effectiveness of the attack. We also compare the I/Q plane diagram of the attacked data, their reconstructions and the original data. Finally, using multiple methods and metrics, we compare the probability distribution of the VQVAE latent space with and without attack. Varying the attack strength, we observe interesting properties of the discrete space, which may help detect the attacks.

[LG-5] Learning single-index models via harmonic decomposition

链接: https://arxiv.org/abs/2506.09887
作者: Nirmit Joshi,Hugo Koubbi,Theodor Misiakiewicz,Nathan Srebro
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 80 pages

点击查看摘要

Abstract:We study the problem of learning single-index models, where the label y \in \mathbbR depends on the input \boldsymbolx \in \mathbbR^d only through an unknown one-dimensional projection \langle \boldsymbolw_,\boldsymbolx\rangle . Prior work has shown that under Gaussian inputs, the statistical and computational complexity of recovering \boldsymbolw_ is governed by the Hermite expansion of the link function. In this paper, we propose a new perspective: we argue that “spherical harmonics” – rather than “Hermite polynomials” – provide the natural basis for this problem, as they capture its intrinsic “rotational symmetry”. Building on this insight, we characterize the complexity of learning single-index models under arbitrary spherically symmetric input distributions. We introduce two families of estimators – based on tensor unfolding and online SGD – that respectively achieve either optimal sample complexity or optimal runtime, and argue that estimators achieving both may not exist in general. When specialized to Gaussian inputs, our theory not only recovers and clarifies existing results but also reveals new phenomena that had previously been overlooked.

[LG-6] UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

链接: https://arxiv.org/abs/2506.09874
作者: Neta Glazer,Aviv Navon,Yael Segal,Aviv Shamsian,Hilit Segev,Asaf Buchnick,Menachem Pirchi,Gil Hetz,Joseph Keshet
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.

[LG-7] Private Aggregation for Byzantine-Resilient Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2506.09870
作者: Maximilian Egger,Rawad Bitar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Ensuring resilience to Byzantine clients while maintaining the privacy of the clients’ data is a fundamental challenge in federated learning (FL). When the clients’ data is homogeneous, suitable countermeasures were studied from an information-theoretic perspective utilizing secure aggregation techniques while ensuring robust aggregation of the clients’ gradients. However, the countermeasures used fail when the clients’ data is heterogeneous. Suitable pre-processing techniques, such as nearest neighbor mixing, were recently shown to enhance the performance of those countermeasures in the heterogeneous setting. Nevertheless, those pre-processing techniques cannot be applied with the introduced privacy-preserving mechanisms. We propose a multi-stage method encompassing a careful co-design of verifiable secret sharing, secure aggregation, and a tailored symmetric private information retrieval scheme to achieve information-theoretic privacy guarantees and Byzantine resilience under data heterogeneity. We evaluate the effectiveness of our scheme on a variety of attacks and show how it outperforms the previously known techniques. Since the communication overhead of secure aggregation is non-negligible, we investigate the interplay with zero-order estimation methods that reduce the communication cost in state-of-the-art FL tasks and thereby make private aggregation scalable. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2506.09870 [cs.LG] (or arXiv:2506.09870v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.09870 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] Machine Learning-Based Classification of Oils Using Dielectric Properties and Microwave Resonant Sensing

链接: https://arxiv.org/abs/2506.09867
作者: Amit Baran Dey,Wasim Arif,Rakhesh Singh Kshetrimayum
类目: Machine Learning (cs.LG)
*备注: 6 pages, 11 figures, Accepted to IEEE INDISCON 2025

点击查看摘要

Abstract:This paper proposes a machine learning-based methodology for the classification of various oil samples based on their dielectric properties, utilizing a microwave resonant sensor. The dielectric behaviour of oils, governed by their molecular composition, induces distinct shifts in the sensor’s resonant frequency and amplitude response. These variations are systematically captured and processed to extract salient features, which serve as inputs for multiple machine learning classifiers. The microwave resonant sensor operates in a non-destructive, low-power manner, making it particularly well-suited for real-time industrial applications. A comprehensive dataset is developed by varying the permittivity of oil samples and acquiring the corresponding sensor responses. Several classifiers are trained and evaluated using the extracted resonant features to assess their capability in distinguishing between oil types. Experimental results demonstrate that the proposed approach achieves a high classification accuracy of 99.41% with the random forest classifier, highlighting its strong potential for automated oil identification. The system’s compact form factor, efficiency, and high performance underscore its viability for fast and reliable oil characterization in industrial environments.

[LG-9] Weighted Loss Methods for Robust Federated Learning under Data Heterogeneity

链接: https://arxiv.org/abs/2506.09824
作者: Johan Erbani,Sonia Ben Mokhtar,Pierre-Edouard Portier,Elod Egyed-Zsigmond,Diana Nurbakova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a machine learning paradigm that enables multiple data holders to collaboratively train a machine learning model without sharing their training data with external parties. In this paradigm, workers locally update a model and share with a central server their updated gradients (or model parameters). While FL seems appealing from a privacy perspective, it opens a number of threats from a security perspective as (Byzantine) participants can contribute poisonous gradients (or model parameters) harming model convergence. Byzantine-resilient FL addresses this issue by ensuring that the training proceeds as if Byzantine participants were absent. Towards this purpose, common strategies ignore outlier gradients during model aggregation, assuming that Byzantine gradients deviate more from honest gradients than honest gradients do from each other. However, in heterogeneous settings, honest gradients may differ significantly, making it difficult to distinguish honest outliers from Byzantine ones. In this paper, we introduce the Worker Label Alignement Loss (WoLA), a weighted loss that aligns honest worker gradients despite data heterogeneity, which facilitates the identification of Byzantines’ gradients. This approach significantly outperforms state-of-the-art methods in heterogeneous settings. In this paper, we provide both theoretical insights and empirical evidence of its effectiveness.

[LG-10] Identifiability Challenges in Sparse Linear Ordinary Differential Equations

链接: https://arxiv.org/abs/2506.09816
作者: Cecilia Casolo,Sören Becker,Niki Kilbertus
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that “linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory.” However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems. In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.

[LG-11] Metritocracy: Representative Metrics for Lite Benchmarks

链接: https://arxiv.org/abs/2506.09813
作者: Ariel Procaccia,Benjamin Schiffer,Serena Wang,Shirley Zhang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:A common problem in LLM evaluation is how to choose a subset of metrics from a full suite of possible metrics. Subset selection is usually done for efficiency or interpretability reasons, and the goal is often to select a representative'' subset of metrics. However, representative’’ is rarely clearly defined. In this work, we use ideas from social choice theory to formalize two notions of representation for the selection of a subset of evaluation metrics. We first introduce positional representation, which guarantees every alternative is sufficiently represented at every position cutoff. We then introduce positional proportionality, which guarantees no alternative is proportionally over- or under-represented by more than a small error at any position. We prove upper and lower bounds on the smallest number of metrics needed to guarantee either of these properties in the worst case. We also study a generalized form of each property that allows for additional input on groups of metrics that must be represented. Finally, we tie theory to practice through real-world case studies on both LLM evaluation and hospital quality evaluation.

[LG-12] Generalizing Supervised Contrastive learning: A Projection Perspective

链接: https://arxiv.org/abs/2506.09810
作者: Minoh Jeong,Alfred Hero
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Self-supervised contrastive learning (SSCL) has emerged as a powerful paradigm for representation learning and has been studied from multiple perspectives, including mutual information and geometric viewpoints. However, supervised contrastive (SupCon) approaches have received comparatively little attention in this context: for instance, while InfoNCE used in SSCL is known to form a lower bound on mutual information (MI), the relationship between SupCon and MI remains unexplored. To address this gap, we introduce ProjNCE, a generalization of the InfoNCE loss that unifies supervised and self-supervised contrastive objectives by incorporating projection functions and an adjustment term for negative pairs. We prove that ProjNCE constitutes a valid MI bound and affords greater flexibility in selecting projection strategies for class embeddings. Building on this flexibility, we further explore the centroid-based class embeddings in SupCon by exploring a variety of projection methods. Extensive experiments on multiple datasets and settings demonstrate that ProjNCE consistently outperforms both SupCon and standard cross-entropy training. Our work thus refines SupCon along two complementary perspective–mutual information interpretation and projection design–and offers broadly applicable improvements whenever SupCon serves as the foundational contrastive objective.

[LG-13] Devils Hand: Data Poisoning Attacks to Locally Private Graph Learning Protocols

链接: https://arxiv.org/abs/2506.09803
作者: Longzhu He,Chaozhuo Li,Peng Tang,Litian Zhang,Sen Su
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have achieved significant success in graph representation learning and have been applied to various domains. However, many real-world graphs contain sensitive personal information, such as user profiles in social networks, raising serious privacy concerns when graph learning is performed using GNNs. To address this issue, locally private graph learning protocols have gained considerable attention. These protocols leverage the privacy advantages of local differential privacy (LDP) and the effectiveness of GNN’s message-passing in calibrating noisy data, offering strict privacy guarantees for users’ local data while maintaining high utility (e.g., node classification accuracy) for graph learning. Despite these advantages, such protocols may be vulnerable to data poisoning attacks, a threat that has not been considered in previous research. Identifying and addressing these threats is crucial for ensuring the robustness and security of privacy-preserving graph learning frameworks. This work introduces the first data poisoning attack targeting locally private graph learning protocols. The attacker injects fake users into the protocol, manipulates these fake users to establish links with genuine users, and sends carefully crafted data to the server, ultimately compromising the utility of private graph learning. The effectiveness of the attack is demonstrated both theoretically and empirically. In addition, several defense strategies have also been explored, but their limited effectiveness highlights the need for more robust defenses.

[LG-14] Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction INTERSPEECH2025

链接: https://arxiv.org/abs/2506.09792
作者: Wenxuan Wu,Shuai Wang,Xixin Wu,Helen Meng,Haizhou Li
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Audio-visual target speaker extraction (AV-TSE) models primarily rely on target visual cues to isolate the target speaker’s voice from others. We know that humans leverage linguistic knowledge, such as syntax and semantics, to support speech perception. Inspired by this, we explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE. In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals. Without introducing any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility. Furthermore, we evaluate our method in multi-language settings and visual cue-impaired scenarios and show robust performance gains.

[LG-15] On the Similarities of Embeddings in Contrastive Learning

链接: https://arxiv.org/abs/2506.09781
作者: Chungpa Lee,Sehee Lim,Kibok Lee,Jy-yong Sohn
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: contrastive learning, representation learning, embedding, similarity, negative pair, positive pair

点击查看摘要

Abstract:Contrastive learning (CL) operates on a simple yet effective principle: embeddings of positive pairs are pulled together, while those of negative pairs are pushed apart. Although various forms of contrastive loss have been proposed and analyzed from different perspectives, prior works lack a comprehensive framework that systematically explains a broad class of these objectives. In this paper, we present a unified framework for understanding CL, which is based on analyzing the cosine similarity between embeddings of positive and negative pairs. In full-batch settings, we show that perfect alignment of positive pairs is unattainable when similarities of negative pairs fall below a certain threshold, and that this misalignment can be alleviated by incorporating within-view negative pairs. In mini-batch settings, we demonstrate that smaller batch sizes incur stronger separation among negative pairs within batches, which leads to higher variance in similarities of negative pairs. To address this limitation of mini-batch CL, we introduce an auxiliary loss term that reduces the variance of similarities of negative pairs in CL. Empirical results demonstrate that incorporating the proposed loss consistently improves the performance of CL methods in small-batch training.

[LG-16] Learning to Optimize Package Picking for Large-Scale Real-World Robot Induction

链接: https://arxiv.org/abs/2506.09765
作者: Shuai Li,Azarakhsh Keipour,Sicong Zhao,Srinath Rajagopalan,Charles Swan,Kostas E. Bekris
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: The 19th International Symposium on Experimental Robotics (ISER 2025); 6-10 July 2025, Santa Fe, New Mexico, USA; 10 pages

点击查看摘要

Abstract:Warehouse automation plays a pivotal role in enhancing operational efficiency, minimizing costs, and improving resilience to workforce variability. While prior research has demonstrated the potential of machine learning (ML) models to increase picking success rates in large-scale robotic fleets by prioritizing high-probability picks and packages, these efforts primarily focused on predicting success probabilities for picks sampled using heuristic methods. Limited attention has been given, however, to leveraging data-driven approaches to directly optimize sampled picks for better performance at scale. In this study, we propose an ML-based framework that predicts transform adjustments as well as improving the selection of suction cups for multi-suction end effectors for sampled picks to enhance their success probabilities. The framework was integrated and evaluated in test workcells that resemble the operations of Amazon Robotics’ Robot Induction (Robin) fleet, which is used for package manipulation. Evaluated on over 2 million picks, the proposed method achieves a 20% reduction in pick failure rates compared to a heuristic-based pick sampling baseline, demonstrating its effectiveness in large-scale warehouse automation scenarios.

[LG-17] Alice and the Caterpillar: A more descriptive null model for assessing data mining results

链接: https://arxiv.org/abs/2506.09764
作者: Giulia Preti,Gianmarco De Francisci Morales,Matteo Riondato
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice, a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.

[LG-18] owards Multi-modal Graph Large Language Model

链接: https://arxiv.org/abs/2506.09738
作者: Xin Wang,Zeyang Zhang,Linxin Xiao,Haibo Chen,Chendi Ge,Wenwu Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal graphs, which integrate diverse multi-modal features and relations, are ubiquitous in real-world applications. However, existing multi-modal graph learning methods are typically trained from scratch for specific graph data and tasks, failing to generalize across various multi-modal graph data and tasks. To bridge this gap, we explore the potential of Multi-modal Graph Large Language Models (MG-LLM) to unify and generalize across diverse multi-modal graph data and tasks. We propose a unified framework of multi-modal graph data, task, and model, discovering the inherent multi-granularity and multi-scale characteristics in multi-modal graphs. Specifically, we present five key desired characteristics for MG-LLM: 1) unified space for multi-modal structures and attributes, 2) capability of handling diverse multi-modal graph tasks, 3) multi-modal graph in-context learning, 4) multi-modal graph interaction with natural language, and 5) multi-modal graph reasoning. We then elaborate on the key challenges, review related works, and highlight promising future research directions towards realizing these ambitious characteristics. Finally, we summarize existing multi-modal graph datasets pertinent for model training. We believe this paper can contribute to the ongoing advancement of the research towards MG-LLM for generalization across multi-modal graph data and tasks.

[LG-19] Auto-Compressing Networks

链接: https://arxiv.org/abs/2506.09714
作者: Vaggelis Dorovatas,Georgios Paraskevopoulos,Alexandros Potamianos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks with short residual connections have demonstrated remarkable success across domains, but increasing depth often introduces computational redundancy without corresponding improvements in representation quality. In this work, we introduce Auto-Compressing Networks (ACNs), an architectural variant where additive long feedforward connections from each layer to the output replace traditional short residual connections. ACNs showcase a unique property we coin as “auto-compression”, the ability of a network to organically compress information during training with gradient descent, through architectural design alone. Through auto-compression, information is dynamically “pushed” into early layers during training, enhancing their representational quality and revealing potential redundancy in deeper ones. We theoretically show that this property emerges from layer-wise training patterns present in ACNs, where layers are dynamically utilized during training based on task requirements. We also find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, improved transfer learning capabilities, and mitigate catastrophic forgetting suggesting that they learn representations that generalize better despite using fewer parameters. Our results demonstrate up to 18% reduction in catastrophic forgetting and 30-80% architectural compression while maintaining accuracy across vision transformers, MLP-mixers, and BERT architectures. Furthermore, we demonstrate that coupling ACNs with traditional pruning techniques, enables significantly better sparsity-performance trade-offs compared to conventional architectures. These findings establish ACNs as a practical approach to developing efficient neural architectures that automatically adapt their computational footprint to task complexity, while learning robust representations.

[LG-20] Wasserstein Hypergraph Neural Network

链接: https://arxiv.org/abs/2506.09682
作者: Iulia Duta,Pietro Liò
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability to model relational information using machine learning has driven advancements across various domains, from medicine to social science. While graph representation learning has become mainstream over the past decade, representing higher-order relationships through hypergraphs is rapidly gaining momentum. In the last few years, numerous hypergraph neural networks have emerged, most of them falling under a two-stage, set-based framework. The messages are sent from nodes to edges and then from edges to nodes. However, most of the advancement still takes inspiration from the graph counterpart, often simplifying the aggregations to basic pooling operations. In this paper we are introducing Wasserstein Hypergraph Neural Network, a model that treats the nodes and hyperedge neighbourhood as distributions and aggregate the information using Sliced Wasserstein Pooling. Unlike conventional aggregators such as mean or sum, which only capture first-order statistics, our approach has the ability to preserve geometric properties like the shape and spread of distributions. This enables the learned embeddings to reflect how easily one hyperedge distribution can be transformed into another, following principles of optimal transport. Experimental results demonstrate that applying Wasserstein pooling in a hypergraph setting significantly benefits node classification tasks, achieving top performance on several real-world datasets.

[LG-21] Wavelet Scattering Transform and Fourier Representation for Offline Detection of Malicious Clients in Federated Learning

链接: https://arxiv.org/abs/2506.09674
作者: Alessandro Licciardi,Davide Leo,Davide Carbone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables the training of machine learning models across decentralized clients while preserving data privacy. However, the presence of anomalous or corrupted clients - such as those with faulty sensors or non representative data distributions - can significantly degrade model performance. Detecting such clients without accessing raw data remains a key challenge. We propose WAFFLE (Wavelet and Fourier representations for Federated Learning) a detection algorithm that labels malicious clients \it before training, using locally computed compressed representations derived from either the Wavelet Scattering Transform (WST) or the Fourier Transform. Both approaches provide low-dimensional, task-agnostic embeddings suitable for unsupervised client separation. A lightweight detector, trained on a distillated public dataset, performs the labeling with minimal communication and computational overhead. While both transforms enable effective detection, WST offers theoretical advantages, such as non-invertibility and stability to local deformations, that make it particularly well-suited to federated scenarios. Experiments on benchmark datasets show that our method improves detection accuracy and downstream classification performance compared to existing FL anomaly detection algorithms, validating its effectiveness as a pre-training alternative to online detection strategies.

[LG-22] SyncFed: Time-Aware Federated Learning through Explicit Timestamping and Synchronization

链接: https://arxiv.org/abs/2506.09660
作者: Baran Can Gül,Stefanos Tziampazis,Nasser Jazdi,Michael Weyrich
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Preprint version. Accepted for publication at IEEE ETFA 2025

点击查看摘要

Abstract:As Federated Learning (FL) expands to larger and more distributed environments, consistency in training is challenged by network-induced delays, clock unsynchronicity, and variability in client updates. This combination of factors may contribute to misaligned contributions that undermine model reliability and convergence. Existing methods like staleness-aware aggregation and model versioning address lagging updates heuristically, yet lack mechanisms to quantify staleness, especially in latency-sensitive and cross-regional deployments. In light of these considerations, we introduce \emphSyncFed, a time-aware FL framework that employs explicit synchronization and timestamping to establish a common temporal reference across the system. Staleness is quantified numerically based on exchanged timestamps under the Network Time Protocol (NTP), enabling the server to reason about the relative freshness of client updates and apply temporally informed weighting during aggregation. Our empirical evaluation on a geographically distributed testbed shows that, under \emphSyncFed, the global model evolves within a stable temporal context, resulting in improved accuracy and information freshness compared to round-based baselines devoid of temporal semantics.

[LG-23] Real-Time Network Traffic Forecasting with Missing Data: A Generative Model Approach

链接: https://arxiv.org/abs/2506.09647
作者: Lei Deng,Wenhan Xu,Jingwei Li,Danny H.K. Tsang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-time network traffic forecasting is crucial for network management and early resource allocation. Existing network traffic forecasting approaches operate under the assumption that the network traffic data is fully observed. However, in practical scenarios, the collected data are often incomplete due to various human and natural factors. In this paper, we propose a generative model approach for real-time network traffic forecasting with missing data. Firstly, we model the network traffic forecasting task as a tensor completion problem. Secondly, we incorporate a pre-trained generative model to achieve the low-rank structure commonly associated with tensor completion. The generative model effectively captures the intrinsic low-rank structure of network traffic data during pre-training and enables the mapping from a compact latent representation to the tensor space. Thirdly, rather than directly optimizing the high-dimensional tensor, we optimize its latent representation, which simplifies the optimization process and enables real-time forecasting. We also establish a theoretical recovery guarantee that quantifies the error bound of the proposed approach. Experiments on real-world datasets demonstrate that our approach achieves accurate network traffic forecasting within 100 ms, with a mean absolute error (MAE) below 0.002, as validated on the Abilene dataset.

[LG-24] In-Context Bias Propagation in LLM -Based Tabular Data Generation ICML2025

链接: https://arxiv.org/abs/2506.09630
作者: Pol G.Recasens,Alberto Gutierrez,Jordi Torres,Josep.Ll Berral,Anisa Halimi,Kieran Fraser
类目: Machine Learning (cs.LG)
*备注: Paper accepted at ICML 2025 workshop DIG-BUG

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines that rely on in-context prompts with in sensitive domains.

[LG-25] GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras ICML2025

链接: https://arxiv.org/abs/2506.09625
作者: Ekaterina Filimoshina,Dmitry Shirokov
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:We propose, implement, and compare with competitors a new architecture of equivariant neural networks based on geometric (Clifford) algebras: Generalized Lipschitz Group Equivariant Neural Networks (GLGENN). These networks are equivariant to all pseudo-orthogonal transformations, including rotations and reflections, of a vector space with any non-degenerate or degenerate symmetric bilinear form. We propose a weight-sharing parametrization technique that takes into account the fundamental structures and operations of geometric algebras. Due to this technique, GLGENN architecture is parameter-light and has less tendency to overfitting than baseline equivariant models. GLGENN outperforms or matches competitors on several benchmarking equivariant tasks, including estimation of an equivariant function and a convex hull experiment, while using significantly fewer optimizable parameters.

[LG-26] SparseSSM: Efficient Selective Structured State Space Models Can Be Pruned in One-Shot

链接: https://arxiv.org/abs/2506.09613
作者: Kaiwen Tuo,Huan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-space language models such as Mamba match Transformer quality while permitting linear complexity inference, yet still comprise billions of parameters that hinder deployment. Existing one-shot pruning methods are tailored to attention blocks and fail to account for the time-shared and discretized state-transition matrix at the heart of the selective state-space module (SSM). In this paper, we introduce SparseSSM, the first training-free pruning framework that extends the classic optimal brain surgeon (OBS) framework to state space architectures. Our layer-wise algorithm (i) derives an approximate second-order saliency score that aggregates Hessian-trace information across time steps, (ii) incorporates a component sensitivity analysis to guide feed-forward network (FFN) pruning, which also sheds light on where redundancy resides in mamba architecture, (iii) can be easily extended to semi-structured and structured sparsity. Empirically, we prune 50% of SSM weights without fine-tuning and observe no zero-shot accuracy loss, achieving the current state-of-the-art pruning algorithm for Mamba-based LLMs.

[LG-27] Accelerating Large-Scale Regularized High-Order Tensor Recovery

链接: https://arxiv.org/abs/2506.09594
作者: Wenjin Qin,Hailin Wang,Jingyao Hou,Jianjun Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, existing tensor recovery methods fail to recognize the impact of tensor scale variations on their structural characteristics. Furthermore, existing studies face prohibitive computational costs when dealing with large-scale high-order tensor data. To alleviate these issue, assisted by the Krylov subspace iteration, block Lanczos bidiagonalization process, and random projection strategies, this article first devises two fast and accurate randomized algorithms for low-rank tensor approximation (LRTA) problem. Theoretical bounds on the accuracy of the approximation error estimate are established. Next, we develop a novel generalized nonconvex modeling framework tailored to large-scale tensor recovery, in which a new regularization paradigm is exploited to achieve insightful prior representation for large-scale tensors. On the basis of the above, we further investigate new unified nonconvex models and efficient optimization algorithms, respectively, for several typical high-order tensor recovery tasks in unquantized and quantized situations. To render the proposed algorithms practical and efficient for large-scale tensor data, the proposed randomized LRTA schemes are integrated into their central and time-intensive computations. Finally, we conduct extensive experiments on various large-scale tensors, whose results demonstrate the practicability, effectiveness and superiority of the proposed method in comparison with some state-of-the-art approaches.

[LG-28] Beyond Overconfidence: Foundation Models Redefine Calibration in Deep Neural Networks

链接: https://arxiv.org/abs/2506.09593
作者: Achim Hekler,Lukas Kuhn,Florian Buettner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable uncertainty calibration is essential for safely deploying deep neural networks in high-stakes applications. Deep neural networks are known to exhibit systematic overconfidence, especially under distribution shifts. Although foundation models such as ConvNeXt, EVA and BEiT have demonstrated significant improvements in predictive performance, their calibration properties remain underexplored. This paper presents a comprehensive investigation into the calibration behavior of foundation models, revealing insights that challenge established paradigms. Our empirical analysis shows that these models tend to be underconfident in in-distribution predictions, resulting in higher calibration errors, while demonstrating improved calibration under distribution shifts. Furthermore, we demonstrate that foundation models are highly responsive to post-hoc calibration techniques in the in-distribution setting, enabling practitioners to effectively mitigate underconfidence bias. However, these methods become progressively less reliable under severe distribution shifts and can occasionally produce counterproductive results. Our findings highlight the complex, non-monotonic effects of architectural and training innovations on calibration, challenging established narratives of continuous improvement.

[LG-29] MOORL: A Framework for Integrating Offline-Online Reinforcement Learning

链接: https://arxiv.org/abs/2506.09574
作者: Gaurav Chaudhary,Wassim Uddin Mondal,Laxmidhar Behera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sample efficiency and exploration remain critical challenges in Deep Reinforcement Learning (DRL), particularly in complex domains. Offline RL, which enables agents to learn optimal policies from static, pre-collected datasets, has emerged as a promising alternative. However, offline RL is constrained by issues such as out-of-distribution (OOD) actions that limit policy performance and generalization. To overcome these limitations, we propose Meta Offline-Online Reinforcement Learning (MOORL), a hybrid framework that unifies offline and online RL for efficient and scalable learning. While previous hybrid methods rely on extensive design components and added computational complexity to utilize offline data effectively, MOORL introduces a meta-policy that seamlessly adapts across offline and online trajectories. This enables the agent to leverage offline data for robust initialization while utilizing online interactions to drive efficient exploration. Our theoretical analysis demonstrates that the hybrid approach enhances exploration by effectively combining the complementary strengths of offline and online data. Furthermore, we demonstrate that MOORL learns a stable Q-function without added complexity. Extensive experiments on 28 tasks from the D4RL and V-D4RL benchmarks validate its effectiveness, showing consistent improvements over state-of-the-art offline and hybrid RL baselines. With minimal computational overhead, MOORL achieves strong performance, underscoring its potential for practical applications in real-world scenarios.

[LG-30] ooBadRL: Trigger Optimization to Boost Effectiveness of Backdoor Attacks on Deep Reinforcement Learning

链接: https://arxiv.org/abs/2506.09562
作者: Songze Li,Mingxuan Zhang,Oubo Ma,Kang Wei,Shouling Ji
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has achieved remarkable success in a wide range of sequential decision-making domains, including robotics, healthcare, smart grids, and finance. Recent research demonstrates that attackers can efficiently exploit system vulnerabilities during the training phase to execute backdoor attacks, producing malicious actions when specific trigger patterns are present in the state observations. However, most existing backdoor attacks rely primarily on simplistic and heuristic trigger configurations, overlooking the potential efficacy of trigger optimization. To address this gap, we introduce TooBadRL (Trigger Optimization to Boost Effectiveness of Backdoor Attacks on DRL), the first framework to systematically optimize DRL backdoor triggers along three critical axes, i.e., temporal, spatial, and magnitude. Specifically, we first introduce a performance-aware adaptive freezing mechanism for injection timing. Then, we formulate dimension selection as a cooperative game, utilizing Shapley value analysis to identify the most influential state variable for the injection dimension. Furthermore, we propose a gradient-based adversarial procedure to optimize the injection magnitude under environment constraints. Evaluations on three mainstream DRL algorithms and nine benchmark tasks show that TooBadRL significantly improves attack success rates, while ensuring minimal degradation of normal task performance. These results highlight the previously underappreciated importance of principled trigger optimization in DRL backdoor attacks. The source code of TooBadRL can be found at this https URL.

[LG-31] STOAT: Spatial-Temporal Probabilistic Causal Inference Network

链接: https://arxiv.org/abs/2506.09544
作者: Yang Yang,Du Yin,Hao Xue,Flora Salim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatial-temporal causal time series (STC-TS) involve region-specific temporal observations driven by causally relevant covariates and interconnected across geographic or network-based spaces. Existing methods often model spatial and temporal dynamics independently and overlook causality-driven probabilistic forecasting, limiting their predictive power. To address this, we propose STOAT (Spatial-Temporal Probabilistic Causal Inference Network), a novel framework for probabilistic forecasting in STC-TS. The proposed method extends a causal inference approach by incorporating a spatial relation matrix that encodes interregional dependencies (e.g. proximity or connectivity), enabling spatially informed causal effect estimation. The resulting latent series are processed by deep probabilistic models to estimate the parameters of the distributions, enabling calibrated uncertainty modeling. We further explore multiple output distributions (e.g., Gaussian, Student’s- t , Laplace) to capture region-specific variability. Experiments on COVID-19 data across six countries demonstrate that STOAT outperforms state-of-the-art probabilistic forecasting models (DeepAR, DeepVAR, Deep State Space Model, etc.) in key metrics, particularly in regions with strong spatial dependencies. By bridging causal inference and geospatial probabilistic forecasting, STOAT offers a generalizable framework for complex spatial-temporal tasks, such as epidemic management.

[LG-32] A Survey on the Role of Artificial Intelligence and Machine Learning in 6G-V2X Applications

链接: https://arxiv.org/abs/2506.09512
作者: Donglin Wang,Anjie Qiu,Qiuheng Zhou,Hans D. Schotten
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:The rapid advancement of Vehicle-to-Everything (V2X) communication is transforming Intelligent Transportation Systems (ITS), with 6G networks expected to provide ultra-reliable, low-latency, and high-capacity connectivity for Connected and Autonomous Vehicles (CAVs). Artificial Intelligence (AI) and Machine Learning (ML) have emerged as key enablers in optimizing V2X communication by enhancing network management, predictive analytics, security, and cooperative driving due to their outstanding performance across various domains, such as natural language processing and computer vision. This survey comprehensively reviews recent advances in AI and ML models applied to 6G-V2X communication. It focuses on state-of-the-art techniques, including Deep Learning (DL), Reinforcement Learning (RL), Generative Learning (GL), and Federated Learning (FL), with particular emphasis on developments from the past two years. Notably, AI, especially GL, has shown remarkable progress and emerging potential in enhancing the performance, adaptability, and intelligence of 6G-V2X systems. Despite these advances, a systematic summary of recent research efforts in this area remains lacking, which this survey aims to address. We analyze their roles in 6G-V2X applications, such as intelligent resource allocation, beamforming, intelligent traffic management, and security management. Furthermore, we explore the technical challenges, including computational complexity, data privacy, and real-time decision-making constraints, while identifying future research directions for AI-driven 6G-V2X development. This study aims to provide valuable insights for researchers, engineers, and policymakers working towards realizing intelligent, AI-powered V2X ecosystems in 6G communication.

[LG-33] On a few pitfalls in KL divergence gradient estimation for RL

链接: https://arxiv.org/abs/2506.09477
作者: Yunhao Tang,Rémi Munos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We point out a few pitfalls in implementing gradient estimation for KL divergence in RL training for LLM, as seen in a number of open source projects and papers. The first major pitfall is to differentiate through the KL estimate as loss functions to minimize KL divergence. We show that such implementations are generally incorrect and do not produce the desired KL gradient. Secondly, we show that some implementations do not account for the sequential nature of the estimation problem and produce a partial gradient at best. We demonstrate the impact of such issues with illustrative tabular and LLM experiments, and show the correct way to implement the KL gradient.

[LG-34] NDCG-Consistent Softmax Approximation with Accelerated Convergence

链接: https://arxiv.org/abs/2506.09454
作者: Yuanhao Pu,Defu Lian,Xiaolong Chen,Xu Huang,Jin Chen,Enhong Chen
类目: Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along with its flexibility across diverse application scenarios. However, despite its effectiveness, SM Loss often suffers from significant computational overhead and scalability limitations when applied to large-scale object spaces. To address this challenge, we propose novel loss formulations that align directly with ranking metrics: the Ranking-Generalizable \textbfsquared (RG ^2 ) Loss and the Ranking-Generalizable interactive (RG ^\times ) Loss, both derived through Taylor expansions of the SM Loss. Notably, RG ^2 reveals the intrinsic mechanisms underlying weighted squared losses (WSL) in ranking methods and uncovers fundamental connections between sampling-based and non-sampling-based loss paradigms. Furthermore, we integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method, providing both generalization guarantees and convergence rate analyses. Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance relative to SM Loss, while significantly accelerating convergence. This framework offers the similarity learning community both theoretical insights and practically efficient tools, with methodologies applicable to a broad range of tasks where balancing ranking quality and computational efficiency is essential.

[LG-35] Safe Screening Rules for Group SLOPE KDD2025 ECML

链接: https://arxiv.org/abs/2506.09451
作者: Runxue Bao,Quanchao Lu,Yanfu Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ECML PKDD 2025

点击查看摘要

Abstract:Variable selection is a challenging problem in high-dimensional sparse learning, especially when group structures exist. Group SLOPE performs well for the adaptive selection of groups of predictors. However, the block non-separable group effects in Group SLOPE make existing methods either invalid or inefficient. Consequently, Group SLOPE tends to incur significant computational costs and memory usage in practical high-dimensional scenarios. To overcome this issue, we introduce a safe screening rule tailored for the Group SLOPE model, which efficiently identifies inactive groups with zero coefficients by addressing the block non-separable group effects. By excluding these inactive groups during training, we achieve considerable gains in computational efficiency and memory usage. Importantly, the proposed screening rule can be seamlessly integrated into existing solvers for both batch and stochastic algorithms. Theoretically, we establish that our screening rule can be safely employed with existing optimization algorithms, ensuring the same results as the original approaches. Experimental results confirm that our method effectively detects inactive feature groups and significantly boosts computational efficiency without compromising accuracy.

[LG-36] Generalization Error Analysis for Attack-Free and Byzantine-Resilient Decentralized Learning with Data Heterogeneity

链接: https://arxiv.org/abs/2506.09438
作者: Haoxiang Ye,Tao Sun,Qing Ling
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized learning, which facilitates joint model training across geographically scattered agents, has gained significant attention in the field of signal and information processing in recent years. While the optimization errors of decentralized learning algorithms have been extensively studied, their generalization errors remain relatively under-explored. As the generalization errors reflect the scalability of trained models on unseen data and are crucial in determining the performance of trained models in real-world applications, understanding the generalization errors of decentralized learning is of paramount importance. In this paper, we present fine-grained generalization error analysis for both attack-free and Byzantine-resilient decentralized learning with heterogeneous data as well as under mild assumptions, in contrast to prior studies that consider homogeneous data and/or rely on a stringent bounded stochastic gradient assumption. Our results shed light on the impact of data heterogeneity, model initialization and stochastic gradient noise – factors that have not been closely investigated before – on the generalization error of decentralized learning. We also reveal that Byzantine attacks performed by malicious agents largely affect the generalization error, and their negative impact is inherently linked to the data heterogeneity while remaining independent on the sample size. Numerical experiments on both convex and non-convex tasks are conducted to validate our theoretical findings.

[LG-37] Mitigating Spurious Correlations in LLM s via Causality-Aware Post-Training

链接: https://arxiv.org/abs/2506.09433
作者: Shurui Gui,Shuiwang Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated remarkable capabilities in language modeling, recent studies reveal that they often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training. Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT). By decomposing a biased prediction into two unbiased steps, known as \textitevent estimation and \textitevent intervention, we reduce LLMs’ pre-training biases without incurring additional fine-tuning biases, thus enhancing the model’s generalization ability. Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks using only 100 ID fine-tuning samples, demonstrating the effectiveness and sample efficiency of CAPT.

[LG-38] me-Unified Diffusion Policy with Action Discrimination for Robotic Manipulation

链接: https://arxiv.org/abs/2506.09422
作者: Ye Niu,Sanping Zhou,Yizhe Li,Ye Den,Le Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many complex scenarios, robotic manipulation relies on generative models to estimate the distribution of multiple successful actions. As the diffusion model has better training robustness than other generative models, it performs well in imitation learning through successful robot demonstrations. However, the diffusion-based policy methods typically require significant time to iteratively denoise robot actions, which hinders real-time responses in robotic manipulation. Moreover, existing diffusion policies model a time-varying action denoising process, whose temporal complexity increases the difficulty of model training and leads to suboptimal action accuracy. To generate robot actions efficiently and accurately, we present the Time-Unified Diffusion Policy (TUDP), which utilizes action recognition capabilities to build a time-unified denoising process. On the one hand, we build a time-unified velocity field in action space with additional action discrimination information. By unifying all timesteps of action denoising, our velocity field reduces the difficulty of policy learning and speeds up action generation. On the other hand, we propose an action-wise training method, which introduces an action discrimination branch to supply additional action discrimination information. Through action-wise training, the TUDP implicitly learns the ability to discern successful actions to better denoising accuracy. Our method achieves state-of-the-art performance on RLBench with the highest success rate of 82.6% on a multi-view setup and 83.8% on a single-view setup. In particular, when using fewer denoising iterations, TUDP achieves a more significant improvement in success rate. Additionally, TUDP can produce accurate actions for a wide range of real-world tasks.

[LG-39] Scoop-and-Toss: Dynamic Object Collection for Quadrupedal Systems

链接: https://arxiv.org/abs/2506.09406
作者: Minji Kang,Chanwoo Baek,Yoonsang Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quadruped robots have made significant advances in locomotion, extending their capabilities from controlled environments to real-world applications. Beyond movement, recent work has explored loco-manipulation using the legs to perform tasks such as pressing buttons or opening doors. While these efforts demonstrate the feasibility of leg-based manipulation, most have focused on relatively static tasks. In this work, we propose a framework that enables quadruped robots to collect objects without additional actuators by leveraging the agility of their legs. By attaching a simple scoop-like add-on to one leg, the robot can scoop objects and toss them into a collection tray mounted on its back. Our method employs a hierarchical policy structure comprising two expert policies-one for scooping and tossing, and one for approaching object positions-and a meta-policy that dynamically switches between them. The expert policies are trained separately, followed by meta-policy training for coordinated multi-object collection. This approach demonstrates how quadruped legs can be effectively utilized for dynamic object manipulation, expanding their role beyond locomotion.

[LG-40] Synergizing Reinforcement Learning and Genetic Algorithms for Neural Combinatorial Optimization

链接: https://arxiv.org/abs/2506.09404
作者: Shengda Gu,Kai Li,Junliang Xing,Yifan Zhang,Jian Cheng
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Combinatorial optimization problems are notoriously challenging due to their discrete structure and exponentially large solution space. Recent advances in deep reinforcement learning (DRL) have enabled the learning heuristics directly from data. However, DRL methods often suffer from limited exploration and susceptibility to local optima. On the other hand, evolutionary algorithms such as Genetic Algorithms (GAs) exhibit strong global exploration capabilities but are typically sample inefficient and computationally intensive. In this work, we propose the Evolutionary Augmentation Mechanism (EAM), a general and plug-and-play framework that synergizes the learning efficiency of DRL with the global search power of GAs. EAM operates by generating solutions from a learned policy and refining them through domain-specific genetic operations such as crossover and mutation. These evolved solutions are then selectively reinjected into the policy training loop, thereby enhancing exploration and accelerating convergence. We further provide a theoretical analysis that establishes an upper bound on the KL divergence between the evolved solution distribution and the policy distribution, ensuring stable and effective policy updates. EAM is model-agnostic and can be seamlessly integrated with state-of-the-art DRL solvers such as the Attention Model, POMO, and SymNCO. Extensive results on benchmark problems (e.g., TSP, CVRP, PCTSP, and OP) demonstrate that EAM significantly improves both solution quality and training efficiency over competitive baselines.

[LG-41] Efficient Prediction of SO(3)-Equivariant Hamiltonian Matrices via SO(2) Local Frames

链接: https://arxiv.org/abs/2506.09398
作者: Haiyang Yu,Yuchao Lin,Xuan Zhang,Xiaofeng Qian,Shuiwang Ji
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Code available at: this https URL

点击查看摘要

Abstract:We consider the task of predicting Hamiltonian matrices to accelerate electronic structure calculations, which plays an important role in physics, chemistry, and materials science. Motivated by the inherent relationship between the off-diagonal blocks of the Hamiltonian matrix and the SO(2) local frame, we propose a novel and efficient network, called QHNetV2, that achieves global SO(3) equivariance without the costly SO(3) Clebsch-Gordan tensor products. This is achieved by introducing a set of new efficient and powerful SO(2)-equivariant operations and performing all off-diagonal feature updates and message passing within SO(2) local frames, thereby eliminating the need of SO(3) tensor products. Moreover, a continuous SO(2) tensor product is performed within the SO(2) local frame at each node to fuse node features, mimicking the symmetric contraction operation. Extensive experiments on the large QH9 and MD17 datasets demonstrate that our model achieves superior performance across a wide range of molecular structures and trajectories, highlighting its strong generalization capability. The proposed SO(2) operations on SO(2) local frames offer a promising direction for scalable and symmetry-aware learning of electronic structures. Our code will be released as part of the AIRS library this https URL.

[LG-42] Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation ICML2025

链接: https://arxiv.org/abs/2506.09376
作者: Bowen Zheng,Tianming Yang
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Diffusion distillation is a widely used technique to reduce the sampling cost of diffusion models, yet it often requires extensive training, and the student performance tends to be degraded. Recent studies show that incorporating a GAN objective may alleviate these issues, yet the underlying mechanism remains unclear. In this work, we first identify a key limitation of distillation: mismatched step sizes and parameter numbers between the teacher and the student model lead them to converge to different local minima, rendering direct imitation suboptimal. We further demonstrate that a standalone GAN objective, without relying a distillation loss, overcomes this limitation and is sufficient to convert diffusion models into efficient one-step generators. Based on this finding, we propose that diffusion training may be viewed as a form of generative pre-training, equipping models with capabilities that can be unlocked through lightweight GAN fine-tuning. Supporting this view, we create a one-step generation model by fine-tuning a pre-trained model with 85% of parameters frozen, achieving strong performance with only 0.2M images and near-SOTA results with 5M images. We further present a frequency-domain analysis that may explain the one-step generative capability gained in diffusion training. Overall, our work provides a new perspective for diffusion training, highlighting its role as a powerful generative pre-training process, which can be the basis for building efficient one-step generation models.

[LG-43] SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending

链接: https://arxiv.org/abs/2506.09366
作者: Yuxuan Kuang,Haoran Geng,Amine Elhafsi,Tan-Dzung Do,Pieter Abbeel,Jitendra Malik,Marco Pavone,Yue Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender, a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills, and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering. We also introduce SkillBench, a parallel, cross-embodiment, and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks, accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking, resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: this https URL.

[LG-44] Adversarial Surrogate Risk Bounds for Binary Classification

链接: https://arxiv.org/abs/2506.09348
作者: Natalie S. Frank
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 37 pages, 2 figures

点击查看摘要

Abstract:A central concern in classification is the vulnerability of machine learning models to adversarial attacks. Adversarial training is one of the most popular techniques for training robust classifiers, which involves minimizing an adversarial surrogate risk. Recent work characterized when a minimizing sequence of an adversarial surrogate risk is also a minimizing sequence of the adversarial classification risk for binary classification – a property known as adversarial consistency. However, these results do not address the rate at which the adversarial classification risk converges to its optimal value for such a sequence of functions that minimize the adversarial surrogate. This paper provides surrogate risk bounds that quantify that convergence rate. Additionally, we derive distribution-dependent surrogate risk bounds in the standard (non-adversarial) learning setting, that may be of independent interest.

[LG-45] On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

链接: https://arxiv.org/abs/2506.09316
作者: Yeonju Ro,Zhenyu Zhang,Souvik Kundu,Zhangyang Wang,Aditya Akella
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose \textitdual-state linear attention (\textbf\dsla), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce \textbf\serve, an online \textitadaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. \serve\ uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that \serve\ yields \textbf2.3x faster inference than Llama2-7B and \textbf3.0x faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA’s dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at this https URL.

[LG-46] What is the Cost of Differential Privacy for Deep Learning-Based Trajectory Generation?

链接: https://arxiv.org/abs/2506.09312
作者: Erik Buchholz,Natasha Fernandes,David D. Nguyen,Alsharif Abuadbba,Surya Nepal,Salil S. Kanhere
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While location trajectories offer valuable insights, they also reveal sensitive personal information. Differential Privacy (DP) offers formal protection, but achieving a favourable utility-privacy trade-off remains challenging. Recent works explore deep learning-based generative models to produce synthetic trajectories. However, current models lack formal privacy guarantees and rely on conditional information derived from real data during generation. This work investigates the utility cost of enforcing DP in such models, addressing three research questions across two datasets and eleven utility metrics. (1) We evaluate how DP-SGD, the standard DP training method for deep learning, affects the utility of state-of-the-art generative models. (2) Since DP-SGD is limited to unconditional models, we propose a novel DP mechanism for conditional generation that provides formal guarantees and assess its impact on utility. (3) We analyse how model types - Diffusion, VAE, and GAN - affect the utility-privacy trade-off. Our results show that DP-SGD significantly impacts performance, although some utility remains if the datasets is sufficiently large. The proposed DP mechanism improves training stability, particularly when combined with DP-SGD, for unstable models such as GANs and on smaller datasets. Diffusion models yield the best utility without guarantees, but with DP-SGD, GANs perform best, indicating that the best non-private model is not necessarily optimal when targeting formal guarantees. In conclusion, DP trajectory generation remains a challenging task, and formal guarantees are currently only feasible with large datasets and in constrained use cases.

[LG-47] ScalableHD: Scalable and High-Throughput Hyperdimensional Computing Inference on Multi-Core CPUs

链接: https://arxiv.org/abs/2506.09282
作者: Dhruv Parikh,Viktor Prasanna
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: IC3

点击查看摘要

Abstract:Hyperdimensional Computing (HDC) is a brain-inspired computing paradigm that represents and manipulates information using high-dimensional vectors, called hypervectors (HV). Traditional HDC methods, while robust to noise and inherently parallel, rely on single-pass, non-parametric training and often suffer from low accuracy. To address this, recent approaches adopt iterative training of base and class HVs, typically accelerated on GPUs. Inference, however, remains lightweight and well-suited for real-time execution. Yet, efficient HDC inference has been studied almost exclusively on specialized hardware such as FPGAs and GPUs, with limited attention to general-purpose multi-core CPUs. To address this gap, we propose ScalableHD for scalable and high-throughput HDC inference on multi-core CPUs. ScalableHD employs a two-stage pipelined execution model, where each stage is parallelized across cores and processes chunks of base and class HVs. Intermediate results are streamed between stages using a producer-consumer mechanism, enabling on-the-fly consumption and improving cache locality. To maximize performance, ScalableHD integrates memory tiling and NUMA-aware worker-to-core binding. Further, it features two execution variants tailored for small and large batch sizes, each designed to exploit compute parallelism based on workload characteristics while mitigating the memory-bound compute pattern that limits HDC inference performance on modern multi-core CPUs. ScalableHD achieves up to 10x speedup in throughput (samples per second) over state-of-the-art baselines such as TorchHD, across a diverse set of tasks ranging from human activity recognition to image classification, while preserving task accuracy. Furthermore, ScalableHD exhibits robust scalability: increasing the number of cores yields near-proportional throughput improvements.

[LG-48] race: Lightweight Error Checking and Diagnosis for Distributed Training

链接: https://arxiv.org/abs/2506.09280
作者: Haitian Jiang,Shaowei Zhu,Zhen Zhang,Zhenyu Song,Xinwei Fu,Zhen Jia,Yida Wang,Jinyang Li
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Distributed training is essential for scaling the training of large neural network models, such as large language models (LLMs), across thousands of GPUs. However, the complexity of distributed training programs makes them particularly prone to silent bugs, which do not produce explicit error signal but lead to incorrect training outcome. Effectively detecting and localizing such silent bugs in distributed training is challenging. Common debugging practice using metrics like training loss or gradient norm curves can be inefficient and ineffective. Additionally, obtaining intermediate tensor values and determining whether they are correct during silent bug localization is difficult, particularly in the context of low-precision training. To address those challenges, we design and implement TTrace, the first system capable of detecting and localizing silent bugs in distributed training. TTrace collects intermediate tensors from distributing training in a fine-grained manner and compares them against those from a trusted single-device reference implementation. To properly compare the floating-point values in the tensors, we propose novel mathematical analysis that provides a guideline for setting thresholds, enabling TTrace to distinguish bug-induced errors from floating-point round-off errors. Experimental results demonstrate that TTrace effectively detects 11 existing bugs and 3 new bugs in the widely used Megatron-LM framework, while requiring fewer than 10 lines of code change. TTrace is effective in various training recipes, including low-precision recipes involving BF16 and FP8. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2506.09280 [cs.DC] (or arXiv:2506.09280v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2506.09280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-49] A Topic Modeling Analysis of Stigma Dimensions Social and Related Behavioral Circumstances in Clinical Notes Among Patients with HIV

链接: https://arxiv.org/abs/2506.09279
作者: Ziyi Chen,Yiyang Liu,Mattia Prosperi,Krishna Vaddiparti,Robert L Cook,Jiang Bian,Yi Guo,Yonghui Wu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Objective: To characterize stigma dimensions, social, and related behavioral circumstances in people living with HIV (PLWHs) seeking care, using natural language processing methods applied to a large collection of electronic health record (EHR) clinical notes from a large integrated health system in the southeast United States. Methods: We identified 9,140 cohort of PLWHs from the UF Health IDR and performed topic modeling analysis using Latent Dirichlet Allocation (LDA) to uncover stigma dimensions, social, and related behavioral circumstances. Domain experts created a seed list of HIV-related stigma keywords, then applied a snowball strategy to iteratively review notes for additional terms until saturation was reached. To identify more target topics, we tested three keyword-based filtering strategies. Domain experts manually reviewed the detected topics using the prevalent terms and key discussion topics. Word frequency analysis was used to highlight the prevalent terms associated with each topic. In addition, we conducted topic variation analysis among subgroups to examine differences across age and sex-specific demographics. Results and Conclusion: Topic modeling on sentences containing at least one keyword uncovered a wide range of topic themes associated with HIV-related stigma, social, and related behaviors circumstances, including “Mental Health Concern and Stigma”, “Social Support and Engagement”, “Limited Healthcare Access and Severe Illness”, “Treatment Refusal and Isolation” and so on. Topic variation analysis across age subgroups revealed differences. Extracting and understanding the HIV-related stigma dimensions, social, and related behavioral circumstances from EHR clinical notes enables scalable, time-efficient assessment, overcoming the limitations of traditional questionnaires and improving patient outcomes.

[LG-50] G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration ICML2025

链接: https://arxiv.org/abs/2506.09272
作者: Samuel Holt,Max Ruiz Luyten,Antonin Berthon,Mihaela van der Schaar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 42nd International Conference on Machine Learning (ICML 2025). 9 pages, 3 figures

点击查看摘要

Abstract:Constructing robust simulators is essential for asking “what if?” questions and guiding policy in critical domains like healthcare and logistics. However, existing methods often struggle, either failing to generalize beyond historical data or, when using Large Language Models (LLMs), suffering from inaccuracies and poor empirical alignment. We introduce G-Sim, a hybrid framework that automates simulator construction by synergizing LLM-driven structural design with rigorous empirical calibration. G-Sim employs an LLM in an iterative loop to propose and refine a simulator’s core components and causal relationships, guided by domain knowledge. This structure is then grounded in reality by estimating its parameters using flexible calibration techniques. Specifically, G-Sim can leverage methods that are both likelihood-free and gradient-free with respect to the simulator, such as gradient-free optimization for direct parameter estimation or simulation-based inference for obtaining a posterior distribution over parameters. This allows it to handle non-differentiable and stochastic simulators. By integrating domain priors with empirical evidence, G-Sim produces reliable, causally-informed simulators, mitigating data-inefficiency and enabling robust system-level interventions for complex decision-making.

[LG-51] Uncertainty Prioritized Experience Replay

链接: https://arxiv.org/abs/2506.09270
作者: Rodrigo Carrasco-Davis,Sebastian Lee,Claudia Clopath,Will Dabney
类目: Machine Learning (cs.LG)
*备注: Accepted at Reinforcement Learning Conference

点击查看摘要

Abstract:Prioritized experience replay, which improves sample efficiency by selecting relevant transitions to update parameter estimates, is a crucial component of contemporary value-based deep reinforcement learning models. Typically, transitions are prioritized based on their temporal difference error. However, this approach is prone to favoring noisy transitions, even when the value estimation closely approximates the target mean. This phenomenon resembles the noisy TV problem postulated in the exploration literature, in which exploration-guided agents get stuck by mistaking noise for novelty. To mitigate the disruptive effects of noise in value estimation, we propose using epistemic uncertainty estimation to guide the prioritization of transitions from the replay buffer. Epistemic uncertainty quantifies the uncertainty that can be reduced by learning, hence reducing transitions sampled from the buffer generated by unpredictable random processes. We first illustrate the benefits of epistemic uncertainty prioritized replay in two tabular toy models: a simple multi-arm bandit task, and a noisy gridworld. Subsequently, we evaluate our prioritization scheme on the Atari suite, outperforming quantile regression deep Q-learning benchmarks; thus forging a path for the use of uncertainty prioritized replay in reinforcement learning agents.

[LG-52] CFMI: Flow Matching for Missing Data Imputation

链接: https://arxiv.org/abs/2506.09258
作者: Vaidotas Simkus,Michael U. Gutmann
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce conditional flow matching for imputation (CFMI), a new general-purpose method to impute missing data. The method combines continuous normalising flows, flow-matching, and shared conditional modelling to deal with intractabilities of traditional multiple imputation. Our comparison with nine classical and state-of-the-art imputation methods on 24 small to moderate-dimensional tabular data sets shows that CFMI matches or outperforms both traditional and modern techniques across a wide range of metrics. Applying the method to zero-shot imputation of time-series data, we find that it matches the accuracy of a related diffusion-based method while outperforming it in terms of computational efficiency. Overall, CFMI performs at least as well as traditional methods on lower-dimensional data while remaining scalable to high-dimensional settings, matching or exceeding the performance of other deep learning-based approaches, making it a go-to imputation method for a wide range of data types and dimensionalities.

[LG-53] Agent -based Condition Monitoring Assistance with Multimodal Industrial Database Retrieval Augmented Generation

链接: https://arxiv.org/abs/2506.09247
作者: Karl Löwenmark,Daniel Strömbergsson,Chang Liu,Marcus Liwicki,Fredrik Sandin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Condition monitoring (CM) plays a crucial role in ensuring reliability and efficiency in the process industry. Although computerised maintenance systems effectively detect and classify faults, tasks like fault severity estimation, and maintenance decisions still largely depend on human expert analysis. The analysis and decision making automatically performed by current systems typically exhibit considerable uncertainty and high false alarm rates, leading to increased workload and reduced efficiency. This work integrates large language model (LLM)-based reasoning agents with CM workflows to address analyst and industry needs, namely reducing false alarms, enhancing fault severity estimation, improving decision support, and offering explainable interfaces. We propose MindRAG, a modular framework combining multimodal retrieval-augmented generation (RAG) with novel vector store structures designed specifically for CM data. The framework leverages existing annotations and maintenance work orders as surrogates for labels in a supervised learning protocol, addressing the common challenge of training predictive models on unlabelled and noisy real-world datasets. The primary contributions include: (1) an approach for structuring industry CM data into a semi-structured multimodal vector store compatible with LLM-driven workflows; (2) developing multimodal RAG techniques tailored for CM data; (3) developing practical reasoning agents capable of addressing real-world CM queries; and (4) presenting an experimental framework for integrating and evaluating such agents in realistic industrial scenarios. Preliminary results, evaluated with the help of an experienced analyst, indicate that MindRAG provide meaningful decision support for more efficient management of alarms, thereby improving the interpretability of CM systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.09247 [cs.LG] (or arXiv:2506.09247v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.09247 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] SoK: Machine Unlearning for Large Language Models

链接: https://arxiv.org/abs/2506.09227
作者: Jie Ren,Yue Xing,Yingqian Cui,Charu C. Aggarwal,Hui Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large language model (LLM) unlearning has become a critical topic in machine learning, aiming to eliminate the influence of specific training data or knowledge without retraining the model from scratch. A variety of techniques have been proposed, including Gradient Ascent, model editing, and re-steering hidden representations. While existing surveys often organize these methods by their technical characteristics, such classifications tend to overlook a more fundamental dimension: the underlying intention of unlearning–whether it seeks to truly remove internal knowledge or merely suppress its behavioral effects. In this SoK paper, we propose a new taxonomy based on this intention-oriented perspective. Building on this taxonomy, we make three key contributions. First, we revisit recent findings suggesting that many removal methods may functionally behave like suppression, and explore whether true removal is necessary or achievable. Second, we survey existing evaluation strategies, identify limitations in current metrics and benchmarks, and suggest directions for developing more reliable and intention-aligned evaluations. Third, we highlight practical challenges–such as scalability and support for sequential unlearning–that currently hinder the broader deployment of unlearning methods. In summary, this work offers a comprehensive framework for understanding and advancing unlearning in generative AI, aiming to support future research and guide policy decisions around data removal and privacy.

[LG-55] Revisiting Graph Projections for Effective Complementary Product Recommendation

链接: https://arxiv.org/abs/2506.09209
作者: Leandro Anghinoni,Pablo Zivic,Jorge Adrian Sanchez
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complementary product recommendation is a powerful strategy to improve customer experience and retail sales. However, recommending the right product is not a simple task because of the noisy and sparse nature of user-item interactions. In this work, we propose a simple yet effective method to predict a list of complementary products given a query item, based on the structure of a directed weighted graph projected from the user-item bipartite graph. We revisit bipartite graph projections for recommender systems and propose a novel approach for inferring complementarity relationships from historical user-item interactions. We compare our model with recent methods from the literature and show, despite the simplicity of our approach, an average improvement of +43% and +38% over sequential and graph-based recommenders, respectively, over different benchmarks.

[LG-56] mLaSDI: Multi-stage latent space dynamics identification

链接: https://arxiv.org/abs/2506.09207
作者: William Anderson,Kevin Chung,Youngsoo Choi
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Determining accurate numerical solutions of partial differential equations (PDEs) is an important task in many scientific disciplines. However, solvers can be computationally expensive, leading to the development of reduced-order models (ROMs). Recently, Latent Space Dynamics Identification (LaSDI) was proposed as a data-driven, non-intrusive ROM framework. LaSDI compresses the training data using an autoencoder and learns a system of user-chosen ordinary differential equations (ODEs), which govern the latent space dynamics. This allows for rapid predictions by interpolating and evolving the low-dimensional ODEs in the latent space. While LaSDI has produced effective ROMs for numerous problems, the autoencoder can have difficulty accurately reconstructing training data while also satisfying the imposed dynamics in the latent space, particularly in complex or high-frequency regimes. To address this, we propose multi-stage Latent Space Dynamics Identification (mLaSDI). With mLaSDI, several autoencoders are trained sequentially in stages, where each autoencoder learns to correct the error of the previous stages. We find that applying mLaSDI with small autoencoders results in lower prediction and reconstruction errors, while also reducing training time compared to LaSDI.

[LG-57] LaDCast: A Latent Diffusion Model for Medium-Range Ensemble Weather Forecasting

链接: https://arxiv.org/abs/2506.09193
作者: Yilin Zhuang,Karthik Duraisamy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate probabilistic weather forecasting demands both high accuracy and efficient uncertainty quantification, challenges that overburden both ensemble numerical weather prediction (NWP) and recent machine-learning methods. We introduce LaDCast, the first global latent-diffusion framework for medium-range ensemble forecasting, which generates hourly ensemble forecasts entirely in a learned latent space. An autoencoder compresses high-dimensional ERA5 reanalysis fields into a compact representation, and a transformer-based diffusion model produces sequential latent updates with arbitrary hour initialization. The model incorporates Geometric Rotary Position Embedding (GeoRoPE) to account for the Earth’s spherical geometry, a dual-stream attention mechanism for efficient conditioning, and sinusoidal temporal embeddings to capture seasonal patterns. LaDCast achieves deterministic and probabilistic skill close to that of the European Centre for Medium-Range Forecast IFS-ENS, without any explicit perturbations. Notably, LaDCast demonstrates superior performance in tracking rare extreme events such as cyclones, capturing their trajectories more accurately than established models. By operating in latent space, LaDCast reduces storage and compute by orders of magnitude, demonstrating a practical path toward forecasting at kilometer-scale resolution in real time. We open-source our code and models and provide the training and evaluation pipelines at: this https URL.

[LG-58] Multivariate Long-term Time Series Forecasting with Fourier Neural Filter

链接: https://arxiv.org/abs/2506.09174
作者: Chenheng Xu,Dan Wu,Yixin Zhu,Ying Nian Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate long-term time series forecasting has been suffering from the challenge of capturing both temporal dependencies within variables and spatial correlations across variables simultaneously. Current approaches predominantly repurpose backbones from natural language processing or computer vision (e.g., Transformers), which fail to adequately address the unique properties of time series (e.g., periodicity). The research community lacks a dedicated backbone with temporal-specific inductive biases, instead relying on domain-agnostic backbones supplemented with auxiliary techniques (e.g., signal decomposition). We introduce FNF as the backbone and DBD as the architecture to provide excellent learning capabilities and optimal learning pathways for spatio-temporal modeling, respectively. Our theoretical analysis proves that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling, while information bottleneck theory demonstrates that DBD provides superior gradient flow and representation capacity compared to existing unified or sequential architectures. Our empirical evaluation across 11 public benchmark datasets spanning five domains (energy, meteorology, transportation, environment, and nature) confirms state-of-the-art performance with consistent hyperparameter settings. Notably, our approach achieves these results without any auxiliary techniques, suggesting that properly designed neural architectures can capture the inherent properties of time series, potentially transforming time series modeling in scientific and industrial applications.

[LG-59] Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

链接: https://arxiv.org/abs/2506.09163
作者: Daniel Jenson,Jhonathan Navott,Piotr Grynfelder,Mengyan Zhang,Makkunda Sharma,Elizaveta Semenova,Seth Flaxman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this tradeoff is often unnecessary, particularly when modeling fully or partially translation invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high dimensional fixed effects, and (5) scale gracefully – running inference with over 1M test points with 100K context points in under a minute on a single 24GB GPU.

[LG-60] RACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval

链接: https://arxiv.org/abs/2506.09114
作者: Jialin Chen,Ziyu Zhao,Gaukhar Nurbek,Aosong Feng,Ali Maatouk,Leandros Tassiulas,Yifeng Gao,Rex Ying
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACE enables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval. It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text, effectively linking linguistic descriptions with complex temporal patterns. By retrieving semantically relevant pairs, TRACE enriches downstream models with informative context, leading to improved predictive accuracy and interpretability. Beyond a static retrieval engine, TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations while maintaining strong cross-modal alignment. These representations achieve state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains highlight its dual utility, as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models.

[LG-61] CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model

链接: https://arxiv.org/abs/2506.09110
作者: Jingying Ma,Feng Wu,Qika Lin,Yucheng Xing,Chenyu Liu,Ziyu Jia,Mengling Feng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) provides real-time insights into brain activity and is widely used in neuroscience. However, variations in channel configurations, sequence lengths, and task objectives limit the transferability of traditional task-specific models. Although recent EEG foundation models (EFMs) aim to learn generalizable representations, they struggle with limited heterogeneous representation capacity and inefficiency in capturing multi-scale brain dependencies. To address these challenges, we propose CodeBrain, an efficient EFM structurally aligned with brain organization, trained in two stages. (1) We introduce a TFDual-Tokenizer that independently tokenizes heterogeneous temporal and frequency components, enabling a quadratic expansion of the discrete representation space. This also offers a degree of interpretability through cross-domain token analysis. (2) We propose the EEGSSM, which combines a structured global convolution architecture and a sliding window attention mechanism to jointly model sparse long-range and local dependencies. Unlike fully connected Transformer models, EEGSSM better reflects the brain’s small-world topology and efficiently captures EEG’s inherent multi-scale structure. EEGSSM is trained with a masked self-supervised learning objective to predict token indices obtained in TFDual-Tokenizer. Comprehensive experiments on 10 public EEG datasets demonstrate the generalizability of CodeBrain with linear probing. By offering biologically informed and interpretable EEG modeling, CodeBrain lays the foundation for future neuroscience research. Both code and pretraining weights will be released in the future version.

[LG-62] Variational Inference Optimized Using the Curved Geometry of Coupled Free Energy

链接: https://arxiv.org/abs/2506.09091
作者: Kenric Nelson,Igor Oliveira,Amenah Al-Najafi,Fode Zhang,Hon Keung Tony Ng
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 11 pages, 2 figures, AGI-25

点击查看摘要

Abstract:We introduce an optimization framework for variational inference based on the coupled free energy, extending variational inference techniques to account for the curved geometry of the coupled exponential family. This family includes important heavy-tailed distributions such as the generalized Pareto and the Student’s t. By leveraging the coupled free energy, which is equal to the coupled evidence lower bound (ELBO) of the inverted probabilities, we improve the accuracy and robustness of the learned model. The coupled generalization of Fisher Information metric and the affine connection. The method is applied to the design of a coupled variational autoencoder (CVAE). By using the coupling for both the distributions and cost functions, the reconstruction metric is derived to still be the mean-square average loss with modified constants. The novelty comes from sampling the heavy-tailed latent distribution with its associated coupled probability, which has faster decaying tails. The result is the ability to train a model with high penalties in the tails, while assuring that the training samples have a reduced number of outliers. The Wasserstein-2 or Fréchet Inception Distance of the reconstructed CelebA images shows the CVAE has a 3% improvement over the VAE after 5 epochs of training.

[LG-63] Integrating Asynchronous AdaBoost into Federated Learning: Five Real World Applications

链接: https://arxiv.org/abs/2506.09090
作者: Arthur Oghlukyan,Nuria Gomez Blas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive analysis of an enhanced asynchronous AdaBoost framework for federated learning (FL), focusing on its application across five distinct domains: computer vision on edge devices, blockchain-based model transparency, on-device mobile personalization, IoT anomaly detection, and federated healthcare diagnostics. The proposed algorithm incorporates adaptive communication scheduling and delayed weight compensation to reduce synchronization frequency and communication overhead while preserving or improving model accuracy. We examine how these innovations improve communication efficiency, scalability, convergence, and robustness in each domain. Comparative metrics including training time, communication overhead, convergence iterations, and classification accuracy are evaluated using data and estimates derived from Oghlukyan’s enhanced AdaBoost framework. Empirical results show, for example, training time reductions on the order of 20-35% and communication overhead reductions of 30-40% compared to baseline AdaBoost, with convergence achieved in significantly fewer boosting rounds. Tables and charts summarize these improvements by domain. Mathematical formulations of the adaptive scheduling rule and error-driven synchronization thresholds are provided. Overall, the enhanced AdaBoost exhibits markedly improved efficiency and robustness across diverse FL scenarios, suggesting broad applicability of the approach.

[LG-64] Spiking Neural Models for Decision-Making Tasks with Learning

链接: https://arxiv.org/abs/2506.09087
作者: Sophie Jaffard(LJAD),Giulia Mezzadri,Patricia Reynaud-Bouret(LJAD, CNRS),Etienne Tanré(LJAD, CRISAM)
类目: Machine Learning (cs.LG); Probability (math.PR); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In cognition, response times and choices in decision-making tasks are commonly modeled using Drift Diffusion Models (DDMs), which describe the accumulation of evidence for a decision as a stochastic process, specifically a Brownian motion, with the drift rate reflecting the strength of the evidence. In the same vein, the Poisson counter model describes the accumulation of evidence as discrete events whose counts over time are modeled as Poisson processes, and has a spiking neurons interpretation as these processes are used to model neuronal activities. However, these models lack a learning mechanism and are limited to tasks where participants have prior knowledge of the categories. To bridge the gap between cognitive and biological models, we propose a biologically plausible Spiking Neural Network (SNN) model for decision-making that incorporates a learning mechanism and whose neurons activities are modeled by a multivariate Hawkes process. First, we show a coupling result between the DDM and the Poisson counter model, establishing that these two models provide similar categorizations and reaction times and that the DDM can be approximated by spiking Poisson neurons. To go further, we show that a particular DDM with correlated noise can be derived from a Hawkes network of spiking neurons governed by a local learning rule. In addition, we designed an online categorization task to evaluate the model predictions. This work provides a significant step toward integrating biologically relevant neural mechanisms into cognitive models, fostering a deeper understanding of the relationship between neural activity and behavior.

[LG-65] A Deep Generative Model for the Simulation of Discrete Karst Networks

链接: https://arxiv.org/abs/2506.09832
作者: Dany Lauzon,Julien Straubhaar,Philippe Renard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 15 figures, submitted to Earth and Space Science

点击查看摘要

Abstract:The simulation of discrete karst networks presents a significant challenge due to the complexity of the physicochemical processes occurring within various geological and hydrogeological contexts over extended periods. This complex interplay leads to a wide variety of karst network patterns, each intricately linked to specific hydrogeological conditions. We explore a novel approach that represents karst networks as graphs and applies graph generative models (deep learning techniques) to capture the intricate nature of karst environments. In this representation, nodes retain spatial information and properties, while edges signify connections between nodes. Our generative process consists of two main steps. First, we utilize graph recurrent neural networks (GraphRNN) to learn the topological distribution of karst networks. GraphRNN decomposes the graph simulation into a sequential generation of nodes and edges, informed by previously generated structures. Second, we employ denoising diffusion probabilistic models on graphs (G-DDPM) to learn node features (spatial coordinates and other properties). G-DDPMs enable the generation of nodes features on the graphs produced by the GraphRNN that adhere to the learned statistical properties by sampling from the derived probability distribution, ensuring that the generated graphs are realistic and capture the essential features of the original data. We test our approach using real-world karst networks and compare generated subgraphs with actual subgraphs from the database, by using geometry and topology metrics. Our methodology allows stochastic simulation of discrete karst networks across various types of formations, a useful tool for studying the behavior of physical processes such as flow and transport.

[LG-66] Automatic Treatment Planning using Reinforcement Learning for High-dose-rate Prostate Brachytherapy

链接: https://arxiv.org/abs/2506.09805
作者: Tonghe Wang,Yining Feng,Xiaofeng Yang
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Purpose: In high-dose-rate (HDR) prostate brachytherapy procedures, the pattern of needle placement solely relies on physician experience. We investigated the feasibility of using reinforcement learning (RL) to provide needle positions and dwell times based on patient anatomy during pre-planning stage. This approach would reduce procedure time and ensure consistent plan quality. Materials and Methods: We train a RL agent to adjust the position of one selected needle and all the dwell times on it to maximize a pre-defined reward function after observing the environment. After adjusting, the RL agent then moves on to the next needle, until all needles are adjusted. Multiple rounds are played by the agent until the maximum number of rounds is reached. Plan data from 11 prostate HDR boost patients (1 for training, and 10 for testing) treated in our clinic were included in this study. The dosimetric metrics and the number of used needles of RL plan were compared to those of the clinical results (ground truth). Results: On average, RL plans and clinical plans have very similar prostate coverage (Prostate V100) and Rectum D2cc (no statistical significance), while RL plans have less prostate hotspot (Prostate V150) and Urethra D20% plans with statistical significance. Moreover, RL plans use 2 less needles than clinical plan on average. Conclusion: We present the first study demonstrating the feasibility of using reinforcement learning to autonomously generate clinically practical HDR prostate brachytherapy plans. This RL-based method achieved equal or improved plan quality compared to conventional clinical approaches while requiring fewer needles. With minimal data requirements and strong generalizability, this approach has substantial potential to standardize brachytherapy planning, reduce clinical variability, and enhance patient outcomes.

[LG-67] Cross-Channel Unlabeled Sensing over a Union of Signal Subspaces ICASSP2025

链接: https://arxiv.org/abs/2506.09773
作者: Taulant Koka,Manolis C. Tsakiris,Benjamín Béjar Haro,Michael Muma
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2025. ©2025 IEEE. Personal use of this material is permitted

点击查看摘要

Abstract:Cross-channel unlabeled sensing addresses the problem of recovering a multi-channel signal from measurements that were shuffled across channels. This work expands the cross-channel unlabeled sensing framework to signals that lie in a union of subspaces. The extension allows for handling more complex signal structures and broadens the framework to tasks like compressed sensing. These mismatches between samples and channels often arise in applications such as whole-brain calcium imaging of freely moving organisms or multi-target tracking. We improve over previous models by deriving tighter bounds on the required number of samples for unique reconstruction, while supporting more general signal types. The approach is validated through an application in whole-brain calcium imaging, where organism movements disrupt sample-to-neuron mappings. This demonstrates the utility of our framework in real-world settings with imprecise sample-channel associations, achieving accurate signal reconstruction.

[LG-68] Empirical and computer-aided robustness analysis of long-step and accelerated methods in smooth convex optimization

链接: https://arxiv.org/abs/2506.09730
作者: Pierre Vernimmen,François Glineur
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work assesses both empirically and theoretically, using the performance estimation methodology, how robust different first-order optimization methods are when subject to relative inexactness in their gradient computations. Relative inexactness occurs, for example, when compressing the gradient using fewer bits of information, which happens when dealing with large-scale problems on GPUs. Three major families of methods are analyzed: constant step gradient descent, long-step methods, and accelerated methods. The latter two are first shown to be theoretically not robust to inexactness. Then, a semi-heuristic shortening factor is introduced to improve their theoretical guarantees. All methods are subsequently tested on a concrete inexact problem, with two different types of relative inexactness, and it is observed that both accelerated methods are much more robust than expected, and that the shortening factor significantly helps the long-step methods. In the end, all shortened methods appear to be promising, even in this inexact setting.

[LG-69] Assessing the Quality of Denoising Diffusion Models in Wasserstein Distance: Noisy Score and Optimal Bounds

链接: https://arxiv.org/abs/2506.09681
作者: Vahan Arsenyan,Elen Vardanyan,Arnak Dalalyan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Generative modeling aims to produce new random examples from an unknown target distribution, given access to a finite collection of examples. Among the leading approaches, denoising diffusion probabilistic models (DDPMs) construct such examples by mapping a Brownian motion via a diffusion process driven by an estimated score function. In this work, we first provide empirical evidence that DDPMs are robust to constant-variance noise in the score evaluations. We then establish finite-sample guarantees in Wasserstein-2 distance that exhibit two key features: (i) they characterize and quantify the robustness of DDPMs to noisy score estimates, and (ii) they achieve faster convergence rates than previously known results. Furthermore, we observe that the obtained rates match those known in the Gaussian case, implying their optimality.

[LG-70] Scaling Laws for Uncertainty in Deep Learning

链接: https://arxiv.org/abs/2506.09648
作者: Mattia Rosso,Simone Rossi,Giulio Franzese,Markus Heinonen,Maurizio Filippone
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has recently revealed the existence of scaling laws, demonstrating that model performance follows predictable trends based on dataset and model sizes. Inspired by these findings and fascinating phenomena emerging in the over-parameterized regime, we examine a parallel direction: do similar scaling laws govern predictive uncertainties in deep learning? In identifiable parametric models, such scaling laws can be derived in a straightforward manner by treating model parameters in a Bayesian way. In this case, for example, we obtain O(1/N) contraction rates for epistemic uncertainty with respect to the number of data N . However, in over-parameterized models, these guarantees do not hold, leading to largely unexplored behaviors. In this work, we empirically show the existence of scaling laws associated with various measures of predictive uncertainty with respect to dataset and model sizes. Through experiments on vision and language tasks, we observe such scaling laws for in- and out-of-distribution predictive uncertainty estimated through popular approximate Bayesian inference and ensemble methods. Besides the elegance of scaling laws and the practical utility of extrapolating uncertainties to larger data or models, this work provides strong evidence to dispel recurring skepticism against Bayesian approaches: “In many applications of deep learning we have so much data available: what do we need Bayes for?”. Our findings show that “so much data” is typically not enough to make epistemic uncertainty negligible.

[LG-71] Evasion Attacks Against Bayesian Predictive Models UAI’25

链接: https://arxiv.org/abs/2506.09640
作者: Pablo G. Arce,Roi Naveiro,David Ríos Insua
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted as an oral presentation at UAI’25

点击查看摘要

Abstract:There is an increasing interest in analyzing the behavior of machine learning systems against adversarial attacks. However, most of the research in adversarial machine learning has focused on studying weaknesses against evasion or poisoning attacks to predictive models in classical setups, with the susceptibility of Bayesian predictive models to attacks remaining underexplored. This paper introduces a general methodology for designing optimal evasion attacks against such models. We investigate two adversarial objectives: perturbing specific point predictions and altering the entire posterior predictive distribution. For both scenarios, we propose novel gradient-based attacks and study their implementation and properties in various computational setups.

[LG-72] LLM -Powered CPI Prediction Inference with Online Text Time Series

链接: https://arxiv.org/abs/2506.09516
作者: Yingying Fan,Jinchi Lv,Ao Sun,Yurou Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 73 pages, 13 figures

点击查看摘要

Abstract:Forecasting the Consumer Price Index (CPI) is an important yet challenging task in economics, where most existing approaches rely on low-frequency, survey-based data. With the recent advances of large language models (LLMs), there is growing potential to leverage high-frequency online text data for improved CPI prediction, an area still largely unexplored. This paper proposes LLM-CPI, an LLM-based approach for CPI prediction inference incorporating online text time series. We collect a large set of high-frequency online texts from a popularly used Chinese social network site and employ LLMs such as ChatGPT and the trained BERT models to construct continuous inflation labels for posts that are related to inflation. Online text embeddings are extracted via LDA and BERT. We develop a joint time series framework that combines monthly CPI data with LLM-generated daily CPI surrogates. The monthly model employs an ARX structure combining observed CPI data with text embeddings and macroeconomic variables, while the daily model uses a VARX structure built on LLM-generated CPI surrogates and text embeddings. We establish the asymptotic properties of the method and provide two forms of constructed prediction intervals. The finite-sample performance and practical advantages of LLM-CPI are demonstrated through both simulation and real data examples.

[LG-73] Attention-Bayesian Hybrid Approach to Modular Multiple Particle Tracking

链接: https://arxiv.org/abs/2506.09441
作者: Piyush Mishra(I2M, FRESNEL, TCLS, AMU),Philippe Roudot(FRESNEL, TCLS, CNRS)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tracking multiple particles in noisy and cluttered scenes remains challenging due to a combinatorial explosion of trajectory hypotheses, which scales super-exponentially with the number of particles and frames. The transformer architecture has shown a significant improvement in robustness against this high combinatorial load. However, its performance still falls short of the conventional Bayesian filtering approaches in scenarios presenting a reduced set of trajectory hypothesis. This suggests that while transformers excel at narrowing down possible associations, they may not be able to reach the optimality of the Bayesian approach in locally sparse scenario. Hence, we introduce a hybrid tracking framework that combines the ability of self-attention to learn the underlying representation of particle behavior with the reliability and interpretability of Bayesian filtering. We perform trajectory-to-detection association by solving a label prediction problem, using a transformer encoder to infer soft associations between detections across frames. This prunes the hypothesis set, enabling efficient multiple-particle tracking in Bayesian filtering framework. Our approach demonstrates improved tracking accuracy and robustness against spurious detections, offering a solution for high clutter multiple particle tracking scenarios.

[LG-74] A theoretical basis for model collapse in recursive training

链接: https://arxiv.org/abs/2506.09401
作者: Vivek Shripad Borkar
类目: Probability (math.PR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It is known that recursive training from generative models can lead to the so called `collapse’ of the simulated probability distribution. This note shows that one in fact gets two different asymptotic behaviours depending on whether an external source, howsoever minor, is also contributing samples.

[LG-75] Surrogate models to optimize plasma assisted atomic layer deposition in high aspect ratio features

链接: https://arxiv.org/abs/2506.09313
作者: Angel Yanguas-Gil,Jeffrey W. Elam
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:In this work we explore surrogate models to optimize plasma enhanced atomic layer deposition (PEALD) in high aspect ratio features. In plasma-based processes such as PEALD and atomic layer etching, surface recombination can dominate the reactivity of plasma species with the surface, which can lead to unfeasibly long exposure times to achieve full conformality inside nanostructures like high aspect ratio vias. Using a synthetic dataset based on simulations of PEALD, we train artificial neural networks to predict saturation times based on cross section thickness data obtained for partially coated conditions. The results obtained show that just two experiments in undersaturated conditions contain enough information to predict saturation times within 10% of the ground truth. A surrogate model trained to determine whether surface recombination dominates the plasma-surface interactions in a PEALD process achieves 99% accuracy. This demonstrates that machine learning can provide a new pathway to accelerate the optimization of PEALD processes in areas such as microelectronics. Our approach can be easily extended to atomic layer etching and more complex structures.

[LG-76] AI-Driven SEEG Channel Ranking for Epileptogenic Zone Localization

链接: https://arxiv.org/abs/2506.09255
作者: Saeed Hashemi,Genchang Peng,Mehrdad Nourani,Omar Nofal,Jay Harvey
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted to be presented at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025). This version is submitted to arXiv prior to final IEEE formatting and publication

点击查看摘要

Abstract:Stereo-electroencephalography (SEEG) is an invasive technique to implant depth electrodes and collect data for pre-surgery evaluation. Visual inspection of signals recorded from hundreds of channels is time consuming and inefficient. We propose a machine learning approach to rank the impactful channels by incorporating clinician’s selection and computational finding. A classification model using XGBoost is trained to learn the discriminative features of each channel during ictal periods. Then, the SHapley Additive exPlanations (SHAP) scoring is utilized to rank SEEG channels based on their contribution to seizures. A channel extension strategy is also incorporated to expand the search space and identify suspicious epileptogenic zones beyond those selected by clinicians. For validation, SEEG data for five patients were analyzed showing promising results in terms of accuracy, consistency, and explainability.

[LG-77] Detecting malignant dynamics on very few blood sample using signature coefficients

链接: https://arxiv.org/abs/2506.09097
作者: Rémi Vaucher,Stéphane Chrétien
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review

点击查看摘要

Abstract:Recent discoveries have suggested that the promising avenue of using circulating tumor DNA (ctDNA) levels in blood samples provides reasonable accuracy for cancer monitoring, with extremely low burden on the patient’s side. It is known that the presence of ctDNA can result from various mechanisms leading to DNA release from cells, such as apoptosis, necrosis or active secretion. One key idea in recent cancer monitoring studies is that monitoring the dynamics of ctDNA levels might be sufficient for early multi-cancer detection. This interesting idea has been turned into commercial products, e.g. in the company named GRAIL. In the present work, we propose to explore the use of Signature theory for detecting aggressive cancer tumors based on the analysis of blood samples. Our approach combines tools from continuous time Markov modelling for the dynamics of ctDNA levels in the blood, with Signature theory for building efficient testing procedures. Signature theory is a topic of growing interest in the Machine Learning community (see Chevyrev2016 and Fermanian2021), which is now recognised as a powerful feature extraction tool for irregularly sampled signals. The method proposed in the present paper is shown to correctly address the challenging problem of overcoming the inherent data scarsity due to the extremely small number of blood samples per patient. The relevance of our approach is illustrated with extensive numerical experiments that confirm the efficiency of the proposed pipeline. Comments: Under review Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.09097 [q-bio.QM] (or arXiv:2506.09097v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2506.09097 Focus to learn more arXiv-issued DOI via DataCite

[LG-78] A Probabilistic Framework for Imputing Genetic Distances in Spatiotemporal Pathogen Models

链接: https://arxiv.org/abs/2506.09076
作者: Haley Stone,Jing Du,Hao Xue,Matthew Scotch,David Heslop,Andreas Züfle,Chandini Raina MacIntyre,Flora Salim
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:Pathogen genome data offers valuable structure for spatial models, but its utility is limited by incomplete sequencing coverage. We propose a probabilistic framework for inferring genetic distances between unsequenced cases and known sequences within defined transmission chains, using time-aware evolutionary distance modeling. The method estimates pairwise divergence from collection dates and observed genetic distances, enabling biologically plausible imputation grounded in observed divergence patterns, without requiring sequence alignment or known transmission chains. Applied to highly pathogenic avian influenza A/H5 cases in wild birds in the United States, this approach supports scalable, uncertainty-aware augmentation of genomic datasets and enhances the integration of evolutionary information into spatiotemporal modeling workflows.

信息检索

[IR-0] Discrete Scale-invariant Metric Learning for Efficient Collaborative Filtering

链接: https://arxiv.org/abs/2506.09898
作者: Yan Zhang,Li Deng,Lixin Duan,Sami Azam
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Metric learning has attracted extensive interest for its ability to provide personalized recommendations based on the importance of observed user-item interactions. Current metric learning methods aim to push negative items away from the corresponding users and positive items by an absolute geometrical distance margin. However, items may come from imbalanced categories with different intra-class variations. Thus, the absolute distance margin may not be ideal for estimating the difference between user preferences over imbalanced items. To this end, we propose a new method, named discrete scale-invariant metric learning (DSIML), by adding binary constraints to users and items, which maps users and items into binary codes of a shared Hamming subspace to speed up the online recommendation. Specifically, we firstly propose a scale-invariant margin based on angles at the negative item points in the shared Hamming subspace. Then, we derive a scale-invariant triple hinge loss based on the margin. To capture more preference difference information, we integrate a pairwise ranking loss into the scale-invariant loss in the proposed model. Due to the difficulty of directly optimizing the mixed integer optimization problem formulated with \textitlog-sum-exp functions, we seek to optimize its variational quadratic upper bound and learn hash codes with an alternating optimization strategy. Experiments on benchmark datasets clearly show that our proposed method is superior to competitive metric learning and hashing-based baselines for recommender systems. The implementation code is available at this https URL.

[IR-1] MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed

链接: https://arxiv.org/abs/2506.09409
作者: Jiaqi Samantha Zhan,Crystina Zhang,Shengyao Zhuang,Xueguang Ma,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Effective video retrieval remains challenging due to the complexity of integrating visual, auditory, and textual modalities. In this paper, we explore unified retrieval methods using OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, in the context of the MAGMaR shared task. Evaluated on the comprehensive MultiVENT 2.0 dataset, OmniEmbed generates unified embeddings for text, images, audio, and video, enabling robust multimodal retrieval. By finetuning OmniEmbed with the combined multimodal data–visual frames, audio tracks, and textual descriptions provided in MultiVENT 2.0, we achieve substantial improvements in complex, multilingual video retrieval tasks. Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025, highlighting the practical effectiveness of our unified multimodal retrieval approach. Model checkpoint in this work is opensourced.

[IR-2] In Crowd Veritas: Leverag ing Human Intelligence To Fight Misinformation

链接: https://arxiv.org/abs/2506.09221
作者: Michael Soprano
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: PhD thesis, University of Udine, defended May 2023, 458 pages

点击查看摘要

Abstract:The spread of online misinformation poses serious threats to democratic societies. Traditionally, expert fact-checkers verify the truthfulness of information through investigative processes. However, the volume and immediacy of online content present major scalability challenges. Crowdsourcing offers a promising alternative by leveraging non-expert judgments, but it introduces concerns about bias, accuracy, and interpretability. This thesis investigates how human intelligence can be harnessed to assess the truthfulness of online information, focusing on three areas: misinformation assessment, cognitive biases, and automated fact-checking systems. Through large-scale crowdsourcing experiments and statistical modeling, it identifies key factors influencing human judgments and introduces a model for the joint prediction and explanation of truthfulness. The findings show that non-expert judgments often align with expert assessments, particularly when factors such as timing and experience are considered. By deepening our understanding of human judgment and bias in truthfulness assessment, this thesis contributes to the development of more transparent, trustworthy, and interpretable systems for combating misinformation.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-06-12

目录

概览 (2025-06-12)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载