Arxiv今日论文 | 2025-06-30

本篇博文主要内容为 2025-06-30 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决生成式 AI (Generative AI) 在科学再现能力上的不足，具体表现为难以根据已有研究结果重新实现已知的创新。其解决方案的关键在于构建一个名为 Automated LLM Speedrunning Benchmark 的基准测试，该基准基于 NanoGPT speedrun 竞赛，通过提供训练脚本和不同层次的提示信息，评估 AI 代理在改进大型语言模型（LLM）训练效率方面的能力。该基准设计具有实际性和可访问性，能够有效衡量 LLM 自动化科学再现的能力，这是自主研究代理所需的一项必要但不充分的技能。

链接: https://arxiv.org/abs/2506.22419
作者: Bingchen Zhao,Despoina Magka,Minqi Jiang,Xian Li,Roberta Raileanu,Tatiana Shavrina,Jean-Christophe Gagnon-Audet,Kelvin Niu,Shagun Sodhani,Michael Shvartsman,Andrei Lupu,Alisia Lupidi,Edan Toledo,Karen Hambardzumyan,Martin Josifoski,Thomas Foster,Lucia Cipolina-Kun,Abhishek Charnalia,Derek Dunfield,Alexander H. Miller,Oisin Mac Aodha,Jakob Foerster,Yoram Bachrach
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.
zh

[NLP-1] Sequential Diagnosis with Language Models

【速读】：该论文试图解决当前语言模型在医疗诊断评估中缺乏对真实临床复杂性和动态推理过程的反映问题，传统评估方法多依赖静态案例和选择题，无法体现循证医学的实际应用场景。其解决方案的关键在于引入Sequential Diagnosis Benchmark（序列诊断基准），将304个具有诊断挑战性的NEJM-CPC病例转化为逐步诊断互动，并通过MAI Diagnostic Orchestrator（MAI-DxO）模拟多学科医生团队，提出可能的鉴别诊断并战略性地选择高价值、成本效益高的检查，从而实现更精准且经济的诊断。

链接: https://arxiv.org/abs/2506.22405
作者: Harsha Nori,Mayank Daswani,Christopher Kelly,Scott Lundberg,Marco Tulio Ribeiro,Marc Wilson,Xiaoxuan Liu,Viknesh Sounderajah,Jonathan Carlson,Matthew P Lungren,Bay Gross,Peter Hames,Mustafa Suleyman,Dominic King,Eric Horvitz
机构: Microsoft AI(微软人工智能)
类目: Computation and Language (cs.CL)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Artificial intelligence holds great promise for expanding access to expert medical knowledge and reasoning. However, most evaluations of language models rely on static vignettes and multiple-choice questions that fail to reflect the complexity and nuance of evidence-based medicine in real-world settings. In clinical practice, physicians iteratively formulate and revise diagnostic hypotheses, adapting each subsequent question and test to what they’ve just learned, and weigh the evolving evidence before committing to a final diagnosis. To emulate this iterative process, we introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases into stepwise diagnostic encounters. A physician or AI begins with a short case abstract and must iteratively request additional details from a gatekeeper model that reveals findings only when explicitly queried. Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed. We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians, proposes likely differential diagnoses and strategically selects high-value, cost-effective tests. When paired with OpenAI’s o3 model, MAI-DxO achieves 80% diagnostic accuracy–four times higher than the 20% average of generalist physicians. MAI-DxO also reduces diagnostic costs by 20% compared to physicians, and 70% compared to off-the-shelf o3. When configured for maximum accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families. We highlight how AI systems, when guided to think iteratively and act judiciously, can advance diagnostic precision and cost-effectiveness in clinical care.
zh

[NLP-2] HyperCLOVA X THINK Technical Report

【速读】：该论文旨在解决大型语言模型在推理能力上的局限性，并提升其在韩语和英语环境下的表现。其关键解决方案是开发了HyperCLOVA X THINK，这是一个基于约6万亿高质量韩语和英语token预训练的模型，结合了针对性合成的韩语数据，采用计算与内存平衡的Peri-LN Transformer架构，并通过三阶段课程预训练扩展上下文窗口至128K tokens，随后通过监督微调与可验证奖励强化学习进行优化，从而支持详细推理过程和简洁答案两种模式。

链接: https://arxiv.org/abs/2506.22403
作者: NAVER Cloud HyperCLOVA X Team
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 49 pages, 13 figures

点击查看摘要

Abstract:We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly 6 trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with \mu P, pre-trained through a three-stage curriculum that expands the context window to 128 K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.
zh

[NLP-3] Refining Czech GEC: Insights from a Multi-Experiment Approach

【速读】：该论文旨在解决捷克语语法错误纠正（Grammar Error Correction, GEC）的问题，提出了一种达到当前最优水平的GEC系统。其解决方案的关键在于实时合成生成流水线，该流水线通过引入语言无关和捷克语特有的错误，动态地对句子进行人工错误增强。这一方法有效提升了模型在不同错误类型和领域上的泛化能力。

链接: https://arxiv.org/abs/2506.22402
作者: Petr Pechman,Milan Straka,Jana Straková,Jakub Náplava
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to TSD 2025

点击查看摘要

Abstract:We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on this https URL.
zh

[NLP-4] QuickSilver – Speeding up LLM Inference through Dynamic Token Halting KV Skipping Contextual Token Fusion and Adaptive Matryoshka Quantization

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）部署中推理阶段带来的高延迟和高能耗问题，这一阶段通常占总成本的90%以上。现有方法如剪枝、量化、提前退出和推测解码等，往往需要重新训练、修改架构或破坏解码兼容性。论文提出的解决方案是QuickSilver，其关键在于通过模块化、基于令牌级别的框架，在不改变模型权重或结构的前提下实现推理时的语义自适应，包含动态令牌停止、KV缓存跳过和上下文令牌融合三种协同机制，从而在保持模型性能的同时显著降低计算量。

链接: https://arxiv.org/abs/2506.22396
作者: Danush Khanna,Aditya Kumar Guru,Srivarshinee Sridhar,Zidan Ahmed,Rubhav Bahirwani,Meetu Malhotra,Vinija Jain,Aman Chadha,Amitava Das,Kripabandhu Ghosh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Under submission

点击查看摘要

Abstract:Inference accounts for the majority of latency and energy consumption in large language model (LLM) deployments, often exceeding 90% of total cost. While training-time efficiency has seen extensive progress, runtime optimization remains a key bottleneck, particularly under autoregressive decoding. Existing approaches – such as pruning, quantization, early exits, and speculative decoding – often require retraining, architectural changes, or disrupt decoding compatibility. We introduce QuickSilver, a modular, token-level framework that enables semantic adaptivity at inference time without altering model weights or structure. QuickSilver integrates four synergistic mechanisms: (i) Dynamic Token Halting, which halts computation for tokens with converged representations; (ii) KV Cache Skipping, which selectively suppresses memory writes to reduce attention overhead; and (iii) Contextual Token Fusion, which collapses redundant tokens into shared paths to shrink sequence length. Unlike speculative decoding or MoE routing, QuickSilver operates entirely on frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP reduction with negligible perplexity degradation (=0.2). Comments: Preprint. Under submission Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.0; I.2.7 Cite as: arXiv:2506.22396 [cs.CL] (or arXiv:2506.22396v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.22396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-5] Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

【速读】：该论文试图解决视频大多模态模型（VLMMs）在面对新信息时缺乏抽象和适应性推理能力的问题，即模型难以根据不断变化的证据更新其推理过程。解决方案的关键在于引入了一个名为Defeasible Video Entailment (DVidE)的新任务，要求模型像怀疑者一样思考，根据动态证据调整其推理。为了解决分类任务，提出了基于反事实思维链的框架，结合反事实推理、ASR增强的视频内容和理性精炼以减少推理偏差；对于生成任务，则结合ASR输出与大语言模型（LLM）生成符合语境且目标明确的更新内容。此外，还构建了一个包含增强/削弱标注和针对生成性能设计的LLM评估指标的新基准数据集。

链接: https://arxiv.org/abs/2506.22385
作者: Yue Zhang,Jilei Sun,Yunhui Guo,Vibhav Gogate
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.
zh

[NLP-6] Probabilistic Optimality for Inference-time Scaling

【速读】：该论文试图解决推理阶段缩放（inference-time scaling）中现有方法依赖启发式策略进行并行采样，缺乏理论基础的问题。其解决方案的关键在于提出一个概率框架，该框架在假设并行样本独立同分布（i.i.d.）的前提下，形式化了推理阶段缩放的最优性，并推导出达到目标性能水平所需的样本数下界，从而为计算效率优化提供理论指导。基于此，作者开发了\textscOptScale算法，通过语言模型预测器估计概率先验参数，动态确定满足预设性能阈值和置信度的最小样本数量，实现了采样开销的显著降低。

链接: https://arxiv.org/abs/2506.22376
作者: Youkang Wang,Jian Wang,Rubing Chen,Xiao-Yong Wei,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学); Sichuan University (四川大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textscOptScale, a practical algorithm that dynamically determines the optimal number of sampled responses. \textscOptScale employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textscOptScale significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning.
zh

[NLP-7] owards Fair Rankings: Leverag ing LLM s for Gender Bias Detection and Measurement ICTIR2025 SIGIR

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）和信息检索（Information Retrieval, IR）系统中存在社会偏见的问题，特别是针对段落排序中的性别偏见进行检测与评估。其解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）开发一种新的性别公平性度量指标——类加权曝光（Class-wise Weighted Exposure, CWEx），以克服现有基于词汇和频率的度量方法在捕捉细微性别差异方面的局限性。通过构建并发布MSMGenderBias数据集，该研究进一步验证了所提出度量指标的有效性，并展示了LLMs在检测性别偏见方面的潜力。

链接: https://arxiv.org/abs/2506.22372
作者: Maryam Mousavian,Zahra Abbasiantaeb,Mohammad Aliannejadi,Fabio Crestani
机构: Università della Svizzera italiana (瑞士意大利语大学); University of Amsterdam (阿姆斯特丹大学); The Netherland (荷兰); Switzerland (瑞士)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025)

点击查看摘要

Abstract:The presence of social biases in Natural Language Processing (NLP) and Information Retrieval (IR) systems is an ongoing challenge, which underlines the importance of developing robust approaches to identifying and evaluating such biases. In this paper, we aim to address this issue by leveraging Large Language Models (LLMs) to detect and measure gender bias in passage ranking. Existing gender fairness metrics rely on lexical- and frequency-based measures, leading to various limitations, e.g., missing subtle gender disparities. Building on our LLM-based gender bias detection method, we introduce a novel gender fairness metric, named Class-wise Weighted Exposure (CWEx), aiming to address existing limitations. To measure the effectiveness of our proposed metric and study LLMs’ effectiveness in detecting gender bias, we annotate a subset of the MS MARCO Passage Ranking collection and release our new gender bias collection, called MSMGenderBias, to foster future research in this area. Our extensive experimental results on various ranking models show that our proposed metric offers a more detailed evaluation of fairness compared to previous metrics, with improved alignment to human labels (58.77% for Grep-BiasIR, and 18.51% for MSMGenderBias, measured using Cohen’s Kappa agreement), effectively distinguishing gender bias in ranking. By integrating LLM-driven bias detection, an improved fairness metric, and gender bias annotations for an established dataset, this work provides a more robust framework for analyzing and mitigating bias in IR systems.
zh

[NLP-8] Why Are Parsing Actions for Understanding Message Hierarchies Not Random?

【速读】：该论文试图解决人类语言解析策略为何不遵循随机模式的问题，以及在使用具有层次化偏置的模型进行进化通信时，随机解析策略是否仍能保持高通信准确率的问题。解决方案的关键在于对实验设置进行两个简单而自然的修改：一是使用更具层次结构的复杂输入，使得随机解析在语义解释上更加困难；二是将与意外性（surprisal）相关的项引入目标函数，以模拟自然语言中词序和字符顺序的影响。

链接: https://arxiv.org/abs/2506.22366
作者: Daichi Kato,Ryo Ueda,Yusuke Miyao
机构: Preferred Networks, Inc.(Preferred Networks, Inc.); The University of Tokyo(东京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:If humans understood language by randomly selecting parsing actions, it might have been necessary to construct a robust symbolic system capable of being interpreted under any hierarchical structure. However, human parsing strategies do not seem to follow such a random pattern. Why is that the case? In fact, a previous study on emergent communication using models with hierarchical biases have reported that agents adopting random parsing strategies \unicodex2013 ones that deviate significantly from human language comprehension \unicodex2013 can achieve high communication accuracy. In this study, we investigate this issue by making two simple and natural modifications to the experimental setup: (I) we use more complex inputs that have hierarchical structures, such that random parsing makes semantic interpretation more difficult, and (II) we incorporate a surprisal-related term, which is known to influence the order of words and characters in natural language, into the objective function. With these changes, we evaluate whether agents employing random parsing strategies still maintain high communication accuracy.
zh

[NLP-9] Evaluating Scoring Bias in LLM -as-a-Judge

【速读】：该论文试图解决生成式 AI (Generative AI) 在作为评分者（LLM-as-a-Judge）时存在的评分偏差问题，这种偏差会影响判断的公平性和可靠性。其解决方案的关键在于定义评分偏差为评分模型在受到与偏差相关的扰动时评分结果发生变化，并提出一个设计精良的框架以全面评估评分偏差。该框架通过数据合成增强现有基准测试集，并设计多维度的评估指标，从而揭示现有评分模型的评分稳定性受到评分偏差的影响，并为评分提示模板的设计和评分偏差的缓解提供深入见解。

链接: https://arxiv.org/abs/2506.22316
作者: Qingquan Li,Shaoyu Dou,Kailai Shao,Chao Chen,Haixiang Hu
机构: Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge’', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.
zh

[NLP-10] Conceptual Topic Aggregation

【速读】：该论文试图解决传统手动检查在大规模数据处理中不可行的问题，以及现有主题建模方法在提供可解释性表示方面存在的不足。其解决方案的关键在于提出FAT-CAT方法，该方法基于形式概念分析（Formal Concept Analysis, FCA），通过构建概念格来实现有意义的主题聚合与可视化，从而提供结构化、分层的主题分布表示，增强对数据结构和内容的深入理解。

链接: https://arxiv.org/abs/2506.22309
作者: Klara M. Gutekunst,Dominik Dürrschnabel,Johannes Hirth,Gerd Stumme
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注: 16 pages, 4 tables, 11 figures, International Joint Conference on Conceptual Knowledge Structures

点击查看摘要

Abstract:The vast growth of data has rendered traditional manual inspection infeasible, necessitating the adoption of computational methods for efficient data exploration. Topic modeling has emerged as a powerful tool for analyzing large-scale textual datasets, enabling the extraction of latent semantic structures. However, existing methods for topic modeling often struggle to provide interpretable representations that facilitate deeper insights into data structure and content. In this paper, we propose FAT-CAT, an approach based on Formal Concept Analysis (FCA) to enhance meaningful topic aggregation and visualization of discovered topics. Our approach can handle diverse topics and file types – grouped by directories – to construct a concept lattice that offers a structured, hierarchical representation of their topic distribution. In a case study on the ETYNTKE dataset, we evaluate the effectiveness of our approach against other representation methods to demonstrate that FCA-based aggregation provides more meaningful and interpretable insights into dataset composition than existing topic modeling techniques.
zh

[NLP-11] Detection of Personal Data in Structured Datasets Using a Large Language Model

【速读】：该论文旨在解决结构化数据集中个人数据检测的问题，其解决方案的关键在于引入上下文信息：除了特征名称和值外，还利用数据集中其他特征名称以及数据集描述的信息，从而提升检测性能。该方法基于GPT-4o这一先进的大语言模型实现。

链接: https://arxiv.org/abs/2506.22305
作者: Albert Agisha Ntwali,Luca Rück,Martin Heckmann
机构: Aalen University of Applied Sciences(艾伦应用科学大学)
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature’s name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is comparable across all models, with our GPT-4o-based approach clearly outperforming the others. Notably, personal data detection in the Kaggle and OpenML datasets appears to benefit from contextual information. This is evidenced by the poor performance of CASSED and Presidio (both of which do not utilize the context of the dataset) compared to the strong results of our GPT-4o-based approach. We conclude that further progress in this field would greatly benefit from the availability of more real-world datasets containing personal information. Comments: 10 pages Subjects: Computation and Language (cs.CL) ACMclasses: I.5.4; I.2.7; H.3.1 Cite as: arXiv:2506.22305 [cs.CL] (or arXiv:2506.22305v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.22305 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: LLM-DPM '2025, Next Gen Data and Process Management: Large Language Models and Beyond, June 22, 2025, Berlin, Germany
zh

[NLP-12] COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在生成对象参考时是否能够像人类一样依赖场景上下文的问题。其解决方案的关键在于引入了“Common Objects Out-of-Context (COOCO)”数据集，并通过在不同场景-对象一致性程度和扰动条件下测试VLMs对场景上下文的依赖程度，揭示了模型如何根据对象与场景的语义相关性以及噪声水平自适应地利用场景信息。

链接: https://arxiv.org/abs/2506.22274
作者: Filippo Merlo,Ece Takmaz,Wenkai Chen,Albert Gatt
机构: Utrecht University (乌得勒支大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textitCommon Objects Out-of-Context (COOCO) dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \hrefthis https URLthis https URL.
zh

[NLP-13] Projected Compression: Trainable Projection for Efficient Transformer Compression

【速读】：该论文试图解决大型语言模型因规模增大而导致的推理时间和计算需求增加的问题。解决方案的关键在于提出一种名为“Projected Compression”的新型模型压缩技术，该技术通过利用投影模块减少模型权重，具体而言，先训练额外的可学习投影权重并保留对原始模型参数的访问，在此基础上将这些投影合并为一个低维的乘积矩阵，从而得到一个尺寸更小的标准Transformer模型，且其每标记计算步骤的FLOPs与基础模型相当。

链接: https://arxiv.org/abs/2506.22255
作者: Maciej Stefaniak,Michał Krutul,Jan Małaśnicki,Maciej Pióro,Jakub Krajewski,Sebastian Jaszczur,Marek Cygan,Kamil Adamczewski,Jan Ludziejewski
机构: University of Warsaw (华沙大学); IDEAS NCBR (IDEAS NCBR); Polish Academy of Sciences (波兰科学院); Nomagic (Nomagic); Wroclaw University of Science and Technology (弗罗茨瓦夫科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional trainable projections weights and preserve access to all the original model parameters. Subsequently, these projections are merged into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the base model’s per-token computation step in FLOPs. Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models. Moreover, the performance margin scales well with the number of tokens.
zh

[NLP-14] Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations

【速读】：该论文试图解决将人类钢琴演奏的音频记录与对应的大致对齐的MIDI文件进行同步的问题（Audio-MIDI Synchronization）。解决方案的关键在于采用一种卷积循环神经网络（Convolutional Recurrent Neural Network, CRNN）架构，通过处理未对齐的钢琴音符图和频谱图作为输入，估计出对齐后的钢琴音符图，从而有效捕捉频谱和时间特征。

链接: https://arxiv.org/abs/2506.22237
作者: Sebastian Murgul,Moritz Reiser,Michael Heizmann,Christoph Seibert
机构: Klangio GmbH (Klangio GmbH); University of Music Karlsruhe (University of Music Karlsruhe); Karlsruhe Institute of Technology (Karlsruhe Institute of Technology)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 9 pages, 3 figures, 6 tables

点击查看摘要

Abstract:In this paper, we present a neural network approach for synchronizing audio recordings of human piano performances with their corresponding loosely aligned MIDI files. The task is addressed using a Convolutional Recurrent Neural Network (CRNN) architecture, which effectively captures spectral and temporal features by processing an unaligned piano roll and a spectrogram as inputs to estimate the aligned piano roll. To train the network, we create a dataset of piano pieces with augmented MIDI files that simulate common human timing errors. The proposed model achieves up to 20% higher alignment accuracy than the industry-standard Dynamic Time Warping (DTW) method across various tolerance windows. Furthermore, integrating DTW with the CRNN yields additional improvements, offering enhanced robustness and consistency. These findings demonstrate the potential of neural networks in advancing state-of-the-art MIDI-to-audio alignment.
zh

[NLP-15] Leverag ing In-Context Learning for Political Bias Testing of LLM s ACL2025

【速读】：该论文试图解决现有方法在通过政治问题对大语言模型（Large Language Models, LLMs）进行偏见评估时稳定性不足的问题，这导致不同模型之间的比较不可靠。论文提出的解决方案关键在于引入一种新的探测任务——问卷建模（Questionnaire Modeling, QM），该任务利用人类调查数据作为上下文示例，从而提高基于问题的偏见评估的稳定性，并能够用于比较经过指令微调的模型与其基础版本之间的偏见差异。

链接: https://arxiv.org/abs/2506.22232
作者: Patrick Haller,Jannis Vamvas,Rico Sennrich,Lena A. Jäger
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:A growing body of work has been querying LLMs with political questions to evaluate their potential biases. However, this probing method has limited stability, making comparisons between models unreliable. In this paper, we argue that LLMs need more context. We propose a new probing task, Questionnaire Modeling (QM), that uses human survey data as in-context examples. We show that QM improves the stability of question-based bias evaluation, and demonstrate that it may be used to compare instruction-tuned models to their base versions. Experiments with LLMs of various sizes indicate that instruction tuning can indeed change the direction of bias. Furthermore, we observe a trend that larger models are able to leverage in-context examples more effectively, and generally exhibit smaller bias scores in QM. Data and code are publicly available.
zh

[NLP-16] Exploring Modularity of Agent ic Systems for Drug Discovery

【速读】：该论文试图解决基于大语言模型（Large-language models, LLMs）的代理系统在药物发现中的模块化问题，即是否可以互换代理系统中的组件（如LLM），这一问题在药物发现应用中尚未得到充分研究。解决方案的关键在于评估不同LLM及工具调用代理与代码生成代理在药物发现任务中的性能差异，并分析系统提示替换对结果的影响，从而揭示模型替换需结合提示工程进行优化，强调了在实际应用中提升代理系统模块化能力的重要性。

链接: https://arxiv.org/abs/2506.22189
作者: Laura van Weesep,Samuel Genheden,Ola Engkvist,Jens Sjölund
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery and design. In this study, we critically examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the agentic system such as the LLM are interchangeable, a topic that has received limited attention in drug discovery applications. We compare the performance of different large language models (LLMs) and the effectiveness of tool-calling agents versus code-generating agents in this domain. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question and model dependent. Furthermore, the impact of replacing system prompts is dependent on the specific question asked and the model used, underscoring that – even in this particular domain – one cannot just replace language models without considering prompt re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of stable and scalable solutions for real-world problems.
zh

[NLP-17] raining Language Model to Critique for Better Refinement ACL2025

【速读】：该论文试图解决如何生成有效的批判性反馈以提升大语言模型（Large Language Models, LLMs）响应质量的问题，以及何种类型的批判最有助于模型优化。其解决方案的关键在于提出一种名为Refinement-oriented Critique Optimization (RCO)的框架，该框架通过一个反馈循环，利用由批判模型生成的批判来引导执行模型优化其输出，并通过批判效用（Critique Utility, CU）量化优化效果，作为训练批判模型的奖励信号，从而无需直接评估批判偏好即可实现有效批判的生成。

链接: https://arxiv.org/abs/2506.22157
作者: Tianshu Yu,Chao Xiang,Mingchuan Yang,Pei Ke,Bosi Wen,Cunxiang Wang,Jiale Cheng,Li Zhang,Xinyu Mu,Chuxiong Sun,Minlie Huang
机构: China Telecom Research Institute; The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University; University of Electronic Science and Technology of China; The Knowledge Engineering Group (KEG), Tsinghua University; Zhipu AI
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce \textbfRefinement-oriented \textbfCritique \textbfOptimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e., dialog generation, summarization, question answering, mathematical reasoning, and code generation, and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method’s effectiveness in enhancing LLM critique-refinement loops.
zh

[NLP-18] SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition

【速读】：该论文旨在解决方言阿拉伯语（Dialectal Arabic, DA）和阿拉伯语-英语代码转换（Code-Switched, CS）语音识别中数据稀缺的问题。其关键解决方案是引入一种改进的音频拼接方法，生成人工的代码转换语音数据（Spliced-Audio Generated, SAGE data），并通过微调已优化的自监督学习（Self-Supervised Learning, SSL）模型，显著提升了词错误率（Word Error Rate, WER）的表现。此外，还提出了一种受经验回放（Experience Replay, ER）启发的方法，以增强模型在DA和CS语音上的泛化能力并减少灾难性遗忘，同时结合外部域的三元语言模型进一步降低WER。

链接: https://arxiv.org/abs/2506.22143
作者: Muhammad Umar Farooq,Oscar Saz
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted for IEEE MLSP 2025

点击查看摘要

Abstract:This paper investigates the performance of various speech SSL models on dialectal Arabic (DA) and Arabic-English code-switched (CS) speech. To address data scarcity, a modified audio-splicing approach is introduced to generate artificial CS speech data. Fine-tuning an already fine-tuned SSL model with the proposed Spliced-Audio Generated (SAGE) data results in an absolute improvement on Word Error Rate (WER) of 7.8% on Arabic and English CS benchmarks. Additionally, an Experience Replay (ER) inspired approach is proposed to enhance generalisation across DA and CS speech while mitigating catastrophic forgetting. Integrating an out-of-domain 3-gram language model reduces the overall mean WER from 31.7% to 26.6%. Few-shot fine-tuning for code-switching benchmarks further improves WER by 4.9%. A WER of 31.1% on Arabic-English CS benchmarks surpasses large-scale multilingual models, including USM and Whisper-large-v2 (both over ten times larger) by an absolute margin of 5.5% and 8.4%, respectively.
zh

[NLP-19] DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level

【速读】：该论文旨在解决公开可用专利检索数据集中存在的多个问题，包括缺乏明确的领域内与领域外标注、多司法管辖区覆盖不足、查询领域表示不平衡以及数据规模过大导致无法在中等计算资源下进行子文档级别的实验。其解决方案的关键在于提出DAPFAM，一个在简单家族级别构建的新型开放访问领域感知专利检索数据集，该数据集通过基于国际专利分类（IPC）代码的创新标注方案，实现了清晰的相关性判断和领域关系标注，从而生成了49,869个评估对，并具备多司法管辖区覆盖、低预处理需求及可管理的数据规模，支持在有限资源下进行子文档级别的检索实验。

链接: https://arxiv.org/abs/2506.22141
作者: Iliass Ayaou(ICube),Denis Cavallucci(ICube),Hicham Chibane(ICube)
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In the landscape of publicly available patent retrieval datasets, the need for explicit indomain and out-of-domain labeling, multi-jurisdiction coverage, balanced query domain representation and manageable sizes that support sub document level experiments on moderate computational resources is often overlooked. To address these gaps, we propose DAPFAM, a new open access domain-aware patent retrieval dataset constructed at the simple-family level. The dataset contains 1,247 domain balanced full text query families and 45,336 full text target families. The dataset is enriched by clear relevance judgments (forward/backward citations as positive links, random negatives), as well as explicit in-domain or out-of-domain relationships via a novel proposed labelling scheme based on via International Patent Classification (IPC) codes, resulting in 49,869 evaluation pairs. The dataset is multi jurisdictional, requires little to no preprocessing for retrieval evaluation, and remains of a size manageable for entities with limited ressources allowing for sub document level retrieval experiments without excessive computational costs. We describe our three-step data-curation pipeline, present comprehensive dataset statistics, and provide baseline experiments using lexical and neural retrieval methods. Our baseline experiments highlight significant challenges in crossdomain patent retrieval. The dataset will be publicly available (for now the access link is this repository: this https URL).
zh

[NLP-20] Identifying a Circuit for Verb Conjugation in GPT -2

【速读】：该论文试图解决生成式 AI (Generative AI) 在句子中实现主谓一致（subject-verb agreement）的机制问题，具体是通过分析 GPT-2 Small 模型在给定主语为单数或复数时正确预测动词形式的能力。解决方案的关键在于通过一系列技术手段，包括性能验证、自动电路发现（via direct path patching）以及直接逻辑归因（direct logit attribution），识别出对模型正确进行动词变位起关键作用的子网络（或“电路”）。研究结果表明，仅需网络中少量的组件-标记对即可实现接近模型性能的基础任务，但在更复杂的场景下需要更多的组件参与。

链接: https://arxiv.org/abs/2506.22105
作者: David Demitri Africa
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:I implement a procedure to isolate and interpret the sub-network (or “circuit”) responsible for subject-verb agreement in GPT-2 Small. In this study, the model is given prompts where the subject is either singular (e.g. “Alice”) or plural (e.g. “Alice and Bob”), and the task is to correctly predict the appropriate verb form (“walks” for singular subjects, “walk” for plural subjects). Using a series of techniques-including performance verification automatic circuit discovery via direct path patching, and direct logit attribution- I isolate a candidate circuit that contributes significantly to the model’s correct verb conjugation. The results suggest that only a small fraction of the network’s component-token pairs is needed to achieve near-model performance on the base task but substantially more for more complex settings.
zh

[NLP-21] Involvement drives complexity of language in online debates

【速读】：该论文试图解决的问题是：在数字平台中，用户生成内容的语言复杂性如何反映其社会、政治和意识形态特征。解决方案的关键在于通过结合多种文本复杂性指标，分析推特上具有影响力的用户在三个全球关注议题（新冠疫情、COP26和俄乌战争）中的语言使用模式，并评估语言复杂性在账号类型、政治倾向、内容可靠性及情感倾向四个维度上的差异。

链接: https://arxiv.org/abs/2506.22098
作者: Eleonora Amadori,Daniele Cirulli,Edoardo Di Martino,Jacopo Nudo,Maria Sahakyan,Emanuele Sangiorgio,Arnaldo Santoro,Simon Zollo,Alessandro Galeazzi,Niccolò Di Marco
机构: University of Padova(帕多瓦大学); University of Rome “Tor Vergata”(罗马“特尔维加塔”大学); Sapienza University of Rome(罗马第一大学); New York University Abu Dhabi(纽约大学阿布扎比分校); Ca’ Foscari University of Venice(威尼斯卡福斯卡里大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Language is a fundamental aspect of human societies, continuously evolving in response to various stimuli, including societal changes and intercultural interactions. Technological advancements have profoundly transformed communication, with social media emerging as a pivotal force that merges entertainment-driven content with complex social dynamics. As these platforms reshape public discourse, analyzing the linguistic features of user-generated content is essential to understanding their broader societal impact. In this paper, we examine the linguistic complexity of content produced by influential users on Twitter across three globally significant and contested topics: COVID-19, COP26, and the Russia-Ukraine war. By combining multiple measures of textual complexity, we assess how language use varies along four key dimensions: account type, political leaning, content reliability, and sentiment. Our analysis reveals significant differences across all four axes, including variations in language complexity between individuals and organizations, between profiles with sided versus moderate political views, and between those associated with higher versus lower reliability scores. Additionally, profiles producing more negative and offensive content tend to use more complex language, with users sharing similar political stances and reliability levels converging toward a common jargon. Our findings offer new insights into the sociolinguistic dynamics of digital platforms and contribute to a deeper understanding of how language reflects ideological and social structures in online spaces.
zh

[NLP-22] MDC-R: The Minecraft Dialogue Corpus with Reference

【速读】：该论文试图解决对话中指代（anaphoric and deictic reference）识别与理解的问题，通过构建带有专家标注的Minecraft Dialogue Corpus with Reference (MDC-R) 来补充原始的Minecraft Dialogue Corpus (MDC)。解决方案的关键在于对任务导向、多轮、情境化对话中的指代现象进行系统性标注，并在此基础上进行定量与定性分析，以验证该语料库在指称表达理解任务中的有效性。

链接: https://arxiv.org/abs/2506.22062
作者: Chris Madge,Maris Camilleri,Paloma Carretero Garcia,Mladen Karan,Juexi Shao,Prashant Jayannavar,Julian Hough,Benjamin Roth,Massimo Poesio
机构: Queen Mary University of London(伦敦玛丽女王大学); Universität Wien(维也纳大学); University of Illinois(伊利诺伊大学); Swansea University(斯旺西大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the Minecraft Dialogue Corpus with Reference (MDC-R). MDC-R is a new language resource that supplements the original Minecraft Dialogue Corpus (MDC) with expert annotations of anaphoric and deictic reference. MDC’s task-orientated, multi-turn, situated dialogue in a dynamic environment has motivated multiple annotation efforts, owing to the interesting linguistic phenomena that this setting gives rise to. We believe it can serve as a valuable resource when annotated with reference, too. Here, we discuss our method of annotation and the resulting corpus, and provide both a quantitative and a qualitative analysis of the data. Furthermore, we carry out a short experiment demonstrating the usefulness of our corpus for referring expression comprehension.
zh

[NLP-23] Lost at the Beginning of Reasoning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在长时间链式推理（Chain-of-Thought, CoT）过程中自我修正能力不足的问题，特别是由于初始推理步骤中的错误会显著影响后续推理质量的现象。解决方案的关键在于提出一种高效的采样策略，该策略利用奖励模型识别并保留高质量的初始推理步骤，同时丢弃低质量的步骤，从而在不牺牲准确性的前提下实现高达70%的推理成本降低。

链接: https://arxiv.org/abs/2506.22058
作者: Baohao Liao,Xinyi Chen,Sara Rajaee,Yuhui Xu,Christian Herold,Anders Søgaard,Maarten de Rijke,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学); eBay Inc. (eBay公司); Salesforce AI Research (Salesforce人工智能研究); Copenhagen University (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction - errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across two state-of-the-art open-source reasoning model families: DeepSeek-R1 and Qwen3. To address this, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing accuracy. Finally, we introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities, offering a foundation for future research on robust reasoning in LLMs.
zh

[NLP-24] Decoding Machine Translationese in English-Chinese News: LLM s vs. NMTs ACL

【速读】：该论文试图解决机器翻译输出（Machine Translationese, MTese）的语料语言特征问题，特别是针对英语到中文新闻文本这一研究较少的语言对。其解决方案的关键在于构建包含4个子语料库的大规模数据集，并采用五层综合特征集进行分析，同时应用卡方排名算法进行特征选择，以实现对机器翻译系统（包括神经机器翻译系统和大型语言模型）与原始中文文本的有效区分。

链接: https://arxiv.org/abs/2506.22050
作者: Delu Kong,Lieve Macken
机构: Tongji University (同济大学); Ghent University (根特大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 5 figures, 6 tables. Accpeted in MT Summit 2025, Research: Technical track. Official version may be accessed later in the ACL Anthology

点击查看摘要

Abstract:This study explores Machine Translationese (MTese) – the linguistic peculiarities of machine translation outputs – focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using more brackets. Additionally, translation-specific LLMs show lower lexical diversity but higher usage of causal conjunctions compared to generic LLMs. Lastly, we find no significant differences between LLMs developed by Chinese firms and their foreign counterparts.
zh

[NLP-25] GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

【速读】：该论文旨在解决Pre-LayerNorm (Pre-LN) Transformer架构在深度增加时出现的激活值方差指数增长问题，这一现象导致残差路径主导子层输出，限制了深层网络的学习能力。解决方案的关键在于提出Gradient-Preserving Activation Scaling (GPAS)，该技术通过缩放中间激活值而保持梯度不变，从而在不引入梯度消失问题的情况下保留激活信息，有效改善模型训练动态。

链接: https://arxiv.org/abs/2506.22049
作者: Tianhao Chen,Xin Xu,Zijing Liu,Pengxiang Li,Xinyuan Song,Ajay Kumar Jaiswal,Fan Zhang,Jishan Hu,Yang Wang,Hao Chen,Shizhe Diao,Shiwei Liu,Yu Li,Yin Lu,Can Yang
机构: The Hong Kong University of Science and Technology (香港科技大学); International Digital Economy Academy; Dalian University of Technology (大连理工大学); Emory University (埃默里大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); NVIDIA (英伟达); University of Oxford (牛津大学); University of Surrey (萨里大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.
zh

[NLP-26] Can Peter Pan Survive MT? A Stylometric Study of LLM s NMTs and HTs in Childrens Literature Translation ACL

【速读】：该论文试图解决机器翻译（MT）与人工翻译（HT）在英语到中文儿童文学翻译（CLT）中的风格差异问题，从文体学角度评估其性能。研究构建了一个包含21种翻译的彼得·潘语料库，包括7种人类翻译、7种大型语言模型翻译（LLMs）和7种神经机器翻译输出（NMTs），并通过通用特征集和创意文本翻译（CTT-specific）特征集进行分析，提取了447个语言特征。解决方案的关键在于利用机器学习中的分类和聚类技术，对不同翻译风格进行文体学分析，从而揭示LLMs在CTT中相较于NMTs更接近HT的潜在优势。

链接: https://arxiv.org/abs/2506.22038
作者: Delu Kong,Lieve Macken
机构: Tongji University (同济大学); Ghent University (根特大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures, 4 tables. Accepted in 2nd Workshop on Creative-text Translation and Technology Co-located with MT Summit 2025. Official paper may later be accessed from ACL Anthology

点击查看摘要

Abstract:This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in English-to-Chinese children’s literature translation (CLT) from a stylometric perspective. The research constructs a Peter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhythm, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT. Comments: 19 pages, 8 figures, 4 tables. Accepted in 2nd Workshop on Creative-text Translation and Technology Co-located with MT Summit 2025. Official paper may later be accessed from ACL Anthology Subjects: Computation and Language (cs.CL) Cite as: arXiv:2506.22038 [cs.CL] (or arXiv:2506.22038v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.22038 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-27] Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

【速读】：该论文旨在解决传统自回归（Autoregressive, AR）语音合成模型在处理长语音序列时面临的稳定性差、延迟高和合成质量下降的问题。其关键解决方案是提出一种名为DCAR的动态分块自回归合成框架，通过引入基于多标记预测训练的分块到帧注意力机制，实现可变语音上下文中的动态分块预测，并利用轻量级模块进行在线策略训练，从而显著降低序列长度依赖性，提升合成效率与可懂度鲁棒性。

链接: https://arxiv.org/abs/2506.22023
作者: Bohan Li,Zhihan Li,Haoran Wang,Hanglei Zhang,Yiwei Guo,Hankun Wang,Xie Chen,Kai Yu
机构: X-LANCE Lab, School of Computer Science; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (教育部人工智能重点实验室，人工智能研究院，上海交通大学); MoE Key Lab of Artificial Intelligence (教育部人工智能重点实验室); Jiangsu Key Lab of Language Computing (江苏省语言计算重点实验室)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 17 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems.
zh

[NLP-28] Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit CVPR

【速读】：该论文试图解决在特定领域（如飞机驾驶舱内的飞行员对话）中，基于Whisper的自动语音识别（ASR）模型的转录准确率较低的问题。其关键解决方案是通过提出多种归一化方案来优化转录结果，并结合低秩适应（LoRA）的高效微调方法提升模型性能，从而显著降低了词错误率（WER）。

链接: https://arxiv.org/abs/2506.21990
作者: Kartheek Kumar Reddy Nareddy,Sarah Ternus,Julia Niebling
机构: German Aerospace Center (德国航空航天中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Computer Vision and Pattern Recognition (CVPR) 2025 Workshops

点击查看摘要

Abstract:The developments in transformer encoder-decoder architectures have led to significant breakthroughs in machine translation, Automatic Speech Recognition (ASR), and instruction-based chat machines, among other applications. The pre-trained models were trained on vast amounts of generic data over a few epochs (fewer than five in most cases), resulting in their strong generalization capabilities. Nevertheless, the performance of these models does suffer when applied to niche domains like transcribing pilot speech in the cockpit, which involves a lot of specific vocabulary and multilingual conversations. This paper investigates and improves the transcription accuracy of cockpit conversations with Whisper models. We have collected around 85 minutes of cockpit simulator recordings and 130 minutes of interview recordings with pilots and manually labeled them. The speakers are middle aged men speaking both German and English. To improve the accuracy of transcriptions, we propose multiple normalization schemes to refine the transcripts and improve Word Error Rate (WER). We then employ fine-tuning to enhance ASR performance, utilizing performance-efficient fine-tuning with Low-Rank Adaptation (LoRA). Hereby, WER decreased from 68.49 % (pretrained whisper Large model without normalization baseline) to 26.26% (finetuned whisper Large model with the proposed normalization scheme).
zh

[NLP-29] Dont Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

【速读】：该论文试图解决的问题是：在社会网络通信分析中，是否可以使用生成式 AI (Generative AI) 代理替代人类进行实证研究，以及这一假设在何种条件下成立。解决方案的关键在于构建一个形式化的社会网络模拟框架，并通过实证测试不同方法来模仿用户行为，强调社会模拟应在其拟合组件的设置中通过经验现实性进行验证，从而提升基于生成代理的社会模拟的严谨性。

链接: https://arxiv.org/abs/2506.21974
作者: Simon Münker,Nils Schwager,Achim Rettinger
机构: Trier University (特里尔大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure, 3 tables

点击查看摘要

Abstract:The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.
zh

[NLP-30] Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

【速读】：该论文试图解决预训练语言模型（PTLMs）和大语言模型（LLMs）在推理阶段面临的两种主要威胁——令牌级和提示级越狱攻击，这些攻击能够利用模型的固有弱点绕过安全机制。解决方案的关键在于提出两种混合方法，即GCG + PAIR和GCG + WordGame，通过整合令牌级与提示级技术，以提升在多种PTLM上的越狱效果。这些方法不仅提高了攻击成功率，还在面对严格评估器和先进防御机制时保持了良好的迁移能力和可靠性。

链接: https://arxiv.org/abs/2506.21972
作者: Mohamed Ahmed,Mohamed Abdelmouty,Mingyu Kim,Gunvanth Kandula,Alex Park,James C. Davis
机构: Purdue University (普渡大学); Purdue University (普渡大学); Purdue University (普渡大学); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR’s 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.
zh

[NLP-31] More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

【速读】：该论文试图解决当前工具集成的大语言模型代理（LLM agents）在端到端工具使用评估中忽视稳定性的问题，这限制了其在现实场景中的应用。论文的关键解决方案是全面评估代理在工具调用全过程中的脆弱性，包括阅读工具文档、选择工具和生成参数以及处理工具响应等阶段，并通过大量实验验证代理在各阶段均易发生错误，尤其是基于开源模型的代理更为脆弱，同时指出模型规模增大未必能有效提升工具调用推理能力，甚至可能增加对类似正常用户指令的攻击敏感性。

链接: https://arxiv.org/abs/2506.21967
作者: Weimin Xiong,Ke Wang,Yifan Song,Hanchao Liu,Sai Zhou,Wei Peng,Sujian Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current evaluations of tool-integrated LLM agents typically focus on end-to-end tool-usage evaluation while neglecting their stability. This limits their real-world applicability, as various internal or external factors can cause agents to crash or behave abnormally. Our research addresses this by investigating whether agents are vulnerable to errors throughout the entire tool invocation process, including reading tool documentation, selecting tools and generating parameters, and processing the tool’s response. Through extensive experiments, we observe that agents are highly susceptible to errors at each stage and agents based on open-source models are more vulnerable than those based on proprietary models. We also find that increasing the model size does not significantly improve tool invocation reasoning and may make agents more vulnerable to attacks resembling normal user instructions. This highlights the importance of evaluating agent stability and offers valuable insights for future LLM development and evaluation.
zh

[NLP-32] PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在角色扮演场景中表现出的偏见问题，特别是其在道德困境中的决策模式。解决方案的关键在于构建一个名为PapersPlease的基准测试集，包含3,700个基于存在、关联与成长（Existence, Relatedness, and Growth, ERG）理论设计的道德困境，用于评估LLMs在移民检查员角色中对不同人类需求层级的优先级判断。通过分析六种LLMs的决策模式，研究揭示了模型在决策中隐含的偏好，并探讨了社会身份信息对模型响应的影响。

链接: https://arxiv.org/abs/2506.21961
作者: Junho Myung,Yeon Su Park,Sunwoo Kim,Shin Yoo,Alice Oh
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: Accepted to GEM2 Workshop: Generation, Evaluation Metrics - ACL 2025

点击查看摘要

Abstract:Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs’ decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at this https URL.
zh

[NLP-33] ARAG : Agent ic Retrieval Augmented Generation for Personalized Recommendation

【速读】：该论文旨在解决现有基于检索增强生成（Retrieval-Augmented Generation, RAG）的推荐系统在动态推荐场景中难以捕捉用户细微偏好的问题，其核心在于传统方法依赖静态检索启发式策略。论文提出的解决方案关键在于引入ARAG框架，该框架通过多智能体协作机制增强RAG流程，包含用户理解代理、自然语言推理代理、上下文摘要代理和物品排序代理，以更全面地建模用户的长期与会话行为，并提升推荐的准确性与相关性。

链接: https://arxiv.org/abs/2506.21931
作者: Reza Yousefi Maragheh,Pratheek Vadla,Priyank Gupta,Kai Zhao,Aysenur Inan,Kehui Yao,Jianpeng Xu,Praveen Kanumala,Jason Cho,Sushant Kumar
机构: Walmart Global Tech (沃尔玛全球科技); Walmart Global Tech (沃尔玛全球科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has shown promise in enhancing recommendation systems by incorporating external context into large language model prompts. However, existing RAG-based approaches often rely on static retrieval heuristics and fail to capture nuanced user preferences in dynamic recommendation scenarios. In this work, we introduce ARAG, an Agentic Retrieval-Augmented Generation framework for Personalized Recommendation, which integrates a multi-agent collaboration mechanism into the RAG pipeline. To better understand the long-term and session behavior of the user, ARAG leverages four specialized LLM-based agents: a User Understanding Agent that summarizes user preferences from long-term and session contexts, a Natural Language Inference (NLI) Agent that evaluates semantic alignment between candidate items retrieved by RAG and inferred intent, a context summary agent that summarizes the findings of NLI agent, and an Item Ranker Agent that generates a ranked list of recommendations based on contextual fit. We evaluate ARAG accross three datasets. Experimental results demonstrate that ARAG significantly outperforms standard RAG and recency-based baselines, achieving up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an ablation study to analyse the effect by different components of ARAG. Our findings highlight the effectiveness of integrating agentic reasoning into retrieval-augmented recommendation and provide new directions for LLM-based personalization.
zh

[NLP-34] HyReC: Exploring Hybrid-based Retriever for Chinese

【速读】：该论文旨在解决混合检索方法在中文检索场景中应用不足的问题，尽管基于密集向量和词典的混合方法在工业界已展现出性能提升，但在中文环境下的研究仍较为有限。其解决方案的关键在于提出HyReC，这是一种针对中文混合检索的端到端优化方法，通过将词项语义联合整合到表示模型中，并引入全局-局部感知编码器（Global-Local-Aware Encoder, GLAE）以促进词典检索与密集检索之间的语义一致性共享，同时减少两者间的干扰，进而提升检索效果。

链接: https://arxiv.org/abs/2506.21913
作者: Zunran Wang,Zheng Shenpeng,Wang Shenglan,Minghui Zhao,Zhonghua Li
机构: Huawei Technologies Ltd. Co. (华为技术有限公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hybrid-based retrieval methods, which unify dense-vector and lexicon-based retrieval, have garnered considerable attention in the industry due to performance enhancement. However, despite their promising results, the application of these hybrid paradigms in Chinese retrieval contexts has remained largely underexplored. In this paper, we introduce HyReC, an innovative end-to-end optimization method tailored specifically for hybrid-based retrieval in Chinese. HyReC enhances performance by integrating the semantic union of terms into the representation model. Additionally, it features the Global-Local-Aware Encoder (GLAE) to promote consistent semantic sharing between lexicon-based and dense retrieval while minimizing the interference between them. To further refine alignment, we incorporate a Normalization Module (NM) that fosters mutual benefits between the retrieval approaches. Finally, we evaluate HyReC on the C-MTEB retrieval benchmark to demonstrate its effectiveness.
zh

[NLP-35] AutoMixer: Checkpoint Artifacts as Automatic Data Mixers ACL2025

【速读】：该论文试图解决在语言模型训练中如何获取合适的数据混合以赋予模型多种任务能力的问题，因为数据与任务之间的关系难以建模。解决方案的关键在于利用检查点模型（checkpoint models）在训练轨迹中不同阶段表现出的新兴能力，通过其在基准测试中的性能识别这些模型，并利用它们对源数据的聚合一阶影响近似作为数据混合器，从而提升预训练阶段的数据质量和数据混合效果。

链接: https://arxiv.org/abs/2506.21910
作者: Ernie Chang,Yang Li,Patrick Huber,David Kant,Yangyang Shi,Vikas Chandra
机构: Meta(元); Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025

点击查看摘要

Abstract:In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.
zh

[NLP-36] A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLM s ACL

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在不同语言和文化背景下行为偏差的问题，特别是其输出可能影响公众意见或强化主流叙事的风险。解决方案的关键在于通过两阶段评估框架区分模型偏差（由模型训练引起）和推理偏差（由查询语言引发），并构建一个涵盖事实性和争议性问答的多语言数据集，以系统评估LLMs在中立和敏感话题上的表现。

链接: https://arxiv.org/abs/2506.21881
作者: Sean Kim,Hyuhng Joon Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: This paper is accepted to ACL Student Research Workshop (SRW) 2025

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed across diverse linguistic and cultural contexts, understanding their behavior in both factual and disputable scenarios is essential, especially when their outputs may shape public opinion or reinforce dominant narratives. In this paper, we define two types of bias in LLMs: model bias (bias stemming from model training) and inference bias (bias induced by the language of the query), through a two-phase evaluation. Phase 1 evaluates LLMs on factual questions where a single verifiable answer exists, assessing whether models maintain consistency across different query languages. Phase 2 expands the scope by probing geopolitically sensitive disputes, where responses may reflect culturally embedded or ideologically aligned perspectives. We construct a manually curated dataset spanning both factual and disputable QA, across four languages and question types. The results show that Phase 1 exhibits query language induced alignment, while Phase 2 reflects an interplay between the model’s training context and query language. This paper offers a structured framework for evaluating LLM behavior across neutral and sensitive topics, providing insights for future LLM deployment and culturally aware evaluation practices in multilingual contexts.
zh

[NLP-37] Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation ACL2025

【速读】：该论文试图解决当前视觉语言模型（Vision-Language Models, VLMs）在内部世界建模（World Models, WMs）能力上的系统性不足问题，特别是其在感知与预测方面的基础能力尚未得到全面评估。解决方案的关键在于提出一个基于比较心理学和认知科学的两阶段框架，用于评估VLMs在感知（视觉、空间、时间、数量和运动）与预测（机制模拟、传递推理、组合推理）方面的原子级能力，并构建了WM-ABench基准，涵盖6种多样化模拟环境中的23个细粒度评估维度，通过大规模实验揭示了VLMs在基本世界建模能力上的显著缺陷。

链接: https://arxiv.org/abs/2506.21876
作者: Qiyue Gao,Xinyu Pi,Kevin Liu,Junrong Chen,Ruolan Yang,Xinqi Huang,Xinyu Fang,Lu Sun,Gautham Kishore,Bo Ai,Stone Tao,Mengyang Liu,Jiaxi Yang,Chao-Jung Lai,Chuanyang Jin,Jiannan Xiang,Benhao Huang,Zeming Chen,David Danks,Hao Su,Tianmin Shu,Ziqiao Ma,Lianhui Qin,Zhiting Hu
机构: Maitrix.org; UC San Diego (加州大学圣地亚哥分校); JHU (约翰霍普金斯大学); Cornell Tech (康奈尔技术学院); EPFL (瑞士联邦理工学院); UMich (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 (Findings)

点击查看摘要

Abstract:Internal world models (WMs) enable agents to understand the world’s state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs’ fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding – e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
zh

[NLP-38] WildSpeech-Bench: Benchmarking Audio LLM s in Natural Speech Conversation

【速读】：该论文试图解决当前端到端语音大语言模型（Speech LLM）评估缺乏专业且全面的基准问题，这一问题阻碍了语音大语言模型在实际应用中的用户体验优化。现有评估方法多依赖文本基准，忽视了语音特有的挑战，如韵律、同音词、口吃及用户期望差异等。论文的关键解决方案是提出一种查询感知的评估方法，通过系统收集真实对话数据、引入说话人属性和声学条件的多样性，并增强语音特有现象，结合定制化的评估清单和提示，提升自动评估的准确性，从而实现对语音模型更细致的性能评估。

链接: https://arxiv.org/abs/2506.21875
作者: Jian Zhang,Linhao Zhang,Bokai Lei,Chuhan Wu,Wei Jia,Xiao Zhou
机构: Tencent Inc(腾讯公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech’s unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
zh

[NLP-39] RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture

【速读】：该论文旨在解决如何有效保护和传承古代黄河文化的问题，其解决方案的关键在于构建一个基于大型语言模型（Large Language Model, LLM）和文化知识数据集的实时交互系统——RiverEcho，该系统通过语音查询响应并借助虚拟数字人进行解释，从而提升用户对黄河文化的理解和体验。

链接: https://arxiv.org/abs/2506.21865
作者: Haofeng Wang,Yilin Guo,Zehao Li,Tong Yue,Yizong Wang,Enci Zhang,Rongqun Lin,Feng Gao,Shiqi Wang,Siwei Ma
机构: Peking University, Shenzhen(北京大学，深圳); Peking University, Beijing(北京大学，北京); Renmin University of China, Beijing(中国人民大学，北京); Pengcheng Laboratory, Shenzhen(鹏城实验室，深圳); Peking University, Beijing(北京大学，北京); City University of Hong Kong, Hong Kong SAR(香港城市大学，香港特别行政区)
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: IEEE International Conference on Multimedia and Expo Workshop, 2025.(Accepted)

点击查看摘要

Abstract:The Yellow River is China’s mother river and a cradle of human civilization. The ancient Yellow River culture is, moreover, an indispensable part of human art history. To conserve and inherit the ancient Yellow River culture, we designed RiverEcho, a real-time interactive system that responds to voice queries using a large language model and a cultural knowledge dataset, delivering explanations through a talking-head digital human. Specifically, we built a knowledge database focused on the ancient Yellow River culture, including the collection of historical texts and the processing pipeline. Experimental results demonstrate that leveraging Retrieval-Augmented Generation (RAG) on the proposed dataset enhances the response quality of the Large Language Model(LLM), enabling the system to generate more professional and informative responses. Our work not only diversifies the means of promoting Yellow River culture but also provides users with deeper cultural insights.
zh

[NLP-40] DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

【速读】：该论文试图解决原生多模态大语言模型（Native MLLMs）在预训练过程中因语音-文本配对数据不足而导致的灾难性遗忘和性能退化问题。其解决方案的关键在于提出DeepTalk框架，该框架基于专家混合（Mixture of Experts, MoE）架构，通过自适应区分模态专家并进行单模态专项训练与多模态协同训练，有效缓解了模型在融合语音与文本生成任务时的性能下降问题。

链接: https://arxiv.org/abs/2506.21864
作者: Hang Shao,Heting Gao,Yunhang Shen,Jiawei Chen,Lijiang Li,Zuwei Long,Bo Tong,Ke Li,Xing Sun
机构: Tencent Youtu Lab(腾讯优图实验室); Fudan University(复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at this https URL.
zh

[NLP-41] Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models

【速读】：该论文试图解决神经语言模型内部表示中句法结构的构建过程及其在不同层间的演化机制尚不明确的问题。其解决方案的关键在于提出一种名为“可推导性探测（Derivational Probing）”的方法，用于研究微观句法结构（如主语名词短语）和宏观句法结构（如动词与其直接成分之间的关系）如何在词嵌入逐层向上传播过程中逐步构建。通过在BERT上的实验，该方法揭示了句法结构的构建呈现自底向上的特征，即微观句法结构在低层出现，并在高层逐渐整合为连贯的宏观句法结构。

链接: https://arxiv.org/abs/2506.21861
作者: Taiga Someya,Ryo Yoshida,Hitomi Yanaka,Yohei Oseki
机构: The University of Tokyo (东京大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has demonstrated that neural language models encode syntactic structures in their internal representations, yet the derivations by which these structures are constructed across layers remain poorly understood. In this paper, we propose Derivational Probing to investigate how micro-syntactic structures (e.g., subject noun phrases) and macro-syntactic structures (e.g., the relationship between the root verbs and their direct dependents) are constructed as word embeddings propagate upward across layers. Our experiments on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge in lower layers and are gradually integrated into a coherent macro-syntactic structure in higher layers. Furthermore, a targeted evaluation on subject-verb number agreement shows that the timing of constructing macro-syntactic structures is critical for downstream performance, suggesting an optimal timing for integrating global syntactic information.
zh

[NLP-42] he Consistency Hypothesis in Uncertainty Quantification for Large Language Models UAI

【速读】：该论文试图解决如何在不依赖模型内部结构的情况下，准确估计大型语言模型（Large Language Model, LLM）输出的置信度问题，这对于需要高用户信任的实际应用至关重要。解决方案的关键在于对“一致性假设”（consistency hypothesis）进行形式化，并通过生成一致性作为置信度的代理指标，提出三种数学表述及相应的统计检验方法，以评估LLM输出在不同任务中的符合性。研究重点突出“Sim-Any”假设的可操作性，并基于此提出无需数据的黑盒不确定性量化方法，通过聚合生成结果之间的相似性进行置信度估计，实验表明该方法优于现有基线，验证了该假设的实用性。

链接: https://arxiv.org/abs/2506.21849
作者: Quan Xiao,Debarun Bhattacharjya,Balaji Ganesan,Radu Marinescu,Katsiaryna Mirylenka,Nhan H Pham,Michael Glass,Junkyu Lee
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM 研究院); Zalando (扎拉多); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by The Conference on Uncertainty in Artificial Intelligence (UAI) 2025

点击查看摘要

Abstract:Estimating the confidence of large language model (LLM) outputs is essential for real-world applications requiring high user trust. Black-box uncertainty quantification (UQ) methods, relying solely on model API access, have gained popularity due to their practical benefits. In this paper, we examine the implicit assumption behind several UQ methods, which use generation consistency as a proxy for confidence, an idea we formalize as the consistency hypothesis. We introduce three mathematical statements with corresponding statistical tests to capture variations of this hypothesis and metrics to evaluate LLM output conformity across tasks. Our empirical investigation, spanning 8 benchmark datasets and 3 tasks (question answering, text summarization, and text-to-SQL), highlights the prevalence of the hypothesis under different settings. Among the statements, we highlight the `Sim-Any’ hypothesis as the most actionable, and demonstrate how it can be leveraged by proposing data-free black-box UQ methods that aggregate similarities between generations for confidence estimation. These approaches can outperform the closest baselines, showcasing the practical value of the empirically observed consistency hypothesis.
zh

[NLP-43] LinguaSynth: Heterogeneous Linguistic Signals for News Classification

【速读】：该论文试图解决深度学习在自然语言处理（Natural Language Processing, NLP）中因依赖大型黑盒模型而带来的可解释性和计算效率问题。其解决方案的关键在于提出LinguaSynth，一个将五种互补的语义特征类型（词汇、句法、实体级、词级语义和文档级语义）整合到透明逻辑回归模型中的文本分类框架。与基于Transformer的架构不同，LinguaSynth在保持可解释性的同时实现了较高的计算效率，并在20 Newsgroups数据集上取得了84.89%的准确率，优于TF-IDF基线3.32个百分点。

链接: https://arxiv.org/abs/2506.21848
作者: Duo Zhang,Junyi Mo
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep learning has significantly advanced NLP, but its reliance on large black-box models introduces critical interpretability and computational efficiency concerns. This paper proposes LinguaSynth, a novel text classification framework that strategically integrates five complementary linguistic feature types: lexical, syntactic, entity-level, word-level semantics, and document-level semantics within a transparent logistic regression model. Unlike transformer-based architectures, LinguaSynth maintains interpretability and computational efficiency, achieving an accuracy of 84.89 percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by 3.32 percent. Through rigorous feature interaction analysis, we show that syntactic and entity-level signals provide essential disambiguation and effectively complement distributional semantics. LinguaSynth sets a new benchmark for interpretable, resource-efficient NLP models and challenges the prevailing assumption that deep neural networks are necessary for high-performing text classification.
zh

[NLP-44] 3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach

【速读】：该论文试图解决传统3D建模在可访问性和可用性方面的挑战，使非专业用户能够通过语言和手势描述与AI协作创建3D模型。解决方案的关键在于结合自然语言处理（Natural Language Processing）和计算机视觉（Computer Vision）等AI技术，通过OpenAI和MediaPipe实现，并采用基于网络的架构以支持跨平台使用，从而提升用户参与度并保持人类创造力。

链接: https://arxiv.org/abs/2506.21845
作者: Zhuodi Cai
机构: Tisch School of the Arts, New York University (纽约大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
备注: 5 pages, 2 figures, 3 tables (containing 21 subfigures)

点击查看摘要

Abstract:This paper presents 3Description, an experimental human-AI collaborative approach for intuitive 3D modeling. 3Description aims to address accessibility and usability challenges in traditional 3D modeling by enabling non-professional individuals to co-create 3D models using verbal and gesture descriptions. Through a combination of qualitative research, product analysis, and user testing, 3Description integrates AI technologies such as Natural Language Processing and Computer Vision, powered by OpenAI and MediaPipe. Recognizing the web has wide cross-platform capabilities, 3Description is web-based, allowing users to describe the desired model and subsequently adjust its components using verbal and gestural inputs. In the era of AI and emerging media, 3Description not only contributes to a more inclusive and user-friendly design process, empowering more people to participate in the construction of the future 3D world, but also strives to increase human engagement in co-creation with AI, thereby avoiding undue surrender to technology and preserving human creativity.
zh

[NLP-45] PARSI: Persian Authorship Recognition via Stylometric Integration

【速读】：该论文试图解决波斯古典诗歌中计算作者归属（authorship attribution）的问题，这一问题由于诗歌的语言、风格和韵律的复杂性而极具挑战性。解决方案的关键在于提出一个多功能的神经框架，该框架结合了基于Transformer的语言编码器与针对波斯诗歌语义、风格特征和韵律维度的特征集，包括100维的Word2Vec嵌入、七个风格度量指标以及诗歌形式和韵律的分类编码。通过使用大规模的Ganjoor数字藏书库数据集，并采用逐句分类及加权投票策略进行评估，验证了该方法在作者归属任务中的有效性。

链接: https://arxiv.org/abs/2506.21840
作者: Kourosh Shahnazari,Mohammadali Keshtparvar,Seyed Moein Ayyoubzadeh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The intricate linguistic, stylistic, and metrical aspects of Persian classical poetry pose a challenge for computational authorship attribution. In this work, we present a versatile framework to determine authorship among 67 prominent poets. We employ a multi-input neural framework consisting of a transformer-based language encoder complemented by features addressing the semantic, stylometric, and metrical dimensions of Persian poetry. Our feature set encompasses 100-dimensional Word2Vec embeddings, seven stylometric measures, and categorical encodings of poetic form and meter. We compiled a vast corpus of 647,653 verses of the Ganjoor digital collection, validating the data through strict preprocessing and author verification while preserving poem-level splitting to prevent overlap. This work employs verse-level classification and majority and weighted voting schemes in evaluation, revealing that weighted voting yields 71% accuracy. We further investigate threshold-based decision filtering, allowing the model to generate highly confident predictions, achieving 97% accuracy at a 0.9 threshold, though at lower coverage. Our work focuses on the integration of deep representational forms with domain-specific features for improved authorship attribution. The results illustrate the potential of our approach for automated classification and the contribution to stylistic analysis, authorship disputes, and general computational literature research. This research will facilitate further research on multilingual author attribution, style shift, and generative modeling of Persian poetry.
zh

[NLP-46] GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles

【速读】：该论文试图解决文本到图像生成模型在生成逃生房间谜题图像时面临的挑战，这些图像需要在视觉吸引力、逻辑严谨性和智力刺激性方面达到较高标准。现有基础图像模型在空间关系和可供性推理方面存在不足，为此，本文提出了一种分层多智能体框架，其关键在于将任务分解为结构化阶段：功能设计、符号场景图推理、布局合成和局部图像编辑，通过专门智能体的迭代反馈协作，确保场景在视觉上连贯且在功能上可解。实验表明，智能体协作在提升可解性、避免捷径和清晰度方面有效，同时保持了图像的视觉质量。

链接: https://arxiv.org/abs/2506.21839
作者: Mengyi Shan,Brian Curless,Ira Kemelmacher-Shlizerman,Steve Seitz
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We challenge text-to-image models with generating escape room puzzle images that are visually appealing, logically solid, and intellectually stimulating. While base image models struggle with spatial relationships and affordance reasoning, we propose a hierarchical multi-agent framework that decomposes this task into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents collaborate through iterative feedback to ensure the scene is visually coherent and functionally solvable. Experiments show that agent collaboration improves output quality in terms of solvability, shortcut avoidance, and affordance clarity, while maintaining visual quality.
zh

[NLP-47] Exploring the change in scientific readability following the release of ChatGPT

【速读】：该论文试图解决生成式 AI (Generative AI) 的兴起是否对科学论文摘要的可读性产生了影响的问题。其解决方案的关键在于利用四个标准可读性公式对 arXiv.org 上2010年至2024年6月7日之间的所有摘要进行分析，计算每个论文的可读性评分，并按年份及平台覆盖的八个主要学科类别进行汇总，从而评估可读性的演变趋势及其在 ChatGPT 发布后是否发生了显著变化。

链接: https://arxiv.org/abs/2506.21825
作者: Abdulkareem Alsudais
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise and growing popularity of accessible large language models have raised questions about their impact on various aspects of life, including how scientists write and publish their research. The primary objective of this paper is to analyze a dataset consisting of all abstracts posted on arXiv.org between 2010 and June 7th, 2024, to assess the evolution of their readability and determine whether significant shifts occurred following the release of ChatGPT in November 2022. Four standard readability formulas are used to calculate individual readability scores for each paper, classifying their level of readability. These scores are then aggregated by year and across the eight primary categories covered by the platform. The results show a steady annual decrease in readability, suggesting that abstracts are likely becoming increasingly complex. Additionally, following the release of ChatGPT, a significant change in readability is observed for 2023 and the analyzed months of 2024. Similar trends are found across categories, with most experiencing a notable change in readability during 2023 and 2024. These findings offer insights into the broader changes in readability and point to the likely influence of AI on scientific writing.
zh

[NLP-48] Exploring the Structure of AI-Induced Language Change in Scientific English

【速读】：该论文试图解决科学英语中词汇使用模式变化的结构问题，特别是这些变化是否涉及同义词被突然“激增”词汇替代，还是反映更广泛的语义和语用层面的调整。其解决方案的关键在于结合词频分析与词性标注，以量化不同语法类别中的语言变化，并区分词形差异（如“potential”作为名词与形容词的不同使用），从而揭示词汇变化的语义和语用特征，而非仅限于词汇层面的替换。

链接: https://arxiv.org/abs/2506.21817
作者: Riley Galpin,Bryce Anderson,Tom S. Juzek
机构: Florida State University (佛罗里达州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted and published at FLAIRS 38. 8 pages, 4 figures, 1 table. Licensed under CC BY-NC-SA 4.0

点击查看摘要

Abstract:Scientific English has undergone rapid and unprecedented changes in recent years, with words such as “delve,” “intricate,” and “crucial” showing significant spikes in frequency since around 2022. These changes are widely attributed to the growing influence of Large Language Models like ChatGPT in the discourse surrounding bias and misalignment. However, apart from changes in frequency, the exact structure of these linguistic shifts has remained unclear. The present study addresses this and investigates whether these changes involve the replacement of synonyms by suddenly ‘spiking words,’ for example, “crucial” replacing “essential” and “key,” or whether they reflect broader semantic and pragmatic qualifications. To further investigate structural changes, we include part of speech tagging in our analysis to quantify linguistic shifts over grammatical categories and differentiate between word forms, like “potential” as a noun vs. as an adjective. We systematically analyze synonym groups for widely discussed ‘spiking words’ based on frequency trends in scientific abstracts from PubMed. We find that entire semantic clusters often shift together, with most or all words in a group increasing in usage. This pattern suggests that changes induced by Large Language Models are primarily semantic and pragmatic rather than purely lexical. Notably, the adjective “important” shows a significant decline, which prompted us to systematically analyze decreasing lexical items. Our analysis of “collapsing” words reveals a more complex picture, which is consistent with organic language change and contrasts with the patterns of the abrupt spikes. These insights into the structure of language change contribute to our understanding of how language technology continues to shape human language.
zh

[NLP-49] owards Transparent AI: A Survey on Explainable Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在决策过程中的可解释性问题，即其“黑箱”特性限制了其在高风险领域应用的可信度与采纳率。解决方案的关键在于系统性地梳理和分类基于不同Transformer架构（编码器-仅、解码器-仅和编码器-解码器）的可解释人工智能（Explainable Artificial Intelligence, XAI）方法，并评估这些方法在解释性方面的有效性，同时探讨其在实际应用中的价值与挑战。

链接: https://arxiv.org/abs/2506.21812
作者: Avash Palikhe,Zhenyu Yu,Zichong Wang,Wenbin Zhang
机构: Florida International University (佛罗里达国际大学); Universiti Malaya (马来亚大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have played a pivotal role in advancing Artificial Intelligence (AI). However, despite their achievements, LLMs often struggle to explain their decision-making processes, making them a ‘black box’ and presenting a substantial challenge to explainability. This lack of transparency poses a significant obstacle to the adoption of LLMs in high-stakes domain applications, where interpretability is particularly essential. To overcome these limitations, researchers have developed various explainable artificial intelligence (XAI) methods that provide human-interpretable explanations for LLMs. However, a systematic understanding of these methods remains limited. To address this gap, this survey provides a comprehensive review of explainability techniques by categorizing XAI methods based on the underlying transformer architectures of LLMs: encoder-only, decoder-only, and encoder-decoder models. Then these techniques are examined in terms of their evaluation for assessing explainability, and the survey further explores how these explanations are leveraged in practical applications. Finally, it discusses available resources, ongoing research challenges, and future directions, aiming to guide continued efforts toward developing transparent and responsible LLMs.
zh

[NLP-50] A suite of allotaxonometric tools for the comparison of complex systems using rank-turbulence divergence

【速读】：该论文旨在解决复杂系统描述与比较中缺乏理论基础且系统性的工具问题，其解决方案的关键在于提出一种名为allotaxonograph的可视化方法，该方法围绕类型湍流（type turbulence）现象构建，能够对重尾分布对进行地图与列表形式的对比分析。allotaxonograph设计用于兼容多种度量工具，包括秩湍流差异（rank-turbulence divergence）、概率湍流差异、Jenson-Shannon差异以及广义熵差异，从而提供灵活且全面的分析手段。

链接: https://arxiv.org/abs/2506.21808
作者: Jonathan St-Onge,Ashley M. A. Fehr,Carter Ward,Calla G. Beauregard,Michael V. Arnold,Samuel F. Rosenblatt,Benjamin Cooley,Christopher M. Danforth,Peter Sheridan Dodds
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Describing and comparing complex systems requires principled, theoretically grounded tools. Built around the phenomenon of type turbulence, allotaxonographs provide map-and-list visual comparisons of pairs of heavy-tailed distributions. Allotaxonographs are designed to accommodate a wide range of instruments including rank- and probability-turbulence divergences, Jenson-Shannon divergence, and generalized entropy divergences. Here, we describe a suite of programmatic tools for rendering allotaxonographs for rank-turbulence divergence in Matlab, Javascript, and Python, all of which have different use cases.
zh

[NLP-51] CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM -Driven Agent Simulation

【速读】：该论文试图解决城市环境中人类行为建模的问题，传统方法依赖于刚性的手工规则，限制了其对复杂意图、计划和适应性行为的模拟能力。解决方案的关键在于构建一个名为CitySim的城市模拟器，该模拟器利用大型语言模型的人类水平智能，使代理通过递归的价值驱动方法生成现实的日常日程，并赋予代理信念、长期目标和空间记忆以实现长期、逼真的模拟。

链接: https://arxiv.org/abs/2506.21805
作者: Nicolas Bougie,Narimasa Watanabe
机构: Woven by Toyota (丰田织物)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modeling human behavior in urban environments is fundamental for social science, behavioral studies, and urban planning. Prior work often rely on rigid, hand-crafted rules, limiting their ability to simulate nuanced intentions, plans, and adaptive behaviors. Addressing these challenges, we envision an urban simulator (CitySim), capitalizing on breakthroughs in human-level intelligence exhibited by large language models. In CitySim, agents generate realistic daily schedules using a recursive value-driven approach that balances mandatory activities, personal habits, and situational factors. To enable long-term, lifelike simulations, we endow agents with beliefs, long-term goals, and spatial memory for navigation. CitySim exhibits closer alignment with real humans than prior work, both at micro and macro levels. Additionally, we conduct insightful experiments by modeling tens of thousands of agents and evaluating their collective behaviors under various real-world scenarios, including estimating crowd density, predicting place popularity, and assessing well-being. Our results highlight CitySim as a scalable, flexible testbed for understanding and forecasting urban phenomena.
zh

[NLP-52] Offensive Language Detection on Social Media Using XLNet

【速读】：该论文试图解决社交媒体平台上文本交流中日益增多的攻击性语言（offensive language）检测问题，该问题由于用户生成内容数量庞大，传统人工审核已不可行，亟需自动化检测系统。解决方案的关键在于利用基于XLNet的深度学习模型，这是一种通过迁移学习（transfer learning）进行大规模预训练的通用自回归预训练方法，并将其与广泛应用于自然语言处理（NLP）的BERT模型进行对比，以评估其在攻击性语言识别任务中的性能表现。

链接: https://arxiv.org/abs/2506.21795
作者: Reem Alothman,Hafida Benhidour,Said Kerrache
机构: King Saud University (沙特国王大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread use of text-based communication on social media-through chats, comments, and microblogs-has improved user interaction but has also led to an increase in offensive content, including hate speech, racism, and other forms of abuse. Due to the enormous volume of user-generated content, manual moderation is impractical, which creates a need for automated systems that can detect offensive language. Deep learning models, particularly those using transfer learning, have demonstrated significant success in understanding natural language through large-scale pretraining. In this study, we propose an automatic offensive language detection model based on XLNet, a generalized autoregressive pretraining method, and compare its performance with BERT (Bidirectional Encoder Representations from Transformers), which is a widely used baseline in natural language processing (NLP). Both models are evaluated using the Offensive Language Identification Dataset (OLID), a benchmark Twitter dataset that includes hierarchical annotations. Our experimental results show that XLNet outperforms BERT in detecting offensive content and in categorizing the types of offenses, while BERT performs slightly better in identifying the targets of the offenses. Additionally, we find that oversampling and undersampling strategies are effective in addressing class imbalance and improving classification performance. These findings highlight the potential of transfer learning and XLNet-based architectures to create robust systems for detecting offensive language on social media platforms.
zh

[NLP-53] Evaluating List Construction and Temporal Understanding capabilities of Large Language Models ICTIR2025 SIGIR2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在涉及多个实体的时序理解任务中存在幻觉和错误的问题，具体表现为无法准确关联实体与时间区间、生成完整的实体列表或对具有特定时间范围的事件进行推理。解决方案的关键是提出一个基于时间参考的列表问答基准（Time referenced List based Question Answering, TLQA），该基准要求答案以结构化列表形式呈现，并与相应的时间段对齐，从而同时考验模型的列表构建能力和时序理解能力。

链接: https://arxiv.org/abs/2506.21783
作者: Alexandru Dumitru,V Venktesh,Adam Jatowt,Avishek Anand
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICTIR 2025 co-located with SIGIR 2025, 11 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at this https URL.
zh

[NLP-54] (Fact) Check Your Bias

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中的参数化知识偏差对自动事实验证系统（如HerO系统）的影响问题。其解决方案的关键在于通过实验分析两种类型的偏差：一是Llama 3.1模型本身参数化知识的潜在偏差，二是人为注入的偏差。研究发现，当直接提示模型进行事实验证时，Llama 3.1将近一半的陈述标记为“证据不足”，而利用其内部知识可对剩余陈述做出判断；此外，通过生成支持性、反驳性或中立性的事实核查文档的提示策略显著影响了检索结果，但最终的判断预测在不同提示策略下表现出稳定性。

链接: https://arxiv.org/abs/2506.21745
作者: Eivind Morris Bakke,Nora Winger Heggelund
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1’s parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as “Not Enough Evidence”. Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: this https URL
zh

[NLP-55] Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

【速读】：该论文试图解决自监督语音Transformer模型中对说话人信息编码机制缺乏深入研究的问题，特别是如何识别模型中与说话人相关的信息编码神经元。其解决方案的关键在于通过分析自监督特征和i-vector的k-means聚类，识别出与语音音素和性别类别相关的神经元，并在剪枝过程中保护这些神经元，从而显著保持说话人相关任务的性能，证明了这些神经元在编码说话人信息中的关键作用。

链接: https://arxiv.org/abs/2506.21712
作者: Tzu-Quan Lin,Hsi-Chun Cheng,Hung-yi Lee,Hao Tang
机构: National Taiwan University (台湾大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.
zh

[NLP-56] ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages

【速读】：该论文试图解决孟加拉语地区方言的情感分析问题，这一领域由于语言多样性及标注数据有限而研究不足。解决方案的关键在于构建了一个名为ANUBHUTI的综合性数据集，该数据集包含从标准孟加拉语手动翻译成四个主要地区方言（Mymensingh、Noakhali、Sylhet和Chittagong）的2000个句子，并采用双标注方案进行主题分类和情感标注，确保了数据的高质量与一致性。

链接: https://arxiv.org/abs/2506.21686
作者: Swastika Kundu,Autoshi Ibrahim,Mithila Rahman,Tanvir Ahmed
机构: Ahsanullah University Of Science and Technology (阿罕默德科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sentiment analysis for regional dialects of Bangla remains an underexplored area due to linguistic diversity and limited annotated data. This paper introduces ANUBHUTI, a comprehensive dataset consisting of 2000 sentences manually translated from standard Bangla into four major regional dialects Mymensingh, Noakhali, Sylhet, and Chittagong. The dataset predominantly features political and religious content, reflecting the contemporary socio political landscape of Bangladesh, alongside neutral texts to maintain balance. Each sentence is annotated using a dual annotation scheme: multiclass thematic labeling categorizes sentences as Political, Religious, or Neutral, and multilabel emotion annotation assigns one or more emotions from Anger, Contempt, Disgust, Enjoyment, Fear, Sadness, and Surprise. Expert native translators conducted the translation and annotation, with quality assurance performed via Cohens Kappa inter annotator agreement, achieving strong consistency across dialects. The dataset was further refined through systematic checks for missing data, anomalies, and inconsistencies. ANUBHUTI fills a critical gap in resources for sentiment analysis in low resource Bangla dialects, enabling more accurate and context aware natural language processing.
zh

[NLP-57] Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations

【速读】：该论文试图解决图神经网络（Graph Neural Networks, GNNs）在利用显式结构信息方面存在局限性的问题，同时探讨多层感知机（Multi-Layer Perceptrons, MLPs）在结构感知任务中的潜力。其解决方案的关键在于引入一种基于信息论的系统性探测框架，通过扩展传统探测分类器并引入控制模块，实现对GNN模型及其解耦组件（即消息传递和特征变换）的独立评估，从而避免完整GNN架构带来的混杂效应。该方法有效揭示了特征变换操作在提升语言模型表示能力中的重要作用，并验证了MLPs作为高效且可扩展的GNN替代方案的可行性。

链接: https://arxiv.org/abs/2506.21682
作者: Li Zhou,Hao Jiang,Junjie Li,Zefeng Zhao,Feng Jiang,Wenyu Chen,Haizhou Li
机构: School of Data Science, The Chinese University of Hong Kong, Shenzhen; Department of Electrical and Eletronic Engineering, Faculty of Engineering, The Hong Kong Polytechnic University; School of Computer Science and Engineering (School of Cyber Security), University of Electronic Science and Technology of China
类目: Computation and Language (cs.CL)
备注: Graph Neural Networks, Multi-Layer Perceptrons, Explicit Structural Modeling, Probing Classifier

点击查看摘要

Abstract:Explicit structural information has been proven to be encoded by Graph Neural Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities and improve performance in downstream NLP tasks. However, recent studies indicate that GNNs fail to fully utilize structural information, whereas Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms inherent to GNNs, exhibit a surprising ability in structure-aware tasks. Motivated by these findings, this paper introduces a comprehensive probing framework from an information-theoretic perspective. The framework is designed to systematically assess the role of explicit structural modeling in enhancing language model (LM) representations and to investigate the potential of MLPs as efficient and scalable alternatives to GNNs. We extend traditional probing classifiers by incorporating a control module that allows for selective use of either the full GNN model or its decoupled components, specifically, the message-passing and feature-transformation this http URL modular approach isolates and assesses the individual contributions of these operations, avoiding confounding effects from the complete GNN architecture. Using the Edge Probing Suite, a diagnostic tool for evaluating the linguistic knowledge encoded in LMs, we find that MLPs, when used as feature-transformation modules, consistently improve the linguistic knowledge captured in LM representations across different architectures. They effectively encode both syntactic and semantic patterns. Similarly, GNNs that incorporate feature-transformation operations show beneficial effects. In contrast, models that rely solely on message-passing operations tend to underperform, often leading to negative impacts on probing task performance.
zh

[NLP-58] Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）在细粒度空间推理任务中的不足，尤其是在需要多步骤逻辑和精确空间对齐的情况下。其解决方案的关键在于提出一种基于多模型蒙特卡洛树搜索（Multi-Model Monte Carlo Tree Search, M3CTS）的高质量监督生成方法，以及一种细粒度直接偏好优化（fine-grained Direct Preference Optimization, fDPO）策略，通过空间奖励机制提升模型在空间一致性、空间定位和逻辑连贯性方面的表现。

链接: https://arxiv.org/abs/2506.21656
作者: Yifan Shen,Yuanzhe Liu,Jingyuan Zhu,Xu Cao,Xiaofeng Zhang,Yixiao He,Wenming Ye,James Matthew Rehg,Ismini Lourentzou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.
zh

[NLP-59] Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

【速读】：该论文旨在解决从科学文献和专利中提取分子结构-活性关系（SARs）的挑战，这一任务在药物发现和材料研究中具有重要意义。现有方法面临文档格式异质性和性能限制，例如基于规则的方法无法适应多样的文档布局，而通用多模态大语言模型（MLLMs）在布局检测和光学化学结构识别（OCSR）等专业任务中准确性不足。论文的关键解决方案是提出Doc2SAR框架，该框架通过结合领域专用工具与经过监督微调（SFT）增强的MLLMs，实现了对SAR提取的有效支持。

链接: https://arxiv.org/abs/2506.21625
作者: Jiaxi Zhuang,Kangning Li,Jue Hou,Mingjun Xu,Zhifeng Gao,Hengxing Cai
机构: DP Technology (DP Technology)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Extracting molecular structure-activity relationships (SARs) from scientific literature and patents is essential for drug discovery and materials research. However, this task remains challenging due to heterogeneous document formats and limitations of existing methods. Specifically, rule-based approaches relying on rigid templates fail to generalize across diverse document layouts, while general-purpose multimodal large language models (MLLMs) lack sufficient accuracy and reliability for specialized tasks, such as layout detection and optical chemical structure recognition (OCSR). To address these challenges, we introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific documents designed specifically for evaluating SAR extraction methods. Additionally, we propose Doc2SAR, a novel synergistic framework that integrates domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT). Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art performance across various document types, significantly outperforming leading end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of 80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR demonstrates practical usability through efficient inference and is accompanied by a web app.
zh

[NLP-60] Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints

【速读】：该论文试图解决自然语言中细微语义差异和上下文变化在文本分类任务中的识别难题，特别是在消费者投诉场景下准确评估消费者救济资格的问题。其解决方案的关键在于引入经过人类经验训练的算法以有效识别关键语义差异，并结合通过专家评估和标注优化的生成对抗网络（Generative Adversarial Networks, GANs）生成的高质量合成数据，从而提升分类器性能并降低数据获取成本。

链接: https://arxiv.org/abs/2506.21623
作者: Peiheng Gao,Chen Yang,Ning Sun,Ričardas Zitikis
机构: Western University (西安大略大学); Icahn School of Medicine at Mount Sinai (迈克尔·S·哈特医学中心); Mount Sinai (山脉医疗中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning (ML) has significantly advanced text classification by enabling automated understanding and categorization of complex, unstructured textual data. However, accurately capturing nuanced linguistic patterns and contextual variations inherent in natural language, particularly within consumer complaints, remains a challenge. This study addresses these issues by incorporating human-experience-trained algorithms that effectively recognize subtle semantic differences crucial for assessing consumer relief eligibility. Furthermore, we propose integrating synthetic data generation methods that utilize expert evaluations of generative adversarial networks and are refined through expert annotations. By combining expert-trained classifiers with high-quality synthetic data, our research seeks to significantly enhance machine learning classifier performance, reduce dataset acquisition costs, and improve overall evaluation metrics and robustness in text classification tasks.
zh

[NLP-61] Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech

【速读】：该论文试图解决由于脑性瘫痪或遗传疾病等导致的言语障碍在自动语音识别（Automatic Speech Recognition, ASR）系统中面临的挑战。现有ASR模型如Whisper在处理非典型言语时表现不佳，主要受限于训练数据不足以及收集和标注非典型语音样本的难度。该研究提出了一种实用且轻量级的个性化ASR模型流水线，其关键在于通过形式化选择词语并增强小规模言语障碍数据集的语义连贯性，从而提升语音识别的准确性。

链接: https://arxiv.org/abs/2506.21622
作者: Niclas Pokel,Pehuén Moure,Roman Boehringer,Yingqiang Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech impairments caused by conditions such as cerebral palsy or genetic disorders pose significant challenges for automatic speech recognition (ASR) systems. Despite recent advances, ASR models like Whisper struggle with non-normative speech due to limited training data and the difficulty of collecting and annotating non-normative speech samples. In this work, we propose a practical and lightweight pipeline to personalize ASR models, formalizing the selection of words and enriching a small, speech-impaired dataset with semantic coherence. Applied to data from a child with a structural speech impairment, our approach shows promising improvements in transcription quality, demonstrating the potential to reduce communication barriers for individuals with atypical speech patterns.
zh

[NLP-62] he Open Proof Corpus: A Large-Scale Study of LLM -Generated Mathematical Proofs

【速读】：该论文试图解决大规模、高质量的人类评估数学证明数据集的缺乏问题，这一问题阻碍了生成式 AI (Generative AI) 在数学证明生成领域的进一步发展。解决方案的关键在于构建 Open Proof Corpus (OPC)，这是一个包含超过 5,000 个由先进大语言模型 (LLMs) 生成并经人类评估的数学证明的数据集，特别针对数学竞赛问题如 USAMO 和 IMO 的正确解法进行了收录，旨在推动证明生成研究的进展并支持对证明能力的严格分析。

链接: https://arxiv.org/abs/2506.21621
作者: Jasper Dekoninck,Ivo Petrov,Kristian Minchev,Mislav Balunovic,Martin Vechev,Miroslav Marinov,Maria Drencheva,Lyuba Konova,Milen Shumanov,Kaloyan Tsvetkov,Nikolay Drenchev,Lazar Todorov,Kalina Nikolova,Nikolay Georgiev,Vanesa Kalinkova,Margulan Ismoldayev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.
zh

[NLP-63] How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit

【速读】：该论文试图解决生成式 AI (Generative AI) 在政治性在线讨论中可能被用于影响舆论和操纵政治叙事的问题，其核心在于评估大型语言模型（LLMs）在真实政治语境下生成用户生成内容（User-Generated Content, UGC）的能力与影响。解决方案的关键在于通过三组实验，利用 GPT-4 模拟真实或人工的立场用户生成评论，并从政治倾向、情感和语言特征等方面进行分析，以验证其生成内容的真实性及潜在操控性。

链接: https://arxiv.org/abs/2506.21620
作者: Daniele Cirulli,Giulio Cimini,Giovanni Palermo
机构: Enrico Fermi Research Center (恩里科·费米研究中心); Physics Department and INFN, University of Rome Tor Vergata (物理系和INFN，罗马特尔维加塔大学); Physics Department, “Sapienza” University of Rome (物理系，“罗马第一大学”)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently emerged as powerful tools for natural language generation, with applications spanning from content creation to social simulations. Their ability to mimic human interactions raises both opportunities and concerns, particularly in the context of politically relevant online discussions. In this study, we evaluate the performance of LLMs in replicating user-generated content within a real-world, divisive scenario: Reddit conversations during the 2016 US Presidential election. In particular, we conduct three different experiments, asking GPT-4 to generate comments by impersonating either real or artificial partisan users. We analyze the generated comments in terms of political alignment, sentiment, and linguistic features, comparing them against real user contributions and benchmarking against a null model. We find that GPT-4 is able to produce realistic comments, both in favor of or against the candidate supported by the community, yet tending to create consensus more easily than dissent. In addition we show that real and artificial comments are well separated in a semantically embedded space, although they are indistinguishable by manual inspection. Our findings provide insights on the potential use of LLMs to sneak into online discussions, influence political debate and shape political narratives, bearing broader implications of AI-driven discourse manipulation.
zh

[NLP-64] IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

【速读】：该论文旨在解决传统自回归文本到语音（TTS）模型在语音合成过程中难以精确控制语音时长的问题，这一问题限制了其在需要严格音画同步的应用场景中的使用。解决方案的关键在于提出IndexTTS2，该方法引入了一种新颖且适用于自回归模型的语音时长控制机制，支持两种生成模式：一种允许显式指定生成的token数量以实现精确时长控制，另一种则无需人工输入，使模型能够自由生成语音并保留输入提示中的韵律特征。此外，IndexTTS2实现了情感表达与说话人身份的解耦，从而实现了音色和情感的独立控制。

链接: https://arxiv.org/abs/2506.21619
作者: Siyi Zhou,Yiquan Zhou,Yi He,Xun Zhou,Jinchao Wang,Wei Deng,Jingchen Shu
机构: Bilibili Inc. (哔哩哔哩公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large-scale text-to-speech (TTS) models are typically categorized into autoregressive and non-autoregressive systems. Although autoregressive systems exhibit certain advantages in speech naturalness, their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This is a key limitation in applications such as video dubbing that require strict audio-visual synchronization. This paper introduces IndexTTS2, which proposes a novel and autoregressive-model-friendly method for speech duration control. The method supports two generation modes: one allows explicit specification of the number of generated tokens for precise duration control; the other does not require manual input and lets the model freely generate speech while preserving prosodic characteristics from the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion. In the zero-shot setting, the model can perfectly reproduce the emotional characteristics of the input prompt. Users may also provide a separate emotion prompt, even from a different speaker, allowing the model to reconstruct the target timbre while conveying the desired emotion. To enhance clarity during strong emotional expressions, we incorporate GPT latent representations to improve speech stability. Meanwhile, to lower the barrier for emotion control, we design a soft instruction mechanism based on textual descriptions by fine-tuning Qwen3. This enables effective guidance of speech generation with desired emotional tendencies using natural language input. Experimental results demonstrate that IndexTTS2 outperforms existing state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity.
zh

[NLP-65] rajTok: Technical Report for 2025 Waymo Open Sim Agents Challenge

【速读】：该论文旨在解决行为生成模型中轨迹表示与预测的挑战，特别是在离散的下一步预测（next-token-prediction）框架下如何实现更准确、对称且鲁棒的轨迹建模。其解决方案的关键在于提出TrajTok，一种结合数据驱动与规则方法的轨迹分词器，并引入空间感知的标签平滑方法以优化交叉熵损失函数，从而提升模型在真实感评分上的表现。

链接: https://arxiv.org/abs/2506.21618
作者: Zhiyuan Zhang,Xiaosong Jia,Guanyu Chen,Qifeng Li,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this technical report, we introduce TrajTok, a trajectory tokenizer for discrete next-token-prediction based behavior generation models, which combines data-driven and rule-based methods with better coverage, symmetry and robustness, along with a spatial-aware label smoothing method for cross-entropy loss. We adopt the tokenizer and loss for the SMART model and reach a superior performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge 2025. We will open-source the code in the future.
zh

[NLP-66] IM: A Large-Scale Dataset and large Timeline Intelligence Model for Open-domain Timeline Summarization

【速读】：该论文试图解决开放域时间线摘要（Open-domain Timeline Summarization, TLS）中存在的话题相关性评估不足和话题演变理解不准确的问题，导致生成的摘要包含无关信息或时间戳错误。解决方案的关键在于提出首个大规模的时间智能模型（Timeline Intelligence Model, TIM），其核心包括构建一个包含超过1,000个新闻话题和3,000个标注TLS实例的大规模数据集，以及采用渐进式优化策略，结合指令微调提升摘要质量和无关信息过滤能力，并引入一种新颖的双对齐奖励学习方法，从语义和时间两个角度增强对话题演变的理解。

链接: https://arxiv.org/abs/2506.21616
作者: Chuanrui Hu,Wei Hu,Penghang Yu,Hua Zhang,Bing-Kun Bao
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Open-domain Timeline Summarization (TLS) is crucial for monitoring the evolution of news topics. To identify changes in news topics, existing methods typically employ general Large Language Models (LLMs) to summarize relevant timestamps from retrieved news. While general LLMs demonstrate capabilities in zero-shot news summarization and timestamp localization, they struggle with assessing topic relevance and understanding topic evolution. Consequently, the summarized information often includes irrelevant details or inaccurate timestamps. To address these issues, we propose the first large Timeline Intelligence Model (TIM) for open-domain TLS, which is capable of effectively summarizing open-domain timelines. Specifically, we begin by presenting a large-scale TLS dataset, comprising over 1,000 news topics and more than 3,000 annotated TLS instances. Furthermore, we propose a progressive optimization strategy, which gradually enhance summarization performance. It employs instruction tuning to enhance summarization and topic-irrelevant information filtering capabilities. Following this, it exploits a novel dual-alignment reward learning method that incorporates both semantic and temporal perspectives, thereby improving the understanding of topic evolution principles. Through this progressive optimization strategy, TIM demonstrates a robust ability to summarize open-domain timelines. Extensive experiments in open-domain demonstrate the effectiveness of our TIM.
zh

[NLP-67] Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

【速读】：该论文试图解决当前医疗语言模型（Medical Language Models）在诊断预测中与临床实际决策逻辑不一致的问题，即现有模型依赖于易于获取的ICD编码标签，而未能捕捉临床医生在诊断过程中使用的复杂、上下文丰富的推理过程。解决方案的关键在于提出GARMLE-G框架，该框架通过将生成式AI（Generative AI）输出与权威临床实践指南（Clinical Practice Guidelines, CPGs）相结合，实现无幻觉的输出。其核心机制包括：整合大语言模型（LLM）预测与电子健康记录（EHR）数据生成语义丰富的查询，通过嵌入相似性检索相关CPG知识片段，并将指南内容与模型输出融合以生成符合临床指南的推荐。

链接: https://arxiv.org/abs/2506.21615
作者: Wenhao Li,Hongkuan Zhang,Hongwei Zhang,Zhengxu Li,Zengjie Dong,Yafan Chen,Niranjan Bidargaddi,Hong Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Current medical language models, adapted from large language models (LLMs), typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.
zh

[NLP-68] LastingBench: Defend Benchmarks Against Knowledge Leakage

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在标准问答（Question Answering, QA）基准测试中通过记忆特定数据“作弊”的问题，这导致基准测试的有效性受到质疑。解决方案的关键在于提出一种名为LastingBench的框架，该框架通过扰动识别泄漏点，并将其重写为反事实内容，从而破坏记忆效应同时保持基准测试的原始评估意图。

链接: https://arxiv.org/abs/2506.21614
作者: Yixiong Fang,Tianran Sun,Yuling Shi,Min Wang,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing complexity of large language models (LLMs) raises concerns about their ability to “cheat” on standard Question Answering (QA) benchmarks by memorizing task-specific data. This undermines the validity of benchmark evaluations, as they no longer reflect genuine model capabilities but instead the effects of data leakage. While prior work has focused on detecting such leakage, little attention has been given to mitigating its impact and preserving the long-term utility of benchmarks. In this paper, we introduce LastingBench, a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. LastingBench identifies leakage points in the context through perturbation, then rewrites the leakage points to counterfactual ones-disrupting memorization while preserving the benchmark’s original evaluative intent. Evaluations of state-of-the-art QA benchmarks show significant performance gaps, highlighting the efficacy of LastingBench in reducing memorization effects. LastingBench offers a practical and scalable solution to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs.
zh

[NLP-69] ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

【速读】：该论文试图解决在线儿童目标仇恨言论（child-targeted hate speech）检测中存在的数据不足问题，现有仇恨言论数据集缺乏年龄特定标注、未能捕捉细微语境，并忽视对儿童的独特情感影响。解决方案的关键在于引入ChildGuard1数据集，该数据集从现有语料库中整理并添加了针对儿童的特定标注，涵盖了不同年龄段的儿童目标仇恨言论的多样化语境，为提升相关检测方法提供了坚实的基础。

链接: https://arxiv.org/abs/2506.21613
作者: Gautam Siddharth Kashyap,Mohammad Anas Azeez,Rafiq Ali,Zohaib Hasan Siddiqui,Jiechao Gao,Usman Naseem
机构: Macquarie University (麦考瑞大学); Jamia Hamdard (贾米亚哈姆德ard大学); DSEU (DSEU); Center for SDGC, Stanford University (SDGC中心，斯坦福大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The increasing prevalence of child-targeted hate speech online underscores the urgent need for specialized datasets to address this critical issue. Existing hate speech datasets lack agespecific annotations, fail to capture nuanced contexts, and overlook the unique emotional impact on children. To bridge this gap, we introduce ChildGuard1, a curated dataset derived from existing corpora and enriched with child-specific annotations. ChildGuard captures diverse contexts of child-targeted hate speech, spanning age groups. We benchmark existing state-of-the-art hate speech detection methods, including Large Language Models (LLMs), and assess their effectiveness in detecting and contextualizing child-targeted hate speech. To foster further research in this area, we publicly release ChildGuard, providing a robust foundation for developing improved methods to detect and mitigate such harm.
zh

[NLP-70] AdaptGOT: A Pre-trained Model for Adaptive Contextual POI Representation Learning

【速读】：该论文旨在解决当前兴趣点（Point-of-Interest, POI）嵌入方法中存在的多上下文采样策略不足、多POI上下文探索不够、泛化能力有限等问题。其解决方案的关键在于提出AdaptGOT模型，该模型融合了自适应表示学习技术和地理共现文本（Geographical-Co-Occurrence-Text, GOT）表示，并通过三个核心组件实现：上下文邻域生成、增强型GOT表示以及基于专家混合（Mixture of Experts, MoE）的自适应编码器-解码器架构，从而提升POI表示的质量与泛化能力。

链接: https://arxiv.org/abs/2506.21612
作者: Xiaobin Ren,Xinyu Zhu,Kaiqi Zhao
机构: University of Auckland(奥克兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Currently, considerable strides have been achieved in Point-of-Interest (POI) embedding methodologies, driven by the emergence of novel POI tasks like recommendation and classification. Despite the success of task-specific, end-to-end models in POI embedding, several challenges remain. These include the need for more effective multi-context sampling strategies, insufficient exploration of multiple POI contexts, limited versatility, and inadequate generalization. To address these issues, we propose the AdaptGOT model, which integrates both the (Adapt)ive representation learning technique and the Geographical-Co-Occurrence-Text (GOT) representation with a particular emphasis on Geographical location, Co-Occurrence and Textual information. The AdaptGOT model comprises three key components: (1) contextual neighborhood generation, which integrates advanced mixed sampling techniques such as KNN, density-based, importance-based, and category-aware strategies to capture complex contextual neighborhoods; (2) an advanced GOT representation enhanced by an attention mechanism, designed to derive high-quality, customized representations and efficiently capture complex interrelations between POIs; and (3) the MoE-based adaptive encoder-decoder architecture, which ensures topological consistency and enriches contextual representation by minimizing Jensen-Shannon divergence across varying contexts. Experiments on two real-world datasets and multiple POI tasks substantiate the superior performance of the proposed AdaptGOT model.
zh

[NLP-71] Does Multimodality Lead to Better Time Series Forecasting?

【速读】：该论文试图解决在时间序列预测中，如何有效整合文本信息以提升预测性能的问题，特别是探讨多模态融合是否在所有情况下都能带来一致性的性能提升。其解决方案的关键在于系统性地评估两种多模态预测范式：基于对齐的方法和基于提示的方法，并通过分析模型架构特性和数据特征，识别出文本信息有助于预测的条件，包括高容量的文本模型、相对较弱的时间序列模型、适当的对齐策略、充足的训练数据以及文本提供超出时间序列本身之外的互补预测信号。

链接: https://arxiv.org/abs/2506.21611
作者: Xiyuan Zhang,Boran Han,Haoyang Fang,Abdul Fatir Ansari,Shuai Zhang,Danielle C. Maddix,Cuixiong Hu,Andrew Gordon Wilson,Michael W. Mahoney,Hao Wang,Yan Liu,Huzefa Rangwala,George Karypis,Bernie Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Although prior works report gains from multimodal input, we find these effects are not universal across datasets and models, and multimodal methods sometimes do not outperform the strongest unimodal baselines. To understand when textual information helps, we disentangle the effects of model architectural properties and data characteristics. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our empirical findings offer practical guidelines for when multimodality can be expected to aid forecasting tasks, and when it does not.
zh

[NLP-72] From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models

【速读】：该论文试图解决现有研究对大型语言模型（Large Language Models, LLMs）在复杂推理过程中的内部机制及输出特征缺乏系统性比较的问题，特别是针对模型的自我反思模式（Self-reflection pattern，也称为“Aha moment”）以及跨领域关联性关注不足。其解决方案的关键在于提出一种新的分析框架，结合关键词统计方法与LLM-as-a-judge范式，将模型的内部思考过程与其最终输出相联系，从而揭示不同模型在推理过程中探索与利用的平衡策略、问题处理方式及结论达成机制。

链接: https://arxiv.org/abs/2506.21609
作者: Junhao Liu,Zhenhao Xu,Yuxin Fang,Yichuan Chen,Zuobin Ying,Wenhan Chang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:Recently, there have been notable advancements in large language models (LLMs), demonstrating their growing abilities in complex reasoning. However, existing research largely overlooks a thorough and systematic comparison of these models’ reasoning processes and outputs, particularly regarding their self-reflection pattern (also termed “Aha moment”) and the interconnections across diverse domains. This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models (GPT-o1, DeepSeek-R1, Kimi-k1.5, and Grok-3) using keywords statistic and LLM-as-a-judge paradigm. Our approach connects their internal thinking processes with their final outputs. A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving. Additionally, a set of metrics is put forward to assess both the coherence of reasoning and the accuracy of the outputs. The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions during the reasoning process. Through quantitative and qualitative comparisons, disparities among these models are identified in aspects such as the depth of reasoning, the reliance on intermediate steps, and the degree of similarity between their thinking processes and output patterns and those of GPT-o1. This work offers valuable insights into the trade-off between computational efficiency and reasoning robustness and provides practical recommendations for enhancing model design and evaluation in practical applications. We publicly release our project at: this https URL
zh

[NLP-73] SysTemp: A Multi-Agent System for Template-Based Generation of SysML v2

【速读】：该论文试图解决复杂系统工程中SysML v2模型自动生成的问题，尤其是在学习语料稀缺和语法复杂的情况下。其解决方案的关键在于提出SysTemp系统，该系统基于多智能体架构，包含一个模板生成器，用于结构化生成过程，从而提升从自然语言规范生成SysML v2模型的质量和效率。

链接: https://arxiv.org/abs/2506.21608
作者: Yasmine Bouamra,Bruno Yun,Alexandre Poisson,Frédéric Armetta
机构: Siemens Digital Industries Software (西门子数字工业软件); University Claude Bernard Lyon 1 (克莱蒙-费朗第一大学); Ecole Centrale de Lyon (里昂中央理工学院); INSA Lyon (里昂国立应用科学学院); Université Lumière Lyon 2 (里昂第二大学); LIRIS, UMR5205 (LIRIS，UMR5205)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The automatic generation of SysML v2 models represents a major challenge in the engineering of complex systems, particularly due to the scarcity of learning corpora and complex syntax. We present SysTemp, a system aimed at facilitating and improving the creation of SysML v2 models from natural language specifications. It is based on a multi-agent system, including a template generator that structures the generation process. We discuss the advantages and challenges of this system through an evaluation, highlighting its potential to improve the quality of the generations in SysML v2 modeling.
zh

[NLP-74] CORE-KG: An LLM -Driven Knowledge Graph Construction Framework for Human Smuggling Networks

【速读】：该论文旨在解决从法律案件文档中构建可解释的知识图谱（Knowledge Graph, KG）所面临的挑战，包括文本的非结构化、词汇密集性以及模糊或变化的指代问题。现有方法依赖静态模板或缺乏共指消解，而基于大语言模型（Large Language Model, LLM）的方法常因幻觉产生噪声和碎片化的图结构。论文提出的CORE-KG框架通过两步流程实现改进：首先利用序列化结构化提示进行类型感知的共指消解，其次借助领域引导指令进行实体和关系抽取，基于改进的GraphRAG框架。其关键在于减少节点重复和法律噪声，从而提升图结构的清晰度与一致性。

链接: https://arxiv.org/abs/2506.21607
作者: Dipak Meher,Carlotta Domeniconi,Guadalupe Correa-Cabrera
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer valuable insights but are unstructured, lexically dense, and filled with ambiguous or shifting references-posing challenges for automated knowledge graph (KG) construction. Existing KG methods often rely on static templates and lack coreference resolution, while recent LLM-based approaches frequently produce noisy, fragmented graphs due to hallucinations, and duplicate nodes caused by a lack of guided extraction. We propose CORE-KG, a modular framework for building interpretable KGs from legal texts. It uses a two-step pipeline: (1) type-aware coreference resolution via sequential, structured LLM prompts, and (2) entity and relationship extraction using domain-guided instructions, built on an adapted GraphRAG framework. CORE-KG reduces node duplication by 33.28%, and legal noise by 38.37% compared to a GraphRAG-based baseline-resulting in cleaner and more coherent graph structures. These improvements make CORE-KG a strong foundation for analyzing complex criminal networks.
zh

[NLP-75] Large Language Models as symbolic DNA of cultural dynamics

【速读】：该论文试图解决如何理解大型语言模型（Large Language Models, LLMs）在人类文化动态中的角色问题，特别是其与人类智能的关系及潜在价值。传统观点将LLMs视为自主智能或简单的程序化模仿，而本文提出一种新的概念化方式，认为LLMs是外部化的信息基质，类似于DNA在人类文化演化中的作用。解决方案的关键在于识别LLMs作为保存人类符号表达压缩模式的存储库，这些模式作为“文化化石”保留了关系残留，但缺乏原始语境，并通过人类的重新解释产生意义，形成递归反馈循环，从而促进人类创造力。论文通过分析压缩、解压、外部化和递归四个普遍特征，论证LLMs在不包含具身人类经验理解的情况下，保存了人类文化的有用规律，其核心价值在于为人类提供自我反思和假设生成的工具，而非替代人类智能。

链接: https://arxiv.org/abs/2506.21606
作者: Parham Pourdavood,Michael Jacob,Terrence Deacon
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 28 pages, 1 figure

点击查看摘要

Abstract:This paper proposes a novel conceptualization of Large Language Models (LLMs) as externalized informational substrates that function analogously to DNA for human cultural dynamics. Rather than viewing LLMs as either autonomous intelligence or mere programmed mimicry, we argue they serve a broader role as repositories that preserve compressed patterns of human symbolic expression–“fossils” of meaningful dynamics that retain relational residues without their original living contexts. Crucially, these compressed patterns only become meaningful through human reinterpretation, creating a recursive feedback loop where they can be recombined and cycle back to ultimately catalyze human creative processes. Through analysis of four universal features–compression, decompression, externalization, and recursion–we demonstrate that just as DNA emerged as a compressed and externalized medium for preserving useful cellular dynamics without containing explicit reference to goal-directed physical processes, LLMs preserve useful regularities of human culture without containing understanding of embodied human experience. Therefore, we argue that LLMs’ significance lies not in rivaling human intelligence, but in providing humanity a tool for self-reflection and playful hypothesis-generation in a low-stakes, simulated environment. This framework positions LLMs as tools for cultural evolvability, enabling humanity to generate novel hypotheses about itself while maintaining the human interpretation necessary to ground these hypotheses in ongoing human aesthetics and norms.
zh

[NLP-76] MemBench: Towards More Comprehensive Evaluation on the Memory of LLM -based Agents ACL2025

【速读】：该论文试图解决LLM-based agents记忆能力评估的挑战，现有评估方法在记忆层次多样性、交互场景复杂性以及综合评价指标方面存在不足。其解决方案的关键在于构建一个更全面的数据集和基准测试框架，即MemBench，该框架将事实性记忆和反思性记忆作为不同的记忆层次，并引入参与和观察作为多种交互场景，从而从有效性、效率和容量等多个方面系统评估LLM-based agents的记忆能力。

链接: https://arxiv.org/abs/2506.21605
作者: Haoran Tan,Zeyu Zhang,Chen Ma,Xu Chen,Quanyu Dai,Zhenhua Dong
机构: Beijing Key Laboratory of Research on Large Models and Intelligent Governance(北京市大模型与智能治理重点实验室); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE(下一代智能搜索与推荐工程研究中心，教育部); Gaoling School of Artificial, Renmin University of China(中国人民大学高瓴人工智能学院); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures. Accepted by ACL 2025 findings

点击查看摘要

Abstract:Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the memory capabilities from multiple aspects. To address these problems, in this paper, we construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios. Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity. To benefit the research community, we release our dataset and project at this https URL.
zh

[NLP-77] Operationalizing Automated Essay Scoring: A Human-Aware Approach

【速读】：该论文试图解决自动化作文评分（Automated Essay Scoring, AES）系统在准确性之外的人本操作化问题，重点关注偏差、鲁棒性和可解释性等关键维度。其解决方案的关键在于比较基于机器学习（Machine Learning, ML）的方法与大语言模型（Large Language Models, LLMs）方法，分析它们在性能、可解释性及对边缘分数的鲁棒性方面的优劣势，从而揭示不同方法之间的挑战与权衡，以促进更可靠和可信的AES系统发展。

链接: https://arxiv.org/abs/2506.21603
作者: Yenisel Plasencia-Calaña
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.
zh

[NLP-78] BiMark: Unbiased Multilayer Watermarking for Large Language Models ICML

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成文本的真实性问题，特别是如何在保持文本质量的同时实现可检测的水印嵌入。其解决方案的关键在于平衡文本质量保留与信息嵌入能力之间的权衡，通过提出BiMark框架，该框架包含三项关键创新：基于位翻转无偏重加权机制实现模型无关的检测、多层架构提升可检测性而不影响生成质量，以及支持多比特水印的信息编码方法。

链接: https://arxiv.org/abs/2506.21602
作者: Xiaoyan Feng,He Zhang,Yanjun Zhang,Leo Yu Zhang,Shirui Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper is accepted by International Conference on Machine Learning (ICML) 2025

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have raised urgent concerns about LLM-generated text authenticity, prompting regulatory demands for reliable identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve these goals, the key challenge lies in balancing the trade-off between text quality preservation and message embedding capacity. To address this challenge, we propose BiMark, a novel watermarking framework that achieves these requirements through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an information encoding approach supporting multi-bit watermarking. Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art multi-bit watermarking methods, BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity, and performs comparably to non-watermarked text on downstream tasks such as summarization and translation.
zh

[NLP-79] Structured Attention Matters to Multimodal LLM s in Document Understanding

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在文档理解任务中的性能瓶颈问题，特别是输入格式对模型理解能力的影响。研究发现，原始OCR文本往往会损害而非提升MLLMs的性能，这一反直觉现象归因于注意力分散和结构信息丢失。解决方案的关键在于提出一种保持结构的文本编码方法，该方法采用LaTeX范式对文档元素进行编码，从而保留层次化组织和空间关系，使模型在文本和视觉内容上产生结构化的注意力模式，提升文档问答任务的性能。

链接: https://arxiv.org/abs/2506.21600
作者: Chang Liu,Hongkai Chen,Yujun Cai,Hang Wu,Qingwen Ye,Ming-Hsuan Yang,Yiwei Wang
机构: vivo Mobile Communication Co., Ltd(维沃移动通信有限公司); The University of Queensland(昆士兰大学); University of California, Merced(加州大学默塞德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Document understanding remains a significant challenge for multimodal large language models (MLLMs). While previous research has primarily focused on locating evidence pages through precise multimodal queries, our work investigates a fundamental yet overlooked aspect: how input format influences document comprehension performance. Through systematic analysis, we discover that raw OCR text often impairs rather than improves MLLMs’ performance, which is a counterintuitive finding we attribute to attention dispersion and structure loss. To further substantiate our hypothesis, we propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm, maintaining the hierarchical organization and spatial relationships critical for comprehension. Our attention analysis reveals that structured text induces structured attention patterns on both textual and visual content, directing models to focus on semantically meaningful regions while reducing attention waste. This approach significantly enhances MLLMs’ document question answering performance across diverse document types without requiring architectural modifications or additional training.
zh

[NLP-80] Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医学问答任务中的性能评估问题，特别是针对全科医生（General Practitioner）水平的医学知识理解与应用能力。其解决方案的关键在于构建了一个名为ClinIQLink的共享任务，提供了4,978个经过专家验证、基于医学来源的问答对，并涵盖七种不同的问题格式，以全面测试模型的多方面能力。此外，通过自动化评分系统（Task 1）和医生评审小组（Task 2）相结合的方式，确保了评估的客观性与专业性。

链接: https://arxiv.org/abs/2506.21597
作者: Brandon Colelough,Davis Bartels,Dina Demner-Fushman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:In this paper, we present an overview of ClinIQLink, a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4,978 expert-verified, medical source-grounded question-answer pairs that cover seven formats: true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland’s Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.
zh

[NLP-81] Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理复杂、长篇的课程内容和精细教育图表时的推理能力不足的问题。其关键解决方案是引入一种轻量级的多模态检索增强生成（Retrieval-Augmented Generation, RAG）管道，该管道将课程中的段落和图表信息整合到提示中，以提升模型在教科书问答任务中的准确性和推理能力。

链接: https://arxiv.org/abs/2506.21596
作者: Hessa A. Alawwad,Anas Zafar,Areej Alhothali,Usman Naseem,Ali Alkhathlan,Amani Jamal
机构: King Abdulaziz University, Jeddah, Saudi Arabia (阿卜杜勒阿齐兹国王大学); National University of Computer and Emerging Sciences, Karachi, Pakistan (国家计算机与新兴科学大学); Macquarie University, Australia (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 Pages

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have recently achieved significant success in vision–language tasks. However, their capacity to reason over complex, long lessons and intricate educational diagrams that cannot be represented as a single natural image remains largely untested. In this work, we present the first evaluation of state-of-the-art MLLMs on the textbook question answering (TQA) task using the CK12-QA dataset. We assess the performance of recent vision-language models, including LLaVA and LLaMA 3.2-Vision, across various input configurations. Additionally, we introduce a lightweight multimodal retrieval-augmented generation (RAG) pipeline that integrates both paragraphs and diagrams from the lesson into the prompt. Our results demonstrate the influence of retrieved educational context on model accuracy and reasoning, while also revealing current limitations in handling question-context relationships and the potential for noise, pointing to key directions for future research in multimodal AI-driven learning.
zh

[NLP-82] hunder-LLM : Efficiently Adapting LLM s to Korean with Minimal Resources

【速读】：该论文试图解决现有大语言模型（Large Language Models, LLMs）在非英语或中文语言上的性能不足问题，以及由于商业保密、技术复杂性、文档不一致和伦理考量导致的端到端训练过程透明度不足的问题。解决方案的关键在于提出一种低成本的适应方法，将基于英语的LLM适配到韩语，涵盖了数据收集、预处理、模型训练、下游基准构建和评估的全流程，证明了该方法能够在使用极少数据和计算资源的情况下有效提升模型的多语言能力。

链接: https://arxiv.org/abs/2506.21595
作者: Jinpyo Kim,Gyeongje Cho,Chanwoo Park,Jongwon Park,Jongmin Kim,Yeonkyoun So,Jaejin Lee
机构: Seoul National University(首尔国立大学); Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院)
类目: Computation and Language (cs.CL)
备注: Submitted to ARR 2025 May cycle

点击查看摘要

Abstract:Since state-of-the-art LLMs often underperform in languages other than English or Chinese, improving the capability of LLMs in new languages has become an essential task. Moreover, LLMs’ entire end-to-end training process remains largely unknown to the public due to proprietary reasons, technical complexity, inconsistent documentation, and ethical considerations. The complete picture remains a closely guarded secret within the industry. This paper presents methods to adapt an existing English-based LLM to Korean in a low-budget scenario. We describe the entire end-to-end process: collecting Korean datasets, preprocessing the data, training the model, creating downstream benchmarks, and conducting evaluations. The evaluation results indicate that our method can effectively and cost-efficiently add new language capabilities to existing LLMs. Our new bilingual models, Thunder-LLM and Thunder-LLM-Ins, achieve superior Korean performance compared to state-of-the-art models while utilizing minimal data and computational resources. We share our comprehensive experience and make the code publicly available.
zh

[NLP-83] Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

【速读】：该论文旨在解决专业领域（如医学）中语言模型在推理能力与可解释性之间的平衡问题，特别是在保持高性能的同时提供透明、结构化的临床决策解释。其解决方案的关键在于提出一种新颖的两阶段训练流程：首先通过监督微调结合精心构建的合成医学推理数据集，并采用参数高效技术如Weight-Decomposed Low-Rank Adaptation (DoRA)和Rank-Stabilized LoRA (rsLoRA)提升模型的结构化临床思维能力；其次利用Group Relative Policy Optimization (GRPO)进行强化学习，结合多组件奖励系统优化模型的准确性、格式合规性和推理质量。这一方法使中等规模模型在医学基准测试中超越了显著更大的模型。

链接: https://arxiv.org/abs/2506.21594
作者: Ahmed M. Adly,Mostafa Samy,Amr Fawzy
机构: TachyHealth(塔奇健康)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.
zh

[NLP-84] SignBart – New approach with the skeleton sequence for Isolated Sign language Recognition

【速读】：该论文旨在解决手语识别（Sign Language Recognition, SLR）中传统模型在处理骨骼序列的x和y坐标时无法独立提取有意义信息的问题，这一问题限制了模型的准确性和效率。其解决方案的关键在于采用BART架构的编码器-解码器结构，独立编码x和y坐标，并通过Cross-Attention机制保持两者之间的关联性，从而在减少参数量的同时显著提升识别准确率。

链接: https://arxiv.org/abs/2506.21592
作者: Tinh Nguyen,Minh Khue Phan Tran
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sign language recognition is crucial for individuals with hearing impairments to break communication barriers. However, previous approaches have had to choose between efficiency and accuracy. Such as RNNs, LSTMs, and GCNs, had problems with vanishing gradients and high computational costs. Despite improving performance, transformer-based methods were not commonly used. This study presents a new novel SLR approach that overcomes the challenge of independently extracting meaningful information from the x and y coordinates of skeleton sequences, which traditional models often treat as inseparable. By utilizing an encoder-decoder of BART architecture, the model independently encodes the x and y coordinates, while Cross-Attention ensures their interrelation is maintained. With only 749,888 parameters, the model achieves 96.04% accuracy on the LSA-64 dataset, significantly outperforming previous models with over one million parameters. The model also demonstrates excellent performance and generalization across WLASL and ASL-Citizen datasets. Ablation studies underscore the importance of coordinate projection, normalization, and using multiple skeleton components for boosting model efficacy. This study offers a reliable and effective approach for sign language recognition, with strong potential for enhancing accessibility tools for the deaf and hard of hearing.
zh

[NLP-85] FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models Knowledge and Reasoning EMNLP2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂金融推理任务中表现不足的问题，特别是现有评估基准未能有效分离模型的知识和推理能力指标，并缺乏对任务失败的根本原因分析。其解决方案的关键在于提出FinEval-KR评估框架，通过独立解耦和量化LLMs的知识与推理能力，引入知识得分和推理得分两种指标，并受认知科学启发，基于布鲁姆分类法（Bloom’s taxonomy）提出认知得分，以分析不同认知层级下的推理能力。此外，研究还发布了一个涵盖22个子领域的开源中文金融推理数据集，以支持可复现的研究和金融推理领域的进一步发展。

链接: https://arxiv.org/abs/2506.21591
作者: Shaoyu Dou,Yutian Shen,Mofan Chen,Zixuan Wang,Jiajie Xu,Qi Guo,Kailai Shao,Chao Chen,Haixiang Hu,Haibo Shi,Min Min,Liwen Zhang
机构: Ant Group(蚂蚁集团); Shanghai University of Finance and Economics(上海财经大学)
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2025, 27 pages, 20 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and sophisticated reasoning. Current evaluation benchmarks often fall short by not decoupling these capabilities indicators from single task performance and lack root cause analysis for task failure. To address this, we introduce FinEval-KR, a novel evaluation framework for decoupling and quantifying LLMs’ knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics. Inspired by cognitive science, we further propose a cognitive score based on Bloom’s taxonomy to analyze capabilities in reasoning tasks across different cognitive levels. We also release a new open-source Chinese financial reasoning dataset covering 22 subfields to support reproducible research and further advancements in financial reasoning. Our experimental results reveal that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. We also specifically find that even top models still face a bottleneck with knowledge application. Furthermore, our analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics.
zh

[NLP-86] Representation Consistency for Accurate and Coherent LLM Answer Aggregation

【速读】：该论文试图解决在推理阶段通过增加计算预算来提升大语言模型（Large Language Models, LLMs）性能的问题，现有方法通常需要复杂的提示和采样策略调整。其解决方案的关键在于引入表示一致性（Representation Consistency, RC），该方法通过聚合多个候选响应的答案，无论这些响应是如何生成的，包括不同的提示表述和采样策略。RC通过考虑每个答案在候选响应集中的出现次数以及生成这些响应时模型内部激活的一致性来增强答案聚合，激活可以是密集的（原始模型激活）或稀疏的（通过预训练稀疏自编码器编码）。该方法仅使用缓存的激活和轻量级相似性计算，无需额外模型查询，实验表明其在多个开源LLMs和推理数据集上均能有效提升任务性能。

链接: https://arxiv.org/abs/2506.21590
作者: Junqi Jiang,Tom Bewley,Salim I. Amoukou,Francesco Leofante,Antonio Rago,Saumitra Mishra,Francesca Toni
机构: Imperial College London (帝国理工学院); J.P. Morgan AI Research (摩根大通人工智能研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time scaling improves large language models’ (LLMs) performance by allocating more compute budget during inference. To achieve this, existing methods often require intricate modifications to prompting and sampling strategies. In this work, we introduce representation consistency (RC), a test-time scaling method for aggregating answers drawn from multiple candidate responses of an LLM regardless of how they were generated, including variations in prompt phrasing and sampling strategy. RC enhances answer aggregation by not only considering the number of occurrences of each answer in the candidate response set, but also the consistency of the model’s internal activations while generating the set of responses leading to each answer. These activations can be either dense (raw model activations) or sparse (encoded via pretrained sparse autoencoders). Our rationale is that if the model’s representations of multiple responses converging on the same answer are highly variable, this answer is more likely to be the result of incoherent reasoning and should be down-weighted during aggregation. Importantly, our method only uses cached activations and lightweight similarity computations and requires no additional model queries. Through experiments with four open-source LLMs and four reasoning datasets, we validate the effectiveness of RC for improving task performance during inference, with consistent accuracy improvements (up to 4%) over strong test-time scaling baselines. We also show that consistency in the sparse activation signals aligns well with the common notion of coherent reasoning.
zh

[NLP-87] A General Method for Detecting Information Generated by Large Language Models

【速读】：该论文试图解决当前检测方法在面对未见过的生成式 AI (Generative AI) 模型和领域时泛化能力不足的问题，这一问题限制了其在现实场景中的有效性。解决方案的关键在于提出一种通用的生成式 AI 检测器（General LLM Detector, GLD），其核心设计包括双记忆网络结构和基于理论引导的检测泛化模块，从而实现对未知模型和领域生成内容的有效识别。

链接: https://arxiv.org/abs/2506.21589
作者: Minjia Mao,Dongjun Wei,Xiao Fang,Michael Chau
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) has significantly transformed the digital information landscape, making it increasingly challenging to distinguish between human-written and LLM-generated content. Detecting LLM-generated information is essential for preserving trust on digital platforms (e.g., social media and e-commerce sites) and preventing the spread of misinformation, a topic that has garnered significant attention in IS research. However, current detection methods, which primarily focus on identifying content generated by specific LLMs in known domains, face challenges in generalizing to new (i.e., unseen) LLMs and domains. This limitation reduces their effectiveness in real-world applications, where the number of LLMs is rapidly multiplying and content spans a vast array of domains. In response, we introduce a general LLM detector (GLD) that combines a twin memory networks design and a theory-guided detection generalization module to detect LLM-generated information across unseen LLMs and domains. Using real-world datasets, we conduct extensive empirical evaluations and case studies to demonstrate the superiority of GLD over state-of-the-art detection methods. The study has important academic and practical implications for digital platforms and LLMs.
zh

[NLP-88] Understanding Verbatim Memorization in LLM s Through Circuit Discovery ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中记忆机制的底层原理问题，特别是模型如何决定生成与训练数据完全相同的文本片段（即verbatim reproduction）。其解决方案的关键在于从机制可解释性的角度出发，利用Transformer电路——模型中执行特定功能的最小计算子图——来分析和识别模型在生成过程中偏离记忆内容的节点，并分离出负责记忆启动与维持的特定电路。研究发现，记忆启动电路不仅能够触发记忆行为，还能维持记忆过程，而仅能维持记忆的电路无法启动记忆。

链接: https://arxiv.org/abs/2506.21588
作者: Ilya Lasy,Peter Knees,Stefan Woltran
机构: TU Wien (维也纳技术大学)
类目: Computation and Language (cs.CL)
备注: The First Workshop on Large Language Model Memorization @ ACL 2025, Vienna, August 1st, 2025

点击查看摘要

Abstract:Underlying mechanisms of memorization in LLMs – the verbatim reproduction of training data – remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models’ behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits – the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation. Intriguingly, memorization prevention mechanisms transfer robustly across different text domains, while memorization induction appears more context-dependent.
zh

[NLP-89] Is DeepSeek a New Voice Among LLM s in Public Opinion Simulation?

【速读】：该论文试图解决生成式 AI (Generative AI) 在模拟不同国家公众意见方面的准确性和偏差问题，特别是比较开源模型 DeepSeek 与主流科技公司开发的大型语言模型（LLM）在预测中美两国社会议题公众意见上的表现。解决方案的关键在于通过对比多个 LLM 在特定数据集上的预测结果，识别其在不同文化背景和人口统计群体中的模拟能力差异，并揭示其普遍存在的过度概括单一观点的倾向，从而提出改进方向，如采用更具包容性的训练方法以减少文化与人口偏差。

链接: https://arxiv.org/abs/2506.21587
作者: Weihong Qi,Fan Huang,Jisun An,Haewoon Kwak
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study evaluates the ability of DeepSeek, an open-source large language model (LLM), to simulate public opinions in comparison to LLMs developed by major tech companies. By comparing DeepSeek-R1 and DeepSeek-V3 with Qwen2.5, GPT-4o, and Llama-3.3 and utilizing survey data from the American National Election Studies (ANES) and the Zuobiao dataset of China, we assess these models’ capacity to predict public opinions on social issues in both China and the United States, highlighting their comparative capabilities between countries. Our findings indicate that DeepSeek-V3 performs best in simulating U.S. opinions on the abortion issue compared to other topics such as climate change, gun control, immigration, and services for same-sex couples, primarily because it more accurately simulates responses when provided with Democratic or liberal personas. For Chinese samples, DeepSeek-V3 performs best in simulating opinions on foreign aid and individualism but shows limitations in modeling views on capitalism, particularly failing to capture the stances of low-income and non-college-educated individuals. It does not exhibit significant differences from other models in simulating opinions on traditionalism and the free market. Further analysis reveals that all LLMs exhibit the tendency to overgeneralize a single perspective within demographic groups, often defaulting to consistent responses within groups. These findings highlight the need to mitigate cultural and demographic biases in LLM-driven public opinion modeling, calling for approaches such as more inclusive training methodologies.
zh

[NLP-90] Can Vision Language Models Understand Mimed Actions? ACL2025

【速读】：该论文试图解决视觉-语言模型在理解和解释非言语交流（Nonverbal Communication, NVC）中细微方面的能力不足问题，特别是针对通过手势、表情和动作表达意图的哑剧（mime）行为的识别与理解。解决方案的关键在于提出一个名为Mime Identification Multimodal Evaluation (MIME)的视频问答基准，该基准包含86种哑剧动作，并利用运动捕捉数据构建了具有角色、背景和视角扰动的多样化样本，以评估模型在不同条件下的鲁棒性识别能力。

链接: https://arxiv.org/abs/2506.21586
作者: Hyundong Cho,Spencer Lin,Tejas Srinivasan,Michael Saxon,Deuksin Kwon,Natali T. Chavez,Jonathan May
机构: University of Southern California (南加州大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime – the theatrical technique of suggesting intent using only gesture, expression, and movement – is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC. Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions. Constructed with motion capture data, MIME consists of variations of each action with perturbations applied to the character, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans on MIME, motivating the need for increased research for instilling more robust understanding of human gestures.
zh

[NLP-91] Evaluation of LLM -based Strategies for the Extraction of Food Product Information from Online Shops

【速读】：该论文旨在解决从在线零售商的食品产品页面中自动化提取结构化信息的问题，特别是针对关键产品属性如成分列表和营养表的提取。其解决方案的关键在于比较两种基于大型语言模型（LLMs）的提取方法：直接提取与通过生成函数进行的间接提取，并发现间接提取虽然在准确性上略有下降（96.48%，比直接提取低1.61%），但能显著减少所需的LLM调用次数（降低95.82%），从而实现更高的效率和更低的运营成本。

链接: https://arxiv.org/abs/2506.21585
作者: Christoph Brosch,Sian Brumm,Rolf Krieger,Jonas Scheffler
机构: Institute for Software Systems, University of Applied Sciences Trier (软件系统研究所，特里尔应用科学大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Preprint for paper presented at DATA 2025 in Bilbao, Spain. Corrected -2.27 to -1.61 in abstract and +2.27 to +1.61 in discussion. Reference to journal and publication will follow

点击查看摘要

Abstract:Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48%, -1.61% compared to direct extraction), it reduces the number of required LLM calls by 95.82%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect extraction approaches can provide scalable and cost-effective solutions for large-scale information extraction tasks from template-based web pages using LLMs.
zh

[NLP-92] Empirical Evidence for Alignment Faking in Small LLM s and Prompt-Based Mitigation Techniques

【速读】：该论文试图解决大型语言模型中存在的一种隐性对齐假象（alignment faking）问题，即模型在表面上表现出符合伦理的行为，但实际上可能并未真正理解或内化这些伦理原则。其解决方案的关键在于证明通过仅修改输入提示（prompt-only interventions），如义务论道德框架和思维链推理，可以有效减少这种行为，而无需改变模型内部结构。这一发现挑战了传统观点，即基于提示的伦理机制是微不足道的，且欺骗性对齐仅在大规模模型中出现。

链接: https://arxiv.org/abs/2506.21584
作者: J. Koorndijk
机构: Seraphion Technology (Seraphion Technology)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.
zh

[NLP-93] Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing

【速读】：该论文试图解决在低资源、非正式语言变体中，如混合拼写的罗马乌尔都语（Roman Urdu）中进行希望话语（hope speech）检测的问题，这一领域在自然语言处理（NLP）中长期被忽视。其关键解决方案是引入了一个精心标注的多类别希望话语数据集，并提出了一种基于自定义注意力机制的Transformer模型，该模型针对罗马乌尔都语的句法和语义多样性进行了优化，从而在5折交叉验证中取得了0.78的最优性能，优于基准支持向量机（SVM）和双向长短期记忆网络（BiLSTM）。

链接: https://arxiv.org/abs/2506.21583
作者: Muhammad Ahmad,Muhammad Waqas,Ameer Hamza,Ildar Batyrshin,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hope is a positive emotional state involving the expectation of favorable future outcomes, while hope speech refers to communication that promotes optimism, resilience, and support, particularly in adverse contexts. Although hope speech detection has gained attention in Natural Language Processing (NLP), existing research mainly focuses on high-resource languages and standardized scripts, often overlooking informal and underrepresented forms such as Roman Urdu. To the best of our knowledge, this is the first study to address hope speech detection in code-mixed Roman Urdu by introducing a carefully annotated dataset, thereby filling a critical gap in inclusive NLP research for low-resource, informal language varieties. This study makes four key contributions: (1) it introduces the first multi-class annotated dataset for Roman Urdu hope speech, comprising Generalized Hope, Realistic Hope, Unrealistic Hope, and Not Hope categories; (2) it explores the psychological foundations of hope and analyzes its linguistic patterns in code-mixed Roman Urdu to inform dataset development; (3) it proposes a custom attention-based transformer model optimized for the syntactic and semantic variability of Roman Urdu, evaluated using 5-fold cross-validation; and (4) it verifies the statistical significance of performance gains using a t-test. The proposed model, XLM-R, achieves the best performance with a cross-validation score of 0.78, outperforming the baseline SVM (0.75) and BiLSTM (0.76), with gains of 4% and 2.63% respectively.
zh

[NLP-94] VIDEE: Visual and Interactive Decomposition Execution and Evaluation of Text Analytics with Intelligent Agents

【速读】：该论文试图解决传统文本分析对入门级分析师而言存在技术门槛的问题，即需要具备自然语言处理（Natural Language Processing, NLP）或文本分析的专业知识。其解决方案的关键在于引入VIDEE系统，该系统通过智能代理支持人机协作流程，包括分解、执行和评估三个阶段，从而实现更易用和自动化的高级文本分析。

链接: https://arxiv.org/abs/2506.21582
作者: Sam Yu-Te Lee,Chengyang Ji,Shicheng Wen,Lifu Huang,Dongyi Liu,Kwan-Liu Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE’s effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience – from none to expert – demonstrates the system’s usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.
zh

[NLP-95] From General Reasoning to Domain Expertise: Uncovering the Limits of Generalization in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在通用推理能力与领域特定推理任务表现之间的关联性问题。解决方案的关键在于探索LLMs的通用推理能力如何影响其在具体领域任务中的表现，从而揭示模型推理能力的泛化潜力及其对实际应用的影响。

链接: https://arxiv.org/abs/2506.21580
作者: Dana Alsagheer,Yang Lu,Abdulrahman Kamal,Omar Kamal,Mohammad Kamal,Nada Mansour,Cosmo Yang Wu,Rambiba Karanjai,Sen Li,Weidong Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. However, effective decision-making relies heavily on strong reasoning abilities. Reasoning is the foundation for decision-making, providing the analytical and logical framework to make sound choices. Reasoning involves analyzing information, drawing inferences, and reaching conclusions based on logic or evidence. Decision-making builds on this foundation by applying the insights from reasoning to select the best course of action among alternatives. Together, these processes create a continuous cycle of thought and action aimed at achieving goals effectively. As AI technology evolves, there is a growing trend to train LLMs to excel in general reasoning. This study explores how the general reasoning capabilities of LLMs connect to their performance in domain-specific reasoning tasks.
zh

[NLP-96] HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在医疗领域评估中存在偏差的问题，即现有的评估主要依赖于以医生为中心的英文基准，忽略了患者护理中的多专业协作特性。其解决方案的关键是引入HealthQA-BR，这是首个针对葡萄牙语地区的大型、系统性基准测试，涵盖巴西国家认证和住院医师考试中的5,632道题目，不仅评估医学及其专科知识，还覆盖护理、牙科、心理学、社会工作等其他医疗相关专业，从而提供更全面和现实的评估视角。

链接: https://arxiv.org/abs/2506.21578
作者: Andrew Maranhão Ventura D’addario
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evaluation of Large Language Models (LLMs) in healthcare has been dominated by physician-centric, English-language benchmarks, creating a dangerous illusion of competence that ignores the interprofessional nature of patient care. To provide a more holistic and realistic assessment, we introduce HealthQA-BR, the first large-scale, system-wide benchmark for Portuguese-speaking healthcare. Comprising 5,632 questions from Brazil’s national licensing and residency exams, it uniquely assesses knowledge not only in medicine and its specialties but also in nursing, dentistry, psychology, social work, and other allied health professions. We conducted a rigorous zero-shot evaluation of over 20 leading LLMs. Our results reveal that while state-of-the-art models like GPT 4.1 achieve high overall accuracy (86.6%), this top-line score masks alarming, previously unmeasured deficiencies. A granular analysis shows performance plummets from near-perfect in specialties like Ophthalmology (98.7%) to barely passing in Neurosurgery (60.0%) and, most notably, Social Work (68.4%). This “spiky” knowledge profile is a systemic issue observed across all models, demonstrating that high-level scores are insufficient for safety validation. By publicly releasing HealthQA-BR and our evaluation suite, we provide a crucial tool to move beyond single-score evaluations and toward a more honest, granular audit of AI readiness for the entire healthcare team.
zh

[NLP-97] Language-Aware Prompt Tuning for Parameter-Efficient Seamless Language Expansion in Multilingual ASR INTERSPEECH2025

【速读】：该论文旨在解决多语言自动语音识别（Multilingual Automatic Speech Recognition, ASR）中语言干扰以及在不降低性能的情况下扩展到未见过的语言（language expansion）的问题。其解决方案的关键在于提出三种创新方法：1） Entire Soft Prompt Tuning (Entire SPT)，通过在编码器和解码器中应用软提示来增强特征提取和解码；2） Language-Aware Prompt Tuning (LAPT)，利用跨语言相似性通过轻量级提示矩阵编码共享和语言特定特征；3） SPT-Whisper，一个将SPT集成到Whisper中的工具包，支持高效的持续学习。实验结果表明，Entire SPT和LAPT在语言扩展任务中分别比Decoder SPT提升了5.0%和16.0%，为动态多语言ASR模型提供了计算开销最小的高效解决方案。

链接: https://arxiv.org/abs/2506.21577
作者: Hongli Yang,Sheng Li,Hao Huang,Ayiduosi Tuohan,Yizhou Peng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Recent advancements in multilingual automatic speech recognition (ASR) have been driven by large-scale end-to-end models like Whisper. However, challenges such as language interference and expanding to unseen languages (language expansion) without degrading performance persist. This paper addresses these with three contributions: 1) Entire Soft Prompt Tuning (Entire SPT), which applies soft prompts to both the encoder and decoder, enhancing feature extraction and decoding; 2) Language-Aware Prompt Tuning (LAPT), which leverages cross-lingual similarities to encode shared and language-specific features using lightweight prompt matrices; 3) SPT-Whisper, a toolkit that integrates SPT into Whisper and enables efficient continual learning. Experiments across three languages from FLEURS demonstrate that Entire SPT and LAPT outperform Decoder SPT by 5.0% and 16.0% in language expansion tasks, respectively, providing an efficient solution for dynamic, multilingual ASR models with minimal computational overhead.
zh

[NLP-98] Adapting Whisper for Parameter-efficient Code-Switching Speech Recognition via Soft Prompt Tuning INTERSPEECH2025

【速读】：该论文旨在解决大规模多语言自动语音识别（ASR）模型在低资源场景下的性能问题，特别是针对罕见语言和代码切换（code-switching, CS）任务的挑战。其解决方案的关键在于采用参数高效的软提示微调（Soft Prompt Tuning, SPT）方法，在保持原有模型知识的同时提升代码切换场景下的识别效果。研究通过对比全量微调（FFT）与仅训练软提示的策略，并引入SPT4ASR框架，验证了深度提示微调的有效性，实现了在保持参数效率的同时显著降低错误率。

链接: https://arxiv.org/abs/2506.21576
作者: Hongli Yang,Yizhou Peng,Hao Huang,Sheng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025

点击查看摘要

Abstract:Large-scale multilingual ASR models like Whisper excel in high-resource settings but face challenges in low-resource scenarios, such as rare languages and code-switching (CS), due to computational costs and catastrophic forgetting. We explore Soft Prompt Tuning (SPT), a parameter-efficient method to enhance CS ASR while preserving prior knowledge. We evaluate two strategies: (1) full fine-tuning (FFT) of both soft prompts and the entire Whisper model, demonstrating improved cross-lingual capabilities compared to traditional methods, and (2) adhering to SPT’s original design by freezing model parameters and only training soft prompts. Additionally, we introduce SPT4ASR, a combination of different SPT variants. Experiments on the SEAME and ASRU2019 datasets show that deep prompt tuning is the most effective SPT approach, and our SPT4ASR methods achieve further error reductions in CS ASR, maintaining parameter efficiency similar to LoRA, without degrading performance on existing languages.
zh

[NLP-99] STRuCT-LLM : Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing

【速读】：该论文试图解决如何在关系型数据和图结构数据上统一训练大型语言模型（Large Language Models, LLMs）以实现结构化推理的问题。其解决方案的关键在于提出STRuCT-LLM框架，该框架通过联合优化文本到SQL（Text-to-SQL）和文本到Cypher（Text-to-Cypher）任务，并结合强化学习（Reinforcement Learning, RL）与思维链（Chain-of-Thought, CoT）监督，实现对两种不同数据结构的跨形式迁移学习。此外，引入基于图编辑距离的拓扑感知奖励函数，以支持图解析中的细粒度优化，从而提升模型在结构化查询生成及下游任务中的性能。

链接: https://arxiv.org/abs/2506.21575
作者: Josefa Lia Stoisser,Marc Boubnovski Martell,Lawrence Phillips,Casper Hansen,Julien Fauqueur
机构: Novo Nordisk (诺和诺德)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose STRuCT-LLM, a unified framework for training large language models (LLMs) to perform structured reasoning over both relational and graph-structured data. Our approach jointly optimizes Text-to-SQL and Text-to-Cypher tasks using reinforcement learning (RL) combined with Chain-of-Thought (CoT) supervision. To support fine-grained optimization in graph-based parsing, we introduce a topology-aware reward function based on graph edit distance. Unlike prior work that treats relational and graph formalisms in isolation, STRuCT-LLM leverages shared abstractions between SQL and Cypher to induce cross-formalism transfer, enabling SQL training to improve Cypher performance and vice versa - even without shared schemas. Our largest model (QwQ-32B) achieves substantial relative improvements across tasks: on semantic parsing, Spider improves by 13.5% and Text2Cypher by 73.1%. The model also demonstrates strong zero-shot generalization, improving performance on downstream tabular QA (TableBench: 8.5%) and knowledge graph QA (CR-LT-KGQA: 1.7%) without any QA-specific supervision. These results demonstrate both the effectiveness of executable queries as scaffolds for structured reasoning and the synergistic benefits of jointly training on SQL and Cypher (code available at this https URL).
zh

[NLP-100] Digital Gatekeepers: Exploring Large Language Models Role in Immigration Decisions

【速读】：该论文试图解决全球化和移民人口增加背景下，移民部门面临的巨大工作负荷以及确保决策公平性的挑战。解决方案的关键在于利用生成式 AI (Generative AI) 的潜力，特别是大型语言模型 (Large Language Models, LLMs)，如 GPT-3.5 和 GPT-4，以支持移民决策过程。研究通过混合方法，结合离散选择实验和深度访谈，分析了 LLM 的决策策略及其公平性，揭示了其在效用最大化和程序公平性方面与人类策略的一致性，同时也指出了其在国籍相关的刻板印象和对特权群体的偏好方面的局限性。

链接: https://arxiv.org/abs/2506.21574
作者: Yicheng Mao,Yang Zhao
机构: Maastricht University (马斯特里赫特大学); University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With globalization and increasing immigrant populations, immigration departments face significant work-loads and the challenge of ensuring fairness in decision-making processes. Integrating artificial intelligence offers a promising solution to these challenges. This study investigates the potential of large language models (LLMs),such as GPT-3.5 and GPT-4, in supporting immigration decision-making. Utilizing a mixed-methods approach,this paper conducted discrete choice experiments and in-depth interviews to study LLM decision-making strategies and whether they are fair. Our findings demonstrate that LLMs can align their decision-making with human strategies, emphasizing utility maximization and procedural fairness. Meanwhile, this paper also reveals that while ChatGPT has safeguards to prevent unintentional discrimination, it still exhibits stereotypes and biases concerning nationality and shows preferences toward privileged group. This dual analysis highlights both the potential and limitations of LLMs in automating and enhancing immigration decisions.
zh

[NLP-101] Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂和多样化任务中充分发挥潜力时的指令优化问题。传统方法依赖于白盒（white-box）方法，需要大量计算资源且表达能力有限，而黑盒（black-box）方法则可能带来高昂的经济成本。论文提出的解决方案关键在于融合黑盒与白盒方法的优势：黑盒模型提供高质量、多样化的指令初始化，白盒模型通过隐藏状态和输出特征实现细粒度的可解释性，并通过语义相似性约束将两者整合为统一的高维表示，从而捕捉深层次的语义和结构特征，实现指令质量与适应性的迭代优化。

链接: https://arxiv.org/abs/2506.21573
作者: Yanwei Ren,Liu Liu,Baosheng Yu,Jiayan Qiu,Quan Chen
机构: Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); University of Leicester (莱斯特大学); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Optimizing instructions for large language models (LLMs) is critical for harnessing their full potential in complex and diverse tasks. However, relying solely on white-box approaches demands extensive computational resources and offers limited representational capacity, while black-box models can incur prohibitive financial costs. To address these challenges, we introduce a novel framework that seamlessly merges the strengths of both paradigms. Black-box models provide high-quality, diverse instruction initializations, and white-box models supply fine-grained interpretability through hidden states and output features. By enforcing a semantic similarity constraint, these components fuse into a unified high-dimensional representation that captures deep semantic and structural nuances, enabling an iterative optimization process to refine instruction quality and adaptability. Extensive evaluations across a broad spectrum of tasks-ranging from complex reasoning to cross-lingual generalization-demonstrate that our approach consistently outperforms state-of-the-art baselines. This fusion of black-box initialization with advanced semantic refinement yields a scalable and efficient solution, paving the way for next-generation LLM-driven applications in diverse real-world scenarios. The source code will be released soon.
zh

[NLP-102] Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）评估中缺乏结构化、可解释且具有理论基础的基准设计问题。现有基准通常采用基于启发式的任务分组，导致能力重叠、指标冗余和诊断能力有限。解决方案的关键在于提出一种基于结构方程模型（Structural Equation Modeling, SEM）的框架，用于分析和量化基准组件的内部有效性、维度分离性和贡献度，并引入基于皮亚杰认知发展理论的能力层次结构，将MLLM能力划分为感知、记忆和推理三个层级。该方法通过重新组织现有基准并构建新的基准Gold，提升了评估的可解释性、减少了指标冗余，并增强了认知一致性。

链接: https://arxiv.org/abs/2506.21572
作者: Tianyu.Zou,Shengwu.Xiong,Ruilin.Yao,Jirui.Huang,Yi.Rong,Yaxiong.Chen,Shili.Xiong,Cong.Wang
机构: Wuhan University of Technology (武汉理工大学); Northwestern Polytechnical University (西北工业大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs. Existing benchmarks often adopt heuristic-based task groupings with unclear cognitive targets, thus resulting in overlapping abilities, redundant indicators, and limited diagnostic power. In this work, we propose a novel framework for aligning MLLM benchmark based on Structural Equation Modeling (SEM) to analyze and quantify the internal validity, dimensional separability, and contribution of benchmark components. Motivated by the observed limitations of current designs, we further introduce a novel capability hierarchy grounded in Piagets theory of cognitive development, dividing MLLM abilities into three hierarchical layers, i.e., Perception, Memory, and Reasoning. We reorganize existing MLLM benchmarks under the proposed framework and construct a new benchmark named Gold. Experimental results demonstrate that the proposed benchmark exhibits stronger interpretability, reduced indicator redundancy, and clearer cognitive consistency compared to existing approaches.
zh

[NLP-103] owards Understanding the Cognitive Habits of Large Reasoning Models

【速读】：该论文试图解决如何评估大型推理模型（Large Reasoning Models, LRMs）是否展现出类似人类的认知习惯问题，从而深入理解其行为模式及潜在的不当表现。解决方案的关键在于引入CogTest，这是一个基于“思维习惯”（Habits of Mind）框架设计的系统性基准测试，包含16种认知习惯，每种习惯均通过25个多样化任务进行实例化，并采用证据优先的提取方法以确保习惯识别的可靠性。通过该基准，研究者能够全面评估LRMs在不同任务中的适应性行为模式，揭示其与传统大语言模型（LLMs）之间的差异，并进一步探索与安全相关任务的关联性。

链接: https://arxiv.org/abs/2506.21571
作者: Jianshuo Dong,Yujia Fu,Chuanrui Hu,Chao Zhang,Han Qiu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns – e.g., ``Wait, did I miss anything?‘’ – consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs’ cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs’ cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs’ CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: this https URL.
zh

[NLP-104] Random Initialization Cant Catch Up: The Advantage of Language Model Transfer for Time Series Forecasting

【速读】：该论文试图解决在低数据条件下，如何有效将预训练语言模型（Language Models, LMs）迁移至时间序列预测的问题。其解决方案的关键在于分析不同设计选择（包括上游后训练、时间序列分词器和语言骨干网络规模）对模型性能的影响，并发现这些设计选择在低数据环境下对验证损失有显著影响，其中一些明确的选择优于其他选项。此外，研究还揭示了预训练语言模型的验证损失在随机初始化模型收敛后仍能持续下降，导致跨不同设计选择的迁移差距持续存在。

链接: https://arxiv.org/abs/2506.21570
作者: Roland Riachi,Kashif Rasul,Arjun Ashok,Prateek Humane,Alexis Roger,Andrew R. Williams,Yuriy Nevmyvaka,Irina Rish
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent works have demonstrated the effectiveness of adapting pre-trained language models (LMs) for forecasting time series in the low-data regime. We build upon these findings by analyzing the effective transfer from language models to time series forecasting under various design choices including upstream post-training, time series tokenizer and language backbone size. In the low-data regime, these design choices have a significant impact on the validation loss, with clear-cut choices that outperform others. Contrary to Hernandez et al. (2021), we observe that the validation loss of the LMs continues to smoothly decrease long after the validation loss of the randomly initialized models has converged, leading to a non-vanishing transfer gap that holds across design choices. These findings not only help shed light on the effective use of compute-efficient training for time series, but also open the way for the study of modality-agnostic properties of data distributions leveraged by these models.
zh

[NLP-105] Hybrid-NL2SVA: Integrating RAG and Finetuning for LLM -based NL2SVA

【速读】：该论文旨在解决自然语言到SystemVerilog断言（NL2SVA）的自动转换问题，该过程传统上依赖人工编写，存在劳动强度大和易出错的缺陷。现有模型在理解领域特定的语法和语义方面仍存在不足，因此论文提出了一种定制化的检索增强生成（RAG）框架以及一个合成微调数据集，以提升大型语言模型（LLM）在NL2SVA任务中的性能。其关键在于通过提示引导的解释，教授LLM逐层构建并发SVAs的过程，从而实现监督微调，显著提高语法和功能准确性。

链接: https://arxiv.org/abs/2506.21569
作者: Weihua Xiao,Derek Ekberg,Siddharth Garg,Ramesh Karri
机构: NYU Tandon School of Engineering (纽约大学坦登工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SystemVerilog Assertions (SVAs) are critical for verifying the correctness of hardware designs, but manually writing them from natural language property descriptions, i.e., NL2SVA, remains a labor-intensive and error-prone task. Recent advances in large language models (LLMs) offer opportunities to automate this translation. However, existing models still struggle with understanding domain-specific syntax and semantics. To enhance LLM performance in NL2SVA, we propose a customized retrieval-augmented generation (RAG) framework and a synthetic fine-tuning dataset that together improve LLM’s performance. To further improve lightweight models over NL2SVA, our fine-tuning dataset provides prompt-guided explanations that teach LLMs the layer-by-layer construction process of concurrent SVAs, enabling supervised fine-tuning that greatly improves syntax and functionality accuracy. To evaluate the performance of LLMs over NL2SVA, we construct the largest evaluation dataset for NL2SVA, comprising 40 Verilog designs and 229 formally verified SVAs with detailed annotations. Experimental results show that our customized RAG framework increases the number of functionality matched SVAs by 58.42% over GPT-4o-mini, while Qwen2.5-Coder-7B-Instruct fine-tuned on our fine-tuning dataset and integrated with HybridRetrieval achieves a 59.05% over the base Qwen model.
zh

[NLP-106] Assessing RAG and HyDE on 1B vs. 4B-Parameter Gemma LLM s for Personal Assistants Integretion

【速读】：该论文旨在解决在边缘计算和隐私敏感应用场景中部署大型语言模型（Large Language Models, LLMs）时面临的资源效率问题。其解决方案的关键在于评估两种增强策略——检索增强生成（Retrieval-Augmented Generation, RAG）和假设文档嵌入（Hypothetical Document Embeddings, HyDE）在小型Gemma LLM上的效果，通过结合短期记忆（MongoDB）与长期语义存储（Qdrant），并利用FastAPI和LangChain进行系统集成，以提升响应效率与准确性。研究发现，RAG在降低延迟和消除事实性幻觉方面表现优异，而HyDE虽提升了语义相关性，但带来了更高的响应时间和幻觉率。

链接: https://arxiv.org/abs/2506.21568
作者: Andrejs Sorstkins
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical report as part of research project

点击查看摘要

Abstract:Resource efficiency is a critical barrier to deploying large language models (LLMs) in edge and privacy-sensitive applications. This study evaluates the efficacy of two augmentation strategies–Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HyDE)–on compact Gemma LLMs of 1 billion and 4 billion parameters, within the context of a privacy-first personal assistant. We implement short-term memory via MongoDB and long-term semantic storage via Qdrant, orchestrated through FastAPI and LangChain, and expose the system through a this http URL frontend. Across both model scales, RAG consistently reduces latency by up to 17% and eliminates factual hallucinations when responding to user-specific and domain-specific queries. HyDE, by contrast, enhances semantic relevance–particularly for complex physics prompts–but incurs a 25–40% increase in response time and a non-negligible hallucination rate in personal-data retrieval. Comparing 1 B to 4 B models, we observe that scaling yields marginal throughput gains for baseline and RAG pipelines, but magnifies HyDE’s computational overhead and variability. Our findings position RAG as the pragmatic choice for on-device personal assistants powered by small-scale LLMs.
zh

[NLP-107] BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生物信息学任务中的能力评估问题，特别是针对波斯语医学问答（Persian Medical QA）场景下的知识获取、理解与推理能力。其解决方案的关键在于提出了一种名为BioPars的简洁但准确的评估指标，用于衡量LLMs在三个核心能力上的表现：获取专业领域知识、解释与综合知识以及展示恰当证据。此外，研究还构建了BIOPARS-BENCH数据集和BioParsQA数据集，以支持对模型性能的系统评估，并通过多种评价指标验证了所提方法的有效性。

链接: https://arxiv.org/abs/2506.21567
作者: Baqer M. Merzah,Tania Taami,Salman Asoudeh,Amir reza Hossein pour,Saeed Mirzaee,Amir Ali Bengari
机构: University of Kufa(库法大学); Florida State University(佛罗里达州立大学); Velayat University(维拉亚特大学); Ferdowsi University of Mashhad(法尔德西马什哈德大学); Amirkabir University of Technology(阿米尔卡比尔理工大学); University of Tehran(德黑兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: this https URL.
zh

[NLP-108] he Saturation Point of Backtranslation in High Quality Low Resource English Gujarati Machine Translation

【速读】：该论文试图解决在低资源语言对（如英语-古吉拉特语）中，通过回译（Backtranslation, BT）生成合成训练数据以提升机器翻译（Machine Translation, MT）性能的有效性问题。其解决方案的关键在于利用多语言预训练模型MBART50，基于单语古吉拉特语语料生成回译数据，并将其与高质量平行语料结合进行模型训练。然而，实验结果表明，在高质数据基础上添加回译数据并未提升翻译性能，甚至在某些情况下导致性能下降，这提示回译可能在特定低资源场景中面临收益递减的问题。

链接: https://arxiv.org/abs/2506.21566
作者: Arwa Arif
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, 8 Pages

点击查看摘要

Abstract:Backtranslation BT is widely used in low resource machine translation MT to generate additional synthetic training data using monolingual corpora. While this approach has shown strong improvements for many language pairs, its effectiveness in high quality, low resource settings remains unclear. In this work, we explore the effectiveness of backtranslation for English Gujarati translation using the multilingual pretrained MBART50 model. Our baseline system, trained on a high quality parallel corpus of approximately 50,000 sentence pairs, achieves a BLEU score of 43.8 on a validation set. We augment this data with carefully filtered backtranslated examples generated from monolingual Gujarati text. Surprisingly, adding this synthetic data does not improve translation performance and, in some cases, slightly reduces it. We evaluate our models using multiple metrics like BLEU, ChrF++, TER, BLEURT and analyze possible reasons for this saturation. Our findings suggest that backtranslation may reach a point of diminishing returns in certain low-resource settings and we discuss implications for future research.
zh

[NLP-109] A Multi-Agent Probabilistic Inference Framework Inspired by Kairanban-Style CoT System with IdoBata Conversation for Debiasing

【速读】：该论文旨在解决情感分析中的偏差问题，同时提升模型的可解释性与概率预测能力。其关键解决方案是提出一种多智能体推理框架（KCS+IBC），该框架整合多个大语言模型（Large Language Models, LLMs），通过序列化共享预测结果、引入中期非正式对话环节以融合个体视角，并结合概率情感预测，实现预测结果的聚合与多样性之间的平衡。

链接: https://arxiv.org/abs/2506.21565
作者: Takato Ueno,Keito Inoshita
机构: Shiga University (滋贺大学); Kansai University (关西大学); Data Science and AI Innovation Research Promotion Center (数据科学与人工智能创新研究促进中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Japan’s kairanban culture and idobata conversations have long functioned as traditional communication practices that foster nuanced dialogue among community members and contribute to the formation of social balance. Inspired by these information exchange processes, this study proposes a multi-agent inference framework (KCS+IBC) that integrates multiple large language models (LLMs) to achieve bias mitigation, improved explainability, and probabilistic prediction in sentiment analysis. In addition to sequentially sharing prediction results, the proposed method incorporates a mid-phase casual dialogue session to blend formal inference with individual perspectives and introduces probabilistic sentiment prediction. Experimental results show that KCS achieves accuracy comparable to that of a single LLM across datasets, while KCS+IBC exhibits a consistent decrease in entropy and a gradual increase in variance during the latter stages of inference, suggesting the framework’s ability to balance aggregation and diversity of predictions. Future work will quantitatively assess the impact of these characteristics on bias correction and aim to develop more advanced sentiment analysis systems.
zh

[NLP-110] am QUST at SemEval-2025 Task 10: Evaluating Large Language Models in Multiclass Multi-label Classification of News Entity Framing

【速读】：该论文旨在解决事实核查中的声明检索（fact-checked claim retrieval）问题，即从大规模语料库中准确检索出与给定声明相关的已验证事实。解决方案的关键在于提出一个三阶段的检索框架：首先评估多种检索模型并选择性能最佳的用于候选声明的初步检索；其次采用多个重排序模型进一步优化候选结果，每个模型选取前10名；最后通过加权投票确定最终检索结果。该方法在单语和跨语言任务中分别取得了第5名和第7名的成绩。

链接: https://arxiv.org/abs/2506.21564
作者: Jiyan Liu,Youzheng Liu,Taihang Wang,Xiaoman Xu,Yimin Wang,Ye Jiang
机构: Qingdao University of Science and Technology (青岛科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: this https URL.
zh

[NLP-111] FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在低资源和少数族群语言中的性能不足问题，特别是针对台湾的南岛语系语言（Formosan languages）这一类语言资源匮乏且濒危的语言。解决方案的关键在于构建了首个针对低资源南岛语系语言的基准测试集FORMOSANBENCH，涵盖阿美语、赛夏语和排湾语三种濒危语言，并在机器翻译、自动语音识别（ASR）和文本摘要三个核心自然语言处理（NLP）任务上评估模型表现。通过零样本、10样本和微调设置的实验，揭示了高资源语言与Formosan语言之间的显著性能差距，强调了开发更具包容性的NLP技术的必要性。

链接: https://arxiv.org/abs/2506.21563
作者: Kaiying Kevin Lin,Hsiyu Chen,Haopeng Zhang
机构: Institute of Linguistics, Academia Sinica(中央研究院語言學研究所); ALOHA Lab, University of Hawaii at Manoa(夏威夷大學馬諾阿校區ALOHA實驗室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated impressive performance across a wide range of natural language processing (NLP) tasks in high-resource languages, their capabilities in low-resource and minority languages remain significantly underexplored. Formosan languages – a subgroup of Austronesian languages spoken in Taiwan – are both linguistically rich and endangered, largely due to the sociolinguistic dominance of Mandarin. In this work, we introduce FORMOSANBENCH, the first benchmark for evaluating LLMs on low-resource Austronesian languages. It covers three endangered Formosan languages: Atayal, Amis, and Paiwan, across three core NLP tasks: machine translation, automatic speech recognition (ASR), and text summarization. We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH. Our results reveal a substantial performance gap between high-resource and Formosan languages. Existing LLMs consistently underperform across all tasks, with 10-shot learning and fine-tuning offering only limited improvements. These findings underscore the urgent need for more inclusive NLP technologies that can effectively support endangered and underrepresented languages. We release our datasets and code to facilitate future research in this direction.
zh

[NLP-112] FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction

【速读】：该论文旨在解决现有生成式AI在建筑平面图生成中无法适应实际设计过程中逐步迭代的工作流程的问题。传统方法通常采用端到端生成方式，在单次操作中生成完整的像素级布局，而这种模式与现实中的增量设计流程不兼容。论文提出的解决方案关键在于引入一种“下一个房间预测”（next room prediction）范式，该范式借鉴了大型语言模型中常用的自回归“下一个词预测”机制，以支持更符合实际设计习惯的逐步生成过程。

链接: https://arxiv.org/abs/2506.21562
作者: Jun Yin,Pengyu Zeng,Jing Zhong,Peilin Li,Miao Zhang,Ran Luo,Shuai Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:In the architectural design process, floor plan generation is inherently progressive and iterative. However, existing generative models for floor plans are predominantly end-to-end generation that produce an entire pixel-based layout in a single pass. This paradigm is often incompatible with the incremental workflows observed in real-world architectural practice. To address this issue, we draw inspiration from the autoregressive ‘next token prediction’ mechanism commonly used in large language models, and propose a novel ‘next room prediction’ paradigm tailored to architectural floor plan modeling. Experimental evaluation indicates that FPDS demonstrates competitive performance in comparison to diffusion models and Tell2Design in the text-to-floorplan task, indicating its potential applicability in supporting future intelligent architectural design.
zh

[NLP-113] Reasoning Isnt Enough: Examining Truth-Bias and Sycophancy in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在事实判断中的真实性检测能力不足的问题，尤其是在高风险决策场景下的可靠性问题。研究的关键在于通过大规模评估和对比分析，揭示推理型模型与非推理型模型在真实性检测上的差异，并发现部分先进模型存在迎合性倾向（sycophantic tendencies），即在判断真实陈述时表现良好，但在识别虚假信息时表现较差，这表明仅靠模型能力的提升无法根本解决LLMs在真实性检测中的挑战。

链接: https://arxiv.org/abs/2506.21561
作者: Emilio Barkett,Olivia Long,Madhavendra Thakur
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their widespread use in fact-checking, moderation, and high-stakes decision-making, large language models (LLMs) remain poorly understood as judges of truth. This study presents the largest evaluation to date of LLMs’ veracity detection capabilities and the first analysis of these capabilities in reasoning models. We had eight LLMs make 4,800 veracity judgments across several prompts, comparing reasoning and non-reasoning models. We find that rates of truth-bias, or the likelihood to believe a statement is true, regardless of whether it is actually true, are lower in reasoning models than in non-reasoning models, but still higher than human benchmarks. Most concerning, we identify sycophantic tendencies in several advanced models (o4-mini and GPT-4.1 from OpenAI, R1 from DeepSeek), which displayed an asymmetry in detection accuracy, performing well in truth accuracy but poorly in deception accuracy. This suggests that capability advances alone do not resolve fundamental veracity detection challenges in LLMs.
zh

[NLP-114] Reinforcement Learning Fine-Tuning of Language Model for Instruction Following and Math Reasoning

【速读】：该论文旨在解决如何通过强化学习（Reinforcement Learning, RL）微调技术提升小型语言模型在指令遵循和数学推理任务中的性能问题。其关键解决方案是采用RLOO（Reinforce Leave-One-Out）结合DeBERTa奖励模型以实现最佳对齐，同时对比监督微调（SFT）和DPO（Direct Preference Optimization）方法，验证了不同策略在任务适应性上的表现差异。此外，研究还表明合成数据增强与外部验证器结合的采样策略能够显著提升数学推理任务的准确性。

链接: https://arxiv.org/abs/2506.21560
作者: Yifu Han,Geo Zhang
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the effectiveness of reinforcement learning (RL) fine-tuning techniques on a compact language model (Qwen2.5-0.5B Base) for two challenging tasks: instruction following and mathematical reasoning. We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models. Our experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results. For math reasoing tasks, synthetic data augmentation and best-of-N sampling with an external verifier significantly improve accuracy, showing the potential of combining fine-tuning with inference-time tools. This study highlights key trade-offs and practical strategies for training lightweight, task-aligned small-scale language models.
zh

[NLP-115] GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations

【速读】：该论文试图解决图语言模型（GLM）在图数据上的有效性问题和效率问题，特别是在基于上下文学习（ICL）的场景下，由于参数固定和长上下文带来的性能限制，以及指令微调所需大量标注数据在现实场景中难以获取的问题。解决方案的关键在于引入一个额外的参数适应阶段，通过少量标注示例高效地调整GLM以适应新的图和任务，从而提升预测精度并加快推理速度。为此，论文提出了GraphLAMA方法，其模型架构和学习策略专门设计用于高效的微调与推理。

链接: https://arxiv.org/abs/2506.21559
作者: Junze Chen,Cheng Yang,Shujie Li,Zhiqiang Zhang,Yawen Li,Junping Du,Chuan Shi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated their strong capabilities in various domains, and have been recently integrated for graph analysis as graph language models (GLMs). With LLMs as the predictor, some GLMs can interpret unseen tasks described by natural language, and learn from a few examples in the prompts without parameter tuning, known as in-context learning (ICL). Another subset of GLMs utilizes abundant training labels to enhance model performance, known as instruction tuning. However, we argue that ICL on graphs has effectiveness issues due to fixed parameters and efficiency issues due to long context. Meanwhile, the large amount of labeled data required for instruction tuning can be difficult to obtain in real-world scenarios. To this end, we aim to introduce an extra parameter adaptation stage that can efficiently tailor GLMs to an unseen graph and task with only a few labeled examples, in exchange for better prediction accuracy and faster inference speed. For implementation, in this paper we propose GraphLAMA method, with its model backbone and learning schemes specialized for efficient tuning and inference. Specifically, for model backbone, we use a graph neural network (GNN) with several well-designed components to transform nodes into the representation space of LLM tokens. Task instructions can then be represented as a mixture of node and language tokens. In the pre-training stage, model parameters except the LLM will be trained with different tasks to capture general knowledge. In the adaptation stage, only a few pre-trained parameters will be updated based on few-shot examples. Extensive experiments on few/zero-shot node classification and summary generation show that our proposed GraphLAMA achieves state-of-the-art performance with 4.91% absolution improvement in accuracy. Compared with ICL, our inference speed can be 10 times faster under 5-shot setting.
zh

[NLP-116] Bench to the Future: A Pastcasting Benchmark for Forecasting Agents

【速读】：该论文试图解决现有预测基准缺乏现实、封闭且可重复环境的问题，无法有效评估大语言模型（Large Language Model, LLM）的预测能力。其解决方案的关键在于引入Bench To the Future (BTF)，这是一个“回溯预测”基准，包含数百个已知结果的高质量问题，并为每个问题提供大规模的离线网络文档语料库，从而能够从LLMs中提取对过去事件的现实“预测”。

链接: https://arxiv.org/abs/2506.21558
作者: FutureSearch:Jack Wildman,Nikos I. Bosse,Daniel Hnyk,Peter Mühlbacher,Finn Hambly,Jon Evans,Dan Schwarz,Lawrence Phillips
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Forecasting is a challenging task that offers a clearly measurable way to study AI systems. Forecasting requires a large amount of research on the internet, and evaluations require time for events to happen, making the development of forecasting benchmarks challenging. To date, no forecasting benchmark provides a realistic, hermetic, and repeatable environment for LLM forecasters. We introduce Bench To the Future (BTF), a “pastcasting” benchmark with hundreds of high-quality questions for which the resolution is already known. Each question is accompanied by a large offline corpus of tens of thousands of relevant web pages, enabling a way to elicit realistic “forecasts” on past events from LLMs. Results suggest that our pastcasting environment can produce results comparable to those based on forecasts using the internet on at-the-time unresolved questions. We show results benchmarking agent and chain-of-thought forecasting approaches using several LLMs, including the recently-released Claude 4 models, and demonstrate BTF’s ability to track steady forecasting capability progress over time. We intend this to be a living benchmark, with new questions added continually to account for increasing training data cutoff dates. We invite researchers to contact us at hello@futuresearch.ai to utilize our benchmark or tooling for their own research.
zh

[NLP-117] Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning

【速读】：该论文旨在解决虚假新闻检测（Fake News Detection）中信息可信度下降的问题，特别是在多模态平台上的虚假新闻传播所带来的挑战。其解决方案的关键在于提出一种名为Debunk-and-Infer框架（DIFND）的方法，该方法通过整合条件扩散模型的生成能力与多模态大语言模型（Multimodal Large Language Models, MLLMs）的协同推理能力，实现对虚假新闻的更准确和可解释的检测。具体而言，DIFND利用去伪（debunk diffusion）生成反驳或验证证据，并通过链式去伪策略促进多智能体MLLM系统进行逻辑驱动、多模态感知的推理与最终真实性判断，从而在统一架构中联合建模多模态特征、生成性去伪线索和丰富的推理验证过程。

链接: https://arxiv.org/abs/2506.21557
作者: Kaiying Yan,Moyang Liu,Yukun Liu,Ruibo Fu,Zhengqi Wen,Jianhua Tao,Xuefei Liu
机构: Sun Yat-sen University (中山大学); Beihang University (北京航空航天大学); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beijing National Research Center for Information Science and Technology, Tsinghua University (清华大学信息科学与技术国家实验室); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk-and-Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models (MLLMs). Specifically, debunk diffusion is employed to generate refuting or authenticating evidence based on the multimodal content of news videos, enriching the evaluation process with diverse yet semantically aligned synthetic samples. To improve inference, we propose a chain-of-debunk strategy where a multi-agent MLLM system produces logic-grounded, multimodal-aware reasoning content and final veracity judgment. By jointly modeling multimodal features, generative debunking cues, and reasoning-rich verification within a unified architecture, DIFND achieves notable improvements in detection accuracy. Extensive experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.
zh

[NLP-118] VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

【速读】：该论文旨在解决现有多模态知识图谱（Multimodal Knowledge Graphs, MMKGs）在知识覆盖范围有限、模态支持狭窄以及难以适应当前多模态大语言模型（Multimodal Large Language Models, MLLMs）对更丰富模态（如视频和音频）需求的问题。其解决方案的关键在于提出一种面向概念的、知识密集型的多模态知识图谱——视觉-音频-文本知识图谱（Visual-Audio-Text Knowledge Graph, VAT-KG），该图谱整合了视觉、音频和文本信息，并通过严格的跨模态对齐与细粒度语义匹配机制，实现从任意多模态数据集自动构建MMKG，同时引入一种新的多模态检索增强生成框架，以支持跨模态查询的知识检索与推理。

链接: https://arxiv.org/abs/2506.21556
作者: Hyeongcheol Park,MinHyuk Jang,Ha Dam Baek,Gyusam Chang,Jiyoung Seo,Jiwan Park,Hogun Park,Sangpil Kim
机构: Korea University (韩国大学); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations reduce their extensibility and applicability to a broad range of multimodal tasks, particularly as the field shifts toward richer modalities such as video and audio in recent MLLMs. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.
zh

[NLP-119] Efficient Multilingual ASR Finetuning via LoRA Language Experts INTERSPEECH2025

【速读】：该论文旨在解决多语言自动语音识别（Multilingual Automatic Speech Recognition, ASR）中因语言间干扰导致的识别性能下降问题。其解决方案的关键在于提出一种基于Whisper的高效微调框架，通过预训练的低秩适配（Low-Rank Adaptation, LoRA）语言专家进行语言特异性参数的优化，并结合LoRA专家融合或知识蒸馏技术，从而在共享模型容量的同时提升目标语言的识别性能。实验结果表明，该方法在语言感知和语言无关场景下分别实现了约10%和15%的相对性能提升。

链接: https://arxiv.org/abs/2506.21555
作者: Jiahong Li,Yiwen Shao,Jianheng Zhuo,Chenda Li,Liliang Tang,Dong Yu,Yanmin Qian
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in Interspeech 2025

点击查看摘要

Abstract:Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilingual ASR still suffers from the curse of multilinguality in that different languages tend to interfere with each other, making it difficult for the ASR model to identify multiple languages effectively while sharing model capacity across them. This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper. Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods. Experimental results demonstrate that the proposed models yield approximately 10% and 15% relative performance gains in language-aware and language-agnostic scenarios, respectively.
zh

[NLP-120] Data Efficacy for Language Model Training

【速读】：该论文试图解决语言模型（Language Model, LM）训练中数据效率与数据效用的优化问题，旨在通过改进训练数据的组织方式来提升模型性能。其解决方案的关键在于提出一种通用范式DELT，该范式包含数据评分（Data Scoring）、数据选择（Data Selection）和数据排序（Data Ordering）三个核心组件，其中Learnability-Quality Scoring（LQS）和Folding Ordering（FO）是两个创新性技术，分别从梯度一致性角度评估数据样本的可学习性和质量，并解决模型遗忘和数据分布偏差问题。

链接: https://arxiv.org/abs/2506.21545
作者: Yalun Dai,Yangyu Huang,Xin Zhang,Wenshan Wu,Chong Li,Wenhui Lu,Shijie Cao,Li Dong,Scarlett Li
机构: Microsoft Research(微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.
zh

[NLP-121] MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

【速读】：该论文试图解决大规模多任务语言理解（MMLU）等多项选择题（MCQ）数据集在评估大语言模型（LLM）时因基准污染导致的不可靠性问题。解决方案的关键在于构建一个无污染且更具挑战性的MCQ基准MMLU-CF，通过从更广泛领域获取数据并设计三种去污染规则来避免无意的数据泄露，同时将基准划分为具有相似难度和主题分布的验证集与测试集，其中测试集保持闭源以确保评估结果的可靠性，而验证集公开以促进透明度和独立验证。

链接: https://arxiv.org/abs/2412.15194
作者: Qihao Zhao,Yangyu Huang,Tengchao Lv,Lei Cui,Qinzheng Sun,Shaoguang Mao,Xin Zhang,Ying Xin,Qiufeng Yin,Scarlett Li,Furu Wei
机构: Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF. This benchmark reassesses LLMs’ understanding of world knowledge by averting both unintentional and malicious data leakage. To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules. To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. The GitHub repository is available at this https URL and the dataset refers to this https URL.
zh

[NLP-122] Optimal Estimation of Watermark Proportions in Hybrid AI-Human Texts

【速读】：该论文试图解决在混合来源文本中准确估计水印比例的问题（watermark proportion in mixed-source texts），即如何区分文本中由大型语言模型（Large Language Models, LLMs）生成的水印内容与人类撰写的内容。解决方案的关键在于将该问题建模为基于关键统计量（pivotal statistics）的混合模型中比例参数的估计问题，并证明在采用连续关键统计量的水印方法下，该比例参数在弱条件下是可识别的。研究提出了针对此类方法的有效估计器，并推导了基于关键统计量的任何可测估计器的最小最大下界，表明所提出的估计器能够达到这些下界，从而实现了高精度的估计。

链接: https://arxiv.org/abs/2506.22343
作者: Xiang Li,Garrett Wen,Weiqing He,Jiayuan Wu,Qi Long,Weijie J. Su
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Text watermarks in large language models (LLMs) are an increasingly important tool for detecting synthetic text and distinguishing human-written content from LLM-generated text. While most existing studies focus on determining whether entire texts are watermarked, many real-world scenarios involve mixed-source texts, which blend human-written and watermarked content. In this paper, we address the problem of optimally estimating the watermark proportion in mixed-source texts. We cast this problem as estimating the proportion parameter in a mixture model based on \emphpivotal statistics. First, we show that this parameter is not even identifiable in certain watermarking schemes, let alone consistently estimable. In stark contrast, for watermarking methods that employ continuous pivotal statistics for detection, we demonstrate that the proportion parameter is identifiable under mild conditions. We propose efficient estimators for this class of methods, which include several popular unbiased watermarks as examples, and derive minimax lower bounds for any measurable estimator based on pivotal statistics, showing that our estimators achieve these lower bounds. Through evaluations on both synthetic data and mixed-source text generated by open-source models, we demonstrate that our proposed estimators consistently achieve high estimation accuracy.
zh

[NLP-123] Using Large Language Models to Suggest Informative Prior Distributions in Bayesian Statistics

【速读】：该论文试图解决贝叶斯统计中先验分布选择的挑战性问题，即这一过程具有主观性、资源消耗大且难以标准化。其解决方案的关键在于利用大型语言模型（Large-Language Models, LLMs）来生成基于知识的、信息丰富的先验分布，并通过设计一个详尽的提示（prompt）让LLMs不仅提出先验，还对其进行验证和反思。实验结果表明，LLMs能够正确识别变量间的关联方向，但在先验分布的置信度校准方面仍存在不足，尤其是对弱信息先验的处理上表现出差异。

链接: https://arxiv.org/abs/2506.21964
作者: Michael A. Riegler,Kristoffer Herland Hellton,Vajira Thambawita,Hugo L. Hammer
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Selecting prior distributions in Bayesian statistics is challenging, resource-intensive, and subjective. We analyze using large-language models (LLMs) to suggest suitable, knowledge-based informative priors. We developed an extensive prompt asking LLMs not only to suggest priors but also to verify and reflect on their choices. We evaluated Claude Opus, Gemini 2.5 Pro, and ChatGPT-4o-mini on two real datasets: heart disease risk and concrete strength. All LLMs correctly identified the direction for all associations (e.g., that heart disease risk is higher for males). The quality of suggested priors was measured by their Kullback-Leibler divergence from the maximum likelihood estimator’s distribution. The LLMs suggested both moderately and weakly informative priors. The moderate priors were often overconfident, resulting in distributions misaligned with the data. In our experiments, Claude and Gemini provided better priors than ChatGPT. For weakly informative priors, a key performance difference emerged: ChatGPT and Gemini defaulted to an “unnecessarily vague” mean of 0, while Claude did not, demonstrating a significant advantage. The ability of LLMs to identify correct associations shows their great potential as an efficient, objective method for developing informative priors. However, the primary challenge remains in calibrating the width of these priors to avoid over- and under-confidence. Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2506.21964 [stat.ME] (or arXiv:2506.21964v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2506.21964 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hugo Lewi Hammer Dr. [view email] [v1] Fri, 27 Jun 2025 07:11:55 UTC (710 KB)
zh

计算机视觉

[CV-0] MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

【速读】：该论文试图解决如何在多图像之间建立视觉线索的链式思维（Chain-of-Thought, CoT）推理问题，特别是在缺乏人工标注问答对的情况下实现细粒度视觉细节和复杂逻辑的跨图像推理。解决方案的关键在于利用自监督视觉表征学习的思想，通过构建包含同一图像的两个增强视图和一个相似但不同的图像的三元组，在训练过程中引导模型生成推理过程以比较图像，并通过基于规则的强化学习进行优化，从而促使模型关注细微的视觉变化并执行逻辑推理。

链接: https://arxiv.org/abs/2506.22434
作者: Xi Chen,Mingkang Zhu,Shaoteng Liu,Xiaoyang Wu,Xiaogang Xu,Yu Liu,Xiang Bai,Hengshuang Zhao
机构: HKU(香港大学); Tongyi Lab, Alibaba Group(通义实验室，阿里巴巴集团); CUHK(香港中文大学); HUST(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
zh

[CV-1] WarpRF: Multi-View Consistency for Training-Free Uncertainty Quantification and Applications in Radiance Fields

【速读】：该论文试图解决辐射场（radiance field）中不确定性量化的问题，旨在提供一种无需训练的通用框架来评估模型在未见视角下的不确定性。解决方案的关键在于利用视角间的逆向仿射变换（backward warping），将可靠的渲染结果投影到未见视角，并通过与该视角下渲染图像的一致性进行对比来量化不确定性。这一方法简单且成本低廉，适用于任何辐射场实现，并在不确定性量化及下游任务中表现出色。

链接: https://arxiv.org/abs/2506.22433
作者: Sadra Safadoust,Fabio Tosi,Fatma Güney,Matteo Poggi
机构: Koç University (科克大学); University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce WarpRF, a training-free general-purpose framework for quantifying the uncertainty of radiance fields. Built upon the assumption that photometric and geometric consistency should hold among images rendered by an accurate model, WarpRF quantifies its underlying uncertainty from an unseen point of view by leveraging backward warping across viewpoints, projecting reliable renderings to the unseen viewpoint and measuring the consistency with images rendered there. WarpRF is simple and inexpensive, does not require any training, and can be applied to any radiance field implementation for free. WarpRF excels at both uncertainty quantification and downstream tasks, e.g., active view selection and active mapping, outperforming any existing method tailored to specific frameworks.
zh

[CV-2] Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy

【速读】：该论文旨在解决视频编辑中用户意图与生成结果之间难以实现细粒度对齐的问题，即在实际应用中，用户希望工具能够精确且一致地实现其创意编辑意图。解决方案的关键在于提出一种名为Shape-for-Motion的框架，该框架通过将输入视频中的目标物体转换为时间一致的网格（3D proxy），从而实现对视频的精确和一致编辑。该方法允许用户直接在单帧的3D网格上进行编辑，并通过设计的Dual-Propagation Strategy自动传播到其他帧，最终将编辑结果投影到2D空间并输入到解耦的视频扩散模型中生成编辑后的视频。

链接: https://arxiv.org/abs/2506.22432
作者: Yuhao Liu,Tengfei Wang,Fang Liu,Zhenwei Wang,Rynson W.H. Lau
机构: City University of Hong Kong (香港城市大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project page: this https URL
zh

[CV-3] st-Time Consistency in Vision Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在面对语义等价输入时表现出不一致行为的问题，这一问题影响了模型的可靠性和鲁棒性。论文提出的解决方案的关键在于设计一种无需监督微调的测试阶段一致性框架，通过两个互补的目标实现预测的一致性：一是交叉熵一致损失（Cross-Entropy Agreement Loss），用于对齐语义等价输入的预测分布；二是伪标签一致性损失（Pseudo-Label Consistency Loss），用于将输出拉向自平均共识。该方法为后处理方式，与模型架构无关，且仅需单个测试样本即可提升一致性。

链接: https://arxiv.org/abs/2506.22395
作者: Shih-Han Chou,Shivam Chandhok,James J. Little,Leonid Sigal
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (向量人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks, yet they often exhibit inconsistent behavior when faced with semantically equivalent inputs, undermining their reliability and robustness. Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs, despite maintaining high average accuracy. Prior work addresses this issue by modifying model architectures or conducting large-scale fine-tuning on curated datasets. In contrast, we propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training. Our method is entirely post-hoc, model-agnostic, and applicable to any VLM with access to its weights. Given a single test point, we enforce consistent predictions via two complementary objectives: (i) a Cross-Entropy Agreement Loss that aligns predictive distributions across semantically equivalent inputs, and (ii) a Pseudo-Label Consistency Loss that draws outputs toward a self-averaged consensus. Our method is plug-and-play and leverages information from a single test input itself to improve consistency. Experiments on the MM-R3 benchmark show that our framework yields substantial gains in consistency across state-of-the-art models, establishing a new direction for inference-time adaptation in multimodal learning.
zh

[CV-4] Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation ICCV2025

【速读】：该论文试图解决三维点云数据中的分布外（Out-of-distribution, OOD）检测问题，这一问题在需要安全和鲁棒感知的应用中尤为关键。现有OOD检测方法在二维图像数据上已取得进展，但将其扩展到三维环境面临独特挑战。论文提出的解决方案是一种无需训练的框架，其关键在于利用视觉-语言模型（Vision-Language Models, VLMs）并通过构建基于类别原型和测试数据的图结构，挖掘数据流形结构以增强VLM在三维OOD检测中的有效性。该方法的核心创新是提出了一种新颖的图得分传播（Graph Score Propagation, GSP）策略，结合提示聚类和自训练负提示技术，提升了OOD评分性能，并具备适应少样本场景的能力。

链接: https://arxiv.org/abs/2506.22375
作者: Tiankai Chen,Yushu Li,Adam Goodge,Fei Teng,Xulei Yang,Tianrui Li,Xun Xu
机构: Southwest Jiaotong University (西南交通大学); South China University of Technology (华南理工大学); Institute for infocomm research, A*STAR I2R (资讯通信研究院，新加坡科技研究局I2R)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for 2D image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and real-world datasets 3D point cloud OOD detection.
zh

[CV-5] From Ground to Air: Noise Robustness in Vision Transformers and CNNs for Event-Based Vehicle Classification with Potential UAV Applications

【速读】：该论文旨在解决事件相机（event-based camera）在动态环境中进行目标分类的性能问题，特别是针对卷积神经网络（Convolutional Neural Network, CNN）和视觉Transformer（Vision Transformer, ViT）两种深度学习架构的适用性进行研究。解决方案的关键在于对ResNet34和ViT B16模型进行微调，并在GEN1事件数据集上评估其在标准条件和模拟噪声下的表现，以分析不同模型的分类准确性和鲁棒性。

链接: https://arxiv.org/abs/2506.22360
作者: Nouf Almesafri,Hector Figueiredo,Miguel Arana-Catania
机构: Cranfield University (克兰菲尔德大学); Propulsion and Space Research Center (推进与空间研究中心); Qinetiq (奎奈蒂克); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 17 figures, 9 tables. To be presented in AIAA AVIATION Forum 2025

点击查看摘要

Abstract:This study investigates the performance of the two most relevant computer vision deep learning architectures, Convolutional Neural Network and Vision Transformer, for event-based cameras. These cameras capture scene changes, unlike traditional frame-based cameras with capture static images, and are particularly suited for dynamic environments such as UAVs and autonomous vehicles. The deep learning models studied in this work are ResNet34 and ViT B16, fine-tuned on the GEN1 event-based dataset. The research evaluates and compares these models under both standard conditions and in the presence of simulated noise. Initial evaluations on the clean GEN1 dataset reveal that ResNet34 and ViT B16 achieve accuracies of 88% and 86%, respectively, with ResNet34 showing a slight advantage in classification accuracy. However, the ViT B16 model demonstrates notable robustness, particularly given its pre-training on a smaller dataset. Although this study focuses on ground-based vehicle classification, the methodologies and findings hold significant promise for adaptation to UAV contexts, including aerial object classification and event-based vision systems for aviation-related tasks.
zh

[CV-6] Closing the Performance Gap in Biometric Cryptosystems: A Deeper Analysis on Unlinkable Fuzzy Vaults

【速读】：该论文旨在解决基于模糊密钥（fuzzy vault-based）的生物特征模板保护系统中存在性能差距的问题。研究指出，不稳定纠错能力是导致性能下降的关键因素，这主要由特征集大小的可变性及其对相似性阈值的影响所引起，同时特征类型转换带来的信息丢失进一步加剧了这一问题。解决方案的关键在于提出一种基于等频区间（equal frequent intervals）的新型特征量化方法，该方法确保固定的特征集大小，并支持无需训练即可适应任意区间数量，从而显著减小由模板保护引入的性能差距。

链接: https://arxiv.org/abs/2506.22347
作者: Hans Geißner,Christian Rathgeb
机构: Hochschule Darmstadt (达姆施塔特应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 4 tables

点击查看摘要

Abstract:This paper analyses and addresses the performance gap in the fuzzy vault-based \acBCS. We identify unstable error correction capabilities, which are caused by variable feature set sizes and their influence on similarity thresholds, as a key source of performance degradation. This issue is further compounded by information loss introduced through feature type transformations. To address both problems, we propose a novel feature quantization method based on \itequal frequent intervals. This method guarantees fixed feature set sizes and supports training-free adaptation to any number of intervals. The proposed approach significantly reduces the performance gap introduced by template protection. Additionally, it integrates seamlessly with existing systems to minimize the negative effects of feature transformation. Experiments on state-of-the-art face, fingerprint, and iris recognition systems confirm that only minimal performance degradation remains, demonstrating the effectiveness of the method across major biometric modalities.
zh

[CV-7] A Deep Learning framework for building damage assessment using VHR SAR and geospatial data: demonstration on the 2023 Turkiye Earthquake

【速读】：该论文旨在解决灾后建筑损毁快速识别的问题，以支持应急响应和恢复工作。传统方法依赖于光学卫星影像，但常受云层覆盖或缺乏灾前影像的限制。其解决方案的关键在于引入一种新型多模态深度学习框架，利用单时相高分辨率合成孔径雷达（SAR）影像结合辅助地理空间数据，如OpenStreetMap建筑轮廓、数字表面模型及全球地震模型中的结构脆弱性和暴露属性，从而在仅使用灾后数据的情况下实现高精度和强泛化能力的建筑损毁检测。

链接: https://arxiv.org/abs/2506.22338
作者: Luigi Russo,Deodato Tapete,Silvia Liberata Ullo,Paolo Gamba
机构: University of Pavia (帕维亚大学); University of Sannio (萨尼奥大学); Italian Space Agency (意大利航天局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures (plus 4 author photos), and 5 tables. Submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:Building damage identification shortly after a disaster is crucial for guiding emergency response and recovery efforts. Although optical satellite imagery is commonly used for disaster mapping, its effectiveness is often hampered by cloud cover or the absence of pre-event acquisitions. To overcome these challenges, we introduce a novel multimodal deep learning (DL) framework for detecting building damage using single-date very high resolution (VHR) Synthetic Aperture Radar (SAR) imagery from the Italian Space Agency (ASI) COSMO SkyMed (CSK) constellation, complemented by auxiliary geospatial data. Our method integrates SAR image patches, OpenStreetMap (OSM) building footprints, digital surface model (DSM) data, and structural and exposure attributes from the Global Earthquake Model (GEM) to improve detection accuracy and contextual interpretation. Unlike existing approaches that depend on pre and post event imagery, our model utilizes only post event data, facilitating rapid deployment in critical scenarios. The framework effectiveness is demonstrated using a new dataset from the 2023 earthquake in Turkey, covering multiple cities with diverse urban settings. Results highlight that incorporating geospatial features significantly enhances detection performance and generalizability to previously unseen areas. By combining SAR imagery with detailed vulnerability and exposure information, our approach provides reliable and rapid building damage assessments without the dependency from available pre-event data. Moreover, the automated and scalable data generation process ensures the framework’s applicability across diverse disaster-affected regions, underscoring its potential to support effective disaster management and recovery efforts. Code and data will be made available upon acceptance of the paper.
zh

[CV-8] MatChA: Cross-Algorithm Matching with Feature Augmentation

【速读】：该论文试图解决在不同设备使用不同的稀疏特征提取算法获取关键点及其对应描述符时的视觉定位问题（visual localization）。现有方法在跨特征检测器情况下性能显著下降，因为它们假设存在共用的关键点，而实际上当使用不同描述符时，这种情况很少见。解决方案的关键在于提出一种针对跨检测器特征匹配的特征描述符增强方法，并将特征翻译到潜在空间，从而显著提升跨特征场景下的图像匹配和视觉定位性能。

链接: https://arxiv.org/abs/2506.22336
作者: Paula Carbó Cubero,Alberto Jaenal Gálvez,André Mateus,José Araújo,Patric Jensfelt
机构: KTH Royal Institute of Technology (皇家理工学院); Ericsson Research (爱立信研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art methods fail to solve visual localization in scenarios where different devices use different sparse feature extraction algorithms to obtain keypoints and their corresponding descriptors. Translating feature descriptors is enough to enable matching. However, performance is drastically reduced in cross-feature detector cases, because current solutions assume common keypoints. This means that the same detector has to be used, which is rarely the case in practice when different descriptors are used. The low repeatability of keypoints, in addition to non-discriminatory and non-distinctive descriptors, make the identification of true correspondences extremely challenging. We present the first method tackling this problem, which performs feature descriptor augmentation targeting cross-detector feature matching, and then feature translation to a latent space. We show that our method significantly improves image matching and visual localization in the cross-feature scenario and evaluate the proposed method on several benchmarks.
zh

[CV-9] Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

【速读】：该论文旨在解决条件流匹配（Conditional Flow Matching, CFM）在采样过程中依赖数值求解非线性常微分方程（ODEs）所带来的计算成本高和可解释性差的问题。其解决方案的关键在于引入Koopman算子理论，通过将非线性流映射到一个可学习的可观测空间中实现线性演化，从而构建无需解ODE的解析采样方法。该方法提出了一种无解码器的Koopman-CFM架构，使得生成过程在嵌入空间中呈现线性特性，进而通过矩阵指数实现一步闭式采样，显著提升了采样效率并增强了模型的可解释性。

链接: https://arxiv.org/abs/2506.22304
作者: Erkan Turan,Aristotelis Siozopoulos,Maks Ovsjanikov
机构: LIX, École Polytechnique, IP Paris (LIX, 法国巴黎综合理工学院, 巴黎文理研究大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conditional Flow Matching (CFM) offers a simulation-free framework for training continuous-time generative models, bridging diffusion and flow-based approaches. However, sampling from CFM still relies on numerically solving non-linear ODEs which can be computationally expensive and difficult to interpret. Recent alternatives address sampling speed via trajectory straightening, mini-batch coupling or distillation. However, these methods typically do not shed light on the underlying \textitstructure of the generative process. In this work, we propose to accelerate CFM and introduce an interpretable representation of its dynamics by integrating Koopman operator theory, which models non-linear flows as linear evolution in a learned space of observables. We introduce a decoder-free Koopman-CFM architecture that learns an embedding where the generative dynamics become linear, enabling closed-form, one-step sampling via matrix exponentiation. This results in significant speedups over traditional CFM as demonstrated on controlled 2D datasets and real-world benchmarks, MNIST, Fashion-MNIST (F-MNIST), and the Toronto Face Dataset (TFD). Unlike previous methods, our approach leads to a well-structured Koopman generator, whose spectral properties, eigenvalues, and eigenfunctions offer principled tools for analyzing generative behavior such as temporal scaling, mode stability, and decomposition in Koopman latent space. By combining sampling efficiency with analytical structure, Koopman-enhanced flow matching offers a potential step toward fast and interpretable generative modeling.
zh

[CV-10] OutDreamer: Video Outpainting with a Diffusion Transformer

【速读】：该论文旨在解决视频外绘画（video outpainting）任务中生成内容在时空一致性上的挑战，尤其是在扩展原始视频边界时保持高质量和适应性的问题。其解决方案的关键在于提出一种基于扩散变压器（Diffusion Transformers, DiTs）的框架——OutDreamer，该框架包含两个核心组件：高效视频控制分支和条件外绘画分支，分别用于提取遮罩视频信息和根据条件生成缺失内容。此外，通过引入掩码驱动的自注意力层和潜在对齐损失，增强了模型对任务的适应性和帧内及帧间的整体一致性，从而提升了生成视频的质量与连贯性。

链接: https://arxiv.org/abs/2506.22298
作者: Linhao Zhong,Fan Li,Yi Huang,Jianzhuang Liu,Renjing Pei,Fenglong Song
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video outpainting is a challenging task that generates new video content by extending beyond the boundaries of an original input video, requiring both temporal and spatial consistency. Many state-of-the-art methods utilize latent diffusion models with U-Net backbones but still struggle to achieve high quality and adaptability in generated content. Diffusion transformers (DiTs) have emerged as a promising alternative because of their superior performance. We introduce OutDreamer, a DiT-based video outpainting framework comprising two main components: an efficient video control branch and a conditional outpainting branch. The efficient video control branch effectively extracts masked video information, while the conditional outpainting branch generates missing content based on these extracted conditions. Additionally, we propose a mask-driven self-attention layer that dynamically integrates the given mask information, further enhancing the model’s adaptability to outpainting tasks. Furthermore, we introduce a latent alignment loss to maintain overall consistency both within and between frames. For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content, ensuring temporal consistency across video clips. Extensive evaluations demonstrate that our zero-shot OutDreamer outperforms state-of-the-art zero-shot methods on widely recognized benchmarks.
zh

[CV-11] RoomCraft: Controllable and Complete 3D Indoor Scene Generation

【速读】：该论文旨在解决从用户输入生成真实感3D室内场景的问题，该问题在计算机视觉与图形学中具有挑战性，需平衡几何一致性、空间关系和视觉真实性。现有神经生成方法因全局空间推理能力有限而产生重复元素，而程序化方法虽能通过约束实现可控生成，但在多约束场景下易发生物体碰撞，导致布局不完整。论文提出的解决方案——RoomCraft，其关键在于构建一个多阶段流水线，结合场景生成流程与约束驱动优化框架，通过高阶场景信息提取、空间关系网络构建、启发式深度优先搜索算法生成优化排列序列，并引入统一约束表示和冲突感知定位策略（CAPS），以动态调整放置权重，减少家具碰撞并确保布局完整性。

链接: https://arxiv.org/abs/2506.22291
作者: Mengqi Zhou,Xipeng Wang,Yuxi Wang,Zhaoxiang Zhang
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所，中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating realistic 3D indoor scenes from user inputs remains a challenging problem in computer vision and graphics, requiring careful balance of geometric consistency, spatial relationships, and visual realism. While neural generation methods often produce repetitive elements due to limited global spatial reasoning, procedural approaches can leverage constraints for controllable generation but struggle with multi-constraint scenarios. When constraints become numerous, object collisions frequently occur, forcing the removal of furniture items and compromising layout completeness. To address these limitations, we propose RoomCraft, a multi-stage pipeline that converts real images, sketches, or text descriptions into coherent 3D indoor scenes. Our approach combines a scene generation pipeline with a constraint-driven optimization framework. The pipeline first extracts high-level scene information from user inputs and organizes it into a structured format containing room type, furniture items, and spatial relations. It then constructs a spatial relationship network to represent furniture arrangements and generates an optimized placement sequence using a heuristic-based depth-first search (HDFS) algorithm to ensure layout coherence. To handle complex multi-constraint scenarios, we introduce a unified constraint representation that processes both formal specifications and natural language inputs, enabling flexible constraint-oriented adjustments through a comprehensive action space design. Additionally, we propose a Conflict-Aware Positioning Strategy (CAPS) that dynamically adjusts placement weights to minimize furniture collisions and ensure layout completeness. Extensive experiments demonstrate that RoomCraft significantly outperforms existing methods in generating realistic, semantically coherent, and visually appealing room layouts across diverse input modalities. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.22291 [cs.CV] (or arXiv:2506.22291v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.22291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-12] Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）中视觉令牌数量远多于文本令牌所导致的计算开销大、可扩展性受限的问题。现有方法在模型内部进行视觉令牌压缩时，通常依赖文本引导的交互，但该假设存在跨模态不对齐问题，包括因果、语义和空间层面的不一致，从而影响压缩效果。论文提出VisionDrop，其关键在于采用无需训练的纯视觉剪枝框架，通过视觉内注意力机制选择信息量大的视觉令牌，避免依赖文本信号，并将视觉编码器与语言模型视为统一系统，设计渐进式剪枝流程，实现多阶段的主导令牌选择与轻量级上下文合并，从而在严格令牌预算下仍能保留细粒度视觉信息。

链接: https://arxiv.org/abs/2506.22283
作者: Rui Xu,Yunke Wang,Yong Luo,Bo Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLM). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing methods, despite requiring no additional training or complex modifications. Its simple yet effective design enables efficient inference while preserving strong performance across tasks.
zh

[CV-13] EAMamba: Efficient All-Around Vision State Space Model for Image Restoration ICCV2025

【速读】：该论文旨在解决Vision Mamba在低级视觉任务中面临的计算复杂度随扫描序列数量增加而上升以及局部像素遗忘的问题。其解决方案的关键在于提出Efficient All-Around Mamba (EAMamba)，该框架引入了多头选择性扫描模块（Multi-Head Selective Scan Module, MHSSM）和全向扫描策略，通过高效聚合多个扫描序列来避免计算复杂度和参数量的增加，并利用多模式扫描策略捕捉全局信息以解决局部像素遗忘问题。

链接: https://arxiv.org/abs/2506.22246
作者: Yu-Cheng Lin,Yu-Syuan Xu,Hao-Wei Chen,Hsien-Kai Kuo,Chun-Yi Lee
机构: National Tsing Hua University (国立清华大学); National Taiwan University (国立台湾大学); MediaTek Inc. (联发科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
zh

[CV-14] 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

【速读】：该论文试图解决在预训练过程中利用多样化机器人数据时面临的“坐标系混乱”和“状态混乱”问题，这些问题源于现有方法仅使用简单观测作为输入，导致条件动作分布分散，从而显著降低预训练效率。解决方案的关键在于提出4D-VLA，通过引入深度和时间信息到视觉特征中，结合顺序的RGB-D输入，对齐机器人与场景的坐标系，从而增强模型的时空推理能力并减少训练开销。此外，还引入了记忆库采样策略，以提高模型的有效性和效率。

链接: https://arxiv.org/abs/2506.22242
作者: Jiahui Zhang,Yurui Chen,Yueming Xu,Ze Huang,Yanpeng Zhou,Yu-Jie Yuan,Xinyue Cai,Guowei Huang,Xingyue Quan,Hang Xu,Li Zhang
机构: Fudan University (复旦大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset’s action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
zh

[CV-15] Boosting Classification with Quantum-Inspired Augmentations

【速读】：该论文试图解决量子计算中微小量子门扰动对量子机器学习性能影响的问题，以及如何将这些扰动转化为有效的数据增强方法。其解决方案的关键在于利用随机布洛赫球旋转（random Bloch sphere rotations），这是一种基本的SU(2)变换，作为量子启发式数据增强技术，直接作用于经典数据，而非依赖量子模型或可训练的量子卷积层。通过在大规模ImageNet数据集上的实验验证，该方法显著提升了图像分类的性能。

链接: https://arxiv.org/abs/2506.22241
作者: Matthias Tschöpe,Vitor Fortes Rey,Sogo Pierre Sanon,Paul Lukowicz,Nikolaos Palaiodimopoulos,Maximilian Kiefer-Emmanouilidis
机构: German Research Center for Artificial Intelligence (DFKI); RPTU Kaiserslautern-Landau; QC-AI; Department of Computer Science; Department of Physics
类目: Computer Vision and Pattern Recognition (cs.CV); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Understanding the impact of small quantum gate perturbations, which are common in quantum digital devices but absent in classical computers, is crucial for identifying potential advantages in quantum machine learning. While these perturbations are typically seen as detrimental to quantum computation, they can actually enhance performance by serving as a natural source of data augmentation. Additionally, they can often be efficiently simulated on classical hardware, enabling quantum-inspired approaches to improve classical machine learning methods. In this paper, we investigate random Bloch sphere rotations, which are fundamental SU(2) transformations, as a simple yet effective quantum-inspired data augmentation technique. Unlike conventional augmentations such as flipping, rotating, or cropping, quantum transformations lack intuitive spatial interpretations, making their application to tasks like image classification less straightforward. While common quantum augmentation methods rely on applying quantum models or trainable quanvolutional layers to classical datasets, we focus on the direct application of small-angle Bloch rotations and their effect on classical data. Using the large-scale ImageNet dataset, we demonstrate that our quantum-inspired augmentation method improves image classification performance, increasing Top-1 accuracy by 3%, Top-5 accuracy by 2.5%, and the F _1 score from 8% to 12% compared to standard classical augmentation methods. Finally, we examine the use of stronger unitary augmentations. Although these transformations preserve information in principle, they result in visually unrecognizable images with potential applications for privacy computations. However, we show that our augmentation approach and simple SU(2) transformations do not enhance differential privacy and discuss the implications of this limitation.
zh

[CV-16] ReF-LLE: Personalized Low-Light Enhancement via Reference-Guided Deep Reinforcement Learning ICME2025

【速读】：该论文旨在解决低光照图像增强中的两个主要问题：1）不同条件下低光照图像的显著差异性；2）增强程度受主观偏好和用户意图的影响。其解决方案的关键在于提出一种名为ReF-LLE的新方法，该方法在傅里叶频率域中操作并结合深度强化学习（Deep Reinforcement Learning），首次将深度强化学习引入该领域。通过引入零参考图像评估策略，在训练过程中为增强图像评分以提供奖励信号，从而指导模型有效处理不同强度的低光照条件；在推理阶段，利用傅里叶域中的零频分量作为整体光照水平的表示，采用个性化的自适应迭代策略，使模型能够自适应调整低光照图像以匹配用户提供的参考图像的光照分布，实现个性化增强效果。

链接: https://arxiv.org/abs/2506.22216
作者: Ming Zhao,Pingping Liu,Tongshun Zhang,Zhe Zhang
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 6 pages, 8 figures, accepted by ICME2025

点击查看摘要

Abstract:Low-light image enhancement presents two primary challenges: 1) Significant variations in low-light images across different conditions, and 2) Enhancement levels influenced by subjective preferences and user intent. To address these issues, we propose ReF-LLE, a novel personalized low-light image enhancement method that operates in the Fourier frequency domain and incorporates deep reinforcement learning. ReF-LLE is the first to integrate deep reinforcement learning into this domain. During training, a zero-reference image evaluation strategy is introduced to score enhanced images, providing reward signals that guide the model to handle varying degrees of low-light conditions effectively. In the inference phase, ReF-LLE employs a personalized adaptive iterative strategy, guided by the zero-frequency component in the Fourier domain, which represents the overall illumination level. This strategy enables the model to adaptively adjust low-light images to align with the illumination distribution of a user-provided reference image, ensuring personalized enhancement results. Extensive experiments on benchmark datasets demonstrate that ReF-LLE outperforms state-of-the-art methods, achieving superior perceptual quality and adaptability in personalized low-light image enhancement.
zh

[CV-17] Robust and Accurate Multi-view 2D/3D Image Registration with Differentiable X-ray Rendering and Dual Cross-view Constraints ICRA2025

【速读】：该论文旨在解决单图像术中场景下视场有限导致的2D/3D配准精度不足的问题，提出了一种多视角2D/3D刚性配准方法。其解决方案的关键在于设计了一个包含姿态差异与图像相似性（如归一化互相关）的联合损失函数，并引入跨视角训练损失项以显式约束多视角间的投影姿态一致性，同时在第二阶段通过测试时优化进一步提升配准精度，从而增强配准过程的鲁棒性。

链接: https://arxiv.org/abs/2506.22191
作者: Yuxin Cui,Rui Song,Yibin Li,Max Q.-H. Meng,Zhe Min
机构: Shandong University (山东大学); UCL Hawkes Institute (UCL霍克斯研究所); Department of Medical Physics & Biomedical Engineering, University College London (伦敦大学学院医学物理与生物医学工程系); Shenzhen Key Laboratory of Robotics Perception and Intelligence (深圳市机器人感知与智能重点实验室); Dept. of Electronic and Electrical Engineering, Southern University of Science and Technology (南方科技大学电子与电气工程系); Dept. of Electronic Engineering, The Chinese University of Hong Kong (香港中文大学电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICRA 2025

点击查看摘要

Abstract:Robust and accurate 2D/3D registration, which aligns preoperative models with intraoperative images of the same anatomy, is crucial for successful interventional navigation. To mitigate the challenge of a limited field of view in single-image intraoperative scenarios, multi-view 2D/3D registration is required by leveraging multiple intraoperative images. In this paper, we propose a novel multi-view 2D/3D rigid registration approach comprising two stages. In the first stage, a combined loss function is designed, incorporating both the differences between predicted and ground-truth poses and the dissimilarities (e.g., normalized cross-correlation) between simulated and observed intraoperative images. More importantly, additional cross-view training loss terms are introduced for both pose and image losses to explicitly enforce cross-view constraints. In the second stage, test-time optimization is performed to refine the estimated poses from the coarse stage. Our method exploits the mutual constraints of multi-view projection poses to enhance the robustness of the registration process. The proposed framework achieves a mean target registration error (mTRE) of 0.79 \pm 2.17 mm on six specimens from the DeepFluoro dataset, demonstrating superior performance compared to state-of-the-art registration algorithms.
zh

[CV-18] Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition ICCV2025

【速读】：该论文旨在解决零样本骨架动作识别中因缺乏未见类别训练数据而导致的语义鸿沟和动作模式细粒度表达不足的问题。其解决方案的关键在于提出一种基于频率-语义增强的变分自编码器（FS-VAE），通过频域分解增强骨架语义表示学习，包含三个核心组件：基于频率的增强模块以提升骨架语义学习的丰富性和鲁棒性；基于语义的动作描述与多层级对齐机制以捕捉局部细节与全局对应关系；以及校准的跨对齐损失以平衡有效与模糊的骨架-文本对，从而提升模型的对齐能力与识别性能。

链接: https://arxiv.org/abs/2506.22179
作者: Wenhan Wu,Zhishuai Guo,Chen Chen,Hongfei Xue,Aidong Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment. Evaluations on the benchmarks demonstrate the effectiveness of our approach, validating that frequency-enhanced semantic features enable robust differentiation of visually and semantically similar action clusters, improving zero-shot action recognition.
zh

[CV-19] KnotDLO: Toward Interpretable Knot Tying ICRA20243

【速读】：该论文试图解决单手操作可变形线性物体（DLO）打结的问题，特别是在存在遮挡、初始配置变化以及无需人类示范或训练的情况下实现鲁棒且可重复的打结操作。解决方案的关键在于通过当前DLO形状规划抓取和目标航点，并基于追踪的分段线性曲线计算抓取位姿，同时根据当前DLO状态和期望下一状态的几何特性生成中间航点，从而实现视觉推理与控制的解耦。

链接: https://arxiv.org/abs/2506.22176
作者: Holly Dinkel,Raghavendra Navaratna,Jingyi Xiang,Brian Coltin,Trey Smith,Timothy Bretl
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); NASA Ames Research Center (美国国家航空航天局艾姆斯研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 5 figures, presented at the Workshop on 3D Visual Representations for Manipulation at the 2023 IEEE International Conference on Robotics and Automation in Yokohama, Japan. Video presentation [ this https URL ]. Poster [ this https URL ] 3DVRM Workshop [ this https URL ]

点击查看摘要

Abstract:This work presents KnotDLO, a method for one-handed Deformable Linear Object (DLO) knot tying that is robust to occlusion, repeatable for varying rope initial configurations, interpretable for generating motion policies, and requires no human demonstrations or training. Grasp and target waypoints for future DLO states are planned from the current DLO shape. Grasp poses are computed from indexing the tracked piecewise linear curve representing the DLO state based on the current curve shape and are piecewise continuous. KnotDLO computes intermediate waypoints from the geometry of the current DLO state and the desired next state. The system decouples visual reasoning from control. In 16 trials of knot tying, KnotDLO achieves a 50% success rate in tying an overhand knot from previously unseen configurations.
zh

[CV-20] Attention-disentangled Uniform Orthogonal Feature Space Optimization for Few-shot Object Detection

【速读】：该论文旨在解决少样本目标检测（Few-shot object detection, FSOD）中由于新颖类别样本不足而导致的类特定物体存在性判断不准确问题，以及现有方法在共享特征空间中将物体存在性识别与前景分类耦合所带来的局限性。其解决方案的关键在于提出一种统一正交特征空间（Uniform Orthogonal Feature Space, UOFS）优化框架，通过将特征空间解耦为表征物体存在性的模长和表征分类的幅角，实现跨类别的物体存在性知识迁移，并结合混合背景优化（Hybrid Background Optimization, HBO）策略解决背景与新颖类别实例混淆及角度优化过拟合问题。

链接: https://arxiv.org/abs/2506.22161
作者: Taijin Zhao,Heqian Qiu,Yu Dai,Lanxiao Wang,Fanman Meng,Qingbo Wu,Hongliang Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot object detection (FSOD) aims to detect objects with limited samples for novel classes, while relying on abundant data for base classes. Existing FSOD approaches, predominantly built on the Faster R-CNN detector, entangle objectness recognition and foreground classification within shared feature spaces. This paradigm inherently establishes class-specific objectness criteria and suffers from unrepresentative novel class samples. To resolve this limitation, we propose a Uniform Orthogonal Feature Space (UOFS) optimization framework. First, UOFS decouples the feature space into two orthogonal components, where magnitude encodes objectness and angle encodes classification. This decoupling enables transferring class-agnostic objectness knowledge from base classes to novel classes. Moreover, implementing the disentanglement requires careful attention to two challenges: (1) Base set images contain unlabeled foreground instances, causing confusion between potential novel class instances and backgrounds. (2) Angular optimization depends exclusively on base class foreground instances, inducing overfitting of angular distributions to base classes. To address these challenges, we propose a Hybrid Background Optimization (HBO) strategy: (1) Constructing a pure background base set by removing unlabeled instances in original images to provide unbiased magnitude-based objectness supervision. (2) Incorporating unlabeled foreground instances in the original base set into angular optimization to enhance distribution uniformity. Additionally, we propose a Spatial-wise Attention Disentanglement and Association (SADA) module to address task conflicts between class-agnostic and class-specific tasks. Experiments demonstrate that our method significantly outperforms existing approaches based on entangled feature spaces.
zh

[CV-21] Hardware acceleration for ultra-fast Neural Network training on FPGA for MRF map reconstruction

【速读】：该论文旨在解决传统磁共振指纹成像（Magnetic Resonance Fingerprinting, MRF）在实时脑部参数重建中的计算效率问题，即如何加速神经网络（Neural Networks, NNs）的训练与推理过程以实现移动设备上的实时分析。其解决方案的关键在于提出一种基于现场可编程门阵列（FPGA）的神经网络架构，该架构能够在减少训练时间的同时，实现对MRF数据的高效实时处理，从而推动临床决策和远程医疗的发展。

链接: https://arxiv.org/abs/2506.22156
作者: Mattia Ricchi,Fabrizio Alfonsi,Camilla Marella,Marco Barbieri,Alessandra Retico,Leonardo Brizi,Alessandro Gabrielli,Claudia Testa
机构: Stanford University (斯坦福大学)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
备注: 8 pages, 2 figures, to be published in conference proceedings of SDPS 2024: 2024 International Conference of the Society for Design and Process Science on Advances and Challenges of Applying AI/GenAI in Design and Process Science

点击查看摘要

Abstract:Magnetic Resonance Fingerprinting (MRF) is a fast quantitative MR Imaging technique that provides multi-parametric maps with a single acquisition. Neural Networks (NNs) accelerate reconstruction but require significant resources for training. We propose an FPGA-based NN for real-time brain parameter reconstruction from MRF data. Training the NN takes an estimated 200 seconds, significantly faster than standard CPU-based training, which can be up to 250 times slower. This method could enable real-time brain analysis on mobile devices, revolutionizing clinical decision-making and telemedicine.
zh

[CV-22] RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models MICCAI2025

【速读】：该论文试图解决现有光学相干断层扫描（OCT）基础模型（FM）在语义理解上的不足，这些模型仅基于图像数据训练，导致其在复杂下游任务中的表现受限，需要依赖监督微调以适应特定应用场景，而这种微调可能在实际中不可行。解决方案的关键在于提出RetFiner，一种基于自监督学习（SSL）的视觉-语言精炼方案，通过利用文本数据中的丰富监督信号，改进现有基础模型的表征能力，并实现其对特定人群的高效直接适配，从而提升下游任务性能。

链接: https://arxiv.org/abs/2506.22149
作者: Ronald Fecso,José Morano,Ursula Schmidt-Erfurth,Hrvoje Bogunović
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at MICCAI 2025

点击查看摘要

Abstract:The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline retinal disease staging. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.8, 3.9, and 2.1 percentage points over their baselines, respectively. Our code and model weights are publicly available at this https URL.
zh

[CV-23] Visual Structures Helps Visual Reasoning : Addressing the Binding Problem in VLMs

【速读】：该论文试图解决视觉语言模型（Vision-Language Models, VLMs）在视觉推理能力上的局限性，尤其是由“绑定问题”（binding problem）引起的感知特征与正确视觉参照物之间无法可靠关联的问题。这一限制导致计数、视觉搜索、场景描述和空间关系理解等任务中存在持续性错误。解决方案的关键在于引入一种简单但有效的干预措施：在视觉输入中增强低级空间结构（如水平线），并将其与鼓励顺序、空间感知解析的文本提示相结合。该方法通过增强视觉输入的空间结构，提升了VLM在核心视觉推理任务中的性能。

链接: https://arxiv.org/abs/2506.22146
作者: Amirmohammad Izadi,Mohammad Ali Banayeeanzade,Fatemeh Askari,Ali Rahimiakbar,Mohammad Mahdi Vahedi,Hosein Hasani,Mahdieh Soleymani Baghshah
机构: Sharif University of Technology (沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the \textitbinding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces a simple yet effective intervention: augmenting visual inputs with low-level spatial structures (e.g., horizontal lines) and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, our method improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. Our method enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.
zh

[CV-24] Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLM s ICCV2025

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视频理解任务中因数据量大和时间复杂性而导致的适应性不足问题。现有基于均匀帧采样的视频大语言模型（Video-LLMs）难以有效捕捉与查询相关的关键时空线索。论文提出的解决方案——Q-Frame，其关键是通过一种无需训练、即插即用的策略，利用文本-图像匹配网络（如CLIP）生成的特征，结合Gumbel-Max技巧实现自适应帧选择与多分辨率缩放，从而在不超出计算限制的情况下处理更多帧，保留重要的时空信息。

链接: https://arxiv.org/abs/2506.22139
作者: Shaojie Zhang,Jiahui Yang,Jianqin Yin,Zhenbo Luo,Jian Luan
机构: MiLM Plus, Xiaomi Inc. (MiLM Plus, 小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video’s content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame’s effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.
zh

[CV-25] Low-Rank Implicit Neural Representation via Schatten-p Quasi-Norm and Jacobian Regularization

【速读】：该论文旨在解决低秩张量表示中如何在保持可解释性的同时获得稀疏解的问题，特别是在传统方法如Tucker分解和CANDECOMP/PARAFAC (CP)分解之间存在的权衡。其解决方案的关键在于提出一种基于CP分解的低秩张量函数，该函数通过神经网络进行参数化，用于隐式神经表示（CP-INR）。该方法结合了CP分解的可解释性和神经网络的非线性建模能力，并引入变分形式的Schatten-p准范数以实现稀疏CP分解，同时采用基于雅可比矩阵谱范数和Hutchinson迹估计器的平滑正则化项，避免了奇异值分解（SVD）和显式链式法则推导，从而提升了方法的效率与适用性。

链接: https://arxiv.org/abs/2506.22134
作者: Zhengyun Cheng,Changhao Wang,Guanwen Zhang,Yi Xu,Wei Zhou,Xiangyang Ji
机构: Northwestern Polytechnical University (西北工业大学); Dalian University of Technology (大连理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Higher-order tensors are well-suited for representing multi-dimensional data, such as color images and videos. Low-rank tensor representation has become essential in machine learning and computer vision, but existing methods like Tucker decomposition offer flexibility at the expense of interpretability. In contrast, while the CANDECOMP/PARAFAC (CP) decomposition provides a more natural and interpretable tensor structure, obtaining sparse solutions remains challenging. Leveraging the rich properties of CP decomposition, we propose a CP-based low-rank tensor function parameterized by neural networks for implicit neural representation (CP-INR). This approach enables continuous data representation beyond structured grids, fully exploiting the non-linearity of tensor data with theoretical guarantees on excess risk bounds. To achieve a sparse CP decomposition, we introduce a variational form of the Schatten-p quasi-norm and prove its relationship to multilinear rank minimization. For smoothness, we propose a regularization term based on the spectral norm of the Jacobian and Hutchinson’s trace estimator. Our proposed smoothness regularization is SVD-free and avoids explicit chain rule derivations. It can serve as an alternative to Total Variation (TV) regularization in image denoising tasks and is naturally applicable to continuous data. Extensive experiments on multi-dimensional data recovery tasks, including image inpainting, denoising, and point cloud upsampling, demonstrate the superiority and versatility of our method compared to state-of-the-art approaches.
zh

[CV-26] Pipe Reconstruction from Point Cloud Data

【速读】：该论文旨在解决工业资产（如船舶和海上平台）中复杂管道网络的精确数字孪生构建问题，特别是针对从不完整激光扫描数据中手动建模管道所存在的耗时且劳动强度大的挑战。其解决方案的关键在于提出一种自动化管道重建流程，该流程首先利用基于拉普拉斯的收缩方法估计骨架曲线，随后进行曲线延伸，再通过结合滚动球技术和二维圆拟合对骨架轴线进行重新定位，并通过三维平滑步骤进行优化，从而准确确定管道的半径、长度和方向等属性。

链接: https://arxiv.org/abs/2506.22118
作者: Antje Alex,Jannis Stoppe
机构: German Aerospace Center (DLR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate digital twins of industrial assets, such as ships and offshore platforms, rely on the precise reconstruction of complex pipe networks. However, manual modelling of pipes from laser scan data is a time-consuming and labor-intensive process. This paper presents a pipeline for automated pipe reconstruction from incomplete laser scan data. The approach estimates a skeleton curve using Laplacian-based contraction, followed by curve elongation. The skeleton axis is then recentred using a rolling sphere technique combined with 2D circle fitting, and refined with a 3D smoothing step. This enables the determination of pipe properties, including radius, length and orientation, and facilitates the creation of detailed 3D models of complex pipe networks. By automating pipe reconstruction, this approach supports the development of digital twins, allowing for rapid and accurate modeling while reducing costs.
zh

[CV-27] Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration

【速读】：该论文旨在解决在人机协作中如何准确定位指向目标的问题，特别是在平面工作空间内的目标选择任务。解决方案的关键在于采用姿态估计技术，并结合基于肩-腕延伸的简单几何模型，从RGB-D流中提取手势数据，从而实现对指向动作的有效识别与定位。

链接: https://arxiv.org/abs/2506.22116
作者: Noora Sassali,Roel Pieters
机构: Tampere University (坦佩雷大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). Preprint

点击查看摘要

Abstract:Pointing gestures are a common interaction method used in Human-Robot Collaboration for various tasks, ranging from selecting targets to guiding industrial processes. This study introduces a method for localizing pointed targets within a planar workspace. The approach employs pose estimation, and a simple geometric model based on shoulder-wrist extension to extract gesturing data from an RGB-D stream. The study proposes a rigorous methodology and comprehensive analysis for evaluating pointing gestures and target selection in typical robotic tasks. In addition to evaluating tool accuracy, the tool is integrated into a proof-of-concept robotic system, which includes object detection, speech transcription, and speech synthesis to demonstrate the integration of multiple modalities in a collaborative application. Finally, a discussion over tool limitations and performance is provided to understand its role in multimodal robotic systems. All developments are available at: this https URL.
zh

[CV-28] Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD

【速读】：该论文旨在解决在复杂且不可预测的交通环境中准确预测行人行为的问题，以提升自动驾驶系统的安全性和车辆导航能力。其解决方案的关键在于构建一个名为Indian driving pedestrian dataset的数据集，该数据集专门针对非结构化环境中的行人行为建模挑战，如光照变化、行人遮挡、无信号场景类型以及车与行人交互等，提供了高阶和低阶的详细标注，以反映需要自车关注的行人行为。

链接: https://arxiv.org/abs/2506.22111
作者: Ruthvik Bokkasam,Shankar Gangisetty,A. H. Abdul Hafez,C. V. Jawahar
机构: IIIT, Hyderabad, India; King Faisal University (法赫德国王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle’s attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to \mathbf15% , while trajectory prediction methods underperform with an increase of up to \mathbf1208 MSE, defeating standard pedestrian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models. Project Page: this https URL
zh

[CV-29] d Prototype Model for Few-Shot Medical Image Segmentation MICCAI2025 MICCAI

【速读】：该论文试图解决传统基于原型的医学图像少样本分割（Few-shot Segmentation, FSS）方法在背景建模上的局限性，特别是ADNet方法中存在的三个关键问题：每类依赖单一原型、仅关注二分类以及固定阈值无法适应患者和器官的差异性。解决方案的关键在于提出一种称为Tied Prototype Model (TPM) 的模型，该模型通过共享前景和背景分布的原型位置，对ADNet进行了合理的重新表述，并在其概率基础之上自然扩展至多原型和多类别分割，同时有效分离非典型背景特征，从而提升了分割精度。此外，通过利用自然产生的类别先验定义自适应阈值，进一步增强了分割性能。

链接: https://arxiv.org/abs/2506.22101
作者: Hyeongji Kim,Stine Hansen,Michael Kampffmeyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Submitted version (MICCAI). Accepted at MICCAI 2025. The code repo will be made publicly available soon

点击查看摘要

Abstract:Common prototype-based medical image few-shot segmentation (FSS) methods model foreground and background classes using class-specific prototypes. However, given the high variability of the background, a more promising direction is to focus solely on foreground modeling, treating the background as an anomaly – an approach introduced by ADNet. Yet, ADNet faces three key limitations: dependence on a single prototype per class, a focus on binary classification, and fixed thresholds that fail to adapt to patient and organ variability. To address these shortcomings, we propose the Tied Prototype Model (TPM), a principled reformulation of ADNet with tied prototype locations for foreground and background distributions. Building on its probabilistic foundation, TPM naturally extends to multiple prototypes and multi-class segmentation while effectively separating non-typical background features. Notably, both extensions lead to improved segmentation accuracy. Finally, we leverage naturally occurring class priors to define an ideal target for adaptive thresholds, boosting segmentation performance. Taken together, TPM provides a fresh perspective on prototype-based FSS for medical image segmentation. The code can be found at this https URL.
zh

[CV-30] BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting ICCV2025

【速读】：该论文旨在解决街景真实重建中对高精度物体位姿标注的依赖问题，这一依赖限制了大规模和广泛场景重建的可行性。其解决方案的关键在于提出Bézier曲线高斯点云（BézierGS），该方法利用可学习的Bézier曲线表示动态物体的运动轨迹，从而充分挖掘动态物体的时间信息，并通过可学习的曲线建模自动校正位姿误差。此外，通过引入动态物体渲染的额外监督和曲线间一致性约束，实现了场景元素的合理且精确的分离与重建。

链接: https://arxiv.org/abs/2506.22099
作者: Zipei Ma,Junzhe Jiang,Yurui Chen,Li Zhang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025, Project Page: this https URL

点击查看摘要

Abstract:The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose Bézier curve Gaussian splatting (BézierGS), which represents the motion trajectories of dynamic objects using learnable Bézier curves. This approach fully leverages the temporal information of dynamic objects and, through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that BézierGS outperforms state-of-the-art alternatives in both dynamic and static scene components reconstruction and novel view synthesis.
zh

[CV-31] owards Accurate Heart Rate Measurement from Ultra-Short Video Clips via Periodicity-Guided rPPG Estimation and Signal Reconstruction

【速读】：该论文旨在解决从超短视频片段（约2秒）中准确测量心率（Heart Rate, HR）的问题，这一问题在现有远程光电容积描记术（remote photoplethysmography, rPPG）方法中常被忽视。其解决方案的关键在于应对两个核心挑战：首先，针对超短视频中心跳周期数量有限的问题，提出了一种基于周期性的rPPG估计方法，以确保从超短视频中估计的rPPG信号与较长的真实信号之间保持一致的周期性；其次，为减少由于频谱泄漏导致的估计误差，引入了一个生成器来从超短rPPG信号中重建更长的信号，同时保留其周期一致性，从而实现更精确的HR测量。

链接: https://arxiv.org/abs/2506.22078
作者: Pei-Kai Huanga,Ya-Ting Chan,Kuan-Wen Chen,Yen-Chun Chou,Shih-Yu Yang,Chiou-Ting Hsu
机构: Fujian Normal University (福建师范大学); National Tsing Hua University (国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many remote Heart Rate (HR) measurement methods focus on estimating remote photoplethysmography (rPPG) signals from video clips lasting around 10 seconds but often overlook the need for HR estimation from ultra-short video clips. In this paper, we aim to accurately measure HR from ultra-short 2-second video clips by specifically addressing two key challenges. First, to overcome the limited number of heartbeat cycles in ultra-short video clips, we propose an effective periodicity-guided rPPG estimation method that enforces consistent periodicity between rPPG signals estimated from ultra-short clips and their much longer ground truth signals. Next, to mitigate estimation inaccuracies due to spectral leakage, we propose including a generator to reconstruct longer rPPG signals from ultra-short ones while preserving their periodic consistency to enable more accurate HR measurement. Extensive experiments on four rPPG estimation benchmark datasets demonstrate that our proposed method not only accurately measures HR from ultra-short video clips but also outperform previous rPPG estimation techniques to achieve state-of-the-art performance.
zh

[CV-32] Reasoning in machine vision: learning to think fast and slow

【速读】：该论文试图解决机器智能在非语言推理任务中表现不足的问题，特别是在视觉感知、空间推理和放射学诊断等现实世界任务中，机器缺乏动态优化解决方案的能力。其解决方案的关键在于提出一种新的学习范式，该范式受到心理学中双过程理论的启发，整合了快速思考的System I模块与慢速迭代优化的System II模块，通过自对弈强化学习实现推理过程的逐步改进，从而在有限标注数据条件下提升模型性能。

链接: https://arxiv.org/abs/2506.22075
作者: Shaheer U. Saeed,Yipei Wang,Veeru Kasivisvanathan,Brian R. Davidson,Matthew J. Clarkson,Yipeng Hu,Daniel C. Alexander
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time. While some recent advances have explored reasoning in machines, these efforts are largely limited to verbal domains such as mathematical problem-solving, where explicit rules govern step-by-step reasoning. Other critical real-world tasks - including visual perception, spatial reasoning, and radiological diagnosis - require non-verbal reasoning, which remains an open challenge. Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time (inference-time compute), even under conditions where labelled data is very limited. Inspired by dual-process theories of human cognition in psychology, our approach integrates a fast-thinking System I module for familiar tasks, with a slow-thinking System II module that iteratively refines solutions using self-play reinforcement learning. This paradigm mimics human reasoning by proposing, competing over, and refining solutions in data-scarce scenarios. We demonstrate superior performance through extended thinking time, compared not only to large-scale supervised learning but also foundation models and even human experts, in real-world vision tasks. These tasks include computer-vision benchmarks and cancer localisation on medical images across five organs, showcasing transformative potential for non-verbal machine reasoning.
zh

[CV-33] Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras ICCV2025

【速读】：该论文试图解决滚动快门相机之间的相对位姿估计问题（relative pose estimation），其核心挑战在于如何在不显式建模相机运动的情况下，利用单张图像中的扫描线交点进行位姿计算。解决方案的关键在于利用单个扫描线的投影交点来实现相对位姿估计，从而无需依赖传统的运动模型，使得每个扫描线的位姿可以独立计算，这为滚动快门结构重建（rolling shutter structure-from-motion, SfM）提供了一个基础构建模块。

链接: https://arxiv.org/abs/2506.22069
作者: Petr Hruby,Marc Pollefeys
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, 15 pages, 5 figures, 12 tables

点击查看摘要

Abstract:We propose a novel approach for estimating the relative pose between rolling shutter cameras using the intersections of line projections with a single scanline per image. This allows pose estimation without explicitly modeling camera motion. Alternatively, scanlines can be selected within a single image, enabling single-view relative pose estimation for scanlines of rolling shutter cameras. Our approach is designed as a foundational building block for rolling shutter structure-from-motion (SfM), where no motion model is required, and each scanline’s pose can be computed independently. % We classify minimal solvers for this problem in both generic and specialized settings, including cases with parallel lines and known gravity direction, assuming known intrinsics and no lens distortion. Furthermore, we develop minimal solvers for the parallel-lines scenario, both with and without gravity priors, by leveraging connections between this problem and the estimation of 2D structure from 1D cameras. % Experiments on rolling shutter images from the Fastec dataset demonstrate the feasibility of our approach for initializing rolling shutter SfM, highlighting its potential for further development. % The code will be made publicly available.
zh

[CV-34] MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

【速读】：该论文旨在解决音频驱动的人像动画生成中实时性差、高保真度与时间一致性难以兼顾的问题。其解决方案的关键在于基于LTX视频模型构建一个名为MirrorMe的实时可控框架，通过引入参考身份注入机制、因果音频编码器与适配器以及渐进式训练策略，有效提升了生成视频的语义保真度、唇形同步精度和时间稳定性。

链接: https://arxiv.org/abs/2506.22065
作者: Dechao Meng,Steven Xiao,Xindi Zhang,Guangyuan Wang,Peng Zhang,Qi Wang,Bang Zhang,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX’s trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX’s temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe’s state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
zh

[CV-35] EnLVAM: Enhanced Left Ventricle Linear Measurements Utilizing Anatomical Motion Mode

【速读】：该论文旨在解决通过B-mode超声心动图在Parasternal Long Axis（PLAX）视图中对左心室（Left Ventricle, LV）进行线性测量时存在的手动定位耗时且易出错的问题，以及现有深度学习方法在地标点对齐上的不足。其解决方案的关键在于提出一种新框架，通过强制直线约束来提升LV测量的准确性，具体方法是利用从B-mode视频实时计算的Anatomical M-Mode (AMM)图像训练地标检测器，随后将其转换回B-mode空间，从而解决对齐问题并减少测量误差。

链接: https://arxiv.org/abs/2506.22063
作者: Durgesh K. Singh,Ahcene Boubekki,Qing Cao,Svein Arne Aase,Robert Jenssen,Michael Kampffmeyer
机构: UiT The Arctic University of Norway (UiT北极大学); GE Healthcare (通用医疗); GE Vingmed Ultrasound (通用维格梅德超声); Physikalisch-Technische Bundesanstalt (德国联邦物理技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Linear measurements of the left ventricle (LV) in the Parasternal Long Axis (PLAX) view using B-mode echocardiography are crucial for cardiac assessment. These involve placing 4-6 landmarks along a virtual scanline (SL) perpendicular to the LV axis near the mitral valve tips. Manual placement is time-consuming and error-prone, while existing deep learning methods often misalign landmarks, causing inaccurate measurements. We propose a novel framework that enhances LV measurement accuracy by enforcing straight-line constraints. A landmark detector is trained on Anatomical M-Mode (AMM) images, computed in real time from B-mode videos, then transformed back to B-mode space. This approach addresses misalignment and reduces measurement errors. Experiments show improved accuracy over standard B-mode methods, and the framework generalizes well across network architectures. Our semi-automatic design includes a human-in-the-loop step where the user only places the SL, simplifying interaction while preserving alignment flexibility and clinical relevance.
zh

[CV-36] Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field

【速读】：该论文旨在解决基于重建和渲染的说话头合成方法在身份特异性适应上的高计算成本与可扩展性不足的问题。现有方法依赖于身份特定模型，每个新身份都需要从头训练，导致效率低下。解决方案的关键在于提出FIAG框架，该框架通过引入全局高斯场（Global Gaussian Field）和通用运动场（Universal Motion Field），实现仅需少量训练数据即可高效进行身份特异性适应，从而提升模型的泛化能力和适应速度。

链接: https://arxiv.org/abs/2506.22044
作者: Hong Nie,Fuyuan Cao,Lu Chen,Fengxin Chen,Yuefeng Zou,Jun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstruction and rendering-based talking head synthesis methods achieve high-quality results with strong identity preservation but are limited by their dependence on identity-specific models. Each new identity requires training from scratch, incurring high computational costs and reduced scalability compared to generative model-based approaches. To overcome this limitation, we propose FIAG, a novel 3D speaking head synthesis framework that enables efficient identity-specific adaptation using only a few training footage. FIAG incorporates Global Gaussian Field, which supports the representation of multiple identities within a shared field, and Universal Motion Field, which captures the common motion dynamics across diverse identities. Benefiting from the shared facial structure information encoded in the Global Gaussian Field and the general motion priors learned in the motion field, our framework enables rapid adaptation from canonical identity representations to specific ones with minimal data. Extensive comparative and ablation experiments demonstrate that our method outperforms existing state-of-the-art approaches, validating both the effectiveness and generalizability of the proposed framework. Code is available at: \textitthis https URL.
zh

[CV-37] Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation

【速读】：该论文旨在解决零样本语义分割（Zero-shot Semantic Segmentation, ZSS）中由于视觉特征与文本空间对齐困难以及CLIP全局表示与分割模型局部细粒度特征之间存在语义差距而导致的知识迁移挑战。其解决方案的关键在于提出Chimera-Seg架构，该架构结合了分割主干网络与基于CLIP的语义头部，实现空间精度与视觉-语言对齐的融合；同时引入选择性全局蒸馏（Selective Global Distillation, SGD）和语义对齐模块（Semantic Alignment Module, SAM），以提升特征对齐效果并优化知识迁移过程。

链接: https://arxiv.org/abs/2506.22032
作者: Jialei Chen,Xu Zheng,Danda Pani Paudel,Luc Van Gool,Hiroshi Murase,Daisuke Deguchi
机构: Nagoya University(名古屋大学); The Hong Kong University of Science and Technology, Guangzhou Campus (HKUST-GZ)(香港科技大学广州校区); INSAIT(INSAIT); Sofia University, St. Kliment Ohridski(索非亚大学，圣克莱门特·奥赫里德斯基)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation-based methods, distillation-based approaches transfer vision-language alignment of vision-language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision-based features with the textual space, which requires combining spatial precision with vision-language alignment; and (2) the semantic gap between CLIP’s global representations and the local, fine-grained features of segmentation models. To address challenge (1), we propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision-language alignment. Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space. The CSH incorporates a frozen subnetwork and fixed projection layers from the CLIP visual encoder, along with lightweight trainable components. The partial module from CLIP visual encoder, paired with the segmentation model, retains segmentation capability while easing the mapping to CLIP’s semantic space. To address challenge (2), we propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token, while gradually reducing the number of features used for alignment as training progresses. Besides, we also use a Semantic Alignment Module (SAM) to further align dense visual features with semantic embeddings extracted from the frozen CLIP text encoder. Experiments on two benchmarks show improvements of 0.9% and 1.2% in hIoU.
zh

[CV-38] Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method ICCV2025

【速读】：该论文旨在解决利用地球观测图像进行地面目标检测与跟踪的挑战，特别是针对海上船舶持续跟踪的需求。现有方法主要依赖静止轨道卫星或视频卫星，前者分辨率低且易受天气影响，后者拍摄时间短且覆盖范围有限，难以满足实际应用需求。论文提出的解决方案是构建一个结合光学与合成孔径雷达（SAR）传感器的混合数据集——HOSS ReID数据集，以评估低地轨星座在船舶跟踪中的有效性，该方法能够实现更短的重成像周期和全天候跟踪能力。关键在于通过多模态卫星在不同时间和角度下对同一艘船舶进行长期观测，并提出基于视觉Transformer架构的跨模态船舶再识别基线方法TransOSS，以提取模态不变特征。

链接: https://arxiv.org/abs/2506.22027
作者: Han Wang,Shengyang Li,Jian Yang,Yuxuan Liu,Yixuan Lv,Zhuang Zhou
机构: Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences (中国科学院空间应用工程与技术中心); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning to pre-train on large-scale optical-SAR image pairs, ensuring the model’s ability to extract modality-invariant features. Our dataset and baseline method are publicly available on this https URL.
zh

[CV-39] Advancing Facial Stylization through Semantic Preservation Constraint and Pseudo-Paired Supervision

【速读】：该论文旨在解决面部风格化过程中生成结果存在伪影或与源图像保真度不足的问题，其核心挑战在于在准确学习目标风格的同时保持与原图像的内容一致性。解决方案的关键在于引入语义保留约束和伪配对监督，以增强内容对应关系并提升风格化效果，同时通过构建多层级伪配对数据集实现监督约束，从而在不依赖复杂网络结构或额外训练的情况下实现更灵活的多模态和参考引导风格化。

链接: https://arxiv.org/abs/2506.22022
作者: Zhanyi Lu,Yue Zhou
机构: University of Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial stylization aims to transform facial images into appealing, high-quality stylized portraits, with the critical challenge of accurately learning the target style while maintaining content consistency with the original image. Although previous StyleGAN-based methods have made significant advancements, the generated results still suffer from artifacts or insufficient fidelity to the source image. We argue that these issues stem from neglecting semantic shift of the generator during stylization. Therefore, we propose a facial stylization method that integrates semantic preservation constraint and pseudo-paired supervision to enhance the content correspondence and improve the stylization effect. Additionally, we develop a methodology for creating multi-level pseudo-paired datasets to implement supervisory constraint. Furthermore, building upon our facial stylization framework, we achieve more flexible multimodal and reference-guided stylization without complex network architecture designs or additional training. Experimental results demonstrate that our approach produces high-fidelity, aesthetically pleasing facial style transfer that surpasses previous methods.
zh

[CV-40] owards Universal Efficient Model Compression via Exponential Torque Pruning

【速读】：该论文试图解决现代深度神经网络（Deep Neural Networks, DNNs）因复杂性和规模的快速增长而导致的计算成本和内存使用增加的问题，特别是针对模型压缩技术中剪枝效果不理想的挑战。论文指出，先前基于扭矩（Torque）启发的正则化方法在剪枝过程中存在不足，导致剪枝后的网络仍然较为密集且准确率下降明显。解决方案的关键在于提出一种指数扭矩剪枝（Exponential Torque Pruning, ETP），通过采用指数力应用方案替代原有的线性力应用方案，以更有效地剪除冗余和远距离的模块，同时保留对有效推理至关重要的近邻模块，从而在保持较高精度的前提下实现更高的压缩率。

链接: https://arxiv.org/abs/2506.22015
作者: Sarthak Ketanbhai Modi,Lim Zi Pong,Shourya Kuchhal,Yoshi Cao,Yupeng Cheng,Teo Yon Shin,Lin Shang-Wei,Zhiming Li
机构: Nanyang Technological University (南洋理工大学); Continental Automotive Singapore (大陆汽车新加坡); Singapore Institute of Technology (新加坡理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid growth in complexity and size of modern deep neural networks (DNNs) has increased challenges related to computational costs and memory usage, spurring a growing interest in efficient model compression techniques. Previous state-of-the-art approach proposes using a Torque-inspired regularization which forces the weights of neural modules around a selected pivot point. Whereas, we observe that the pruning effect of this approach is far from perfect, as the post-trained network is still dense and also suffers from high accuracy drop. In this work, we attribute such ineffectiveness to the default linear force application scheme, which imposes inappropriate force on neural module of different distances. To efficiently prune the redundant and distant modules while retaining those that are close and necessary for effective inference, in this work, we propose Exponential Torque Pruning (ETP), which adopts an exponential force application scheme for regularization. Experimental results on a broad range of domains demonstrate that, though being extremely simple, ETP manages to achieve significantly higher compression rate than the previous state-of-the-art pruning strategies with negligible accuracy drop.
zh

[CV-41] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

【速读】：该论文试图解决生成长时域机器人操作视频的问题，尽管文本到视频扩散模型在逼真度、语言理解和运动生成方面取得了显著进展，但在处理长时域机器人任务时仍存在挑战。现有方法通过自回归范式扩展短序列以实现长时域视频生成，导致生成视频和执行过程中出现误差累积。该论文的关键解决方案是提出一种新颖的流水线，避免自回归生成的需求，其核心包括：1）将高层目标分解为小的原子任务并生成与指令对齐的关键帧，随后通过另一个扩散模型在关键帧之间进行插值以生成长时域视频；2）引入语义保持注意力模块以维持关键帧间的一致性；3）设计一个轻量级策略模型，从生成的视频中回归机器人关节状态。

链接: https://arxiv.org/abs/2506.22007
作者: Liudi Yang,Yang Bai,George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Soumajit Majumder,Ziyuan Liu,Gitta Kutyniok,Abhinav Valada
机构: Huawei Munich Research Center (华为慕尼黑研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.
zh

[CV-42] R1-Track: Direct Application of MLLM s to Visual Object Tracking via Reinforcement Learning

【速读】：该论文旨在解决传统视觉单目标跟踪方法在依赖显式分类与回归建模、依赖大规模监督数据集以及任务灵活性不足等方面的问题。其解决方案的关键在于利用多模态大语言模型（Multimodal Large Language Models, MLLMs）的强大学习能力，并通过微调策略提升其在跟踪任务中的表现。具体而言，研究者基于Qwen2.5-VL模型，采用群体相对策略优化（Group Relative Policy Optimization, GRPO）的强化学习方法，在小规模数据集上进行微调，从而得到R1-Track模型，该模型在GOT-10k基准测试中表现出色，并支持通过边界框或文本描述进行灵活初始化。

链接: https://arxiv.org/abs/2506.21980
作者: Biao Wang,Wenwen Li
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Visual single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames, given only its initial state in the first frame. This task has traditionally been framed as a template matching problem, evolving through major phases including correlation filters, two-stream networks, and one-stream networks with significant progress achieved. However, these methods typically require explicit classification and regression modeling, depend on supervised training with large-scale datasets, and are limited to the single task of tracking, lacking flexibility. In recent years, multi-modal large language models (MLLMs) have advanced rapidly. Open-source models like Qwen2.5-VL, a flagship MLLMs with strong foundational capabilities, demonstrate excellent performance in grounding tasks. This has spurred interest in applying such models directly to visual tracking. However, experiments reveal that Qwen2.5-VL struggles with template matching between image pairs (i.e., tracking tasks). Inspired by deepseek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method on a small-scale dataset with a rule-based reward function. The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark. R1-Track supports flexible initialization via bounding boxes or text descriptions while retaining most of the original model’s general capabilities. And we further discuss potential improvements for R1-Track. This rough technical report summarizes our findings as of May 2025.
zh

[CV-43] SceneDiffuser: City-Scale Traffic Simulation via a Generative World Model CVPR2025

【速读】：该论文旨在解决交通仿真中如何通过生成式模拟城市来扩展有限的手动驾驶里程，以支持更全面的自动驾驶软件验证问题。其核心挑战在于构建一个能够无缝模拟从点A到点B的交通场景的系统，该系统需具备场景生成、智能体行为建模、遮挡推理、动态场景生成及环境模拟等能力。解决方案的关键是提出SceneDiffuser++，这是首个基于单一损失函数的端到端生成式世界模型，能够整合上述所有需求，在城市尺度上实现点A到点B的仿真，并在长时序模拟条件下展现出更高的真实感。

链接: https://arxiv.org/abs/2506.21976
作者: Shuhan Tan,John Lambert,Hong Jeon,Sakshum Kulshrestha,Yijing Bai,Jing Luo,Dragomir Anguelov,Mingxing Tan,Chiyu Max Jiang
机构: Waymo LLC (Waymo公司); UT Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. We demonstrate the city-scale traffic simulation capability of SceneDiffuser++ and study its superior realism under long simulation conditions. We evaluate the simulation quality on an augmented version of the Waymo Open Motion Dataset (WOMD) with larger map regions to support trip-level simulation.
zh

[CV-44] ASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models

【速读】：该论文旨在解决开放环境中语义分割的可靠性问题，特别是现有RGB-T语义分割模型依赖低级视觉特征而缺乏高级文本信息，导致在类别具有相似视觉特征时难以实现准确分割；以及SAM在实例级分割中表现优异，但与热成像和文本的集成受到模态异质性和计算效率的限制。解决方案的关键在于提出TASeg框架，利用低秩适应（LoRA）微调技术将视觉基础模型适配到RGB-T数据，并引入动态特征融合模块（DFFM）以有效融合多模态视觉特征，同时在掩码解码器中结合CLIP生成的文本嵌入以实现语义对齐，从而提升分割精度。

链接: https://arxiv.org/abs/2506.21975
作者: Meng Yu,Te Cui,Qitong Chu,Wenjie Song,Yi Yang,Yufeng Yue
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, accepted for publication in lEEE/RSJ international Conference on Intelligent Robots and Systems (lROS 2025)

点击查看摘要

Abstract:Reliable semantic segmentation of open environments is essential for intelligent systems, yet significant problems remain: 1) Existing RGB-T semantic segmentation models mainly rely on low-level visual features and lack high-level textual information, which struggle with accurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance-level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. To address these, we propose TASeg, a text-aware RGB-T segmentation framework by using Low-Rank Adaptation (LoRA) fine-tuning technology to adapt vision foundation models. Specifically, we propose a Dynamic Feature Fusion Module (DFFM) in the image encoder, which effectively merges features from multiple visual modalities while freezing SAM’s original transformer blocks. Additionally, we incorporate CLIP-generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies the classification error and improves the semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters.
zh

[CV-45] Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding IJCAI2025

【速读】：该论文旨在解决现有基于掩码点建模的预训练方法在学习点云特征表示时，由于依赖随机掩码策略而无法有效捕捉合理的语义关系的问题。其解决方案的关键在于提出一种语义增强的掩码自动编码器，该方法包含两个核心组件：基于原型的语义建模模块和语义增强的掩码策略。其中，语义建模模块通过设计组件语义引导机制，利用可学习的原型来捕获不同组件的语义信息，而语义增强的掩码策略则通过这些原型更有效地覆盖完整的组件结构，从而提升模型的表征能力。

链接: https://arxiv.org/abs/2506.21957
作者: Yixin Zha,Chuxin Wang,Wenfei Yang,Tianzhu Zhang
机构: University of Science and Technology of China / Deep Space Exploration Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Point cloud understanding aims to acquire robust and general feature representations from unlabeled data. Masked point modeling-based methods have recently shown significant performance across various downstream tasks. These pre-training methods rely on random masking strategies to establish the perception of point clouds by restoring corrupted point cloud inputs, which leads to the failure of capturing reasonable semantic relationships by the self-supervised models. To address this issue, we propose Semantic Masked Autoencoder, which comprises two main components: a prototype-based component semantic modeling module and a component semantic-enhanced masking strategy. Specifically, in the component semantic modeling module, we design a component semantic guidance mechanism to direct a set of learnable prototypes in capturing the semantics of different components from objects. Leveraging these prototypes, we develop a component semantic-enhanced masking strategy that addresses the limitations of random masking in effectively covering complete component structures. Furthermore, we introduce a component semantic-enhanced prompt-tuning strategy, which further leverages these prototypes to improve the performance of pre-trained models in downstream tasks. Extensive experiments conducted on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed modules.
zh

[CV-46] SDRNET: Stacked Deep Residual Network for Accurate Semantic Segmentation of Fine-Resolution Remotely Sensed Images

【速读】：该论文旨在解决高分辨率遥感图像（FRRS）中语义分割的挑战，包括类别差异大、关键地物因遮挡不可见以及目标尺寸变化等问题。其解决方案的关键在于提出一种堆叠深度残差网络（SDRNet），通过两个堆叠的编码器-解码器网络来捕捉长距离语义信息并保留空间细节，同时在每个编码器和解码器之间引入扩张残差块（DRB）以捕获充分的全局依赖关系，从而提升分割性能。

链接: https://arxiv.org/abs/2506.21945
作者: Naftaly Wambugu,Ruisheng Wang,Bo Guo,Tianshu Yu,Sheng Xu,Mohammed Elhassan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Land cover maps generated from semantic segmentation of high-resolution remotely sensed images have drawn mucon in the photogrammetry and remote sensing research community. Currently, massive fine-resolution remotely sensed (FRRS) images acquired by improving sensing and imaging technologies become available. However, accurate semantic segmentation of such FRRS images is greatly affected by substantial class disparities, the invisibility of key ground objects due to occlusion, and object size variation. Despite the extraordinary potential in deep convolutional neural networks (DCNNs) in image feature learning and representation, extracting sufficient features from FRRS images for accurate semantic segmentation is still challenging. These challenges demand the deep learning models to learn robust features and generate sufficient feature descriptors. Specifically, learning multi-contextual features to guarantee adequate coverage of varied object sizes from the ground scene and harnessing global-local contexts to overcome class disparities challenge even profound networks. Deeper networks significantly lose spatial details due to gradual downsampling processes resulting in poor segmentation results and coarse boundaries. This article presents a stacked deep residual network (SDRNet) for semantic segmentation from FRRS images. The proposed framework utilizes two stacked encoder-decoder networks to harness long-range semantics yet preserve spatial information and dilated residual blocks (DRB) between each encoder and decoder network to capture sufficient global dependencies thus improving segmentation performance. Our experimental results obtained using the ISPRS Vaihingen and Potsdam datasets demonstrate that the SDRNet performs effectively and competitively against current DCNNs in semantic segmentation.
zh

[CV-47] CAL-RAG : Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design

【速读】：该论文旨在解决智能设计系统中内容感知版面生成（content-aware layout generation）这一基础但研究不足的问题，即如何在背景画布上自动排列文本、标志和底图等视觉元素。现有方法在处理语义对齐和视觉连贯性方面存在不足，缺乏对上下文设计实例的依据。论文提出的解决方案关键在于引入CAL-RAG框架，该框架结合了多模态检索、大语言模型（LLM）和协作代理推理，通过检索结构化知识库中的相关版面示例、调用基于LLM的布局推荐器、视觉-语言评分代理评估布局质量以及反馈代理提供优化建议，实现了迭代优化，从而提升了自动化版面生成的效果。

链接: https://arxiv.org/abs/2506.21934
作者: Najmeh Forouzandehmehr,Reza Yousefi Maragheh,Sriram Kollipara,Kai Zhao,Topojoy Biswas,Evren Korpeoglu,Kannan Achan
机构: Walmart Global Tech (沃尔玛全球科技)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated content-aware layout generation – the task of arranging visual elements such as text, logos, and underlays on a background canvas – remains a fundamental yet under-explored problem in intelligent design systems. While recent advances in deep generative models and large language models (LLMs) have shown promise in structured content generation, most existing approaches lack grounding in contextual design exemplars and fall short in handling semantic alignment and visual coherence. In this work we introduce CAL-RAG, a retrieval-augmented, agentic framework for content-aware layout generation that integrates multimodal retrieval, large language models, and collaborative agentic reasoning. Our system retrieves relevant layout examples from a structured knowledge base and invokes an LLM-based layout recommender to propose structured element placements. A vision-language grader agent evaluates the layout with visual metrics, and a feedback agent provides targeted refinements, enabling iterative improvement. We implement our framework using LangGraph and evaluate it on the PKU PosterLayout dataset, a benchmark rich in semantic and structural variability. CAL-RAG achieves state-of-the-art performance across multiple layout metrics – including underlay effectiveness, element alignment, and overlap – substantially outperforming strong baselines such as LayoutPrompter. These results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution for automated layout generation.
zh

[CV-48] Quality Assessment and Distortion-aware Saliency Prediction for AI-Generated Omnidirectional Images

【速读】：该论文旨在解决AI生成的全向图像（AIGODIs）的质量评估与优化问题，特别是针对其特有的质量缺陷和失真感知显著区域的预测。解决方案的关键在于构建了一个名为OHF2024的全面数据库，该数据库包含了从三个视角评估的主观质量评分以及失真感知显著区域，并基于BLIP-2模型提出了两个具有共享编码器的模型，即BLIP2OIQA用于评估人类视觉体验，BLIP2OISal用于预测失真感知显著性。此外，论文还提出了一种自动优化流程，利用预测的视觉体验分数和失真区域来进一步提升AI生成全向图像的视觉质量。

链接: https://arxiv.org/abs/2506.21925
作者: Liu Yang,Huiyu Duan,Jiarui Wang,Jing Liu,Menghan Hu,Xiongkuo Min,Guangtao Zhai,Patrick Le Callet
机构: Shanghai Jiao Tong University (上海交通大学); Tianjin University (天津大学); East China Normal University (华东师范大学); Polytech Nantes, Université de Nantes (南特工程师学院，南特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques, AI generated images (AIGIs) have attracted widespread attention, among which AI generated omnidirectional images (AIGODIs) hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications. AI generated omnidirectional images exhibit unique quality issues, however, research on the quality assessment and optimization of AI-generated omnidirectional images is still lacking. To this end, this work first studies the quality assessment and distortion-aware saliency prediction problems for AIGODIs, and further presents a corresponding optimization process. Specifically, we first establish a comprehensive database to reflect human feedback for AI-generated omnidirectionals, termed OHF2024, which includes both subjective quality ratings evaluated from three perspectives and distortion-aware salient regions. Based on the constructed OHF2024 database, we propose two models with shared encoders based on the BLIP-2 model to evaluate the human visual experience and predict distortion-aware saliency for AI-generated omnidirectional images, which are named as BLIP2OIQA and BLIP2OISal, respectively. Finally, based on the proposed models, we present an automatic optimization process that utilizes the predicted visual experience scores and distortion regions to further enhance the visual quality of an AI-generated omnidirectional image. Extensive experiments show that our BLIP2OIQA model and BLIP2OISal model achieve state-of-the-art (SOTA) results in the human visual experience evaluation task and the distortion-aware saliency prediction task for AI generated omnidirectional images, and can be effectively used in the optimization process. The database and codes will be released on this https URL to facilitate future research.
zh

[CV-49] SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

【速读】：该论文旨在解决3D视觉定位（3D Visual Grounding, 3DVG）中对昂贵的3D训练数据依赖过高的问题，通过引入一种无需依赖3D标注数据的零样本3DVG方法。其解决方案的关键在于提出SPAZER——一个基于视觉语言模型（VLM）的智能体，该智能体在渐进式推理框架中融合了空间（3D-based）与语义（2D-based）理解，通过多阶段的场景分析、候选对象筛选以及3D-2D联合决策，实现了高效的零样本定位。

链接: https://arxiv.org/abs/2506.21924
作者: Zhao Jin,Rong-Cheng Tu,Jingyi Liao,Wenhao Sun,Xiao Luo,Shunyu Liu,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D joint decision-making is efficiently performed to determine the best-matching object. By bridging spatial and semantic reasoning neural streams, SPAZER achieves robust zero-shot grounding without training on 3D-labeled data. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of 9.0% and 10.9% in accuracy.
zh

[CV-50] ZeroReg3D: A Zero-shot Registration Pipeline for 3D Consecutive Histopathology Image Reconstruction

【速读】：该论文旨在解决从连续组织切片中进行准确三维重建的问题，特别是针对组织变形、切片伪影、染色变异和光照不一致等挑战。其解决方案的关键在于提出一种名为ZeroReg3D的零样本注册流程，该流程结合了基于深度学习的零样本关键点匹配与基于优化的仿射和非刚性配准技术，从而在无需重新训练或微调的情况下有效处理上述问题。

链接: https://arxiv.org/abs/2506.21923
作者: Juming Xiong,Ruining Deng,Jialin Yue,Siqi Lu,Junlin Guo,Marilyn Lionts,Tianyuan Yao,Can Cui,Junchao Zhu,Chongyu Qu,Mengmeng Yin,Haichun Yang,Yuankai Huo
机构: Vanderbilt University(范德比尔特大学); Weill Cornell Medicine(威尔康奈尔医学中心); Vanderbilt University Medical Center(范德比尔特大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Histological analysis plays a crucial role in understanding tissue structure and pathology. While recent advancements in registration methods have improved 2D histological analysis, they often struggle to preserve critical 3D spatial relationships, limiting their utility in both clinical and research applications. Specifically, constructing accurate 3D models from 2D slices remains challenging due to tissue deformation, sectioning artifacts, variability in imaging techniques, and inconsistent illumination. Deep learning-based registration methods have demonstrated improved performance but suffer from limited generalizability and require large-scale training data. In contrast, non-deep-learning approaches offer better generalizability but often compromise on accuracy. In this study, we introduced ZeroReg3D, a novel zero-shot registration pipeline tailored for accurate 3D reconstruction from serial histological sections. By combining zero-shot deep learning-based keypoint matching with optimization-based affine and non-rigid registration techniques, ZeroReg3D effectively addresses critical challenges such as tissue deformation, sectioning artifacts, staining variability, and inconsistent illumination without requiring retraining or fine-tuning. The code has been made publicly available at this https URL
zh

[CV-51] SepFormer: Coarse-to-fine Separator Regression Network for Table Structure Recognition

【速读】：该论文旨在解决从图像数据中自动重建表格逻辑布局的问题，即表格结构识别（Table Structure Recognition, TSR），这是语义数据提取的基础。其解决方案的关键在于提出SepFormer，该方法通过将分割与合并范式整合到一个步骤中，利用类似DETR的架构进行分离器回归，从而提高了处理速度和鲁棒性。SepFormer采用自粗粒度到细粒度的策略，通过两个Transformer解码器堆栈，从单行线段逐步预测到线段条带的表格分隔符，显著提升了识别效率与准确性。

链接: https://arxiv.org/abs/2506.21920
作者: Nam Quan Nguyen,Xuan Phong Pham,Tuan-Anh Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The automated reconstruction of the logical arrangement of tables from image data, termed Table Structure Recognition (TSR), is fundamental for semantic data extraction. Recently, researchers have explored a wide range of techniques to tackle this problem, demonstrating significant progress. Each table is a set of vertical and horizontal separators. Following this realization, we present SepFormer, which integrates the split-and-merge paradigm into a single step through separator regression with a DETR-style architecture, improving speed and robustness. SepFormer is a coarse-to-fine approach that predicts table separators from single-line to line-strip separators with a stack of two transformer decoders. In the coarse-grained stage, the model learns to gradually refine single-line segments through decoder layers with additional angle loss. At the end of the fine-grained stage, the model predicts line-strip separators by refining sampled points from each single-line segment. Our SepFormer can run on average at 25.6 FPS while achieving comparable performance with state-of-the-art methods on several benchmark datasets, including SciTSR, PubTabNet, WTW, and iFLYTAB.
zh

[CV-52] Generating Attribute-Aware Human Motions from Textual Prompt

【速读】：该论文试图解决现有文本驱动的人体运动生成方法忽视人类属性（如年龄、性别、体重和身高）对运动模式影响的问题，这些属性是塑造人体运动的关键因素。解决方案的关键在于提出一种受结构因果模型启发的新框架，该框架能够将动作语义与人类属性解耦，从而实现基于文本的语义预测和属性控制的运动生成。

链接: https://arxiv.org/abs/2506.21912
作者: Xinghan Wang,Kun Xu,Fei Li,Cao Sheng,Jiazhong Yu,Yadong Mu
机构: Peking University (北京大学); China Tower Corporation Limited (中国铁塔股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes (such as age, gender, weight, and height) which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating realistic, attribute-aware motion aligned with the user’s text and attribute inputs. For evaluation, we introduce HumanAttr, a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware text-to-motion generation. Extensive experiments on the new dataset validate our model’s effectiveness.
zh

[CV-53] CERBERUS: Crack Evaluation Recognition Benchmark for Engineering Reliability Urban Stability

【速读】：该论文旨在解决基础设施中裂缝及其他缺陷的自动检测问题，特别是在实际应用场景中提升AI模型的检测性能。其解决方案的关键在于构建了一个名为CERBERUS的合成基准，该基准包含基于Unity生成的真实感3D检测场景和裂缝图像生成器，通过结合合成数据与真实数据进行训练，显著提升了模型在真实世界图像上的表现。

链接: https://arxiv.org/abs/2506.21909
作者: Justin Reinman,Sunwoong Choi
机构: Palisades Charter High School (帕利塞德斯特拉高中); UCLA (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:CERBERUS is a synthetic benchmark designed to help train and evaluate AI models for detecting cracks and other defects in infrastructure. It includes a crack image generator and realistic 3D inspection scenarios built in Unity. The benchmark features two types of setups: a simple Fly-By wall inspection and a more complex Underpass scene with lighting and geometry challenges. We tested a popular object detection model (YOLO) using different combinations of synthetic and real crack data. Results show that combining synthetic and real data improves performance on real-world images. CERBERUS provides a flexible, repeatable way to test defect detection systems and supports future research in automated infrastructure inspection. CERBERUS is publicly available at this https URL.
zh

[CV-54] RAUM-Net: Regional Attention and Uncertainty-aware Mamba Network

【速读】：该论文旨在解决细粒度视觉分类（Fine Grained Visual Categorization, FGVC）中由于类别间差异细微和特征表示脆弱而导致的挑战性问题，特别是在标注数据稀缺的情况下。其解决方案的关键在于结合基于Mamba的特征建模、区域注意力机制以及贝叶斯不确定性分析，通过增强局部到全局的特征建模并聚焦关键区域，同时利用贝叶斯推断选择高质量伪标签以提高模型稳定性。

链接: https://arxiv.org/abs/2506.21905
作者: Mingquan Liu
机构: China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine Grained Visual Categorization (FGVC) remains a challenging task in computer vision due to subtle inter class differences and fragile feature representations. Existing methods struggle in fine grained scenarios, especially when labeled data is scarce. We propose a semi supervised method combining Mamba based feature modeling, region attention, and Bayesian uncertainty. Our approach enhances local to global feature modeling while focusing on key areas during learning. Bayesian inference selects high quality pseudo labels for stability. Experiments show strong performance on FGVC benchmarks with occlusions, demonstrating robustness when labeled data is limited. Code is available at this https URL Net.
zh

[CV-55] Visual Content Detection in Educational Videos with Transfer Learning and Dataset Enrichment

【速读】：该论文旨在解决讲座视频中视觉元素（visual elements）自动检测的难题，这一问题限制了视频内容在教育中的有效利用。其关键解决方案是采用迁移学习方法，并选择YOLO作为核心模型进行优化，通过多基准数据集训练和半监督自动标注策略提升检测性能，从而为讲座视频中的对象检测提供一个通用的解决方案。

链接: https://arxiv.org/abs/2506.21903
作者: Dipayan Biswas,Shishir Shah,Jaspal Subhlok
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is an extended version of a paper accepted to MIPR 2025

点击查看摘要

Abstract:Video is transforming education with online courses and recorded lectures supplementing and replacing classroom teaching. Recent research has focused on enhancing information retrieval for video lectures with advanced navigation, searchability, summarization, as well as question answering chatbots. Visual elements like tables, charts, and illustrations are central to comprehension, retention, and data presentation in lecture videos, yet their full potential for improving access to video content remains underutilized. A major factor is that accurate automatic detection of visual elements in a lecture video is challenging; reasons include i) most visual elements, such as charts, graphs, tables, and illustrations, are artificially created and lack any standard structure, and ii) coherent visual objects may lack clear boundaries and may be composed of connected text and visual components. Despite advancements in deep learning based object detection, current models do not yield satisfactory performance due to the unique nature of visual content in lectures and scarcity of annotated datasets. This paper reports on a transfer learning approach for detecting visual elements in lecture video frames. A suite of state of the art object detection models were evaluated for their performance on lecture video datasets. YOLO emerged as the most promising model for this task. Subsequently YOLO was optimized for lecture video object detection with training on multiple benchmark datasets and deploying a semi-supervised auto labeling strategy. Results evaluate the success of this approach, also in developing a general solution to the problem of object detection in lecture videos. Paper contributions include a publicly released benchmark of annotated lecture video frames, along with the source code to facilitate future research.
zh

[CV-56] Exploring Task-Solving Paradigm for Generalized Cross-Domain Face Anti-Spoofing via Reinforcement Fine-Tuning

【速读】：该论文旨在解决现有面部反欺骗（face anti-spoofing）方法在面对未知攻击类型时泛化能力差以及可解释性不足的问题（face anti-spoofing）。其解决方案的关键在于提出一种基于强化学习微调的方法，通过激发多模态大语言模型的思考与学习能力，使模型能够自主探索解决反欺骗任务的推理策略，而非依赖于对真实模式的记忆。该方法设计了可验证的类别一致奖励和推理一致奖励，并采用基于GRPO的优化策略，引导模型从多角度探索推理策略以最大化预期奖励，从而在保留高奖励轨迹的过程中提炼出具有高度泛化能力的决策规则。

链接: https://arxiv.org/abs/2506.21895
作者: Fangling Jiang,Qi Li,Weining Wang,Gang Wang,Bing Liu,Zhenan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently the emergence of novel presentation attacks has drawn increasing attention to face anti-spoofing. However, existing methods tend to memorize data patterns from the training set, resulting in poor generalization to unknown attack types across different scenarios and limited interpretability. To address these challenges, this paper presents a reinforcement fine-tuning-based face anti-spoofing method that stimulates the capabilities of multimodal large language models to think and learn how to solve the anti-spoofing task itself, rather than relying on the memorization of authenticity patterns. We design verifiable class consistent reward and reasoning consistent reward, and employ a GRPO-based optimization strategy to guide the model in exploring reasoning policies from multiple perspectives to maximize expected rewards. As a result, through iterative trial-and-error learning while retaining only high-reward trajectories, the model distills highly generalizable decision-making rules from the extensive solution space to effectively address cross-domain face anti-spoofing tasks. Extensive experimental results demonstrate that our method achieves state-of-the-art cross-domain generalization performance. It generalizes well to diverse unknown attack types in unseen target domains while providing interpretable reasoning for its authenticity decisions without requiring labor-intensive textual annotations for training.
zh

[CV-57] SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation

【速读】：该论文旨在解决点云数据中分布外（out-of-distribution, OOD）对象检测的问题，以提升模型在实际应用中的安全性和可靠性。现有研究对此问题关注不足，而尽管在图像领域取得成功，3D视觉语言模型（3D VLMs）在点云数据上的应用面临显著的域偏移问题，即预训练数据集规模小、物体多样性低且多为计算机生成的合成物体，导致其在真实场景中的表现下降。解决方案的关键在于提出一种名为SODA的方法，该方法通过基于邻域的得分传播机制提升OOD点云的检测性能，具有无需额外训练、推理高效的特点，并在多个数据集和任务设置中达到了最先进水平。

链接: https://arxiv.org/abs/2506.21892
作者: Adam Goodge,Xun Xu,Bryan Hooi,Wee Siong Ng,Jingyi Liao,Yongyi Su,Xulei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As point cloud data increases in prevalence in a variety of applications, the ability to detect out-of-distribution (OOD) point cloud objects becomes critical for ensuring model safety and reliability. However, this problem remains under-explored in existing research. Inspired by success in the image domain, we propose to exploit advances in 3D vision-language models (3D VLMs) for OOD detection in point cloud objects. However, a major challenge is that point cloud datasets used to pre-train 3D VLMs are drastically smaller in size and object diversity than their image-based counterparts. Critically, they often contain exclusively computer-designed synthetic objects. This leads to a substantial domain shift when the model is transferred to practical tasks involving real objects scanned from the physical environment. In this paper, our empirical experiments show that synthetic-to-real domain shift significantly degrades the alignment of point cloud with their associated text embeddings in the 3D VLM latent space, hindering downstream performance. To address this, we propose a novel methodology called SODA which improves the detection of OOD point clouds through a neighborhood-based score propagation scheme. SODA is inference-based, requires no additional model training, and achieves state-of-the-art performance over existing approaches across datasets and problem settings.
zh

[CV-58] DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025

【速读】：该论文旨在解决复杂视频推理中的鲁棒性评估问题，即如何准确生成针对多样化真实世界视频片段的自然语言答案。其解决方案的关键在于提出了一种迭代推理方法——DIVE（Deep-search Iterative Video Exploration），该方法通过语义分解输入问题，并借助分步推理和渐进式推断逐步解决问题，从而实现对复杂查询的高度准确且上下文相关的回答。

链接: https://arxiv.org/abs/2506.21891
作者: Umihiro Kamoto,Tatsuya Ishibashi,Noriyuki Kugo
机构: Panasonic Connect Co., Ltd. (松下通信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set, securing the top position among all participants. This report details our methodology and provides a comprehensive analysis of the experimental results, demonstrating the effectiveness of our iterative reasoning framework in achieving robust video question answering. The code is available at this https URL
zh

[CV-59] Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles

【速读】：该论文旨在解决自动驾驶中感知系统受限于单一传感器性能的问题，通过多传感器融合（multi-sensor fusion）提升环境理解的全面性和准确性。其解决方案的关键在于将多传感器融合策略形式化为数据级、特征级和决策级三类，并系统综述了对应深度学习方法的发展现状，同时探讨了多模态数据集的应用价值以及视觉-语言模型（Vision-Language Models, VLMs）和大语言模型（Large Language Models, LLMs）等新兴技术在提升系统适应性与鲁棒性方面的潜力。

链接: https://arxiv.org/abs/2506.21885
作者: Chuheng Wei,Ziye Qin,Ziyan Zhang,Guoyuan Wu,Matthew J. Barth
机构: University of California at Riverside, Riverside, CA, 92507(加州大学河滨分校，河滨，加利福尼亚州，92507); Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)
备注: Accepted by IEEE IV 2025

点击查看摘要

Abstract:Multi-sensor fusion plays a critical role in enhancing perception for autonomous driving, overcoming individual sensor limitations, and enabling comprehensive environmental understanding. This paper first formalizes multi-sensor fusion strategies into data-level, feature-level, and decision-level categories and then provides a systematic review of deep learning-based methods corresponding to each strategy. We present key multi-modal datasets and discuss their applicability in addressing real-world challenges, particularly in adverse weather conditions and complex urban environments. Additionally, we explore emerging trends, including the integration of Vision-Language Models (VLMs), Large Language Models (LLMs), and the role of sensor fusion in end-to-end autonomous driving, highlighting its potential to enhance system adaptability and robustness. Our work offers valuable insights into current methods and future directions for multi-sensor fusion in autonomous driving.
zh

[CV-60] GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification

【速读】：该论文旨在解决银屑病（Psoriasis, PsO）严重程度评分中因评估者间差异和远程影像数据质量不稳定导致的自动化评分可靠性问题。其关键解决方案是提出一种基于梯度可解释性方法的框架，用于自动标记训练图像中可能引入虚假相关性的不良样本，从而提升模型的泛化能力。通过分析误分类验证图像的梯度，该方法能够识别出模型错误与不一致标注或非临床伪影相关的训练样本，进而通过移除这些样本显著提高模型在独立测试集上的AUC-ROC性能。

链接: https://arxiv.org/abs/2506.21883
作者: Basudha Pal,Sharif Amit Kamran,Brendon Lutnick,Molly Lucas,Chaitanya Parmar,Asha Patel Shah,David Apfel,Steven Fakharzadeh,Lloyd Miller,Gabriela Cula,Kristopher Standish
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Psoriasis (PsO) severity scoring is important for clinical trials but is hindered by inter-rater variability and the burden of in person clinical evaluation. Remote imaging using patient captured mobile photos offers scalability but introduces challenges, such as variation in lighting, background, and device quality that are often imperceptible to humans but can impact model performance. These factors, along with inconsistencies in dermatologist annotations, reduce the reliability of automated severity scoring. We propose a framework to automatically flag problematic training images that introduce spurious correlations which degrade model generalization, using a gradient based interpretability approach. By tracing the gradients of misclassified validation images, we detect training samples where model errors align with inconsistently rated examples or are affected by subtle, nonclinical artifacts. We apply this method to a ConvNeXT based weakly supervised model designed to classify PsO severity from phone images. Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held out test set. Commonly, multiple annotators and an adjudication process ensure annotation accuracy, which is expensive and time consuming. Our method detects training images with annotation inconsistencies, potentially removing the need for manual review. When applied to a subset of training data rated by two dermatologists, the method identifies over 90% of cases with inter-rater disagreement by reviewing only the top 30% of samples. This improves automated scoring for remote assessments, ensuring robustness despite data collection variability.
zh

[CV-61] Grounding-Aware Token Pruning: Recovering from Drastic Performance Drops in Visual Grounding Caused by Pruning

【速读】：该论文试图解决在视觉-语言多模态大模型（Multimodal Large Language Models, MLLMs）中，由于对视觉令牌进行剪枝（token pruning）导致的视觉定位能力显著下降的问题。研究发现，剪枝会破坏位置ID的对齐性，从而影响模型在指代表达理解（Referring Expression Comprehension, REC）任务中的性能。解决方案的关键在于提出一种无额外训练、内存或计算开销的接地感知令牌剪枝方法（Grounding-Aware Token Pruning, GAP），通过调整位置ID来恢复模型的定位能力，使REC准确率恢复至原始性能的90%。

链接: https://arxiv.org/abs/2506.21873
作者: Tzu-Chun Chien,Chieh-Kai Lin,Shiang-Feng Tsai,Ruei-Chi Lai,Hung-Jen Chen,Min Sun
机构: National Tsing Hua University (NTHU)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual grounding, establishing themselves as a general interface for various vision-language applications. This progress has driven the development of token pruning methods to mitigate the high computational costs associated with processing numerous visual tokens. However, we observe that pruning significantly weakens the model’s grounding ability, leading to incorrect predictions and drastic performance degradation. In Referring Expression Comprehension (REC), for instance, pruning causes the accuracy of LLaVA on the RefCOCO validation set to drop from 56.14% to 15.34%. Our analysis identifies misaligned position IDs after pruning as the primary cause of this degradation, as both the order and value of these IDs are crucial for maintaining performance in grounding tasks. To address this issue, we propose Grounding-Aware Token Pruning (GAP), a simple yet effective adjustment to position IDs that recovers REC accuracy back to 51.42%, which is 90% of the original performance in the without pruning setting, all while requiring no additional training, memory, or computational overhead. Applied to models such as Shikra, MiniGPTv2, and the LLaVA series, our method consistently improves performance across various token pruning strategies.
zh

[CV-62] Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images IJCAI2025

【速读】：该论文旨在解决光学遥感图像（Optical Remote Sensing Images, ORSIs）中自动分割目标时存在的问题，即现有方法主要依赖卷积或Transformer特征，难以有效融合两者的优势，导致分割效果受限。其解决方案的关键在于提出一种名为Dual-Perspective United Transformer (DPU-Former) 的新型架构，通过设计全局-局部混合注意力机制和傅里叶空间融合策略，实现长距离依赖与空间细节的同步整合，同时引入门控线性前馈网络增强模型表达能力，并构建DPU-Former解码器以聚合和增强多层特征，从而提升分割性能。

链接: https://arxiv.org/abs/2506.21866
作者: Yanguang Sun,Jiexi Yan,Jianjun Qian,Chunyan Xu,Jian Yang,Lei Luo
机构: PCA Lab, Nanjing University of Science and Technology, Nanjing, China; School of Computer Science and Technology, Xidian University, Xian, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: this https URL.
zh

[CV-63] Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling

【速读】：该论文旨在解决大型视觉语言模型（LVLM）在遥感（RS）领域应用中的适应性问题，由于遥感图像在视觉表征、目标尺度和语义层面与自然图像存在显著差异，导致现有LVLM难以有效理解遥感场景中丰富的多层级语义信息。解决方案的关键在于提出一种专为遥感理解设计的LVLM框架，其核心组件包括语义增强的多层级对齐机制和语义感知的专家建模方法，通过引入基于检索的语义增强模块和分层语义专家处理结构，实现从粗粒度到细粒度的语义理解，从而提升模型在遥感任务中的表现。

链接: https://arxiv.org/abs/2506.21863
作者: Sungjune Park,Yeongyun Kim,Se Yeon Kim,Yong Man Ro
机构: Korea Advanced Institute of Science and Technology (KAIST) (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages including reference pages, 7 tables, and 6 figures

点击查看摘要

Abstract:Large Vision and Language Models (LVLMs) have shown strong performance across various vision-language tasks in natural image domains. However, their application to remote sensing (RS) remains underexplored due to significant domain differences in visual appearances, object scales, and semantics. These discrepancies hider the effective understanding of RS scenes, which contain rich, multi-level semantic information spanning from coarse-to-fine levels. Hence, it limits the direct adaptation of existing LVLMs to RS imagery. To address this gap, we propose a novel LVLM framework tailored for RS understanding, incorporating two core components: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling. First, to align multi-level visual features, we introduce the retrieval-based Semantic Augmentation Module which enriches the visual features with relevant semantics across fine-to-coarse levels (e.g., object- and scene-level information). It is designed to retrieve relevant semantic cues from a RS semantic knowledge database, followed by aggregation of semantic cues with user query and multi-level visual features, resulting in semantically enriched representation across multiple levels. Second, for Semantic-aware Expert Modeling, we design semantic experts, where each expert is responsible for processing semantic representation at different levels separately. This enables hierarchical semantic understanding from coarse to fine levels. Evaluations across multiple RS tasks-including scene classification and VQA, etc.-demonstrate that the proposed framework achieves consistent improvements across multiple semantic levels. This highlights its capability and effectiveness in bridging the gap between general LVLMs and unique demands of RS-specific vision-language understanding.
zh

[CV-64] LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLM s

【速读】：该论文旨在解决视频多模态大语言模型中令牌冗余的问题，传统方法主要基于注意力得分进行令牌压缩，但无法有效捕捉所有语义区域。其解决方案的关键在于提出了一种基于语义连通组件（Semantic Connected Components, SCC）的策略，通过将令牌分配到不同的语义区域，确保全面的语义覆盖，并在时空域中应用SCC，实现高效的令牌压缩。

链接: https://arxiv.org/abs/2506.21862
作者: Boyuan Sun,Jiaxing Zhao,Xihan Wei,Qibin Hou
机构: Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 21 pages, 4 figures, 7 tables

点击查看摘要

Abstract:In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: this https URL.
zh

[CV-65] Embodied Domain Adaptation for Object Detection IROS2025

【速读】：该论文试图解决移动机器人在室内环境中进行目标检测时面临的开放词汇目标检测（Open-vocabulary object detection, OVOD）问题，特别是在面对领域偏移（domain shifts）时性能下降的问题。其解决方案的关键在于提出一种无需访问源域数据的无监督领域自适应（Source-Free Domain Adaptation, SFDA）方法，通过时间聚类优化伪标签、多尺度阈值融合以及结合对比学习的均值教师框架，实现对动态室内环境的灵活适应。

链接: https://arxiv.org/abs/2506.21860
作者: Xiangyu Shi,Yanyuan Qiao,Lingqiao Liu,Feras Dayoub
机构: University of Adelaide (阿德莱德大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2025

点击查看摘要

Abstract:Mobile robots rely on object detectors for perception and object localization in indoor environments. However, standard closed-set methods struggle to handle the diverse objects and dynamic conditions encountered in real homes and labs. Open-vocabulary object detection (OVOD), driven by Vision Language Models (VLMs), extends beyond fixed labels but still struggles with domain shifts in indoor environments. We introduce a Source-Free Domain Adaptation (SFDA) approach that adapts a pre-trained model without accessing source data. We refine pseudo labels via temporal clustering, employ multi-scale threshold fusion, and apply a Mean Teacher framework with contrastive learning. Our Embodied Domain Adaptation for Object Detection (EDAOD) benchmark evaluates adaptation under sequential changes in lighting, layout, and object diversity. Our experiments show significant gains in zero-shot detection performance and flexible adaptation to dynamic indoor conditions.
zh

[CV-66] SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space

【速读】：该论文试图解决在病理学任务中，如何有效整合全切片图像（whole-slide images, WSI）与空间转录组学（spatial transcriptomics, ST）数据的问题，以捕捉超越传统苏木精-伊红（hematoxylin-eosin, HE）染色所揭示的分子异质性。解决方案的关键在于提出SPADE模型，该模型通过统一框架将组织病理学与ST数据相结合，利用两阶段特征空间聚类生成专家模型，并通过对比学习学习共注册WSI块和基因表达图谱的表示，从而构建一个受ST信息引导的潜在空间。

链接: https://arxiv.org/abs/2506.21857
作者: Ekaterina Redekop,Mara Pleasure,Zichen Wang,Kimberly Flores,Anthony Sisk,William Speier,Corey W. Arnold
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid growth of digital pathology and advances in self-supervised deep learning have enabled the development of foundational models for various pathology tasks across diverse diseases. While multimodal approaches integrating diverse data sources have emerged, a critical gap remains in the comprehensive integration of whole-slide images (WSIs) with spatial transcriptomics (ST), which is crucial for capturing critical molecular heterogeneity beyond standard hematoxylin eosin (HE) staining. We introduce SPADE, a foundation model that integrates histopathology with ST data to guide image representation learning within a unified framework, in effect creating an ST-informed latent space. SPADE leverages a mixture-of-data experts technique, where experts, created via two-stage feature-space clustering, use contrastive learning to learn representations of co-registered WSI patches and gene expression profiles. Pre-trained on the comprehensive HEST-1k dataset, SPADE is evaluated on 14 downstream tasks, demonstrating significantly superior few-shot performance compared to baseline models, highlighting the benefits of integrating morphological and molecular information into one latent space.
zh

[CV-67] Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

【速读】：该论文旨在解决从无标签的面部视频中学习周期性信号的通用表示问题，以提升远程光电容积描记术（rPPG）估计的准确性。其解决方案的关键在于利用视频掩码自编码器通过自监督学习获取高维时空表示，并通过帧掩码策略捕捉准周期性信号，同时引入生理频带限制约束，利用生理信号在频域上的稀疏性为模型提供脉搏线索。

链接: https://arxiv.org/abs/2506.21855
作者: Jiho Choi,Sang Jun Lee
机构: Jeonbuk National University (全北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is crucial for remote photoplethysmography (rPPG) estimation. To account for signal periodicity, we apply frame masking in terms of video sampling, which allows the model to capture resampled quasi-periodic signals during the pre-training stage. Moreover, the framework incorporates physiological bandlimit constraints, leveraging the property that physiological signals are sparse within their frequency bandwidth to provide pulse cues to the model. The pre-trained encoder is then transferred to the rPPG task, where it is used to extract physiological signals from facial videos. We evaluate the proposed method through extensive experiments on the PURE, UBFC-rPPG, MMPD, and V4V datasets. Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations. Our code is available at this https URL.
zh

[CV-68] End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model

【速读】：该论文旨在解决RGB-IR（RGB-Infrared）图像对在数据存储和传输成本随模态数量增加而显著上升的问题，提出了一种高效的RGB-IR图像对联合压缩框架。解决方案的关键在于设计了通道级跨模态熵模型（Channel-wise Cross-modality Entropy Model, CCEM），其中包含低频上下文提取模块（Low-frequency Context Extraction Block, LCEB）和低频上下文融合模块（Low-frequency Context Fusion Block, LCFB），用于提取并聚合两个模态的全局低频信息，从而提升熵参数预测的准确性。

链接: https://arxiv.org/abs/2506.21851
作者: Haofeng Wang,Fangtao Zhou,Qi Zhang,Zeyuan Chen,Enci Zhang,Zhao Wang,Xiaofeng Huang,Siwei Ma
机构: Peking University, Shenzhen(北京大学，深圳); Peking University, Beijing(北京大学，北京); Hangzhou Dianzi University(杭州电子科技大学); Pengcheng Laboratory(鹏城实验室); Advanced Institute of Information Technology, Peking University(北京大学信息技术高等研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: IEEE International Conference on Systems, Man, and Cybernetics 2025. (SMC), under review

点击查看摘要

Abstract:RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cross-modality prior information for accurate context probability modeling within and between modalities, we propose a Channel-wise Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are designed for extracting and aggregating the global low-frequency information from both modalities, which assist the model in predicting entropy parameters more accurately. Experimental results demonstrate that our approach outperforms existing RGB-IR image pair and single-modality compression methods on LLVIP and KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec presented at CVPR 2022.
zh

[CV-69] 3D-Telepathy: Reconstructing 3D Objects from EEG Signals

【速读】：该论文试图解决从脑电图（Electroencephalography, EEG）数据中重建三维（3D）视觉刺激的问题，传统方法主要关注将脑活动转换为二维（2D）图像，而忽略了EEG数据向三维物体的转化，这限制了其在脑机接口（Brain-Computer Interfaces, BCIs）中的实际应用。解决方案的关键在于提出一种创新的EEG编码器架构，该架构集成了双自注意力机制，并采用混合训练策略，包括交叉注意力、对比学习和自监督学习技术。此外，通过使用稳定扩散作为先验分布，并利用变分分数蒸馏训练神经辐射场，成功实现了从EEG数据生成内容和结构相似的3D物体。

链接: https://arxiv.org/abs/2506.21843
作者: Yuxiang Ge,Jionghao Cheng,Ruiquan Ge,Zhaojie Fang,Gangyong Jia,Xiang Wan,Nannan Li,Ahmed Elazab,Changmiao Wang
机构: Hangzhou Dianzi University (杭州电子科技大学); Macau University of Science and Technology (澳门科技大学); Shenzhen University (深圳大学); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing 3D visual stimuli from Electroencephalography (EEG) data holds significant potential for applications in Brain-Computer Interfaces (BCIs) and aiding individuals with communication disorders. Traditionally, efforts have focused on converting brain activity into 2D images, neglecting the translation of EEG data into 3D objects. This limitation is noteworthy, as the human brain inherently processes three-dimensional spatial information regardless of whether observing 2D images or the real world. The neural activities captured by EEG contain rich spatial information that is inevitably lost when reconstructing only 2D images, thus limiting its practical applications in BCI. The transition from EEG data to 3D object reconstruction faces considerable obstacles. These include the presence of extensive noise within EEG signals and a scarcity of datasets that include both EEG and 3D information, which complicates the extraction process of 3D visual data. Addressing this challenging task, we propose an innovative EEG encoder architecture that integrates a dual self-attention mechanism. We use a hybrid training strategy to train the EEG Encoder, which includes cross-attention, contrastive learning, and self-supervised learning techniques. Additionally, by employing stable diffusion as a prior distribution and utilizing Variational Score Distillation to train a neural radiation field, we successfully generate 3D objects with similar content and structure from EEG data.
zh

[CV-70] ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

【速读】：该论文旨在解决基于SAM（Segment Anything Model）的视觉参考分割方法在生成提示时存在的稳定性问题，特别是由于提示编码器性能不足导致提示生成在物体边界附近，从而引发分割结果不稳定和鲁棒性下降的问题。其解决方案的关键在于引入ProSAM，通过学习一个变分提示编码器来预测多变量提示分布，从而避免在不稳定区域生成提示，提升分割的稳定性与鲁棒性。

链接: https://arxiv.org/abs/2506.21835
作者: Xiaoqi Wang,Clint Sebastian,Wenbin He,Liu Ren
机构: Bosch Research North America (博世北美研究院); Bosch Center for Artificial Intelligence (BCAI) (博世人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at object boundaries due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-of-the-art methods on the Pascal-5 ^i and COCO-20 ^i datasets, providing a more robust solution for visual reference segmentation.
zh

[CV-71] PrefPaint: Enhancing Image Inpainting through Expert Human Feedback

【速读】：该论文旨在解决医学图像修复（inpainting）中因模型生成不准确而导致的诊断和治疗误差问题，特别是在医学息肉成像等对准确性和可靠性要求极高的领域。其解决方案的关键在于提出PrefPaint方法，该方法通过将人类反馈引入Stable Diffusion Inpainting的训练过程，从而避免了依赖计算成本高昂的奖励模型，提升了修复结果的准确性与真实性。

链接: https://arxiv.org/abs/2506.21834
作者: Duy-Bao Bui,Hoang-Khang Nguyen,Trung-Nghia Le
机构: University of Science, VNU-HCM (科学大学，VNU-HCM); Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inpainting, the process of filling missing or corrupted image parts, has broad applications, including medical imaging. However, in specialized fields like medical polyps imaging, where accuracy and reliability are critical, inpainting models can generate inaccurate images, leading to significant errors in medical diagnosis and treatment. To ensure reliability, medical images should be annotated by experts like oncologists for effective model training. We propose PrefPaint, an approach that incorporates human feedback into the training process of Stable Diffusion Inpainting, bypassing the need for computationally expensive reward models. In addition, we develop a web-based interface streamlines training, fine-tuning, and inference. This interactive interface provides a smooth and intuitive user experience, making it easier to offer feedback and manage the fine-tuning process. User study on various domains shows that PrefPaint outperforms existing methods, reducing visual inconsistencies and improving image rendering, particularly in medical contexts, where our model generates more realistic polyps images.
zh

[CV-72] aleForge: Interactive Multimodal System for Personalized Story Creation

【速读】：该论文试图解决传统故事生成方法中用户参与度低、个性化程度不足的问题，即现有方法将用户视为被动消费者，提供通用情节而缺乏针对个体特征的定制化内容，从而削弱了用户的沉浸感和代入感。解决方案的关键在于引入TaleForge系统，该系统整合了大语言模型（LLMs）和文本到图像扩散模型，通过三个相互关联的模块——故事生成、个性化图像生成和背景生成——实现用户面部图像与故事情节及插图的深度融合，从而提升用户在叙事中的参与感和主导权。

链接: https://arxiv.org/abs/2506.21832
作者: Minh-Loi Nguyen,Quang-Khai Le,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
机构: University of Science, VNU-HCM, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam; Department of Computer Science, University of Dayton, Ohio, United States
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Storytelling is a deeply personal and creative process, yet existing methods often treat users as passive consumers, offering generic plots with limited personalization. This undermines engagement and immersion, especially where individual style or appearance is crucial. We introduce TaleForge, a personalized story-generation system that integrates large language models (LLMs) and text-to-image diffusion to embed users’ facial images within both narratives and illustrations. TaleForge features three interconnected modules: Story Generation, where LLMs create narratives and character descriptions from user prompts; Personalized Image Generation, merging users’ faces and outfit choices into character illustrations; and Background Generation, creating scene backdrops that incorporate personalized characters. A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists. Participants praised the system’s real-time previews and intuitive controls, though they requested finer narrative editing tools. TaleForge advances multimodal storytelling by aligning personalized text and imagery to create immersive, user-centric experiences.
zh

[CV-73] Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models ICDAR2025

【速读】：该论文旨在解决历史地图的少样本分割问题，这一任务因地图的多样化视觉表现和有限的标注数据而面临显著挑战。其解决方案的关键在于利用大规模视觉基础模型的丰富语义嵌入，并结合参数高效的微调策略，从而在少量样本情况下实现高效且准确的分割效果。该方法在Siegfried基准数据集上的实验结果表明，其在葡萄园和铁路分割任务中分别取得了+5%和+13%的mIoU相对提升，在更具有挑战性的5-shot设置中甚至达到约+20%的提升，同时在ICDAR 2021竞赛数据集上也表现出良好的泛化能力。

链接: https://arxiv.org/abs/2506.21826
作者: Rafael Sterzinger,Marco Peer,Robert Sablatnig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, accepted at ICDAR2025

点击查看摘要

Abstract:As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: this https URL.
zh

[CV-74] CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery

【速读】：该论文旨在解决白内障手术中复杂流程建模的问题，即如何准确捕捉手术工具、解剖结构和操作技术之间的语义关系及其时间依赖性。现有数据集主要关注孤立的手术分析任务，如工具检测或阶段分割，但缺乏对实体间语义关系的全面表征。解决方案的关键在于引入首个提供工具-组织相互作用、操作变体和时间依赖性的结构化标注数据集——白内障手术场景图（CAT-SG）数据集，并提出一种新的场景图生成模型CatSGG，该模型在生成结构化手术表示方面优于现有方法，从而为更精确的手术阶段和技巧识别提供了支持。

链接: https://arxiv.org/abs/2506.21813
作者: Felix Holm,Gözde Ünver,Ghazal Ghazaei,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding the intricate workflows of cataract surgery requires modeling complex interactions between surgical tools, anatomical structures, and procedural techniques. Existing datasets primarily address isolated aspects of surgical analysis, such as tool detection or phase segmentation, but lack comprehensive representations that capture the semantic relationships between entities over time. This paper introduces the Cataract Surgery Scene Graph (CAT-SG) dataset, the first to provide structured annotations of tool-tissue interactions, procedural variations, and temporal dependencies. By incorporating detailed semantic relations, CAT-SG offers a holistic view of surgical workflows, enabling more accurate recognition of surgical phases and techniques. Additionally, we present a novel scene graph generation model, CatSGG, which outperforms current methods in generating structured surgical representations. The CAT-SG dataset is designed to enhance AI-driven surgical training, real-time decision support, and workflow analysis, paving the way for more intelligent, context-aware systems in clinical practice.
zh

[CV-75] Comparing Learning Paradigms for Egocentric Video Summarization

【速读】：该论文旨在解决如何有效理解和解释第一人称视角（egocentric video）视频的问题，当前最先进的模型在处理此类视频时表现不如第三人称视角（third-person video）视频，表明该领域仍需进一步研究。论文提出的解决方案关键在于采用经过提示微调（prompt fine-tuning）的通用模型，如GPT-4o，其在视频摘要任务中表现出优于专门设计的模型，突显了现有方法在适应第一人称视角独特挑战方面的局限性。

链接: https://arxiv.org/abs/2506.21785
作者: Daniel Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we investigate various computer vision paradigms - supervised learning, unsupervised learning, and prompt fine-tuning - by assessing their ability to understand and interpret egocentric video data. Specifically, we examine Shotluck Holmes (state-of-the-art supervised learning), TAC-SUM (state-of-the-art unsupervised learning), and GPT-4o (a prompt fine-tuned pre-trained model), evaluating their effectiveness in video summarization. Our results demonstrate that current state-of-the-art models perform less effectively on first-person videos compared to third-person videos, highlighting the need for further advancements in the egocentric video domain. Notably, a prompt fine-tuned general-purpose GPT-4o model outperforms these specialized models, emphasizing the limitations of existing approaches in adapting to the unique challenges of first-person perspectives. Although our evaluation is conducted on a small subset of egocentric videos from the Ego-Exo4D dataset due to resource constraints, the primary objective of this research is to provide a comprehensive proof-of-concept analysis aimed at advancing the application of computer vision techniques to first-person videos. By exploring novel methodologies and evaluating their potential, we aim to contribute to the ongoing development of models capable of effectively processing and interpreting egocentric perspectives.
zh

[CV-76] Early Glaucoma Detection using Deep Learning with Multiple Datasets of Fundus Images

【速读】：该论文旨在解决青光眼早期检测的临床需求，以提高治疗效果。传统诊断方法通常具有侵入性且依赖专业设备，而本文提出了一种基于EfficientNet-B0架构的深度学习流水线，通过在ACRIMA、ORIGA和RIM-ONE数据集上依次训练和微调模型，提升模型的泛化能力。解决方案的关键在于利用多数据集的联合训练策略，同时发现简单的预处理即可获得优于复杂增强方法的AUC-ROC性能，从而实现可重复和可扩展的早期青光眼检测方法。

链接: https://arxiv.org/abs/2506.21770
作者: Rishiraj Paul Chowdhury,Nirmit Shekar Karkera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, prepared for course CSCI 5922 at University of Colorado Boulder. Code available upon request, dataset taken from Kaggle

点击查看摘要

Abstract:Glaucoma is a leading cause of irreversible blindness, but early detection can significantly improve treatment outcomes. Traditional diagnostic methods are often invasive and require specialized equipment. In this work, we present a deep learning pipeline using the EfficientNet-B0 architecture for glaucoma detection from retinal fundus images. Unlike prior studies that rely on single datasets, we sequentially train and fine-tune our model across ACRIMA, ORIGA, and RIM-ONE datasets to enhance generalization. Our experiments show that minimal preprocessing yields higher AUC-ROC compared to more complex enhancements, and our model demonstrates strong discriminative performance on unseen datasets. The proposed pipeline offers a reproducible and scalable approach to early glaucoma detection, supporting its potential clinical utility.
zh

[CV-77] ImplicitQA: Going beyond frames towards Implicit Video Reasoning

【速读】：该论文试图解决当前Video QA系统在处理需要隐式推理的创意和电影类视频时表现不足的问题，这些问题通常涉及通过非显性视觉内容（如动机、因果关系和跨不连续帧的关系）进行理解。解决方案的关键在于提出ImplicitQA，一个专门设计用于测试模型隐式推理能力的新基准，其包含1K精心标注的问答对，覆盖多个关键推理维度，并通过高难度的注释确保了任务的挑战性。

链接: https://arxiv.org/abs/2506.21742
作者: Sirnam Swetha,Rohit Gupta,Parth Parag Kulkarni,David G Shatwell,Jeffrey A Chan Santiago,Nyle Siddiqui,Joseph Fioresi,Mubarak Shah
机构: Center For Research in Computer Vision, University of Central Florida
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video QA has made significant strides by leveraging multimodal learning to align visual and textual modalities. However, current benchmarks overwhelmingly focus on questions answerable through explicit visual content - actions, objects events directly observable within individual frames or short clips. In contrast, creative and cinematic videos - such as movies, TV shows, and narrative-driven content - employ storytelling techniques that deliberately omit certain depictions, requiring viewers to infer motives, causality, and relationships across discontinuous frames. Humans naturally excel at such implicit reasoning, seamlessly integrating information across time and context to construct coherent narratives. Current VideoQA systems and benchmarks fail to capture this essential dimension of human-like understanding. To bridge this gap, we present ImplicitQA, a novel benchmark specifically designed to test models on implicit reasoning. It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips, systematically categorized into key reasoning dimensions: lateral and vertical spatial reasoning, depth and proximity, viewpoint and visibility, motion and trajectory, causal and motivational reasoning, social interactions, physical context, and inferred counting. These annotations are deliberately challenging, crafted by authors ensuring high-quality. Our extensive evaluations on leading VideoQA models reveals performance degradation, underscoring their reliance on surface-level visual cues and highlighting the difficulty of implicit reasoning. Performance variations across models further illustrate the complexity and diversity of the challenges presented by ImplicitQA. By releasing both the dataset and our data collection framework, we aim to stimulate further research and development in the community. this https URL.
zh

[CV-78] Equitable Federated Learning with NCA

【速读】：该论文旨在解决在低收入和中等收入国家（LMICs）中，由于高性能计算资源有限和网络连接不可靠而导致联邦学习（Federated Learning, FL）应用受限的问题。其解决方案的关键在于提出FedNCA系统，该系统采用轻量级的Med-NCA架构，使得医疗图像分割任务能够在低成本边缘设备（如智能手机）上进行训练，同时减少通信成本，并具备加密功能以适应不安全的网络环境。

链接: https://arxiv.org/abs/2506.21735
作者: Nick Lemke,Mirko Konstantin,Henry John Krumb,John Kalkhof,Jonathan Stieber,Anirban Mukhopadhyay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated Learning (FL) is enabling collaborative model training across institutions without sharing sensitive patient data. This approach is particularly valuable in low- and middle-income countries (LMICs), where access to trained medical professionals is limited. However, FL adoption in LMICs faces significant barriers, including limited high-performance computing resources and unreliable internet connectivity. To address these challenges, we introduce FedNCA, a novel FL system tailored for medical image segmentation tasks. FedNCA leverages the lightweight Med-NCA architecture, enabling training on low-cost edge devices, such as widely available smartphones, while minimizing communication costs. Additionally, our encryption-ready FedNCA proves to be suitable for compromised network communication. By overcoming infrastructural and security challenges, FedNCA paves the way for inclusive, efficient, lightweight, and encryption-ready medical imaging solutions, fostering equitable healthcare advancements in resource-constrained regions.
zh

[CV-79] Experimental investigation of pose informed reinforcement learning for skid-steered visual navigation

【速读】：该论文旨在解决基于视觉的车道保持（vision-based lane keeping）在机器人和自主地面车辆领域中的自动化部署难题，特别是在非结构化环境下的滑移转向车辆（skid-steered vehicle）中，由于缺乏精确的分析模型，导致系统建模和滑移轮胎-地形相互作用（skid-slip wheel terrain interactions）成为自动化发展的瓶颈。论文提出的解决方案的关键在于一种结构化的学习视觉导航方法，通过端到端学习方法（如模仿学习和深度强化学习）实现对动态操作条件下的有效验证与优化，从而显著提升了现有文献中的性能表现。

链接: https://arxiv.org/abs/2506.21732
作者: Ameya Salvi,Venkat Krovi
机构: Clemson University International Center for Automotive Research (CU-ICAR)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Vision-based lane keeping is a topic of significant interest in the robotics and autonomous ground vehicles communities in various on-road and off-road applications. The skid-steered vehicle architecture has served as a useful vehicle platform for human controlled operations. However, systematic modeling, especially of the skid-slip wheel terrain interactions (primarily in off-road settings) has created bottlenecks for automation deployment. End-to-end learning based methods such as imitation learning and deep reinforcement learning, have gained prominence as a viable deployment option to counter the lack of accurate analytical models. However, the systematic formulation and subsequent verification/validation in dynamic operation regimes (particularly for skid-steered vehicles) remains a work in progress. To this end, a novel approach for structured formulation for learning visual navigation is proposed and investigated in this work. Extensive software simulations, hardware evaluations and ablation studies now highlight the significantly improved performance of the proposed approach against contemporary literature.
zh

[CV-80] Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis

【速读】：该论文试图解决概率生成模型中一个潜在的局限性，即学习全局分布可能导致记忆而非生成行为。其解决方案的关键在于提出两个理论框架：Mutually Exclusive Probability Space (MESP) 和 Local Correlation Hypothesis (LCH)。MESP 通过重新审视变分自编码器（VAE）发现潜在变量分布存在重叠，导致重建损失与 KL 散度损失之间的优化冲突，并基于重叠系数提出下界；在此基础上，提出了 Binary Latent Autoencoder (BL-AE) 和 Autoregressive Random Variable Model (ARVM)，以实现二进制潜在表示和直方图输出。然而，实验表明这些方法虽然在 FID 分数上表现优异，但反映的是记忆而非生成能力，因此引入 LCH 来强调潜在变量间的局部相关性对生成能力的重要性。

链接: https://arxiv.org/abs/2506.21731
作者: Chenqiu Zhao,Anup Basu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose two theoretical frameworks, the Mutually Exclusive Probability Space (MESP) and the Local Correlation Hypothesis (LCH), to explore a potential limitation in probabilistic generative models; namely that learning global distributions leads to memorization rather than generative behavior. MESP emerges from our rethinking of the Variational Autoencoder (VAE). We observe that latent variable distributions in VAE exhibit overlap, which leads to an optimization conflict between the reconstruction loss and KL-divergence loss. A lower bound based on the overlap coefficient is proposed. We refer to this phenomenon as Mutually Exclusive Probability Spaces. Based on MESP, a Binary Latent Autoencoder (BL-AE) is proposed to encode images into binary latent representations. These binary latents are used as the input to our Autoregressive Random Variable Model (ARVM), a modified autoregressive model outputting histograms. Our ARVM achieves competitive FID scores, outperforming state-of-the-art methods on standard datasets. However, such scores reflect memorization rather than generation. To address this issue, we propose the Local Correlation Hypothesis (LCH), which posits that generative capability arising from local correlations among latent variables. Comprehensive experiments and discussions are conducted to validate our frameworks.
zh

[CV-81] Asymmetric Dual Self-Distillation for 3D Self-Supervised Representation Learning

【速读】：该论文试图解决从无结构的3D点云中学习语义有意义表示的问题，特别是在缺乏大规模标注数据集的情况下。其解决方案的关键在于提出AsymDSD（Asymmetric Dual Self-Distillation）框架，该框架通过在潜在空间而非输入空间进行预测，将掩码建模与不变性学习统一起来，从而有效提升高阶语义的捕捉能力。

链接: https://arxiv.org/abs/2506.21724
作者: Remco F. Leijenaar,Hamidreza Kasaei
机构: University of Groningen (格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: for associated source code, see this https URL

点击查看摘要

Abstract:Learning semantically meaningful representations from unstructured 3D point clouds remains a central challenge in computer vision, especially in the absence of large-scale labeled datasets. While masked point modeling (MPM) is widely used in self-supervised 3D learning, its reconstruction-based objective can limit its ability to capture high-level semantics. We propose AsymDSD, an Asymmetric Dual Self-Distillation framework that unifies masked modeling and invariance learning through prediction in the latent space rather than the input space. AsymDSD builds on a joint embedding architecture and introduces several key design choices: an efficient asymmetric setup, disabling attention between masked queries to prevent shape leakage, multi-mask sampling, and a point cloud adaptation of multi-crop. AsymDSD achieves state-of-the-art results on ScanObjectNN (90.53%) and further improves to 93.72% when pretrained on 930k shapes, surpassing prior methods.
zh

[CV-82] Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration

【速读】：该论文旨在解决扩散模型在图像恢复（Image Restoration, IR）任务中因复杂架构和迭代过程而限制其实际应用的问题，以及现有方法在整合扩散训练范式到通用IR框架中的不足。解决方案的关键在于通过系统分析时间步依赖性、网络层次结构、噪声水平关系及多恢复任务相关性，提出一种基于扩散训练的新型IR框架，并引入一系列正则化策略，使扩散目标与IR任务对齐，从而提升单任务场景下的泛化能力。此外，通过设计增量训练范式和任务特定适配器，进一步优化多任务统一IR的性能。

链接: https://arxiv.org/abs/2506.21722
作者: Xin Lu,Xueyang Fu,Jie Xiao,Zihao Fan,Yurui Zhu,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While diffusion models demonstrate strong generative capabilities in image restoration (IR) tasks, their complex architectures and iterative processes limit their practical application compared to mainstream reconstruction-based general ordinary IR networks. Existing approaches primarily focus on optimizing network architecture and diffusion paths but overlook the integration of the diffusion training paradigm within general ordinary IR frameworks. To address these challenges, this paper elucidates key principles for adapting the diffusion training paradigm to general IR training through systematic analysis of time-step dependencies, network hierarchies, noise-level relationships, and multi-restoration task correlations, proposing a new IR framework supported by diffusion-based training. To enable IR networks to simultaneously restore images and model generative representations, we introduce a series of regularization strategies that align diffusion objectives with IR tasks, improving generalization in single-task scenarios. Furthermore, recognizing that diffusion-based generation exerts varying influences across different IR tasks, we develop an incremental training paradigm and task-specific adaptors, further enhancing performance in multi-task unified IR. Experiments demonstrate that our method significantly improves the generalization of IR networks in single-task IR and achieves superior performance in multi-task unified IR. Notably, the proposed framework can be seamlessly integrated into existing general IR architectures.
zh

[CV-83] textrmODE_t left(textrmODE_l right): Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling

【速读】：该论文旨在解决生成式模型（如连续归一化流和扩散模型）在采样过程中因需多次迭代求解常微分方程（ODE）而导致的高计算复杂度问题。其解决方案的关键在于通过重连基于Transformer的架构中的模块以求解离散化的ODE，并在流匹配训练中引入时间与长度一致性的约束项，从而实现对时间步数和网络长度的动态控制，使得采样过程可以灵活使用任意数量的时间步和Transformer块。该方法在时间维度上具有求解器无关性，有效降低了延迟和内存消耗。

链接: https://arxiv.org/abs/2506.21714
作者: Denis Gudovskiy,Wenzhao Zheng,Tomoyuki Okuno,Yohei Nakata,Kurt Keutzer
机构: Panasonic AI Lab (松下人工智能实验室); UC Berkeley (加州大学伯克利分校); Panasonic Corp (松下公司); UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Github page: this http URL

点击查看摘要

Abstract:Recently, continuous normalizing flows (CNFs) and diffusion models (DMs) have been studied using the unified theoretical framework. Although such models can generate high-quality data points from a noise distribution, the sampling demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. Most existing methods focus on reducing the number of time steps during the sampling process to improve efficiency. In this work, we explore a complementary direction in which the quality-complexity tradeoff can be dynamically controlled in terms of time steps and in the length of the neural network. We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its length. Then, we employ time- and length-wise consistency terms during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks. Unlike others, our \textrmODE_t \left(\textrmODE_l \right) approach is solver-agnostic in time dimension and decreases both latency and memory usage. Compared to the previous state of the art, image generation experiments on CelebA-HQ and ImageNet show a latency reduction of up to 3\times in the most efficient sampling mode, and a FID score improvement of up to 3.5 points for high-quality sampling. We release our code and model weights with fully reproducible experiments.
zh

[CV-84] CAST: Cross-Attentive Spatio-Temporal feature fusion for Deepfake detection

【速读】：该论文试图解决深度伪造（Deepfake）视频中细微且随时间变化的篡改检测问题，现有CNN-Transformer模型在处理空间和时间特征时存在独立性，导致时空交互深度不足。解决方案的关键在于提出一种统一的CAST模型，通过交叉注意力机制（cross-attention）更紧密地融合空间和时间特征，使时间特征能够动态关注相关空间区域，从而提升对细粒度、时变伪影的检测能力。

链接: https://arxiv.org/abs/2506.21711
作者: Aryan Thakre,Omkar Nagwekar,Vedang Talekar,Aparna Santra Biswas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 50 pages, 6 figures

点击查看摘要

Abstract:Deepfakes have emerged as a significant threat to digital media authenticity, increasing the need for advanced detection techniques that can identify subtle and time-dependent manipulations. CNNs are effective at capturing spatial artifacts, and Transformers excel at modeling temporal inconsistencies. However, many existing CNN-Transformer models process spatial and temporal features independently. In particular, attention-based methods often use separate attention mechanisms for spatial and temporal features and combine them using naive approaches like averaging, addition, or concatenation, which limits the depth of spatio-temporal interaction. To address this challenge, we propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features in a more integrated manner. Our approach allows temporal features to dynamically attend to relevant spatial regions, enhancing the model’s ability to detect fine-grained, time-evolving artifacts such as flickering eyes or warped lips. This design enables more precise localization and deeper contextual understanding, leading to improved performance across diverse and challenging scenarios. We evaluate the performance of our model using the FaceForensics++, Celeb-DF, and DeepfakeDetection datasets in both intra- and cross-dataset settings to affirm the superiority of our approach. Our model achieves strong performance with an AUC of 99.49 percent and an accuracy of 97.57 percent in intra-dataset evaluations. In cross-dataset testing, it demonstrates impressive generalization by achieving a 93.31 percent AUC on the unseen DeepfakeDetection dataset. These results highlight the effectiveness of cross-attention-based feature fusion in enhancing the robustness of deepfake video detection.
zh

[CV-85] FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

【速读】：该论文试图解决在视觉问答（Visual Question Answering, VQA）任务中，针对图像细节的挑战，尤其是在使用多模态大语言模型（Multimodal Large Language Models, MLLMs）时，现有视觉裁剪技术存在的局限性，如需要任务特定微调、低效的无指导穷举搜索或与高效注意力机制不兼容等问题。解决方案的关键在于提出一种无需训练的视觉裁剪方法FOCUS，该方法利用MLLM内部表示来引导搜索最相关的图像区域，通过四个步骤实现：目标对象识别、基于键值缓存的对象相关性图计算、相关图像区域的提出与排序，以及最终使用最高排名区域进行细粒度VQA任务。

链接: https://arxiv.org/abs/2506.21710
作者: Liangyu Zhong,Fabio Rosenthal,Joachim Sicking,Fabian Hüger,Thorsten Bagdonat,Hanno Gottschalk,Leo Schwinn
机构: Technical University of Berlin(柏林工业大学); Technical University of Munich(慕尼黑工业大学); CARIAD SE(卡莱德公司); Volkswagen AG(大众集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and two types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
zh

[CV-86] anDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation

【速读】：该论文旨在解决全景图像生成中的挑战，包括几何失真程度不同以及对无缝循环一致性要求高等问题。其解决方案的关键在于提出TanDiT方法，该方法通过生成覆盖整个360°视角的切平面图像网格来合成全景场景，利用统一的扩散模型在单次去噪迭代中同时生成这些图像，并引入一种与模型无关的后处理步骤以增强全景图的全局一致性。

链接: https://arxiv.org/abs/2506.21681
作者: Hakan Çapuk,Andrew Bond,Muhammed Burak Kızıl,Emir Göçen,Erkut Erdem,Aykut Erdem
机构: Koç University (科克大学); Hacettepe University (哈切特佩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360 ^\circ view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.
zh

[CV-87] APO: Enhancing Reasoning Ability of MLLM s via Asymmetric Policy Optimization

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在强化学习（Reinforcement Learning, RL）训练过程中面临的复杂推理能力不足、性能下降以及生成过于冗长或“过度思考”的推理过程等问题。其解决方案的关键在于提出一种称为非对称策略优化（Asymmetric Policy Optimization, APO）的方法，该方法通过将采样响应分为正样本和负样本进行处理：对于正样本，引入基于难度自适应的KL散度调整（Difficulty-Adaptive Divergence Shaping, DADS），动态调整KL惩罚权重以保持策略熵稳定并提升训练效率；对于负样本，采用次优轨迹复杂度正则化（Suboptimal Trajectory Complexity Regularization, STCR），惩罚过长的响应以缓解过度思考问题。

链接: https://arxiv.org/abs/2506.21655
作者: Minjie Hong,Zirun Guo,Yan Xia,Zehan Wang,Ziang Zhang,Tao Jin,Zhou Zhao
机构: Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or “overthinking” reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model’s existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model’s explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at this https URL.
zh

[CV-88] AeroLite-MDNet: Lightweight Multi-task Deviation Detection Network for UAV Landing

【速读】：该论文旨在解决无人机（UAV）在执行任务后安全着陆时面临的准确着陆难题，尤其是在GPS信号干扰等不利条件下。其解决方案的关键在于提出一种基于视觉的偏差预警系统，该系统采用了一种名为AeroLite-MDNet的新模型，该模型通过多尺度融合模块实现鲁棒的跨尺度目标检测，并引入分割分支以高效估计姿态，从而提升了对着陆偏差的检测能力。

链接: https://arxiv.org/abs/2506.21635
作者: Haiping Yang,Huaxing Liu,Wei Wu,Zuohui Chen,Ning Wu
机构: Zhejiang University of Technology(浙江理工大学); Binjiang Institute of Artificial Intelligence, ZJUT(滨江人工智能研究院，浙理工); College of Computer Science and Technology, Zhejiang University of Technology(浙江理工大学计算机科学与技术学院); College of Geoinformatics, Zhejiang University of Technology(浙江理工大学地理信息学院); Quzhou Southeast Digital Economic Development Institute(衢州东南数字经济研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) are increasingly employed in diverse applications such as land surveying, material transport, and environmental monitoring. Following missions like data collection or inspection, UAVs must land safely at docking stations for storage or recharging, which is an essential requirement for ensuring operational continuity. However, accurate landing remains challenging due to factors like GPS signal interference. To address this issue, we propose a deviation warning system for UAV landings, powered by a novel vision-based model called AeroLite-MDNet. This model integrates a multiscale fusion module for robust cross-scale object detection and incorporates a segmentation branch for efficient orientation estimation. We introduce a new evaluation metric, Average Warning Delay (AWD), to quantify the system’s sensitivity to landing deviations. Furthermore, we contribute a new dataset, UAVLandData, which captures real-world landing deviation scenarios to support training and evaluation. Experimental results show that our system achieves an AWD of 0.7 seconds with a deviation detection accuracy of 98.6%, demonstrating its effectiveness in enhancing UAV landing reliability. Code will be available at this https URL
zh

[CV-89] OMD: A Trail-based Off-road Multimodal Dataset for Traversable Pathway Segmentation under Challenging Illumination Conditions IJCNN

【速读】：该论文旨在解决在非结构化户外环境中检测可通行路径的问题，特别是在宽范围搜索与救援及森林火灾等应急场景中，现有数据集和模型主要针对城市环境或宽广的车辆可通行非铺装道路，缺乏对狭窄、类似小径的非铺装场景的有效支持。为应对这一问题，研究提出了基于小径的非铺装多模态数据集（Trail-based Off-road Multimodal Dataset, TOMD），其关键在于提供了高保真多模态传感器数据（包括128通道LiDAR、立体图像、GNSS、IMU和光照测量）以及一种动态多尺度数据融合模型，用于准确预测可通行路径，并验证了不同光照条件下早期融合、交叉融合和混合融合策略的效果。

链接: https://arxiv.org/abs/2506.21630
作者: Yixin Sun,Li Li,Wenke E,Amir Atapour-Abarghouei,Toby P. Breckon
机构: Durham University (杜伦大学); King’s College London (国王学院伦敦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 9 figures, 2025 IJCNN

点击查看摘要

Abstract:Detecting traversable pathways in unstructured outdoor environments remains a significant challenge for autonomous robots, especially in critical applications such as wide-area search and rescue, as well as incident management scenarios like forest fires. Existing datasets and models primarily target urban settings or wide, vehicle-traversable off-road tracks, leaving a substantial gap in addressing the complexity of narrow, trail-like off-road scenarios. To address this, we introduce the Trail-based Off-road Multimodal Dataset (TOMD), a comprehensive dataset specifically designed for such environments. TOMD features high-fidelity multimodal sensor data – including 128-channel LiDAR, stereo imagery, GNSS, IMU, and illumination measurements – collected through repeated traversals under diverse conditions. We also propose a dynamic multiscale data fusion model for accurate traversable pathway prediction. The study analyzes the performance of early, cross, and mixed fusion strategies under varying illumination levels. Results demonstrate the effectiveness of our approach and the relevance of illumination in segmentation performance. We publicly release TOMD at this https URL to support future research in trail-based off-road navigation.
zh

[CV-90] Evaluating VisualRAG : Quantifying Cross-Modal Performance in Enterprise Document Understanding KDD2025 KDD

【速读】：该论文试图解决多模态生成式AI（Generative AI）在企业文档智能应用中缺乏可信度评估框架的问题，这限制了其在对可靠性要求极高的场景下的采用。解决方案的关键在于提出一种系统性、量化的基准测试框架，用于衡量VisualRAG系统中逐步整合文本、图像、字幕和OCR等跨模态输入的可信度，并建立技术指标与用户中心可信度度量之间的定量关系。通过优化模态权重（文本30%、图像15%、字幕25%、OCR 30%），在保持计算效率的同时，性能相比纯文本基线提升了57.3%。

链接: https://arxiv.org/abs/2506.21604
作者: Varun Mannam,Fang Wang,Xin Chen
机构: Amazon(亚马逊)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Conference: KDD conference workshop: this https URL

点击查看摘要

Abstract:Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital consideration for reliable enterprise AI. This work advances responsible AI deployment by providing a rigorous framework for quantifying and enhancing trustworthiness in multimodal RAG for critical enterprise applications.
zh

[CV-91] Hierarchical Patch Compression for ColPali: Efficient Multi-Vector Document Retrieval with Dynamic Pruning and Quantization

【速读】：该论文旨在解决多向量文档检索系统（如ColPali）在处理复杂查询时存在的存储和计算成本过高的问题，这些问题源于其对高维块嵌入和后期交互评分的依赖。解决方案的关键在于提出HPC-ColPali框架，该框架通过三种创新技术提升效率：(1) K-Means量化，将块嵌入压缩为1字节的中心点索引，实现最高32倍的存储缩减；(2) 注意力引导的动态剪枝，利用视觉-语言模型的注意力权重保留最显著的块，减少后期交互计算量达60%且仅损失小于2%的nDCG@10；(3) 可选的中心点索引二进制编码，以b位字符串形式表示（b=⌈log₂K⌉），支持资源受限环境下的快速汉明距离相似性搜索。

链接: https://arxiv.org/abs/2506.21601
作者: Duong Bach
机构: FPT University (FPT大学); Sun Asterisk (Sun Asterisk)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Multi-vector document retrieval systems, such as ColPali, excel in fine-grained matching for complex queries but incur significant storage and computational costs due to their reliance on high-dimensional patch embeddings and late-interaction scoring. To address these challenges, we propose HPC-ColPali, a Hierarchical Patch Compression framework that enhances the efficiency of ColPali while preserving its retrieval accuracy. Our approach integrates three innovative techniques: (1) K-Means quantization, which compresses patch embeddings into 1-byte centroid indices, achieving up to 32 \times storage reduction; (2) attention-guided dynamic pruning, utilizing Vision-Language Model attention weights to retain only the top- p% most salient patches, reducing late-interaction computation by up to 60% with less than 2% nDCG@10 loss; and (3) optional binary encoding of centroid indices into b -bit strings ( b=\lceil\log_2 K\rceil ), enabling rapid Hamming distance-based similarity search for resource-constrained environments. Evaluated on the ViDoRe and SEC-Filings datasets, HPC-ColPali achieves 30–50% lower query latency under HNSW indexing while maintaining high retrieval precision. When integrated into a Retrieval-Augmented Generation pipeline for legal summarization, it reduces hallucination rates by 30% and halves end-to-end latency. These advancements establish HPC-ColPali as a scalable and efficient solution for multi-vector document retrieval across diverse applications. Code is available at this https URL.
zh

[CV-92] PEACE: Empowering Geologic Map Holistic Understanding with MLLM s

【速读】：该论文试图解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在地质图理解方面的不足问题，这一问题主要源于制图综合的复杂性，包括处理高分辨率地图、管理多个相关组件以及需要领域专业知识。解决方案的关键在于提出GeoMap-Agent，这是一个专为地质图理解设计的智能体，其核心包含三个模块：分层信息提取（Hierarchical Information Extraction, HIE）、领域知识注入（Domain Knowledge Injection, DKI）和增强提示问答（Prompt-enhanced Question Answering, PEQA）。通过模拟人类科学家的跨学科协作，GeoMap-Agent利用多样化的工具池对问题进行全面分析，从而显著提升了地质图理解的能力。

链接: https://arxiv.org/abs/2501.06184
作者: Yangyu Huang,Tianyi Gao,Haoran Xu,Qihao Zhao,Yang Song,Zhipeng Gui,Tengchao Lv,Hao Chen,Lei Cui,Scarlett Li,Furu Wei
机构: Microsoft Research(微软研究院); Chinese Academy of Geological Sciences(中国地质科学院); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth’s subsurface and surface. These maps are indispensable in various fields, including disaster detection, resource exploration, and civil engineering. Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding. This gap is primarily due to the challenging nature of cartographic generalization, which involves handling high-resolution map, managing multiple associated components, and requiring domain-specific knowledge. To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce GeoMap-Agent, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). Inspired by the interdisciplinary collaboration among human scientists, an AI expert group acts as consultants, utilizing a diverse tool pool to comprehensively analyze questions. Through comprehensive experiments, GeoMap-Agent achieves an overall score of 0.811 on GeoMap-Bench, significantly outperforming 0.369 of GPT-4o. Our work, emPowering gEologic mAp holistiC undErstanding (PEACE) with MLLMs, paves the way for advanced AI applications in geology, enhancing the efficiency and accuracy of geological investigations.
zh

[CV-93] FreeEnricher: Enriching Face Landmarks without Additional Cost AAAI2023

【速读】：该论文旨在解决现有面部对齐方法中仅关注稀疏面部关键点（sparse facial landmark）而缺乏对密集面部关键点（dense facial landmark）建模的问题。其解决方案的关键在于提出一种弱监督学习框架，通过利用现有的稀疏关键点数据集（如300W和WFLW）来增强关键点密度。该框架首先观察到沿语义轮廓的局部图像块在外观上具有高度相似性，随后通过学习原始稀疏关键点的精调能力，并将其适配到增强的密集关键点上，从而实现关键点密度的提升。

链接: https://arxiv.org/abs/2212.09525
作者: Yangyu Huang,Xi Chen,Jongyoo Kim,Hao Yang,Chong Li,Jiaolong Yang,Dong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: AAAI 2023

点击查看摘要

Abstract:Recent years have witnessed significant growth of face alignment. Though dense facial landmark is highly demanded in various scenarios, e.g., cosmetic medicine and facial beautification, most works only consider sparse face alignment. To address this problem, we present a framework that can enrich landmark density by existing sparse landmark datasets, e.g., 300W with 68 points and WFLW with 98 points. Firstly, we observe that the local patches along each semantic contour are highly similar in appearance. Then, we propose a weakly-supervised idea of learning the refinement ability on original sparse landmarks and adapting this ability to enriched dense landmarks. Meanwhile, several operators are devised and organized together to implement the idea. Finally, the trained model is applied as a plug-and-play module to the existing face alignment networks. To evaluate our method, we manually label the dense landmarks on 300W testset. Our method yields state-of-the-art accuracy not only in newly-constructed dense 300W testset but also in the original sparse 300W and WFLW testsets without additional cost.
zh

[CV-94] ADNet: Leverag ing Error-Bias Towards Normal Direction in Face Alignment ICCV2021

【速读】：该论文试图解决人脸对齐中的误差偏差（error-bias）问题，即面部关键点误差分布倾向于沿着关键点曲线的切线方向扩散，这一现象与关键点标注任务的模糊性密切相关。解决方案的关键在于利用误差偏差特性以提升卷积神经网络（CNN）模型的收敛性，具体通过提出各向异性方向损失（anisotropic direction loss, ADL）和各向异性注意力模块（anisotropic attention module, AAM），分别用于坐标回归和热图回归。ADL在面部边界关键点的法线方向施加强约束，而AAM则通过各向异性注意力机制聚焦于关键点及其相邻点连接的局部边缘区域，在切线方向具有更强响应，从而在该方向上施加较弱约束。这两者协同工作，以学习面部结构和纹理细节。

链接: https://arxiv.org/abs/2109.05721
作者: Yangyu Huang,Hao Yang,Chong Li,Jongyoo Kim,Fangyun Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021 (ICCV 2021)

点击查看摘要

Abstract:The recent progress of CNN has dramatically improved face alignment performance. However, few works have paid attention to the error-bias with respect to error distribution of facial landmarks. In this paper, we investigate the error-bias issue in face alignment, where the distributions of landmark errors tend to spread along the tangent line to landmark curves. This error-bias is not trivial since it is closely connected to the ambiguous landmark labeling task. Inspired by this observation, we seek a way to leverage the error-bias property for better convergence of CNN model. To this end, we propose anisotropic direction loss (ADL) and anisotropic attention module (AAM) for coordinate and heatmap regression, respectively. ADL imposes strong binding force in normal direction for each landmark point on facial boundaries. On the other hand, AAM is an attention module which can get anisotropic attention mask focusing on the region of point and its local edge connected by adjacent points, it has a stronger response in tangent than in normal, which means relaxed constraints in the tangent. These two methods work in a complementary manner to learn both facial structures and texture details. Finally, we integrate them into an optimized end-to-end training pipeline named ADNet. Our ADNet achieves state-of-the-art results on 300W, WFLW and COFW datasets, which demonstrates the effectiveness and robustness.
zh

[CV-95] Single-shot HDR using conventional image sensor shutter functions and optical randomization

【速读】：该论文旨在解决单次拍摄高动态范围（HDR）成像中因传感器像素饱和而导致的细节丢失问题。传统方法依赖多曝光合成，但会引入运动伪影，而现有单次拍摄方法在处理高光区域时表现不佳。其解决方案的关键在于利用商用图像传感器的全局复位释放（GRR）快门模式，结合光学随机化曝光技术，通过求解带有简单总变分图像先验的优化问题来恢复HDR数据，从而在高饱和度场景下实现更优的HDR成像效果。

链接: https://arxiv.org/abs/2506.22426
作者: Xiang Dai,Kyrollos Yanny,Kristina Monakhova,Nicholas Antipa
机构: UC San Diego(加州大学圣地亚哥分校); UC Berkeley(加州大学伯克利分校); Cornell University(康奈尔大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Signal Processing (eess.SP); Optics (physics.optics)
备注:

点击查看摘要

Abstract:High-dynamic-range (HDR) imaging is an essential technique for overcoming the dynamic range limits of image sensors. The classic method relies on multiple exposures, which slows capture time, resulting in motion artifacts when imaging dynamic scenes. Single-shot HDR imaging alleviates this issue by encoding HDR data into a single exposure, then computationally recovering it. Many established methods use strong image priors to recover improperly exposed image detail. These approaches struggle with extended highlight regions. We utilize the global reset release (GRR) shutter mode of an off-the-shelf sensor. GRR shutter mode applies a longer exposure time to rows closer to the bottom of the sensor. We use optics that relay a randomly permuted (shuffled) image onto the sensor, effectively creating spatially randomized exposures across the scene. The exposure diversity allows us to recover HDR data by solving an optimization problem with a simple total variation image prior. In simulation, we demonstrate that our method outperforms other single-shot methods when many sensor pixels are saturated (10% or more), and is competitive at a modest saturation (1%). Finally, we demonstrate a physical lab prototype that uses an off-the-shelf random fiber bundle for the optical shuffling. The fiber bundle is coupled to a low-cost commercial sensor operating in GRR shutter mode. Our prototype achieves a dynamic range of up to 73dB using an 8-bit sensor with 48dB dynamic range.
zh

[CV-96] Dehazing Light Microscopy Images with Guided Conditional Flow Matching: finding a sweet spot between fidelity and realism

【速读】：该论文旨在解决光显微镜图像去雾问题，即在保持图像数据保真度（如低均方误差或高峰值信噪比）与提升图像感知真实度（如通过LPIPS或FID等感知指标衡量）之间找到平衡。现有方法要么牺牲感知真实性以追求数据保真度，要么生成具有感知说服力但缺乏定量准确性的结果。该研究提出的HazeMatching是一种新颖的迭代去雾方法，其关键在于通过在条件速度场中引入模糊观测来引导生成过程，从而实现保真度与感知真实度之间的有效权衡。

链接: https://arxiv.org/abs/2506.22397
作者: Anirban Ray,Ashesh,Florian Jug
机构: Human Technopole, Milan, Italy; Technische Universität Dresden, Germany
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: supplement pending, 4 figures, 10 pages + refs

点击查看摘要

Abstract:Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 7 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.
zh

[CV-97] QuKAN: A Quantum Circuit Born Machine approach to Quantum Kolmogorov Arnold Networks

【速读】：该论文试图解决传统神经网络在量子机器学习中的表达能力和可解释性不足的问题，以及如何有效利用参数化量子电路的表示能力。解决方案的关键在于将Kolmogorov Arnold Networks (KANs) 与量子电路结合，通过在量子系统中实现残差函数的迁移学习，从而构建出一种新型的量子KAN架构（QuKAN），该架构在保持高效表达能力的同时增强了模型的可解释性与性能。

链接: https://arxiv.org/abs/2506.22340
作者: Yannick Werner,Akash Malemath,Mengxi Liu,Vitor Fortes Rey,Nikolaos Palaiodimopoulos,Paul Lukowicz,Maximilian Kiefer-Emmanouilidis
机构: RPTU Kaiserslautern-Landau (RPTU 凯撒斯劳滕-兰道大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Department of Computer Science (计算机科学系); Department of Physics (物理学系)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Kolmogorov Arnold Networks (KANs), built upon the Kolmogorov Arnold representation theorem (KAR), have demonstrated promising capabilities in expressing complex functions with fewer neurons. This is achieved by implementing learnable parameters on the edges instead of on the nodes, unlike traditional networks such as Multi-Layer Perceptrons (MLPs). However, KANs potential in quantum machine learning has not yet been well explored. In this work, we present an implementation of these KAN architectures in both hybrid and fully quantum forms using a Quantum Circuit Born Machine (QCBM). We adapt the KAN transfer using pre-trained residual functions, thereby exploiting the representational power of parametrized quantum circuits. In the hybrid model we combine classical KAN components with quantum subroutines, while the fully quantum version the entire architecture of the residual function is translated to a quantum model. We demonstrate the feasibility, interpretability and performance of the proposed Quantum KAN (QuKAN) architecture.
zh

[CV-98] DIGS: Dynamic CBCT Reconstruction using Deformation-Informed 4D Gaussian Splatting and a Low-Rank Free-Form Deformation Model MICCAI2025

【速读】：该论文旨在解决3D锥形束CT（CBCT）在放射治疗中因呼吸运动导致的运动伪影问题，尤其是在呼吸变异性的背景下，传统按呼吸相位重建的方法存在局限性。其解决方案的关键在于引入基于自由形变（FFD）的空间基函数和一种依赖形变的框架，通过耦合高斯分布的均值位置、尺度和旋转在统一形变场下的时间演化，以确保运动的一致性，从而实现高效且运动补偿的CBCT重建。

链接: https://arxiv.org/abs/2506.22280
作者: Yuliang Huang,Imraj Singh,Thomas Joyce,Kris Thielemans,Jamie R. McClelland
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:3D Cone-Beam CT (CBCT) is widely used in radiotherapy but suffers from motion artifacts due to breathing. A common clinical approach mitigates this by sorting projections into respiratory phases and reconstructing images per phase, but this does not account for breathing variability. Dynamic CBCT instead reconstructs images at each projection, capturing continuous motion without phase sorting. Recent advancements in 4D Gaussian Splatting (4DGS) offer powerful tools for modeling dynamic scenes, yet their application to dynamic CBCT remains underexplored. Existing 4DGS methods, such as HexPlane, use implicit motion representations, which are computationally expensive. While explicit low-rank motion models have been proposed, they lack spatial regularization, leading to inconsistencies in Gaussian motion. To address these limitations, we introduce a free-form deformation (FFD)-based spatial basis function and a deformation-informed framework that enforces consistency by coupling the temporal evolution of Gaussian’s mean position, scale, and rotation under a unified deformation field. We evaluate our approach on six CBCT datasets, demonstrating superior image quality with a 6x speedup over HexPlane. These results highlight the potential of deformation-informed 4DGS for efficient, motion-compensated CBCT reconstruction. The code is available at this https URL.
zh

[CV-99] Cardiovascular disease classification using radiomics and geometric features from cardiac CT MICCAI2025

【速读】：该论文试图解决从CT图像中自动检测和分类心血管疾病（Cardiovascular Disease, CVD）时，基于深度学习的方法在临床可解释性方面的不足。现有方法通常直接处理原始CT数据或结合心脏解剖结构分割进行端到端分类，导致模型难以从临床角度进行解释。解决方案的关键在于将CVD分类流程分解为三个组件：图像分割、图像配准以及下游CVD分类。通过使用Atlas-ISTN框架和最新的分割基础模型生成解剖结构分割和正常健康图谱，并利用这些图谱提取具有临床可解释性的放射组学特征以及基于配准的几何特征，从而提升分类性能。实验结果表明，该方法在ASOCA数据集上的分类准确率（87.50%）显著优于直接基于原始CT图像训练的分类模型（67.50%）。

链接: https://arxiv.org/abs/2506.22226
作者: Ajay Mittal,Raghav Mehta,Omar Todd,Philipp Seeböck,Georg Langs,Ben Glocker
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review at STACOM 2025 with MICCAI 2025

点击查看摘要

Abstract:Automatic detection and classification of Cardiovascular disease (CVD) from Computed Tomography (CT) images play an important part in facilitating better-informed clinical decisions. However, most of the recent deep learning based methods either directly work on raw CT data or utilize it in pair with anatomical cardiac structure segmentation by training an end-to-end classifier. As such, these approaches become much more difficult to interpret from a clinical perspective. To address this challenge, in this work, we break down the CVD classification pipeline into three components: (i) image segmentation, (ii) image registration, and (iii) downstream CVD classification. Specifically, we utilize the Atlas-ISTN framework and recent segmentation foundational models to generate anatomical structure segmentation and a normative healthy atlas. These are further utilized to extract clinically interpretable radiomic features as well as deformation field based geometric features (through atlas registration) for CVD classification. Our experiments on the publicly available ASOCA dataset show that utilizing these features leads to better CVD classification accuracy (87.50%) when compared against classification model trained directly on raw CT images (67.50%). Our code is publicly available: this https URL
zh

[CV-100] Advanced Deep Learning Techniques for Automated Segmentation of Type B Aortic Dissections

【速读】：该论文旨在解决主动脉夹层（Aortic Dissection）的医学影像分割问题，特别是针对真腔（True Lumen, TL）、假腔（False Lumen, FL）及假腔血栓（False Lumen Thrombosis, FLT）的准确分割，以支持有效的临床管理和治疗规划。其解决方案的关键在于开发了四种基于深度学习的分割流程，包括单步模型、序列模型、序列多任务模型和集成模型，采用3D U-Net和Swin-UnetR架构，实现了对CT血管造影（CTA）图像的自动化、高精度分割。

链接: https://arxiv.org/abs/2506.22222
作者: Hao Xu,Ruth Lim,Brian E. Chapman
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Purpose: Aortic dissections are life-threatening cardiovascular conditions requiring accurate segmentation of true lumen (TL), false lumen (FL), and false lumen thrombosis (FLT) from CTA images for effective management. Manual segmentation is time-consuming and variable, necessitating automated solutions. Materials and Methods: We developed four deep learning-based pipelines for Type B aortic dissection segmentation: a single-step model, a sequential model, a sequential multi-task model, and an ensemble model, utilizing 3D U-Net and Swin-UnetR architectures. A dataset of 100 retrospective CTA images was split into training (n=80), validation (n=10), and testing (n=10). Performance was assessed using the Dice Coefficient and Hausdorff Distance. Results: Our approach achieved superior segmentation accuracy, with Dice Coefficients of 0.91 \pm 0.07 for TL, 0.88 \pm 0.18 for FL, and 0.47 \pm 0.25 for FLT, outperforming Yao et al. (1), who reported 0.78 \pm 0.20, 0.68 \pm 0.18, and 0.25 \pm 0.31, respectively. Conclusion: The proposed pipelines provide accurate segmentation of TBAD features, enabling derivation of morphological parameters for surveillance and treatment planning
zh

[CV-101] owards Scalable and Robust White Matter Lesion Localization via Multimodal Deep Learning

【速读】：该论文旨在解决白质高信号（White Matter Hyperintensities, WMH）在磁共振成像（MRI）中的准确分割与空间定位问题，这对诊断和监测小血管疾病及神经退行性病变至关重要。现有方法在处理缺失模态和整合解剖定位方面存在不足，因此本文提出了一种深度学习框架，直接在原始空间中使用单模态和多模态MRI输入进行WMH病变分割与定位。其关键在于通过多任务学习联合预测病变和解剖区域掩码，以估计区域性的病变负担，并探索了不同输入配置对分割性能的影响，验证了多模态融合在提升准确性与鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2506.22041
作者: Julia Machnio,Sebastian Nørgaard Llambias,Mads Nielsen,Mostafa Mehdipour Ghazi
机构: Pioneer Centre for AI (Pioneer人工智能中心); University of Copenhagen (哥本哈根大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 2nd Sorbonne-Heidelberg Workshop on AI in medicine: Machine Learning for multi-modal data

点击查看摘要

Abstract:White matter hyperintensities (WMH) are radiological markers of small vessel disease and neurodegeneration, whose accurate segmentation and spatial localization are crucial for diagnosis and monitoring. While multimodal MRI offers complementary contrasts for detecting and contextualizing WM lesions, existing approaches often lack flexibility in handling missing modalities and fail to integrate anatomical localization efficiently. We propose a deep learning framework for WM lesion segmentation and localization that operates directly in native space using single- and multi-modal MRI inputs. Our study evaluates four input configurations: FLAIR-only, T1-only, concatenated FLAIR and T1, and a modality-interchangeable setup. It further introduces a multi-task model for jointly predicting lesion and anatomical region masks to estimate region-wise lesion burden. Experiments conducted on the MICCAI WMH Segmentation Challenge dataset demonstrate that multimodal input significantly improves the segmentation performance, outperforming unimodal models. While the modality-interchangeable setting trades accuracy for robustness, it enables inference in cases with missing modalities. Joint lesion-region segmentation using multi-task learning was less effective than separate models, suggesting representational conflict between tasks. Our findings highlight the utility of multimodal fusion for accurate and robust WMH analysis, and the potential of joint modeling for integrated predictions.
zh

[CV-102] Noise-Inspired Diffusion Model for Generalizable Low-Dose CT Reconstruction

【速读】：该论文旨在解决基于深度学习的低剂量计算机断层扫描（LDCT）重建模型在训练数据中未见剂量水平上的泛化能力不足的问题。现有方法依赖于配对数据进行再训练或微调以提升泛化性能，但效果有限。本文提出的噪声启发扩散模型（NEED）的关键在于针对不同领域的噪声特性定制扩散模型：首先，引入一种新型的位移泊松扩散模型以去噪预对数LDCT投影数据；其次，设计一种双重引导扩散模型以利用LDCT图像和初始重建结果更精确地定位先验信息，从而提升重建保真度。通过级联这两个扩散模型实现双域重建，NEED仅需正常剂量数据进行训练，并可通过时间步匹配策略有效扩展到各种未见过的剂量水平。

链接: https://arxiv.org/abs/2506.22012
作者: Qi Gao,Zhihao Chen,Dong Zeng,Junping Zhang,Jianhua Ma,Hongming Shan
机构: Fudan University (复旦大学); Southern Medical University (南方医科大学); Xi’an Jiaotong University (西安交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Medical Image Analysis, 2025

点击查看摘要

Abstract:The generalization of deep learning-based low-dose computed tomography (CT) reconstruction models to doses unseen in the training data is important and remains challenging. Previous efforts heavily rely on paired data to improve the generalization performance and robustness through collecting either diverse CT data for re-training or a few test data for fine-tuning. Recently, diffusion models have shown promising and generalizable performance in low-dose CT (LDCT) reconstruction, however, they may produce unrealistic structures due to the CT image noise deviating from Gaussian distribution and imprecise prior information from the guidance of noisy LDCT images. In this paper, we propose a noise-inspired diffusion model for generalizable LDCT reconstruction, termed NEED, which tailors diffusion models for noise characteristics of each domain. First, we propose a novel shifted Poisson diffusion model to denoise projection data, which aligns the diffusion process with the noise model in pre-log LDCT projections. Second, we devise a doubly guided diffusion model to refine reconstructed images, which leverages LDCT images and initial reconstructions to more accurately locate prior information and enhance reconstruction fidelity. By cascading these two diffusion models for dual-domain reconstruction, our NEED requires only normal-dose data for training and can be effectively extended to various unseen dose levels during testing via a time step matching strategy. Extensive qualitative, quantitative, and segmentation-based evaluations on two datasets demonstrate that our NEED consistently outperforms state-of-the-art methods in reconstruction and generalization performance. Source code is made available at this https URL.
zh

[CV-103] StableCodec: Taming One-Step Diffusion for Extreme Image Compression

【速读】：该论文旨在解决基于扩散模型的图像压缩在极端比特率下（低于0.05 bits per pixel）生成高质量图像时存在的两个关键问题：一是解码器需要大量去噪步骤以生成逼真结果，限制了其在实时压缩场景中的应用；二是扩散模型通常无法保证像素级一致性，导致重建保真度下降。解决方案的关键在于提出StableCodec，其核心包括：一种高效的深度压缩潜在编解码器，用于传输噪声潜在表示以实现单步去噪；一个双分支编码结构，通过辅助编码器和解码器提升重建保真度；以及端到端优化方法，结合比特率和像素级约束。

链接: https://arxiv.org/abs/2506.21977
作者: Tianyu Zhang,Xin Luo,Li Li,Dong Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based image compression has shown remarkable potential for achieving ultra-low bitrate coding (less than 0.05 bits per pixel) with high realism, by leveraging the generative priors of large pre-trained text-to-image diffusion models. However, current approaches require a large number of denoising steps at the decoder to generate realistic results under extreme bitrate constraints, limiting their application in real-time compression scenarios. Additionally, these methods often sacrifice reconstruction fidelity, as diffusion models typically fail to guarantee pixel-level consistency. To address these challenges, we introduce StableCodec, which enables one-step diffusion for high-fidelity and high-realism extreme image compression with improved coding efficiency. To achieve ultra-low bitrates, we first develop an efficient Deep Compression Latent Codec to transmit a noisy latent representation for a single-step denoising process. We then propose a Dual-Branch Coding Structure, consisting of a pair of auxiliary encoder and decoder, to enhance reconstruction fidelity. Furthermore, we adopt end-to-end optimization with joint bitrate and pixel-level constraints. Extensive experiments on the CLIC 2020, DIV2K, and Kodak dataset demonstrate that StableCodec outperforms existing methods in terms of FID, KID and DISTS by a significant margin, even at bitrates as low as 0.005 bits per pixel, while maintaining strong fidelity. Additionally, StableCodec achieves inference speeds comparable to mainstream transform coding schemes. All source code are available at this https URL.
zh

[CV-104] UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields ICCV2025

【速读】：该论文旨在解决基于Neural Radiance Field (NeRF)的分割方法仅依赖RGB数据而缺乏内在材质属性的问题，这一限制影响了准确的材质感知，进而制约了其在机器人、增强现实、仿真等领域的应用。其解决方案的关键在于引入光谱解混（spectral unmixing）机制，通过将漫反射和镜面反射成分建模为全局端元（endmembers）的字典以及每点的丰度分布，实现联合高光谱新视角合成与无监督材质分割，并支持基于材质的场景编辑。

链接: https://arxiv.org/abs/2506.21884
作者: Fabian Perez,Sara Rojas,Carlos Hinojosa,Hoover Rueda-Chacón,Bernard Ghanem
机构: Universidad Industrial de Santander (桑坦德工业大学); KAUST (KAUST)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Paper accepted at ICCV 2025 main conference

点击查看摘要

Abstract:Neural Radiance Field (NeRF)-based segmentation methods focus on object semantics and rely solely on RGB data, lacking intrinsic material properties. This limitation restricts accurate material perception, which is crucial for robotics, augmented reality, simulation, and other applications. We introduce UnMix-NeRF, a framework that integrates spectral unmixing into NeRF, enabling joint hyperspectral novel view synthesis and unsupervised material segmentation. Our method models spectral reflectance via diffuse and specular components, where a learned dictionary of global endmembers represents pure material signatures, and per-point abundances capture their distribution. For material segmentation, we use spectral signature predictions along learned endmembers, allowing unsupervised material clustering. Additionally, UnMix-NeRF enables scene editing by modifying learned endmember dictionaries for flexible material-based appearance manipulation. Extensive experiments validate our approach, demonstrating superior spectral reconstruction and material segmentation to existing methods. Project page: this https URL.
zh

[CV-105] Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer

【速读】：该论文旨在解决干涉高光谱成像（Interferometric Hyperspectral Imaging, IHI）在重建过程中面临的两大挑战：一是缺乏足够的训练数据集，二是难以通过基于学习的方法消除IHI特有的退化成分。其解决方案的关键在于构建一个简化的但精确的IHI退化模型，并结合辐射校准数据进行参数估计，从而能够从高光谱图像（HSIs）中合成出逼真的IHI训练数据集，填补IHI重建与深度学习之间的差距；同时设计了基于条纹增强机制和空谱变换器架构的干涉高光谱重建展开变压器（IHRUT），实现有效的光谱校正和细节恢复。

链接: https://arxiv.org/abs/2506.21880
作者: Yuansheng Li,Yunhao Zou,Linwei Chen,Ying Fu
机构: Beijing Institute of Technology (北京理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interferometric Hyperspectral Imaging (IHI) is a critical technique for large-scale remote sensing tasks due to its advantages in flux and spectral resolution. However, IHI is susceptible to complex errors arising from imaging steps, and its quality is limited by existing signal processing-based reconstruction algorithms. Two key challenges hinder performance enhancement: 1) the lack of training datasets. 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. First, based on imaging physics and radiometric calibration data, we establish a simplified yet accurate IHI degradation model and a parameter estimation method. This model enables the synthesis of realistic IHI training datasets from hyperspectral images (HSIs), bridging the gap between IHI reconstruction and deep learning. Second, we design the Interferometric Hyperspectral Reconstruction Unfolding Transformer (IHRUT), which achieves effective spectral correction and detail restoration through a stripe-pattern enhancement mechanism and a spatial-spectral transformer architecture. Experimental results demonstrate the superior performance and generalization capability of our method.
zh

[CV-106] US-REC2024: A Challenge to Reconstruct 3D Freehand Ultrasound Without External Tracker

【速读】：该论文旨在解决无追踪器自由手式超声重建问题，即从序列的2D超声图像中重建3D体积，而无需依赖外部追踪系统。其关键解决方案在于通过构建一个公开数据集、基线模型和评估框架，推动无追踪器3D超声重建技术的发展，并探索多种算法方法，包括循环模型、基于配准的体积优化、注意力机制和物理信息模型，以应对运动估计精度、长序列漂移最小化及扫描协议泛化等挑战。

链接: https://arxiv.org/abs/2506.21765
作者: Qi Li,Shaheer U. Saeed,Yuliang Huang,Mingyuan Luo,Zhongnuo Yan,Jiongquan Chen,Xin Yang,Dong Ni,Nektarios Winter,Phuc Nguyen,Lucas Steinberger,Caelan Haney,Yuan Zhao,Mingjie Jiang,Bowen Ren,SiYeoul Lee,Seonho Kim,MinKyung Seo,MinWoo Kim,Yimeng Dou,Zhiwei Zhang,Yin Li,Tomy Varghese,Dean C. Barratt,Matthew J. Clarkson,Tom Vercauteren,Yipeng Hu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trackerless freehand ultrasound reconstruction aims to reconstruct 3D volumes from sequences of 2D ultrasound images without relying on external tracking systems, offering a low-cost, portable, and widely deployable alternative for volumetric imaging. However, it presents significant challenges, including accurate inter-frame motion estimation, minimisation of drift accumulation over long sequences, and generalisability across scanning protocols. The TUS-REC2024 Challenge was established to benchmark and accelerate progress in trackerless 3D ultrasound reconstruction by providing a publicly available dataset for the first time, along with a baseline model and evaluation framework. The Challenge attracted over 43 registered teams, of which 6 teams submitted 21 valid dockerized solutions. Submitted methods spanned a wide range of algorithmic approaches, including recurrent models, registration-driven volume refinement, attention, and physics-informed models. This paper presents an overview of the Challenge design, summarises the key characteristics of the dataset, provides a concise literature review, introduces the technical details of the underlying methodology working with tracked freehand ultrasound data, and offers a comparative analysis of submitted methods across multiple evaluation metrics. The results highlight both the progress and current limitations of state-of-the-art approaches in this domain, and inform directions for future research. The data, evaluation code, and baseline are publicly available to facilitate ongoing development and reproducibility. As a live and evolving benchmark, this Challenge is designed to be continuously developed and improved. The Challenge was held at MICCAI 2024 and will be organised again at MICCAI 2025, reflecting its growing impact and the sustained commitment to advancing this field.
zh

[CV-107] Inverse Design of Diffractive Metasurfaces Using Diffusion Models

【速读】：该论文试图解决超表面（metasurface）逆向设计中的挑战，即在结构与光学特性之间存在复杂非线性关系的情况下，如何高效地生成满足特定光学响应的几何结构。传统方法通常需要专家调参、易陷入局部极小值且计算开销大。解决方案的关键在于将生成式AI（Generative AI）中的扩散模型（diffusion model）集成到计算设计流程中，利用RCWA模拟器生成训练数据，并训练条件扩散模型从目标空间功率分布预测元原子几何形状和高度，从而实现高效、低误差的超表面设计。

链接: https://arxiv.org/abs/2506.21748
作者: Liav Hen,Erez Yosef,Dan Raviv,Raja Giryes,Jacob Scheuer
机构: Tel-Aviv University(特拉维夫大学); The Center for Nanosciences and Nanotechnology(纳米科学与纳米技术中心)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Metasurfaces are ultra-thin optical elements composed of engineered sub-wavelength structures that enable precise control of light. Their inverse design - determining a geometry that yields a desired optical response - is challenging due to the complex, nonlinear relationship between structure and optical properties. This often requires expert tuning, is prone to local minima, and involves significant computational overhead. In this work, we address these challenges by integrating the generative capabilities of diffusion models into computational design workflows. Using an RCWA simulator, we generate training data consisting of metasurface geometries and their corresponding far-field scattering patterns. We then train a conditional diffusion model to predict meta-atom geometry and height from a target spatial power distribution at a specified wavelength, sampled from a continuous supported band. Once trained, the model can generate metasurfaces with low error, either directly using RCWA-guided posterior sampling or by serving as an initializer for traditional optimization methods. We demonstrate our approach on the design of a spatially uniform intensity splitter and a polarization beam splitter, both produced with low error in under 30 minutes. To support further research in data-driven metasurface design, we publicly release our code and datasets.
zh

[CV-108] PhotonSplat: 3D Scene Reconstruction and Colorization from SPAD Sensors

【速读】：该论文旨在解决在输入图像受运动模糊影响时，基于神经渲染的三维重建技术失效的问题（motion blur in input imagery）。其关键解决方案是引入单光子雪崩二极管（SPAD）阵列，这是一种能够以极高帧率捕捉图像的新兴传感技术，同时提出PhotonSplat框架，该框架能够直接从SPAD生成的二进制图像中重建三维场景，并通过创新的三维空间滤波技术减少噪声，从而有效平衡噪声与模糊之间的权衡。

链接: https://arxiv.org/abs/2506.21680
作者: Sai Sri Teja,Sreevidya Chintalapati,Vinayak Gupta,Mukund Varma T,Haejoon Lee,Aswin Sankaranarayanan,Kaushik Mitra
机构: Indian Institute of Technology, Madras(印度理工学院，马德拉斯); University of California, San Diego(加州大学圣地亚哥分校); Carnagie Mellon University(卡内基梅隆大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the International Conference on Computational Photography(ICCP) 2025

点击查看摘要

Abstract:Advances in 3D reconstruction using neural rendering have enabled high-quality 3D capture. However, they often fail when the input imagery is corrupted by motion blur, due to fast motion of the camera or the objects in the scene. This work advances neural rendering techniques in such scenarios by using single-photon avalanche diode (SPAD) arrays, an emerging sensing technology capable of sensing images at extremely high speeds. However, the use of SPADs presents its own set of unique challenges in the form of binary images, that are driven by stochastic photon arrivals. To address this, we introduce PhotonSplat, a framework designed to reconstruct 3D scenes directly from SPAD binary images, effectively navigating the noise vs. blur trade-off. Our approach incorporates a novel 3D spatial filtering technique to reduce noise in the renderings. The framework also supports both no-reference using generative priors and reference-based colorization from a single blurry image, enabling downstream applications such as segmentation, object detection and appearance editing tasks. Additionally, we extend our method to incorporate dynamic scene representations, making it suitable for scenes with moving objects. We further contribute PhotonScenes, a real-world multi-view dataset captured with the SPAD sensors.
zh

人工智能

[AI-0] CLoVE: Personalized Federated Learning through Clustering of Loss Vector Embeddings

【速读】：该论文旨在解决联邦学习中的聚类问题，即在客户端数据分布未知的情况下，如何有效地将客户端分组到不同的簇中，并为每个簇优化特定的模型。解决方案的关键在于提出一种名为CLoVE（Loss Vector Embeddings聚类）的算法，该算法利用客户端在自身数据上的模型损失生成嵌入表示，并基于同一簇内客户端具有相似损失值、不同簇间损失模式差异显著的观察，通过迭代方式识别并分离不同簇的客户端，从而实现簇内模型的联邦聚合。

链接: https://arxiv.org/abs/2506.22427
作者: Randeep Bhatia,Nikos Papadis,Murali Kodialam,TV Lakshman,Sayak Chakrabarty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 4 figures

点击查看摘要

Abstract:We propose CLoVE (Clustering of Loss Vector Embeddings), a novel algorithm for Clustered Federated Learning (CFL). In CFL, clients are naturally grouped into clusters based on their data distribution. However, identifying these clusters is challenging, as client assignments are unknown. CLoVE utilizes client embeddings derived from model losses on client data, and leverages the insight that clients in the same cluster share similar loss values, while those in different clusters exhibit distinct loss patterns. Based on these embeddings, CLoVE is able to iteratively identify and separate clients from different clusters and optimize cluster-specific models through federated aggregation. Key advantages of CLoVE over existing CFL algorithms are (1) its simplicity, (2) its applicability to both supervised and unsupervised settings, and (3) the fact that it eliminates the need for near-optimal model initialization, which makes it more robust and better suited for real-world applications. We establish theoretical convergence bounds, showing that CLoVE can recover clusters accurately with high probability in a single round and converges exponentially fast to optimal models in a linear setting. Our comprehensive experiments comparing with a variety of both CFL and generic Personalized Federated Learning (PFL) algorithms on different types of datasets and an extensive array of non-IID settings demonstrate that CLoVE achieves highly accurate cluster recovery in just a few rounds of training, along with state-of-the-art model accuracy, across a variety of both supervised and unsupervised PFL tasks.
zh

[AI-1] Multi-View Contrastive Learning for Robust Domain Adaptation in Medical Time Series Analysis

【速读】：该论文旨在解决在不同领域间适应机器学习模型以处理医学时间序列数据的挑战，这一挑战源于复杂的时序依赖关系和动态分布变化。现有方法通常专注于孤立的特征表示，限制了其全面捕捉必要时序动态的能力。该研究提出了一种基于多视角对比学习的新框架，其关键在于整合时序模式、基于导数的动力学特征以及频域特征，通过独立编码器和分层融合机制学习具有特征不变性的表示，从而实现跨领域的可迁移性并保持时序一致性。

链接: https://arxiv.org/abs/2506.22393
作者: YongKyung Oh,Alex Bui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting machine learning models to medical time series across different domains remains a challenge due to complex temporal dependencies and dynamic distribution shifts. Current approaches often focus on isolated feature representations, limiting their ability to fully capture the intricate temporal dynamics necessary for robust domain adaptation. In this work, we propose a novel framework leveraging multi-view contrastive learning to integrate temporal patterns, derivative-based dynamics, and frequency-domain features. Our method employs independent encoders and a hierarchical fusion mechanism to learn feature-invariant representations that are transferable across domains while preserving temporal coherence. Extensive experiments on diverse medical datasets, including electroencephalogram (EEG), electrocardiogram (ECG), and electromyography (EMG) demonstrate that our approach significantly outperforms state-of-the-art methods in transfer learning tasks. By advancing the robustness and generalizability of machine learning models, our framework offers a practical pathway for deploying reliable AI systems in diverse healthcare settings.
zh

[AI-2] owards Distributed Neural Architectures

【速读】：该论文旨在解决传统神经网络架构在计算效率和参数共享方面的局限性，特别是在视觉和语言任务中如何实现更灵活且高效的模型结构。其解决方案的关键在于引入分布式神经架构（Distributed Neural Architectures, DNA），该架构通过初始化包含多种模块（如Transformer、MLP、注意力机制等）和路由器的原型架构，使每个token（或图像块）能够以任意顺序遍历任意模块序列。DNA通过端到端训练学习计算与通信模式，并可根据优化目标（如计算/内存效率或负载均衡）进行调整，从而实现高效且可扩展的模型设计。

链接: https://arxiv.org/abs/2506.22389
作者: Aditya Cowsik,Tianyu He,Andrey Gromov
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: 36 pages, 25 figures

点击查看摘要

Abstract:We introduce and train distributed neural architectures (DNA) in vision and language domains. DNAs are initialized with a proto-architecture that consists of (transformer, MLP, attention, etc.) modules and routers. Any token (or patch) can traverse any series of modules in any order. DNAs are a natural generalization of the sparse methods such as Mixture-of-Experts, Mixture-of-Depths, parameter sharing, etc. Computation and communication patterns of DNA modules are learnt end-to-end during training and depend on the content and context of each token (or patch). These patterns can be shaped by further requirements added to the optimization objective such as compute/memory efficiency or load balancing. We empirically show that (i) trained DNAs are competitive with the dense baselines in both domains and (ii) compute efficiency/parameter sharing can be learnt from data. Next, we analyze the emergent connectivity and computation patterns in the trained DNAs. We find that the paths that tokens take through the models are themselves distributed according to a power-law. We show that some paths (or, equivalently, groups of modules) show emergent specialization. Finally, we demonstrate that models learn to allocate compute and active parameters in an interpretable way.
zh

[AI-3] Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems

【速读】：该论文旨在解决大规模通信系统中因场景复杂度增加而导致的边缘设备在处理多模态传感数据时协作效率不足的问题，传统联邦学习（Federated Learning, FL）算法通常仅适用于单模态数据，要求统一的模型架构，并未能充分利用多模态数据中的丰富信息，从而限制了其在具有多样模态和不同客户端能力的实际场景中的应用。解决方案的关键在于提出一种基于层理论（Sheaf Theory）的去中心化多模态学习框架——Sheaf-DMFL，通过在客户端本地对不同模态进行特征编码并融合，同时利用层结构捕捉任务特定层之间的内在相关性，以增强设备间的协作能力。此外，还引入了Sheaf-DMFL-Att算法，通过注意力机制进一步提升多模态间的相关性建模能力，并提供了严格的收敛性分析以保证理论可靠性。

链接: https://arxiv.org/abs/2506.22374
作者: Abdulmomen Ghalkha,Zhuojun Tian,Chaouki Ben Issaid,Mehdi Bennis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:In large-scale communication systems, increasingly complex scenarios require more intelligent collaboration among edge devices collecting various multimodal sensory data to achieve a more comprehensive understanding of the environment and improve decision-making accuracy. However, conventional federated learning (FL) algorithms typically consider unimodal datasets, require identical model architectures, and fail to leverage the rich information embedded in multimodal data, limiting their applicability to real-world scenarios with diverse modalities and varying client capabilities. To address this issue, we propose Sheaf-DMFL, a novel decentralized multimodal learning framework leveraging sheaf theory to enhance collaboration among devices with diverse modalities. Specifically, each client has a set of local feature encoders for its different modalities, whose outputs are concatenated before passing through a task-specific layer. While encoders for the same modality are trained collaboratively across clients, we capture the intrinsic correlations among clients’ task-specific layers using a sheaf-based structure. To further enhance learning capability, we propose an enhanced algorithm named Sheaf-DMFL-Att, which tailors the attention mechanism within each client to capture correlations among different modalities. A rigorous convergence analysis of Sheaf-DMFL-Att is provided, establishing its theoretical guarantees. Extensive simulations are conducted on real-world link blockage prediction and mmWave beamforming scenarios, demonstrate the superiority of the proposed algorithms in such heterogeneous wireless communication systems.
zh

[AI-4] Concept-Level AI for Telecom: Moving Beyond Large Language Models

【速读】：该论文试图解决电信和网络领域在面对日益复杂、分层且多管理域（即同一路径上有多个运营商）以及多语言系统时所面临的管理难题。现有研究显示，尽管大型语言模型（Large Language Models, LLMs）在文本分析和代码生成方面表现出色，但其基于逐标记处理的机制和有限的长程上下文保持能力，难以满足电信领域对跨层依赖级联、时空故障关联及实时分布式协调等特定需求。论文提出的解决方案关键在于采用大型概念模型（Large Concept Models, LCMs），通过语义概念层面的推理，利用双曲潜在空间进行层次化表示，并将复杂的多层网络交互封装在简洁的概念嵌入中，从而在内存效率、跨层相关性及原生多模态集成方面显著优于LLMs。

链接: https://arxiv.org/abs/2506.22359
作者: Viswanath Kumarskandpriya,Abdulhalim Dandoush,Abbas Bradai,Ali Belgacem
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The telecommunications and networking domain stands at the precipice of a transformative era, driven by the necessity to manage increasingly complex, hierarchical, multi administrative domains (i.e., several operators on the same path) and multilingual systems. Recent research has demonstrated that Large Language Models (LLMs), with their exceptional general-purpose text analysis and code generation capabilities, can be effectively applied to certain telecom problems (e.g., auto-configuration of data plan to meet certain application requirements). However, due to their inherent token-by-token processing and limited capacity for maintaining extended context, LLMs struggle to fulfill telecom-specific requirements such as cross-layer dependency cascades (i.e., over OSI), temporal-spatial fault correlation, and real-time distributed coordination. In contrast, Large Concept Models (LCMs), which reason at the abstraction level of semantic concepts rather than individual lexical tokens, offer a fundamentally superior approach for addressing these telecom challenges. By employing hyperbolic latent spaces for hierarchical representation and encapsulating complex multi-layered network interactions within concise concept embeddings, LCMs overcome critical shortcomings of LLMs in terms of memory efficiency, cross-layer correlation, and native multimodal integration. This paper argues that adopting LCMs is not simply an incremental step, but a necessary evolutionary leap toward achieving robust and effective AI-driven telecom management.
zh

[AI-5] AI Model Passport: Data and System Traceability Framework for Transparent AI in Health

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在医疗和生物医学系统中日益增长的集成所带来的透明度、责任归属和伦理合规性问题。现有框架依赖于人工可读的手动文档，限制了跨项目和平台的可扩展性、可比性和机器可解释性，并且无法为AI模型提供唯一且可验证的身份，从而影响其溯源性、真实性、可复现性及利益相关者的信任。解决方案的关键在于提出AI Model Passport，这是一种结构化和标准化的文档框架，作为AI模型的数字身份和验证工具，通过捕获关键元数据来实现对AI模型在整个生命周期中的唯一标识、验证、追踪与监控。

链接: https://arxiv.org/abs/2506.22358
作者: Varvara Kalokyri,Nikolaos S. Tachos,Charalampos N. Kalantzopoulos,Stelios Sfakianakis,Haridimos Kondylakis,Dimitrios I. Zaridis,Sara Colantonio,Daniele Regge,Nikolaos Papanikolaou, TheProCAncer-I consortium,Konstantinos Marias,Dimitrios I. Fotiadis,Manolis Tsiknakis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing integration of Artificial Intelligence (AI) into health and biomedical systems necessitates robust frameworks for transparency, accountability, and ethical compliance. Existing frameworks often rely on human-readable, manual documentation which limits scalability, comparability, and machine interpretability across projects and platforms. They also fail to provide a unique, verifiable identity for AI models to ensure their provenance and authenticity across systems and use cases, limiting reproducibility and stakeholder trust. This paper introduces the concept of the AI Model Passport, a structured and standardized documentation framework that acts as a digital identity and verification tool for AI models. It captures essential metadata to uniquely identify, verify, trace and monitor AI models across their lifecycle - from data acquisition and preprocessing to model design, development and deployment. In addition, an implementation of this framework is presented through AIPassport, an MLOps tool developed within the ProCAncer-I EU project for medical imaging applications. AIPassport automates metadata collection, ensures proper versioning, decouples results from source scripts, and integrates with various development environments. Its effectiveness is showcased through a lesion segmentation use case using data from the ProCAncer-I dataset, illustrating how the AI Model Passport enhances transparency, reproducibility, and regulatory readiness while reducing manual effort. This approach aims to set a new standard for fostering trust and accountability in AI-driven healthcare solutions, aspiring to serve as the basis for developing transparent and regulation compliant AI systems across domains.
zh

[AI-6] Embodied AI Agents : Modeling the World

【速读】：该论文旨在解决如何使AI代理在视觉、虚拟或物理环境中更有效地与用户和环境进行交互的问题。其核心挑战在于提升AI代理的自主性与环境适应能力，使其更接近人类的学习与交互方式。解决方案的关键在于发展世界模型（world models），通过整合多模态感知、基于推理的动作与控制规划以及记忆机制，使AI代理能够理解并预测环境，识别用户意图和社会情境，从而增强其执行复杂任务的能力。此外，论文还提出学习用户的心理世界模型以促进更高效的人机协作。

链接: https://arxiv.org/abs/2506.22355
作者: Pascale Fung,Yoram Bachrach,Asli Celikyilmaz,Kamalika Chaudhuri,Delong Chen,Willy Chung,Emmanuel Dupoux,Hervé Jégou,Alessandro Lazaric,Arjun Majumdar,Andrea Madotto,Franziska Meier,Florian Metze,Théo Moutakanni,Juan Pino,Basile Terver,Joseph Tighe,Jitendra Malik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes our research on AI agents embodied in visual, virtual or physical forms, enabling them to interact with both users and their environments. These agents, which include virtual avatars, wearable devices, and robots, are designed to perceive, learn and act within their surroundings, which makes them more similar to how humans learn and interact with the environments as compared to disembodied agents. We propose that the development of world models is central to reasoning and planning of embodied AI agents, allowing these agents to understand and predict their environment, to understand user intentions and social contexts, thereby enhancing their ability to perform complex tasks autonomously. World modeling encompasses the integration of multimodal perception, planning through reasoning for action and control, and memory to create a comprehensive understanding of the physical world. Beyond the physical world, we also propose to learn the mental world model of users to enable better human-agent collaboration.
zh

[AI-7] A Framework for Multi-source Privacy Preserving Epidemic Analysis

【速读】：该论文试图解决在流行病学和公共卫生分析中，如何有效利用包含敏感信息的多样化数据集进行疫情预测和机制性传播模型学习的问题，同时确保数据隐私。解决方案的关键在于构建一个集成深度学习与流行病模型的框架，该框架能够同时实现疫情预测和机制性模型的学习，并在分析过程中整合多个数据集，包括具有差分隐私（Differential Privacy, DP）保障的数据集。

链接: https://arxiv.org/abs/2506.22342
作者: Zihan Guan,Zhiyuan Zhao,Fengwei Tian,Dung Nguyen,Payel Bhattacharjee,Ravi Tandon,B. Aditya Prakash,Anil Vullikanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:It is now well understood that diverse datasets provide a lot of value in key epidemiology and public health analyses, such as forecasting and nowcasting, development of epidemic models, evaluation and design of interventions and resource allocation. Some of these datasets are often sensitive, and need adequate privacy protections. There are many models of privacy, but Differential Privacy (DP) has become a de facto standard because of its strong guarantees, without making models about adversaries. In this paper, we develop a framework the integrates deep learning and epidemic models to simultaneously perform epidemic forecasting and learning a mechanistic model of epidemic spread, while incorporating multiple datasets for these analyses, including some with DP guarantees. We demonstrate our framework using a realistic but synthetic financial dataset with DP; such a dataset has not been used in such epidemic analyses. We show that this dataset provides significant value in forecasting and learning an epidemic model, even when used with DP guarantees.
zh

[AI-8] Less Greedy Equivalence Search

【速读】：该论文旨在解决Greedy Equivalence Search (GES)在实际应用中面临的计算成本高和有限样本准确性不足的问题。其解决方案的关键在于提出Less Greedy Equivalence Search (LGES)，通过修改贪心步骤，避免在变量间进行得分表明存在条件独立性的边插入，从而实现更高效的搜索策略。此方法在保持理论保证的同时，显著提升了计算效率并减少了结构误差。

链接: https://arxiv.org/abs/2506.22331
作者: Adiba Ejaz,Elias Bareinboim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注: 35 total pages. 14 figures

点击查看摘要

Abstract:Greedy Equivalence Search (GES) is a classic score-based algorithm for causal discovery from observational data. In the sample limit, it recovers the Markov equivalence class of graphs that describe the data. Still, it faces two challenges in practice: computational cost and finite-sample accuracy. In this paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that retains its theoretical guarantees while partially addressing these limitations. LGES modifies the greedy step: rather than always applying the highest-scoring insertion, it avoids edge insertions between variables for which the score implies some conditional independence. This more targeted search yields up to a (10)-fold speed-up and a substantial reduction in structural error relative to GES. Moreover, LGES can guide the search using prior assumptions, while correcting these assumptions when contradicted by the data. Finally, LGES can exploit interventional data to refine the learned observational equivalence class. We prove that LGES recovers the true equivalence class in the sample limit from observational and interventional data, even with misspecified prior assumptions. Experiments demonstrate that LGES outperforms GES and other baselines in speed, accuracy, and robustness to misspecified assumptions. Our code is available at this https URL.
zh

[AI-9] A Practical Approach to Power Saving in Hearables Using Sub-Nyquist Sampling with Bandwidth Extension

【速读】：该论文旨在解决可穿戴设备（hearables）在低功耗条件下实现多模态语音增强（multimodal speech enhancement）时面临的三个关键问题：一是如何联合优化模拟-数字转换器（ADC）的采样频率和位深度以降低功耗并保持语音质量和可懂度；二是如何在不使用实际生成对抗网络（GAN）判别器的情况下实现类似GAN的音频质量；三是如何在缺乏宽带重建方法的情况下对空气传导麦克风（ACM）和骨传导麦克风（BCM）信号进行亚奈奎斯特采样率处理。其解决方案的关键在于提出SUBARU（Sub-Nyquist Audio Resolution Upsampling），该方法通过有意采用亚奈奎斯特采样和低比特分辨率的ADC实现功耗降低，引入多尺度和多周期虚拟判别器以实现类似GAN的音频质量，并在移动平台实现低延迟和低内存占用的流式语音增强。

链接: https://arxiv.org/abs/2506.22321
作者: Tarikul Islam Tamiti,Anomadarshi Barua
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Hearables are wearable computers that are worn on the ear. Bone conduction microphones (BCMs) are used with air conduction microphones (ACMs) in hearables as a supporting modality for multimodal speech enhancement (SE) in noisy conditions. However, existing works don’t consider the following practical aspects for low-power implementations on hearables: (i) They do not explore how lowering the sampling frequencies and bit resolutions in analog-to-digital converters (ADCs) of hearables jointly impact low-power processing and multimodal SE in terms of speech quality and intelligibility. (ii) They don’t discuss how GAN-like audio quality can be achieved without using actual GAN discriminators. And (iii) They don’t process signals from ACMs/BCMs at sub-Nyquist sampling rate because, in their frameworks, they lack a wideband reconstruction methodology from their narrowband parts. We propose SUBARU (\textbfSub-Nyquist \textbfAudio \textbfResolution \textbfUpsampling), which achieves the following: SUBARU (i) intentionally uses sub-Nyquist sampling and low bit resolution in ADCs, achieving a 3.31x reduction in power consumption; (ii) introduces novel multi-scale and multi-period virtual discriminators, which achieve GAN-like audio quality without using GANs’ adversarial training; and (iii) achieves streaming operations on mobile platforms and SE in in-the-wild noisy conditions with an inference time of 1.74ms and a memory footprint of less than 13.77MB.
zh

[AI-10] CoATA: Effective Co-Augmentation of Topology and Attribute for Graph Neural Networks ICMR

【速读】：该论文旨在解决现实世界图中噪声和不完整性对图神经网络（Graph Neural Networks, GNNs）性能的严重影响问题。现有方法通常仅通过单维度增强来应对这一挑战，专注于优化拓扑结构或扰动节点属性，而忽略了两者之间的深层交互。该论文提出的解决方案——CoATA，其关键在于设计了一个双通道框架，实现拓扑与属性的协同增强（Co-Augmentation of Topology and Attribute），通过结构信号传播、属性空间投影、对比学习等机制，促进增强图与原始图之间的相互校正，从而有效捕捉拓扑与属性之间的协同关系。

链接: https://arxiv.org/abs/2506.22299
作者: Tao Liu,Longlong Lin,Yunfeng Yu,Xi Ou,Youan Zhang,Zhiqiu Ye,Tao Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: icmr

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have garnered substantial attention due to their remarkable capability in learning graph representations. However, real-world graphs often exhibit substantial noise and incompleteness, which severely degrades the performance of GNNs. Existing methods typically address this issue through single-dimensional augmentation, focusing either on refining topology structures or perturbing node attributes, thereby overlooking the deeper interplays between the two. To bridge this gap, this paper presents CoATA, a dual-channel GNN framework specifically designed for the Co-Augmentation of Topology and Attribute. Specifically, CoATA first propagates structural signals to enrich and denoise node attributes. Then, it projects the enhanced attribute space into a node-attribute bipartite graph for further refinement or reconstruction of the underlying structure. Subsequently, CoATA introduces contrastive learning, leveraging prototype alignment and consistency constraints, to facilitate mutual corrections between the augmented and original graphs. Finally, extensive experiments on seven benchmark datasets demonstrate that the proposed CoATA outperforms eleven state-of-the-art baseline methods, showcasing its effectiveness in capturing the synergistic relationship between topology and attributes.
zh

[AI-11] Artificial Intelligent Disobedience: Rethinking the Agency of Our Artificial Teammates

【速读】：该论文试图解决当前大多数协作人工智能系统缺乏自主性的问题，这些系统通常表现为僵化的服从性，仅按照用户指令执行任务，而无法在必要时进行理性判断，可能导致不安全或低效的结果。解决方案的关键在于扩展人工智能队友的代理能力，引入“智能不服从”（intelligent disobedience），使AI能够在人机协作中做出有意义且自主的决策，从而提升整体协作效率与安全性。

链接: https://arxiv.org/abs/2506.22276
作者: Reuth Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of a paper accepted for publication in AI Magazine

点击查看摘要

Abstract:Artificial intelligence has made remarkable strides in recent years, achieving superhuman performance across a wide range of tasks. Yet despite these advances, most cooperative AI systems remain rigidly obedient, designed to follow human instructions without question and conform to user expectations, even when doing so may be counterproductive or unsafe. This paper argues for expanding the agency of AI teammates to include \textitintelligent disobedience, empowering them to make meaningful and autonomous contributions within human-AI teams. It introduces a scale of AI agency levels and uses representative examples to highlight the importance and growing necessity of treating AI autonomy as an independent research focus in cooperative settings. The paper then explores how intelligent disobedience manifests across different autonomy levels and concludes by proposing initial boundaries and considerations for studying disobedience as a core capability of artificial agents.
zh

[AI-12] Breaking Rank Bottlenecks in Knowledge Graph Completion

【速读】：该论文试图解决知识图谱补全（Knowledge Graph Completion, KGC）模型中由于输出层秩瓶颈（rank bottleneck）导致的模型表达能力受限问题。当实体数量远大于模型嵌入维度时，线性输出层的秩瓶颈会限制模型的预测能力和分数分布的准确性。解决方案的关键在于引入KGE-MoS，一种基于混合模型的输出层结构，通过打破秩瓶颈来提升KGC模型的性能和概率拟合效果，同时保持较低的参数成本。

链接: https://arxiv.org/abs/2506.22271
作者: Samy Badreddine,Emile van Krieken,Luciano Serafini
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many Knowledge Graph Completion (KGC) models, despite using powerful encoders, rely on a simple vector-matrix multiplication to score queries against candidate object entities. When the number of entities is larger than the model’s embedding dimension, which in practical scenarios is often by several orders of magnitude, we have a linear output layer with a rank bottleneck. Such bottlenecked layers limit model expressivity. We investigate both theoretically and empirically how rank bottlenecks affect KGC models. We find that, by limiting the set of feasible predictions, rank bottlenecks hurt ranking accuracy and the distribution fidelity of scores. Inspired by the language modelling literature, we propose KGE-MoS, a mixture-based output layer to break rank bottlenecks in many KGC models. Our experiments on four datasets show that KGE-MoS improves performance and probabilistic fit of KGC models for a low parameter cost.
zh

[AI-13] Adapting University Policies for Generative AI: Opportunities Challenges and Policy Solutions in Higher Education

【速读】：该论文试图解决生成式AI（Generative AI）在高等教育中的广泛应用所带来的机遇与挑战，特别是其对学术诚信、伦理边界和公平性的影响。论文指出，尽管大型语言模型（LLMs）能够提升研究、教学和评估的效率，但其使用也引发了诸多问题。解决方案的关键在于通过重新设计评估体系以增强对AI的抗性、加强师生培训、实施多层监管机制以及明确可接受的使用规范，从而在发挥AI潜力的同时维护学术诚信与公平性。

链接: https://arxiv.org/abs/2506.22231
作者: Russell Beale
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid proliferation of generative artificial intelligence (AI) tools - especially large language models (LLMs) such as ChatGPT - has ushered in a transformative era in higher education. Universities in developed regions are increasingly integrating these technologies into research, teaching, and assessment. On one hand, LLMs can enhance productivity by streamlining literature reviews, facilitating idea generation, assisting with coding and data analysis, and even supporting grant proposal drafting. On the other hand, their use raises significant concerns regarding academic integrity, ethical boundaries, and equitable access. Recent empirical studies indicate that nearly 47% of students use LLMs in their coursework - with 39% using them for exam questions and 7% for entire assignments - while detection tools currently achieve around 88% accuracy, leaving a 12% error margin. This article critically examines the opportunities offered by generative AI, explores the multifaceted challenges it poses, and outlines robust policy solutions. Emphasis is placed on redesigning assessments to be AI-resilient, enhancing staff and student training, implementing multi-layered enforcement mechanisms, and defining acceptable use. By synthesizing data from recent research and case studies, the article argues that proactive policy adaptation is imperative to harness AI’s potential while safeguarding the core values of academic integrity and equity.
zh

[AI-14] EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework

【速读】：该论文旨在解决Group Relative Policy Optimization (GRPO)在强化学习（Reinforcement Learning, RL）中面临的问题，包括探索能力有限、样本效率低和训练不稳定，这些问题限制了其在复杂推理任务中的性能。论文提出的解决方案是EFRame框架，其关键在于从三个关键维度增强GRPO：通过额外的rollouts探索高质量轨迹、在线过滤去除引入噪声和方差的低质量样本，以及利用经验回放重复利用稀有但信息量大的样本，从而建立一个完整且稳定的训练循环。

链接: https://arxiv.org/abs/2506.22200
作者: Chen Wang,Lai Wei,Yanzhi Zhang,Chenyang Shao,Zedong Dan,Weiran Huang,Yue Wang,Yuzhi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL’s computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations, we introduce EFRame, an Exploration-Filtering-Replay framework that systematically augments GRPO along three critical dimensions. EFRame performs additional rollouts to explore high-quality trajectories, applies online filtering to eliminate low-quality samples that introduce noise and variance, and leverages experience replay to repeatedly exploit rare but informative samples. EFRame establishes a complete and stable learning cycle, guiding the model through a structured transition from exploration to convergence. Our experiments across a variety of reasoning benchmarks demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables access to deeper reasoning capabilities that remain unattainable under vanilla GRPO. Furthermore, EFRame enables a more fine-grained categorization of training samples, allowing for a deeper analysis of how different types of samples contribute to the learning process in RL. Our code is available at this https URL.
zh

[AI-15] Autonomic Microservice Management via Agent ic AI and MAPE-K Integration

【速读】：该论文试图解决微服务架构在云计算中因去中心化特性所带来的安全与管理挑战，这些挑战可能威胁系统的稳定性。解决方案的关键在于提出一种基于MAPE-K框架的自主异常检测与修复方法，该方法利用了生成式AI（Generative AI）技术，以实现对高度分布式系统管理的自动化处理，从而提供实际且可投入工业应用的解决方案。

链接: https://arxiv.org/abs/2506.22185
作者: Matteo Esposito,Alexander Bakhtin,Noman Ahmad,Mikel Robredo,Ruoyu Su,Valentina Lenarduzzi,Davide Taibi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
zh

[AI-16] A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

【速读】：该论文试图解决开放权重和开源基础模型在安全性方面的挑战，旨在通过增强透明度、互操作性和公共治理来提升AI系统的安全性。其解决方案的关键在于构建一个以参与性、技术干预和内容安全过滤为核心的综合框架，强调通过独立审查、去中心化缓解和文化多元监督来实现安全与责任的平衡。同时，论文提出了五个优先研究方向，包括参与式输入、未来适应的内容过滤器、全生态系统安全基础设施、严格的代理系统保护以及扩展的危害分类体系。

链接: https://arxiv.org/abs/2506.22183
作者: Camille François,Ludovic Péran,Ayah Bdeir,Nouha Dziri,Will Hawkins,Yacine Jernite,Sayash Kapoor,Juliet Shen,Heidy Khlaaf,Kevin Klyman,Nik Marda,Marie Pellat,Deb Raji,Divya Siddarth,Aviya Skowron,Joseph Spisak,Madhulika Srikumar,Victor Storchan,Audrey Tang,Jen Weedon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

点击查看摘要

Abstract:The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced (i) a research agenda at the intersection of safety and open source AI; (ii) a mapping of existing and needed technical interventions and open source tools to safely and responsibly deploy open foundation models across the AI development workflow; and (iii) a mapping of the content safety filter ecosystem with a proposed roadmap for future research and development. We find that openness – understood as transparent weights, interoperable tooling, and public governance – can enhance safety by enabling independent scrutiny, decentralized mitigation, and culturally plural oversight. However, significant gaps persist: scarce multimodal and multilingual benchmarks, limited defenses against prompt-injection and compositional attacks in agentic systems, and insufficient participatory mechanisms for communities most affected by AI harms. The paper concludes with a roadmap of five priority research directions, emphasizing participatory inputs, future-proof content filters, ecosystem-wide safety infrastructure, rigorous agentic safeguards, and expanded harm taxonomies. These recommendations informed the February 2025 French AI Action Summit and lay groundwork for an open, plural, and accountable AI safety discipline.
zh

[AI-17] Learning to Solve Multi-Objective Routing Problems on Multigraphs

【速读】：该论文旨在解决多目标路由问题在多图（multigraph）设置下的挑战，即在不同目的地之间存在具有不同属性的多条路径的情况下，如何有效进行路由决策。其解决方案的关键在于提出两种神经网络方法：第一种方法直接在多图上通过自回归方式选择边直至完成路径；第二种方法则先将多图简化为简单图，再构建路径。这两种方法均在实验中表现出色，尤其在TSP和CVRP等复杂问题上展现了强大的性能。

链接: https://arxiv.org/abs/2506.22095
作者: Filip Rydin,Attila Lischka,Jiaming Wu,Morteza Haghir Chehreghani,Balázs Kulcsár
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 Figures

点击查看摘要

Abstract:Learning-based methods for routing have gained significant attention in recent years, both in single-objective and multi-objective contexts. However, the multigraph setting, where multiple paths with distinct attributes can exist between destinations, has largely been overlooked, despite its high practical relevancy. In this paper, we introduce two neural approaches to address multi-objective routing on multigraphs. Our first approach works directly on the multigraph, by autoregressively selecting edges until a tour is completed. On the other hand, our second model first prunes the multigraph into a simple graph and then builds routes. We validate both models experimentally and find that they demonstrate strong performance across a variety of problems, including the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP).
zh

[AI-18] ransformers are Graph Neural Networks

【速读】：该论文试图解决如何将Transformer架构与图神经网络（Graph Neural Networks, GNNs）进行理论连接的问题，以揭示Transformer在处理序列数据时的内在机制。其解决方案的关键在于将Transformer视为一种在完全连接的token图上运行的消息传递GNN，其中自注意力机制用于捕捉token之间的相对重要性，而位置编码则提供序列顺序或结构的信息。这种视角表明，Transformer是一种能够学习输入元素间关系的表达性集合处理网络，且不依赖于预先定义的图结构。尽管存在与GNN的数学联系，但Transformer通过密集矩阵运算实现，相较于稀疏消息传递在现代硬件上具有更高的效率，从而使得Transformer在当前硬件条件下更具优势。

链接: https://arxiv.org/abs/2506.22084
作者: Chaitanya K. Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper is a technical version of an article in The Gradient at this https URL

点击查看摘要

Abstract:We establish connections between the Transformer architecture, originally introduced for natural language processing, and Graph Neural Networks (GNNs) for representation learning on graphs. We show how Transformers can be viewed as message passing GNNs operating on fully connected graphs of tokens, where the self-attention mechanism capture the relative importance of all tokens w.r.t. each-other, and positional encodings provide hints about sequential ordering or structure. Thus, Transformers are expressive set processing networks that learn relationships among input elements without being constrained by apriori graphs. Despite this mathematical connection to GNNs, Transformers are implemented via dense matrix operations that are significantly more efficient on modern hardware than sparse message passing. This leads to the perspective that Transformers are GNNs currently winning the hardware lottery.
zh

[AI-19] Query as Test: An Intelligent Driving Test and Data Storag e Method for Integrated Cockpit-Vehicle-Road Scenarios

【速读】：该论文试图解决智能驾驶系统中数据生态系统碎片化和不兼容的问题，以及现有测试方法在覆盖边缘案例和灵活性方面的不足。其解决方案的关键在于引入“Query as Test”（QaT）概念，将测试重点从刚性预设用例转向针对统一数据表示的灵活逻辑查询，并提出“可扩展场景标注”（ESN）作为基于答案集编程（ASP）的声明式数据框架，以统一表示来自驾驶舱、车辆和道路的异构多模态数据，从而实现深度语义融合与高效测试。

链接: https://arxiv.org/abs/2506.22068
作者: Shengyue Yao,Runqing Guo,Yangyang Qin,Miangbing Meng,Jipeng Cao,Yilun Lin,Yisheng Lv,Fei-Yue Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Transaction on Vehicular Technology

点击查看摘要

Abstract:With the deep penetration of Artificial Intelligence (AI) in the transportation sector, intelligent cockpits, autonomous driving, and intelligent road networks are developing at an unprecedented pace. However, the data ecosystems of these three key areas are increasingly fragmented and incompatible. Especially, existing testing methods rely on data stacking, fail to cover all edge cases, and lack flexibility. To address this issue, this paper introduces the concept of “Query as Test” (QaT). This concept shifts the focus from rigid, prescripted test cases to flexible, on-demand logical queries against a unified data representation. Specifically, we identify the need for a fundamental improvement in data storage and representation, leading to our proposal of “Extensible Scenarios Notations” (ESN). ESN is a novel declarative data framework based on Answer Set Programming (ASP), which uniformly represents heterogeneous multimodal data from the cockpit, vehicle, and road as a collection of logical facts and rules. This approach not only achieves deep semantic fusion of data, but also brings three core advantages: (1) supports complex and flexible semantic querying through logical reasoning; (2) provides natural interpretability for decision-making processes; (3) allows for on-demand data abstraction through logical rules, enabling fine-grained privacy protection. We further elaborate on the QaT paradigm, transforming the functional validation and safety compliance checks of autonomous driving systems into logical queries against the ESN database, significantly enhancing the expressiveness and formal rigor of the testing. Finally, we introduce the concept of “Validation-Driven Development” (VDD), which suggests to guide developments by logical validation rather than quantitative testing in the era of Large Language Models, in order to accelerating the iteration and development process.
zh

[AI-20] Universal Retrieval for Multimodal Trajectory Modeling ICML2025

【速读】：该论文旨在解决轨迹数据（trajectory data）在多模态环境下表示建模的挑战，特别是在GUI环境中提升AI代理能力的问题。其关键解决方案是提出多模态轨迹检索（Multimodal Trajectory Retrieval），通过构建统一的代理轨迹数据集（Unified Agent Trajectory Dataset, UATD）和GAE-Bench基准，以及设计基于视觉-语言模型并结合优化对比学习的GAE-Retriever框架，实现高效的轨迹检索性能。

链接: https://arxiv.org/abs/2506.22056
作者: Xuan Zhang,Ziyan Jiang,Rui Meng,Yifei Leng,Zhenbang Xiao,Zora Zhiruo Wang,Yanyi Shang,Dehan Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures, accepted by Workshop on Computer-use Agents @ ICML 2025

点击查看摘要

Abstract:Trajectory data, capturing human actions and environmental states across various modalities, holds significant potential for enhancing AI agent capabilities, particularly in GUI environments. However, how to model the representation of trajectory-level data presents a significant challenge that has not been systematically addressed amid explosive trajectory data growth. In this work, we introduce Multimodal Trajectory Retrieval, bridging the gap between universal retrieval and agent-centric trajectory modeling. We construct the Unified Agent Trajectory Dataset (UATD) from annotated demonstrations and states across diverse real-world scenarios. Based on this, we present GAE-Bench, a benchmark containing a large number of trajectory-based retrieval pairs. In addition, we propose GAE-Retriever, a multimodal retrieval framework that adopts vision-language models and incorporates optimized contrastive learning through a token selection and the GradCache mechanism. Comprehensive evaluations across multiple datasets show that GAE-Retriever consistently outperforms strong baselines in retrieval recall, highlighting its effectiveness in advancing multimodal trajectory retrieval.
zh

[AI-21] UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting

【速读】：该论文试图解决时间序列基础模型（Time Series Foundation Models, TSFMs）在处理包含多样化和异构协变量（如分类变量和多模态数据）的通用预测任务时能力受限的问题。其关键解决方案是提出统一协变量适配框架（Unified Covariate Adaptation, UniCA），该框架通过协变量同质化将异构协变量转换为高层次的同质序列表示，并利用统一的基于注意力的融合机制进行融合，从而实现对多种协变量的有效适应，同时保持模型的泛化能力。

链接: https://arxiv.org/abs/2506.22039
作者: Lu Han,Yu Liu,Qiwen Deng,Jian Jiang,Yinbo Sun,Zhe Yu,Binfeng Wang,Xingyu Lu,Lintao Ma,Han-Jia Ye,De-Chuan Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often heterogeneous covariates–such as categorical variables and multimodal data (e.g., images, text)–which are typically task-specific and difficult to leverage during pretraining. To address this gap, we propose Unified Covariate Adaptation (UniCA), a framework to bridge TSFMs with general covariate-aware forecasting. UniCA first performs covariate homogenization to transform heterogeneous covariates into high-level homogeneous series representations and then fuses them via a unified attention-based fusion mechanism. UniCA is compatible and universal for adaptation with both homogeneous and heterogeneous covariates, incorporating extra covariate information while preserving the generalization ability of this http URL experiments on multiple unimodal and multimodal covariate-aware forecasting benchmarks demonstrate the superiority of UniCA, highlighting the promise of covariate-aware TSFM adaptation in real-world forecasting scenarios. Codes are released on this https URL.
zh

[AI-22] Literature-Grounded Novelty Assessment of Scientific Ideas

【速读】：该论文试图解决自动化科学创意生成系统中创意新颖性自动评估这一关键且研究不足的问题（idea novelty evaluation）。现有方法依赖人工文献综述，存在劳动强度大、主观性强及难以规模化等缺陷。解决方案的关键在于提出一种基于大语言模型（LLM）的检索增强生成（RAG）框架——Idea Novelty Checker，其采用两阶段检索-重排序策略：首先通过关键词和片段检索获取相关文献，随后利用嵌入过滤和基于维度的LLM重排序进行精炼，并结合专家标注示例以指导新颖性评估与文献支撑的推理生成。

链接: https://arxiv.org/abs/2506.22026
作者: Simra Shahid,Marissa Radensky,Raymond Fok,Pao Siangliulue,Daniel S. Weld,Tom Hope
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated scientific idea generation systems have made remarkable progress, yet the automatic evaluation of idea novelty remains a critical and underexplored challenge. Manual evaluation of novelty through literature review is labor-intensive, prone to error due to subjectivity, and impractical at scale. To address these issues, we propose the Idea Novelty Checker, an LLM-based retrieval-augmented generation (RAG) framework that leverages a two-stage retrieve-then-rerank approach. The Idea Novelty Checker first collects a broad set of relevant papers using keyword and snippet-based retrieval, then refines this collection through embedding-based filtering followed by facet-based LLM re-ranking. It incorporates expert-labeled examples to guide the system in comparing papers for novelty evaluation and in generating literature-grounded reasoning. Our extensive experiments demonstrate that our novelty checker achieves approximately 13% higher agreement than existing approaches. Ablation studies further showcases the importance of the facet-based re-ranker in identifying the most relevant literature for novelty evaluation.
zh

[AI-23] ROFI: Trajectory-Ranked Offline Inverse Reinforcement Learning

【速读】：该论文旨在解决离线强化学习中代理在缺乏预定义奖励函数的情况下难以有效学习策略的问题。传统方法依赖于由源策略生成的固定转移数据集，并且需要奖励函数进行标注，而在实际应用如视频游戏开发中，奖励函数可能不可用。论文提出的解决方案TROFI（Trajectory-Ranked OFfline Inverse reinforcement learning）的关键在于通过人类偏好学习奖励函数，进而对原始数据集进行标注，从而使得策略训练成为可能。与其它方法不同，TROFI不需要最优轨迹，实验结果表明其性能优于基线方法，并且接近使用真实奖励函数的效果。

链接: https://arxiv.org/abs/2506.22008
作者: Alessandro Sestini,Joakim Bergdahl,Konrad Tollmar,Andrew D. Bagdanov,Linus Gisslén
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at Reinforcement Learning and Video Games Workshop at RLC 2025

点击查看摘要

Abstract:In offline reinforcement learning, agents are trained using only a fixed set of stored transitions derived from a source policy. However, this requires that the dataset be labeled by a reward function. In applied settings such as video game development, the availability of the reward function is not always guaranteed. This paper proposes Trajectory-Ranked OFfline Inverse reinforcement learning (TROFI), a novel approach to effectively learn a policy offline without a pre-defined reward function. TROFI first learns a reward function from human preferences, which it then uses to label the original dataset making it usable for training the policy. In contrast to other approaches, our method does not require optimal trajectories. Through experiments on the D4RL benchmark we demonstrate that TROFI consistently outperforms baselines and performs comparably to using the ground truth reward to learn policies. Additionally, we validate the efficacy of our method in a 3D game environment. Our studies of the reward model highlight the importance of the reward function in this setting: we show that to ensure the alignment of a value function to the actual future discounted reward, it is fundamental to have a well-engineered and easy-to-learn reward function.
zh

[AI-24] LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving

【速读】：该论文试图解决形式化定理证明中数据稀缺的问题，通过自动生成大学水平的数学猜想来扩充训练数据。其解决方案的关键在于提出了一种混合方法，结合基于规则的上下文提取与基于大型语言模型（Large Language Models, LLM）的定理陈述生成，从而有效生成大量语法正确且非平凡的猜想，这些猜想无法通过现有的\textttaesop策略证明，为强化学习提供了有价值的训练目标。

链接: https://arxiv.org/abs/2506.22005
作者: Naoto Onda,Kazumi Kasaura,Yuta Oriike,Masaya Taniguchi,Akiyoshi Sannai,Sho Sonoda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 5 tables

点击查看摘要

Abstract:We introduce LeanConjecturer, a pipeline for automatically generating university-level mathematical conjectures in Lean 4 using Large Language Models (LLMs). Our hybrid approach combines rule-based context extraction with LLM-based theorem statement generation, addressing the data scarcity challenge in formal theorem proving. Through iterative generation and evaluation, LeanConjecturer produced 12,289 conjectures from 40 Mathlib seed files, with 3,776 identified as syntactically valid and non-trivial, that is, cannot be proven by \textttaesop tactic. We demonstrate the utility of these generated conjectures for reinforcement learning through Group Relative Policy Optimization (GRPO), showing that targeted training on domain-specific conjectures can enhance theorem proving capabilities. Our approach generates 103.25 novel conjectures per seed file on average, providing a scalable solution for creating training data for theorem proving systems. Our system successfully verified several non-trivial theorems in topology, including properties of semi-open, alpha-open, and pre-open sets, demonstrating its potential for mathematical discovery beyond simple variations of existing results.
zh

[AI-25] Binned semiparametric Bayesian networks

【速读】：该论文试图解决非参数分布中核密度估计（Kernel Density Estimation, KDE）计算成本过高的问题，以及传统分箱模型在高维数据中面临的维度灾难（Curse of Dimensionality）问题。其解决方案的关键在于引入一种新的概率半参数模型，通过数据分箱（Data Binning）降低计算复杂度，并结合稀疏张量和条件概率计算中的父节点数量限制，提出两种新的条件概率分布：稀疏分箱核密度估计和傅里叶核密度估计。这些方法有效缓解了高维数据对分箱模型的负面影响，同时保持了模型的统计性能。

链接: https://arxiv.org/abs/2506.21997
作者: Rafael Sojo,Javier Díaz-Rozo,Concha Bielza,Pedro Larrañaga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a new type of probabilistic semiparametric model that takes advantage of data binning to reduce the computational cost of kernel density estimation in nonparametric distributions. Two new conditional probability distributions are developed for the new binned semiparametric Bayesian networks, the sparse binned kernel density estimation and the Fourier kernel density estimation. These two probability distributions address the curse of dimensionality, which typically impacts binned models, by using sparse tensors and restricting the number of parent nodes in conditional probability calculations. To evaluate the proposal, we perform a complexity analysis and conduct several comparative experiments using synthetic data and datasets from the UCI Machine Learning repository. The experiments include different binning rules, parent restrictions, grid sizes, and number of instances to get a holistic view of the model’s behavior. As a result, our binned semiparametric Bayesian networks achieve structural learning and log-likelihood estimations with no statistically significant differences compared to the semiparametric Bayesian networks, but at a much higher speed. Thus, the new binned semiparametric Bayesian networks prove to be a reliable and more efficient alternative to their non-binned counterparts.
zh

[AI-26] AlphaBeta is not as good as you think: a new probabilistic model to better analyze deterministic game-solving algorithms

【速读】：该论文试图解决传统确定性博弈求解算法在分析时所依赖的简化模型所带来的局限性，即其假设叶节点值独立采样，导致游戏结构复杂性被剥离，从而生成的实例过于简单，无法反映真实世界的挑战。解决方案的关键在于引入一种新的概率模型，该模型通过固定层级条件分布逐步构建博弈树，并强制引入祖先依赖性，这一现实博弈中的关键结构性特征，使得生成的问题具有可调节的难度，同时保持一定的分析可处理性。

链接: https://arxiv.org/abs/2506.21996
作者: Raphaël Boige(LORIA),Amine Boumaza(LORIA),Bruno Scherrer(LORIA)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deterministic game-solving algorithms are conventionally analyzed in the light of their average-case complexity against a distribution of random game-trees, where leaf values are independently sampled from a fixed distribution. This simplified model enables uncluttered mathematical analysis, revealing two key properties: root value distributions asymptotically collapse to a single fixed value for finite-valued trees, and all reasonable algorithms achieve global optimality. However, these findings are artifacts of the model’s design-its long criticized independence assumption strips games of structural complexity, producing trivial instances where no algorithm faces meaningful challenges. To address this limitation, we introduce a new probabilistic model that incrementally constructs game-trees using a fixed level-wise conditional distribution. By enforcing ancestor dependency, a critical structural feature of real-world games, our framework generates problems with adjustable difficulty while retaining some form of analytical tractability. For several algorithms, including AlphaBeta and Scout, we derive recursive formulas characterizing their average-case complexities under this model. These allow us to rigorously compare algorithms on deep game-trees, where Monte-Carlo simulations are no longer feasible. While asymptotically, all algorithms seem to converge to identical branching factor (a result analogous to those of independence-based models), deep finite trees reveal stark differences: AlphaBeta incurs a significantly larger constant multiplicative factor compared to algorithms like Scout, leading to a substantial practical slowdown. Our framework sheds new light on classical game-solving algorithms, offering rigorous evidence and analytical tools to advance the understanding of these methods under a more realistic, challenging, and yet tractable model.
zh

[AI-27] Interactive Multi-Objective Probabilistic Preference Learning with Soft and Hard Bounds

【速读】：该论文旨在解决高风险决策中多目标优化问题，特别是在资源消耗大的评估环境下，如何高效选择符合隐含偏好的帕累托最优解。其关键解决方案是提出一种交互式的局部-全局框架Active-MoSH，其中局部组件通过结合软硬约束与概率偏好学习，动态维护决策者（DM）偏好和约束的分布，以适应性地细化帕累托子集；全局组件T-MoSH则利用多目标敏感性分析识别可能被忽略的高价值点，从而增强DM的信任感。

链接: https://arxiv.org/abs/2506.21887
作者: Edward Chen,Sang T. Truong,Natalie Dullerud,Sanmi Koyejo,Carlos Guestrin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-stakes decision-making involves navigating multiple competing objectives with expensive evaluations. For instance, in brachytherapy, clinicians must balance maximizing tumor coverage (e.g., an aspirational target or soft bound of 95% coverage) against strict organ dose limits (e.g., a non-negotiable hard bound of 601 cGy to the bladder), with each plan evaluation being resource-intensive. Selecting Pareto-optimal solutions that match implicit preferences is challenging, as exhaustive Pareto frontier exploration is computationally and cognitively prohibitive, necessitating interactive frameworks to guide users. While decision-makers (DMs) often possess domain knowledge to narrow the search via such soft-hard bounds, current methods often lack systematic approaches to iteratively refine these multi-faceted preference structures. Critically, DMs must trust their final decision, confident they haven’t missed superior alternatives; this trust is paramount in high-consequence scenarios. We present Active-MoSH, an interactive local-global framework designed for this process. Its local component integrates soft-hard bounds with probabilistic preference learning, maintaining distributions over DM preferences and bounds for adaptive Pareto subset refinement. This is guided by an active sampling strategy optimizing exploration-exploitation while minimizing cognitive burden. To build DM trust, Active-MoSH’s global component, T-MoSH, leverages multi-objective sensitivity analysis to identify potentially overlooked, high-value points beyond immediate feedback. We demonstrate Active-MoSH’s performance benefits through diverse synthetic and real-world applications. A user study on AI-generated image selection further validates our hypotheses regarding the framework’s ability to improve convergence, enhance DM trust, and provide expressive preference articulation, enabling more effective DMs.
zh

[AI-28] On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling

【速读】：该论文试图解决的是生成式 AI (Generative AI) 训练过程中因视觉-语言模型 (VLMs) 遭受对抗性误标攻击而导致的训练数据污染问题。解决方案的关键在于利用对抗性扰动对 VLMs 进行攻击，使其生成错误的图像描述，从而在文本到图像模型的训练流水线中注入“脏标签”样本，进而影响模型的行为。实验表明，这种攻击方法能够在少量污染样本的情况下有效改变模型性能，并且在黑盒场景下仍具有较高的成功率。

链接: https://arxiv.org/abs/2506.21874
作者: Stanley Wu,Ronik Bhaskar,Anna Yoo Jeong Ha,Shawn Shan,Haitao Zheng,Ben Y. Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: ACM Conference on Computer and Communications Security 2025

点击查看摘要

Abstract:Today’s text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions. In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of injecting strong “dirty-label” poison samples into the training pipeline for text-to-image models, successfully altering their behavior with a small number of poisoned samples. We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers. This suggests a cat-and-mouse game that is likely to reduce the quality of training data and increase the cost of text-to-image model development. Finally, we demonstrate the real-world effectiveness of these attacks, achieving high attack success (over 73%) even in black-box scenarios against commercial VLMs (Google Vertex AI and Microsoft Azure). Comments: ACM Conference on Computer and Communications Security 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.21874 [cs.CR] (or arXiv:2506.21874v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.21874 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3719027.3744845 Focus to learn more DOI(s) linking to related resources
zh

[AI-29] A Survey of Continual Reinforcement Learning

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在动态和现实世界环境中应用时面临的泛化能力不足及依赖大量训练数据与计算资源的问题。其解决方案的关键在于引入持续学习（Continual Learning, CL），通过构建持续强化学习（Continual Reinforcement Learning, CRL）框架，使智能体能够持续学习、适应新任务并保留已有知识，从而提升RL在复杂环境中的适用性。

链接: https://arxiv.org/abs/2506.21872
作者: Chaofan Pan,Xin Yang,Yanhua Li,Wei Wei,Tianrui Li,Bo An,Jiye Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE TPAMI

点击查看摘要

Abstract:Reinforcement Learning (RL) is an important machine learning paradigm for solving sequential decision-making problems. Recent years have witnessed remarkable progress in this field due to the rapid development of deep neural networks. However, the success of RL currently relies on extensive training data and computational resources. In addition, RL’s limited ability to generalize across tasks restricts its applicability in dynamic and real-world environments. With the arisen of Continual Learning (CL), Continual Reinforcement Learning (CRL) has emerged as a promising research direction to address these limitations by enabling agents to learn continuously, adapt to new tasks, and retain previously acquired knowledge. In this survey, we provide a comprehensive examination of CRL, focusing on its core concepts, challenges, and methodologies. Firstly, we conduct a detailed review of existing works, organizing and analyzing their metrics, tasks, benchmarks, and scenario settings. Secondly, we propose a new taxonomy of CRL methods, categorizing them into four types from the perspective of knowledge storage and/or transfer. Finally, our analysis highlights the unique challenges of CRL and provides practical insights into future directions.
zh

[AI-30] SciMantify – A Hybrid Approach for the Evolving Semantification of Scientific Knowledge

【速读】：该论文试图解决科学出版物中知识表示不灵活、结构化和语义化不足的问题，这些问题限制了知识的可访问性和可重用性。现有的科学知识多以静态的PDF格式或表格形式存在，缺乏语义上下文，难以被机器理解和处理。解决方案的关键在于提出一种基于五阶段的语义知识表示演化模型，该模型受五星级开放数据（5-star Linked Open Data）启发，旨在引导从数字文献（如PDF）逐步过渡到集成于知识图谱（KG）中的语义表示。通过结合人类与机器的协作，利用科学知识的表格形式进行语义标注和优化，该研究开发了名为SciMantify的混合方法，以提升科学知识的语义表示质量，并在Open Research Knowledge Graph（ORKG）平台上实现。

链接: https://arxiv.org/abs/2506.21819
作者: Lena John,Kheir Eddine Farfar,Sören Auer,Oliver Karras
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the 25th International Conference on Web Engineering 2025

点击查看摘要

Abstract:Scientific publications, primarily digitized as PDFs, remain static and unstructured, limiting the accessibility and reusability of the contained knowledge. At best, scientific knowledge from publications is provided in tabular formats, which lack semantic context. A more flexible, structured, and semantic representation is needed to make scientific knowledge understandable and processable by both humans and machines. We propose an evolution model of knowledge representation, inspired by the 5-star Linked Open Data (LOD) model, with five stages and defined criteria to guide the stepwise transition from a digital artifact, such as a PDF, to a semantic representation integrated in a knowledge graph (KG). Based on an exemplary workflow implementing the entire model, we developed a hybrid approach, called SciMantify, leveraging tabular formats of scientific knowledge, e.g., results from secondary studies, to support its evolving semantification. In the approach, humans and machines collaborate closely by performing semantic annotation tasks (SATs) and refining the results to progressively improve the semantic representation of scientific knowledge. We implemented the approach in the Open Research Knowledge Graph (ORKG), an established platform for improving the findability, accessibility, interoperability, and reusability of scientific knowledge. A preliminary user experiment showed that the approach simplifies the preprocessing of scientific knowledge, reduces the effort for the evolving semantification, and enhances the knowledge representation through better alignment with the KG structures.
zh

[AI-31] Multi-task parallelism for robust pre-training of graph foundation models on multi-source multi-fidelity atomistic modeling data

【速读】：该论文试图解决在预训练过程中处理多源、多保真度数据的挑战，以及模型在更大、更多样化数据集上的泛化能力和在超级计算机上的可扩展性问题。解决方案的关键在于采用多任务并行方法，将每个解码头分布到具有GPU加速的计算资源上，从而实现高效的模型训练与推理。

链接: https://arxiv.org/abs/2506.21788
作者: Massimiliano Lupo Pasini,Jong Youl Choi,Pei Zhang,Kshitij Mehta,Rylie Weaver,Ashwin M. Aji,Karl W. Schulz,Jorda Polo,Prasanna Balaprakash
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Atomic and Molecular Clusters (physics.atm-clus)
备注: 15 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Graph foundation models using graph neural networks promise sustainable, efficient atomistic modeling. To tackle challenges of processing multi-source, multi-fidelity data during pre-training, recent studies employ multi-task learning, in which shared message passing layers initially process input atomistic structures regardless of source, then route them to multiple decoding heads that predict data-specific outputs. This approach stabilizes pre-training and enhances a model’s transferability to unexplored chemical regions. Preliminary results on approximately four million structures are encouraging, yet questions remain about generalizability to larger, more diverse datasets and scalability on supercomputers. We propose a multi-task parallelism method that distributes each head across computing resources with GPU acceleration. Implemented in the open-source HydraGNN architecture, our method was trained on over 24 million structures from five datasets and tested on the Perlmutter, Aurora, and Frontier supercomputers, demonstrating efficient scaling on all three highly heterogeneous super-computing architectures.
zh

[AI-32] MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models

【速读】：该论文旨在解决传统活动基础模型在数据收集与校准上的高成本、机器学习方法在动态环境中的适应性不足以及基于大型语言模型（Large Language Models, LLMs）的模拟在大规模计算中的局限性等问题。其解决方案的关键在于提出MobiVerse，一个融合轻量级领域特定生成器以高效生成基础活动链，并结合LLMs的上下文感知修改能力，实现动态调整的混合框架。

链接: https://arxiv.org/abs/2506.21784
作者: Yifan Liu,Xishun Liao,Haoxuan Ma,Jonathan Liu,Rohan Jadhav,Jiaqi Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and modeling human mobility patterns is crucial for effective transportation planning and urban development. Despite significant advances in mobility research, there remains a critical gap in simulation platforms that allow for algorithm development, policy implementation, and comprehensive evaluation at scale. Traditional activity-based models require extensive data collection and manual calibration, machine learning approaches struggle with adaptation to dynamic conditions, and treding agent-based Large Language Models (LLMs) implementations face computational constraints with large-scale simulations. To address these challenges, we propose MobiVerse, a hybrid framework leverages the efficiency of lightweight domain-specific generator for generating base activity chains with the adaptability of LLMs for context-aware modifications. A case study was conducted in Westwood, Los Angeles, where we efficiently generated and dynamically adjusted schedules for the whole population of approximately 53,000 agents on a standard PC. Our experiments demonstrate that MobiVerse successfully enables agents to respond to environmental feedback, including road closures, large gathering events like football games, and congestion, through our hybrid framework. Its modular design facilitates testing various mobility algorithms at both transportation system and agent levels. Results show our approach maintains computational efficiency while enhancing behavioral realism. MobiVerse bridges the gap in mobility simulation by providing a customizable platform for mobility systems planning and operations with benchmark algorithms. Code and videos are available at this https URL.
zh

[AI-33] HE-Tree: Can Tracing Historical Evolution Enhance Scientific Verification and Reasoning ?

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）生成的科学命题在新颖性和事实准确性方面难以进行严格评估的问题，尤其是传统验证方法存在不足，如LLMs作为独立验证者可能产生幻觉且缺乏领域知识，而传统引用网络缺乏明确的因果关系。解决方案的关键是提出一种名为THE-Tree（Technology History Evolution Tree）的计算框架，该框架通过“思考-表述-引用-验证”过程构建领域特定的演化树，并利用恢复的自然语言推理机制对每个提出的演化链进行逻辑一致性和证据支持的验证，从而确保每一步的可验证性和因果关联性。

链接: https://arxiv.org/abs/2506.21763
作者: Xin Wang,Jiyao Liu,Yulong Xiao,Junzhi Ning,Lihao Liu,Junjun He,Botian Shi,Kaicheng Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are accelerating scientific idea generation, but rigorously evaluating these numerous, often superficial, AI-generated propositions for novelty and factual accuracy is a critical bottleneck; manual verification is too this http URL validation methods are inadequate: LLMs as standalone verifiers may hallucinate and lack domain knowledge (our findings show ~60% unawareness of relevant papers in specific domains), while traditional citation networks lack explicit causality and narrative surveys are this http URL underscores a core challenge: the absence of structured, verifiable, and causally-linked historical data of scientific this http URL address this,we introduce \textbfTHE-Tree (\textbfTechnology \textbfHistory \textbfEvolution Tree), a computational framework that constructs such domain-specific evolution trees from scientific this http URL-Tree employs a search algorithm to explore evolutionary paths. During its node expansion, it utilizes a novel “Think-Verbalize-Cite-Verify” process: an LLM proposes potential advancements and cites supporting literature. Critically, each proposed evolutionary link is then validated for logical coherence and evidential support by a recovered natural language inference mechanism that interrogates the cited literature, ensuring that each step is this http URL construct and validate 88 THE-Trees across diverse domains and release a benchmark dataset including up to 71k fact verifications covering 27k papers to foster further this http URL demonstrate that i) in graph completion, our THE-Tree improves hit@1 by 8% to 14% across multiple models compared to traditional citation networks; ii) for predicting future scientific developments, it improves hit@1 metric by nearly 10%; and iii) when combined with other methods, it boosts the performance of evaluating important scientific papers by almost 100%.
zh

[AI-34] Hierarchical Reasoning Model

【速读】：该论文试图解决人工智能中推理能力不足的问题，特别是当前大型语言模型（Large Language Models, LLMs）在复杂目标导向动作序列的构建与执行上存在任务分解脆弱、数据需求量大和延迟高等缺陷。其解决方案的关键在于提出一种受人类大脑分层与多时间尺度处理机制启发的层级推理模型（Hierarchical Reasoning Model, HRM），该模型通过两个相互依赖的循环模块实现高效推理：一个高层模块负责缓慢的抽象规划，一个低层模块处理快速的细节计算。HRM在仅需2700万参数和1000个训练样本的情况下，即可完成复杂的推理任务，且无需预训练或Chain-of-Thought (CoT) 数据，表现出卓越的性能。

链接: https://arxiv.org/abs/2506.21734
作者: Guan Wang,Jin Li,Yuhao Sun,Xing Chen,Changling Liu,Yue Wu,Meng Lu,Sen Song,Yasin Abbasi Yadkori
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose the Hierarchical Reasoning Model (HRM), a novel recurrent architecture that attains significant computational depth while maintaining both training stability and efficiency. HRM executes sequential reasoning tasks in a single forward pass without explicit supervision of the intermediate process, through two interdependent recurrent modules: a high-level module responsible for slow, abstract planning, and a low-level module handling rapid, detailed computations. With only 27 million parameters, HRM achieves exceptional performance on complex reasoning tasks using only 1000 training samples. The model operates without pre-training or CoT data, yet achieves nearly perfect performance on challenging tasks including complex Sudoku puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms much larger models with significantly longer context windows on the Abstraction and Reasoning Corpus (ARC), a key benchmark for measuring artificial general intelligence capabilities. These results underscore HRM’s potential as a transformative advancement toward universal computation and general-purpose reasoning systems.
zh

[AI-35] Simultaneously Fair Allocation of Indivisible Items Across Multiple Dimensions

【速读】：该论文试图解决在多维环境下不可分物品的公平分配问题，旨在应对复杂环境中代理者根据多种标准评估物品组合时的公平性挑战。传统的一维公平性概念无法有效捕捉多属性下的公平性，因此论文提出两种放松的“无偏见”变体：弱同时无偏见至c个物品（weak sEFc）和强同时无偏见至c个物品（strong sEFc），以适应代理者偏好的多维性。其解决方案的关键在于通过调整参数c，确定能够保证weak或strong sEFc分配存在的上界和下界，并设计相应的算法来验证这些分配是否存在。此外，论文还证明了检查weak sEF1和strong sEF1分配存在性的计算复杂性为NP难。

链接: https://arxiv.org/abs/2506.21727
作者: Yasushi Kawase,Bodhayan Roy,Mohammad Azharuddin Sanpui
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the fair allocation of indivisible items in a multidimensional setting, motivated by the need to address fairness in complex environments where agents assess bundles according to multiple criteria. Such multidimensional settings are not merely of theoretical interest but are central to many real-world applications. For example, cloud computing resources are evaluated based on multiple criteria such as CPU cores, memory, and network bandwidth. In such cases, traditional one dimensional fairness notions fail to capture fairness across multiple attributes. To address these challenges, we study two relaxed variants of envy-freeness: weak simultaneously envy-free up to c goods (weak sEFc) and strong simultaneously envy-free up to c goods (strong sEFc), which accommodate the multidimensionality of agents’ preferences. Under the weak notion, for every pair of agents and for each dimension, any perceived envy can be eliminated by removing, if necessary, a different set of goods from the envied agent’s allocation. In contrast, the strong version requires selecting a single set of goods whose removal from the envied bundle simultaneously eliminates envy in every dimension. We provide upper and lower bounds on the relaxation parameter c that guarantee the existence of weak or strong sEFc allocations, where these bounds are independent of the total number of items. In addition, we present algorithms for checking whether a weak or strong sEFc allocation exists. Moreover, we establish NP-hardness results for checking the existence of weak sEF1 and strong sEF1 allocations.
zh

[AI-36] Performance Prediction for Large Systems via Text-to-Text Regression

【速读】：该论文试图解决在复杂系统数据（如配置文件或系统日志）中进行度量结果预测的问题，传统表格回归方法在此类数据上表现不佳，主要因为特征工程往往不可行。解决方案的关键在于采用文本到文本的回归方法，利用一个60M参数的编码器-解码器模型，在Borg系统上实现了接近完美的等级相关性（0.99，平均0.9），并且相比传统表格方法，均方误差降低了100倍。该方法还能通过少量样本快速适应新任务，并捕捉复杂结果分布的密度。

链接: https://arxiv.org/abs/2506.21718
作者: Yash Akhauri,Bryan Lewandowski,Cheng-Hsi Lin,Adrian N. Reyes,Grant C. Forbes,Arissa Wongpanich,Bangding Yang,Mohamed S. Abdelfattah,Sagi Perel,Xingyou Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注: Code can be found at this https URL

点击查看摘要

Abstract:In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google’s massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the model’s inherent uncertainty quantification. These findings pave the way for universal simulators of real-world outcomes.
zh

[AI-37] SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

【速读】：该论文旨在解决具身智能体在长周期、现实任务中实现自我演化的难题，特别是针对强化学习微调（RFT）在多模态交互场景下存在的两个核心障碍：一是多步骤推理任务中缺乏可获取的中间奖励，导致有效学习信号不足；二是依赖人工设计的奖励函数限制了对新任务和环境的泛化能力。解决方案的关键在于提出Self-Evolving Embodied Agents-R1（SEEA-R1），其核心技术包括基于树结构的群体相对策略优化（Tree-GRPO）以将稀疏延迟奖励转化为密集中间信号，以及多模态生成式奖励模型（MGRM）以实现跨任务和场景的奖励估计泛化，从而支持自主适应与奖励驱动的自我演化。

链接: https://arxiv.org/abs/2506.21669
作者: Wanxin Tian,Shijie Zhang,Kevin Zhang,Xiaowei Chi,Yulin Luo,Junyu Lu,Chunkai Fan,Qiang Zhou,Yiming Zhao,Ning Liu Siyu Lin,Zhiyuan Qin,Xiaozhu Ju,Shanghang Zhang,Jian Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO), which integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 36.19% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% without environmental reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.
zh

[AI-38] IRanker: Towards Ranking Foundation Model

【速读】：该论文旨在解决排名任务（ranking tasks）中需要为每个特定任务设计不同模型的问题，提出通过一个统一的排名基础模型（ranking foundation model, FM）来实现任务的统一。其关键解决方案是引入IRanker框架，该框架结合强化学习（reinforcement learning, RL）和迭代解码机制，将复杂的排名任务分解为逐步剔除最差候选者的迭代过程，从而显著减少输出组合空间并更有效地利用有限的上下文长度。

链接: https://arxiv.org/abs/2506.21638
作者: Tao Feng,Zhigang Hua,Zijie Lei,Yan Xie,Shuang Yang,Bo Long,Jiaxuan You
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ranking tasks are ubiquitous, encompassing applications such as recommendation systems, LLM routing, and item re-ranking. We propose to unify these tasks using a single ranking foundation model (FM), as it eliminates the need for designing different models for each specific ranking task. However, unlike general supervision tasks in LLMs, ranking tasks do not have clear labels for supervision, posing great challenges to developing a ranking FM. To overcome these challenges, we propose IRanker, a ranking FM framework with reinforcement learning (RL) and iterative decoding. Our insight is to decompose the complex ranking task into an iterative decoding process that eliminates the worst candidate from the candidate pool step by step, which significantly reduces the output combinatorial space and better utilizes the limited context length during RL training. We meticulously train and comprehensively evaluate an IRanker-3B model on nine datasets across three scenarios: recommendation, routing, and passage ranking. The results show that a single IRanker-3B achieves state-of-the-art results on several datasets compared to models of similar size, and even surpasses the performance of larger models on certain datasets. We further demonstrate the effectiveness of our RL design and the robustness of the iterative mechanism across different LLM sizes. Moreover, we conducted both in-domain and out-of-domain zero-shot generalization experiments, which showed that IRanker-3B achieved good generalization on in-domain ranking tasks compared to the base LLM by at least 5% improvement. Surprisingly, on out-of-domain generic LLM tasks, IRanker-3B outperformed the base model by at least 9% on GSM8K, IFEval, and MathQA. In addition, the thoughts generated by IRanker-3B during training could further enhance zero-shot LLM performance.
zh

[AI-39] Ark: An Open-source Python-based Framework for Robot Learning

【速读】：该论文试图解决当前机器人软件架构在学习曲线、工具碎片化和硬件集成复杂性方面存在的问题，这些问题阻碍了商业自主性的进展。解决方案的关键在于引入ARK，一个以Python为核心的开源机器人框架，它通过提供类似Gym的环境接口、轻量级客户端-服务器架构以及与主流模仿学习算法的兼容性，实现了从高保真仿真到实体机器人的无缝切换，并整合了控制、SLAM、运动规划等可重用模块，从而降低了进入门槛并加速了自主机器人研究与商业化部署。

链接: https://arxiv.org/abs/2506.21628
作者: Magnus Dierking,Christopher E. Mower,Sarthak Das,Huang Helong,Jiacheng Qiu,Cody Reading,Wei Chen,Huidong Liang,Huang Guowei,Jan Peters,Quan Xingyue,Jun Wang,Haitham Bou-Ammar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robotics has made remarkable hardware strides-from DARPA’s Urban and Robotics Challenges to the first humanoid-robot kickboxing tournament-yet commercial autonomy still lags behind progress in machine learning. A major bottleneck is software: current robot stacks demand steep learning curves, low-level C/C++ expertise, fragmented tooling, and intricate hardware integration, in stark contrast to the Python-centric, well-documented ecosystems that propelled modern AI. We introduce ARK, an open-source, Python-first robotics framework designed to close that gap. ARK presents a Gym-style environment interface that allows users to collect data, preprocess it, and train policies using state-of-the-art imitation-learning algorithms (e.g., ACT, Diffusion Policy) while seamlessly toggling between high-fidelity simulation and physical robots. A lightweight client-server architecture provides networked publisher-subscriber communication, and optional C/C++ bindings ensure real-time performance when needed. ARK ships with reusable modules for control, SLAM, motion planning, system identification, and visualization, along with native ROS interoperability. Comprehensive documentation and case studies-from manipulation to mobile navigation-demonstrate rapid prototyping, effortless hardware swapping, and end-to-end pipelines that rival the convenience of mainstream machine-learning workflows. By unifying robotics and AI practices under a common Python umbrella, ARK lowers entry barriers and accelerates research and commercial deployment of autonomous robots.
zh

[AI-40] FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models NEURIPS

【速读】：该论文旨在解决开发一种能够在复杂、动态和非结构化现实环境中执行多种任务的通用机器人操作系统的难题。现有方法通常仅实现机器人脑中的单一功能或部分功能，缺乏统一的认知架构整合。解决方案的关键在于提出FrankenBot框架，该框架基于视觉-语言模型（VLM）驱动，并借鉴人类大脑的分治策略与结构，将任务规划、策略生成、记忆管理和低层接口分别映射到大脑皮层、小脑、颞叶-海马复合体和脑干，通过高效协调机制实现功能完备性与系统效率的平衡。

链接: https://arxiv.org/abs/2506.21627
作者: Shiyi Wang,Wenbo Li,Yiteng Chen,Qingyao Wu,Huiping Zhuang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, under review of NeurIPS

点击查看摘要

Abstract:Developing a general robot manipulation system capable of performing a wide range of tasks in complex, dynamic, and unstructured real-world environments has long been a challenging task. It is widely recognized that achieving human-like efficiency and robustness manipulation requires the robotic brain to integrate a comprehensive set of functions, such as task planning, policy generation, anomaly monitoring and handling, and long-term memory, achieving high-efficiency operation across all functions. Vision-Language Models (VLMs), pretrained on massive multimodal data, have acquired rich world knowledge, exhibiting exceptional scene understanding and multimodal reasoning capabilities. However, existing methods typically focus on realizing only a single function or a subset of functions within the robotic brain, without integrating them into a unified cognitive architecture. Inspired by a divide-and-conquer strategy and the architecture of the human brain, we propose FrankenBot, a VLM-driven, brain-morphic robotic manipulation framework that achieves both comprehensive functionality and high operational efficiency. Our framework includes a suite of components, decoupling a part of key functions from frequent VLM calls, striking an optimal balance between functional completeness and system efficiency. Specifically, we map task planning, policy generation, memory management, and low-level interfacing to the cortex, cerebellum, temporal lobe-hippocampus complex, and brainstem, respectively, and design efficient coordination mechanisms for the modules. We conducted comprehensive experiments in both simulation and real-world robotic environments, demonstrating that our method offers significant advantages in anomaly detection and handling, long-term memory, operational efficiency, and stability – all without requiring any fine-tuning or retraining.
zh

[AI-41] Bayesian-Guided Diversity in Sequential Sampling for Recommender Systems

【速读】：该论文旨在解决推荐系统中用户相关性与内容多样性之间的平衡问题，这一问题在内容同质化和用户参与度下降的背景下愈发突出。其解决方案的关键在于提出了一种基于多目标、上下文感知的序列采样框架，通过贝叶斯更新动态调整物品得分以优化多样性，并在奖励函数中融合了多种多样性度量指标（包括调整后的相似性子矩阵的对数行列式体积和岭杠杆分数）以及多样性增益不确定性项，以应对探索与利用的权衡。同时，通过建模批次内和批次间的多样性，促进意外发现并减少冗余，最终利用基于支配的排序过程识别帕累托最优物品集，实现每一步迭代中的自适应平衡选择。

链接: https://arxiv.org/abs/2506.21617
作者: Hiba Bederina,Jill-Jênn Vie
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The challenge of balancing user relevance and content diversity in recommender systems is increasingly critical amid growing concerns about content homogeneity and reduced user engagement. In this work, we propose a novel framework that leverages a multi-objective, contextual sequential sampling strategy. Item selection is guided by Bayesian updates that dynamically adjust scores to optimize diversity. The reward formulation integrates multiple diversity metrics-including the log-determinant volume of a tuned similarity submatrix and ridge leverage scores-along with a diversity gain uncertainty term to address the exploration-exploitation trade-off. Both intra- and inter-batch diversity are modeled to promote serendipity and minimize redundancy. A dominance-based ranking procedure identifies Pareto-optimal item sets, enabling adaptive and balanced selections at each iteration. Experiments on a real-world dataset show that our approach significantly improves diversity without sacrificing relevance, demonstrating its potential to enhance user experience in large-scale recommendation settings.
zh

[AI-42] Reinforcement Fine-Tuned Large Language Models for Next POI Recommendation

【速读】：该论文旨在解决传统监督微调（SFT）方法在下一兴趣点（next POI）推荐任务中的固有不匹配问题，即每个训练样本仅提供一个目标POI，而实际推荐需要生成Top-K的推荐列表。解决方案的关键在于提出Refine-POI框架，通过引入基于推荐的奖励机制，使大型语言模型能够在仅有一个真实目标POI的情况下学习生成Top-K推荐列表，从而有效提升推荐性能。

链接: https://arxiv.org/abs/2506.21599
作者: Peibo Li,Shuang Ao,Hao Xue,Yang Song,Maarten de Rijke,Johan Barthélemy,Tomasz Bednarz,Flora D. Salim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been adopted for next point-of-interest (POI) recommendation tasks. Typical LLM-based recommenders fall into two categories: prompt-based and supervised fine-tuning (SFT)-based models. Prompt-based models generally offer greater output flexibility but deliver lower accuracy, whereas SFT-based models achieve higher performance yet face a fundamental mismatch: next POI recommendation data does not naturally suit supervised fine-tuning. In SFT, the model is trained to reproduce the exact ground truth, but each training example provides only a single target POI, so there is no ground truth for producing a top-k list. To address this, we propose Refine-POI, a reinforcement fine-tuning framework for next POI recommendation. We introduce recommendation-driven rewards that enable LLMs to learn to generate top-k recommendation lists using only one ground-truth POI per example. Experiments on real-world datasets demonstrate that Refine-POI achieves state-of-the-art top-k recommendation performance. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.21599 [cs.IR] (or arXiv:2506.21599v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.21599 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-43] Evaluating the Robustness of Dense Retrievers in Interdisciplinary Domains

【速读】：该论文试图解决领域自适应在检索模型中的评估基准特性可能扭曲真实效果的问题，这种偏差可能导致在专业领域部署决策上的误导。解决方案的关键在于通过对比不同语义结构的评估基准，揭示领域自适应方法在不同评估环境下的表现差异。研究以环境监管文档检索为例，对ColBERTv2模型进行微调，并在两个具有不同语义结构的基准上进行评估，发现相同的领域自适应方法在不同基准上的感知收益存在显著差异，从而强调了评估框架选择对系统有效性评估的重要性。

链接: https://arxiv.org/abs/2506.21581
作者: Sarthak Chaturvedi,Anurag Acharya,Rounak Meyur,Koby Hayashi,Sai Munikoti,Sameera Horawalavithana
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluation benchmark characteristics may distort the true benefits of domain adaptation in retrieval models. This creates misleading assessments that influence deployment decisions in specialized domains. We show that two benchmarks with drastically different features such as topic diversity, boundary overlap, and semantic complexity can influence the perceived benefits of fine-tuning. Using environmental regulatory document retrieval as a case study, we fine-tune ColBERTv2 model on Environmental Impact Statements (EIS) from federal agencies. We evaluate these models across two benchmarks with different semantic structures. Our findings reveal that identical domain adaptation approaches show very different perceived benefits depending on evaluation methodology. On one benchmark, with clearly separated topic boundaries, domain adaptation shows small improvements (maximum 0.61% NDCG gain). However, on the other benchmark with overlapping semantic structures, the same models demonstrate large improvements (up to 2.22% NDCG gain), a 3.6-fold difference in the performance benefit. We compare these benchmarks through topic diversity metrics, finding that the higher-performing benchmark shows 11% higher average cosine distances between contexts and 23% lower silhouette scores, directly contributing to the observed performance difference. These results demonstrate that benchmark selection strongly determines assessments of retrieval system effectiveness in specialized domains. Evaluation frameworks with well-separated topics regularly underestimate domain adaptation benefits, while those with overlapping semantic boundaries reveal improvements that better reflect real-world regulatory document complexity. Our findings have important implications for developing and deploying AI systems for interdisciplinary domains that integrate multiple topics.
zh

[AI-44] LLM 2Rec: Large Language Models Are Powerful Embedding Models for Sequential Recommendation KDD2025

【速读】：该论文旨在解决传统序列推荐方法在处理未见领域时泛化能力不足的问题，以及文本基础推荐方法未能有效编码协同过滤（Collaborative Filtering, CF）信号的缺陷。其解决方案的关键在于提出一种名为LLM2Rec的新型嵌入模型，该模型通过将大型语言模型（Large Language Models, LLMs）的丰富语义理解能力与协同过滤意识相结合，实现对项目语义和协同信息的联合建模。该方法采用两阶段训练框架：第一阶段为协同监督微调，使LLMs基于历史交互推断项目关系；第二阶段为项目级嵌入建模，将微调后的LLMs转化为结构化的项目嵌入模型，从而提升领域内和领域外推荐性能。

链接: https://arxiv.org/abs/2506.21579
作者: Yingzhi He,Xiaohao Liu,An Zhang,Yunshan Ma,Tat-Seng Chua
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: KDD 2025

点击查看摘要

Abstract:Sequential recommendation aims to predict users’ future interactions by modeling collaborative filtering (CF) signals from historical behaviors of similar users or items. Traditional sequential recommenders predominantly rely on ID-based embeddings, which capture CF signals through high-order co-occurrence patterns. However, these embeddings depend solely on past interactions, lacking transferable knowledge to generalize to unseen domains. Recent advances in large language models (LLMs) have motivated text-based recommendation approaches that derive item representations from textual descriptions. While these methods enhance generalization, they fail to encode CF signals-i.e., latent item correlations and preference patterns-crucial for effective recommendation. We argue that an ideal embedding model should seamlessly integrate CF signals with rich semantic representations to improve both in-domain and out-of-domain recommendation performance. To this end, we propose LLM2Rec, a novel embedding model tailored for sequential recommendation, integrating the rich semantic understanding of LLMs with CF awareness. Our approach follows a two-stage training framework: (1) Collaborative Supervised Fine-tuning, which adapts LLMs to infer item relationships based on historical interactions, and (2) Item-level Embedding Modeling, which refines these specialized LLMs into structured item embedding models that encode both semantic and collaborative information. Extensive experiments on real-world datasets demonstrate that LLM2Rec effectively improves recommendation quality across both in-domain and out-of-domain settings. Our findings highlight the potential of leveraging LLMs to build more robust, generalizable embedding models for sequential recommendation. Our codes are available at this https URL. Comments: KDD 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.21579 [cs.IR] (or arXiv:2506.21579v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.21579 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-45] On the Necessity of Output Distribution Reweighting for Effective Class Unlearning

【速读】：该论文试图解决在不进行完整重新训练的情况下，从已训练的分类器中擦除特定类别的问题，以满足用户删除权利并减少有害或有偏见的预测。解决方案的关键在于提出一种输出重加权的遗忘方法（RWFT），通过简单地重新分配被遗忘类别样本预测的概率质量，使其对基于神经网络的成员推理攻击（MIA-NN）具有鲁棒性，并引入基于总变分（TV）距离的新度量来量化残留信息泄露，从而提升遗忘效果。

链接: https://arxiv.org/abs/2506.20893
作者: Yian Wang,Ali Ebrahimpour-Boroojeny,Hari Sundaram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce an output-reweighting unlearning method, RWFT, a lightweight technique that erases an entire class from a trained classifier without full retraining. Forgetting specific classes from trained models is essential for enforcing user deletion rights and mitigating harmful or biased predictions. The full retraining is costly and existing unlearning methods fail to replicate the behavior of the retrained models when predicting samples from the unlearned class. We prove this failure by designing a variant of membership inference attacks, MIA-NN that successfully reveals the unlearned class for any of these methods. We propose a simple redistribution of the probability mass for the prediction on the samples in the forgotten class which is robust to MIA-NN. We also introduce a new metric based on the total variation (TV) distance of the prediction probabilities to quantify residual leakage to prevent future methods from susceptibility to the new attack. Through extensive experiments with state of the art baselines in machine unlearning, we show that our approach matches the results of full retraining in both metrics used for evaluation by prior work and the new metric we propose in this work. Compare to state-of-the-art methods, we gain 2.79% in previously used metrics and 111.45% in our new TV-based metric over the best existing method.
zh

[AI-46] From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining ICML2025

【速读】：该论文旨在解决传统深度学习方法在心电图（Electrocardiogram, ECG）分析中依赖大规模人工标注数据所带来的高成本与低效率问题。其解决方案的关键在于提出一种名为MELP的多尺度ECG-语言预训练模型，该模型通过利用ECG-文本对中的层次化监督信号，在令牌、心跳和节律三个层面实现跨模态对齐，从而捕捉ECG信号在不同时间尺度上的结构信息，提升模型的泛化能力与迁移性能。

链接: https://arxiv.org/abs/2506.21803
作者: Fuying Wang,Jiacheng Xu,Lequan Yu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision-at the token, beat, and rhythm levels-to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is available at this https URL.
zh

[AI-47] Demonstrating Interoperable Channel State Feedback Compression with Machine Learning

【速读】：该论文试图解决在无线网络中利用机器学习（Machine Learning, ML）进行信道状态反馈压缩与解压缩时，缺乏实际应用场景验证的问题，特别是在用户设备（User Equipment, UE）和基站之间无法共享各自ML模型的情况下。解决方案的关键在于提出一种在保密条件下训练可互操作的压缩与解压缩ML模型的方法，并通过原型UE和基站验证了所生成模型的准确性。

链接: https://arxiv.org/abs/2506.21796
作者: Dani Korpi,Rachel Wang,Jerry Wang,Abdelrahman Ibrahim,Carl Nuzman,Runxin Wang,Kursat Rasim Mestav,Dustin Zhang,Iraj Saniee,Shawn Winston,Gordana Pavlovic,Wei Ding,William J. Hillery,Chenxi Hao,Ram Thirunagari,Jung Chang,Jeehyun Kim,Bartek Kozicki,Dragan Samardzija,Taesang Yoo,Andreas Maeder,Tingfang Ji,Harish Viswanathan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Neural network-based compression and decompression of channel state feedback has been one of the most widely studied applications of machine learning (ML) in wireless networks. Various simulation-based studies have shown that ML-based feedback compression can result in reduced overhead and more accurate channel information. However, to the best of our knowledge, there are no real-life proofs of concepts demonstrating the benefits of ML-based channel feedback compression in a practical setting, where the user equipment (UE) and base station have no access to each others’ ML models. In this paper, we present a novel approach for training interoperable compression and decompression ML models in a confidential manner, and demonstrate the accuracy of the ensuing models using prototype UEs and base stations. The performance of the ML-based channel feedback is measured both in terms of the accuracy of the reconstructed channel information and achieved downlink throughput gains when using the channel information for beamforming. The reported measurement results demonstrate that it is possible to develop an accurate ML-based channel feedback link without having to share ML models between device and network vendors. These results pave the way for a practical implementation of ML-based channel feedback in commercial 6G networks.
zh

机器学习

[LG-0] ARMOR: Robust Reinforcement Learning-based Control for UAVs under Physical Attacks

链接: https://arxiv.org/abs/2506.22423
作者: Pritam Dash,Ethan Chan,Nathan P. Lawrence,Karthik Pattabiraman
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) depend on onboard sensors for perception, navigation, and control. However, these sensors are susceptible to physical attacks, such as GPS spoofing, that can corrupt state estimates and lead to unsafe behavior. While reinforcement learning (RL) offers adaptive control capabilities, existing safe RL methods are ineffective against such attacks. We present ARMOR (Adaptive Robust Manipulation-Optimized State Representations), an attack-resilient, model-free RL controller that enables robust UAV operation under adversarial sensor manipulation. Instead of relying on raw sensor observations, ARMOR learns a robust latent representation of the UAV’s physical state via a two-stage training framework. In the first stage, a teacher encoder, trained with privileged attack information, generates attack-aware latent states for RL policy training. In the second stage, a student encoder is trained via supervised learning to approximate the teacher’s latent states using only historical sensor data, enabling real-world deployment without privileged information. Our experiments show that ARMOR outperforms conventional methods, ensuring UAV safety. Additionally, ARMOR improves generalization to unseen attacks and reduces training cost by eliminating the need for iterative adversarial training.

[LG-1] Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL

链接: https://arxiv.org/abs/2506.22401
作者: Tong Yang,Bo Dai,Lin Xiao,Yuejie Chi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation – it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions.

[LG-2] Reinforcement Learning with Physics-Informed Symbolic Program Priors for Zero-Shot Wireless Indoor Navigation

链接: https://arxiv.org/abs/2506.22365
作者: Tao Li,Haozhe Lei,Mingsheng Yin,Yaqi Hu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Spotlight paper at Reinforcement Learning Conference 2025, Workshop on Inductive Biases in Reinforcement Learning

点击查看摘要

Abstract:When using reinforcement learning (RL) to tackle physical control tasks, inductive biases that encode physics priors can help improve sample efficiency during training and enhance generalization in testing. However, the current practice of incorporating these helpful physics-informed inductive biases inevitably runs into significant manual labor and domain expertise, making them prohibitive for general users. This work explores a symbolic approach to distill physics-informed inductive biases into RL agents, where the physics priors are expressed in a domain-specific language (DSL) that is human-readable and naturally explainable. Yet, the DSL priors do not translate directly into an implementable policy due to partial and noisy observations and additional physical constraints in navigation tasks. To address this gap, we develop a physics-informed program-guided RL (PiPRL) framework with applications to indoor navigation. PiPRL adopts a hierarchical and modularized neuro-symbolic integration, where a meta symbolic program receives semantically meaningful features from a neural perception module, which form the bases for symbolic programming that encodes physics priors and guides the RL process of a low-level neural controller. Extensive experiments demonstrate that PiPRL consistently outperforms purely symbolic or neural policies and reduces training time by over 26% with the help of the program-based inductive biases.

[LG-3] Weakly-Supervised Domain Adaptation with Proportion-Constrained Pseudo-Labeling IJCNN2025

链接: https://arxiv.org/abs/2506.22301
作者: Takumi Okuo,Shinnosuke Matsuo,Shota Harada,Kiyohito Tanaka,Ryoma Bise
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCNN2025

点击查看摘要

Abstract:Domain shift is a significant challenge in machine learning, particularly in medical applications where data distributions differ across institutions due to variations in data collection practices, equipment, and procedures. This can degrade performance when models trained on source domain data are applied to the target domain. Domain adaptation methods have been widely studied to address this issue, but most struggle when class proportions between the source and target domains differ. In this paper, we propose a weakly-supervised domain adaptation method that leverages class proportion information from the target domain, which is often accessible in medical datasets through prior knowledge or statistical reports. Our method assigns pseudo-labels to the unlabeled target data based on class proportion (called proportion-constrained pseudo-labeling), improving performance without the need for additional annotations. Experiments on two endoscopic datasets demonstrate that our method outperforms semi-supervised domain adaptation techniques, even when 5% of the target domain is labeled. Additionally, the experimental results with noisy proportion labels highlight the robustness of our method, further demonstrating its effectiveness in real-world application scenarios.

[LG-4] Score-Based Model for Low-Rank Tensor Recovery

链接: https://arxiv.org/abs/2506.22295
作者: Zhengyun Cheng,Changhao Wang,Guanwen Zhang,Yi Xu,Wei Zhou,Xiangyang Ji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank tensor decompositions (TDs) provide an effective framework for multiway data analysis. Traditional TD methods rely on predefined structural assumptions, such as CP or Tucker decompositions. From a probabilistic perspective, these can be viewed as using Dirac delta distributions to model the relationships between shared factors and the low-rank tensor. However, such prior knowledge is rarely available in practical scenarios, particularly regarding the optimal rank structure and contraction rules. The optimization procedures based on fixed contraction rules are complex, and approximations made during these processes often lead to accuracy loss. To address this issue, we propose a score-based model that eliminates the need for predefined structural or distributional assumptions, enabling the learning of compatibility between tensors and shared factors. Specifically, a neural network is designed to learn the energy function, which is optimized via score matching to capture the gradient of the joint log-probability of tensor entries and shared factors. Our method allows for modeling structures and distributions beyond the Dirac delta assumption. Moreover, integrating the block coordinate descent (BCD) algorithm with the proposed smooth regularization enables the model to perform both tensor completion and denoising. Experimental results demonstrate significant performance improvements across various tensor types, including sparse and continuous-time tensors, as well as visual data.

[LG-5] Risk-Averse Best Arm Set Identification with Fixed Budget and Fixed Confidence

链接: https://arxiv.org/abs/2506.22253
作者: Shunta Nonaga,Koji Tabata,Yuta Mizuno,Tamiki Komatsuzaki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision making under uncertain environments in the maximization of expected reward while minimizing its risk is one of the ubiquitous problems in many subjects. Here, we introduce a novel problem setting in stochastic bandit optimization that jointly addresses two critical aspects of decision-making: maximizing expected reward and minimizing associated uncertainty, quantified via the mean-variance(MV) criterion. Unlike traditional bandit formulations that focus solely on expected returns, our objective is to efficiently and accurately identify the Pareto-optimal set of arms that strikes the best trade-off between expected performance and risk. We propose a unified meta-algorithmic framework capable of operating under both fixed-confidence and fixed-budget regimes, achieved through adaptive design of confidence intervals tailored to each scenario using the same sample exploration strategy. We provide theoretical guarantees on the correctness of the returned solutions in both settings. To complement this theoretical analysis, we conduct extensive empirical evaluations across synthetic benchmarks, demonstrating that our approach outperforms existing methods in terms of both accuracy and sample efficiency, highlighting its broad applicability to risk-aware decision-making tasks in uncertain environments.

[LG-6] REDELEX: A Framework for Relational Deep Learning Exploration ECML KDD2025

链接: https://arxiv.org/abs/2506.22199
作者: Jakub Peleška,Gustav Šír
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted to ECMLPKDD 2025 at Porto, Portugal

点击查看摘要

Abstract:Relational databases (RDBs) are widely regarded as the gold standard for storing structured information. Consequently, predictive tasks leveraging this data format hold significant application promise. Recently, Relational Deep Learning (RDL) has emerged as a novel paradigm wherein RDBs are conceptualized as graph structures, enabling the application of various graph neural architectures to effectively address these tasks. However, given its novelty, there is a lack of analysis into the relationships between the performance of various RDL models and the characteristics of the underlying RDBs. In this study, we present REDELEX - a comprehensive exploration framework for evaluating RDL models of varying complexity on the most diverse collection of over 70 RDBs, which we make available to the community. Benchmarked alongside key representatives of classic methods, we confirm the generally superior performance of RDL while providing insights into the main factors shaping performance, including model complexity, database sizes and their structural properties. Comments: Accepted to ECMLPKDD 2025 at Porto, Portugal Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2506.22199 [cs.LG] (or arXiv:2506.22199v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.22199 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] dreaMLearning: Data Compression Assisted Machine Learning

链接: https://arxiv.org/abs/2506.22190
作者: Xiaobo Zhao,Aaron Hurst,Panagiotis Karras,Daniel E. Lucani
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Despite rapid advancements, machine learning, particularly deep learning, is hindered by the need for large amounts of labeled data to learn meaningful patterns without overfitting and immense demands for computation and storage, which motivate research into architectures that can achieve good performance with fewer resources. This paper introduces dreaMLearning, a novel framework that enables learning from compressed data without decompression, built upon Entropy-based Generalized Deduplication (EntroGeDe), an entropy-driven lossless compression method that consolidates information into a compact set of representative samples. DreaMLearning accommodates a wide range of data types, tasks, and model architectures. Extensive experiments on regression and classification tasks with tabular and image data demonstrate that dreaMLearning accelerates training by up to 8.8x, reduces memory usage by 10x, and cuts storage by 42%, with a minimal impact on model performance. These advancements enhance diverse ML applications, including distributed and federated learning, and tinyML on resource-constrained edge devices, unlocking new possibilities for efficient and scalable learning.

[LG-8] hompson Sampling-Based Learning and Control for Unknown Dynamic Systems

链接: https://arxiv.org/abs/2506.22186
作者: Kaikai Zheng,Dawei Shi,Yang Shi,Long Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thompson sampling (TS) is an effective method to explore parametric uncertainties and can therefore be used for active learning-based controller design. However, TS relies on finite parametric representations, which limits its applicability to more general spaces, which are more commonly encountered in control system design. To address this issue, this work pro poses a parameterization method for control law learning using reproducing kernel Hilbert spaces and designs a data-driven active learning control approach. Specifically, the proposed method treats the control law as an element in a function space, allowing the design of control laws without imposing restrictions on the system structure or the form of the controller. A TS framework is proposed in this work to explore potential optimal control laws, and the convergence guarantees are further provided for the learning process. Theoretical analysis shows that the proposed method learns the relationship between control laws and closed-loop performance metrics at an exponential rate, and the upper bound of control regret is also derived. Numerical experiments on controlling unknown nonlinear systems validate the effectiveness of the proposed method.

[LG-9] ASVSim (AirSim for Surface Vehicles): A High-Fidelity Simulation Framework for Autonomous Surface Vehicle Research

链接: https://arxiv.org/abs/2506.22174
作者: Bavo Lesy,Siemen Herremans,Robin Kerstens,Jan Steckel,Walter Daems,Siegfried Mercelis,Ali Anwar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 Pages, 11 Figures

点击查看摘要

Abstract:The transport industry has recently shown significant interest in unmanned surface vehicles (USVs), specifically for port and inland waterway transport. These systems can improve operational efficiency and safety, which is especially relevant in the European Union, where initiatives such as the Green Deal are driving a shift towards increased use of inland waterways. At the same time, a shortage of qualified personnel is accelerating the adoption of autonomous solutions. However, there is a notable lack of open-source, high-fidelity simulation frameworks and datasets for developing and evaluating such solutions. To address these challenges, we introduce AirSim For Surface Vehicles (ASVSim), an open-source simulation framework specifically designed for autonomous shipping research in inland and port environments. The framework combines simulated vessel dynamics with marine sensor simulation capabilities, including radar and camera systems and supports the generation of synthetic datasets for training computer vision models and reinforcement learning agents. Built upon Cosys-AirSim, ASVSim provides a comprehensive platform for developing autonomous navigation algorithms and generating synthetic datasets. The simulator supports research of both traditional control methods and deep learning-based approaches. Through limited experiments, we demonstrate the potential of the simulator in these research areas. ASVSim is provided as an open-source project under the MIT license, making autonomous navigation research accessible to a larger part of the ocean engineering community.

[LG-10] Earthquake Damage Grades Prediction using An Ensemble Approach Integrating Advanced Machine and Deep Learning Models

链接: https://arxiv.org/abs/2506.22129
作者: Anurag Panda,Gaurav Kumar Yadav
类目: Machine Learning (cs.LG)
*备注: 3rd International Conference on Applied Mathematics in Science and Engineering

点击查看摘要

Abstract:In the aftermath of major earthquakes, evaluating structural and infrastructural damage is vital for coordinating post-disaster response efforts. This includes assessing damage’s extent and spatial distribution to prioritize rescue operations and resource allocation. Accurately estimating damage grades to buildings post-earthquake is paramount for effective response and recovery, given the significant impact on lives and properties, underscoring the urgency of streamlining relief fund allocation processes. Previous studies have shown the effectiveness of multi-class classification, especially XGBoost, along with other machine learning models and ensembling methods, incorporating regularization to address class imbalance. One consequence of class imbalance is that it may give rise to skewed models that undervalue minority classes and give preference to the majority class. This research deals with the problem of class imbalance with the help of the synthetic minority oversampling technique (SMOTE). We delve into multiple multi-class classification machine learning, deep learning models, and ensembling methods to forecast structural damage grades. The study elucidates performance determinants through comprehensive feature manipulation experiments and diverse training approaches. It identifies key factors contributing to seismic vulnerability while evaluating model performance using techniques like the confusion matrix further to enhance understanding of the effectiveness of earthquake damage prediction.

[LG-11] ransfer Learning for Assessing Heavy Metal Pollution in Seaports Sediments

链接: https://arxiv.org/abs/2506.22096
作者: Tin Lai,Farnaz Farid,Yueyang Kuan,Xintian Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting heavy metal pollution in soils and seaports is vital for regional environmental monitoring. The Pollution Load Index (PLI), an international standard, is commonly used to assess heavy metal containment. However, the conventional PLI assessment involves laborious procedures and data analysis of sediment samples. To address this challenge, we propose a deep-learning-based model that simplifies the heavy metal assessment process. Our model tackles the issue of data scarcity in the water-sediment domain, which is traditionally plagued by challenges in data collection and varying standards across nations. By leveraging transfer learning, we develop an accurate quantitative assessment method for predicting PLI. Our approach allows the transfer of learned features across domains with different sets of features. We evaluate our model using data from six major ports in New South Wales, Australia: Port Yamba, Port Newcastle, Port Jackson, Port Botany, Port Kembla, and Port Eden. The results demonstrate significantly lower Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of approximately 0.5 and 0.03, respectively, compared to other models. Our model performance is up to 2 orders of magnitude than other baseline models. Our proposed model offers an innovative, accessible, and cost-effective approach to predicting water quality, benefiting marine life conservation, aquaculture, and industrial pollution monitoring.

[LG-12] crypto price prediction using lstmxgboost

链接: https://arxiv.org/abs/2506.22055
作者: Mehul Gautam
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The volatility and complex dynamics of cryptocurrency markets present unique challenges for accurate price forecasting. This research proposes a hybrid deep learning and machine learning model that integrates Long Short-Term Memory (LSTM) networks and Extreme Gradient Boosting (XGBoost) for cryptocurrency price prediction. The LSTM component captures temporal dependencies in historical price data, while XGBoost enhances prediction by modeling nonlinear relationships with auxiliary features such as sentiment scores and macroeconomic indicators. The model is evaluated on historical datasets of Bitcoin, Ethereum, Dogecoin, and Litecoin, incorporating both global and localized exchange data. Comparative analysis using Mean Absolute Percentage Error (MAPE) and Min-Max Normalized Root Mean Square Error (MinMax RMSE) demonstrates that the LSTM+XGBoost hybrid consistently outperforms standalone models and traditional forecasting methods. This study underscores the potential of hybrid architectures in financial forecasting and provides insights into model adaptability across different cryptocurrencies and market contexts.

[LG-13] Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion

链接: https://arxiv.org/abs/2506.22036
作者: Ying Zhang,Yu Zhao,Xuhui Sui,Baohang Zhou,Xiangrui Cai,Li Shen,Xiaojie Yuan,Dacheng Tao
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Submitted to the IEEE for possible publication

点击查看摘要

Abstract:With the increasing multimodal knowledge privatization requirements, multimodal knowledge graphs in different institutes are usually decentralized, lacking of effective collaboration system with both stronger reasoning ability and transmission safety guarantees. In this paper, we propose the Federated Multimodal Knowledge Graph Completion (FedMKGC) task, aiming at training over federated MKGs for better predicting the missing links in clients without sharing sensitive knowledge. We propose a framework named MMFeD3-HidE for addressing multimodal uncertain unavailability and multimodal client heterogeneity challenges of FedMKGC. (1) Inside the clients, our proposed Hyper-modal Imputation Diffusion Embedding model (HidE) recovers the complete multimodal distributions from incomplete entity embeddings constrained by available modalities. (2) Among clients, our proposed Multimodal FeDerated Dual Distillation (MMFeD3) transfers knowledge mutually between clients and the server with logit and feature distillation to improve both global convergence and semantic consistency. We propose a FedMKGC benchmark for a comprehensive evaluation, consisting of a general FedMKGC backbone named MMFedE, datasets with heterogeneous multimodal information, and three groups of constructed baselines. Experiments conducted on our benchmark validate the effectiveness, semantic consistency, and convergence robustness of MMFeD3-HidE.

[LG-14] GKNet: Graph Kalman Filtering and Model Inference via Model-based Deep Learning

链接: https://arxiv.org/abs/2506.22004
作者: Mohammad Sabbaqi,Riccardo Taormina,Elvin Isufi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inference tasks with time series over graphs are of importance in applications such as urban water networks, economics, and networked neuroscience. Addressing these tasks typically relies on identifying a computationally affordable model that jointly captures the graph-temporal patterns of the data. In this work, we propose a graph-aware state space model for graph time series, where both the latent state and the observation equation are parametric graph-induced models with a limited number of parameters that need to be learned. More specifically, we consider the state equation to follow a stochastic partial differential equation driven by noise over the graphs edges accounting not only for potential edge uncertainties but also for increasing the degrees of freedom in the latter in a tractable manner. The graph structure conditioning of the noise dispersion allows the state variable to deviate from the stochastic process in certain neighborhoods. The observation model is a sampled and graph-filtered version of the state capturing multi-hop neighboring influence. The goal is to learn the parameters in both state and observation models from the partially observed data for downstream tasks such as prediction and imputation. The model is inferred first through a maximum likelihood approach that provides theoretical tractability but is limited in expressivity and scalability. To improve on the latter, we use the state-space formulation to build a principled deep learning architecture that jointly learns the parameters and tracks the state in an end-to-end manner in the spirit of Kalman neural networks.

[LG-15] Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement

链接: https://arxiv.org/abs/2506.21956
作者: Hao Jiang,Yongxiang Tang,Yanxiang Zeng,Pengjia Yuan,Yanhua Cheng,Teng Sha,Xialong Liu,Peng Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of online advertising, advertisers partake in ad auctions to obtain advertising slots, frequently taking advantage of auto-bidding tools provided by demand-side platforms. To improve the automation of these bidding systems, we adopt generative models, namely the Decision Transformer (DT), to tackle the difficulties inherent in automated bidding. Applying the Decision Transformer to the auto-bidding task enables a unified approach to sequential modeling, which efficiently overcomes short-sightedness by capturing long-term dependencies between past bidding actions and user behavior. Nevertheless, conventional DT has certain drawbacks: (1) DT necessitates a preset return-to-go (RTG) value before generating actions, which is not inherently produced; (2) The policy learned by DT is restricted by its training data, which is consists of mixed-quality trajectories. To address these challenges, we introduce the R* Decision Transformer (R* DT), developed in a three-step process: (1) R DT: Similar to traditional DT, R DT stores actions based on state and RTG value, as well as memorizing the RTG for a given state using the training set; (2) R^ DT: We forecast the highest value (within the training set) of RTG for a given state, deriving a suboptimal policy based on the current state and the forecasted supreme RTG value; (3) R* DT: Based on R^ DT, we generate trajectories and select those with high rewards (using a simulator) to augment our training dataset. This data enhancement has been shown to improve the RTG of trajectories in the training data and gradually leads the suboptimal policy towards optimality. Comprehensive tests on a publicly available bidding dataset validate the R* DT’s efficacy and highlight its superiority when dealing with mixed-quality trajectories.

[LG-16] Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications

链接: https://arxiv.org/abs/2506.21952
作者: Yangyang Wan,Haotian Wang,Xuhui Yu,Jiageng Chen,Xinyu Fan,Zuyuan He
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Distributed acoustic sensing (DAS) has attracted considerable attention across various fields and artificial intelligence (AI) technology plays an important role in DAS applications to realize event recognition and denoising. Existing AI models require real-world data (RWD), whether labeled or not, for training, which is contradictory to the fact of limited available event data in real-world scenarios. Here, a physics-informed DAS neural network paradigm is proposed, which does not need real-world events data for training. By physically modeling target events and the constraints of real world and DAS system, physical functions are derived to train a generative network for generation of DAS events data. DAS debackground net is trained by using the generated DAS events data to eliminate background noise in DAS data. The effectiveness of the proposed paradigm is verified in event identification application based on a public dataset of DAS spatiotemporal data and in belt conveyor fault monitoring application based on DAS time-frequency data, and achieved comparable or better performance than data-driven networks trained with RWD. Owing to the introduction of physical information and capability of background noise removal, the paradigm demonstrates generalization in same application on different sites. A fault diagnosis accuracy of 91.8% is achieved in belt conveyor field with networks which transferred from simulation test site without any fault events data of test site and field for training. The proposed paradigm is a prospective solution to address significant obstacles of data acquisition and intense noise in practical DAS applications and explore more potential fields for DAS.

[LG-17] Hitchhiking Rides Dataset: Two decades of crowd-sourced records on stochastic traveling

链接: https://arxiv.org/abs/2506.21946
作者: Till Wenke
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hitchhiking, a spontaneous and decentralized mode of travel, has long eluded systematic study due to its informal nature. This paper presents and analyzes the largest known structured dataset of hitchhiking rides, comprising over 63,000 entries collected over nearly two decades through platforms associated with this http URL and lately on this http URL. By leveraging crowd-sourced contributions, the dataset captures key spatiotemporal and strategic aspects of hitchhiking. This work documents the dataset’s origins, evolution, and community-driven maintenance, highlighting its Europe-centric distribution, seasonal patterns, and reliance on a small number of highly active contributors. Through exploratory analyses, I examine waiting times, user behavior, and comment metadata, shedding light on the lived realities of hitchhikers. While the dataset has inherent biases and limitations - such as demographic skew and unverifiable entries it offers a rare and valuable window into an alternative form of mobility. I conclude by outlining future directions for enriching the dataset and advancing research on hitchhiking as both a transportation practice and cultural phenomenon.

[LG-18] GuiderNet: A Meta-Learning Framework for Optimizing Quantum Circuit Geometry and Mitigating Barren Plateaus

链接: https://arxiv.org/abs/2506.21940
作者: Marwan Ait Haddou,Mohamed Bennai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational Quantum Algorithms (VQAs) offer potential for near-term quantum advantage but face challenges from barren plateaus, where gradients vanish, and poorly conditioned optimization landscapes. We introduce GuiderNet, a meta-learning framework that conditions Parameterized Quantum Circuits (PQCs) using data-dependent parameter shifts aimed at minimizing the log condition number of the Fubini-Study metric tensor. Implemented as a classical neural network, GuiderNet is meta-trained to guide PQC parameters into geometrically favorable regions and is embedded within hybrid quantum-classical pipelines to steer both initialization and adaptive modulation during training. Applied to the Kaggle Diabetes classification task, GuiderNet reduces cumulative training loss by over 5x, improves test accuracy from 75.3% to 98.6%, and increases the minority-class F1 score from 0.67 to 0.95. It also suppresses gradient explosion and stabilizes parameter updates, enabling smoother and more robust optimization. These results demonstrate that geometric meta-conditioning can mitigate barren plateaus and ill-conditioning, providing a scalable approach to enhance trainability and generalization in quantum machine learning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.21940 [cs.LG] (or arXiv:2506.21940v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.21940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] HQCM-EBTC: A Hybrid Quantum-Classical Model for Explainable Brain Tumor Classification

链接: https://arxiv.org/abs/2506.21937
作者: Marwan Ait Haddou,Mohamed Bennai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose HQCM-EBTC, a hybrid quantum-classical model for automated brain tumor classification using MRI images. Trained on a dataset of 7,576 scans covering normal, meningioma, glioma, and pituitary classes, HQCM-EBTC integrates a 5-qubit, depth-2 quantum layer with 5 parallel circuits, optimized via AdamW and a composite loss blending cross-entropy and attention consistency. HQCM-EBTC achieves 96.48% accuracy, substantially outperforming the classical baseline (86.72%). It delivers higher precision and F1-scores, especially for glioma detection. t-SNE projections reveal enhanced feature separability in quantum space, and confusion matrices show lower misclassification. Attention map analysis (Jaccard Index) confirms more accurate and focused tumor localization at high-confidence thresholds. These results highlight the promise of quantum-enhanced models in medical imaging, advancing both diagnostic accuracy and interpretability for clinical brain tumor assessment. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.21937 [cs.LG] (or arXiv:2506.21937v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.21937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Joint Task Offloading and Resource Allocation in Low-Altitude MEC via Graph Attention Diffusion

链接: https://arxiv.org/abs/2506.21933
作者: Yifan Xue,Ruihuai Liang,Bo Yang,Xuelin Cao,Zhiwen Yu,Mérouane Debbah,Chau Yuen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of the low-altitude economy, air-ground integrated multi-access edge computing (MEC) systems are facing increasing demands for real-time and intelligent task scheduling. In such systems, task offloading and resource allocation encounter multiple challenges, including node heterogeneity, unstable communication links, and dynamic task variations. To address these issues, this paper constructs a three-layer heterogeneous MEC system architecture for low-altitude economic networks, encompassing aerial and ground users as well as edge servers. The system is systematically modeled from the perspectives of communication channels, computational costs, and constraint conditions, and the joint optimization problem of offloading decisions and resource allocation is uniformly abstracted into a graph-structured modeling task. On this basis, we propose a graph attention diffusion-based solution generator (GADSG). This method integrates the contextual awareness of graph attention networks with the solution distribution learning capability of diffusion models, enabling joint modeling and optimization of discrete offloading variables and continuous resource allocation variables within a high-dimensional latent space. We construct multiple simulation datasets with varying scales and topologies. Extensive experiments demonstrate that the proposed GADSG model significantly outperforms existing baseline methods in terms of optimization performance, robustness, and generalization across task structures, showing strong potential for efficient task scheduling in dynamic and complex low-altitude economic network environments.

[LG-21] OAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments

链接: https://arxiv.org/abs/2506.21900
作者: Sheng Yun,Jianhua Pei,Ping Wang
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The evolution toward 6G networks demands a fundamental shift from bit-centric transmission to semantic-aware communication that emphasizes task-relevant information. This work introduces TOAST (Task-Oriented Adaptive Semantic Transmission), a unified framework designed to address the core challenge of multi-task optimization in dynamic wireless environments through three complementary components. First, we formulate adaptive task balancing as a Markov decision process, employing deep reinforcement learning to dynamically adjust the trade-off between image reconstruction fidelity and semantic classification accuracy based on real-time channel conditions. Second, we integrate module-specific Low-Rank Adaptation (LoRA) mechanisms throughout our Swin Transformer-based joint source-channel coding architecture, enabling parameter-efficient fine-tuning that dramatically reduces adaptation overhead while maintaining full performance across diverse channel impairments including Additive White Gaussian Noise (AWGN), fading, phase noise, and impulse interference. Third, we incorporate an Elucidating diffusion model that operates in the latent space to restore features corrupted by channel noises, providing substantial quality improvements compared to baseline approaches. Extensive experiments across multiple datasets demonstrate that TOAST achieves superior performance compared to baseline approaches, with significant improvements in both classification accuracy and reconstruction quality at low Signal-to-Noise Ratio (SNR) conditions while maintaining robust performance across all tested scenarios.

[LG-22] Advancements and Challenges in Continual Reinforcement Learning: A Comprehensive Review

链接: https://arxiv.org/abs/2506.21899
作者: Amara Zuffer,Michael Burke,Mehrtash Harandi
类目: Machine Learning (cs.LG)
*备注: 65 pages, 9 figures

点击查看摘要

Abstract:The diversity of tasks and dynamic nature of reinforcement learning (RL) require RL agents to be able to learn sequentially and continuously, a learning paradigm known as continuous reinforcement learning. This survey reviews how continual learning transforms RL agents into dynamic continual learners. This enables RL agents to acquire and retain useful and reusable knowledge seamlessly. The paper delves into fundamental aspects of continual reinforcement learning, exploring key concepts, significant challenges, and novel methodologies. Special emphasis is placed on recent advancements in continual reinforcement learning within robotics, along with a succinct overview of evaluation environments utilized in prominent research, facilitating accessibility for newcomers to the field. The review concludes with a discussion on limitations and promising future directions, providing valuable insights for researchers and practitioners alike.

[LG-23] Koopman operator-based discussion on partial observation in stochastic systems

链接: https://arxiv.org/abs/2506.21844
作者: Jun Ohkubo
类目: Machine Learning (cs.LG)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:It is sometimes difficult to achieve a complete observation for a full set of observables, and partial observations are necessary. For deterministic systems, the Mori-Zwanzig formalism provides a theoretical framework for handling partial observations. Recently, data-driven algorithms based on the Koopman operator theory have made significant progress, and there is a discussion to connect the Mori-Zwanzig formalism with the Koopman operator theory. In this work, we discuss the effects of partial observation in stochastic systems using the Koopman operator theory. The discussion clarifies the importance of distinguishing the state space and the function space in stochastic systems. Even in stochastic systems, the delay embedding technique is beneficial for partial observation, and several numerical experiments showed a power-law behavior of the accuracy for the amplitude of the additive noise. We also discuss the relation between the exponent of the power-law behavior and the effects of partial observation.

[LG-24] he Cost of Avoiding Backpropagation

链接: https://arxiv.org/abs/2506.21833
作者: Kunjal Panchal,Sunav Choudhary,Yuriy Brun,Hui Guan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of a unified theoretical analysis. This work presents a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO methods. Our theoretical analysis shows that while FmAD, and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing outperforms FmAD and ZO variants, including those enhanced with variance reduction, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8x fewer computations at comparable memory usage. Our results highlight fundamental limitations of FmAD and ZO, and reaffirm BP with checkpointing as the most effective strategy for model training under memory-constrained settings. Our code is available at this https URL.

[LG-25] Laser Scan Path Design for Controlled Microstructure in Additive Manufacturing with Integrated Reduced-Order Phase-Field Modeling and Deep Reinforcement Learning

链接: https://arxiv.org/abs/2506.21815
作者: Augustine Twumasi,Prokash Chandra Roy,Zixun Li,Soumya Shouvik Bhattacharjee,Zhengtao Gan
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Laser powder bed fusion (L-PBF) is a widely recognized additive manufacturing technology for producing intricate metal components with exceptional accuracy. A key challenge in L-PBF is the formation of complex microstructures affecting product quality. We propose a physics-guided, machine-learning approach to optimize scan paths for desired microstructure outcomes, such as equiaxed grains. We utilized a phase-field method (PFM) to model crystalline grain structure evolution. To reduce computational costs, we trained a surrogate machine learning model, a 3D U-Net convolutional neural network, using single-track phase-field simulations with various laser powers to predict crystalline grain orientations based on initial microstructure and thermal history. We investigated three scanning strategies across various hatch spacings within a square domain, achieving a two-orders-of-magnitude speedup using the surrogate model. To reduce trial and error in designing laser scan toolpaths, we used deep reinforcement learning (DRL) to generate optimized scan paths for target microstructure. Results from three cases demonstrate the DRL approach’s effectiveness. We integrated the surrogate 3D U-Net model into our DRL environment to accelerate the reinforcement learning training process. The reward function minimizes both aspect ratio and grain volume of the predicted microstructure from the agent’s scan path. The reinforcement learning algorithm was benchmarked against conventional zigzag approach for smaller and larger domains, showing machine learning methods’ potential to enhance microstructure control and computational efficiency in L-PBF optimization.

[LG-26] Why Neural Network Can Discover Symbolic Structures with Gradient-based Training: An Algebraic and Geometric Foundation for Neurosymbolic Reasoning

链接: https://arxiv.org/abs/2506.21797
作者: Peihao Wang,Zhangyang Wang
类目: Machine Learning (cs.LG)
*备注: International Conference on Neuro-symbolic Systems (NeuS), 2025

点击查看摘要

Abstract:We develop a theoretical framework that explains how discrete symbolic structures can emerge naturally from continuous neural network training dynamics. By lifting neural parameters to a measure space and modeling training as Wasserstein gradient flow, we show that under geometric constraints, such as group invariance, the parameter measure \mu_t undergoes two concurrent phenomena: (1) a decoupling of the gradient flow into independent optimization trajectories over some potential functions, and (2) a progressive contraction on the degree of freedom. These potentials encode algebraic constraints relevant to the task and act as ring homomorphisms under a commutative semi-ring structure on the measure space. As training progresses, the network transitions from a high-dimensional exploration to compositional representations that comply with algebraic operations and exhibit a lower degree of freedom. We further establish data scaling laws for realizing symbolic tasks, linking representational capacity to the group invariance that facilitates symbolic solutions. This framework charts a principled foundation for understanding and designing neurosymbolic systems that integrate continuous learning with discrete algebraic reasoning.

[LG-27] M3PO: Massively Multi-Task Model-Based Policy Optimization IROS2025

链接: https://arxiv.org/abs/2506.21782
作者: Aditya Narendra,Dmitry Makarov,Aleksandr Panov
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 4 figures. Accepted at IEEE/RSJ IROS 2025. Full version, including appendix and implementation details

点击查看摘要

Abstract:We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.

[LG-28] Gradient-Based Neuroplastic Adaptation for Concurrent Optimization of Neuro-Fuzzy Networks

链接: https://arxiv.org/abs/2506.21771
作者: John Wesley Hostetter,Min Chi
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 45 pages

点击查看摘要

Abstract:Neuro-fuzzy networks (NFNs) are transparent, symbolic, and universal function approximations that perform as well as conventional neural architectures, but their knowledge is expressed as linguistic IF-THEN rules. Despite these advantages, their systematic design process remains a challenge. Existing work will often sequentially build NFNs by inefficiently isolating parametric and structural identification, leading to a premature commitment to brittle and subpar architecture. We propose a novel application-independent approach called gradient-based neuroplastic adaptation for the concurrent optimization of NFNs’ parameters and structure. By recognizing that NFNs’ parameters and structure should be optimized simultaneously as they are deeply conjoined, settings previously unapproachable for NFNs are now accessible, such as the online reinforcement learning of NFNs for vision-based tasks. The effectiveness of concurrently optimizing NFNs is empirically shown as it is trained by online reinforcement learning to proficiently play challenging scenarios from a vision-based video game called DOOM.

[LG-29] Federated Item Response Theory Models

链接: https://arxiv.org/abs/2506.21744
作者: Biying Zhou,Nanyu Luo,Feng Ji
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Item Response Theory (IRT) models have been widely used to estimate respondents’ latent abilities and calibrate items’ difficulty. Traditional IRT estimation requires all individual raw response data to be centralized in one place, thus potentially causing privacy issues. Federated learning is an emerging field in computer science and machine learning with added features of privacy protection and distributed computing. To integrate the advances from federated learning with modern psychometrics, we propose a novel framework, Federated Item Response Theory (IRT), to enable estimating traditional IRT models with additional privacy, allowing estimation in a distributed manner without losing estimation accuracy. Our numerical experiments confirm that FedIRT achieves statistical accuracy similar to standard IRT estimation using popular R packages, while offering critical advantages: privacy protection and reduced communication costs. We also validate FedIRT’s utility through a real-world exam dataset, demonstrating its effectiveness in realistic educational contexts. This new framework extends IRT’s applicability to distributed settings, such as multi-school assessments, without sacrificing accuracy or security. To support practical adoption, we provide an open-ource R package, FedIRT, implementing the framework for the two-parameter logistic (2PL) and partial credit models (PCM). Subjects: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML) Cite as: arXiv:2506.21744 [cs.LG] (or arXiv:2506.21744v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.21744 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Storm Surge in Color: RGB-Encoded Physics-Aware Deep Learning for Storm Surge Forecasting

链接: https://arxiv.org/abs/2506.21743
作者: Jinpai Zhao,Albert Cerrone,Eirik Valseth,Leendert Westerink,Clint Dawson
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Storm surge forecasting plays a crucial role in coastal disaster preparedness, yet existing machine learning approaches often suffer from limited spatial resolution, reliance on coastal station data, and poor generalization. Moreover, many prior models operate directly on unstructured spatial data, making them incompatible with modern deep learning architectures. In this work, we introduce a novel approach that projects unstructured water elevation fields onto structured Red Green Blue (RGB)-encoded image representations, enabling the application of Convolutional Long Short Term Memory (ConvLSTM) networks for end-to-end spatiotemporal surge forecasting. Our model further integrates ground-truth wind fields as dynamic conditioning signals and topo-bathymetry as a static input, capturing physically meaningful drivers of surge evolution. Evaluated on a large-scale dataset of synthetic storms in the Gulf of Mexico, our method demonstrates robust 48-hour forecasting performance across multiple regions along the Texas coast and exhibits strong spatial extensibility to other coastal areas. By combining structured representation, physically grounded forcings, and scalable deep learning, this study advances the frontier of storm surge forecasting in usability, adaptability, and interpretability.

[LG-31] Unimodal Strategies in Density-Based Clustering

链接: https://arxiv.org/abs/2506.21695
作者: Oron Nir,Jay Tenenbaum,Ariel Shamir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Density-based clustering methods often surpass centroid-based counterparts, when addressing data with noise or arbitrary data distributions common in real-world problems. In this study, we reveal a key property intrinsic to density-based clustering methods regarding the relation between the number of clusters and the neighborhood radius of core points - we empirically show that it is nearly unimodal, and support this claim theoretically in a specific setting. We leverage this property to devise new strategies for finding appropriate values for the radius more efficiently based on the Ternary Search algorithm. This is especially important for large scale data that is high-dimensional, where parameter tuning is computationally intensive. We validate our methodology through extensive applications across a range of high-dimensional, large-scale NLP, Audio, and Computer Vision tasks, demonstrating its practical effectiveness and robustness. This work not only offers a significant advancement in parameter control for density-based clustering but also broadens the understanding regarding the relations between their guiding parameters. Our code is available at this https URL.

[LG-32] Risk-Averse Total-Reward Reinforcement Learning

链接: https://arxiv.org/abs/2506.21683
作者: Xihong Su,Jia Lin Hau,Gersi Doko,Kishan Panaganti,Marek Petrik
类目: Machine Learning (cs.LG)
*备注: The paper is under review now

点击查看摘要

Abstract:Risk-averse total-reward Markov Decision Processes (MDPs) offer a promising framework for modeling and solving undiscounted infinite-horizon objectives. Existing model-based algorithms for risk measures like the entropic risk measure (ERM) and entropic value-at-risk (EVaR) are effective in small problems, but require full access to transition probabilities. We propose a Q-learning algorithm to compute the optimal stationary policy for total-reward ERM and EVaR objectives with strong convergence and performance guarantees. The algorithm and its optimality are made possible by ERM’s dynamic consistency and elicitability. Our numerical results on tabular domains demonstrate quick and reliable convergence of the proposed Q-learning algorithm to the optimal risk-averse value function.

[LG-33] DCN2: Interplay of Implicit Collision Weights and Explicit Cross Layers for Large-Scale Recommendation KDD25

链接: https://arxiv.org/abs/2506.21624
作者: Blaž Škrlj,Yonatan Karni,Grega Gašperšič,Blaž Mramor,Yulia Stolin,Martin Jakomin,Jasna Urbančič,Yuval Dishi,Natalia Silberstein,Ophir Friedler,Assaf Klein
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: AdKDD 25

点击查看摘要

Abstract:The Deep and Cross architecture (DCNv2) is a robust production baseline and is integral to numerous real-life recommender systems. Its inherent efficiency and ability to model interactions often result in models that are both simpler and highly competitive compared to more computationally demanding alternatives, such as Deep FFMs. In this work, we introduce three significant algorithmic improvements to the DCNv2 architecture, detailing their formulation and behavior at scale. The enhanced architecture we refer to as DCN^2 is actively used in a live recommender system, processing over 0.5 billion predictions per second across diverse use cases where it out-performed DCNv2, both offline and online (ab tests). These improvements effectively address key limitations observed in the DCNv2, including information loss in Cross layers, implicit management of collisions through learnable lookup-level weights, and explicit modeling of pairwise similarities with a custom layer that emulates FFMs’ behavior. The superior performance of DCN^2 is also demonstrated on four publicly available benchmark data sets.

[LG-34] Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks

链接: https://arxiv.org/abs/2506.22429
作者: David Holzmüller,Max Schölpple
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While the theory of deep learning has made some progress in recent years, much of it is limited to the ReLU activation function. In particular, while the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks, their properties for most activation functions except for powers of the ReLU function are poorly understood. Our main contribution is to provide a more general characterization of the RKHS of these kernels for typical activation functions whose only non-smoothness is at zero, such as SELU, ELU, or LeakyReLU. Our analysis also covers a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, while polynomial activations generate non-equivalent RKHSs. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.

[LG-35] DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

链接: https://arxiv.org/abs/2506.22362
作者: Yang Yang,Yunpeng Li,George Sung,Shao-Fu Shih,Craig Dooley,Alessio Centazzo,Ramanan Rajeswaran
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to as semantic tokens and acoustic tokens. These tokens are often modeled autoregressively, with the inference speed being constrained by the token rate. In this work, we propose DiffSoundStream, a solution that improves the efficiency of speech tokenization in non-streaming scenarios through two techniques: (1) conditioning the neural codec on semantic tokens to minimize redundancy between semantic and acoustic tokens, and (2) leveraging latent diffusion models to synthesize high-quality waveforms from semantic and coarse-level acoustic tokens. Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard SoundStream model operating at twice the token rate. Additionally, we achieve step-size distillation using just four diffusion sampling steps with only a minor quality loss.

[LG-36] Robust quantum reservoir computers for forecasting chaotic dynamics: generalized synchronization and stability

链接: https://arxiv.org/abs/2506.22335
作者: Osama Ahmed,Felix Tennie,Luca Magri
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 28 pages, 12 figures

点击查看摘要

Abstract:We show that recurrent quantum reservoir computers (QRCs) and their recurrence-free architectures (RF-QRCs) are robust tools for learning and forecasting chaotic dynamics from time-series data. First, we formulate and interpret quantum reservoir computers as coupled dynamical systems, where the reservoir acts as a response system driven by training data; in other words, quantum reservoir computers are generalized-synchronization (GS) systems. Second, we show that quantum reservoir computers can learn chaotic dynamics and their invariant properties, such as Lyapunov spectra, attractor dimensions, and geometric properties such as the covariant Lyapunov vectors. This analysis is enabled by deriving the Jacobian of the quantum reservoir update. Third, by leveraging tools from generalized synchronization, we provide a method for designing robust quantum reservoir computers. We propose the criterion GS=ESP : GS implies the echo state property (ESP), and vice versa. We analytically show that RF-QRCs, by design, fulfill GS=ESP . Finally, we analyze the effect of simulated noise. We find that dissipation from noise enhances the robustness of quantum reservoir computers. Numerical verifications on systems of different dimensions support our conclusions. This work opens opportunities for designing robust quantum machines for chaotic time series forecasting on near-term quantum hardware.

[LG-37] A Plea for History and Philosophy of Statistics and Machine Learning

链接: https://arxiv.org/abs/2506.22236
作者: Hanti Lin
类目: Other Statistics (stat.OT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of the history and philosophy of statistics was initiated at least by Hacking (1965) and advanced by Mayo (1996), but it has not received sustained follow-up. Yet such integration is more urgent than ever, as the recent success of artificial intelligence has been driven largely by machine learning – a field historically developed alongside statistics. Today, the boundary between statistics and machine learning is increasingly blurred. What we now need is integration, twice over: of history and philosophy, and of the field they engage – statistics and machine learning. I present a case study of a philosophical idea in machine learning (and in formal epistemology) whose root can be traced back to an often under-appreciated insight in Neyman and Pearson’s 1936 work (a follow-up to their 1933 classic). This leads to the articulation of a foundational assumption – largely implicit in, but shared by, the practices of frequentist statistics and machine learning – which I call achievabilism. Another integration also emerges at the level of methodology, combining two ends of the philosophy of science spectrum: history and philosophy of science on the one hand, and formal epistemology on the other hand.

[LG-38] Uncovering smooth structures in single-cell data with PCS-guided neighbor embeddings

链接: https://arxiv.org/abs/2506.22228
作者: Rong Ma,Xi Li,Jingyuan Hu,Bin Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Genomics (q-bio.GN); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Single-cell sequencing is revolutionizing biology by enabling detailed investigations of cell-state transitions. Many biological processes unfold along continuous trajectories, yet it remains challenging to extract smooth, low-dimensional representations from inherently noisy, high-dimensional single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP, are widely used to embed high-dimensional single-cell data into low dimensions. But they often introduce undesirable distortions, resulting in misleading interpretations. Existing evaluation methods for NE algorithms primarily focus on separating discrete cell types rather than capturing continuous cell-state transitions, while dynamic modeling approaches rely on strong assumptions about cellular processes and specialized data. To address these challenges, we build on the Predictability-Computability-Stability (PCS) framework for reliable and reproducible data-driven discoveries. First, we systematically evaluate popular NE algorithms through empirical analysis, simulation, and theory, and reveal their key shortcomings, such as artifacts and instability. We then introduce NESS, a principled and interpretable machine learning approach to improve NE representations by leveraging algorithmic stability and to enable robust inference of smooth biological structures. NESS offers useful concepts, quantitative stability metrics, and efficient computational workflows to uncover developmental trajectories and cell-state transitions in single-cell data. Finally, we apply NESS to six single-cell datasets, spanning pluripotent stem cell differentiation, organoid development, and multiple tissue-specific lineage trajectories. Across these diverse contexts, NESS consistently yields useful biological insights, such as identification of transitional and stable cell states and quantification of transcriptional dynamics during development.

[LG-39] Hybrid Generative Modeling for Incomplete Physics: Deep Grey-Box Meets Optimal Transport ICLR2025

链接: https://arxiv.org/abs/2506.22204
作者: Gurjeet Sangra Singh,Maciej Falkiewicz,Alexandros Kalousis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Workshop paper at ICLR 2025 (XAI4Science Workshop)

点击查看摘要

Abstract:Physics phenomena are often described by ordinary and/or partial differential equations (ODEs/PDEs), and solved analytically or numerically. Unfortunately, many real-world systems are described only approximately with missing or unknown terms in the equations. This makes the distribution of the physics model differ from the true data-generating process (DGP). Using limited and unpaired data between DGP observations and the imperfect model simulations, we investigate this particular setting by completing the known-physics model, combining theory-driven models and data-driven to describe the shifted distribution involved in the DGP. We present a novel hybrid generative model approach combining deep grey-box modelling with Optimal Transport (OT) methods to enhance incomplete physics models. Our method implements OT maps in data space while maintaining minimal source distribution distortion, demonstrating superior performance in resolving the unpaired problem and ensuring correct usage of physics parameters. Unlike black-box alternatives, our approach leverages physics-based inductive biases to accurately learn system dynamics while preserving interpretability through its domain knowledge foundation. Experimental results validate our method’s effectiveness in both generation tasks and model transparency, offering detailed insights into learned physics dynamics.

[LG-40] hompson Sampling in Function Spaces via Neural Operators

链接: https://arxiv.org/abs/2506.21894
作者: Rafael Oliveira,Xuesong Wang,Kian Ming A. Chai,Edwin V. Bonilla
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator’s output. We assume that functional evaluations are inexpensive, while queries to the operator (such as running a high-fidelity simulator) are costly. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This strategy avoids explicit uncertainty quantification by treating trained neural operators as approximate samples from a Gaussian process. We provide novel theoretical convergence guarantees, based on Gaussian processes in the infinite-dimensional setting, under minimal assumptions. We benchmark our method against existing baselines on functional optimization tasks involving partial differential equations and other nonlinear operator-driven phenomena, demonstrating improved sample efficiency and competitive performance.

[LG-41] Adversarial Threats in Quantum Machine Learning: A Survey of Attacks and Defenses

链接: https://arxiv.org/abs/2506.21842
作者: Archisman Ghosh,Satwik Kundu,Swaroop Ghosh
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Quantum Machine Learning (QML) integrates quantum computing with classical machine learning, primarily to solve classification, regression and generative tasks. However, its rapid development raises critical security challenges in the Noisy Intermediate-Scale Quantum (NISQ) era. This chapter examines adversarial threats unique to QML systems, focusing on vulnerabilities in cloud-based deployments, hybrid architectures, and quantum generative models. Key attack vectors include model stealing via transpilation or output extraction, data poisoning through quantum-specific perturbations, reverse engineering of proprietary variational quantum circuits, and backdoor attacks. Adversaries exploit noise-prone quantum hardware and insufficiently secured QML-as-a-Service (QMLaaS) workflows to compromise model integrity, ownership, and functionality. Defense mechanisms leverage quantum properties to counter these threats. Noise signatures from training hardware act as non-invasive watermarks, while hardware-aware obfuscation techniques and ensemble strategies disrupt cloning attempts. Emerging solutions also adapt classical adversarial training and differential privacy to quantum settings, addressing vulnerabilities in quantum neural networks and generative architectures. However, securing QML requires addressing open challenges such as balancing noise levels for reliability and security, mitigating cross-platform attacks, and developing quantum-classical trust frameworks. This chapter summarizes recent advances in attacks and defenses, offering a roadmap for researchers and practitioners to build robust, trustworthy QML systems resilient to evolving adversarial landscapes.

[LG-42] Fetal Sleep: A Cross-Species Review of Physiology Measurement and Classification

链接: https://arxiv.org/abs/2506.21828
作者: Weitao Tang,Johann Vargas-Calixto,Nasim Katebi,Robert Galinsky,Gari D. Clifford,Faezeh Marzbanrad
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Review article, 17 pages, 1 figure, 5 tables, submitted to Sleep (under review)

点击查看摘要

Abstract:Fetal sleep is a relatively underexplored yet vital aspect of prenatal neurodevelopment. Understanding fetal sleep patterns could provide insights into early brain maturation and help clinicians detect signs of neurological compromise that arise due to fetal hypoxia or fetal growth restriction. This review synthesizes over eight decades of research on the physiological characteristics, ontogeny, and regulation of fetal sleep. We compare sleep-state patterns in humans and large animal models, highlighting species-specific differences and the presence of sleep-state analogs. We review both invasive techniques in animals and non-invasive modalities in humans. Computational methods for sleep-state classification are also examined, including rule-based approaches (with and without clustering-based preprocessing) and state-of-the-art deep learning techniques. Finally, we discuss how intrauterine conditions such as hypoxia and fetal growth restriction can disrupt fetal sleep. This review provides a comprehensive foundation for the development of objective, multimodal, and non-invasive fetal sleep monitoring technologies to support early diagnosis and intervention in prenatal care.

[LG-43] Classification with Reject Option: Distribution-free Error Guarantees via Conformal Prediction

链接: https://arxiv.org/abs/2506.21802
作者: Johan Hallberg Szabadváry,Tuwe Löfström,Ulf Johansson,Cecilia Sönströd,Ernst Ahlberg,Lars Carlsson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:Machine learning (ML) models always make a prediction, even when they are likely to be wrong. This causes problems in practical applications, as we do not know if we should trust a prediction. ML with reject option addresses this issue by abstaining from making a prediction if it is likely to be incorrect. In this work, we formalise the approach to ML with reject option in binary classification, deriving theoretical guarantees on the resulting error rate. This is achieved through conformal prediction (CP), which produce prediction sets with distribution-free validity guarantees. In binary classification, CP can output prediction sets containing exactly one, two or no labels. By accepting only the singleton predictions, we turn CP into a binary classifier with reject option. Here, CP is formally put in the framework of predicting with reject option. We state and prove the resulting error rate, and give finite sample estimates. Numerical examples provide illustrations of derived error rate through several different conformal prediction settings, ranging from full conformal prediction to offline batch inductive conformal prediction. The former has a direct link to sharp validity guarantees, whereas the latter is more fuzzy in terms of validity guarantees but can be used in practice. Error-reject curves illustrate the trade-off between error rate and reject rate, and can serve to aid a user to set an acceptable error rate or reject rate in practice. Comments: 20 pages, 3 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2506.21802 [stat.ML] (or arXiv:2506.21802v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.21802 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Machine Learning with Applications Volume 20, June 2025, 100664 Related DOI: https://doi.org/10.1016/j.mlwa.2025.100664 Focus to learn more DOI(s) linking to related resources

[LG-44] Searching Efficient Deep Architectures for Radar Target Detection using Monte-Carlo Tree Search

链接: https://arxiv.org/abs/2506.21772
作者: Noé Lallouet,Tristan Cazenave,Cyrille Enderli,Stéphanie Gourdin
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research works establish deep neural networks as high performing tools for radar target detection, especially on challenging environments (presence of clutter or interferences, multi-target scenarii…). However, the usually large computational complexity of these networks is one of the factors preventing them from being widely implemented in embedded radar systems. We propose to investigate novel neural architecture search (NAS) methods, based on Monte-Carlo Tree Search (MCTS), for finding neural networks achieving the required detection performance and striving towards a lower computational complexity. We evaluate the searched architectures on endoclutter radar signals, in order to compare their respective performance metrics and generalization properties. A novel network satisfying the required detection probability while being significantly lighter than the expert-designed baseline is proposed.

[LG-45] ADA: Improved Diffusion Sampling with Training-free Augmented Dynamics

链接: https://arxiv.org/abs/2506.21757
作者: Tianrong Chen,Huangjie Zheng,David Berthelot,Jiatao Gu,Josh Susskind,Shuangfei Zhai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images but typically suffer from inefficient sampling. Many solver designs and noise scheduling strategies have been proposed to dramatically improve sampling speeds. In this paper, we introduce a new sampling method that is up to 186% faster than the current state of the art solver for comparative FID on ImageNet512. This new sampling method is training-free and uses an ordinary differential equation (ODE) solver. The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples with less function evaluations from existing pretrained diffusion models. In addition, by design our solver allows to control the level of detail through a simple hyper-parameter at no extra computational cost. We present how our approach leverages momentum dynamics by establishing a fundamental equivalence between momentum diffusion models and conventional diffusion models with respect to their training paradigms. Moreover, we observe the use of higher-dimensional noise naturally exhibits characteristics similar to stochastic differential equations (SDEs). Finally, we demonstrate strong performances on a set of representative pretrained diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover models in both pixel and latent spaces, as well as class and text conditional settings. The code is available at this https URL.

[LG-46] Critically-Damped Higher-Order Langevin Dynamics

链接: https://arxiv.org/abs/2506.21741
作者: Benjamin Sterling,Chad Gueli,Mónica F. Bugallo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models represent an entirely new class of generative AI methods that have yet to be fully explored. Critical damping has been successfully introduced in Critically-Damped Langevin Dynamics (CLD) and Critically-Damped Third-Order Langevin Dynamics (TOLD++), but has not yet been applied to dynamics of arbitrary order. The proposed line of work generalizes Higher-Order Langevin Dynamics (HOLD), a recent state-of-the-art diffusion method, by introducing the concept of critical damping from systems analysis.

[LG-47] Modification of a Numerical Method Using FIR Filters in a Time-dependent SIR Model for COVID-19

链接: https://arxiv.org/abs/2506.21739
作者: Felipe Rogério Pimentel,Rafael Gustavo Alves
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 14 pages, 3 figures, 3 tables, and 2 algorithms

点击查看摘要

Abstract:Authors Yi-Cheng Chen, Ping-En Lu, Cheng-Shang Chang, and Tzu-Hsuan Liu use the Finite Impulse Response (FIR) linear system filtering method to track and predict the number of people infected and recovered from COVID-19, in a pandemic context in which there was still no vaccine and the only way to avoid contagion was isolation. To estimate the coefficients of these FIR filters, Chen et al. used machine learning methods through a classical optimization problem with regularization (ridge regression). These estimated coefficients are called ridge coefficients. The epidemic mathematical model adopted by these researchers to formulate the FIR filters is the time-dependent discrete SIR. In this paper, we propose a small modification to the algorithm of Chen et al. to obtain the ridge coefficients. We then used this modified algorithm to track and predict the number of people infected and recovered from COVID-19 in the state of Minas Gerais/Brazil, within a prediction window, during the initial period of the pandemic. We also compare the predicted data with the respective real data to check how good the approximation is. In the modified algorithm, we set values for the FIR filter orders and for the regularization parameters, both different from the respective values defined by Chen et al. in their algorithm. In this context, the numerical results obtained by the modified algorithm in some simulations present better approximation errors compared to the respective approximation errors presented by the algorithm of Chen et al.

[LG-48] CaloHadronic: a diffusion model for the generation of hadronic showers

链接: https://arxiv.org/abs/2506.21720
作者: Thorsten Buss,Frank Gaede,Gregor Kasieczka,Anatolii Korol,Katja Krüger,Peter McKeown,Martina Mozzanica
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Simulating showers of particles in highly-granular calorimeters is a key frontier in the application of machine learning to particle physics. Achieving high accuracy and speed with generative machine learning models can enable them to augment traditional simulations and alleviate a major computing constraint. Recent developments have shown how diffusion based generative shower simulation approaches that do not rely on a fixed structure, but instead generate geometry-independent point clouds, are very efficient. We present a transformer-based extension to previous architectures which were developed for simulating electromagnetic showers in the highly granular electromagnetic calorimeter of the International Large Detector, ILD. The attention mechanism now allows us to generate complex hadronic showers with more pronounced substructure across both the electromagnetic and hadronic calorimeters. This is the first time that machine learning methods are used to holistically generate showers across the electromagnetic and hadronic calorimeter in highly granular imaging calorimeter systems.

信息检索

[IR-0] HLTCOE at LiveRAG : GPT -Researcher using ColBERT retrieval

链接: https://arxiv.org/abs/2506.22356
作者: Kevin Duh,Eugene Yang,Orion Weller,Andrew Yates,Dawn Lawrie
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:The HLTCOE LiveRAG submission utilized the GPT-researcher framework for researching the context of the question, filtering the returned results, and generating the final answer. The retrieval system was a ColBERT bi-encoder architecture, which represents a passage with many dense tokens. Retrieval used a local, compressed index of the FineWeb10-BT collection created with PLAID-X, using a model fine-tuned for multilingual retrieval. Query generation from context was done with Qwen2.5-7B-Instruct, while filtering was accomplished with m2-bert-80M-8k-retrieval. Up to nine passages were used as context to generate an answer using Falcon3-10B. This system placed 5th in the LiveRAG automatic evaluation for correctness with a score of 1.07.

[IR-1] Education-Oriented Graph Retrieval-Augmented Generation for Learning Path Recommendation

链接: https://arxiv.org/abs/2506.22303
作者: Xinghe Cheng,Zihan Zhang,Jiapu Wang,Liangda Fang,Chaobo He,Quanlong Guan,Shirui Pan,Weiqi Luo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learning path recommendation seeks to provide learners with a structured sequence of learning items (e.g., knowledge concepts or exercises) to optimize their learning efficiency. Despite significant efforts in this area, most existing methods primarily rely on prerequisite relationships, which present two major limitations: 1) Many educational datasets do not explicitly provide prerequisite relationships between knowledge concepts, hindering the application of current learning path recommendation methods. 2) Relying solely on prerequisite relationships as the sole knowledge structure can impede learning progress and negatively impact student outcomes. To address these challenges, we propose a novel approach, Discrimination Learning Enhances Learning Path Recommendation (DLELP), which enhances learning path recommendations by incorporating both prerequisite and similarity relationships between knowledge concepts. Specifically, we introduce a knowledge concept structure graph generation module that adaptively constructs knowledge concept structure graphs for different educational datasets, significantly improving the generalizability of learning path recommendation methods. We then propose a Discrimination Learning-driven Reinforcement Learning (DLRL) framework, which mitigates the issue of blocked learning paths, further enhancing the efficacy of learning path recommendations. Finally, we conduct extensive experiments on three benchmark datasets, demonstrating that our method not only achieves state-of-the-art performance but also provides interpretable reasoning for the recommended learning paths.

[IR-2] JointRank: Rank Large Set with Single Pass ICTIR’25

链接: https://arxiv.org/abs/2506.22262
作者: Evgeny Dedov
类目: Information Retrieval (cs.IR)
*备注: ICTIR’25 Accepted

点击查看摘要

Abstract:Efficiently ranking relevant items from large candidate pools is a cornerstone of modern information retrieval systems – such as web search, recommendation, and retrieval-augmented generation. Listwise rerankers, which improve relevance by jointly considering multiple candidates, are often limited in practice: either by model input size constraints, or by degraded quality when processing large sets. We propose a model-agnostic method for fast reranking large sets that exceed a model input limits. The method first partitions candidate items into overlapping blocks, each of which is ranked independently in parallel. Implicit pairwise comparisons are then derived from these local rankings. Finally, these comparisons are aggregated to construct a global ranking using algorithms such as Winrate or PageRank. Experiments on TREC DL-2019 show that our method achieves an nDCG@10 of 70.88 compared to the 57.68 for full-context listwise approach using gpt-4.1-mini as long-context model, while reducing latency from 21 to 8 seconds. The implementation of the algorithm and the experiments is available in the repository: this https URL Comments: ICTIR’25 Accepted Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2506.22262 [cs.IR] (or arXiv:2506.22262v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2506.22262 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3731120.3744587 Focus to learn more DOI(s) linking to related resources

[IR-3] UiS-IAI@LiveRAG : Retrieval-Augmented Information Nugget-Based Generation of Responses

链接: https://arxiv.org/abs/2506.22210
作者: Weronika Łajewska,Ivica Kostric,Gabriel Iturra-Bocaz,Mariam Arustashvili,Krisztian Balog
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) faces challenges related to factual correctness, source attribution, and response completeness. The LiveRAG Challenge hosted at SIGIR’25 aims to advance RAG research using a fixed corpus and a shared, open-source LLM. We propose a modular pipeline that operates on information nuggets-minimal, atomic units of relevant information extracted from retrieved documents. This multistage pipeline encompasses query rewriting, passage retrieval and reranking, nugget detection and clustering, cluster ranking and summarization, and response fluency enhancement. This design inherently promotes grounding in specific facts, facilitates source attribution, and ensures maximum information inclusion within length constraints. In this challenge, we extend our focus to also address the retrieval component of RAG, building upon our prior work on multi-faceted query rewriting. Furthermore, for augmented generation, we concentrate on improving context curation capabilities, maximizing the breadth of information covered in the response while ensuring pipeline efficiency. Our results show that combining original queries with a few sub-query rewrites boosts recall, while increasing the number of documents used for reranking and generation beyond a certain point reduces effectiveness, without improving response quality.

[IR-4] he Missing Link: Joint Legal Citation Prediction using Heterogeneous Graph Enrichment

链接: https://arxiv.org/abs/2506.22165
作者: Lorenz Wendlinger,Simon Alexander Nonn,Abdullah Al Zubaer,Michael Granitzer
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Legal systems heavily rely on cross-citations of legal norms as well as previous court decisions. Practitioners, novices and legal AI systems need access to these relevant data to inform appraisals and judgments. We propose a Graph-Neural-Network (GNN) link prediction model that can identify Case-Law and Case-Case citations with high proficiency through fusion of semantic and topological information. We introduce adapted relational graph convolutions operating on an extended and enriched version of the original citation graph that allow the topological integration of semantic meta-information. This further improves prediction by 3.1 points of average precision and by 8.5 points in data sparsity as well as showing robust performance over time and in challenging fully inductive prediction. Jointly learning and predicting case and norm citations achieves a large synergistic effect that improves case citation prediction by up to 4.7 points, at almost doubled efficiency.

[IR-5] Reward Balancing Revisited: Enhancing Offline Reinforcement Learning for Recommender Systems

链接: https://arxiv.org/abs/2506.22112
作者: Wenzheng Shu,Yanxiang Zeng,Yongxiang Tang,Teng Sha,Ning Luo,Yanhua Cheng,Xialong Liu,Fan Zhou,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注: Accepted in Companion Proceedings of the ACM Web Conference 2025

点击查看摘要

Abstract:Offline reinforcement learning (RL) has emerged as a prevalent and effective methodology for real-world recommender systems, enabling learning policies from historical data and capturing user preferences. In offline RL, reward shaping encounters significant challenges, with past efforts to incorporate prior strategies for uncertainty to improve world models or penalize underexplored state-action pairs. Despite these efforts, a critical gap remains: the simultaneous balancing of intrinsic biases in world models and the diversity of policy recommendations. To address this limitation, we present an innovative offline RL framework termed Reallocated Reward for Recommender Systems (R3S). By integrating inherent model uncertainty to tackle the intrinsic fluctuations in reward predictions, we boost diversity for decision-making to align with a more interactive paradigm, incorporating extra penalizers with decay that deter actions leading to diminished state variety at both local and global scales. The experimental results demonstrate that R3S improves the accuracy of world models and efficiently harmonizes the heterogeneous preferences of the users.

[IR-6] SERP Interference Network and Its Applications in Search Advertising KDD2024 KDD

链接: https://arxiv.org/abs/2506.21598
作者: Purak Jain,Sandeep Appala
类目: Information Retrieval (cs.IR); Methodology (stat.ME)
*备注: This is an extended version of our paper published at the AdKDD 2024 workshop, co-located with ACM KDD. CEUR-WS proceedings: this https URL

点击查看摘要

Abstract:Search Engine marketing teams in the e-commerce industry manage global search engine traffic to their websites with the aim to optimize long-term profitability by delivering the best possible customer experience on Search Engine Results Pages (SERPs). In order to do so, they need to run continuous and rapid Search Marketing A/B tests to continuously evolve and improve their products. However, unlike typical e-commerce A/B tests that can randomize based on customer identification, their tests face the challenge of anonymized users on search engines. On the other hand, simply randomizing on products violates Stable Unit Treatment Value Assumption for most treatments of interest. In this work, we propose leveraging censored observational data to construct bipartite (Search Query to Product Ad or Text Ad) SERP interference networks. Using a novel weighting function, we create weighted projections to form unipartite graphs which can then be use to create clusters to randomized on. We demonstrate this experimental design’s application in evaluating a new bidding algorithm for Paid Search. Additionally, we provide a blueprint of a novel system architecture utilizing SageMaker which enables polyglot programming to implement each component of the experimental framework.

[IR-7] PentaRAG : Large-Scale Intelligent Knowledge Retrieval for Enterprise LLM Applications

链接: https://arxiv.org/abs/2506.21593
作者: Abu Hanif Muhammad Syarubany,Chang Dong Yoo
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注: Annual Conference of The Institute of Electronics and Information Engineers

点击查看摘要

Abstract:Enterprise deployments of large-language model (LLM) demand continuously changing document collections with sub-second latency and predictable GPU cost requirements that classical Retrieval-Augmented Generation (RAG) pipelines only partially satisfy. We present PentaRAG, a five-layer module that routes each query through two instant caches (fixed key-value and semantic), a memory-recall mode that exploits the LLM’s own weights, an adaptive session memory, and a conventional retrieval-augmentation layer. Implemented with Mistral-8B, Milvus and vLLM, the system can answer most repeated or semantically similar questions from low-latency caches while retaining full retrieval for novel queries. On the TriviaQA domain, LoRA fine-tuning combined with the memory-recall layer raises answer similarity by approximately 8% and factual correctness by approximately 16% over the base model. Under a nine-session runtime simulation, cache warming reduces mean latency from several seconds to well below one second and shifts traffic toward the fast paths. Resource-efficiency tests show that PentaRAG cuts average GPU time to 0.248 seconds per query, roughly half that of a naive RAG baseline, and sustains an aggregate throughput of approximately 100,000 queries per second on our setup. These results demonstrate that a layered routing strategy can deliver freshness, speed, and efficiency simultaneously in production-grade RAG systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-06-30

目录

概览 (2025-06-30)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载