本篇博文主要内容为 2025-05-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-28)

今日共更新751篇论文,其中:

  • 自然语言处理182篇(Computation and Language (cs.CL))
  • 人工智能242篇(Artificial Intelligence (cs.AI))
  • 计算机视觉179篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习240篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] How does Alignment Enhance LLM s Multilingual Capabilities? A Language Neurons Perspective

【速读】: 该论文旨在解决如何更深入理解多语言大语言模型(LLMs)在多语言场景下的内部机制及其多语言能力增强的问题。其解决方案的关键在于提出一种细粒度的神经元识别算法,用于检测语言特定神经元和语言相关神经元,以及语言无关神经元,并基于不同类型神经元的分布特性,将模型内部的多语言推理过程划分为四个阶段:多语言理解、共享语义空间推理、多语言输出空间转换和词汇空间输出。这一方法为分析多语言对齐机制及模型多语言能力提供了新的视角和实证依据。

链接: https://arxiv.org/abs/2505.21505
作者: Shimao Zhang,Zhejian Lai,Xiang Liu,Shuaijie She,Xiao Liu,Yeyun Gong,Shujian Huang,Jiajun Chen
机构: National Key Laboratory for Novel Software Technology, Nanjing University (国家关键软件技术实验室,南京大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual Alignment is an effective and representative paradigm to enhance LLMs’ multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some researches on language-specific neurons reveal that there are language-specific neurons that are selectively activated in LLMs when processing different languages. This provides a new perspective to analyze and understand LLMs’ mechanisms more specifically in multilingual scenarios. In this work, we propose a new finer-grained neuron identification algorithm, which detects language neurons~(including language-specific neurons and language-related neurons) and language-agnostic neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs’ internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ‘‘Spontaneous Multilingual Alignment’’. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights for better understanding multilingual alignment and multilingual capabilities of LLMs.
zh

[NLP-1] Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLM Agent LLMs via Catfish Agent for Clinical Decision Making

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在临床问答任务中出现的“Silent Agreement”问题,即代理在缺乏充分批判性分析的情况下过早达成诊断共识,尤其是在复杂或模糊病例中。解决方案的关键是引入一种名为Catfish Agent的角色专业化LLM,其通过结构化异议注入来打破沉默共识,激发更深入的推理。该方法包含两个核心机制:基于案例复杂度的干预机制和语气校准的干预机制,以平衡批判与协作。

链接: https://arxiv.org/abs/2505.21503
作者: Yihan Wang,Qiao Yan,Zhenghao Xing,Lihao Liu,Junjun He,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); Amazon (亚马逊); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect’’ in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical QA and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.
zh

[NLP-2] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在跨视角理解和空间推理任务中的局限性,特别是其在从非第一人称视角(如他人视角)进行空间推理时表现不佳的问题。解决方案的关键在于引入ViewSpatial-Bench,这是一个针对多视角空间定位识别的首个综合性基准测试平台,并结合自动化3D标注流程生成精确的方向标签。通过在多视角空间数据集上微调VLMs,显著提升了模型在不同视角下的空间推理性能,验证了建模三维空间关系对增强VLMs空间理解能力的有效性。

链接: https://arxiv.org/abs/2505.21500
作者: Dingming Li,Hongxing Li,Zixuan Wang,Yuchen Yan,Hang Zhang,Siqi Chen,Guiyang Hou,Shengpei Jiang,Wenqi Zhang,Yongliang Shen,Weiming Lu,Yueting Zhuang
机构: Zhejiang University (浙江大学); University of Electronic Science and Technology of China (电子科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera’s perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity’s spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs’ corresponding spatial comprehension capabilities.
zh

[NLP-3] Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

【速读】: 该论文旨在解决学术海报生成这一在科学传播中至关重要但具有挑战性的问题,即如何将长文本交错的文档压缩为一张视觉连贯的页面。其解决方案的关键在于引入了首个用于海报生成的基准和度量套件,包括视觉质量、文本连贯性、整体评估以及PaperQuiz等多维度评价指标,并提出了PosterAgent,一个自上而下、视觉反馈驱动的多智能体流水线,通过解析器、规划器和画家-评论器循环实现结构化内容提取、布局规划与细节优化,从而提升海报的视觉一致性和信息传达能力。

链接: https://arxiv.org/abs/2505.21497
作者: Wei Pang,Kevin Qinghong Lin,Xiangru Jian,Xi He,Philip Torr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Project Page: this https URL

点击查看摘要

Abstract:Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster’s ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the ©Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just 0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at this https URL.
zh

[NLP-4] UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM -based Mobile GUI Agents

【速读】: 该论文旨在解决GUI代理中的两个关键问题:轨迹结果的验证困难以及高质量训练数据的可扩展性不足。解决方案的关键在于引入一个奖励模型(UI-Genie-RM)和一个自我改进的流水线。UI-Genie-RM采用图像-文本交错架构,能够高效处理历史上下文并统一动作级与任务级奖励;同时,通过规则验证、受控轨迹破坏和难例挖掘等数据生成策略支持其训练。针对数据可扩展性问题,自我改进流水线通过奖励引导的探索和动态环境中的结果验证,逐步扩展可解决的复杂GUI任务。

链接: https://arxiv.org/abs/2505.21496
作者: Han Xiao,Guozhi Wang,Yuxiang Chai,Zimu Lu,Weifeng Lin,Hao He,Lue Fan,Liuyang Bian,Rui Hu,Liang Liu,Shuai Ren,Yafei Wen,Xiaoxin Chen,Aojun Zhou,Hongsheng Li
机构: vivo AI Lab (vivo人工智能实验室)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in this https URL.
zh

[NLP-5] Reinforcing General Reasoning without Verifiers

【速读】: 该论文试图解决当前基于深度强化学习(RL)训练大型语言模型(LLMs)时,依赖于可验证奖励的局限性,这种限制使得方法仅适用于规则明确的任务,难以扩展到如化学、医疗、工程等现实世界领域。解决方案的关键在于提出一种无需验证器的方法(VeriFree),该方法绕过答案验证步骤,直接通过强化学习最大化生成参考答案的概率,从而在保持性能的同时降低计算需求并提升实用性。

链接: https://arxiv.org/abs/2505.21493
作者: Xiangxin Zhou,Zichen Liu,Anya Sims,Haonan Wang,Tianyu Pang,Chongxuan Li,Liang Wang,Min Lin,Chao Du
机构: Sea AI Lab (Sea AI Lab); University of Chinese Academy of Sciences (University of Chinese Academy of Sciences); Institute of Automation, Chinese Academy of Sciences (Institute of Automation, Chinese Academy of Sciences); National University of Singapore (National University of Singapore); University of Oxford (University of Oxford); Renmin University of China (Renmin University of China)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at this https URL.
zh

[NLP-6] Hardware-Efficient Attention for Fast Decoding

【速读】: 该论文旨在解决大型批次和长上下文场景下大语言模型(Large Language Model, LLM)解码过程中的性能瓶颈问题,主要表现为从高带宽内存加载键值(Key-Value, KV)缓存导致的单个标记延迟增加,以及解码过程的顺序性限制了并行性。其解决方案的关键在于重新设计注意力机制,以在不牺牲并行可扩展性的前提下最大化硬件效率。具体而言,提出了两种方法:Grouped-Tied Attention (GTA) 通过合并和复用键值状态减少内存传输,而 Grouped Latent Attention (GLA) 则结合低级优化实现高效的并行解码,同时保持高质量的模型表现。

链接: https://arxiv.org/abs/2505.21487
作者: Ted Zadouri,Hubert Strauss,Tri Dao
机构: Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 37 pages, 15 figures, 45 tables

点击查看摘要

Abstract:LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2 \times faster than FlashMLA, for example, in a speculative decoding setting when the query length exceeds one. Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-end latency and increases throughput in online serving benchmarks by up to 2 \times .
zh

[NLP-7] Are Language Models Consequentialist or Deontological Moral Reason ers?

【速读】: 该论文试图解决如何理解大型语言模型(Large Language Models, LLMs)在处理伦理复杂场景时的道德推理过程这一问题。现有研究多集中于LLMs的道德判断,而缺乏对其内在道德推理机制的深入分析。本文的关键解决方案是通过大规模分析LLMs提供的道德推理轨迹,并利用超过600个不同的电车难题作为探测工具,系统地揭示不同LLMs中的推理模式。同时,论文引入并测试了一个基于后果主义和义务论两种主要规范伦理理论的道德理由分类体系,以实现对推理轨迹的系统分类。

链接: https://arxiv.org/abs/2505.21479
作者: Keenan Samway,Max Kleiman-Weiner,David Guzman Piedrahita,Rada Mihalcea,Bernhard Schölkopf,Zhijing Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at this https URL .
zh

[NLP-8] Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration

【速读】: 该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在多模态任务中出现的幻觉问题,即模型在生成过程中自信地描述图像中不存在的对象或属性。解决方案的关键在于提出一种名为置信度感知注意力校准(Confidence-Aware Attention Calibration, CAAC)的框架,通过针对空间感知偏差和模态偏差进行干预,实现注意力分布的平衡与视觉基础的强化,从而提升生成过程中的视觉一致性与准确性。

链接: https://arxiv.org/abs/2505.21472
作者: Mehrdad Fazli,Bowen Wei,Ziwei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) achieve impressive performance on multimodal tasks but often suffer from hallucination, and confidently describe objects or attributes not present in the image. Current inference-time interventions, while training-free, struggle to maintain accuracy in open-ended and long-form generation scenarios. We introduce the Confidence-Aware Attention Calibration (CAAC) framework to address this challenge by targeting two key biases: spatial perception bias, which distributes attention disproportionately across image tokens, and modality bias, which shifts focus from visual to textual inputs over time. CAAC employs a two-step approach: Visual-Token Calibration (VTC) to balance attention across visual tokens, and Adaptive Attention Re-Scaling (AAR) to reinforce visual grounding based on the model’s confidence. This confidence-driven adjustment ensures consistent visual alignment during generation. Experiments on CHAIR, AMBER, and POPE benchmarks demonstrate that CAAC outperforms baselines, particularly in long-form generations, effectively reducing hallucination.
zh

[NLP-9] Scaling External Knowledge Input Beyond Context Windows of LLM s via Multi-Agent Collaboration

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理需要大量外部知识的任务时,受限于有限的上下文窗口导致的知识输入规模不足问题。现有方法在扩展上下文窗口时不可避免地造成信息丢失,而基于LLM的多智能体方法虽能分布式处理大规模输入,但其知识同步与推理过程仍存在核心瓶颈。本文提出的解决方案是开发一个名为\textbf{ExtAgents}的多智能体框架,其关键在于通过改进知识同步机制和推理流程,在不依赖更长上下文训练的情况下实现推理阶段知识集成的可扩展性。实验表明,ExtAgents在相同外部知识输入量下显著提升了性能,无论知识是否超出上下文窗口限制,并且由于高并行性保持了高效性。

链接: https://arxiv.org/abs/2505.21471
作者: Zijun Liu,Zhennan Wan,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Yang Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 30 pages, 9 figures. Code and data are available at this https URL

点击查看摘要

Abstract:With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement, especially for tasks requiring significant amount of external knowledge. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing knowledge synchronization and reasoning processes. In this work, we develop a multi-agent framework, \textbfExtAgents , to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, \textbf \boldsymbol\infty Bench+ , and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls \textitwithin or exceeds the context window . Moreover, the method maintains high efficiency due to high parallelism. Further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.
zh

[NLP-10] Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理过程中计算成本高、延迟大以及并行生成导致的token不一致性问题。其关键解决方案是提出两种无需训练的技术:首先,FreeCache通过重用去噪步骤中的稳定键值(Key-Value, KV)投影,减少计算开销;其次,Guided Diffusion利用轻量级预训练自回归模型监督token解掩码过程,显著降低去噪迭代次数而不影响质量。这两种方法共同实现了34倍的端到端加速,使DLM在延迟上达到与自回归模型相当甚至更优的水平。

链接: https://arxiv.org/abs/2505.21467
作者: Zhanqiu Hu,Jian Meng,Yash Akhauri,Mohamed S. Abdelfattah,Jae-sun Seo,Zhiru Zhang,Udit Gupta
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.
zh

[NLP-11] ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

【速读】: 该论文试图解决当前增强视觉-语言模型(Vision-Language Models, VLMs)性能的方法中,同时编码高分辨率图像和缩略图所导致的图像标记数量过多的问题,以及在结合旋转位置嵌入(Rotary Position Embedding, RoPE)时,其长期衰减特性阻碍了高分辨率标记与缩略图标记之间、以及文本与图像之间的交互问题。解决方案的关键在于提出ID-Align方法,通过重新排序位置ID,使高分辨率标记继承对应缩略图标记的ID,同时限制位置索引的过度扩展,从而缓解上述问题。

链接: https://arxiv.org/abs/2505.21465
作者: Bozhou Li,Wentao Zhang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench’s relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: this https URL.
zh

[NLP-12] Do LLM s Need to Think in One Language? Correlation between Latent Language and Task Performance

【速读】: 该论文试图解决的问题是:生成式 AI (Generative AI) 的潜在语言(latent language)与输入/输出语言之间的差异如何影响下游任务的性能。研究的关键在于验证一致的潜在语言是否能提升下游任务表现,并通过在多种下游任务中变化输入提示语言,分析潜在语言一致性与任务性能之间的相关性。实验结果表明,维持潜在语言一致性并非总是优化下游任务性能的必要条件,因为模型在最终层会调整内部表示以匹配目标语言,从而降低了不一致性的影响。

链接: https://arxiv.org/abs/2505.21458
作者: Shintaro Ozaki,Tatsuya Hiraoka,Hiroto Otake,Hiroki Ouchi,Masaru Isonuma,Benjamin Heinzerling,Kentaro Inui,Taro Watanabe,Yusuke Miyao,Yohei Oseki,Yu Takagi
机构: NAIST(国立情報学研究所); NII LLMC(国立情報学研究所LLMC); MBZUAI(穆巴达拉人工智能研究院); RIKEN(理化学研究所); Tohoku University(东北大学); The University of Tokyo(东京大学); Nagoya Institute of Technology(名古屋工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are known to process information using a proficient internal language consistently, referred to as latent language, which may differ from the input or output languages. However, how the discrepancy between the latent language and the input and output language affects downstream task performance remains largely unexplored. While many studies research the latent language of LLMs, few address its importance in influencing task performance. In our study, we hypothesize that thinking in latent language consistently enhances downstream task performance. To validate this, our work varies the input prompt languages across multiple downstream tasks and analyzes the correlation between consistency in latent language and task performance. We create datasets consisting of questions from diverse domains such as translation and geo-culture, which are influenced by the choice of latent language. Experimental results across multiple LLMs on translation and geo-culture tasks, which are sensitive to the choice of language, indicate that maintaining consistency in latent language is not always necessary for optimal downstream task performance. This is because these models adapt their internal representations near the final layers to match the target language, reducing the impact of consistency on overall performance.
zh

[NLP-13] Words Like Knives: Backstory-Personalized Modeling and Detection of Violent Communication

【速读】: 该论文试图解决现有自然语言处理(Natural Language Processing, NLP)研究中对亲密关系中对话破裂(conversational breakdowns)检测的不足,即忽视了关系背景对对话感知的影响。其解决方案的关键在于引入非暴力沟通(Nonviolent Communication, NVC)理论,并构建PersonaConflicts Corpus数据集,通过模拟真实场景下的对话来评估大语言模型(Large Language Models, LLMs)在考虑关系背景下的冲突检测能力。研究发现,关系背景极性显著影响人类对对话破裂的感知,但模型在利用这些背景信息方面存在明显不足。

链接: https://arxiv.org/abs/2505.21451
作者: Jocelyn Shen,Akhila Yerukola,Xuhui Zhou,Cynthia Breazeal,Maarten Sap,Hae Won Park
机构: Massachusetts Institute of Technology (麻省理工学院); Carnegie Mellon University (卡内基梅隆大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational breakdowns in close relationships are deeply shaped by personal histories and emotional context, yet most NLP research treats conflict detection as a general task, overlooking the relational dynamics that influence how messages are perceived. In this work, we leverage nonviolent communication (NVC) theory to evaluate LLMs in detecting conversational breakdowns and assessing how relationship backstory influences both human and model perception of conflicts. Given the sensitivity and scarcity of real-world datasets featuring conflict between familiar social partners with rich personal backstories, we contribute the PersonaConflicts Corpus, a dataset of N=5,772 naturalistic simulated dialogues spanning diverse conflict scenarios between friends, family members, and romantic partners. Through a controlled human study, we annotate a subset of dialogues and obtain fine-grained labels of communication breakdown types on individual turns, and assess the impact of backstory on human and model perception of conflict in conversation. We find that the polarity of relationship backstories significantly shifted human perception of communication breakdowns and impressions of the social partners, yet models struggle to meaningfully leverage those backstories in the detection task. Additionally, we find that models consistently overestimate how positively a message will make a listener feel. Our findings underscore the critical role of personalization to relationship contexts in enabling LLMs to serve as effective mediators in human communication for authentic connection.
zh

[NLP-14] owards Better Instruction Following Retrieval Models

【速读】: 该论文试图解决现代信息检索(IR)模型在处理显式用户指令时表现不佳的问题,即这些模型仅基于标准查询与文档对进行训练,难以有效理解和遵循用户指令。解决方案的关键在于构建一个大规模、高质量的训练语料库InF-IR,其中包含超过38,000个表达性指令、查询和文档三元组作为正样本,并通过污染指令和查询生成两个额外的难例负样本,再由先进的推理模型(o3-mini)严格验证以确保语义合理性同时保持指令错误性。这种高度对比的正负三元组使得小型编码器模型能够进行高效的表示学习,从而实现基于嵌入的直接检索。

链接: https://arxiv.org/abs/2505.21439
作者: Yuchen Zhuang,Aaron Trinh,Rushi Qiang,Haotian Sun,Chao Zhang,Hanjun Dai,Bo Dai
机构: Georgia Institute of Technology (佐治亚理工学院); precur.ai
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Retrieval Models, Embedding, Retrieval with Instructions

点击查看摘要

Abstract:Modern information retrieval (IR) models, trained exclusively on standard query, passage pairs, struggle to effectively interpret and follow explicit user instructions. We introduce InF-IR, a large-scale, high-quality training corpus tailored for enhancing retrieval models in Instruction-Following IR. InF-IR expands traditional training pairs into over 38,000 expressive instruction, query, passage triplets as positive samples. In particular, for each positive triplet, we generate two additional hard negative examples by poisoning both instructions and queries, then rigorously validated by an advanced reasoning model (o3-mini) to ensure semantic plausibility while maintaining instructional incorrectness. Unlike existing corpora that primarily support computationally intensive reranking tasks for decoder-only language models, the highly contrastive positive-negative triplets in InF-IR further enable efficient representation learning for smaller encoder-only models, facilitating direct embedding-based retrieval. Using this corpus, we train InF-Embed, an instruction-aware Embedding model optimized through contrastive learning and instruction-query attention mechanisms to align retrieval outcomes precisely with user intents. Extensive experiments across five instruction-based retrieval benchmarks demonstrate that InF-Embed significantly surpasses competitive baselines by 8.1% in p-MRR, measuring the instruction-following capabilities.
zh

[NLP-15] RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在缺乏预定义工具的复杂问题求解任务中,因内部知识局限而无法有效生成适用工具的问题。其解决方案的关键在于提出RefTool框架,该框架通过引入结构化外部资料(如教科书)作为参考,引导LLMs从参考内容中生成可执行工具,并通过示例验证和分层组织构建工具箱,进而实现工具的有效选择与应用,从而提升模型在跨领域任务中的准确性和泛化能力。

链接: https://arxiv.org/abs/2505.21413
作者: Xiao Liu,Da Yin,Zirui Wu,Yansong Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models’ internal knowledge and would fail in domains beyond the LLMs’ knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.
zh

[NLP-16] Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

【速读】: 该论文试图解决在大规模语言模型中使用Mixture of Experts (MoE)架构时,由于部分专家被频繁激活而导致的计算负载不均衡问题,进而影响系统效率。解决方案的关键在于引入Mixture of Grouped Experts (MoGE),通过在选择阶段对专家进行分组,并限制每个预定义专家组内激活的专家数量相等,从而实现更优的专家负载平衡。这种架构设计确保了模型在多设备上分布式执行时计算负载的均衡,显著提升了吞吐量,特别是在推理阶段。

链接: https://arxiv.org/abs/2505.21411
作者: Yehui Tang,Xiaosong Li,Fangcheng Liu,Wei Guo,Hang Zhou,Yaoyuan Wang,Kai Han,Xianzhi Yu,Jinpeng Li,Hui Zang,Fei Mi,Xiaojun Meng,Zhicheng Liu,Hanting Chen,Binfan Zheng,Can Chen,Youliang Yan,Ruiming Tang,Peifeng Qin,Xinghao Chen,Dacheng Tao,Yunhe Wang(and Other Contributors)
机构: Huawei(华为)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature. It constrains tokens to activate an equal number of experts within each predefined expert group. When a model execution is distributed on multiple devices, this architectural design ensures a balanced computational load across devices, significantly enhancing throughput, particularly for the inference phase. Further, we build Pangu Pro MoE on Ascend NPUs, a sparse model based on MoGE with 72 billion total parameters, 16 billion of which are activated for each token. The configuration of Pangu Pro MoE is optimized for Ascend 300I Duo and 800I A2 through extensive system simulation studies. Our experiments indicate that MoGE indeed leads to better expert load balancing and more efficient execution for both model training and inference on Ascend NPUs. The inference performance of Pangu Pro MoE achieves 1148 tokens/s per card and can be further improved to 1528 tokens/s per card by speculative acceleration, outperforming comparable 32B and 72B Dense models. Furthermore, we achieve an excellent cost-to-performance ratio for model inference on Ascend 300I this http URL studies show that Ascend NPUs are capable of training Pangu Pro MoE with massive parallelization to make it a leading model within the sub-100B total parameter class, outperforming prominent open-source models like GLM-Z1-32B and Qwen3-32B.
zh

[NLP-17] RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成结构化、多记录表格输出方面的真实性问题,即模型在处理关系性事实检索任务时表现不佳。现有基准测试主要评估短文本事实答案,而忽视了模型从参数化知识中生成结构化表格的能力。论文的关键解决方案是引入RelationalFactQA,这是一个新的基准测试,包含多样化的自然语言问题(配对SQL语句)和黄金标准表格答案,专门用于评估结构化格式下的知识检索能力。通过该基准,研究揭示了当前先进LLMs在生成关系性输出时存在显著局限性,其事实准确性不超过25%,且性能随输出维度增加而明显下降。

链接: https://arxiv.org/abs/2505.21409
作者: Dario Satriani,Enzo Veltri,Donatello Santoro,Paolo Papotti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs’ ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.
zh

[NLP-18] Factual Self-Awareness in Language Models: Representation Robustness and Scaling

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)生成内容中的事实错误问题。其解决方案的关键在于揭示LLMs在生成过程中具备内部机制,能够通过Transformer的残差流中编码的线性特征判断事实回忆的正确性,从而实现对实体-关系-属性三元组的自我意识信号检测。这种自我监控能力在训练初期迅速出现,并在中间层达到峰值,增强了模型的可解释性和可靠性。

链接: https://arxiv.org/abs/2505.21399
作者: Hovhannes Tamoyan,Subhabrata Dutta,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Department of Computer Science and Hessian Center for AI (hessian.AI); Technical University of Darmstadt
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs’ internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer’s residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.
zh

[NLP-19] DecisionFlow: Advancing Large Language Model as Principled Decision Maker

【速读】: 该论文旨在解决高风险领域(如医疗和金融)中语言模型在决策过程中缺乏结构化推理与透明解释的问题。当前语言模型通常以非结构化、事后补救的方式生成决策和理由,难以满足高风险场景对可解释性和透明性的需求。解决方案的关键在于提出DecisionFlow框架,该框架通过引导模型在动作、属性和约束的结构化表示上进行推理,构建语义基础的决策空间,并推断出隐式效用函数以透明、效用驱动的方式评估权衡,从而生成与可解释推理紧密耦合的决策。

链接: https://arxiv.org/abs/2505.21397
作者: Xiusi Chen,Shanyong Wang,Cheng Qian,Hongru Wang,Peixuan Han,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: 24 pages, 13 figures

点击查看摘要

Abstract:In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model’s reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. We release the data and code at this https URL.
zh

[NLP-20] Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science

【速读】: 该论文试图解决生成式 AI (Generative AI) 在科研选题过程中生成的创意在可行性与预期有效性方面存在的不足。其解决方案的关键在于通过在创意生成过程中引入相关数据来增强生成创意的质量,具体包括:在创意生成阶段提供元数据以引导模型向更具可行性的方向发展,以及在创意选择阶段加入自动验证机制以评估假设的实证合理性。

链接: https://arxiv.org/abs/2505.21396
作者: Xiao Liu,Xinyi Dong,Xinyang Gao,Yansong Feng,Xun Pang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.
zh

[NLP-21] AutoJudger: An Agent -Driven Framework for Efficient Benchmarking of MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)评估成本高昂的问题,尤其是在基准测试规模扩大和跨模态复杂性增加的情况下,评估所需的人力与计算资源显著上升。其解决方案的关键在于提出AutoJudger框架,该框架通过结合项目反应理论(Item Response Theory, IRT)估算题目难度,并利用自主评估代理动态选择最具信息量的测试题目,从而实现高效且自适应的评估。此外,AutoJudger还包含语义感知检索机制与动态记忆模块,以确保所选题目覆盖多样且具有挑战性的视觉与语言模态场景,并在评估过程中保持上下文一致性与全局信息导向。

链接: https://arxiv.org/abs/2505.21389
作者: Xuanwen Ding,Chengjun Pan,Zejun Li,Jiwen Zhang,Siyuan Wang,Zhongyu Wei
机构: Fudan University (复旦大学); University of Southern California (南加州大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model’s real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
zh

[NLP-22] PHISH in MESH: Korean Adversarial Phonetic Substitution and Phonetic-Semantic Feature Integration Defense

【速读】: 该论文试图解决恶意用户通过语音替代策略逃避仇恨言论检测的问题,特别是针对韩语由于其音素文字特性而容易受到语音扰动影响的漏洞,以及现有研究多集中于数据集构建而非架构防御的不足。解决方案的关键在于提出两种方法:(1) PHonetic-Informed Substitution for Hangul (PHISH),利用韩文书写系统的语音学特征进行替代;(2) Mixed Encoding of Semantic-pHonetic features (MESH),通过在架构层面整合语音信息来增强检测器的鲁棒性。

链接: https://arxiv.org/abs/2505.21380
作者: Byungjun Kim,Minju Kim,Hyeonchu Park,Bugeun Kim
机构: Chung-Ang University (忠南大学)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:As malicious users increasingly employ phonetic substitution to evade hate speech detection, researchers have investigated such strategies. However, two key challenges remain. First, existing studies have overlooked the Korean language, despite its vulnerability to phonetic perturbations due to its phonographic nature. Second, prior work has primarily focused on constructing datasets rather than developing architectural defenses. To address these challenges, we propose (1) PHonetic-Informed Substitution for Hangul (PHISH) that exploits the phonological characteristics of the Korean writing system, and (2) Mixed Encoding of Semantic-pHonetic features (MESH) that enhances the detector’s robustness by incorporating phonetic information at the architectural level. Our experimental results demonstrate the effectiveness of our proposed methods on both perturbed and unperturbed datasets, suggesting that they not only improve detection performance but also reflect realistic adversarial behaviors employed by malicious users.
zh

[NLP-23] Analyzing values about gendered language reform in LLM s revisions

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文本修订过程中对性别化角色名词(如outdoorsperson/woman/man)的处理及其修订理由是否符合女性主义和跨性别包容性语言改革的问题。其解决方案的关键在于评估LLMs在应用这些语言改革时是否能够敏感地识别和响应语境效应,从而实现与人类在语言使用中的社会语言学洞察相一致的调整。

链接: https://arxiv.org/abs/2505.21378
作者: Jules Watson,Xi Wang,Raymond Liu,Suzanne Stevenson,Barend Beekhuizen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Within the common LLM use case of text revision, we study LLMs’ revision of gendered role nouns (e.g., outdoorsperson/woman/man) and their justifications of such revisions. We evaluate their alignment with feminist and trans-inclusive language reforms for English. Drawing on insight from sociolinguistics, we further assess if LLMs are sensitive to the same contextual effects in the application of such reforms as people are, finding broad evidence of such effects. We discuss implications for value alignment.
zh

[NLP-24] Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对用户社会人口学特征(如年龄、职业和教育水平)时,如何有效调整其响应以实现更精准的行为适应问题。现有评估多集中于单次交互的提示,而忽略了实际应用中通过多轮对话历史进行上下文构建的重要性。论文提出了一种评估框架,用于分析当用户属性通过显式用户资料或隐式多轮对话历史引入时,LLM的行为适应能力。解决方案的关键在于构建一个合成数据集,结合不同的用户档案与对话历史,并利用价值调查模块(Value Survey Module, VSM 2013)的问题来探测模型的价值表达一致性,从而评估模型在不同情境下的行为稳定性与适应性。

链接: https://arxiv.org/abs/2505.21362
作者: Qishuai Zhong,Zongmin Li,Siqi Fan,Aixin Sun
机构: Nanyang Technological University (南洋理工大学); University of Electronic Science and Technology of China (中国电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Effective engagement by large language models (LLMs) requires adapting responses to users’ sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs’ behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.
zh

[NLP-25] Leverag ing Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

【速读】: 该论文试图解决 Bengali 数学文字问题(Math Word Problems, MWPs)在自然语言处理(NLP)中的挑战,主要由于该语言资源匮乏以及解决问题所需的多步骤推理复杂性。此前缺乏人工标注的 Bengali 数据集,限制了 Bengali 数学推理的研究进展。解决方案的关键是创建了 SOMADHAN,这是一个包含 8792 个复杂 Bengali MWPs 的数据集,并附有手动编写的逐步解答,旨在支持推理导向的评估和模型开发。此外,研究还通过 Chain of Thought (CoT) 提示方法和 Low-Rank Adaptation (LoRA) 技术提升了模型性能,实现了在 Bengali MWPs 上的高效适应与高准确率。

链接: https://arxiv.org/abs/2505.21354
作者: Bidyarthi Paul,Jalisha Jashim Era,Mirazur Rahman Zim,Tahmid Sattar Aothoi,Faisal Muhammad Shah
机构: Ahsanullah University of Science and Technology (阿罕默德大学科技学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language’s low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.
zh

[NLP-26] he Multilingual Divide and Its Impact on Global AI Safety

【速读】: 该论文试图解决人工智能领域中存在的“语言差距”问题,即在多种语言中,尤其是全球主导语言之外的语言,大型语言模型的能力和安全性能存在显著不足。论文指出,这一差距不仅限制了AI技术的普及与公平性,还加剧了全球AI安全性的不平等。解决方案的关键在于支持多语言数据集的创建、提高透明度以及推动相关研究,以缩小语言差距并降低跨语言的安全风险。

链接: https://arxiv.org/abs/2505.21344
作者: Aidan Peppin,Julia Kreutzer,Alice Schoenauer Sebag,Kelly Marchisio,Beyza Ermis,John Dang,Samuel Cahyawijaya,Shivalika Singh,Seraphina Goldfarb-Tarrant,Viraat Aryabumi,Aakanksha,Wei-Yin Ko,Ahmet Üstün,Matthias Gallé,Marzieh Fadaee,Sara Hooker
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the “language gap” in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.
zh

[NLP-27] PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

【速读】: 该论文试图解决专利权利要求中存在模糊性导致专利申请被驳回的问题,特别是在美国,这种问题被称为“不确定性”(indefiniteness),是专利申请被拒绝的常见原因之一。解决方案的关键在于构建一个名为PEDANTIC的新型数据集,该数据集包含14,000条与自然语言处理(Natural Language Processing, NLP)相关的美国专利权利要求,并标注了不确定性的原因。PEDANTIC通过完全自动化的流程生成,利用大型语言模型(Large Language Models, LLMs)从美国专利商标局(USPTO)的审查意见文档中提取不确定性原因,经人工验证确保标注质量。该数据集为专利人工智能研究提供了宝贵资源,有助于开发更先进的专利审查模型。

链接: https://arxiv.org/abs/2505.21342
作者: Valentin Knappich,Annemarie Friedrich,Anna Hätty,Simon Razniewski
机构: Bosch Center for AI(博世人工智能中心); University of Augsburg(奥格斯堡大学); ScaDS.AI & TU Dresden(Scads人工智能与德累斯顿工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C § 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (\underlinePat\underlineent \underlineDefiniteness Ex\underlineami\underlinena\underlinetion \underlineCorpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline’s accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.21342 [cs.CL] (or arXiv:2505.21342v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.21342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] Somethings Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks ACL2025

【速读】: 该论文试图解决当前表联合搜索(Table Union Search, TUS)基准测试中存在的一些局限性,这些问题使得简单的基线方法能够表现出色,甚至超越更复杂的模型,表明现有基准分数主要受数据集特定特征的影响,而非真正反映语义理解能力。解决方案的关键在于提出适用于未来基准的必要标准,以实现对语义表联合搜索进展更为真实和可靠的评估。

链接: https://arxiv.org/abs/2505.21329
作者: Allaa Boutaleb,Bernd Amann,Hubert Naacke,Rafael Angarita
机构: Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心); LIP6 (LIP6)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
备注: Accepted @ ACL 2025’s Table Representation Learning Workshop (TRL)

点击查看摘要

Abstract:Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.
zh

[NLP-29] Leverag ing large language models and traditional machine learning ensembles for ADHD detection from narrative transcripts

【速读】: 该论文试图解决将大型语言模型(LLMs)与传统监督机器学习(ML)技术结合应用于精神病学领域中叙事数据分类的问题,尤其是在注意力缺陷/多动障碍(ADHD)诊断的自动分类任务中。解决方案的关键在于构建一个集成框架,该框架融合了三种互补模型:LLaMA3(用于捕捉长距离语义结构的开源LLM)、RoBERTa(在标注临床叙事数据上微调的预训练Transformer模型)以及基于TF-IDF的词法特征训练的支持向量机(SVM)分类器,并通过多数投票机制进行聚合,以提升预测的鲁棒性。

链接: https://arxiv.org/abs/2505.21324
作者: Yuxin Zhu,Yuting Guo,Noah Marchuck,Abeed Sarker,Yun Wang
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite rapid advances in large language models (LLMs), their integration with traditional supervised machine learning (ML) techniques that have proven applicability to medical data remains underexplored. This is particularly true for psychiatric applications, where narrative data often exhibit nuanced linguistic and contextual complexity, and can benefit from the combination of multiple models with differing characteristics. In this study, we introduce an ensemble framework for automatically classifying Attention-Deficit/Hyperactivity Disorder (ADHD) diagnosis (binary) using narrative transcripts. Our approach integrates three complementary models: LLaMA3, an open-source LLM that captures long-range semantic structure; RoBERTa, a pre-trained transformer model fine-tuned on labeled clinical narratives; and a Support Vector Machine (SVM) classifier trained using TF-IDF-based lexical features. These models are aggregated through a majority voting mechanism to enhance predictive robustness. The dataset includes 441 instances, including 352 for training and 89 for validation. Empirical results show that the ensemble outperforms individual models, achieving an F _1 score of 0.71 (95% CI: [0.60-0.80]). Compared to the best-performing individual model (SVM), the ensemble improved recall while maintaining competitive precision. This indicates the strong sensitivity of the ensemble in identifying ADHD-related linguistic cues. These findings demonstrate the promise of hybrid architectures that leverage the semantic richness of LLMs alongside the interpretability and pattern recognition capabilities of traditional supervised ML, offering a new direction for robust and generalizable psychiatric text classification.
zh

[NLP-30] Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

【速读】: 该论文试图解决非洲语言在当前自然语言处理(Natural Language Processing, NLP)系统和大型语言模型(Large Language Models, LLMs)中代表性不足的问题。由于非洲拥有超过2000种语言且可能有数百万使用者,但现有NLP技术主要支持资源丰富的少数语言,导致技术覆盖范围受限并可能加剧数字鸿沟。论文通过分析过去五年内发表的734篇关于非洲语言NLP的研究论文,总结了该领域的研究进展,并指出解决方案的关键在于推动多语种语言资源的建设、社区驱动的倡议以及资金支持,以促进更加包容和可持续的非洲语言NLP研究。

链接: https://arxiv.org/abs/2505.21315
作者: Jesujoba O. Alabi,Michael A. Hedderich,David Ifeoluwa Adelani,Dietrich Klakow
机构: Saarland University (萨尔兰大学); Saarland Informatics Campus (萨尔兰信息学园区); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Mila - Quebec AI Institute (Mila-魁北克人工智能研究所); McGill University (麦吉尔大学); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注: Working paper

点击查看摘要

Abstract:With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 734 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.
zh

[NLP-31] How Humans and LLM s Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian ACL2025

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)生成的实例是否能够反映人类对类别组织的理解这一问题,特别是在下位层次(subordinate level)的类别组织方面。研究的关键在于构建一个新的意大利语心理语言学数据集,包含187个具体名词的人类生成实例,并利用该数据集评估文本和视觉LLMs在实例生成、类别归纳和典型性判断三个关键任务中与人类类别组织的一致性。

链接: https://arxiv.org/abs/2505.21301
作者: Andrea Pedrotti,Giulia Rambelli,Caterina Villani,Marianna Bolognesi
机构: ISTI-CNR (ISTI-CNR); Università di Bologna (University of Bologna)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025

点击查看摘要

Abstract:People can categorize the same entity at multiple taxonomic levels, such as basic (bear), superordinate (animal), and subordinate (grizzly bear). While prior research has focused on basic-level categories, this study is the first attempt to examine the organization of categories by analyzing exemplars produced at the subordinate level. We present a new Italian psycholinguistic dataset of human-generated exemplars for 187 concrete words. We then use these data to evaluate whether textual and vision LLMs produce meaningful exemplars that align with human category organization across three key tasks: exemplar generation, category induction, and typicality judgment. Our findings show a low alignment between humans and LLMs, consistent with previous studies. However, their performance varies notably across different semantic domains. Ultimately, this study highlights both the promises and the constraints of using AI-generated exemplars to support psychological and linguistic research.
zh

[NLP-32] rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码推理能力上的局限性,这一局限性主要源于高质量、高难度数据集的稀缺性,尤其是具备可验证输入输出测试用例的数据集。其解决方案的关键在于构建一个大规模、经过验证的代码问题数据集rStar-Coder,包含418K竞赛级别的代码问题、580K长推理解决方案以及多样化的测试用例。该数据集通过三个核心贡献实现:一是整理竞赛编程问题及其标准解法以生成新的可解问题;二是引入可靠的输入输出测试用例合成流程,采用三步输入生成方法和互验证机制进行有效输出标注;三是为问题添加高质量、经过测试用例验证的长推理解决方案。实验结果表明,rStar-Coder数据集在多个代码推理基准测试中表现出色,显著提升了模型性能。

链接: https://arxiv.org/abs/2505.21297
作者: Yifei Liu,Li Lyna Zhang,Yi Zhu,Bingcheng Dong,Xudong Zhou,Ning Shang,Fan Yang,Mao Yang
机构: Microsoft Research Asia (微软亚洲研究院); Dalian University of Technology (大连理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at this https URL.
zh

[NLP-33] Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全防护方面存在的漏洞问题,特别是通过黑盒越狱攻击(black-box jailbreak attacks)揭示模型的脆弱性,从而提升模型的鲁棒性。其解决方案的关键在于提出一种基于详尽可能性模型(Elaboration Likelihood Model, ELM)理论的框架,将越狱策略分解为基本组件,并结合基于遗传算法的优化方法与意图评估机制,以系统性地扩展策略空间,从而突破传统方法在预定义策略空间中的性能瓶颈。

链接: https://arxiv.org/abs/2505.21277
作者: Yao Huang,Yitong Sun,Shouwei Ruan,Yichi Zhang,Yinpeng Dong,Xingxing Wei
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Tsinghua-Bosch Joint ML Center (清华大学-博世联合机器学习中心); THBI Lab (THBI 实验室); BNRist Center (北京人工智能研究院); RealAI (RealAI)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 20 figures, accepted by ACL 2025, Findings

点击查看摘要

Abstract:Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: this https URL.
zh

[NLP-34] Multilingual Pretraining for Pixel Language Models

【速读】: 该论文试图解决像素语言模型(Pixel Language Model)在多语言预训练方面的研究不足问题,尤其是在非拉丁字母语言中的跨语言迁移能力有限的问题。解决方案的关键在于引入PIXEL-M4,该模型在四种视觉和语言多样性较高的语言(英语、印地语、乌克兰语和简体中文)上进行预训练,通过多语言预训练增强了模型对不同语言的语义表示能力,使其能够在多种语言间形成紧密对齐的语义嵌入空间,从而显著提升模型对多样化语言的支持能力。

链接: https://arxiv.org/abs/2505.21265
作者: Ilker Kesen,Jonas F. Lotz,Ingo Ziegler,Phillip Rust,Desmond Elliott
机构: Department of Computer Science, University of Copenhagen (计算机科学系,哥本哈根大学); ROCKWOOL Foundation Research Unit (ROCKWOOL基金会研究单位)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 19 figures, 7 tables

点击查看摘要

Abstract:Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
zh

[NLP-35] ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision ACL2025

【速读】: 该论文试图解决多跳问答(Multi-hop Question Answering, MHQA)中由于查询在推理过程中高度变化而导致的标签文档不足问题。传统密集检索器依赖于带标签的查询-文档对进行微调,但在MHQA任务中,这种标签数据难以获取。解决方案的关键在于提出一种无需标签文档的密集检索器训练方法——ReSCORE(Retriever Supervision with Consistency and Relevance),该方法利用大语言模型捕捉文档与问题的相关性以及与正确答案的一致性,并在迭代式问答框架中训练检索器,从而提升检索效果和MHQA性能。

链接: https://arxiv.org/abs/2505.21250
作者: Dosung Lee,Wonjun Oh,Boyoung Kim,Minyoung Kim,Joonsuk Park,Paul Hongsuck Seo
机构: Korea University (韩国科学技术大学); NAVER AI Lab (NAVER人工智能实验室); NAVER Cloud (NAVER云); University of Richmond (里士满大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, ACL 2025

点击查看摘要

Abstract:Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: this https URL.
zh

[NLP-36] Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医学文本摘要任务中面对高词汇外(out-of-vocabulary, OOV)词或高新颖性数据时性能显著下降的问题。其关键解决方案是通过词汇适应(vocabulary adaptation),即更新LLM的词汇表以包含特定领域(如医学)的词汇或子词,从而缓解词汇不匹配问题。研究发现,即使使用具有约128K标记词汇量的Llama-3.1,仍存在医学词汇过度碎片化的问题,而词汇适应能够有效提升模型在困难场景下的摘要性能。

链接: https://arxiv.org/abs/2505.21242
作者: Gunjan Balde,Soumyadeep Roy,Mainack Mondal,Niloy Ganguly
机构: Indian Institute of Technology Kharagpur (印度理工学院卡哈格普尔分校)
类目: Computation and Language (cs.CL)
备注: 16 pages. Accepted for publication in the Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at this https URL.
zh

[NLP-37] LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners

【速读】: 该论文试图解决认知诊断(Cognitive Diagnosis, CD)在冷启动(cold-start)场景下的挑战,即由于缺乏学生与习题的交互数据而导致的传统CD模型性能下降问题。解决方案的关键在于提出一种名为Language Models as Zero-shot Cognitive Diagnosis Learners (LMCD)的框架,该框架通过两个主要阶段实现:知识扩散(Knowledge Diffusion)和语义-认知融合(Semantic-Cognitive Fusion),利用大语言模型(Large Language Models, LLMs)生成丰富的习题和知识概念内容,并通过因果注意力机制整合文本信息与学生认知状态,从而构建全面的学生和习题表征。

链接: https://arxiv.org/abs/2505.21239
作者: Yu He,Zihan Yao,Chentao Song,Tianyu Qi,Jun Liu,Ming Li,Qing Huang
机构: TAL Education Group (TAL教育集团)
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students’ cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at this https URL
zh

[NLP-38] A Representation Level Analysis of NMT Model Robustness to Grammatical Errors ACL2025

【速读】: 该论文试图解决机器翻译系统中鲁棒性(robustness)的问题,特别是关注模型如何处理不合法语法规则的输入。与以往主要关注鲁棒性失败或提升鲁棒性的研究不同,本文从模型表示的角度出发,通过分析模型对不合法输入的内部表示及其在各层中的演化过程来研究鲁棒性。其解决方案的关键在于利用语法错误检测(Grammatical Error Detection, GED)探测和表征相似性分析,揭示编码器首先检测语法错误并将其表示向正确形式迁移的机制。进一步地,通过分析注意力机制,识别出被称为“鲁棒性头”(Robustness Heads)的注意力模块,这些模块在应对语法错误时关注可解释的语言单元,并在微调模型以增强鲁棒性时更依赖于这些头部进行不合法词表示的更新。

链接: https://arxiv.org/abs/2505.21224
作者: Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Findings

点击查看摘要

Abstract:Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term Robustness Heads. We find that Robustness Heads attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on Robustness Heads for updating the ungrammatical word representation.
zh

[NLP-39] Pretrained LLM s Learn Multiple Types of Uncertainty

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成文本时出现的幻觉(hallucinations)问题,即模型产生与事实不符的文本。其解决方案的关键在于研究LLMs在未显式训练以捕捉不确定性的情况下,是否能够隐式地表征不确定性。研究发现,若将不确定性视为模型潜在空间中的线性概念,则LLMs能够在仅预训练后就捕捉到不确定性,并且能够识别多种不同类型的不确定性,这些不确定性可用于预测特定任务或基准的正确性。此外,研究还表明通过指令微调或[IDK]-token微调统一不确定性类型有助于提升模型在正确性预测方面的表现。

链接: https://arxiv.org/abs/2505.21218
作者: Roi Cohen,Omri Fahn,Gerard de Melo
机构: HPI / University of Potsdam (HPI/波茨坦大学); Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we study how well LLMs capture uncertainty, without explicitly being trained for that. We show that, if considering uncertainty as a linear concept in the model’s latent space, it might indeed be captured, even after only pretraining. We further show that, though unintuitive, LLMs appear to capture several different types of uncertainty, each of which can be useful to predict the correctness for a specific task or benchmark. Furthermore, we provide in-depth results such as demonstrating a correlation between our correction prediction and the model’s ability to abstain from misinformation using words, and the lack of impact of model scaling for capturing uncertainty. Finally, we claim that unifying the uncertainty types as a single one using instruction-tuning or [IDK]-token tuning is helpful for the model in terms of correctness prediction.
zh

[NLP-40] Unveiling Instruction-Specific Neurons Experts: An Analytical Framework for LLM s Instruction-Following Capabilities

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)微调后指令遵循能力提升的底层计算机制不明确的问题。其解决方案的关键在于引入HexaInst数据集和SPARCOM分析框架,通过识别、评估和比较指令相关的稀疏组件(如密集模型中的神经元及混合专家(Mixture-of-Experts, MoE)架构中的神经元和专家),揭示微调如何重构模型计算,并阐明这些稀疏组件在指令执行中的关键作用。

链接: https://arxiv.org/abs/2505.21191
作者: Junyan Zhang,Yubo Gao,Yibo Yan,Jungang Li,Zhaorui Hou,Sicheng Tao,Shuliang Liu,Song Dai,Yonghua Hei,Junzhuo Li,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The finetuning of Large Language Models (LLMs) has significantly advanced their instruction-following capabilities, yet the underlying computational mechanisms driving these improvements remain poorly understood. This study systematically examines how fine-tuning reconfigures LLM computations by isolating and analyzing instruction-specific sparse components, i.e., neurons in dense models and both neurons and experts in Mixture-of-Experts (MoE) architectures. In particular, we introduce HexaInst, a carefully curated and balanced instructional dataset spanning six distinct categories, and propose SPARCOM, a novel analytical framework comprising three key contributions: (1) a method for identifying these sparse components, (2) an evaluation of their functional generality and uniqueness, and (3) a systematic comparison of their alterations. Through experiments, we demonstrate functional generality, uniqueness, and the critical role of these components in instruction execution. By elucidating the relationship between fine-tuning-induced adaptations and sparse computational substrates, this work provides deeper insights into how LLMs internalize instruction-following behavior for the trustworthy LLM community.
zh

[NLP-41] Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

【速读】: 该论文旨在解决放射学报告生成中的评估局限性问题,现有评估方法仅适用于单份报告场景,并依赖于粗粒度指标,无法捕捉细粒度的临床语义和时间依赖性。其解决方案的关键在于引入LUNGUAGE,一个支持单份报告评估和跨多份研究的纵向患者级评估的基准数据集,以及提出LUNGUAGESCORE,一种在实体、关系和属性层面比较结构化输出并建模患者时间线中时间一致性的可解释评估指标。

链接: https://arxiv.org/abs/2505.21190
作者: Jong Hak Moon,Geon Choi,Paloma Rabaey,Min Gwan Kim,Hyuk Gi Hong,Jung-Oh Lee,Hangyul Yoon,Eun Woo Doe,Jiyoun Kim,Harshita Sharma,Daniel C. Castro,Javier Alvarez-Valle,Edward Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: this https URL
zh

[NLP-42] Exploring the Latent Capacity of LLM s for One-Step Text Generation

【速读】: 该论文试图解决如何在不依赖自回归生成的情况下,实现长文本的重建问题。其解决方案的关键在于利用冻结的大型语言模型(LLMs)仅通过两个学习得到的嵌入向量即可生成数百个准确的标记,展示了无需迭代解码的多标记生成能力。这一方法揭示了嵌入空间中信息编码的特性,并表明这些表示虽非唯一,但在嵌入空间中形成连通且局部的区域,为学习专用编码器提供了可能性。

链接: https://arxiv.org/abs/2505.21189
作者: Gleb Mezentsev,Ivan Oseledets
机构: AIRI(人工智能研究机构); Skoltech(莫斯科物理技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.
zh

[NLP-43] PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing

【速读】: 该论文旨在解决生成有害信息数据时面临的生成可靠性与内容多样性不足的问题(harmful information synthesis)。现有方法依赖大型语言模型(Large Language Models, LLMs)进行数据合成,但受限于其安全对齐机制,难以有效生成多样且可靠的有害数据。论文提出的解决方案关键在于构建一个名为PoisonSwarm的框架,通过模型众包策略,在保持高成功率的同时生成多样化的有害数据,具体包括以反事实方式生成大量良性数据作为基础模板,并通过动态模型切换对每个模板进行语义单元级别的毒性处理与最终优化。

链接: https://arxiv.org/abs/2505.21184
作者: Yu Yan,Sheng Sun,Zhifei Zheng,Ziji Hao,Teli Liu,Min Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.
zh

[NLP-44] Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

【速读】: 该论文旨在解决当前先进推理模型在生成长Chain-of-Thought (CoT)响应时存在的过度思考问题,即响应中出现的冗余或重复性思维模式。解决方案的关键在于提出一种简单而有效的两阶段强化学习框架,称为ConciseR,其核心思想是通过两个阶段分别优化模型的推理能力和响应简洁性:第一阶段使用更多训练步骤,通过改进的Group Relative Policy Optimization with clip-higher和动态采样组件(GRPO++)来增强模型的推理能力;第二阶段则通过Length-aware Group Relative Policy Optimization (L-GRPO)在较少训练步骤下显式地约束响应长度并提升效率。该方法遵循“先走后跑”的原则,仅在所有样本的rollouts正确后才优化响应长度。

链接: https://arxiv.org/abs/2505.21178
作者: Mingyang Song,Mao Zheng
机构: Tencent Hunyuan(腾讯混元)
类目: Computation and Language (cs.CL)
备注: Ongoing Work

点击查看摘要

Abstract:As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model’s reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the “walk before you run” principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.
zh

[NLP-45] AT-R1: Terminology-Aware Translation with Reinforcement Learning and Word Alignment

【速读】: 该论文试图解决术语翻译(terminology translation)在深度推理大型语言模型(deep reasoning large language models, LLMs)中的未探索问题,即现有模型在处理专业术语时的准确性不足。解决方案的关键在于提出一种基于强化学习(reinforcement learning, RL)和词对齐(word alignment)的术语感知翻译模型TAT-R1,通过提取关键词翻译对并设计三种基于规则的对齐奖励机制,使模型能够关注源文本中关键信息的准确翻译,从而提升术语翻译的准确性。

链接: https://arxiv.org/abs/2505.21172
作者: Zheng Li,Mao Zheng,Mingyang Song,Wenjie Yang
机构: Tencent Hunyuan(腾讯混元)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, deep reasoning large language models(LLMs) like DeepSeek-R1 have made significant progress in tasks such as mathematics and coding. Inspired by this, several studies have employed reinforcement learning(RL) to enhance models’ deep reasoning capabilities and improve machine translation(MT) quality. However, the terminology translation, an essential task in MT, remains unexplored in deep reasoning LLMs. In this paper, we propose \textbfTAT-R1, a terminology-aware translation model trained with reinforcement learning and word alignment. Specifically, we first extract the keyword translation pairs using a word alignment model. Then we carefully design three types of rule-based alignment rewards with the extracted alignment relationships. With those alignment rewards, the RL-trained translation model can learn to focus on the accurate translation of key information, including terminology in the source text. Experimental results show the effectiveness of TAT-R1. Our model significantly improves terminology translation accuracy compared to the baseline models while maintaining comparable performance on general translation tasks. In addition, we conduct detailed ablation studies of the DeepSeek-R1-like training paradigm for machine translation and reveal several key findings.
zh

[NLP-46] M-Wanda: Improving One-Shot Pruning for Multilingual LLM s

【速读】: 该论文试图解决多语言大模型(Multilingual Large Language Models)在进行稀疏化(sparsification)过程中性能下降的问题,特别是如何在减少模型规模的同时保持多语言能力。解决方案的关键在于提出M-Wanda方法,该方法通过引入语言感知的激活统计信息到剪枝准则中,并根据跨语言重要性动态调整各层的稀疏度,从而更有效地保留多语言性能。

链接: https://arxiv.org/abs/2505.21171
作者: Rochelle Choenni,Ivan Titov
机构: University of Amsterdam(阿姆斯特丹大学); University of Edinburgh(爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.
zh

[NLP-47] Leverag ing GANs for citation intent classification and its impact on citation network analysis

【速读】: 该论文试图解决科学文献中引用意图(citation intent)分类及其对引文网络中心性(centrality)影响的问题。其解决方案的关键在于采用基于生成对抗网络(GAN)的方法进行引用意图分类,该方法在保持竞争性分类性能的同时,显著减少了参数数量,证明了GAN架构与上下文嵌入结合在意图分类任务中的有效性和高效性。

链接: https://arxiv.org/abs/2505.21162
作者: Davi A. Bezerra,Filipi N. Silva,Diego R. Amancio
机构: Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, Brazil; Observatory on Social Media, Indiana University, Bloomington, IN, USA
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Citations play a fundamental role in the scientific ecosystem, serving as a foundation for tracking the flow of knowledge, acknowledging prior work, and assessing scholarly influence. In scientometrics, they are also central to the construction of quantitative indicators. Not all citations, however, serve the same function: some provide background, others introduce methods, or compare results. Therefore, understanding citation intent allows for a more nuanced interpretation of scientific impact. In this paper, we adopted a GAN-based method to classify citation intents. Our results revealed that the proposed method achieves competitive classification performance, closely matching state-of-the-art results with substantially fewer parameters. This demonstrates the effectiveness and efficiency of leveraging GAN architectures combined with contextual embeddings in intent classification task. We also investigated whether filtering citation intents affects the centrality of papers in citation networks. Analyzing the network constructed from the unArXiv dataset, we found that paper rankings can be significantly influenced by citation intent. All four centrality metrics examined- degree, PageRank, closeness, and betweenness - were sensitive to the filtering of citation types. The betweenness centrality displayed the greatest sensitivity, showing substantial changes in ranking when specific citation intents were removed.
zh

[NLP-48] Assessment of L2 Oral Proficiency using Speech Large Language Models INTERSPEECH

【速读】: 该论文试图解决第二语言(L2)英语口语能力自动评分(Spoken Language Assessment, SLA)中的准确性与泛化能力问题。传统方法如统计模型、文本编码器和自监督语音模型存在信息丢失或性能限制,而端到端(End-to-End, E2E)评分系统也面临相似挑战。本文的关键解决方案是利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的强大音频理解能力,通过不同的训练策略(回归与分类目标)进行优化,从而在两个数据集上取得了优于现有基线的性能,并展现出跨说话人和跨任务的强泛化能力。

链接: https://arxiv.org/abs/2505.21148
作者: Rao Ma,Mengjie Qian,Siyuan Tang,Stefano Bannò,Kate M. Knill,Mark J.F. Gales
机构: ALTA Institute, Department of Engineering (ALTA 研究所,工程系)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to Interspeech

点击查看摘要

Abstract:The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.
zh

[NLP-49] Leverag ing LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

【速读】: 该论文试图解决在低资源场景下,尤其是针对中文方言和口音的自动语音识别(ASR)性能不足的问题。其解决方案的关键在于利用自监督预训练结合大语言模型(LLM),通过在30万小时的未标注方言和口音语音数据上预训练Data2vec2模型,并在4万小时的监督数据集上进行对齐训练,从而提升模型在方言及口音语音识别中的表现。

链接: https://arxiv.org/abs/2505.21138
作者: Tianyi Xu,Hongjie Chen,Wang Qing,Lv Hang,Jian Kang,Li Jie,Zhennan Lin,Yongxiang Li,Xie Lei
机构: Institute of Artificial Intelligence (TeleAI); Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre- training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research
zh

[NLP-50] Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction INTERSPEECH

【速读】: 该论文旨在解决口语语法错误纠正(Spoken Grammatical Error Correction, SGEC)及反馈生成(SGECF)中的挑战,尤其是在数据量有限的情况下提升模型性能。其关键解决方案是引入伪标签(pseudo-labelling)过程,将训练数据规模从77小时扩展至约2500小时,从而显著提升模型表现;此外,通过使用流畅的转录文本对基于Whisper的端到端(E2E)SGEC模型进行提示(prompting),进一步优化了反馈生成效果。

链接: https://arxiv.org/abs/2505.21137
作者: Mengjie Qian,Rao Ma,Stefano Bannò,Kate M. Knill,Mark J.F. Gales
机构: ALTA Institute, Department of Engineering (ALTA研究所,工程系)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to Interspeech

点击查看摘要

Abstract:Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.
zh

[NLP-51] Creativity in LLM -based Multi-Agent Systems: A Survey

【速读】: 该论文试图解决当前多智能体系统(MAS)研究中对创造力维度的忽视问题,具体包括新颖输出的生成与评估、创造力如何影响代理人格化设计以及创造性工作流的协调。其解决方案的关键在于构建一个针对创造力的系统性框架,涵盖代理主动性与人格设计的分类、生成技术(如发散探索、迭代优化和协作综合)的综述,以及相关数据集和评估指标的总结,并探讨了诸如评估标准不一致、偏差缓解不足、协调冲突和缺乏统一基准等关键挑战。

链接: https://arxiv.org/abs/2505.21116
作者: Yi-Cheng Lin,Kang-Chieh Chen,Zhe-Yan Li,Tzu-Heng Wu,Tzu-Hsuan Wu,Kuan-Yu Chen,Hung-yi Lee,Yun-Nung Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emphcreativity, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.
zh

[NLP-52] Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在问答(Question Answering, QA)任务中出现的幻觉问题,其关键在于探索问题的时间特性——即问题是否为永恒性(evergreen,答案随时间保持稳定)或可变性(mutable,答案会变化)。论文提出EverGreenQA,首个包含永恒性标签的多语言QA数据集,用于评估和训练,并通过该数据集对12个现代LLMs进行基准测试,以判断它们是否通过显式表述的判断或隐式的不确定性信号来编码问题的时间特性。此外,还训练了EG-E5,一种轻量级多语言分类器,在该任务上达到了最先进(SoTA)性能。

链接: https://arxiv.org/abs/2505.21115
作者: Sergey Pletenev,Maria Marina,Nikolay Ivanov,Daria Galimzianova,Nikita Krayko,Mikhail Salnikov,Vasily Konovalov,Alexander Panchenko,Viktor Moskvoretskii
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions – whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.
zh

[NLP-53] A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction

【速读】: 该论文试图解决大规模语言模型领域自适应技术计算资源消耗大以及生成文本中存在幻觉(hallucination)问题,尤其是在工程场景中对生成文本结构化和准确性的高要求。解决方案的关键在于提出一种名为Small Language Graph (SLG) 的轻量级自适应方法,该方法通过构建图结构,其中每个节点代表一个在特定且简洁文本上微调的小型语言模型专家,从而有效降低计算成本并提升生成文本的准确性。

链接: https://arxiv.org/abs/2505.21109
作者: Bogdan Bogachov,Yaoyao Fiona Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 4 Figures, 6 Tables. This paper has been accepted to be published in the proceedings of IDETC-CIE 2025

点击查看摘要

Abstract:Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical. This work introduces a novel approach called the Small Language Graph (SLG), which is a lightweight adaptation solution designed to address the two key challenges outlined above. The system is structured in the form of a graph, where each node represents a lightweight expert - a small language model fine-tuned on specific and concise texts. The results of this study have shown that SLG was able to surpass conventional fine-tuning methods on the Exact Match metric by 3 times. Additionally, the fine-tuning process was 1.7 times faster compared to that of a larger stand-alone language model. These findings introduce a potential for small to medium-sized engineering companies to confidently use generative AI technologies, such as LLMs, without the necessity to invest in expensive computational resources. Also, the graph architecture and the small size of expert nodes offer a possible opportunity for distributed AI systems, thus potentially diverting the global need for expensive centralized compute clusters.
zh

[NLP-54] hinker: Learning to Think Fast and Slow

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理任务中存在准确性不足、响应冗长以及缺乏自信的问题。其解决方案的关键在于受心理学中的双系统理论启发,对问答(QA)任务进行改进,引入四个阶段:快速思维(Fast Thinking)、验证(Verification)、慢速思维(Slow Thinking)和总结(Summarization),通过分阶段的结构化流程提升模型的推理能力和响应质量。

链接: https://arxiv.org/abs/2505.21097
作者: Stephen Chung,Wenyu Du,Jie Fu
机构: DualityRL; Shanghai AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.
zh

[NLP-55] BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在孟加拉语语言理解和文化知识方面的评估问题,其解决方案的关键是构建了一个名为BLUCK的新数据集,该数据集包含2366个精心挑选的多项选择题(MCQs),覆盖23个类别,涵盖孟加拉国文化、历史及孟加拉语语言学内容,并首次以本土孟加拉文化、历史和语言学为中心建立基于MCQ的评估基准。

链接: https://arxiv.org/abs/2505.21092
作者: Daeen Kabir,Minhajur Rahman Chowdhury Mahim,Sheikh Shafayat,Adnan Sadik,Arian Ahmed,Eunsu Kim,Alice Oh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh’s culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs’ performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali’s status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.
zh

[NLP-56] Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLM s)

【速读】: 该论文试图解决系统提示(system prompts)在大型语言模型(Large Language Models, LLMs)中的透明度问题及其对模型输出的影响,特别是系统提示中信息位置如何塑造模型行为,进而可能引入不可检测的偏见和下游危害。其解决方案的关键在于通过对比六种商业LLMs在系统提示与用户提示中处理50个不同人口统计群体信息的方式,揭示系统提示配置的不透明性所带来的偏差问题,并强调将系统提示分析纳入AI审计流程的重要性。

链接: https://arxiv.org/abs/2505.21091
作者: Anna Neumann,Elisabeth Kirsten,Muhammad Bilal Zafar,Jatinder Singh
机构: Research Center Trust, UA Ruhr University of Duisburg-Essen (研究信任中心,UA鲁尔大学杜伊斯堡-埃森分校); Research Center Trust, UA Ruhr Ruhr University Bochum (研究信任中心,UA鲁尔鲁尔大学波鸿分校); University of Cambridge (剑桥大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Forthcoming in Proceedings of ACM FAccT 2025

点击查看摘要

Abstract:System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others’ additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user’s ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.
zh

[NLP-57] LLM s Think But Not In Your Flow: Reasoning -Level Personalization for Black-Box Large Language Models

【速读】: 该论文试图解决黑盒大语言模型(black-box LLM)在生成响应时缺乏对用户个性化偏好和推理风格的适应性问题。现有方法主要关注响应层面的个性化,而未能建模用户的个性化思维过程。解决方案的关键在于提出RPM框架,该框架通过构建用户特定的统计因素和个性化推理路径,实现推理层面的个性化,使模型在推理过程中遵循用户特定的逻辑轨迹,从而提升预测准确性和可解释性。

链接: https://arxiv.org/abs/2505.21082
作者: Jieyong Kim,Tongyoung Kim,Soonjin Yoon,Jaehyung Kim,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently achieved impressive performance across a wide range of natural language tasks and are now widely used in real-world applications. Among them, black-box LLMs–served via APIs without access to model internals–are especially dominant due to their scalability and ease of deployment. Despite their strong capabilities, these models typically produce generalized responses that overlook personal preferences and reasoning styles. This has led to growing interest in black-box LLM personalization, which aims to tailor model outputs to user-specific context without modifying model parameters. However, existing approaches primarily focus on response-level personalization, attempting to match final outputs without modeling personal thought process. To address this limitation, we propose RPM, a framework for reasoning-level personalization that aligns the model’s reasoning process with a user’s personalized logic. RPM first constructs statistical user-specific factors by extracting and grouping response-influential features from user history. It then builds personalized reasoning paths that reflect how these factors are used in context. In the inference stage, RPM retrieves reasoning-aligned examples for new queries via feature-level similarity and performs inference conditioned on the structured factors and retrieved reasoning paths, enabling the model to follow user-specific reasoning trajectories. This reasoning-level personalization enhances both predictive accuracy and interpretability by grounding model outputs in user-specific logic through structured information. Extensive experiments across diverse tasks show that RPM consistently outperforms response-level personalization methods, demonstrating the effectiveness of reasoning-level personalization in black-box LLMs.
zh

[NLP-58] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中幻觉(hallucination)检测的问题,即模型生成的内容在事实性上存在错误,这些错误可能源于模型内部知识的不一致或对检索到的上下文的误用。现有方法常将事实性与对检索内容的忠实性混淆,导致某些事实正确但未直接由检索内容支持的陈述被误判为幻觉。论文提出的解决方案关键在于引入FRANQ(Faithfulness-based Retrieval Augmented Uncertainty Quantification),该方法通过不同的不确定性量化(Uncertainty Quantification, UQ)技术,基于陈述是否忠实于检索内容来估计其事实性,从而更准确地检测RAG输出中的事实性错误。

链接: https://arxiv.org/abs/2505.21072
作者: Ekaterina Fadeeva,Aleksandr Rubashevskii,Roman Vashurin,Shehzaad Dhuliawala,Artem Shelmanov,Timothy Baldwin,Preslav Nakov,Mrinmaya Sachan,Maxim Panov
机构: ETH Zürich (ETH Zurich); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model’s internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.
zh

[NLP-59] Predicting Implicit Arguments in Procedural Video Instructions ACL2025

【速读】: 该论文旨在解决过程性文本中隐含语义角色(implicit semantic roles)识别不准确的问题,尤其是在多模态烹饪步骤中,现有SRL(Semantic Role Labeling)基准常忽略隐含的论元,导致理解不完整。其解决方案的关键在于引入Implicit-VidSRL数据集,该数据集要求模型从多模态上下文中推断显性和隐含的语义角色,并通过实体跟踪来捕捉视觉变化,从而提升对隐含“what”和“where/with”语义角色的识别能力。

链接: https://arxiv.org/abs/2505.21068
作者: Anil Batra,Laura Sevilla-Lara,Marcus Rohrbach,Frank Keller
机构: University of Edinburgh(爱丁堡大学); TU Darmstadt(达姆施塔特工业大学); hessian.AI(黑森人工智能)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 Main

点击查看摘要

Abstract:Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like verb,what,where/with. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step’s where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models’ contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.
zh

[NLP-60] Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

【速读】: 该论文旨在解决人类与机器人交互中自然对话轮换(turn-taking)预测的问题,传统模型主要依赖语音信息,而忽略了其他模态的潜在价值。其解决方案的关键在于引入MM-VAP,这是一个多模态的预测轮换模型(Multimodal Predictive Turn-Taking Model),它结合了语音与视觉线索,包括面部表情、头部姿态和注视方向,从而提升了视频会议交互中的轮换预测准确率(84% vs. 79%)。研究还发现,通过将沉默时长作为分组依据,结合视觉特征能够显著提升模型在不同说话人转换时长下的性能,其中面部表情特征对模型表现贡献最大。

链接: https://arxiv.org/abs/2505.21043
作者: Sam O’Connor Russell,Naomi Harte
机构: ADAPT Centre, School of Engineering, Trinity College Dublin, Ireland
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
zh

[NLP-61] FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis

【速读】: 该论文试图解决目标情感分析(Targeted Sentiment Analysis, TSA)中跨任务知识迁移的有效性问题,特别是在细粒度方面的情感-方面关系建模不足。现有方法大多依赖粗粒度的知识迁移,未能充分捕捉不同方面之间的细微情感差异,导致负向迁移。解决方案的关键在于提出FCKT框架,通过在情感预测中显式引入方面级信息,实现细粒度的跨任务知识迁移,从而有效缓解负向迁移并提升任务性能。

链接: https://arxiv.org/abs/2505.21040
作者: Wei Chen,Zhao Zhang,Meng Yuan,Kepeng Xu,Fuzhen Zhuang
机构: Beihang University (北京航空航天大学); Xidian University (西安电子科技大学); Zhongguancun Laboratory (中关村实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on this https URL.
zh

[NLP-62] Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation ACL2025

【速读】: 该论文旨在解决对话主题分割(Dialogue Topic Segmentation, DTS)中存在的数据短缺、标注模糊以及近期提出方法的复杂性增加等长期问题。其解决方案的关键在于引入Def-DTS:一种基于大型语言模型(Large Language Models, LLM)的多步骤演绎推理方法,通过结构化提示策略实现双向上下文摘要、话语意图分类和演绎主题转换检测,从而提升DTS性能并支持中间结果的案例分析。

链接: https://arxiv.org/abs/2505.21033
作者: Seungmin Lee,Yongsang Yoo,Minhwa Jung,Min Song
机构: Yonsei University (延世大学); LOTTE INNOVATE (乐天创新); LG Eletronics (LG电子); Onoma AI (Onoma AI)
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures, Accepted to Findings of the ACL 2025

点击查看摘要

Abstract:Dialogue Topic Segmentation (DTS) aims to divide dialogues into coherent segments. DTS plays a crucial role in various NLP downstream tasks, but suffers from chronic problems: data shortage, labeling ambiguity, and incremental complexity of recently proposed solutions. On the other hand, Despite advances in Large Language Models (LLMs) and reasoning strategies, these have rarely been applied to DTS. This paper introduces Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation, which utilizes LLM-based multi-step deductive reasoning to enhance DTS performance and enable case study using intermediate result. Our method employs a structured prompting approach for bidirectional context summarization, utterance intent classification, and deductive topic shift detection. In the intent classification process, we propose the generalizable intent list for domain-agnostic dialogue intent classification. Experiments in various dialogue settings demonstrate that Def-DTS consistently outperforms traditional and state-of-the-art approaches, with each subtask contributing to improved performance, particularly in reducing type 2 error. We also explore the potential for autolabeling, emphasizing the importance of LLM reasoning techniques in DTS.
zh

[NLP-63] Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

【速读】: 该论文试图解决pause tokens(暂停标记)在Transformer模型中如何提升性能的理论机制问题,特别是其对模型计算表达能力的影响。解决方案的关键在于首次提供了形式化分离结果,证明在常数深度、对数宽度的Transformer中引入暂停标记可以严格提升其计算表达性;当使用有限精度激活时,没有暂停标记的Transformer仅能计算AC⁰类函数的一个真子集,而引入多项式数量的暂停标记则使其能够表达整个AC⁰类;对于对数精度的Transformer,暂停标记的引入使其表达能力达到TC⁰级别,与已知上界一致。

链接: https://arxiv.org/abs/2505.21024
作者: Charles London,Varun Kanade
机构: University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pause tokens, simple filler symbols such as “…”, consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of \mathsfAC^0 functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to \mathsfTC^0 , matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.
zh

[NLP-64] LLM s are Frequency Pattern Learners in Natural Language Inference

【速读】: 该论文试图解决在自然语言推理(Natural Language Inference, NLI)任务中,预训练大语言模型(Large Language Models, LLMs)通过微调后推理性能提升的内在机制不明确的问题。其解决方案的关键在于通过分析前提和假设中的谓词频率分布,发现正例中假设中的谓词出现频率高于前提中的谓词,即存在频率偏差,并进一步验证模型在推理过程中依赖这一频率偏差,从而揭示频率模式学习对模型性能提升的作用。

链接: https://arxiv.org/abs/2505.21011
作者: Liang Cheng,Zhaowei Wang,Mark Steedman
机构: University of Edinburgh(爱丁堡大学); HKUST(香港科技大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:While fine-tuning LLMs on NLI corpora improves their inferential performance, the underlying mechanisms driving this improvement remain largely opaque. In this work, we conduct a series of experiments to investigate what LLMs actually learn during fine-tuning. We begin by analyzing predicate frequencies in premises and hypotheses across NLI datasets and identify a consistent frequency bias, where predicates in hypotheses occur more frequently than those in premises for positive instances. To assess the impact of this bias, we evaluate both standard and NLI fine-tuned LLMs on bias-consistent and bias-adversarial cases. We find that LLMs exploit frequency bias for inference and perform poorly on adversarial instances. Furthermore, fine-tuned LLMs exhibit significantly increased reliance on this bias, suggesting that they are learning these frequency patterns from datasets. Finally, we compute the frequencies of hyponyms and their corresponding hypernyms from WordNet, revealing a correlation between frequency bias and textual entailment. These findings help explain why learning frequency patterns can enhance model performance on inference tasks.
zh

[NLP-65] Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models ? ACL2025

【速读】: 该论文试图解决在长上下文情境学习(long-context in-context learning, ICL)中,增加示例数量对生成响应可信度的影响尚未被充分研究的问题,特别是预测不确定性(predictive uncertainty)的变化机制。解决方案的关键在于通过系统量化不同样本数量下的不确定性,结合不确定性分解方法,揭示任务特定知识的注入如何降低认知不确定性(epistemic uncertainty, EU),从而提升模型性能,并进一步分析内部置信度在不同网络层中的演化机制。

链接: https://arxiv.org/abs/2505.21003
作者: Yifei Wang,Yu Sheng,Linjing Li,Daniel Zeng
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: Camera-ready versions for ACL 2025 Findings

点击查看摘要

Abstract:Recent advances in handling long sequences have facilitated the exploration of long-context in-context learning (ICL). While much of the existing research emphasizes performance improvements driven by additional in-context examples, the influence on the trustworthiness of generated responses remains underexplored. This paper addresses this gap by investigating how increased examples influence predictive uncertainty, an essential aspect in trustworthiness. We begin by systematically quantifying the uncertainty of ICL with varying shot counts, analyzing the impact of example quantity. Through uncertainty decomposition, we introduce a novel perspective on performance enhancement, with a focus on epistemic uncertainty (EU). Our results reveal that additional examples reduce total uncertainty in both simple and complex tasks by injecting task-specific knowledge, thereby diminishing EU and enhancing performance. For complex tasks, these advantages emerge only after addressing the increased noise and uncertainty associated with longer inputs. Finally, we explore the evolution of internal confidence across layers, unveiling the mechanisms driving the reduction in uncertainty.
zh

[NLP-66] Articulatory strategy in vowel production as a basis for speaker discrimination INTERSPEECH2025

【速读】: 该论文试图解决的问题是:在元音发音过程中,发音策略是否足够具有说话人特异性,从而可以作为说话人识别的基础。研究的关键在于通过广义普罗克鲁斯特斯分析(Generalised Procrustes Analysis)对40名来自英格兰西北部的英语说话人的舌形数据进行分析,并在似然比框架下评估正交舌形特征的说话人鉴别能力。研究发现,舌体大小是区分说话人的最有效维度,而前部舌形变化的鉴别性能通常优于后部舌形变化,同时表明仅基于形状的信息在不出现说话人层面共变特征的情况下,可能达到与大小和形状联合信息相当的说话人特异性。

链接: https://arxiv.org/abs/2505.20995
作者: Justin J. H. Lo,Patrycja Strycharczuk,Sam Kirkham
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:The way speakers articulate is well known to be variable across individuals while at the same time subject to anatomical and biomechanical constraints. In this study, we ask whether articulatory strategy in vowel production can be sufficiently speaker-specific to form the basis for speaker discrimination. We conducted Generalised Procrustes Analyses of tongue shape data from 40 English speakers from the North West of England, and assessed the speaker-discriminatory potential of orthogonal tongue shape features within the framework of likelihood ratios. Tongue size emerged as the individual dimension with the strongest discriminatory power, while tongue shape variation in the more anterior part of the tongue generally outperformed tongue shape variation in the posterior part. When considered in combination, shape-only information may offer comparable levels of speaker specificity to size-and-shape information, but only when features do not exhibit speaker-level co-variation.
zh

[NLP-67] Who Reason s in the Large Language Models ?

【速读】: 该论文试图解决如何赋予大型语言模型(Large Language Models, LLMs)新能力(如数学推理)的过程中缺乏理论依据和透明性的问题。其关键解决方案是提出假设,即推理能力主要来源于Transformer架构中的输出投影模块(output projection module, oproj),并通过引入Stethoscope for Networks (SfN)这一诊断工具集,提供实证证据支持该假设,表明oproj在推理中起核心作用,而其他模块更倾向于促进流畅对话。

链接: https://arxiv.org/abs/2505.20993
作者: Jie Shao,Jianxin Wu
机构: Nanjing University(南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities–such as mathematical reasoning–remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer’s multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.
zh

[NLP-68] RefAV: Towards Planning -Centric Scenario Mining

【速读】: 该论文试图解决在自动驾驶车辆(AVs)的无监督驾驶日志中识别出具有兴趣和安全关键性的场景这一问题,传统场景挖掘技术存在误差大且耗时的问题,通常依赖于人工设计的结构化查询。解决方案的关键在于利用近期视觉-语言模型(VLMs)重新审视时空场景挖掘,通过自然语言查询来检测并精确定位驾驶日志中的特定场景。为此,研究者引入了RefAV数据集,包含10,000个多样化的自然语言查询,用于描述与运动规划相关的复杂多智能体交互,并对多种参照性多目标跟踪器进行了评估。

链接: https://arxiv.org/abs/2505.20981
作者: Cainan Davidson,Deva Ramanan,Neehar Peri
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Our code and dataset are available at this https URL and this https URL
zh

[NLP-69] Evaluating and Steering Modality Preferences in Multimodal Large Language Model

【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理多模态上下文时是否存在模态偏好(modality preference)的问题,即模型是否倾向于优先依赖某一模态进行决策。其解决方案的关键在于通过构建一个受控证据冲突场景下的基准测试集\textbfMC\textsuperscript2,系统性地评估模态偏好,并进一步提出一种基于表示工程的探测与调控方法,无需额外微调或精心设计的提示词即可显式控制模态偏好方向。该方法能够有效增强模型的模态偏好并应用于下游任务,如幻觉缓解和多模态机器翻译。

链接: https://arxiv.org/abs/2505.20977
作者: Yu Zhang,Jinlong Ma,Yongshuai Hou,Xuefeng Bai,Kehai Chen,Yang Xiang,Jun Yu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室)
类目: Computation and Language (cs.CL)
备注: Modality Preference

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbfMC\textsuperscript2 benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
zh

[NLP-70] Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing ACL2025

【速读】: 该论文旨在解决跨领域成分句法分析(cross-domain constituency parsing)在计算语言学中的挑战,特别是由于多领域成分树库(constituency treebank)资源有限的问题。其解决方案的关键在于提出一种新颖的树库生成方法——LLM back generation,该方法通过将仅包含领域关键词叶节点的不完整跨领域成分树作为输入,填补缺失词汇以生成跨领域成分树库。此外,还引入了基于跨度级别的对比学习预训练策略,以充分利用生成的树库进行跨领域成分句法分析。

链接: https://arxiv.org/abs/2505.20976
作者: Peiming Guo,Meishan Zhang,Jianling Li,Min Zhang,Yue Zhang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); Tianjin University (天津大学); Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 main conference

点击查看摘要

Abstract:Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
zh

[NLP-71] Reason -Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中存在幻觉和缺乏可靠事实基础的问题,以及知识图谱(Knowledge Graphs, KGs)在灵活推理能力上的不足。其解决方案的关键在于提出一种名为Reason-Align-Respond (RAR) 的框架,该框架通过三个核心组件——Reasoner(生成类人推理链)、Aligner(将推理链映射到有效的KG路径)和Responser(合成最终答案)——系统地整合LLM的推理能力与KG的事实知识,从而提升知识图谱问答(KGQA)的性能与可解释性。

链接: https://arxiv.org/abs/2505.20971
作者: Xiangqing Shen,Fanfan Wang,Rui Xia
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit@1 scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference.
zh

[NLP-72] Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation KDD2025

【速读】: 该论文旨在解决查询自动补全(Query Auto-Completion, QAC)系统中的两个关键问题:一是用户个性化表示的层次化需求,二是QAC系统的去毒化问题。传统方法通常将用户的搜索行为作为单一的整体表示,难以满足更复杂的生成场景;同时,查询前缀通常较短且可能包含拼写错误或敏感信息,增加了生成有毒内容的风险。为了解决这些问题,作者提出了一种名为LaD的新模型,其关键在于通过层次化的粗粒度和细粒度个性化信息捕捉机制,以及基于拒绝偏好优化(Reject Preference Optimization, RPO)的在线训练方法,结合特殊标记[Reject]实现自适应去毒,从而生成既无害又相关的补全结果。

链接: https://arxiv.org/abs/2505.20966
作者: Zhibo Wang,Xiaoze Jiang,Zhiheng Qin,Enyun Yu,Han Li
机构: Kuaishou Technology(快手科技)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: KDD 2025

点击查看摘要

Abstract:Query auto-completion (QAC) plays a crucial role in modern search systems. However, in real-world applications, there are two pressing challenges that still need to be addressed. First, there is a need for hierarchical personalized representations for users. Previous approaches have typically used users’ search behavior as a single, overall representation, which proves inadequate in more nuanced generative scenarios. Additionally, query prefixes are typically short and may contain typos or sensitive information, increasing the likelihood of generating toxic content compared to traditional text generation tasks. Such toxic content can degrade user experience and lead to public relations issues. Therefore, the second critical challenge is detoxifying QAC systems. To address these two limitations, we propose a novel model (LaD) that captures personalized information from both long-term and short-term interests, incorporating adaptive detoxification. In LaD, personalized information is captured hierarchically at both coarse-grained and fine-grained levels. This approach preserves as much personalized information as possible while enabling online generation within time constraints. To move a futher step, we propose an online training method based on Reject Preference Optimization (RPO). By incorporating a special token [Reject] during both the training and inference processes, the model achieves adaptive detoxification. Consequently, the generated text presented to users is both non-toxic and relevant to the given prefix. We conduct comprehensive experiments on industrial-scale datasets and perform online A/B tests, delivering the largest single-experiment metric improvement in nearly two years of our product. Our model has been deployed on Kuaishou search, driving the primary traffic for hundreds of millions of active users. The code is available at this https URL. Comments: KDD 2025 Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2505.20966 [cs.CL] (or arXiv:2505.20966v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.20966 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-73] Context-Aware Content Moderation for German Newspaper Comments

【速读】: 该论文试图解决德国语种报纸论坛中自动内容审核的问题,尤其是在平台特定上下文(如用户历史和文章主题)被忽视的情况下,传统方法难以有效识别不当言论。解决方案的关键在于开发并评估二分类模型,通过引入上下文信息提升内容审核的准确性,实验表明基于LSTM和CNN的模型在利用上下文信息时表现优于当前最先进的方法,而ChatGPT-3.5 Turbo在零样本分类任务中未能从上下文中获益。

链接: https://arxiv.org/abs/2505.20963
作者: Felix Krejca,Tobias Kietreiber,Alexander Buchelt,Sebastian Neumaier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing volume of online discussions requires advanced automatic content moderation to maintain responsible discourse. While hate speech detection on social media is well-studied, research on German-language newspaper forums remains limited. Existing studies often neglect platform-specific context, such as user history and article themes. This paper addresses this gap by developing and evaluating binary classification models for automatic content moderation in German newspaper forums, incorporating contextual information. Using LSTM, CNN, and ChatGPT-3.5 Turbo, and leveraging the One Million Posts Corpus from the Austrian newspaper Der Standard, we assess the impact of context-aware models. Results show that CNN and LSTM models benefit from contextual information and perform competitively with state-of-the-art approaches. In contrast, ChatGPT’s zero-shot classification does not improve with added context and underperforms.
zh

[NLP-74] Research Community Perspectives on “Intelligence” and Large Language Models ACL

【速读】: 该论文试图解决当前自然语言处理(Natural Language Processing, NLP)研究中对“智能”(intelligence)概念界定不清的问题,以及这种模糊性对研究方向的影响。其解决方案的关键在于通过一项针对多领域303名研究人员的调查,明确社区对“智能”核心特征的共识,识别出普遍认可的三个标准:泛化能力、适应性和推理能力,并揭示当前NLP系统被视作“智能”的观点仅为少数(29%),且仅有16.2%的研究人员认为开发智能系统是研究目标。

链接: https://arxiv.org/abs/2505.20959
作者: Bertram Højer,Terne Sasha Thorn Jakobsen,Anna Rogers,Stefan Heinrich
机构: IT University of Copenhagen (丹麦技术大学); University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: ACL Findings 2025

点击查看摘要

Abstract:Despite the widespread use of ‘‘artificial intelligence’’ (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by ‘‘intelligence’’. To that end, we present the results of a survey on the notion of ‘‘intelligence’’ among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience. We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, reasoning. Our results suggests that the perception of the current NLP systems as ‘‘intelligent’’ is a minority position (29%). Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.
zh

[NLP-75] On VLMs for Diverse Tasks in Multimodal Meme Classification

【速读】: 该论文旨在解决不同语境下网络迷因(meme)分类任务中的理解难题,特别是针对讽刺(sarcasm)、攻击性(offensive)和情感(sentiment)分类。其解决方案的关键在于结合视觉语言模型(VLM)与大语言模型(LLM),通过VLM对迷因图像进行理解,并利用VLM生成的详细迷因解释来微调更小的LLM,从而提升分类性能。该方法在基准测试中分别提升了8.34%、3.52%和26.24%的准确率。

链接: https://arxiv.org/abs/2505.20937
作者: Deepesh Gavit,Debajyoti Mazumder,Samiran Das,Jasabanta Patro
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.
zh

[NLP-76] Information-Theoretic Complementary Prompts for Improved Continual Text Classification

【速读】: 该论文旨在解决持续文本分类(Continual Text Classification, CTC)中的灾难性遗忘问题,即在不断学习新任务时,模型会遗忘之前学到的知识。现有方法通常仅关注任务特定知识,而忽视了共享的、与任务无关的知识的重要性。该论文提出的解决方案是信息理论互补提示(Information-Theoretic Complementary Prompts, InfoComp),其关键在于显式学习两个不同的提示空间:P-Prompt(编码任务特定知识)和S-Prompt(编码任务不变知识),从而实现无需数据重放的序列分类任务学习。通过信息理论框架最大化不同参数间的互信息,并设计两种新的损失函数,分别强化任务特定知识的积累和任务不变知识的保留,以缓解灾难性遗忘并提升前向知识迁移能力。

链接: https://arxiv.org/abs/2505.20933
作者: Duzhen Zhang,Yong Ren,Chenxing Li,Dong Yu,Tielin Zhang
机构: Mohamed bin Zayed University of Artificial Intelligence ( Mohamed bin Zayed University of Artificial Intelligence); Chinese Academy of Sciences (Chinese Academy of Sciences); Tencent AI Lab (Tencent AI Lab)
类目: Computation and Language (cs.CL)
备注: Accepted by Neural Networks

点击查看摘要

Abstract:Continual Text Classification (CTC) aims to continuously classify new text data over time while minimizing catastrophic forgetting of previously acquired knowledge. However, existing methods often focus on task-specific knowledge, overlooking the importance of shared, task-agnostic knowledge. Inspired by the complementary learning systems theory, which posits that humans learn continually through the interaction of two systems – the hippocampus, responsible for forming distinct representations of specific experiences, and the neocortex, which extracts more general and transferable representations from past experiences – we introduce Information-Theoretic Complementary Prompts (InfoComp), a novel approach for CTC. InfoComp explicitly learns two distinct prompt spaces: P(rivate)-Prompt and S(hared)-Prompt. These respectively encode task-specific and task-invariant knowledge, enabling models to sequentially learn classification tasks without relying on data replay. To promote more informative prompt learning, InfoComp uses an information-theoretic framework that maximizes mutual information between different parameters (or encoded representations). Within this framework, we design two novel loss functions: (1) to strengthen the accumulation of task-specific knowledge in P-Prompt, effectively mitigating catastrophic forgetting, and (2) to enhance the retention of task-invariant knowledge in S-Prompt, improving forward knowledge transfer. Extensive experiments on diverse CTC benchmarks show that our approach outperforms previous state-of-the-art methods.
zh

[NLP-77] Multi-objective Large Language Model Alignment with Hierarchical Experts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在同时满足多个目标时的对齐问题,尤其是面对多样化且常存在冲突的人类偏好时,现有对齐方法难以有效平衡权衡,通常需要高昂的重新训练成本或在偏好帕累托前沿上产生次优结果。其解决方案的关键在于提出一种轻量级、参数高效且即插即用的框架——\textitHoE(Hierarchical Mixture-of-Experts),该框架无需模型训练即可使LLMs适应整个帕累托前沿,并满足多样化的用户偏好,其核心由三层分层组件构成:LoRA专家、路由专家和偏好路由,实现了参数规模、训练成本与性能之间的最优权衡。

链接: https://arxiv.org/abs/2505.20925
作者: Zhuo Li,Guodong Du,Weiyang Guo,Yigeng Zhou,Xiucheng Li,Wenya Wang,Fangming Liu,Yequan Wang,Deheng Ye,Min Zhang,Jing Li
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Nanyang Technological University, Singapore (南洋理工大学); Peng Cheng Laboratory, China (鹏城实验室); Beijing Academy of Artificial Intelligence, China (北京人工智能研究院); Tencent, China (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce \textitHoE(Hierarchical Mixture-of-Experts), a \textitlightweight, \textitparameter-efficient, and \textitplug-and-play approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, \textitHoE consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate \textitHoE across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines. Code is available in the supplementary materials.
zh

[NLP-78] Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models ACL2025

【速读】: 该论文试图解决在自然语言处理(Natural Language Processing, NLP)任务中,如何根据子任务的复杂性和需求,选择合适的大型语言模型(Large Language Model, LLM)层级以平衡成本与性能的问题。解决方案的关键在于提出一种无需训练的自动传输框架——LLM Automatic Transmission (LLM-AT),该框架通过 Starter、Generator 和 Judge 三个模块协同工作,自动选择并迭代升级 LLM 层级,直至生成有效响应。此外,论文还引入了精度估计器(accuracy estimator),用于在不进行训练的情况下预测各 LLM 层级的预期精度,从而优化初始层级的选择。

链接: https://arxiv.org/abs/2505.20921
作者: Injae Na,Keonwoong Noh,Woohwan Jung
机构: Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 (Findings)

点击查看摘要

Abstract:LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.
zh

[NLP-79] Automated Privacy Information Annotation in Large Language Model Interactions

【速读】: 该论文试图解决用户在与大型语言模型(Large Language Models, LLMs)交互时,因使用真实身份标识而无意中泄露隐私信息的问题。其关键解决方案是构建一个大规模多语言数据集,包含249K条用户查询和154K条标注的隐私短语,并设计自动化隐私标注流程,利用云原生的强大型语言模型从对话数据集中自动提取隐私短语并标注泄露信息,从而支持可部署在本地用户设备上的隐私检测模型的开发与评估。

链接: https://arxiv.org/abs/2505.20910
作者: Hang Zeng,Xiangyu Liu,Yong Hu,Chaoyue Niu,Fan Wu,Shaojie Tang,Guihai Chen
机构: Shanghai Jiao Tong University (上海交通大学); WeChat AI Tencent (微信人工智能腾讯); University at Buffalo (纽约州立大学布法罗分校)
类目: Computation and Language (cs.CL)
备注: 9 content pages

点击查看摘要

Abstract:Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application scenarios, typically tagging personally identifiable information (PII) in anonymous content. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with cloud-based strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.
zh

[NLP-80] owards Objective Fine-tuning: How LLM s Prior Knowledge Causes Potential Poor Calibration? ACL2025

【速读】: 该论文试图解决微调大型语言模型(Large Language Models, LLMs)时出现的校准性能不佳问题,即模型的置信度评分与其实际性能不匹配。研究发现,LLMs的先验知识在现实世界微调中普遍存在,导致校准性能下降,具体表现为与先验知识对齐的数据会引发过度自信,而新知识则有助于提升校准效果。解决方案的关键在于提出CogCalib框架,该框架通过根据模型的先验知识应用针对性的学习策略,从而改善校准性能。实验结果表明,CogCalib在7个任务中显著提升了校准效果,同时保持了模型性能,平均ECE降低了57%。

链接: https://arxiv.org/abs/2505.20903
作者: Ziming Wang,Zeyu Shi,Haoyi Zhou,Shiqi Gao,Qingyun Sun,Jianxin Li
机构: Beihang University (北京航空航天大学); School of Software, Beihang University (软件学院,北京航空航天大学); Zhongguancun Laboratory, Beijing (中关村实验室,北京)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2025 Main; The code will be released soon

点击查看摘要

Abstract:Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs’ prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs’ prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning, which appears harmful for calibration. Specifically, data aligned with LLMs’ prior knowledge would induce overconfidence, while new knowledge improves calibration. Our findings expose a tension: LLMs’ encyclopedic knowledge, while enabling task versatility, undermines calibration through unavoidable knowledge overlaps. To address this, we propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model’s prior knowledge. Experiments across 7 tasks using 3 LLM families prove that CogCalib significantly improves calibration while maintaining performance, achieving an average 57% reduction in ECE compared to standard fine-tuning in Llama3-8B. These improvements generalize well to out-of-domain tasks, enhancing the objectivity and reliability of domain-specific LLMs, and making them more trustworthy for critical human-AI interaction applications.
zh

[NLP-81] A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models

【速读】: 该论文旨在解决大型视觉语言模型(LVLMs)在生成内容中可能习得并传播社会偏见和刻板印象的问题。现有研究在评估LVLM的刻板印象时存在两大局限:一是评价指标忽略了关键内容词的重要性,二是数据集未考虑颜色因素的影响。论文的关键解决方案是引入基于刻板印象内容模型(SCM)的新评估指标,并提出BASIC基准,用于评估性别、种族和颜色相关的刻板印象。通过SCM指标和BASIC基准,研究揭示了LVLM中的刻板印象特征及其与模型架构和参数规模的交互影响。

链接: https://arxiv.org/abs/2505.20901
作者: Junhyuk Choi,Minju Kim,Yeseon Hong,Bugeun Kim
机构: Chung-Ang University (忠南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:As large vision language models(LVLMs) rapidly advance, concerns about their potential to learn and generate social biases and stereotypes are increasing. Previous studies on LVLM’s stereotypes face two primary limitations: metrics that overlooked the importance of content words, and datasets that overlooked the effect of color. To address these limitations, this study introduces new evaluation metrics based on the Stereotype Content Model (SCM). We also propose BASIC, a benchmark for assessing gender, race, and color stereotypes. Using SCM metrics and BASIC, we conduct a study with eight LVLMs to discover stereotypes. As a result, we found three findings. (1) The SCM-based evaluation is effective in capturing stereotypes. (2) LVLMs exhibit color stereotypes in the output along with gender and race ones. (3) Interaction between model architecture and parameter sizes seems to affect stereotypes. We release BASIC publicly on [anonymized for review].
zh

[NLP-82] Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

【速读】: 该论文旨在解决跨语言配音系统中语音模式迁移不足的问题,即现有语音翻译方法虽然在翻译质量上表现优异,但往往忽略了语音特征的传递,导致生成的语音与源语音在时长、说话人身份和语速等方面不匹配,从而限制了其在配音应用中的适用性。解决方案的关键在于提出一种基于离散扩散的语音到单元翻译模型,该模型具备显式的时长控制能力,实现了时间对齐的翻译,并结合条件流匹配模型进行语音合成,同时引入基于单元的语速自适应机制,确保生成语音的语速与源语音一致,而无需依赖文本信息。

链接: https://arxiv.org/abs/2505.20899
作者: Jeongsoo Choi,Jaehun Kim,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech’s duration and speaking pace, while achieving competitive translation performance.
zh

[NLP-83] Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

【速读】: 该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)中因部分可观测性导致的感知与语言对齐困难问题。现有方法通过想象未来场景来缓解这一问题,但依赖于基于视觉的合成,导致计算成本高且冗余信息多。解决方案的关键在于通过语言形式自适应地想象关键环境语义,从而实现更可靠和高效的策略。其核心是提出一种名为自适应文本梦境者(Adaptive Text Dreamer, ATD)的双分支自引导想象策略,该策略基于大语言模型(LLM),采用类人左右脑架构,分别负责逻辑整合与未来场景的想象预测,并通过微调Q-former高效激活LLM中的领域知识,实现导航过程中逻辑推理与想象的动态更新。

链接: https://arxiv.org/abs/2505.20897
作者: Pingrui Zhang,Yifei Su,Pengyuan Wu,Dong An,Li Zhang,Zhigang Wang,Dong Wang,Yan Ding,Bin Zhao,Xuelong Li
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); MAIS, Institute of Automation of Chinese Academy of Sciences (中国科学院自动化研究所MAIS); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Science and Technology of China (中国科学技术大学); TeleAI, China Telecom Corp Ltd (中国电信天翼AI公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textitlanguage form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \hrefthis https URLhere.
zh

[NLP-84] How Do Transformers Learn Variable Binding in Symbolic Programs? ICML2025

【速读】: 该论文试图解决现代神经网络在缺乏内置变量绑定操作的情况下如何获得变量绑定能力的问题(variable binding)。其解决方案的关键在于通过训练Transformer模型在符号程序中解析变量引用,使其能够动态跟踪变量赋值链。研究发现,模型通过利用残差流作为可寻址的内存空间,并借助专门的注意力头在不同标记位置间传递信息,最终发展出系统化的变量解引用机制。这一机制使模型能够在没有显式架构支持的情况下实现结构化的变量绑定,从而弥合了连接主义与符号主义方法之间的差距。

链接: https://arxiv.org/abs/2505.20896
作者: Yiwei Wu,Atticus Geiger,Raphaël Millière
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 10 figures, 1 table. To appear in the Proceedings of the 42nd International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:Variable binding – the ability to associate variables with values – is fundamental to symbolic computation and cognition. Although classical architectures typically implement variable binding via addressable memory, it is not well understood how modern neural networks lacking built-in binding operations may acquire this capacity. We investigate this by training a Transformer to dereference queried variables in symbolic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing assignment chains. Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, resulting in accurate dereferencing. Our results show how Transformer models can learn to implement systematic variable binding without explicit architectural support, bridging connectionist and symbolic approaches.
zh

[NLP-85] EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)的知识蒸馏(Knowledge Distillation, KD)问题,特别是针对黑盒和白盒场景下的高效模型压缩与性能优化。其解决方案的关键在于提出一个综合性工具包EasyDistill,该工具包集成了数据合成、监督微调、排序优化和强化学习等多样化技术,以支持System 1(快速、直觉型)和System 2(缓慢、分析型)模型的知识蒸馏功能,同时通过模块化设计和用户友好的界面简化了先进KD策略的实验与实现过程。

链接: https://arxiv.org/abs/2505.20888
作者: Chengyu Wang,Junbing Yan,Wenrui Cai,Yuanhao Yue,Jun Huang
机构: Alibaba Cloud Computing (阿里巴巴云计算); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud’s Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.
zh

[NLP-86] MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection

【速读】: 该论文旨在解决多语言环境下由指令微调的大规模语言模型(Large Language Models, LLMs)生成文本中幻觉片段(hallucination spans)的检测问题。其解决方案的关键在于结合任务特定的提示工程与LLM集成验证机制,其中主模型负责提取可能的幻觉片段,三个独立的LLM通过基于概率的投票机制判断其有效性,从而模拟共享任务验证和测试数据中的人工标注流程。此外,模糊匹配技术用于优化片段对齐。

链接: https://arxiv.org/abs/2505.20880
作者: Baraa Hikal,Ahmed Nasreldin,Ali Hamdi
机构: MSA University (MSA大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, fuzzy matching refines span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.
zh

[NLP-87] rans-EnV: A Framework for Evaluating the Linguistic Robustness of LLM s Against English Varieties

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在评估过程中过度依赖标准美国英语(Standard American English, SAE),而忽视了全球英语变体多样性的现象,这一问题可能导致模型在非标准英语变体上的性能下降,进而引发使用公平性问题。论文提出的解决方案关键在于构建一个名为Trans-EnV的框架,该框架通过结合语言学专家知识与基于大语言模型的转换技术,实现对SAE数据集的自动化多英语变体转换,从而有效评估LLMs的语义鲁棒性。

链接: https://arxiv.org/abs/2505.20875
作者: Jiyoung Lee,Seungho Kim,Jieun Han,Jun-Min Lee,Kitaek Kim,Alice Oh,Edward Choi
机构: KAIST(韩国科学技术院); Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 27 pages, 6 figures, 16 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our \hrefthis https URLcode and \hrefthis https URLdatasets are publicly available.
zh

[NLP-88] Can LLM s Learn to Map the World from Local Descriptions?

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在内部化结构化空间知识方面的潜力尚未被充分探索的问题,具体关注LLMs是否能够通过整合碎片化的关系描述,构建连贯的全局空间认知。解决方案的关键在于利用局部相对的人类观察作为基础,使LLMs在空间感知和空间导航两个核心方面实现能力提升,即从局部位置关系推断一致的全局布局,并通过轨迹数据学习道路连通性以规划最优路径。实验结果表明,LLMs不仅能够泛化到未见过的兴趣点(Points of Interest, POIs)之间的空间关系,还能生成与现实世界空间分布对齐的潜在表示,从而实现准确的路径规划和动态空间感知。

链接: https://arxiv.org/abs/2505.20874
作者: Sirui Xia,Aili Chen,Xintao Wang,Tinghui Zhu,Yikai Zhang,Jiangjie Chen,Yanghua Xiao
机构: Fudan University (复旦大学); ByteDance Seed (字节跳动种子)
类目: Computation and Language (cs.CL)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.
zh

[NLP-89] Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG ACL2025

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对噪声检索结果时,过度生成答案而未能正确表达不确定性的可靠性问题。其关键解决方案是提出一种名为“分而对齐”(Divide-Then-Align, DTA)的后训练方法,通过将数据样本划分为四个知识象限并为每个象限构建定制化的偏好数据,从而生成适用于直接偏好优化(Direct Preference Optimization, DPO)的精炼数据集,使系统能够在知识边界之外时选择“我不知道”作为回应,从而在准确性和适当回避之间取得平衡。

链接: https://arxiv.org/abs/2505.20871
作者: Xin Sun,Jianan Xie,Zhongqi Chen,Qiang Liu,Shu Wu,Yuehe Chen,Bowen Song,Weiqiang Wang,Zilei Wang,Liang Wang
机构: USTC(中国科学技术大学); NLPR, MAIS, CASIA(模式识别国家重点实验室,多媒体与智能系统,中国科学院自动化研究所); SUSTech(南方科技大学); Independent(独立)
类目: Computation and Language (cs.CL)
备注: ACL 2025 main

点击查看摘要

Abstract:Large language models (LLMs) augmented with retrieval systems have significantly advanced natural language processing tasks by integrating external knowledge sources, enabling more accurate and contextually rich responses. To improve the robustness of such systems against noisy retrievals, Retrieval-Augmented Fine-Tuning (RAFT) has emerged as a widely adopted method. However, RAFT conditions models to generate answers even in the absence of reliable knowledge. This behavior undermines their reliability in high-stakes domains, where acknowledging uncertainty is critical. To address this issue, we propose Divide-Then-Align (DTA), a post-training approach designed to endow RAG systems with the ability to respond with “I don’t know” when the query is out of the knowledge boundary of both the retrieved passages and the model’s internal knowledge. DTA divides data samples into four knowledge quadrants and constructs tailored preference data for each quadrant, resulting in a curated dataset for Direct Preference Optimization (DPO). Experimental results on three benchmark datasets demonstrate that DTA effectively balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
zh

[NLP-90] An LLM -as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

【速读】: 该论文试图解决如何准确评估由大型语言模型(Large Language Models, LLMs)生成的软件工件(如代码片段、补丁和注释)的正确性问题。现有方法在准确性或可扩展性方面存在局限,人类评估虽然准确但成本高,而现有的自动评估指标则难以准确反映生成结果的实际正确性。论文提出的解决方案是SWE-Judge,其关键在于定义了五种不同的评估策略,并将其作为独立的评判者,通过动态团队选择机制集成最合适的评判者以生成最终的正确性评分,从而提高评估的准确性和可靠性。

链接: https://arxiv.org/abs/2505.20854
作者: Xin Zhou,Kisub Kim,Ting Zhang,Martin Weyssow,Luis F. Gomes,Guang Yang,David Lo
机构: Singapore Management University(新加坡管理大学); Carnegie Mellon University(卡内基梅隆大学); Nanjing University of Aeronautics and Astronautics(南京航空航天大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge’s potential as a scalable and reliable alternative to human evaluation. Comments: 20 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2505.20854 [cs.SE] (or arXiv:2505.20854v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.20854 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-91] Concealment of Intent: A Game-Theoretic Analysis

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全部署中面临的恶意攻击问题,特别是针对对齐机制的对抗性提示攻击。其解决方案的关键在于提出一种可扩展的攻击策略——意图隐藏对抗性提示(intent-hiding adversarial prompting),通过技能组合隐藏恶意意图,并构建博弈论框架来分析攻击与防御系统之间的交互,从而揭示攻击者的结构优势。

链接: https://arxiv.org/abs/2505.20841
作者: Xinbo Wu,Abhishek Umrawal,Lav R. Varshney
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) grow more capable, concerns about their safe deployment have also grown. Although alignment mechanisms have been introduced to deter misuse, they remain vulnerable to carefully designed adversarial prompts. In this work, we present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills. We develop a game-theoretic framework to model the interaction between such attacks and defense systems that apply both prompt and response filtering. Our analysis identifies equilibrium points and reveals structural advantages for the attacker. To counter these threats, we propose and analyze a defense mechanism tailored to intent-hiding attacks. Empirically, we validate the attack’s effectiveness on multiple real-world LLMs across a range of malicious behaviors, demonstrating clear advantages over existing adversarial prompting techniques.
zh

[NLP-92] AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset ACL2025

【速读】: 该论文试图解决如何识别使广告文案具有吸引力的语言因素,以提升广告效果的问题。其解决方案的关键在于提出AdParaphrase v2.0数据集,该数据集包含大量人工偏好数据,支持对语言特征的分析以及吸引人广告文案生成方法的开发,相较于v1.0版本,其规模扩大了20倍,包含16,460个广告文案改写对,并由10名评估者进行标注,从而实现了更全面和可靠的研究。

链接: https://arxiv.org/abs/2505.20826
作者: Soichiro Murakami,Peinan Zhang,Hidetaka Kamigaito,Hiroya Takamura,Manabu Okumura
机构: CyberAgent, Inc.(CyberAgent公司); Nara Institute of Science and Technology(奈良先端科学技术大学院大学); Institute of Science Tokyo(东京科学大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2025 Findings

点击查看摘要

Abstract:Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness. The dataset is publicly available at: this https URL.
zh

[NLP-93] Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

【速读】: 该论文旨在解决长文本问答(Long-form Question Answering, LFQA)中大型语言模型面临的挑战,包括高质量长文本生成训练数据的稀缺性、长输出中幻觉风险的累积以及事实完整性评估指标的缺失。其解决方案的关键在于提出一种名为RioRAG的强化学习(Reinforcement Learning, RL)框架,通过两个核心创新来优化信息量并提升长文本生成的质量:一是采用直接优化信息量的强化学习训练范式,以克服传统RAG系统中的慢思考缺陷,无需依赖昂贵的监督数据;二是引入以关键信息点(nugget)为中心的分层奖励建模方法,通过三阶段过程实现对长文本答案的精确评估。

链接: https://arxiv.org/abs/2505.20825
作者: Yuhao Wang,Ruiyang Ren,Yucheng Wang,Wayne Xin Zhao,Jing Liu,Hua Wu,Haifeng Wang
机构: Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at this https URL.
zh

[NLP-94] racing and Reversing Rank-One Model Edits

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)中知识编辑(Knowledge Editing, KEs)可能带来的双重用途风险,即恶意篡改模型事实内容的问题。其解决方案的关键在于通过分析编辑后的权重矩阵分布模式,实现对知识编辑的可追溯性和可逆性。研究揭示了Rank-One Model Editing (ROME) 方法在权重矩阵中引入的独特分布特征,这些特征可用于定位被编辑的权重,并进一步预测被修改的事实关系,甚至直接推断出被编辑的对象实体。此外,研究还表明 ROME 编辑可以通过反向操作恢复模型的原始输出,从而为防御对抗性编辑提供了有效手段。

链接: https://arxiv.org/abs/2505.20819
作者: Paul Youssef,Zhixue Zhao,Christin Seifert,Jörg Schlötterer
机构: Marburg University (马尔堡大学); University of Sheffield (谢菲尔德大学); University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge editing methods (KEs) are a cost-effective way to update the factual content of large language models (LLMs), but they pose a dual-use risk. While KEs are beneficial for updating outdated or incorrect information, they can be exploited maliciously to implant misinformation or bias. In order to defend against these types of malicious manipulation, we need robust techniques that can reliably detect, interpret, and mitigate adversarial edits. This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method. We first show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights. Second, we show that these altered weights can reliably be used to predict the edited factual relation, enabling partial reconstruction of the modified fact. Building on this, we propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy. Finally, we demonstrate that ROME edits can be reversed, recovering the model’s original outputs with \geq 80% accuracy. Our findings highlight the feasibility of detecting, tracing, and reversing edits based on the edited weights, offering a robust framework for safeguarding LLMs against adversarial manipulations.
zh

[NLP-95] Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

【速读】: 该论文试图解决多模态问答任务中现有方法依赖单一通用推理策略、忽视各模态独特性的局限性,从而导致准确性和可解释性受限的问题。解决方案的关键在于提出MAMMQA框架,该框架采用多智能体架构,包含两个视觉语言模型(VLM)代理和一个基于文本的大语言模型(LLM)代理,通过分阶段的子问题分解、跨模态推理与整合,实现对多模态输入的高效处理与透明的推理过程。

链接: https://arxiv.org/abs/2505.20816
作者: Krishna Singh Rajput,Tejas Anvekar,Chitta Baral,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
zh

[NLP-96] RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph ACL2025

【速读】: 该论文试图解决知识图谱嵌入(Knowledge Graph Embedding, KGE)中关系特定实体变换的一致性问题,即变换前后的嵌入差异缺乏一致性,可能导致嵌入中固有的归纳偏置丢失。解决方案的关键在于提出一种名为关系语义一致过滤(Relation-Semantics Consistent Filter, RSCF)的插件方法,其核心特征包括:1)所有关系共享的仿射变换;2)以实体嵌入为基础的变换,即将变换向量加到实体嵌入上;3)对变换进行归一化以防止尺度缩减。此外,RSCF通过引入关系变换和预测模块来增强语义一致性,从而在基于距离和张量分解的KGC任务中显著优于现有方法。

链接: https://arxiv.org/abs/2505.20813
作者: Junsik Kim,Jinwook Park,Kangil Kim
机构: Gwangju Institute of Science and Technology (光州科学技术大学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025, 17 pages, 10 figures

点击查看摘要

Abstract:In knowledge graph embedding, leveraging relation-specific entity-transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations in a disconnected manner, allowing dissimilar transformations and corresponding entity-embeddings for similar relations. Second, a generalized plug-in approach as a SFBR (Semantic Filter Based on Relations) disrupts this consistency through excessive concentration of entity embeddings under entity-based regularization, generating indistinguishable score distributions among relations. In this paper, we introduce a plug-in KGE method, Relation-Semantics Consistent Filter (RSCF), containing more consistent entity-transformation characterized by three features: 1) shared affine transformation of relation embeddings across all relations, 2) rooted entity-transformation that adds an entity embedding to its change represented by the transformed vector, and 3) normalization of the change to prevent scale reduction. To amplify the advantages of consistency that preserve semantics on embeddings, RSCF adds relation transformation and prediction modules for enhancing the semantics. In knowledge graph completion tasks with distance-based and tensor decomposition models, RSCF significantly outperforms state-of-the-art KGE methods, showing robustness across all relations and their frequencies.
zh

[NLP-97] Improved Representation Steering for Language Models

【速读】: 该论文试图解决语言模型(Language Model, LM)生成过程中缺乏细粒度和可解释性控制的问题,尤其是在引入或抑制特定概念时,传统调整权重或表示的方法效果有限。其解决方案的关键在于提出一种无需参考的偏好优化方法——Reference-free Preference Steering (RePS),该方法通过双向偏好优化目标同时实现概念引导和抑制,从而提升表示层面的控制效果。RePS在多个模型规模上表现出优于现有方法的性能,并在抑制任务中展现出对基于提示的越狱攻击的鲁棒性。

链接: https://arxiv.org/abs/2505.20809
作者: Zhengxuan Wu,Qinan Yu,Aryaman Arora,Christopher D. Manning,Christopher Potts
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 46 pages, 23 figures, preprint

点击查看摘要

Abstract:Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting – while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.
zh

[NLP-98] CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature

【速读】: 该论文试图解决科学创新中概念重组机制的自动化识别与建模问题,旨在揭示科学家如何通过整合不同领域的现有机制和概念来生成原创性思想。其解决方案的关键在于构建一个大规模的知识库(KB)——CHIMERA,该知识库通过从科学文献中自动提取重组实例,并利用基于大语言模型(LLM)的抽取模型进行训练,从而实现对重组模式的系统性分析与应用。

链接: https://arxiv.org/abs/2505.20779
作者: Noy Sternlicht,Tom Hope
机构: The Hebrew University of Jerusalem(希伯来大学); The Allen Institute for AI (AI2)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:A hallmark of human innovation is the process of recombination – creating original ideas by integrating elements of existing mechanisms and concepts. In this work, we automatically mine the scientific literature and build CHIMERA: a large-scale knowledge base (KB) of recombination examples. CHIMERA can be used to empirically explore at scale how scientists recombine concepts and take inspiration from different areas, or to train supervised machine learning models that learn to predict new creative cross-domain directions. To build this KB, we present a novel information extraction task of extracting recombination from scientific paper abstracts, collect a high-quality corpus of hundreds of manually annotated abstracts, and use it to train an LLM-based extraction model. The model is applied to a large corpus of papers in the AI domain, yielding a KB of over 28K recombination examples. We analyze CHIMERA to explore the properties of recombination in different subareas of AI. Finally, we train a scientific hypothesis generation model using the KB, which predicts new recombination directions that real-world researchers find inspiring. Our data and code are available at this https URL
zh

[NLP-99] SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences EMNLP2025

【速读】: 该论文试图解决生成式 AI (Generative AI) 在处理长输入时,基于树的推测解码(tree-based speculative decoding)性能下降的问题,主要表现为注意力成本增加和草稿模型准确性降低。解决方案的关键在于提出 SpecExtend,它通过集成高效的注意力机制(如 FlashAttention 和 Hybrid Tree Attention)来减少所有阶段的延迟,并引入 Cross-model Retrieval 策略,利用目标模型的注意力得分动态选择相关上下文以提升草稿模型的准确性与速度。

链接: https://arxiv.org/abs/2505.20776
作者: Jungyoub Cha,Hyunjong Kim,Sungzoon Cho
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 3 figures. Under review at EMNLP 2025

点击查看摘要

Abstract:Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models, reducing latency across all stages. To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy that uses the target model’s attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. The code is available at this https URL .
zh

[NLP-100] CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models ACL2025

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在生成认知性陈述时出现的忠实性幻觉(faithfulness hallucination)问题,即模型生成的陈述未得到上下文支持。现有基准测试仅包含对源材料进行重述的“事实性陈述”,而未标注基于上下文进行推理的“认知性陈述”,导致难以评估和优化认知性陈述的一致性。论文的关键解决方案是受立法领域证据评估的启发,设计了一个严谨的框架来评估不同层次的忠实性,并构建了一个基准数据集,同时设计了自动化的标注流程以生成更大规模的 CogniBench-L 数据集,用于训练准确的认知幻觉检测模型。

链接: https://arxiv.org/abs/2505.20767
作者: Xiaqiang Tang,Jian Li,Keyu Hu,Du Nan,Xiaolong Li,Xi Zhang,Weigao Sun,Sihong Xie
机构: The Hong Kong University of Science and Technology (Guangzhou); Hunyuan AI Digital Human, Tencent; Beijing University of Posts and Telecommunications; Shanghai AI Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025

点击查看摘要

Abstract:Faithfulness hallucination are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standard, existing benchmarks only contain “factual statements” that rephrase source materials without marking “cognitive statements” that make inference from the given context, making the consistency evaluation and optimization of cognitive statements difficult. Inspired by how an evidence is assessed in the legislative domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and create a benchmark dataset where we reveal insightful statistics. We design an annotation pipeline to create larger benchmarks for different LLMs automatically, and the resulting larger-scale CogniBench-L dataset can be used to train accurate cognitive hallucination detection model. We release our model and dataset at: this https URL
zh

[NLP-101] Silencer: From Discovery to Mitigation of Self-Bias in LLM -as-Benchmark-Generator

【速读】: 该论文试图解决在使用模型自生成基准进行评估时出现的性能虚高问题,即自偏倚(self-bias),其根源在于问题领域、语言风格和错误标签等子偏倚。解决方案的关键在于提出Silencer框架,该框架通过利用多个生成器在样本和基准层面的异质性来中和偏倚,从而生成高质量且无自偏倚的基准。

链接: https://arxiv.org/abs/2505.20738
作者: Peiwen Yuan,Yiwei Li,Shaoxiong Feng,Xinglin Wang,Yueqi Zhang,Jiayi Shi,Chuyi Tan,Boyuan Pan,Yao Hu,Kan Li
机构: Beijing Institute of Technology (北京理工大学); Xiaohongshu Inc (小红书公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-as-Benchmark-Generator methods have been widely studied as a supplement to human annotators for scalable evaluation, while the potential biases within this paradigm remain underexplored. In this work, we systematically define and validate the phenomenon of inflated performance in models evaluated on their self-generated benchmarks, referred to as self-bias, and attribute it to sub-biases arising from question domain, language style, and wrong labels. On this basis, we propose Silencer, a general framework that leverages the heterogeneity between multiple generators at both the sample and benchmark levels to neutralize bias and generate high-quality, self-bias-silenced benchmark. Experimental results across various settings demonstrate that Silencer can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
zh

[NLP-102] SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution

【速读】: 该论文旨在解决在强化学习(Reinforcement Learning, RL)中,由于奖励延迟导致的智能体训练困难问题。具体而言,当智能体执行多步骤任务时,反馈信号通常仅在任务完成后才可用,这使得早期动作的奖励分配变得复杂,从而影响智能体对环境约束的理解和学习效率。论文提出的解决方案的关键在于Stepwise Progress Attribution (SPA),其核心思想是将最终奖励分解为各个步骤的贡献,每个步骤的贡献反映了其对整体任务完成的增量进展。通过训练一个进度估计器来累积步骤贡献以匹配任务完成状态,并在策略优化过程中结合动作的接地信号生成细粒度的中间奖励,从而提升训练效果。

链接: https://arxiv.org/abs/2505.20732
作者: Hanlin Wang,Chak Tou Leong,Jiashuo Wang,Jian Wang,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5% on average) and grounding accuracy (+1.9% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at this https URL.
zh

[NLP-103] What LLM s Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals

【速读】: 该论文试图解决如何评估生成式 AI (Generative AI) 在用户-物品交互数据中有效推理协同信息的能力问题,以及如何提升其推荐性能。解决方案的关键在于引入一种简单的检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过将模型预测基于结构化的交互数据进行强化,从而显著提升推荐质量。

链接: https://arxiv.org/abs/2505.20730
作者: Shahrooz Pouryousef
机构: UMass Amherst(马萨诸塞大学阿默斯特分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs’ ability to leverage user-item interaction data. We further introduce a simple retrieval-augmented generation (RAG) method that enhances LLMs by grounding their predictions in structured interaction data. Our experiments reveal that current LLMs often fall short in capturing collaborative patterns inherent to MF models, but that our RAG-based approach substantially improves recommendation quality-highlighting a promising direction for future LLM-based recommenders.
zh

[NLP-104] MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频中进行细粒度时间推理方面的不足。现有方法在处理时间敏感的视频问答和时间定位任务时表现有限,而强化学习(Reinforcement Learning, RL)虽被尝试用于改善这一问题,但其效果仍存在局限。论文提出的解决方案关键在于MUSEG,这是一种基于强化学习的方法,通过引入时间戳感知的多片段对齐(timestamp-aware multi-segment grounding)来增强模型的时间理解能力,使模型能够更准确地将查询与多个相关视频片段对齐,从而提升整体的时间推理性能。

链接: https://arxiv.org/abs/2505.20715
作者: Fuwen Luo,Shengfeng Lou,Chi Chen,Ziyue Wang,Chenliang Li,Weizhou Shen,Jiyue Guo,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Yang Liu
机构: Tsinghua University (清华大学); AIR (人工智能产业研究院); Tongyi Lab (通义实验室); Zhejiang Sci-Tech University (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at this https URL.
zh

[NLP-105] Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

【速读】: 该论文试图解决小语言模型(Small Language Models, SLMs)在高中物理推理能力方面的不足问题,特别是其在复杂推理任务中的表现尚未得到充分研究。研究的关键在于构建了一个基于OpenStax高中物理教材的综合性物理数据集,并依据布鲁姆分类法进行标注,同时引入了一种新颖的文化情境化方法,以创建适应不同文化背景的物理问题,从而评估SLMs在不同语境下的推理能力。此外,研究还采用LLM-as-a-judge框架对答案和推理链的正确性及计算准确性进行了系统评估。

链接: https://arxiv.org/abs/2505.20707
作者: Nicy Scaria,Silvester John Joseph Kennedy,Diksha Seth,Deepak Subramani
机构: Indian Institute of Science (印度科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Physics Education (physics.ed-ph)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of Llama 3.2, Phi 4 Mini, Gemma 3, and Qwen series. We developed a comprehensive physics dataset from the OpenStax High School Physics textbook, annotated according to Bloom’s Taxonomy, with LaTeX and plaintext mathematical notations. A novel cultural contextualization approach was applied to a subset, creating culturally adapted problems for Asian, African, and South American/Australian contexts while preserving core physics principles. Using an LLM-as-a-judge framework with Google’s Gemini 2.5 Flash, we evaluated answer and reasoning chain correctness, along with calculation accuracy. The results reveal significant differences between the SLMs. Qwen 3 1.7B achieved high answer accuracy' (85%), but fully correct reasoning’ was substantially low (38%). The format of the mathematical notation had a negligible impact on performance. SLMs exhibited varied performance across the physics topics and showed a decline in reasoning quality with increasing cognitive and knowledge complexity. In particular, the consistency of reasoning was largely maintained in diverse cultural contexts, especially by better performing models. These findings indicate that, while SLMs can often find correct answers, their underlying reasoning is frequently flawed, suggesting an overreliance on pattern recognition. For SLMs to become reliable educational tools in physics, future development must prioritize enhancing genuine understanding and the generation of sound, verifiable reasoning chains over mere answer accuracy.
zh

[NLP-106] Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration

【速读】: 该论文旨在解决将大型语言模型(Large Language Models, LLMs)的推理能力迁移至小型语言模型(Small Language Models, SLMs)时面临的分布不匹配和模型容量受限问题。现有推理数据集通常针对强大LLMs设计,直接应用于较弱模型时会导致性能下降。论文提出的解决方案为动态推理轨迹适配(Dynamic Adaptation of Reasoning Trajectories, DART),其关键在于采用基于步骤可适应性估计的有选择性模仿策略,通过解题模拟评估每一步的适配性;当专家步骤超出学生模型能力时,学生模型会自主探索满足结果一致性的替代推理路径,从而提升泛化能力和数据效率。

链接: https://arxiv.org/abs/2505.20700
作者: Yong Wu,Weihang Pan,Ke Li,Chen Binhui,Ping Li,Binbin Lin
机构: Zhejiang University (浙江大学); Fullong Technology (宁波福隆科技); Ningbo Zhoushan Port Co., Ltd. (宁波舟山港股份有限公司); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable reasoning capabilities, yet aligning such abilities to small language models (SLMs) remains a challenge due to distributional mismatches and limited model capacity. Existing reasoning datasets, typically designed for powerful LLMs, often lead to degraded performance when directly applied to weaker models. In this work, we introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework that bridges the capability gap between expert reasoning trajectories and diverse SLMs. Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation via solution simulation. When expert steps surpass the student’s capacity – signaled by an Imitation Gap – the student autonomously explores alternative reasoning paths, constrained by outcome consistency. We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency over static fine-tuning. Our method enhances supervision quality by aligning training signals with the student’s reasoning capabilities, offering a scalable solution for reasoning alignment in resource-constrained models.
zh

[NLP-107] Phir Hera Fairy: An English Fairytaler is a Strong Faker of Fluent Speech in Low-Resource Indian Languages

【速读】: 该论文试图解决在低资源语言中实现接近人类水平的文本到语音(Text-to-Speech, TTS)生成问题,特别是通过将英语F5-TTS模型微调至印度语言以提升多语言流畅性、语音克隆、风格克隆和代码混合能力。其解决方案的关键在于使用仅印度语数据进行微调,而非从头训练或同时微调英印双语数据,结果表明仅使用印度语数据的微调策略最为有效,从而得到了一个接近人类水平的多语言模型IN-F5。此外,研究还探索了数据受限条件下的计算最优策略,并通过人机协同方法实现了零资源语言的语音合成。

链接: https://arxiv.org/abs/2505.20693
作者: Praveen Srinivasa Varadhan,Srija Anand,Soma Siddhartha,Mitesh M.Khapra
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:What happens when an English Fairytaler is fine-tuned on Indian languages? We evaluate how the English F5-TTS model adapts to 11 Indian languages, measuring polyglot fluency, voice-cloning, style-cloning, and code-mixing. We compare: (i) training from scratch, (ii) fine-tuning English F5 on Indian data, and (iii) fine-tuning on both Indian and English data to prevent forgetting. Fine-tuning with only Indian data proves most effective and the resultant IN-F5 is a near-human polyglot; that enables speakers of one language (e.g., Odia) to fluently speak in another (e.g., Hindi). Our results show English pretraining aids low-resource TTS in reaching human parity. To aid progress in other low-resource languages, we study data-constrained setups and arrive at a compute optimal strategy. Finally, we show IN-F5 can synthesize unseen languages like Bhojpuri and Tulu using a human-in-the-loop approach for zero-resource TTS via synthetic data generation.
zh

[NLP-108] Can we Debias Social Stereotypes in AI-Generated Images? Examining Text-to-Image Outputs and User Perceptions

【速读】: 该论文试图解决文本到图像(T2I)生成模型在输出中复制和放大社会刻板印象的问题,特别是与性别、种族和文化相关的刻板印象,这引发了重要的伦理问题。解决方案的关键是提出一种基于理论的偏见检测评分标准和一个社会刻板印象指数(Social Stereotype Index, SSI),通过系统评估T2I输出中的社会偏见,并采用针对提示的优化方法,利用大语言模型(LLMs)进行干预,从而显著降低偏见水平。

链接: https://arxiv.org/abs/2505.20692
作者: Saharsh Barve,Andy Mao,Jiayue Melissa Shi,Prerna Juneja,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in generative AI have enabled visual content creation through text-to-image (T2I) generation. However, despite their creative potential, T2I models often replicate and amplify societal stereotypes – particularly those related to gender, race, and culture – raising important ethical concerns. This paper proposes a theory-driven bias detection rubric and a Social Stereotype Index (SSI) to systematically evaluate social biases in T2I outputs. We audited three major T2I model outputs – DALL-E-3, Midjourney-6.1, and Stability AI Core – using 100 queries across three categories – geocultural, occupational, and adjectival. Our analysis reveals that initial outputs are prone to include stereotypical visual cues, including gendered professions, cultural markers, and western beauty norms. To address this, we adopted our rubric to conduct targeted prompt refinement using LLMs, which significantly reduced bias – SSI dropped by 61% for geocultural, 69% for occupational, and 51% for adjectival queries. We complemented our quantitative analysis through a user study examining perceptions, awareness, and preferences around AI-generated biased imagery. Our findings reveal a key tension – although prompt refinement can mitigate stereotypes, it can limit contextual alignment. Interestingly, users often perceived stereotypical images to be more aligned with their expectations. We discuss the need to balance ethical debiasing with contextual relevance and call for T2I systems that support global diversity and inclusivity while not compromising the reflection of real-world social complexity.
zh

[NLP-109] SELF-PERCEPT: Introspection Improves Large Language Models Detection of Multi-Person Mental Manipulation in Conversations ACL2025

【速读】: 该论文试图解决在复杂、多轮次和多人对话中检测心理操控(mental manipulation)的问题,这一问题由于操控行为的细微性和情境依赖性,对大型语言模型(LLMs)构成了重大挑战。解决方案的关键在于提出SELF-PERCEPT框架,这是一个受自我感知理论(Self-Perception Theory)启发的两阶段提示方法,能够有效识别多人物、多轮次的心理操控行为。

链接: https://arxiv.org/abs/2505.20679
作者: Danush Khanna,Pratinav Seth,Sidhaarth Sredharan Murali,Aditya Kumar Guru,Siddharth Shukla,Tanuj Tyagi,Sandeep Chaurasia,Kripabandhu Ghosh
机构: Manipal University Jaipur, India; Manipal Institute of Technology, India; AryaXAI Alignment Lab, AryaXAI.com, India; National Institute of Technology Karnataka, Surathkal, India; IISER Kolkata, India
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 (Main)

点击查看摘要

Abstract:Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation’s nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at this https URL .
zh

[NLP-110] Pretraining Language Models to Ponder in Continuous Space

【速读】: 该论文旨在解决语言模型在生成复杂句子元素时缺乏深度认知处理的问题,即人类在表达前会进行思考(pondering)以提升认知处理深度。其解决方案的关键在于将 pondering 过程引入语言模型中,通过在单个 token 生成步骤中重复调用前向过程,使模型在生成过程中不是直接从预测分布中采样生成实际 token,而是根据预测的 token 分布生成所有 token 嵌入的加权和,并将其作为输入进行下一次前向传递,从而实现模拟人类的思考过程。

链接: https://arxiv.org/abs/2505.20674
作者: Boyi Zeng,Shixiang Song,Siyuan Huang,Yixuan Wang,He Li,Ziwei He,Xinbing Wang,Zhiyu Li,Zhouhan Lin
机构: LUMIA Lab, Shanghai Jiao Tong University (LUMIA实验室,上海交通大学); Institute for Advanced Algorithms Research, Shanghai (先进算法研究所,上海); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Our method is straightforward and can be seamlessly integrated with various existing language models. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, pondering-enhanced Pythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at this https URL.
zh

[NLP-111] Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning

【速读】: 该论文试图解决推理增强型大语言模型(RLLMs)在处理简单任务时因过长的推理链导致的冗余token消耗问题,从而造成资源利用效率低下的问题。解决方案的关键在于提出Self-Route框架,该框架通过一个轻量级的预推理阶段提取与能力相关的嵌入表示,实现对模型解题能力的实时评估,并据此动态选择通用模式或推理模式。此外,研究还构建了Gradient-10K数据集,用于训练路由模块以精确识别模型能力边界,从而在保持准确率的同时显著降低token消耗。

链接: https://arxiv.org/abs/2505.20664
作者: Yang He,Xiao Ding,Bibo Cai,Yufei Zhang,Kai Xiong,Zhouhao Sun,Bing Qin,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model’s ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.
zh

[NLP-112] roSeek: An AI-Powered Knowledge Base and Retrieval Generation Platform for Terpenoid Research

【速读】: 该论文试图解决萜类化合物(terpenoids)研究中由于其跨学科性质(涵盖化学、药理学和生物学)导致的知识整合难题。解决方案的关键在于构建了一个名为TeroSeek的结构化知识库(KB),该库基于二十年的萜类化合物文献,并结合了人工智能驱动的问答聊天机器人和网络服务,利用检索增强生成(RAG)框架,提供了高质量且结构化的信息,从而在萜类化合物相关查询中优于通用大语言模型(LLMs)。

链接: https://arxiv.org/abs/2505.20663
作者: Xu Kang,Siqi Jiang,Kangwei Xu,Jiahao Li,Ruibo Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Terpenoids are a crucial class of natural products that have been studied for over 150 years, but their interdisciplinary nature (spanning chemistry, pharmacology, and biology) complicates knowledge integration. To address this, the authors developed TeroSeek, a curated knowledge base (KB) built from two decades of terpenoid literature, coupled with an AI-powered question-answering chatbot and web service. Leveraging a retrieval-augmented generation (RAG) framework, TeroSeek provides structured, high-quality information and outperforms general-purpose large language models (LLMs) in terpenoid-related queries. It serves as a domain-specific expert tool for multidisciplinary research and is publicly available at this http URL.
zh

[NLP-113] BacktrackAgent : Enhancing GUI Agent with Error Detection and Backtracking Mechanism

【速读】: 该论文旨在解决现有图形用户界面(Graphical User Interface, GUI)代理在任务执行过程中缺乏有效的错误检测与恢复机制的问题,尽管它们在提升单个操作的准确性方面表现优异。解决方案的关键在于提出BacktrackAgent框架,该框架引入了回溯机制,并包含验证器(verifier)、评判器(judger)和反思器(reflector)模块,用于实现错误检测与恢复,同时通过应用判断奖励进一步提升代理性能。

链接: https://arxiv.org/abs/2505.20660
作者: Qinzhuo Wu,Pengzhi Gao,Wei Liu,Jian Luan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) agents have gained substantial attention due to their impressive capabilities to complete tasks through multiple interactions within GUI environments. However, existing agents primarily focus on enhancing the accuracy of individual actions and often lack effective mechanisms for detecting and recovering from errors. To address these shortcomings, we propose the BacktrackAgent, a robust framework that incorporates a backtracking mechanism to improve task completion efficiency. BacktrackAgent includes verifier, judger, and reflector components as modules for error detection and recovery, while also applying judgment rewards to further enhance the agent’s performance. Additionally, we develop a training dataset specifically designed for the backtracking mechanism, which considers the outcome pages after action executions. Experimental results show that BacktrackAgent has achieved performance improvements in both task success rate and step accuracy on Mobile3M and Auto-UI benchmarks. Our data and code will be released upon acceptance.
zh

[NLP-114] Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLM s with Diverse External Knowledge ACL

【速读】: 该论文旨在解决自然语言(Natural Language, NL)到信号时序逻辑(Signal Temporal Logic, STL)自动转换的挑战,这一过程在传统方法中依赖人工操作,存在耗时且易出错的问题。为克服数据集匮乏带来的限制,论文提出了一种名为STL-Diversity-Enhanced (STL-DivEn)的NL-STL数据集,其关键在于通过手动构建种子集、聚类选取代表性样本引导大语言模型生成更多样本,并结合规则过滤与人工验证以确保数据的多样性和准确性。此外,论文还引入了基于外部知识的KGST框架,采用生成-精炼流程提升转换精度。

链接: https://arxiv.org/abs/2505.20658
作者: Yue Fang,Zhi Jin,Jie An,Hongshen Chen,Xiaohong Chen,Naijun Zhan
机构: Peking University (北京大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); JD.com (京东); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures, published to ACL

点击查看摘要

Abstract:Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose an NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), which comprises 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.
zh

[NLP-115] Chinese Cyberbullying Detection: Dataset Method and Validation

【速读】: 该论文试图解决现有网络欺凌检测基准仅基于言论极性(如“攻击性”与“非攻击性”)进行分类,而未能反映现实世界中通过具体事件引发广泛关注的网络欺凌问题。其解决方案的关键在于提出一种新的标注方法,构建以事件为核心的网络欺凌数据集(CHNCI),该数据集包含91个事件的220,676条评论。通过结合基于解释生成的三种网络欺凌检测方法作为集成方法生成伪标签,并由人工标注者进行验证,最终提出用于判断是否构成网络欺凌事件的评估标准。

链接: https://arxiv.org/abs/2505.20654
作者: Yi Zhu,Xin Zou,Xindong Wu
机构: Yangzhou University (扬州大学); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as “offensive” and “non-offensive”, which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.
zh

[NLP-116] FinTagging: An LLM -ready Benchmark for Extracting and Structuring Financial Information

【速读】: 该论文试图解决在XBRL(eXtensible Business Reporting Language)财务报告背景下,大型语言模型(LLMs)在结构化信息抽取和语义对齐能力方面的评估问题。现有基准测试将XBRL标签简化为平面多类分类任务,并仅关注叙述性文本,未能全面反映实际应用场景。论文提出的解决方案关键在于构建FinTagging,这是一个全范围、表格感知的XBRL基准,将XBRL标签问题分解为两个子任务:FinNI(金融实体抽取)和FinCL(基于分类法的概念对齐),要求模型在非结构化文本和结构化表格中联合抽取事实并将其与完整的10k+美国通用会计准则(US-GAAP)分类法进行对齐,从而实现更真实、细粒度的评估。

链接: https://arxiv.org/abs/2505.20650
作者: Yan Wang,Yang Ren,Lingfei Qian,Xueqing Peng,Keyi Wang,Yi Han,Dongji Feng,Xiao-Yang Liu,Jimin Huang,Qianqian Xie
机构: The Fin AI(Fin AI); Columbia University (哥伦比亚大学); Georgia Institute of Technology (佐治亚理工学院); Gustavus Adolphus College (古斯塔夫斯·阿道夫斯学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.
zh

[NLP-117] STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在适应不同社区特定规范、视角和沟通风格方面的可调节性(steerability)问题,这一能力对于实际应用至关重要但尚未得到充分评估。解决方案的关键在于提出Steer-Bench,这是一个基于对比Reddit社区的基准测试工具,用于评估针对特定人群的可调节性。Steer-Bench涵盖了19个领域中的30对对比子论坛,包含超过10,000条指令-响应对以及经过验证的5,500个多选题及其对应的银标签,以测试模型与多样化社区规范的对齐程度。

链接: https://arxiv.org/abs/2505.20645
作者: Kai Chen,Zihao He,Taiwei Shi,Kristina Lerman
机构: University of Southern California (南加州大学); Information Sciences Institute, University of Southern California (信息科学研究所,南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.
zh

[NLP-118] st-Time Learning for Large Language Models ICML2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对专业领域和多样化语言变化时泛化能力不足的问题,即分布偏移(distribution shifts)问题。其解决方案的关键在于提出一种测试时学习(Test-Time Learning, TTL)范式,称为TLM,通过仅使用未标注的测试数据动态适应目标领域。该方法的核心是将测试时学习过程建模为输入困惑度(input perplexity)最小化,从而实现自监督的模型性能增强,并通过高效采样策略选择高困惑度样本以提升优化效果,同时采用低秩适配(Low-Rank Adaptation, LoRA)来防止灾难性遗忘,确保模型适应的稳定性。

链接: https://arxiv.org/abs/2505.20633
作者: Jinwu Hu,Zhitian Zhang,Guohao Chen,Xutao Wen,Chao Shuai,Wei Luo,Bin Xiao,Yuanqing Li,Mingkui Tan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML2025

点击查看摘要

Abstract:While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.
zh

[NLP-119] SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在分析源代码漏洞时的可靠性评估问题,特别是其在结构和语义推理方面的不足。解决方案的关键在于引入SV-TrustEval-C基准,该基准通过两个核心维度——结构推理和语义推理——来评估LLMs在C语言代码漏洞分析中的能力,从而更全面地衡量其逻辑一致性与对复杂代码关系的理解程度。

链接: https://arxiv.org/abs/2505.20630
作者: Yansong Li,Paula Branco,Alexander M. Hoole,Manish Marwah,Hari Manassery Koduvely,Guy-Vincent Jourdan,Stephan Jou
机构: University of Ottawa (渥太华大学); OpenText (奥普图姆)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs’ abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.
zh

[NLP-120] Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration

【速读】: 该论文旨在解决现代大语言模型(Large Language Models, LLMs)在处理长上下文时面临的挑战,包括累积延迟过高、过多代理调用导致的信息丢失以及过度分割破坏文本内在依赖关系等问题。其解决方案的关键在于提出一种名为XpandA的多代理框架,该框架结合了基于问题驱动的工作流和动态分割机制,通过动态分割长文本以自适应调节上下文窗口的填充率、基于问题引导的协议更新集中共享内存中的信息集合,以及根据问题-信息对的状态跟踪选择性重放特定分区,从而有效提升长上下文处理的鲁棒性和效率。

链接: https://arxiv.org/abs/2505.20625
作者: Sibo Xiao,Zixin Lin,Wenyang Gao,Yue Zhang
机构: Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA’s feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.
zh

[NLP-121] POLAR: A Benchmark for Multilingual Multicultural and Multi-Event Online Polarization

【速读】: 该论文试图解决在线极化(online polarization)在民主话语中的日益加剧问题,而现有计算社会科学研究多局限于单语种、文化范围狭窄或特定事件。其解决方案的关键在于构建了一个多语言、多文化、多事件的大型标注数据集POLAR,包含七种语言的23,000多个实例,并在不同文化背景下对极化的存在、类型和表现形式进行了标注。此外,通过在单语和跨语言设置下微调多语言预训练语言模型,并评估多种开放和封闭的大语言模型(LLM)在少样本和零样本场景下的性能,验证了极化任务的复杂性和上下文依赖性。

链接: https://arxiv.org/abs/2505.20624
作者: Usman Naseem,Juan Ren,Saba Anwar,Sarah Kohail,Rudy Alexandro Garrido Veliz,Robert Geislinger,Aisha Jabr,Idris Abdulmumin,Laiba Qureshi,Aarushi Ajay Borkar,Maryam Ibrahim Mukhtar,Abinew Ali Ayele,Ibrahim Said Ahmad,Adem Ali,Martin Semmann,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam
机构: Macquarie University (麦考瑞大学); University of Hamburg (汉堡大学); Bahir Dar University (巴赫尔达大学); Imperial College London (帝国理工学院); University of Pretoria (比勒陀利亚大学); Zayed University (扎耶德大学); Bayero University Kano (拜尔罗大学卡诺分校); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multievent dataset with over 23k instances in seven languages from diverse online platforms and real-world events. Polarization is annotated along three axes: presence, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) we fine-tune six multilingual pretrained language models in both monolingual and cross-lingual setups; and (2) we evaluate a range of open and closed large language models (LLMs) in few-shot and zero-shot scenarios. Results show that while most models perform well on binary polarization detection, they achieve substantially lower scores when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.
zh

[NLP-122] SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation ACL2025

【速读】: 该论文旨在解决同时机器翻译(Simultaneous Machine Translation, SiMT)中的翻译质量与延迟问题,通过引入一种新的策略优化框架SeqPO-SiMT,将SiMT任务建模为序列决策问题,并结合定制化的奖励机制以提升翻译质量并降低延迟。其解决方案的关键在于设计了一个适用于多步骤SiMT任务的策略优化框架,使SiMT大语言模型能够通过模拟和优化翻译过程来提升性能,相较于传统的监督微调方法,在多个数据集上表现出更高的翻译质量与更低的延迟。

链接: https://arxiv.org/abs/2505.20622
作者: Ting Xu,Zhichao Huang,Jiankai Sun,Shanbo Cheng,Wai Lam
机构: The Chinese University of Hong Kong (香港中文大学); Bytedance (字节跳动); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

点击查看摘要

Abstract:We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En to Zh and Zh to En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En to Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.
zh

[NLP-123] REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

【速读】: 该论文旨在解决当前形式化定理证明器在高中和竞赛级数学问题上取得显著进展,但在更高级数学问题上的泛化能力不足的问题。其解决方案的关键在于提出REAL-Prover,一个基于微调的大语言模型(REAL-Prover-v1)并集成检索系统(Leansearch-PS)的新型开源分步定理证明器,以提升解决大学级别数学问题的性能。此外,研究者还开发了HERALD-AF数据提取管道和Jixia-interactive交互环境,用于生成和收集训练数据,从而有效支持模型的训练与优化。

链接: https://arxiv.org/abs/2505.20613
作者: Ziju Shen,Naohao Huang,Fanyi Yang,Yutong Wang,Guoxiong Gao,Tianyi Xu,Jiedong Jiang,Wanyi He,Pu Yang,Mengzhou Sun,Haocheng Ju,Peihao Wu,Bryan Dai,Bin Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
zh

[NLP-124] Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在面对分布外类别、任务和成像模态时泛化能力不足的问题。现有最先进的模型在常见物体如汽车、卡车和行人上表现出色,但在未见过的场景中表现显著下降。论文提出的解决方案关键在于通过包含少量视觉示例和丰富文本描述的标注指令,对VLMs进行新概念的对齐,而非仅仅在更多视觉数据上重新训练模型。这一方法旨在提升模型在不同数据分布下的适应能力。

链接: https://arxiv.org/abs/2505.20612
作者: Peter Robicheaux,Matvei Popov,Anish Madan,Isaac Robinson,Joseph Nelson,Deva Ramanan,Neehar Peri
机构: Roboflow; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Our code and dataset are available at this https URL and this https URL
zh

[NLP-125] Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients

【速读】: 该论文旨在解决传统医疗诊断中存在的时间消耗长、成本高以及诊断准确性受限的问题。其解决方案的关键在于开发一种基于大语言模型(Large Language Model, LLM)的实时复合诊断医学人工智能接口,通过与医生在常见内科病例中的表现进行对比,验证其在诊断准确性、效率和成本方面的优势。

链接: https://arxiv.org/abs/2505.20609
作者: Hyungjun Park(1,2),Chang-Yun Woo(3),Seungjo Lim(2),Seunghwan Lim(2),Keunho Kwak(2),Ju Young Jeong(4),Chong Hyun Suh(4) ((1) Department of Pulmonology, Shihwa Medical Center, Siheung, Republic of Korea (2) Helpmedoc Inc., Republic of Korea (3) Department of Internal Medicine, Asan Medical Center, Seoul, Republic of Korea (4) Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians’ first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas the AI interface achieved an accuracy rate of 100%. The average time for the AI interface (557 sec) was 44.6% shorter than that of the physicians (1006 sec). The AI interface ( 0.08) also reduced costs by 98.1% compared to the physicians’ average ( 4.2). Patient satisfaction scores ranged from 4.2 to 4.3 for care by physicians and were 3.9 for the AI interface Conclusion An LLM based realtime compound diagnostic medical AI interface demonstrated diagnostic accuracy and patient satisfaction comparable to those of a physician, while requiring less time and lower costs. These findings suggest that AI interfaces may have the potential to assist primary care consultations for common internal medicine cases.
zh

[NLP-126] owards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

【速读】: 该论文试图解决自动语音识别(ASR)模型在面对未见数据时泛化能力不足的问题,尤其是当训练数据规模受限时。其解决方案的关键在于通过有针对性的声学增强方法提升模型的鲁棒性,研究发现声学变异性是推动语音转录泛化的主要因素,而非语言丰富性,因此声学聚焦的数据增强策略可作为大规模数据集的可行替代方案。

链接: https://arxiv.org/abs/2505.20606
作者: Dancheng Liu,Amir Nassereldine,Chenhui Xu,Jinjun Xiong
机构: 未知
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: in submission

点击查看摘要

Abstract:Whisper’s robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.
zh

[NLP-127] Effectiveness of Prompt Optimization in NL2SQL Systems

【速读】: 该论文旨在解决生产环境中高精度、高性能自然语言到结构化查询语言(NL2SQL)系统构建的问题,而不仅仅是关注当前大多数方法所侧重的高质量SQL生成。其关键在于通过精心选择静态示例集来捕捉查询日志、目标数据库、SQL构造及执行延迟的复杂性,而非仅依赖相似性进行示例选择。为此,作者提出了一种提示优化框架,该框架不仅满足高精度需求,还通过多目标优化提升生成SQL的性能。

链接: https://arxiv.org/abs/2505.20591
作者: Sairam Gurajada,Eser Kandogan,Sajjadur Rahman
机构: Megagon Labs(梅加贡实验室); Adobe(Adobe)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:NL2SQL approaches have greatly benefited from the impressive capabilities of large language models (LLMs). In particular, bootstrapping an NL2SQL system for a specific domain can be as simple as instructing an LLM with sufficient contextual information, such as schema details and translation demonstrations. However, building an accurate system still requires the rigorous task of selecting the right context for each query-including identifying relevant schema elements, cell values, and suitable exemplars that help the LLM understand domain-specific nuances. Retrieval-based methods have become the go-to approach for identifying such context. While effective, these methods introduce additional inference-time costs due to the retrieval process. In this paper, we argue that production scenarios demand high-precision, high-performance NL2SQL systems, rather than simply high-quality SQL generation, which is the focus of most current NL2SQL approaches. In such scenarios, the careful selection of a static set of exemplars-capturing the intricacies of the query log, target database, SQL constructs, and execution latencies-plays a more crucial role than exemplar selection based solely on similarity. The key challenge, however, lies in identifying a representative set of exemplars for a given production setting. To this end, we propose a prompt optimization framework that not only addresses the high-precision requirement but also optimizes the performance of the generated SQL through multi-objective optimization. Preliminary empirical analysis demonstrates the effectiveness of the proposed framework. Subjects: Computation and Language (cs.CL); Databases (cs.DB) Cite as: arXiv:2505.20591 [cs.CL] (or arXiv:2505.20591v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.20591 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: NOVAS Workshop, SIGMOD 2025 Related DOI: https://doi.org/10.1145/3735079.3735325 Focus to learn more DOI(s) linking to related resources
zh

[NLP-128] Emotion Classification In-Context in Spanish

【速读】: 该论文试图解决将西班牙语客户反馈分类为情感类别(积极、中性和消极)的问题,旨在提升情感分析的准确性以改善客户体验。传统方法通过翻译将广泛使用的语言反馈转换为较少见的语言,导致语义完整性和上下文细微差别丢失。解决方案的关键在于提出一种混合方法,结合TF-IDF与BERT嵌入,并采用自定义堆叠集成(Custom Stacking Ensemble, CSE)策略,将西班牙语文本转化为保留原语言语义深度的数值表示,从而有效提升分类性能。

链接: https://arxiv.org/abs/2505.20571
作者: Bipul Thapa,Gabriel Cofre
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This paper has been accepted and presented at the 4th International Conference on Applied Intelligence and Informatics (AII 2024). The final version will appear in the official conference proceedings. This preprint is provided to ensure the timely dissemination of the research prior to formal publication

点击查看摘要

Abstract:Classifying customer feedback into distinct emotion categories is essential for understanding sentiment and improving customer experience. In this paper, we classify customer feedback in Spanish into three emotion categories–positive, neutral, and negative–using advanced NLP and ML techniques. Traditional methods translate feedback from widely spoken languages to less common ones, resulting in a loss of semantic integrity and contextual nuances inherent to the original language. To address this limitation, we propose a hybrid approach that combines TF-IDF with BERT embeddings, effectively transforming Spanish text into rich numerical representations that preserve the semantic depth of the original language by using a Custom Stacking Ensemble (CSE) approach. To evaluate emotion classification, we utilize a range of models, including Logistic Regression, KNN, Bagging classifier with LGBM, and AdaBoost. The CSE model combines these classifiers as base models and uses a one-vs-all Logistic Regression as the meta-model. Our experimental results demonstrate that CSE significantly outperforms the individual and BERT model, achieving a test accuracy of 93.3% on the native Spanish dataset–higher than the accuracy obtained from the translated version. These findings underscore the challenges of emotion classification in Spanish and highlight the advantages of combining vectorization techniques like TF-IDF with BERT for improved accuracy. Our results provide valuable insights for businesses seeking to leverage emotion classification to enhance customer feedback analysis and service improvements.
zh

[NLP-129] he NaijaVoices Dataset: Cultivating Large-Scale High-Quality Culturally-Rich Speech Data for African Languages INTERSPEECH2025

【速读】: 该论文试图解决非洲语言在语音技术中数据不足的问题,特别是针对伊博语(Igbo)、豪萨语(Hausa)和约鲁巴语(Yoruba)等语言缺乏大规模、高质量语音-文本数据集的问题。解决方案的关键是引入NaijaVoices数据集,这是一个包含1,800小时语音和5,000多名说话者的语音-文本数据集,通过其独特的数据收集方法提升了语音模型的声学多样性,并在自动语音识别任务中显著提高了性能。

链接: https://arxiv.org/abs/2505.20564
作者: Chris Emezue, TheNaijaVoices Community,Busayo Awobade,Abraham Owodunni,Handel Emezue,Gloria Monica Tobechukwu Emezue,Nefertiti Nneoma Emezue,Sewade Ogun,Bunmi Akinremi,David Ifeoluwa Adelani,Chris Pal
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted for publication at Interspeech 2025

点击查看摘要

Abstract:The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages – including our focus, Igbo, Hausa, and Yoruba – remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices’ potential to advance multilingual speech processing for African languages.
zh

[NLP-130] Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

【速读】: 该论文试图解决在马尔可夫强化学习(Markovian RL)训练过程中,生成式 AI (Generative AI) 是否能够涌现出反思性推理行为,以及这些行为为何在测试阶段具有优势的问题。传统马尔可夫强化学习仅在训练阶段进行探索,并依赖当前状态来利用历史上下文,限制了模型的反思能力。为了解决这一问题,作者提出了贝叶斯自适应强化学习(Bayes-Adaptive RL, BARL),通过在后验分布下优化预期回报,显式地激励奖励最大化和信息收集探索。BARL的关键在于通过信念更新机制,指导模型根据观测结果切换和组合策略,从而实现有原则的反思性探索。

链接: https://arxiv.org/abs/2505.20561
作者: Shenao Zhang,Yaqing Wang,Yinxiao Liu,Tianqi Liu,Peter Grabowski,Eugene Ie,Zhaoran Wang,Yunxuan Li
机构: Google DeepMind(谷歌深度思维); Google(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at this https URL.
zh

[NLP-131] Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline

【速读】: 该论文试图解决多语言大型语言模型(Multilingual Large Language Models, LLMs)在不同语言中事实一致性不足的问题,尤其是在非英语语言中的事实回忆性能显著低于英语。其关键解决方案是通过机制分析技术揭示了LLMs处理多语言查询的底层流程,即利用以英语为中心的事实回忆机制处理多语言查询,并将英语答案翻译回目标语言。研究识别出两个主要错误来源:对可靠英语中心事实回忆机制的参与不足,以及从英语翻译回目标语言时的错误。为解决这些问题,作者引入了两种与语言和数据集无关的向量干预方法,以引导模型走向更优的内部路径,从而提高事实一致性。

链接: https://arxiv.org/abs/2505.20546
作者: Meng Lu,Ruochen Zhang,Ellie Pavlick,Carsten Eickhoff
机构: Brown University (布朗大学); University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) often exhibit factual inconsistencies across languages, with significantly better performance in factual recall tasks in English than in other languages. The causes of these failures, however, remain poorly understood. Using mechanistic analysis techniques, we uncover the underlying pipeline that LLMs employ, which involves using the English-centric factual recall mechanism to process multilingual queries and then translating English answers back into the target language. We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language for the final answer. To address these vulnerabilities, we introduce two vector interventions, both independent of languages and datasets, to redirect the model toward better internal paths for higher factual consistency. Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language. Our findings demonstrate how mechanistic insights can be used to unlock latent multilingual capabilities in LLMs.
zh

[NLP-132] AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

【速读】: 该论文试图解决如何评估大型语言模型(Large Language Models, LLMs)在天文学领域中生成科学见解的能力,特别是其在数据处理、分析和可视化方面的表现。现有研究缺乏对LLM辅助科学工作流是否能准确传达正确科学见解的有效评估方法。解决方案的关键在于提出AstroVisBench,这是首个针对天文学领域科学计算与可视化的基准测试,用于评估模型创建特定天文数据处理流程以及通过复杂图表可视化结果的能力,并采用一种基于大模型作为评判者的新型评估流程,该流程经过五位专业天文学家的标注验证,从而为AI科学家提供了一个端到端的评估框架。

链接: https://arxiv.org/abs/2505.20538
作者: Sebastian Antony Joseph,Syed Murtaza Husain,Stella S. R. Offner,Stéphanie Juneau,Paul Torrey,Adam S. Bolton,Juan P. Farias,Niall Gaffney,Greg Durrett,Junyi Jessy Li
机构: 未知
类目: Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model’s ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.
zh

[NLP-133] Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models

【速读】: 该论文旨在解决大规模推理模型(Large Reasoning Models, LRM)在测试时扩展(Test-Time Scaling)中的资源分配与性能提升之间的成本效益权衡问题。其关键解决方案是提出了一种测试时扩展性能模型(Test-Time Scaling Performance Model, TTSPM),并通过概率建模分析了并行扩展和串行扩展两种基本范式,推导出两种策略的扩展预算饱和点,揭示了额外计算带来的边际收益递减现象,并验证了两种范式在上限上的统一数学结构,从而为测试时资源分配提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2505.20522
作者: Jian Wang,Boyan Zhu,Chak Tou Leong,Yongqi Li,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling Pareto of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.
zh

[NLP-134] Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

【速读】: 该论文试图解决传统对话式人工智能在情感推理和多模态交互方面的不足,旨在通过模拟受情绪状态影响的推理过程来提升对话系统的自然性和情感适配性。解决方案的关键在于构建一个基于五种基本情绪代理(Joy, Sadness, Fear, Anger, and Disgust)的多模态、多模型架构,这些代理通过结构化多轮对话生成、批评并迭代优化响应,最终由一个综合推理机制将各代理的贡献整合为连贯输出,从而实现情感驱动的智能对话。

链接: https://arxiv.org/abs/2505.20521
作者: Ana Rita Ortigoso,Gabriel Vieira,Daniel Fuentes,Luis Frazão,Nuno Costa,António Pereira
机构: Computer Science and Communication Research Centre, Polytechnic University of Leiria(计算机科学与通信研究中心,莱里亚理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 5 figures. Submitted for review to Information Fusion

点击查看摘要

Abstract:This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar’s Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.
zh

[NLP-135] Multimodal Emotion Recognition in Conversations: A Survey of Methods Trends Challenges and Prospects

【速读】: 该论文旨在解决多模态情感识别在对话系统中的应用问题,即如何通过融合文本、语音和视觉等多源信息来提升情感理解的准确性和自然度。其解决方案的关键在于构建有效的多模态融合机制,以实现对情绪的全面感知与识别,从而推动人机交互中情感智能的发展。

链接: https://arxiv.org/abs/2505.20511
作者: Chengyan Wu,Yiqiang Cai,Yang Liu,Pengxu Zhu,Yun Xue,Ziwei Gong,Julia Hirschberg,Bolei Ma
机构: South China Normal University (华南师范大学); Guangdong Provincial Key Laboratory of Quantum Engineering and Quantum Materials (广东省量子工程与量子材料重点实验室); School of Electronic Science and Engineering (School of Microelectronics) (电子科学与工程学院(微电子学院)); North Carolina Central University (北卡罗来纳中央大学); Georgia Institute of Technology (佐治亚理工学院); Columbia University (哥伦比亚大学); LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.20511 [cs.CL] (or arXiv:2505.20511v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.20511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-136] ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis UAI INTERSPEECH2025

【速读】: 该论文旨在解决多说话人现代标准阿拉伯语(Modern Standard Arabic, MSA)语音合成及相关任务(如基于语音的符号恢复、语音转换和深度伪造检测)中缺乏高质量、带符号转录的语音数据的问题。其解决方案的关键在于构建一个包含专业录制语音、修改自现有阿拉伯语语音语料库以及高质量合成语音的多说话人语料库,即ArVoice,该语料库涵盖了11个说话人的83.52小时语音数据,其中约10小时为人类语音,能够支持多种语音处理任务的研究与应用。

链接: https://arxiv.org/abs/2505.20506
作者: Hawau Olamide Toyin,Rufael Marew,Humaid Alblooshi,Samar M. Magdy,Hanan Aldarmaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at INTERSPEECH 2025 The dataset is available at this https URL

点击查看摘要

Abstract:We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection. ArVoice comprises: (1) a new professionally recorded set from six voice talents with diverse demographics, (2) a modified subset of the Arabic Speech Corpus; and (3) high-quality synthetic speech from two commercial systems. The complete corpus consists of a total of 83.52 hours of speech across 11 voices; around 10 hours consist of human voices from 7 speakers. We train three open-source TTS and two voice conversion systems to illustrate the use cases of the dataset. The corpus is available for research use.
zh

[NLP-137] Large Language Models for IT Automation Tasks: Are We There Yet?

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在IT自动化任务中的有效性不足问题,尤其是针对Ansible等工具的代码生成能力尚未得到充分研究。现有基准测试主要依赖于合成任务,无法反映实际使用者的需求。论文提出的解决方案是构建ITAB(IT Automation Task Benchmark),这是一个包含126个多样化任务的基准测试集,每个任务均涉及状态同步(state reconciliation)这一IT自动化工具特有的属性,并通过动态执行评估LLMs生成功能性的Ansible脚本的能力。其关键在于通过真实场景下的任务设计和执行验证,揭示LLMs在状态推理和模块化执行知识方面的局限性。

链接: https://arxiv.org/abs/2505.20505
作者: Md Mahadi Hassan,John Salvador,Akond Rahman,Santu Karmaker
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 8 pages

点击查看摘要

Abstract:LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs’ ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in state reconciliation related reasoning (44.87% combined from variable (11.43%), host (11.84%), path(11.63%), and template (9.97%) issues) and deficiencies in module-specific execution knowledge (24.37% combined from Attribute and parameter (14.44%) and module (9.93%) errors). Our findings reveal key limitations in open-source LLMs’ ability to track state changes and apply specialized module knowledge, indicating that reliable IT automation will require major advances in state reasoning and domain-specific execution understanding.
zh

[NLP-138] Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

【速读】: 该论文试图解决移动服务机器人中具身人工智能(embodied AI)面临的关键挑战,包括多模态传感器融合、不确定性下的实时决策、任务泛化以及有效的人机交互(HRI)。解决方案的关键在于将基础模型(foundation models)与具身AI的原则相结合,通过生成式AI实现实时传感器融合、语言条件控制和自适应任务执行,从而提升机器人在动态现实环境中的理解能力、适应能力和任务执行效率。

链接: https://arxiv.org/abs/2505.20503
作者: Matthew Lisondra,Beno Benhabib,Goldie Nejat
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action Models have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interactions, robots can improve understanding, adapt to, and execute complex tasks in dynamic real-world environments. However, embodied AI in mobile service robots continues to face key challenges, including multimodal sensor fusion, real-time decision-making under uncertainty, task generalization, and effective human-robot interactions (HRI). In this paper, we present the first systematic review of the integration of foundation models in mobile service robotics, identifying key open challenges in embodied AI and examining how foundation models can address them. Namely, we explore the role of such models in enabling real-time sensor fusion, language-conditioned control, and adaptive task execution. Furthermore, we discuss real-world applications in the domestic assistance, healthcare, and service automation sectors, demonstrating the transformative impact of foundation models on service robotics. We also include potential future research directions, emphasizing the need for predictive scaling laws, autonomous long-term adaptation, and cross-embodiment generalization to enable scalable, efficient, and robust deployment of foundation models in human-centric robotic systems.
zh

[NLP-139] Gatsby Without the E: Crafting Lipograms with LLM s

【速读】: 该论文试图解决在严格语言约束下(如排除特定字母)生成连贯且意义相近文本的问题,具体案例是将F. Scott Fitzgerald的《了不起的盖茨比》转换为完全不含字母’e’的文本。解决方案的关键在于利用现代大型语言模型(Large Language Models, LLMs)结合多种技术手段,包括同义词替换、束搜索(beam search)和命名实体分析,以在保持文本语义的前提下实现严格的字母排除约束。

链接: https://arxiv.org/abs/2505.20501
作者: Rohan Balasubramanian,Nitish Gokulakrishnan,Syeda Jannatus Saba,Steven Skiena
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注: 7.5 pages

点击查看摘要

Abstract:Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of the letter ‘e’. In this study, we explore the power of modern large language models (LLMs) by transforming the novel F. Scott Fitzgerald’s The Great Gatsby into a fully ‘e’-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter ‘u’) had minimal impact on the text’s meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints, revealing just how adaptable and creative language can be.
zh

[NLP-140] Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在识别针对自闭症人群的隐性能力歧视(ableism)内容方面的局限性问题。其关键解决方案在于评估四种LLMs对与自闭症相关语言的识别能力,并分析它们在理解术语与实际检测有害或冒犯性内容之间的差距。研究发现,尽管LLMs能够识别与自闭症相关的语言,但往往无法准确捕捉其中的负面或冒犯性含义,且其解释主要依赖于表面关键词匹配,而非上下文、说话者身份及潜在影响的综合考量。

链接: https://arxiv.org/abs/2505.20500
作者: Naba Rizvi,Harper Strickland,Saleha Ahmedi,Aekta Kallepalli,Isha Khirwadkar,William Wu,Imani N. S. Munyaka,Nedjma Ousidhoum
机构: University of California, San Diego (加州大学圣地亚哥分校); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in decision-making tasks like résumé screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.
zh

[NLP-141] Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

【速读】: 该论文试图解决传统Transformer模型在处理需要局部或层次化线索的任务时,由于将序列中所有标记的信息压缩到单一的\texttt[CLS]标记而导致的信息丢失问题。其解决方案的关键在于引入一种名为Inceptive Transformer的模块化轻量级架构,该架构通过集成受Inception网络启发的多尺度特征提取模块,增强基于Transformer的标记表示,并通过动态加权机制根据标记对特定任务的相关性来平衡局部与全局依赖关系。

链接: https://arxiv.org/abs/2505.20496
作者: Asif Shahriar,Rifat Shahriyar,M Saifur Rahman
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conventional transformer models typically compress the information from all tokens in a sequence into a single \texttt[CLS] token to represent global context-- an approach that can lead to information loss in tasks requiring localized or hierarchical cues. In this work, we introduce \textitInceptive Transformer, a modular and lightweight architecture that enriches transformer-based token representations by integrating a multi-scale feature extraction module inspired by inception networks. Our model is designed to balance local and global dependencies by dynamically weighting tokens based on their relevance to a particular task. Evaluation across a diverse range of tasks including emotion recognition (both English and Bangla), irony detection, disease identification, and anti-COVID vaccine tweets classification shows that our models consistently outperform the baselines by 1% to 14% while maintaining efficiency. These findings highlight the versatility and cross-lingual applicability of our method for enriching transformer-based representations across diverse domains.
zh

[NLP-142] InFact: Informativeness Alignment for Improved LLM Factuality

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成事实性文本时存在的信息不完整性问题,即模型可能生成正确但不够详细和信息量低的文本。解决方案的关键在于提出一种信息对齐机制(informativeness alignment mechanism),该机制利用最新的事实性基准来构建信息对齐目标,该目标优先选择既正确又信息丰富的答案。关键发现是,通过最大化该目标或优化其偏好,不仅可以提升文本的信息量,还能增强事实准确性。

链接: https://arxiv.org/abs/2505.20487
作者: Roi Cohen,Russa Biswas,Gerard de Melo
机构: Hasso Plattner Institute (哈索普拉特纳研究所); University of Potsdam (波茨坦大学); Aalborg University (奥尔堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Factual completeness is a general term that captures how detailed and informative a factually correct text is. For instance, the factual sentence Barack Obama was born in the United States'' is factually correct, though less informative than the factual sentence Barack Obama was born in Honolulu, Hawaii, United States’'. Despite the known fact that LLMs tend to hallucinate and generate factually incorrect text, they might also tend to choose to generate factual text that is indeed factually correct and yet less informative than other, more informative choices. In this work, we tackle this problem by proposing an informativeness alignment mechanism. This mechanism takes advantage of recent factual benchmarks to propose an informativeness alignment objective. This objective prioritizes answers that are both correct and informative. A key finding of our work is that when training a model to maximize this objective or optimize its preference, we can improve not just informativeness but also factuality.
zh

[NLP-143] Conversation Kernels: A Flexible Mechanism to Learn Relevant Context for Online Conversation Understanding AAAI

【速读】: 该论文旨在解决在线对话中个体发言理解困难的问题,因为每个发言通常较短且可能隐含引用同一对话中的其他内容,从而需要捕捉对话上下文及不同部分之间的依赖关系,并将这些上下文依赖编码到语言模型中。解决方案的关键在于提出一种通用机制,用于发现针对在线发言不同方面(如信息量、洞察力、趣味性等)的适当对话上下文,具体通过设计两种Conversation Kernels家族,探索对话树中发言周围的多个部分,从而构建适用于不同任务的相关对话上下文。

链接: https://arxiv.org/abs/2505.20482
作者: Vibhor Agarwal,Arjoo Gupta,Suparna De,Nishanth Sastry
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at International AAAI Conference on Web and Social Media (ICWSM) 2025

点击查看摘要

Abstract:Understanding online conversations has attracted research attention with the growth of social networks and online discussion forums. Content analysis of posts and replies in online conversations is difficult because each individual utterance is usually short and may implicitly refer to other posts within the same conversation. Thus, understanding individual posts requires capturing the conversational context and dependencies between different parts of a conversation tree and then encoding the context dependencies between posts and comments/replies into the language model. To this end, we propose a general-purpose mechanism to discover appropriate conversational context for various aspects about an online post in a conversation, such as whether it is informative, insightful, interesting or funny. Specifically, we design two families of Conversation Kernels, which explore different parts of the neighborhood of a post in the tree representing the conversation and through this, build relevant conversational context that is appropriate for each task being considered. We apply our developed method to conversations crawled from this http URL, which allows users to apply highly different labels to posts, such as ‘insightful’, ‘funny’, etc., and therefore provides an ideal experimental platform to study whether a framework such as Conversation Kernels is general-purpose and flexible enough to be adapted to disparately different conversation understanding tasks. Comments: Accepted at International AAAI Conference on Web and Social Media (ICWSM) 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.20482 [cs.CL] (or arXiv:2505.20482v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.20482 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-144] he Impact of a Chatbots Ephemerality-Framing on Self-Disclosure Perceptions

【速读】: 该论文试图解决聊天机器人(chatbot)的框架设计如何影响用户自我披露(self-disclosure)的问题,特别是在不同情境下用户对聊天机器人角色感知所引发的行为差异。其解决方案的关键在于通过对比两种不同的聊天机器人角色设定:一种是“熟悉型”(Familiar),强调记忆用户过往互动并作为陪伴者;另一种是“陌生型”(Stranger),表现为每次对话都是全新的、未建立联系的实体。研究结果表明,用户在情感性自我披露情境下更倾向于与陌生型聊天机器人交流,而在事实性自我披露情境下则更享受与熟悉型聊天机器人互动,这表明聊天机器人的角色框架对用户行为具有显著影响。

链接: https://arxiv.org/abs/2505.20464
作者: Samuel Rhys Cox,Rune Møberg Jacobsen,Niels van Berkel
机构: Aalborg University (奥尔堡大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: In ACM Conversational User Interfaces (CUI '25), July 8-10, 2025; 18 pages; 6 Figures; 6 Tables

点击查看摘要

Abstract:Self-disclosure, the sharing of one’s thoughts and feelings, is affected by the perceived relationship between individuals. While chatbots are increasingly used for self-disclosure, the impact of a chatbot’s framing on users’ self-disclosure remains under-explored. We investigated how a chatbot’s description of its relationship with users, particularly in terms of ephemerality, affects self-disclosure. Specifically, we compared a Familiar chatbot, presenting itself as a companion remembering past interactions, with a Stranger chatbot, presenting itself as a new, unacquainted entity in each conversation. In a mixed factorial design, participants engaged with either the Familiar or Stranger chatbot in two sessions across two days, with one conversation focusing on Emotional- and another Factual-disclosure. When Emotional-disclosure was sought in the first chatting session, Stranger-condition participants felt more comfortable self-disclosing. However, when Factual-disclosure was sought first, these differences were replaced by more enjoyment among Familiar-condition participants. Qualitative findings showed Stranger afforded anonymity and reduced judgement, whereas Familiar sometimes felt intrusive unless rapport was built via low-risk Factual-disclosure.
zh

[NLP-145] Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为评判者在评估其他语言模型响应时的准确性不足问题,特别是在处理复杂、多轮对话上下文的偏好数据时。其解决方案的关键在于提出Amulet框架,该框架利用对话行为(dialog acts)和会话准则(maxims)等语言学概念,以更准确地捕捉对话中的交际结构和意图,以及评估偏好响应是否符合会话原则,从而提升LLM评判者的判断能力。

链接: https://arxiv.org/abs/2505.20451
作者: Sahana Ramnath,Anurag Mudgil,Brihi Joshi,Skyler Hallinan,Xiang Ren
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter’s significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.
zh

[NLP-146] In-context Language Learning for Endangered Languages in Speech Recognition

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)对全球约7,000种语言支持有限的问题,特别是针对未见过的低资源语言在语音识别(Automatic Speech Recognition, ASR)任务中的学习能力。其解决方案的关键在于利用上下文学习(In-Context Learning, ICL),通过提供相关文本样本,使LLMs在无需监督数据的情况下提升语言建模和ASR任务的性能,并证明基于概率的方法优于传统的指令驱动方法。

链接: https://arxiv.org/abs/2505.20445
作者: Zhaolin Li,Jan Niehues
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.
zh

[NLP-147] HAMburger: Accelerating LLM Inference via Token Smashing

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中计算与存储资源消耗过高的问题,传统方法中每个token需要一次前向传播和一个KV缓存,导致资源利用率低下。其解决方案的关键在于提出HAMburger,一种分层自回归模型,通过将多个token压缩至单一KV缓存并每步生成多个token,实现了计算和存储的非均匀分配,从而将KV缓存和前向FLOPs的增长从线性降至次线性,并根据查询困惑度和输出结构动态调整推理速度。

链接: https://arxiv.org/abs/2505.20438
作者: Jingyu Liu,Ce Zhang
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2 \times and achieves up to 2 \times TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.
zh

[NLP-148] PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy ACL2025

【速读】: 该论文旨在解决从退化历史文档中提取文本时存在的高错误率问题。其解决方案的关键在于提出PreP-OCR管道,该管道通过将文档图像修复与语义感知的后OCR校正相结合,联合优化图像清晰度和语言一致性,从而提升文本提取的准确性。

链接: https://arxiv.org/abs/2505.20429
作者: Shuhao Guan,Moule Lin,Cheng Xu,Xinyi Liu,Jinman Zhao,Jiexin Fan,Qi Xu,Derek Greene
机构: University College Dublin(都柏林大学学院); Trinity College Dublin(三一学院都柏林); University of Toronto(多伦多大学); Shanghai University(上海大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACL 2025 main

点击查看摘要

Abstract:This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to improve text extraction from degraded historical documents. Our key innovation lies in jointly optimizing image clarity and linguistic consistency. First, we generate synthetic image pairs with randomized text fonts, layouts, and degradations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-corrector, fine-tuned on synthetic historical text training pairs, addresses any remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.
zh

[NLP-149] he UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

【速读】: 该论文试图解决Tagalog语言在计算语言学研究中资源不足的问题,特别是缺乏大规模、高质量的句法树库。其解决方案的关键在于构建了目前最大的Tagalog树库UD-NewsCrawl,包含15.6k棵根据Universal Dependencies框架手动标注的句子结构树,并通过多种基于Transformer的模型提供了基线评估,以推动对该语言句法分析的研究。

链接: https://arxiv.org/abs/2505.20428
作者: Angelina A. Aquino,Lester James V. Miranda,Elsie Marie T. Or
机构: Charles Darwin University (查尔斯·达尔文大学); Allen Institute for AI (艾伦人工智能研究所); Department of Linguistics, University of the Philippines Diliman (菲律宾大学迪利曼分校语言学系); Electrical and Electronics Engineering Institute, University of the Philippines Diliman (菲律宾大学迪利曼分校电气与电子工程研究所)
类目: Computation and Language (cs.CL)
备注: Link to treebank: this https URL ; All authors contributed equally in this work

点击查看摘要

Abstract:This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
zh

[NLP-150] SEMMA: A Semantic Aware Knowledge Graph Foundation Model

【速读】: 该论文旨在解决知识图谱基础模型(Knowledge Graph Foundation Models, KGFMs)在零样本推理任务中对结构信息依赖过重,而忽视了文本属性中丰富的语义信号的问题。其解决方案的关键在于提出一种双模块架构的KGFM——SEMMA,该模型通过集成可迁移的文本语义与结构信息,利用大型语言模型(Large Language Models, LLMs)增强关系标识符,并生成语义嵌入以构建文本关系图,进而与结构部分进行融合,从而提升模型在跨图泛化任务中的性能。

链接: https://arxiv.org/abs/2505.20422
作者: Arvindh Arun,Sumit Kumar,Mojtaba Nayyeri,Bo Xiong,Ponnurangam Kumaraguru,Antonio Vergari,Steffen Staab
机构: Institute for AI, University of Stuttgart (人工智能研究所,斯图加特大学); IIIT Hyderabad (印度国际信息技术学院海得拉巴分校); Stanford University (斯坦福大学); University of Edinburgh (爱丁堡大学); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.
zh

[NLP-151] GraphGen: Enhancing Supervised Fine-Tuning for LLM s with Knowledge-Driven Synthetic Data Generation

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在监督微调过程中对高质量标注数据的高依赖性问题,这一过程通常成本高昂且耗时。其关键解决方案是提出GraphGen框架,该框架通过构建细粒度知识图谱、利用预期校准误差指标识别知识缺口、优先生成高价值长尾知识的问答对、引入多跳邻域采样以捕捉复杂关系信息,并采用风格控制生成来增强问答数据的多样性,从而有效提升合成数据的质量和覆盖范围。

链接: https://arxiv.org/abs/2505.20416
作者: Zihong Chen,Wanli Jiang,Jinzhe Li,Zhonghang Yuan,Huanjun Kong,Wanli Ouyang,Nanqing Dong
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at this https URL.
zh

[NLP-152] Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学和逻辑推理任务中表现优异但主要依赖记忆而非泛化能力的问题,以及现有方法在结合符号方法时因缺乏可靠且可扩展的验证机制而无法有效利用符号表示的局限性。其解决方案的关键在于生成符号推理轨迹,并通过基于蒙特卡洛估计自动调优的过程奖励模型选择高质量轨迹,随后利用这些轨迹进行微调以提升逻辑推理与泛化能力。

链接: https://arxiv.org/abs/2505.20415
作者: Xingwei Tan,Marco Valentino,Mahmud Akhter,Maria Liakata,Nikolaos Aletras
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have shown promising performance in mathematical and logical reasoning benchmarks. However, recent studies have pointed to memorization, rather than generalization, as one of the leading causes for such performance. LLMs, in fact, are susceptible to content variations, demonstrating a lack of robust symbolic abstractions supporting their reasoning process. To improve reliability, many attempts have been made to combine LLMs with symbolic methods. Nevertheless, existing approaches fail to effectively leverage symbolic representations due to the challenges involved in developing reliable and scalable verification mechanisms. In this paper, we propose to overcome such limitations by generating symbolic reasoning trajectories and select the high-quality ones using a process reward model automatically tuned based on Monte Carlo estimation. The trajectories are then employed via fine-tuning methods to improve logical reasoning and generalization. Our results on logical reasoning benchmarks such as FOLIO and LogicAsker show the effectiveness of the proposed method with large gains on frontier and open-weight models. Moreover, additional experiments on claim verification reveal that fine-tuning on the generated symbolic reasoning trajectories enhances out-of-domain generalizability, suggesting the potential impact of symbolically-guided process supervision in alleviating the effect of memorization on LLM reasoning.
zh

[NLP-153] SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在软件工程(Software Engineering, SWE)任务中面临的两个关键问题:高质量训练数据的稀缺性以及静态基准测试因数据污染而迅速过时的问题。其解决方案的关键在于提出一种新颖、自动化且可扩展的流水线,用于从多样化的 GitHub 仓库中持续提取真实的交互式 SWE 任务,并基于此构建了 SWE-rebench 数据集,该数据集包含超过 21,000 个基于 Python 的交互式 SWE 任务,适用于大规模强化学习训练。此外,通过持续供应新鲜任务,构建了一个无数据污染的基准测试环境,以更准确地评估智能体软件工程模型的性能。

链接: https://arxiv.org/abs/2505.20411
作者: Ibragim Badertdinov,Alexander Golubev,Maksim Nekrashevich,Anton Shevtsov,Simon Karasik,Andrei Andriushchenko,Maria Trofimova,Daria Litvintseva,Boris Yangel
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Dataset: this https URL , SWE-rebench leaderboard this https URL

点击查看摘要

Abstract:LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
zh

[NLP-154] What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

【速读】: 该论文旨在解决基于指令的图像编辑模型在结果评估上的挑战,尤其是现有评估指标在与人类判断的一致性和可解释性方面存在不足的问题。其解决方案的关键在于提出DICE(DIfference Coherence Estimator),该模型通过两个核心组件——差异检测器和一致性估计器,利用自回归多模态大语言模型(MLLM)进行训练,结合自监督、修复网络蒸馏和全监督策略,有效检测原始图像与编辑后图像之间的局部差异,并评估这些差异与修改请求的相关性。

链接: https://arxiv.org/abs/2505.20405
作者: Lorenzo Baraldi,Davide Bucciarelli,Federico Betti,Marcella Cornia,Lorenzo Baraldi,Nicu Sebe,Rita Cucchiara
机构: University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学); University of Pisa(比萨大学); University of Trento(特伦托大学); IIT-CNR(意大利国家研究委员会智能机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.
zh

[NLP-155] Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents ACL2025

【速读】: 该论文旨在解决基于检索增强生成(Retrieval-augmented generation, RAG)的大语言模型在金融领域应用时,因标准化文档(如美国证券交易委员会(SEC)文件)具有重复性模板文本和相似表格结构而导致的传统RAG方法误识别近似重复文本的问题,从而引发的重复检索影响准确性与完整性的难题。解决方案的关键在于提出分层检索与证据整理(Hierarchical Retrieval with Evidence Curation, HiREC)框架,该框架通过分层检索减少相似文本之间的混淆,并通过证据整理过程去除无关段落,必要时自动生成补充查询以获取缺失信息。

链接: https://arxiv.org/abs/2505.20368
作者: Jaeyoung Choe,Jihoon Kim,Woohwan Jung
机构: Hanyang University (汉阳大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2025 (Findings)

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at this https URL.
zh

[NLP-156] Rethinking Text-based Protein Understanding: Retrieval or LLM ?

【速读】: 该论文旨在解决当前蛋白质-文本模型在基准测试中存在显著的数据泄露问题以及传统自然语言处理指标无法准确评估模型在该领域性能的问题。其解决方案的关键在于重新组织现有数据集,并引入基于生物实体的新评价框架,同时提出一种增强检索的方法,在无需微调大语言模型的情况下,实现了蛋白质到文本生成任务的显著性能提升。

链接: https://arxiv.org/abs/2505.20354
作者: Juntong Wu,Zijing Liu,He Cao,Hao Li,Bin Feng,Zishan Shu,Ke Yu,Li Yuan,Yu Li
机构: Peking University (北京大学); International Digital Economy Academy (国际数字经济发展研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model’s performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at this https URL.
zh

[NLP-157] SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

【速读】: 该论文试图解决在专业领域中,大型语言模型(Large Language Models, LLMs)训练过程中依赖高质量指令和可验证奖励的问题,这些问题在实际应用中往往难以获取。解决方案的关键在于提出一种自对弈强化学习(Self-play Reinforcement Learning, SeRL),其核心包含两个互补模块:自我指令生成与自我奖励估计。前者通过在线过滤策略在每个训练步骤中生成高质量、多样且具有挑战性的额外指令,后者则采用简单的多数投票机制来估计响应奖励,从而无需外部标注。SeRL通过生成的数据进行传统强化学习,实现迭代的自对弈学习,最终在多个推理基准测试中取得了优于现有方法的性能。

链接: https://arxiv.org/abs/2505.20347
作者: Wenkai Fang,Shunyu Liu,Yang Zhou,Kongcheng Zhang,Tongya Zheng,Kaixuan Chen,Mingli Song,Dacheng Tao
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Hangzhou City University (杭州市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at this https URL.
zh

[NLP-158] Do LLM s have a Gender (Entropy) Bias?

【速读】: 该论文试图解决流行大语言模型(Large Language Models, LLMs)中存在的一种特定类型性别偏见问题,具体表现为模型在回答真实用户问题时生成的信息量存在差异。研究提出了一种新的基准数据集RealWorldQuestioning,涵盖商业和健康领域的四个关键领域。论文定义并研究了熵偏见(entropy bias),即模型对不同性别用户问题生成信息量的差异。解决方案的关键在于通过一种简单的去偏方法,迭代融合男性和女性的响应以生成最终结果,该方法基于提示(prompt-based)策略,能够有效减少偏见,使78%的情况下生成的响应信息量高于两种性别变体,并在其余情况下实现均衡整合。

链接: https://arxiv.org/abs/2505.20343
作者: Sonal Prabhune,Balaji Padmanabhan,Kaushik Dutta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as “LLM-as-judge”). Our analyses (metric-based comparisons and “LLM-as-judge” evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which “cancel” each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.
zh

[NLP-159] Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models

【速读】: 该论文试图解决大语言模型生成过程中如何平衡创造性与一致性的问题,其核心挑战在于理解并控制生成过程中的语义演化机制。解决方案的关键在于提出动态流形演化理论(Dynamic Manifold Evolution Theory, DMET),将语言模型的生成建模为在低维语义流形上受控的动力系统,通过将潜在状态更新视为连续动力学的离散时间欧拉近似,将内在能量驱动流和上下文依赖力映射到Transformer组件中,并利用李雅普诺夫稳定性理论定义了三个经验度量,从而定量关联潜在轨迹特性与文本流畅性、语法正确性和语义连贯性。

链接: https://arxiv.org/abs/2505.20340
作者: Yukun Zhang,Qi Dong
机构: The Chinese University Of Hongkong (香港中文大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Dynamic Manifold Evolution Theory (DMET),a unified framework that models large language model generation as a controlled dynamical system evolving on a low_dimensional semantic manifold. By casting latent_state updates as discrete time Euler approximations of continuous dynamics, we map intrinsic energy_driven flows and context_dependent forces onto Transformer components (residual connections, attention, feed-forward networks). Leveraging Lyapunov stability theory We define three empirical metrics (state continuity, clustering quality, topological persistence) that quantitatively link latent_trajectory properties to text fluency, grammaticality, and semantic coherence. Extensive experiments across decoding parameters validate DMET’s predictions and yield principled guidelines for balancing creativity and consistency in text generation.
zh

[NLP-160] Assessing the Capability of LLM s in Solving POSCOMP Questions

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在计算机科学等专业领域中的性能评估问题,特别是其在复杂考试任务中的表现是否能够达到或超越人类水平。解决方案的关键在于利用POSCOMP这一巴西计算机协会(SBC)主办的高挑战性研究生入学考试作为基准,对多个先进LLMs进行系统性评估,以量化其在文本理解和图像解释等任务中的能力,并通过对比不同年份的考试结果,分析LLMs的持续改进趋势及其在专业领域的应用潜力。

链接: https://arxiv.org/abs/2505.20338
作者: Cayo Viegas,Rohit Gheyi,Márcio Ribeiro
机构: UFCG(巴西联邦共济会大学); UFAL(巴西阿尔梅达·拉莫斯联邦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models’ proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.
zh

[NLP-161] MOSLIM:Align with diverse preferences in prompts through reward classification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)的多目标对齐问题,即如何使基础模型适应多样化的用户偏好。现有方法通常依赖于多个策略或针对不同偏好的定制化奖励模型,或需要进行特定偏好的监督微调(SFT)模型训练。本文提出的解决方案MOSLIM的关键在于使用单一奖励模型和策略模型来处理多种目标,通过提示(prompting)灵活控制这些目标,并在SFT阶段无需偏好训练,从而可以直接利用大量现成模型。MOSLIM采用多头奖励模型对问答对进行分类而非打分,并通过映射函数将分类结果转换为奖励分数,进而优化策略模型。

链接: https://arxiv.org/abs/2505.20336
作者: Yu Zhang,Wanli Jiang,Zhengyu Yang
机构: Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.
zh

[NLP-162] Language Model Distillation: A Temporal Difference Imitation Learning Perspective

【速读】: 该论文试图解决大型语言模型在计算成本上的高消耗问题,通过知识蒸馏将其压缩为更小、更高效的模型。解决方案的关键在于引入一种基于时序差分(temporal difference)的学习框架,该框架利用了教师模型的分布稀疏性,即语言模型通常将大部分概率质量分配给词汇表中的一小部分词元。通过在缩减的动作空间(词汇表子集)上进行操作,该方法有效提升了蒸馏过程的效率和性能。

链接: https://arxiv.org/abs/2505.20335
作者: Zishun Yu,Shangzhe Li,Xinhua Zhang
机构: The University of Illinois, Chicago(伊利诺伊大学芝加哥分校); The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.
zh

[NLP-163] Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在解码过程中因键值缓存(KV cache)内存占用随文本序列变长而显著增加所带来的高效部署难题。现有KV缓存淘汰方法通过预填充阶段的注意力得分修剪标记,导致与实际推理查询不一致,尤其在内存预算紧张的情况下问题更为突出。论文提出的解决方案是Lookahead Q-Cache (LAQ),其关键在于生成低成本的伪前瞻查询,以更准确地逼近真实的解码阶段查询,并利用这些前瞻查询作为重要性估计的观察窗口,从而实现与实际推理场景更一致且精确的KV缓存淘汰。

链接: https://arxiv.org/abs/2505.20334
作者: Yixuan Wang,Shiyu Ji,Yijun Liu,Yuzhuang Xu,Yang Xu,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 \sim 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.
zh

[NLP-164] Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内部推理过程不透明的问题,这限制了其在关键应用中的可解释性和可信度。解决方案的关键在于提出一种多尺度流形对齐框架(Multi_Scale Manifold Alignment),通过将潜在空间分解为全局、中间和局部语义流形来捕捉主题、上下文和词级细节,并引入跨尺度映射函数,联合强制几何对齐(如Procrustes分析)和信息保留(通过互信息约束如MINE或VIB)。此外,还结合曲率正则化和超参数调优以实现稳定的优化。

链接: https://arxiv.org/abs/2505.20333
作者: Yukun Zhang,Qi Dong
机构: The Chinese University Of Hongkong (香港中文大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have achieved strong performance, yet their internal reasoning remains opaque, limiting interpretability and trust in critical applications. We propose a novel Multi_Scale Manifold Alignment framework that decomposes the latent space into global, intermediate, and local semantic manifolds capturing themes, context, and word-level details. Our method introduces cross_scale mapping functions that jointly enforce geometric alignment (e.g., Procrustes analysis) and information preservation (via mutual information constraints like MINE or VIB). We further incorporate curvature regularization and hyperparameter tuning for stable optimization. Theoretical analysis shows that alignment error, measured by KL divergence, can be bounded under mild assumptions. This framework offers a unified explanation of how LLMs structure multi-scale semantics, advancing interpretability and enabling applications such as bias detection and robustness enhancement.
zh

[NLP-165] Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在不同文化背景下存在的内部偏见问题,特别是其对种族、性别和身体特征等文化差异的编码方式。解决方案的关键在于提出一个新颖的框架,通过三个基于检索的任务系统评估VLMs在不同国家与特定种族、个人特质及身体特征之间的关联性,从而揭示模型中潜在的偏见和刻板印象。

链接: https://arxiv.org/abs/2505.20326
作者: Avinash Madasu,Vasudev Lal,Phillip Howard
机构: Intel Labs(英特尔实验室); Thoughtworks(思特沃克)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries. We introduce three retrieval-based tasks: (1) Race to Country retrieval, which examines the association between individuals from specific racial groups (East Asian, White, Middle Eastern, Latino, South Asian, and Black) and different countries; (2) Personal Traits to Country retrieval, where images are paired with trait-based prompts (e.g., Smart, Honest, Criminal, Violent) to investigate potential stereotypical associations; and (3) Physical Characteristics to Country retrieval, focusing on visual attributes like skinny, young, obese, and old to explore how physical appearances are culturally linked to nations. Our findings reveal persistent biases in VLMs, highlighting how visual representations may inadvertently reinforce societal stereotypes.
zh

[NLP-166] Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

【速读】: 该论文旨在解决测试阶段缩放(Test-Time Scaling, TTS)方法在增强大型语言模型(Large Language Model, LLM)推理能力时所面临的高计算成本问题,尤其是依赖外部过程奖励模型(Process Reward Models, PRMs)或采样方法如Best-of-N(BoN)所带来的资源消耗。其解决方案的关键在于提出一种名为“Guided by Gut”(GG)的高效自引导TTS框架,该框架通过仅利用内在的LLM信号、分词级置信度和步骤新颖性进行轻量级树搜索,实现了无需外部验证模型即可达到PRM级别的性能。其中一项关键创新是通过针对性强化学习微调阶段提升内部置信度估计的可靠性。

链接: https://arxiv.org/abs/2505.20325
作者: Amirhosein Ghasemabadi,Keith G. Mills,Baochun Li,Di Niu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.
zh

[NLP-167] PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

【速读】: 该论文旨在解决临床叙述中时间动态建模的问题,以支持患者轨迹的分析,但目前缺乏大规模的时间标注资源。其解决方案的关键在于构建PMOA-TTS数据集,通过可扩展的基于大语言模型(LLM)的流水线,将124,699篇PubMed Open Access(PMOA)个案报告转换为结构化的(事件,时间)时间线,结合启发式过滤与Llama 3.3进行单患者个案报告识别,并利用Llama 3.3和DeepSeek R1进行提示驱动的事件提取,从而生成超过560万条带有时间戳的临床事件。

链接: https://arxiv.org/abs/2505.20323
作者: Shahriar Noroozizadeh,Sayantan Kumar,George H. Chen,Jeremy C. Weiss
机构: Carnegie Mellon University (卡内基梅隆大学); National Institutes of Health (美国国家卫生研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction using Llama 3.3 and DeepSeek R1, resulting in over 5.6 million timestamped clinical events. To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics: (i) event-level matching (80% match at a cosine similarity threshold of 0.1), (ii) temporal concordance (c-index 0.90), and (iii) Area Under the Log-Time CDF (AULTC) for timestamp alignment. Corpus-level analysis shows wide diagnostic and demographic coverage. In a downstream survival prediction task, embeddings from extracted timelines achieve time-dependent concordance indices up to 0.82 \pm 0.01, demonstrating the predictive value of temporally structured narratives. PMOA-TTS provides a scalable foundation for timeline extraction, temporal reasoning, and longitudinal modeling in biomedical NLP. The dataset is available at: this https URL .
zh

[NLP-168] Beyond Prompt Engineering: Robust Behavior Control in LLM s via Steering Target Atoms

【速读】: 该论文旨在解决语言模型生成过程中控制精度不足及潜在副作用的问题,尤其是在面对复杂模型时,由于参数量庞大导致的内部表示高度耦合。其解决方案的关键在于提出Steering Target Atoms (STA),通过隔离和操控解耦的知识组件来提升安全性与可控性。

链接: https://arxiv.org/abs/2505.20322
作者: Mengru Wang,Ziwen Xu,Shengyu Mao,Shumin Deng,Zhaopeng Tu,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Tencent AI Lab (腾讯人工智能实验室); National University of Singapore (新加坡国立大学); NUS-NCS Joint Lab (新加坡国立大学-国家计算机科学研究所联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
zh

[NLP-169] BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

【速读】: 该论文旨在解决生物医学领域中文本到SQL(text-to-SQL)系统在将定性科学问题映射为可执行SQL查询时存在的挑战,尤其是在需要隐式领域推理的情况下。其解决方案的关键在于构建了一个名为BiomedSQL的基准数据集,该数据集基于一个整合了基因-疾病关联、组学数据因果推断以及药物批准记录的标准化BigQuery知识库,包含68,000个问题/SQL查询/答案对。每个问题要求模型推断特定领域的标准,如全基因组显著性阈值、效应方向或试验阶段过滤,而非仅依赖语法翻译。

链接: https://arxiv.org/abs/2505.20321
作者: Mathew J. Koretsky,Maya Willey,Adi Asija,Owen Bianchi,Chelsea X. Alvarado,Tanay Nayak,Nicole Kuznetsov,Sungwon Kim,Mike A. Nalls,Daniel Khashabi,Faraz Faghri
机构: Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC; Johns Hopkins University; Laboratory of Neurogenetics, NIA, NIH
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at this https URL, and our code is open-source at this https URL.
zh

[NLP-170] Less Context Same Performance: A RAG Framework for Resource-Efficient LLM -Based Clinical NLP

【速读】: 该论文试图解决长文本分类在大型语言模型(Large Language Models, LLMs)中的挑战,特别是由于token限制和高计算成本导致的性能瓶颈。其解决方案的关键在于采用检索增强生成(Retrieval Augmented Generation, RAG)方法,通过仅提取与分类查询最相关的文本片段,而非处理整个临床文档,从而有效降低token使用量并保持分类准确性。

链接: https://arxiv.org/abs/2505.20320
作者: Satya Narayana Cheetirala,Ganesh Raut,Dhavalkumar Patel,Fabio Sanatana,Robert Freeman,Matthew A Levin,Girish N. Nadkarni,Omar Dawkins,Reba Miller,Randolph M. Steinhagen,Eyal Klang,Prem Timsina
机构: Institute for Healthcare Delivery Science; Department of Anesthesiology, Perioperative and Pain Medicine; Hasso Plattner Institute for Digital Health; Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA; Department of Surgery; Department of Surgery, Senior Quality Data Analyst – ISM; Department of Surgery, Division of Quality and Patient Safety, All at Icahn School of Medicine at Mount Sinai, NY, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long text classification is challenging for Large Language Models (LLMs) due to token limits and high computational costs. This study explores whether a Retrieval Augmented Generation (RAG) approach using only the most relevant text segments can match the performance of processing entire clinical notes with large context LLMs. We begin by splitting clinical documents into smaller chunks, converting them into vector embeddings, and storing these in a FAISS index. We then retrieve the top 4,000 words most pertinent to the classification query and feed these consolidated segments into an LLM. We evaluated three LLMs (GPT4o, LLaMA, and Mistral) on a surgical complication identification task. Metrics such as AUC ROC, precision, recall, and F1 showed no statistically significant differences between the RAG based approach and whole-text processing (p 0.05p 0.05). These findings indicate that RAG can significantly reduce token usage without sacrificing classification accuracy, providing a scalable and cost effective solution for analyzing lengthy clinical documents.
zh

[NLP-171] Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

【速读】: 该论文旨在解决现有In-Context derived Vector (ICV)方法在面对In-Context Learning (ICL)特定因素时仍存在敏感性、依赖粗粒度或语义碎片化的表示作为向量源以及依赖启发式注入位置的问题,从而限制了其适用性。其解决方案的关键在于提出Dynamic Vector (DyVec),通过引入Exhaustive Query Rotation (EQR)策略提取鲁棒的语义聚合潜在表示,并结合Dynamic Latent Segmentation and Injection技术根据任务复杂度自适应地分割表示,同时利用REINFORCE-based优化方法学习每个片段的最佳注入位置,从而实现更高效和灵活的推理时任务适配。

链接: https://arxiv.org/abs/2505.20318
作者: Wang Cai,Hsiu-Yuan Huang,Zhixiang Wang,Yunfang Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability. To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment. Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.20318 [cs.CL] (or arXiv:2505.20318v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.20318 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-172] Arctic-Text2SQL-R1: Simple Rewards Strong Reasoning in Text-to-SQL

【速读】: 该论文旨在解决将自然语言转换为可执行SQL(Test2SQL)的问题,特别是在复杂查询生成中的准确性瓶颈。其解决方案的关键在于提出了一种基于强化学习(Reinforcement Learning, RL)的框架Arctic-Text2SQL-R1,该框架通过仅依赖执行正确性的轻量级奖励信号来生成准确且可执行的SQL,避免了对中间监督和复杂奖励设计的依赖,从而实现了稳定训练和与最终任务的一致性对齐。

链接: https://arxiv.org/abs/2505.20315
作者: Zhewei Yao,Guoheng Sun,Lukasz Borchmann,Zheyu Shen,Minghang Deng,Bohan Zhai,Hao Zhang,Ang Li,Yuxiong He
机构: Snowflake AI Research (Snowflake AI 研究); University of Maryland (马里兰大学); University of California, San Diego (加利福尼亚大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 2 figures

点击查看摘要

Abstract:Translating natural language into SQL (Test2SQL) is a longstanding challenge at the intersection of natural language understanding and structured data access. While large language models (LLMs) have significantly improved fluency in SQL generation, producing correct and executable SQL–particularly for complex queries–remains a bottleneck. We present Arctic-Text2SQL-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executable SQL using a lightweight reward signal based solely on execution correctness. Our approach avoids brittle intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task. Combined with carefully curated data, strong supervised initialization, and effective training practices, Arctic-Text2SQL-R1 achieves state-of-the-art execution accuracy across six diverse Test2SQL benchmarks, including the top position on the BIRD leaderboard. Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework’s scalability and efficiency. We further demonstrate inference-time robustness through simple extensions like value retrieval and majority voting. Extensive experiments and ablation studies offer both positive and negative insights, providing practical guidance for future Test2SQL research.
zh

[NLP-173] Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在生成过程中出现的不良行为问题,如生成不安全内容或未能遵循安全指南,而传统方法依赖于成本高昂的微调。其解决方案的关键在于引入一种轻量级、可训练的控制器网络,在推理阶段对模型进行动态调控。该控制器通过观察特定的中间激活状态,预测全局缩放因子和层特定权重,从而动态调整预计算的“拒绝方向”向量的影响强度,实现对模型行为的细粒度控制,且无需修改原始模型参数。

链接: https://arxiv.org/abs/2505.20309
作者: Amr Hegazy,Mostafa Elhoushi,Amr Alanwar
机构: The German University in Cairo (德国开罗大学); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed “refusal direction” vector, applied across the LLM’s layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.
zh

[NLP-174] ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation

【速读】: 该论文试图解决在命令行接口(Command-line Interface, CLI)中生成高质量、可执行的命令序列以用于行为建模的问题,现有数据集缺乏执行细节如退出码、输出和环境副作用,限制了其在行为建模中的应用。解决方案的关键在于引入Shell Input-Output Environment (ShIOEnv),将命令构建过程建模为马尔可夫决策过程,通过执行候选命令返回退出状态、输出和行为目标进展,并利用从手册页导出的上下文无关文法来屏蔽无效参数,从而提高样本效率和数据集质量。

链接: https://arxiv.org/abs/2505.18374
作者: Jarrod Ragsdale,Rajendra Boppana
机构: The University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 11 figures, conference preprint

点击查看摘要

Abstract:Command-line interfaces (CLIs) provide structured textual environments for system administration. Explorations have been performed using pre-trained language models (PLMs) to simulate these environments for safe interaction in high-risk environments. However, their use has been constrained to frozen, large parameter models like GPT. For smaller architectures to reach a similar level of believability, a rich dataset of CLI interactions is required. Existing public datasets focus on mapping natural-language tasks to commands, omitting crucial execution data such as exit codes, outputs, and environmental side effects, limiting their usability for behavioral modeling. We introduce a Shell Input -Output Environment (ShIOEnv), which casts command construction as a Markov Decision Process whose state is the partially built sequence and whose actions append arguments. After each action, ShIOEnv executes the candidate and returns its exit status, output, and progress toward a minimal-length behavioral objective. Due to the intractable nature of the combinatorial argument state-action space, we derive a context-free grammar from man pages to mask invalid arguments from being emitted. We explore random and proximal-policy optimization (PPO)-optimized sampling of unrestricted and grammar-masked action spaces to produce four exploration strategies. We observed that grammar masking and PPO significantly improve sample efficiency to produce a higher quality dataset (maximizing the number of arguments while minimizing redundancies). Policy-generated datasets of shell input-output behavior pairs are used to fine-tune CodeT5, where we observe 85% improvements in BLEU-4 when constraining the action space to grammar productions with an additional 26% improvement when applying PPO. The ShIOEnv environment and curated command behavior datasets are released for use in future research.
zh

[NLP-175] InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning ACL2025

【速读】: 该论文试图解决当前大型多模态基础模型在理解物体组成部分及其功能方面的不足,特别是在任务导向的部件分割(task-oriented part segmentation)方面表现不佳的问题。其解决方案的关键在于引入了一个新的现实世界基准数据集InstructPart,该数据集包含手工标注的部件分割注释和任务导向的指令,用于评估模型在日常情境中理解和执行部件级任务的能力,并通过微调策略提升模型性能。

链接: https://arxiv.org/abs/2505.18291
作者: Zifu Wan,Yaqi Xie,Ce Zhang,Zhiqiu Lin,Zihan Wang,Simon Stepputtis,Deva Ramanan,Katia Sycara
机构: Robotics Institute, Carnegie Mellon University (机器人学院,卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Accepted by ACL 2025 Main. Project page: this https URL

点击查看摘要

Abstract:Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object’s functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: this https URL.
zh

[NLP-176] Graph RAG for Legal Norms: A Hierarchical and Temporal Approach

【速读】: 该论文试图解决法律规范(legal norms)分析与理解中的复杂性与数据量大的问题,这些问题源于法律文本的预定义层级结构、广泛的内部和外部引用网络以及多时间版本特性。解决方案的关键在于将结构化知识图谱与上下文丰富的文本片段相结合,通过引入层次结构、时间演化以及全面文本单元(comprehensive Text Units)的概念,构建更丰富且相互关联的法律知识表示,从而提升法律数据处理的有效性。

链接: https://arxiv.org/abs/2505.00039
作者: Hudson de Martim
机构: Federal Senate of Brazil (巴西联邦参议院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This article proposes an adaptation of Graph Retrieval Augmented Generation (Graph RAG) specifically designed for the analysis and comprehension of legal norms, which are characterized by their predefined hierarchical structure, extensive network of internal and external references and multiple temporal versions. By combining structured knowledge graphs with contextually enriched text segments, Graph RAG offers a promising solution to address the inherent complexity and vast volume of legal data. The integration of hierarchical structure and temporal evolution into knowledge graphs - along with the concept of comprehensive Text Units - facilitates the construction of richer, interconnected representations of legal knowledge. Through a detailed analysis of Graph RAG and its application to legal norm datasets, this article aims to advance the field of Artificial Intelligence applied to Law, creating opportunities for more effective systems in legal research, legislative analysis, and decision support.
zh

[NLP-177] CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

【速读】: 该论文试图解决将传统C代码安全地转换为现代Rust语言的问题,以提升代码的安全性和与Rust生态系统的互操作性。其解决方案的关键在于引入CRUST-Bench,这是一个包含100个C仓库的数据集,每个仓库都配有手动编写的、符合Rust风格的接口和测试用例,用于验证转换的正确性。通过考虑整个仓库而非孤立函数,CRUST-Bench能够捕捉跨多文件依赖的复杂项目转换挑战,同时提供的Rust接口确保了对内存安全和惯用Rust模式的遵循。

链接: https://arxiv.org/abs/2504.15254
作者: Anirudh Khatry,Robert Zhang,Jia Pan,Ziteng Wang,Qiaochu Chen,Greg Durrett,Isil Dillig
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); New York University (纽约大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at this https URL.
zh

[NLP-178] Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants

【速读】: 该论文试图解决从有限参与者的功能性磁共振成像(fMRI)数据中解码感知到的自然语音的问题。其解决方案的关键在于利用深度神经网络从fMRI活动预测语言模型(LLM)生成的文本表示,并探索多被试训练对解码准确率的影响。研究发现,在当前的数据条件下,多被试训练并未优于单被试方法,且跨被试的相似或不同刺激对解码准确率影响不大,表明在自然语音解码中,仅依靠多被试数据可能不足以提升性能,需结合更深入的个体表型信息或更大规模的样本群体。

链接: https://arxiv.org/abs/2505.21304
作者: Louis Jalouzot,Alexis Thual,Yair Lakretz,Christophe Pallier,Bertrand Thirion
机构: CEA(法国原子能委员会); ENS(法国高等师范学院); Université Paris-Saclay(巴黎-萨克雷大学); EHESS(法国社会科学高等研究院); CNRS(法国国家科学研究中心); INSERM(法国国家医学研究院); INRIA(法国国家信息与自动化研究所); Université PSL(巴黎文理研究大学)
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:We investigate optimal strategies for decoding perceived natural speech from fMRI data acquired from a limited number of participants. Leveraging Lebel et al. (2023)'s dataset of 8 participants, we first demonstrate the effectiveness of training deep neural networks to predict LLM-derived text representations from fMRI activity. Then, in this data regime, we observe that multi-subject training does not improve decoding accuracy compared to single-subject approach. Furthermore, training on similar or different stimuli across subjects has a negligible effect on decoding accuracy. Finally, we find that our decoders better model syntactic than semantic features, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode. While our results demonstrate the benefits of having extensive data per participant (deep phenotyping), they suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort.
zh

[NLP-179] PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems

【速读】: 该论文试图解决低资源语言(如波斯语)中自动语音识别(ASR)系统评估困难的问题,其关键解决方案是构建了一个名为Persian Speech Recognition Benchmark (PSRB) 的综合性基准,该基准涵盖了多样化的语言和声学条件,以更全面地评估ASR系统的性能,并通过引入一种新型的权重视觉替换错误的度量标准,提高评估的鲁棒性和准确性。

链接: https://arxiv.org/abs/2505.21230
作者: Nima Sedghiyeh,Sara Sadeghi,Reza Khodadadi,Farzin Kashani,Omid Aghdaei,Somayeh Rahimi,Mohammad Sadegh Safari
机构: PartDP.ai
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Although Automatic Speech Recognition (ASR) systems have become an integral part of modern technology, their evaluation remains challenging, particularly for low-resource languages such as Persian. This paper introduces Persian Speech Recognition Benchmark(PSRB), a comprehensive benchmark designed to address this gap by incorporating diverse linguistic and acoustic conditions. We evaluate ten ASR systems, including state-of-the-art commercial and open-source models, to examine performance variations and inherent biases. Additionally, we conduct an in-depth analysis of Persian ASR transcriptions, identifying key error types and proposing a novel metric that weights substitution errors. This metric enhances evaluation robustness by reducing the impact of minor and partial errors, thereby improving the precision of performance assessment. Our findings indicate that while ASR models generally perform well on standard Persian, they struggle with regional accents, children’s speech, and specific linguistic challenges. These results highlight the necessity of fine-tuning and incorporating diverse, representative training datasets to mitigate biases and enhance overall ASR performance. PSRB provides a valuable resource for advancing ASR research in Persian and serves as a framework for developing benchmarks in other low-resource languages. A subset of the PSRB dataset is publicly available at this https URL.
zh

[NLP-180] BrainStratify: Coarse-to-Fine Disentanglement of Intracranial Neural Dynamics

【速读】: 该论文旨在解决从颅内神经信号中直接解码语音这一脑机接口(BCI)领域的核心问题,特别是针对立体脑电图(sEEG)和皮层脑电图(ECoG)等神经信号中存在的两个关键挑战:任务相关信号在电极上稀疏分布,以及任务无关信号与任务相关信号在时间和空间上存在混杂。其解决方案的关键在于提出一种统一的粗粒度到细粒度的神经解缠框架——BrainStratify,该框架通过空间上下文引导的时间-空间建模识别功能组,并利用解耦乘积量化(DPQ)方法在目标功能组内解缠不同的神经动力学特征,从而提升语音解码的性能与可解释性。

链接: https://arxiv.org/abs/2505.20480
作者: Hui Zheng,Hai-Teng Wang,Yi-Tao Jing,Pei-Yang Lin,Han-Qing Zhao,Wei Chen,Peng-Hu Wei,Yong-Zhi Shan,Guo-Guang Zhao,Yun-Zhe Liu
机构: Beijing Normal University (北京师范大学); Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Chinese Institute for Brain Research (中国脑科学研究所); Capital Medical University, Xuanwu Hospital, Beijing (首都医科大学,北京宣武医院)
类目: ignal Processing (eess.SP); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Decoding speech directly from neural activity is a central goal in brain-computer interface (BCI) research. In recent years, exciting advances have been made through the growing use of intracranial field potential recordings, such as stereo-ElectroEncephaloGraphy (sEEG) and ElectroCorticoGraphy (ECoG). These neural signals capture rich population-level activity but present key challenges: (i) task-relevant neural signals are sparsely distributed across sEEG electrodes, and (ii) they are often entangled with task-irrelevant neural signals in both sEEG and ECoG. To address these challenges, we introduce a unified Coarse-to-Fine neural disentanglement framework, BrainStratify, which includes (i) identifying functional groups through spatial-context-guided temporal-spatial modeling, and (ii) disentangling distinct neural dynamics within the target functional group using Decoupled Product Quantization (DPQ). We evaluate BrainStratify on two open-source sEEG datasets and one (epidural) ECoG dataset, spanning tasks like vocal production and speech perception. Extensive experiments show that BrainStratify, as a unified framework for decoding speech from intracranial neural signals, significantly outperforms previous decoding methods. Overall, by combining data-driven stratification with neuroscience-inspired modularity, BrainStratify offers a robust and interpretable solution for speech decoding from intracranial recordings.
zh

[NLP-181] owards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset INTERSPEECH2025

【速读】: 该论文旨在解决文本语音编辑(Text-based Speech Editing, TSE)中因文本修改导致的情感变化或情感不一致问题,现有方法主要关注合成语音段的内容准确性和声学一致性,而忽视了情感层面的稳定性。解决方案的关键在于提出EmoCorrector,这是一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的后校正框架,通过提取编辑后文本的情感特征、检索具有匹配情感的语音样本,并合成与目标情感一致的语音,同时保持说话人身份和语音质量。

链接: https://arxiv.org/abs/2505.20341
作者: Rui Liu,Pu Gao,Jiatian Xi,Berrak Sisman,Carlos Busso,Haizhou Li
机构: Inner Mongolia University (内蒙古大学); Center for Language and Speech Processing (语言与语音处理中心); SRIBD, School of Data Science (数据科学学院SRIBD); Department of ECE (电子与计算机工程系)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: INTERSPEECH2025. Code and audio examples: this https URL

点击查看摘要

Abstract:Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text’s emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker’s identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of text, speech paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at this https URL.
zh

计算机视觉

[CV-0] Generalizable and Relightable Gaussian Splatting for Human Novel View Synthesis

【速读】:该论文旨在解决在多种光照条件下实现高保真人体新视角合成的问题,现有方法要么依赖于逐角色优化,要么忽略了物理约束。其解决方案的关键在于提出一种通用且可重新照明的3D Gaussian框架(GRGS),通过前馈、全监督策略将多视角2D观测中的几何、材质和光照线索投影到3D Gaussian表示中,核心创新包括光照感知的几何精修模块(LGR)和基于物理的神经渲染模块(PGNR),以实现高质量几何重建与可编辑的重新照明效果。

链接: https://arxiv.org/abs/2505.21502
作者: Yipengjing Sun,Chenyang Wang,Shunyuan Zheng,Zonglin Li,Shengping Zhang,Xiangyang Ji
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:We propose GRGS, a generalizable and relightable 3D Gaussian framework for high-fidelity human novel view synthesis under diverse lighting conditions. Unlike existing methods that rely on per-character optimization or ignore physical constraints, GRGS adopts a feed-forward, fully supervised strategy that projects geometry, material, and illumination cues from multi-view 2D observations into 3D Gaussian representations. Specifically, to reconstruct lighting-invariant geometry, we introduce a Lighting-aware Geometry Refinement (LGR) module trained on synthetically relit data to predict accurate depth and surface normals. Based on the high-quality geometry, a Physically Grounded Neural Rendering (PGNR) module is further proposed to integrate neural prediction with physics-based shading, supporting editable relighting with shadows and indirect illumination. Besides, we design a 2D-to-3D projection training scheme that leverages differentiable supervision from ambient occlusion, direct, and indirect lighting maps, which alleviates the computational cost of explicit ray tracing. Extensive experiments demonstrate that GRGS achieves superior visual quality, geometric consistency, and generalization across characters and lighting conditions.
zh

[CV-1] Vision Transformers with Self-Distilled Registers

【速读】:该论文试图解决Vision Transformers (ViTs)中出现的artifact tokens问题,这些异常标记与局部语义不一致,导致在需要细粒度定位或结构连贯性的任务中性能下降。解决方案的关键在于提出Post Hoc Registers (PH-Reg),这是一种高效的自蒸馏方法,能够在不重新训练预训练ViT的情况下将其集成到现有模型中,通过测试时增强生成无噪声的密集嵌入,并仅优化学生网络中一小部分未锁定的权重,从而有效减少artifact tokens的数量。

链接: https://arxiv.org/abs/2505.21501
作者: Yinjie Chen,Zipeng Yan,Chong Zhou,Bo Dai,Andrew F. Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 14 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly “absorb” the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher’s inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.
zh

[CV-2] Adversarial Attacks against Closed-Source MLLM s via Feature Optimal Alignment

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在对抗样本迁移性方面的不足问题。现有方法通常通过对齐全局特征(如CLIP的[CLS]标记)实现定向攻击,但忽略了补丁标记中丰富的局部信息,导致对齐效果不佳和迁移能力有限,尤其在封闭源代码模型上表现更差。解决方案的关键在于提出一种基于特征最优对齐的定向可迁移对抗攻击方法(FOA-Attack),其核心包括:在全局层面引入基于余弦相似度的全局特征损失以对齐粗粒度特征;在局部层面利用聚类技术提取紧凑的局部模式,并将对抗样本与目标样本的局部特征对齐建模为最优传输(Optimal Transport, OT)问题,同时设计动态集成模型加权策略以自适应平衡多模型的影响,从而提升对抗样本的迁移能力。

链接: https://arxiv.org/abs/2505.21494
作者: Xiaojun Jia,Sensen Gao,Simeng Qin,Tianyu Pang,Chao Du,Yihao Huang,Xinfeng Li,Yiming Li,Bo Li,Yang Liu
机构: Nanyang Technological University (南洋理工大学); MBZUAI (MBZUAI); Sea AI Lab (Sea AI 实验室); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP’s [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at this https URL.
zh

[CV-3] Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

【速读】:该论文旨在解决视频生成中的可控性、时间连贯性和细节合成等关键问题,特别是针对一种常用但研究不足的电影技术——帧内(Frame In)和帧外(Frame Out)的控制问题。解决方案的关键在于通过用户指定的运动轨迹,实现图像中物体自然退出场景或以新的身份进入场景的控制,为此提出了一个高效的身份保持型运动可控视频扩散Transformer架构,并构建了一个半自动标注的数据集及相应的评估协议。

链接: https://arxiv.org/abs/2505.21491
作者: Boyang Wang,Xuweiyi Chen,Matheus Gadelha,Zezhou Cheng
机构: University of Virginia (弗吉尼亚大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.
zh

[CV-4] Be Decisive: Noise-Induced Layouts for Multi-Subject Generation SIGGRAPH2025

【速读】:该论文试图解决现有文本到图像扩散模型在生成多个不同主体时存在的主体泄漏问题,这一问题通常由复杂提示引发,导致数量、属性和视觉特征的不准确。解决方案的关键在于预测一个与提示对齐的空间布局,该布局来源于初始噪声,并在去噪过程中进行优化。通过依赖噪声诱导的布局,避免了与外部强加布局的冲突,从而更好地保留模型的先验知识。该方法使用一个小的神经网络在每个去噪步骤中预测并优化不断变化的噪声诱导布局,确保主体之间有清晰的边界同时保持一致性。

链接: https://arxiv.org/abs/2505.21488
作者: Omer Dahary,Yehonathan Cohen,Or Patashnik,Kfir Aberman,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Snap Research (Snap研究); Israel (以色列); United States of America (美利坚合众国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: SIGGRAPH 2025. Project page: this https URL

点击查看摘要

Abstract:Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject’s spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model’s prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model’s prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model’s original distribution.
zh

[CV-5] MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

【速读】:该论文旨在解决增强现实(AR)和具身智能应用中对象合成(object compositing)的多视角一致性、复杂场景和多样光照条件等挑战。现有方法主要集中在单图像场景或内在分解技术,难以应对大规模场景的重建效率与一致性问题。其解决方案的关键在于提出MV-CoLight,一个两阶段框架,通过前馈架构直接建模光照与阴影,避免了基于扩散方法的迭代偏差,并采用基于希尔伯特曲线(Hilbert curve)的映射实现2D图像输入与3D高斯场景表示的无缝对齐,从而提升合成结果的一致性与泛化能力。

链接: https://arxiv.org/abs/2505.21483
作者: Kerui Ren,Jiayang Bai,Linning Xu,Lihan Jiang,Jiangmiao Pang,Mulin Yu,Bo Dai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学); University of Science and Technology of China (中国科学技术大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework’s robustness and wide generalization.
zh

[CV-6] Policy Optimized Text-to-Image Pipeline Design

【速读】:该论文旨在解决文本到图像生成中多组件流水线设计效率低下的问题,尤其是在自动化设计过程中存在的计算资源消耗大和泛化能力差的挑战。其解决方案的关键在于提出一种基于强化学习的框架,通过训练一组能够直接从提示-工作流组合预测图像质量评分的奖励模型,从而避免在训练过程中进行高成本的图像生成;随后采用两阶段训练策略,结合GRPO优化方法引导模型向性能更优的工作流空间区域迁移,并引入无分类器指导增强技术以提升输出质量。

链接: https://arxiv.org/abs/2505.21478
作者: Uri Gadot,Rinon Gal,Yftah Ziser,Gal Chechik,Shie Mannor
机构: Technion(以色列理工学院); NVIDIA Research(英伟达研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.
zh

[CV-7] DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

【速读】:该论文旨在解决传统一维自回归(1D autoregressive, AR)图像生成方法在生成质量和效率上的局限性,尤其是针对高分辨率图像生成中token数量多、计算效率低以及采样误差积累的问题。其解决方案的关键在于提出了一种从粗到细的生成策略(coarse-to-fine strategy),通过学习与分辨率相关的token序列,并利用逐步退化的图像进行监督,使生成过程从全局结构开始逐步细化细节,从而提升生成质量并减少token数量。此外,引入并行推理机制与自校正策略,进一步加速了生成速度并降低了教师强制(teacher-forcing)监督带来的采样误差。

链接: https://arxiv.org/abs/2505.21473
作者: Yiheng Liu,Liao Qu,Huichao Zhang,Xu Wang,Yi Jiang,Yiming Gao,Hu Ye,Xian Li,Shuai Wang,Daniel K. Du,Shu Cheng,Zehuan Yuan,Xinglong Wu
机构: ByteDance Inc. (字节跳动公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow’s superior generation quality and efficiency compared to existing state-of-the-art methods.
zh

[CV-8] LazyVLM: Neuro-Symbolic Approach to Video Analytics

【速读】:该论文试图解决当前视频分析方法在灵活性与效率之间的根本性权衡问题(trade-off)。现有的端到端视觉语言模型(Vision Language Models, VLMs)在处理长上下文时表现不佳且计算成本高,而神经符号方法则依赖于手动标注和严格的规则设计。论文提出的解决方案是LazyVLM,其关键在于通过将多帧视频查询分解为细粒度操作,并将大部分处理任务卸载到高效的关联查询执行和向量相似性搜索中,从而在保持用户友好的查询接口的同时克服了VLM的可扩展性限制。

链接: https://arxiv.org/abs/2505.21459
作者: Xiangru Jian,Wei Pang,Zhengyuan Dong,Chao Zhang,M. Tamer Özsu
机构: University of Waterloo (滑铁卢大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 5 pages, 2 figures, Working paper

点击查看摘要

Abstract:Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain video data at scale.
zh

[CV-9] Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

【速读】:该论文旨在解决如何使多模态大语言模型(Multimodal Large Language Models, MLLMs)具备主动感知(active perception)能力的问题。尽管主动感知在具身智能中具有重要地位,但目前缺乏对MLLMs如何获得或学习主动感知能力的系统性研究。论文提出了一种基于强化学习的训练框架ACTIVE-O3,该框架建立在GRPO之上,旨在为MLLMs赋予主动感知能力。其关键在于通过强化学习机制优化模型在复杂任务中的感知策略,提升搜索效率与区域选择准确性,并通过构建全面的基准测试套件验证其有效性。

链接: https://arxiv.org/abs/2505.21457
作者: Muzhi Zhu,Hao Zhong,Canyu Zhao,Zongze Du,Zheng Huang,Mingyu Liu,Hao Chen,Cheng Zou,Jingdong Chen,Ming Yang,Chunhua Shen
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model’s zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.
zh

[CV-10] Visual Product Graph: Bridging Visual Products And Composite Images For End-to-End Style Recommendations

【速读】:该论文试图解决视觉搜索系统中语义相似但视觉差异显著的内容检索问题,即在大量图像数据中找到与给定产品在语义上相关但外观不同的内容。解决方案的关键在于构建一个名为Visual Product Graph (VPG)的在线实时检索系统,该系统利用高性能存储基础设施和先进的计算机视觉模型进行图像理解,从而实现从单个产品到包含这些产品的复合场景的导航,并提供互补推荐。

链接: https://arxiv.org/abs/2505.21454
作者: Yue Li Du,Ben Alexander,Mikhail Antonenka,Rohan Mahadev,Hao-yu Wu,Dmitry Kislyuk
机构: Pinterest, Inc( Pinterest公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Retrieving semantically similar but visually distinct contents has been a critical capability in visual search systems. In this work, we aim to tackle this problem with Visual Product Graph (VPG), leveraging high-performance infrastructure for storage and state-of-the-art computer vision models for image understanding. VPG is built to be an online real-time retrieval system that enables navigation from individual products to composite scenes containing those products, along with complementary recommendations. Our system not only offers contextual insights by showcasing how products can be styled in a context, but also provides recommendations for complementary products drawn from these inspirations. We discuss the essential components for building the Visual Product Graph, along with the core computer vision model improvements across object detection, foundational visual embeddings, and other visual signals. Our system achieves a 78.8% extremely similar@1 in end-to-end human relevance evaluations, and a 6% module engagement rate. The “Ways to Style It” module, powered by the Visual Product Graph technology, is deployed in production at Pinterest.
zh

[CV-11] OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

【速读】:该论文旨在解决视频中唇形同步(lip synchronization)问题,即如何将说话者的嘴唇动作与对应的语音音频对齐,以生成逼真、富有表现力的视频内容。现有方法依赖参考帧和掩码帧修复,限制了其在身份一致性、姿态变化、面部遮挡和风格化内容中的鲁棒性,同时音频信号的弱条件性导致原始视频中的唇形信息泄露,影响同步质量。论文提出的解决方案关键在于提出OmniSync框架,采用无掩码训练范式,利用扩散Transformer模型直接编辑帧,无需显式掩码,从而实现无限时长推理并保持自然面部动态和角色身份一致性;此外,通过基于流匹配的渐进噪声初始化确保姿态和身份一致性,并引入动态时空无分类器指导机制(DS-CFG)以增强音频条件的适应性。

链接: https://arxiv.org/abs/2505.21448
作者: Ziqiao Peng,Jiwen Liu,Haoxian Zhang,Xiaoqiang Liu,Songlin Tang,Pengfei Wan,Di Zhang,Hongyan Liu,Jun He
机构: Renmin University of China (中国人民大学); Kuaishou Technology (快手科技); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Lip synchronization is the task of aligning a speaker’s lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During inference, we propose a flow-matching-based progressive noise initialization to ensure pose and identity consistency, while allowing precise mouth-region editing. To address the weak conditioning signal of audio, we develop a Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG) mechanism that adaptively adjusts guidance strength over time and space. We also establish the AIGC-LipSync Benchmark, the first evaluation suite for lip synchronization in diverse AI-generated videos. Extensive experiments demonstrate that OmniSync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos.
zh

[CV-12] VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin INTERSPEECH2025

【速读】:该论文试图解决语音识别系统中因说话人年龄变化导致的性能下降问题,其关键解决方案是构建了一个大规模的纵向数据集VoxAging。该数据集包含293名说话人的语音数据(其中226名为英语使用者,67名为普通话使用者),数据采集时间跨度长达17年(约900周),且每位说话人的数据以每周为间隔进行记录,从而为研究说话人老化现象及其对高级语音验证系统的影响提供了可靠的数据基础。

链接: https://arxiv.org/abs/2505.21445
作者: Zhiqi Ai,Meixuan Bao,Zhiyong Chen,Zhi Yang,Xinnuo Li,Shugong Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 5 pages, 4 figures, Accepted by Interspeech 2025

点击查看摘要

Abstract:The performance of speaker verification systems is adversely affected by speaker aging. However, due to challenges in data collection, particularly the lack of sustained and large-scale longitudinal data for individuals, research on speaker aging remains difficult. In this paper, we present VoxAging, a large-scale longitudinal dataset collected from 293 speakers (226 English speakers and 67 Mandarin speakers) over several years, with the longest time span reaching 17 years (approximately 900 weeks). For each speaker, the data were recorded at weekly intervals. We studied the phenomenon of speaker aging and its effects on advanced speaker verification systems, analyzed individual speaker aging processes, and explored the impact of factors such as age group and gender on speaker aging research.
zh

[CV-13] CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects

【速读】:该论文旨在解决整体身体操作刚性物体的问题,该问题涉及身体运动、手部运动和物体运动的协同生成,其核心挑战在于实现手部与全身运动的紧密协调以及高自由度下手-物体交互的高精度控制。解决方案的关键在于提出一种新型的协调扩散噪声优化框架,通过在身体、左手和右手三个专用扩散模型上进行噪声空间优化,并利用基于基点集(BPS)的统一表示来捕捉手与物体之间的细粒度空间关系,从而实现高保真度的全身运动生成与精确的手-物体交互。

链接: https://arxiv.org/abs/2505.21437
作者: Huaijin Pi,Zhi Cen,Zhiyang Dou,Taku Komura
机构: The University of Hong Kong (香港大学); Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility, and enables various capabilities such as object pose control, simultaneous walking and manipulation, and whole-body generation from hand-only data.
zh

[CV-14] Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning

【速读】:该论文旨在解决3D异常检测中如何有效融合多模态信息以提升检测性能的问题。其解决方案的关键在于提出了一种名为Mentor3AD的新方法,该方法通过多模态导师学习(multi-modal mentor learning)实现特征的融合与引导重建,具体包括融合模块(Mentor of Fusion Module, MFM)提取RGB与3D模态的共享特征生成导师特征,引导模块(Mentor of Guidance Module, MGM)利用导师特征促进跨模态重建,以及投票模块(Voting Module, VM)更精确地生成最终异常得分。

链接: https://arxiv.org/abs/2505.21420
作者: Jinbao Wang,Hanzhe Liang,Can Gao,Chenxi Hu,Jie Zhou,Yunkang Cao,Linlin Shen,Weiming Shen
机构: Shenzhen University (深圳大学); Hunan University (湖南大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 Pages, 6 Figures, 7 Tables

点击查看摘要

Abstract:Multimodal feature reconstruction is a promising approach for 3D anomaly detection, leveraging the complementary information from dual modalities. We further advance this paradigm by utilizing multi-modal mentor learning, which fuses intermediate features to further distinguish normal from feature differences. To address these challenges, we propose a novel method called Mentor3AD, which utilizes multi-modal mentor learning. By leveraging the shared features of different modalities, Mentor3AD can extract more effective features and guide feature reconstruction, ultimately improving detection performance. Specifically, Mentor3AD includes a Mentor of Fusion Module (MFM) that merges features extracted from RGB and 3D modalities to create a mentor feature. Additionally, we have designed a Mentor of Guidance Module (MGM) to facilitate cross-modal reconstruction, supported by the mentor feature. Lastly, we introduce a Voting Module (VM) to more accurately generate the final anomaly score. Extensive comparative and ablation studies on MVTec 3D-AD and Eyecandies have verified the effectiveness of the proposed method.
zh

[CV-15] Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios

【速读】:该论文旨在解决多视图聚类中噪声数据对模型性能造成显著下降的问题。其解决方案的关键在于提出一种名为AIRMVC的新型多视图聚类框架,该框架通过将噪声识别建模为异常检测问题,并利用高斯混合模型(GMM)进行自动识别,随后采用混合修正策略以减轻噪声数据的负面影响,同时引入一种噪声鲁棒的对比机制来生成可靠的表示,从而有效提升下游任务的性能。

链接: https://arxiv.org/abs/2505.21387
作者: Xihong Yang,Siwei Wang,Fangdi Wang,Jiaqi Jin,Suyuan Liu,Yue Liu,En Zhu,Xinwang Liu,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available at this https URL on Github.
zh

[CV-16] ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding

【速读】:该论文旨在解决基于PointMamba的点云自监督学习方法中存在复杂令牌排序和随机掩码导致空间连续性和局部语义关联性被破坏的问题。其解决方案的关键在于提出ZigzagPointMamba,通过一种简单的锯齿状扫描路径全局排列点云令牌,以保持空间相邻点令牌的邻近性,从而增强空间连续性;同时引入语义-孪生掩码策略(Semantic-Siamese Masking Strategy, SMS),通过掩码语义相似的令牌来促进重建,整合原始与相似令牌的局部特征,克服对孤立局部特征的依赖,实现稳健的全局语义建模。

链接: https://arxiv.org/abs/2505.21381
作者: Linshuang Diao,Dayong Ren,Sensen Song,Yurong Qian
机构: Xinjiang University (新疆大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State Space models (SSMs) such as PointMamba enable efficient feature extraction for point cloud self-supervised learning with linear complexity, outperforming Transformers in computational efficiency. However, existing PointMamba-based methods depend on complex token ordering and random masking, which disrupt spatial continuity and local semantic correlations. We propose ZigzagPointMamba to tackle these challenges. The core of our approach is a simple zigzag scan path that globally sequences point cloud tokens, enhancing spatial continuity by preserving the proximity of spatially adjacent point tokens. Nevertheless, random masking undermines local semantic modeling in self-supervised learning. To address this, we introduce a Semantic-Siamese Masking Strategy (SMS), which masks semantically similar tokens to facilitate reconstruction by integrating local features of original and similar tokens. This overcomes the dependence on isolated local features and enables robust global semantic modeling. Our pre-trained ZigzagPointMamba weights significantly improve downstream tasks, achieving a 1.59% mIoU gain on ShapeNetPart for part segmentation, a 0.4% higher accuracy on ModelNet40 for classification, and 0.19%, 1.22%, and 0.72% higher accuracies respectively for the classification tasks on the OBJ-BG, OBJ-ONLY, and PB-T50-RS subsets of ScanObjectNN. The code is available at: this https URL
zh

[CV-17] Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility CVPR2025

【速读】:该论文试图解决文本到矢量图形生成中的视角任意观看、渐进式细节优化以及视图依赖的遮挡感知问题。其解决方案的关键在于提出了一种双分支优化框架,包含辅助的3D高斯点云(3D Gaussian Splatting, 3DGS)优化分支和3D矢量图形优化分支,通过3DGS分支增强文本提示与矢量图形之间的语义一致性,并利用无分类器引导调度实现渐进式细节控制,同时引入可见性感知渲染模块以提升视图依赖的遮挡处理能力。

链接: https://arxiv.org/abs/2505.21377
作者: Yidi Li,Jun Xiao,Zhengda Lu,Yiqun Wang,Haiyong Jiang
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.
zh

[CV-18] GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

【速读】:该论文旨在解决超高分辨率(Ultra-high-resolution, UHR)遥感(RS)图像在现有多模态基础模型中的两个关键瓶颈问题:一是UHR训练数据的稀缺性,二是由于图像尺寸过大导致的token爆炸问题。为应对数据稀缺性,研究者构建了SuperRS-VQA和HighRS-VQA两个高分辨率视觉-语言数据集;为缓解token爆炸问题,提出了背景token剪枝和锚定token选择策略,以减少内存占用并保留关键信息。基于这些技术,研究者开发了GeoLLaVA-8K,首个专注于遥感的多模态大语言模型,可处理高达8K×8K分辨率的输入,并在XLRS-Bench上取得了新的最先进性能。

链接: https://arxiv.org/abs/2505.21375
作者: Fengxiang Wang,Mingshuo Chen,Yueying Li,Di Wang,Haotian Wang,Zonghao Guo,Zefan Wang,Boqi Shan,Long Lan,Yulin Wang,Hongzhen Wang,Wenjing Yang,Bo Du,Jing Zhang
机构: 1 College of Computer Science and Technology, National University of Defense Technology, China; 2 Beijing University of Posts and Telecommunications, China; 3 Tsinghua University, China; 4 School of Computer Science, Wuhan University, China; 5 Zhongguancun Academy, China; 6 Beihang University, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376 \times 8,376) and HighRS-VQA (avg. 2,000 \times 1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key this http URL these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K \times 8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.
zh

[CV-19] Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning ?

【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视频推理任务中表现不足的问题,特别是在整合多源视觉线索和进行深层次逻辑推理方面存在明显缺陷。现有视频基准测试主要关注视觉感知和定位能力,未能全面反映真实场景下人类所需的主动搜索、整合与分析多线索的推理过程。为解决这一问题,作者提出了Video-Holmes基准,其关键在于通过精心设计的任务,要求模型在不同视频片段中主动定位并连接多个相关视觉线索,从而评估其复杂视频推理能力。该基准基于270部手动标注的悬疑短片,涵盖7个任务,旨在推动模型向更接近人类的推理方式发展。

链接: https://arxiv.org/abs/2505.21374
作者: Junhao Cheng,Yuying Ge,Teng Wang,Yixiao Ge,Jing Liao,Ying Shan
机构: ARC Lab, Tencent PCG; City University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a “Holmes-test” for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in this https URL.
zh

[CV-20] YOLO-SPCI: Enhancing Remote Sensing Object Detection via Selective-Perspective-Class Integration

【速读】:该论文旨在解决遥感图像中目标检测面临的极端尺度变化、密集目标分布和复杂背景等挑战,特别是针对现有检测器如YOLOv8的主干网络缺乏显式机制引导多尺度特征优化的问题。其解决方案的关键在于提出一种增强注意力的检测框架YOLO-SPCI,该框架引入了一个轻量级的Selective-Perspective-Class Integration (SPCI)模块,通过集成选择性流门(SSG)、视角融合模块(PFM)和类别判别模块(CDM),实现了对多尺度特征的有效整合与增强。

链接: https://arxiv.org/abs/2505.21370
作者: Xinyuan Wang,Lian Peng,Xiangcheng Li,Yilin He,KinTak U
机构: Faculty of Innovation Engineering, Macau University of Science and Technology, Macau, China; School of Computer and Information Science, Southwest University, Chongqing, China; College of Automation and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection in remote sensing imagery remains a challenging task due to extreme scale variation, dense object distributions, and cluttered backgrounds. While recent detectors such as YOLOv8 have shown promising results, their backbone architectures lack explicit mechanisms to guide multi-scale feature refinement, limiting performance on high-resolution aerial data. In this work, we propose YOLO-SPCI, an attention-enhanced detection framework that introduces a lightweight Selective-Perspective-Class Integration (SPCI) module to improve feature representation. The SPCI module integrates three components: a Selective Stream Gate (SSG) for adaptive regulation of global feature flow, a Perspective Fusion Module (PFM) for context-aware multi-scale integration, and a Class Discrimination Module (CDM) to enhance inter-class separability. We embed two SPCI blocks into the P3 and P5 stages of the YOLOv8 backbone, enabling effective refinement while preserving compatibility with the original neck and head. Experiments on the NWPU VHR-10 dataset demonstrate that YOLO-SPCI achieves superior performance compared to state-of-the-art detectors.
zh

[CV-21] AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping

【速读】:该论文旨在解决传统遥感基础模型在作物制图中对多尺度时空特征建模不足的问题,具体表现为现有模型要么使用固定的时空窗口而忽略作物系统的多尺度特性,要么仅关注空间模式而忽视时间信息。其解决方案的关键在于提出AgriFM,一个专门针对农业作物制图的多源遥感基础模型,该模型通过同步时间下采样与空间缩放操作,实现了层次化时空特征的联合提取,并利用来自MODIS、Landsat-8/9和Sentinel-2的时序数据进行预训练,从而有效捕捉作物生长的长期动态变化。

链接: https://arxiv.org/abs/2505.21357
作者: Wenyuan Li,Shunlin Liang,Keyan Chen,Yongzhe Chen,Han Ma,Jianglei Xu,Yichuan Ma,Shikang Guan,Husheng Fang,Zhenwei Shi
机构: Jockey Club STEM Lab of Quantitative Remote Sensing, Department of Geography, The University of Hong Kong (香港大学地理系); Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University (北京航空航天大学航天学院); School of Remote Sensing and Information Engineering, Wuhan University (武汉大学遥感信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM’s superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at urlhttps://github.com/flyakon/AgriFM.
zh

[CV-22] Beyond Accuracy: Uncovering the Role of Similarity Perception and its Alignment with Semantics in Supervised Learning

【速读】:该论文试图解决深度视觉网络在相似性感知(similarity perception)发展过程中的机制问题,特别是其与语义相似性(semantic similarity)对齐的程度。解决方案的关键在于提出一种系统框架——深度相似性检查器(Deep Similarity Inspector, DSI),用于分析深度视觉网络在训练过程中如何逐步建立其相似性感知能力,并评估其与语义相似性的匹配程度。

链接: https://arxiv.org/abs/2505.21338
作者: Katarzyna Filus,Mateusz Żarski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Similarity manifests in various forms, including semantic similarity that is particularly important, serving as an approximation of human object categorization based on e.g. shared functionalities and evolutionary traits. It also offers practical advantages in computational modeling via lexical structures such as WordNet with constant and interpretable similarity. As in the domain of deep vision, there is still not enough focus on the phenomena regarding the similarity perception emergence. We introduce Deep Similarity Inspector (DSI) – a systematic framework to inspect how deep vision networks develop their similarity perception and its alignment with semantic similarity. Our experiments show that both Convolutional Neural Networks’ (CNNs) and Vision Transformers’ (ViTs) develop a rich similarity perception during training with 3 phases (initial similarity surge, refinement, stabilization), with clear differences between CNNs and ViTs. Besides the gradual mistakes elimination, the mistakes refinement phenomenon can be observed.
zh

[CV-23] Structure from Collision CVPR2025 WWW

【速读】:该论文试图解决从多视角图像中仅能估计可见外部结构,而难以识别隐藏在表面之下的不可见内部结构的问题。为克服这一限制,论文提出了一种新的任务——碰撞结构(Structure from Collision, SfC),旨在通过碰撞过程中的外观变化估计物体的结构(包括不可见内部结构)。解决方案的关键在于提出了一种名为SfC-NeRF的新模型,该模型通过物理约束、外观(即可见外部结构)保持以及关键帧约束下的视频序列优化物体的不可见内部结构,并引入了体素退火(volume annealing)策略以避免因问题病态性导致的局部最优问题。

链接: https://arxiv.org/abs/2505.21335
作者: Takuhiro Kaneko
机构: NTT Corporation(NTT公司)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to CVPR 2025 (Highlight). Project page: this https URL

点击查看摘要

Abstract:Recent advancements in neural 3D representations, such as neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS), have enabled the accurate estimation of 3D structures from multiview images. However, this capability is limited to estimating the visible external structure, and identifying the invisible internal structure hidden behind the surface is difficult. To overcome this limitation, we address a new task called Structure from Collision (SfC), which aims to estimate the structure (including the invisible internal structure) of an object from appearance changes during collision. To solve this problem, we propose a novel model called SfC-NeRF that optimizes the invisible internal structure of an object through a video sequence under physical, appearance (i.e., visible external structure)-preserving, and keyframe constraints. In particular, to avoid falling into undesirable local optima owing to its ill-posed nature, we propose volume annealing; that is, searching for global optima by repeatedly reducing and expanding the volume. Extensive experiments on 115 objects involving diverse structures (i.e., various cavity shapes, locations, and sizes) and material properties revealed the properties of SfC and demonstrated the effectiveness of the proposed SfC-NeRF.
zh

[CV-24] HoliTom: Holistic Token Merging for Fast Video Large Language Models

【速读】:该论文旨在解决视频大语言模型(video LLMs)在视频理解任务中因冗余视频标记导致的计算效率低下问题。现有方法在减少冗余方面存在局限,内层LLM剪枝方法在浅层中引入计算开销,而外层LLM剪枝方法仅关注单帧或有限时间窗口内的空间冗余,忽视了长视频序列中的全局时间动态和相关性。该论文提出HoliTom,一种无需训练的整体标记合并框架,其关键在于通过全局冗余感知的时间分割进行外层LLM剪枝,并结合时空合并策略,实现超过90%的视觉标记减少,从而显著降低LLM的计算负担。此外,还引入了一种基于内层LLM标记相似性的合并方法,以提升性能并兼容外层剪枝,实验表明该方法在保持高精度的同时大幅降低了计算成本和推理延迟。

链接: https://arxiv.org/abs/2505.21334
作者: Kele Shao,Keda Tao,Can Qin,Haoxuan You,Yang Sui,Huan Wang
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Salesforce AI Research (Salesforce AI 研究); Columbia University (哥伦比亚大学); Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM’s computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method’s promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.
zh

[CV-25] MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLM s in Video Scenarios

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频光学字符识别(Video OCR)任务中的性能不足问题。现有MLLMs在静态图像OCR中表现良好,但在视频场景中因运动模糊、时间变化和视觉效果等因素导致效果显著下降。为提供更清晰的训练指导,研究者提出了MME-VideoOCR基准,涵盖25个任务和44种视频OCR应用场景,包含1,464段视频及2,000个精心标注的问答对。该基准不仅关注文本识别,还强调对视频中文本内容的深度理解和推理能力。实验表明,即使最先进的模型(如Gemini-2.5 Pro)在该基准上的准确率也仅为73.7%,且在需要时空推理、跨帧信息整合或抵抗语言先验偏见的任务中表现有限。因此,解决方案的关键在于提升模型对动态视频内容的全局理解能力,并优化高分辨率视觉输入与充分的时间覆盖以提高OCR可靠性。

链接: https://arxiv.org/abs/2505.21333
作者: Yang Shi,Huanqian Wang,Wulin Xie,Huanyao Zhang,Lijie Zhao,Yi-Fan Zhang,Xinfeng Li,Chaoyou Fu,Zhuoer Wen,Wenting Liu,Zhuoran Zhang,Xinlong Chen,Bohan Zeng,Sihan Yang,Yuanxing Zhang,Pengfei Wan,Haotian Wang,Wenjing Yang
机构: PKU(北京大学); THU(清华大学); CASIA(中国科学院自动化研究所); CUHKSZ(香港中文大学深圳); NTU(南洋理工大学); XJTU(西安交通大学); Kuaishou(快手)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.
zh

[CV-26] MME-Reasoning : A Comprehensive Benchmark for Logical Reasoning in MLLM s

【速读】:该论文试图解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在逻辑推理能力评估中存在不足的问题,具体表现为缺乏对逻辑推理类型(归纳、演绎和溯因)的明确分类以及对推理理解不清晰。解决方案的关键在于引入MME-Reasoning基准,该基准全面覆盖三种逻辑推理类型,并通过精心筛选数据确保评估重点在于推理能力而非感知技能或知识广度,同时扩展了评估协议以涵盖多样化问题,从而更准确地衡量MLLMs的逻辑推理能力。

链接: https://arxiv.org/abs/2505.21327
作者: Jiakang Yuan,Tianshuo Peng,Yilei Jiang,Yiting Lu,Renrui Zhang,Kaituo Feng,Chaoyou Fu,Tao Chen,Lei Bai,Bo Zhang,Xiangyu Yue
机构: Fudan University (复旦大学); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Shanghai AI Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode’’ and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.
zh

[CV-27] MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

【速读】:该论文旨在解决视频虚拟试衣(Video Virtual Try-On, VVT)中存在的时间-空间一致性不足和服装内容保真度低的问题。现有方法受限于基于U-Net的扩散模型在表达能力上的不足,以及分离建模时空注意力机制导致的结构关系与动态一致性捕捉不充分,进而影响合成结果的真实性和稳定性。其解决方案的关键在于构建一个基于大规模视频扩散模型的框架MagicTryOn,将U-Net架构替换为扩散Transformer,并结合全自注意力机制以联合建模视频的时空一致性;同时设计了从粗到细的服装保真策略,通过嵌入阶段整合服装标记和去噪阶段引入语义、纹理及轮廓等多条件信息,进一步提升服装区域的保真度。

链接: https://arxiv.org/abs/2505.21325
作者: Guangyuan Li,Siming Zheng,Hao Zhang,Jinwei Chen,Junsheng Luan,Binkai Ou,Lei Zhao,Bo Li,Peng-Tao Jiang
机构: vivo Mobile Communication Co., Ltd (vivo移动通信有限公司); Zhejiang University (浙江大学); BoardWare Information System Limited (BoardWare信息系统有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion this http URL replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
zh

[CV-28] func: An Efficient Function Representation without Neural Networks

【速读】:该论文旨在解决函数拟合/逼近问题,该问题在计算机图形学和其他工程应用中具有基础性作用。传统方法依赖于参数量大的神经网络架构,限制了其实际应用。论文的解决方案关键在于提出一种基于参数高效表示的高精度函数逼近方法,完全摒弃了对神经网络的依赖。其核心是构建一个连续函数建模框架,并引入一种基于径向基函数插值的多项式紧凑函数表示,避免了神经网络和复杂分层数据结构的使用,同时开发了内存高效的CUDA优化算法,显著降低了计算时间和内存消耗。

链接: https://arxiv.org/abs/2505.21319
作者: Biao Zhang,Peter Wonka
机构: KAUST(沙特阿拉伯王国科学技术大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Function fitting/approximation plays a fundamental role in computer graphics and other engineering applications. While recent advances have explored neural networks to address this task, these methods often rely on architectures with many parameters, limiting their practical applicability. In contrast, we pursue high-quality function approximation using parameter-efficient representations that eliminate the dependency on neural networks entirely. We first propose a novel framework for continuous function modeling. Most existing works can be formulated using this framework. We then introduce a compact function representation, which is based on polynomials interpolated using radial basis functions, bypassing both neural networks and complex/hierarchical data structures. We also develop memory-efficient CUDA-optimized algorithms that reduce computational time and memory consumption to less than 10% compared to conventional automatic differentiation frameworks. Finally, we validate our representation and optimization pipeline through extensive experiments on 3D signed distance functions (SDFs). The proposed representation achieves comparable or superior performance to state-of-the-art techniques (e.g., octree/hash-grid techniques) with significantly fewer parameters.
zh

[CV-29] Efficient Leaf Disease Classification and Segmentation using Midpoint Normalization Technique and Attention Mechanism ICIP

【速读】:该论文旨在解决植物病害检测中由于标注数据稀缺和复杂上下文因素导致的挑战。其解决方案的关键在于提出一种两阶段的方法,即中点归一化(Mid Point Normalization, MPN)用于智能图像预处理,并结合复杂的注意力机制以动态校准特征表示。通过将MPN与Squeeze-and-Excitation (SE)块融合,分类管道在保持类别平衡的同时实现了93%的高准确率;而在分割任务中,通过在U-Net架构中集成相同的注意力块并使用MPN增强输入,取得了72.44%的Dice分数和58.54%的IoU,显著优于基线实现。

链接: https://arxiv.org/abs/2505.21316
作者: Enam Ahmed Taufik,Antara Firoz Parsa,Seraj Al Mahmud Mostafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted in 2025 IEEE International Conference on Image Processing (ICIP)

点击查看摘要

Abstract:Enhancing plant disease detection from leaf imagery remains a persistent challenge due to scarce labeled data and complex contextual factors. We introduce a transformative two-stage methodology, Mid Point Normalization (MPN) for intelligent image preprocessing, coupled with sophisticated attention mechanisms that dynamically recalibrate feature representations. Our classification pipeline, merging MPN with Squeeze-and-Excitation (SE) blocks, achieves remarkable 93% accuracy while maintaining exceptional class-wise balance. The perfect F1 score attained for our target class exemplifies attention’s power in adaptive feature refinement. For segmentation tasks, we seamlessly integrate identical attention blocks within U-Net architecture using MPN-enhanced inputs, delivering compelling performance gains with 72.44% Dice score and 58.54% IoU, substantially outperforming baseline implementations. Beyond superior accuracy metrics, our approach yields computationally efficient, lightweight architectures perfectly suited for real-world computer vision applications.
zh

[CV-30] Spectral Compression Transformer with Line Pose Graph for Monocular 3D Human Pose Estimation

【速读】:该论文旨在解决基于Transformer的3D人体姿态估计方法因自注意力机制与序列长度呈二次复杂度而导致的高计算成本问题,以及姿态序列中帧间存在显著冗余但现有方法难以有效消除的问题。其解决方案的关键在于引入频谱压缩Transformer(Spectral Compression Transformer, SCT),通过将块间的隐藏特征视为时间特征信号并应用离散余弦变换来提取保留的频谱成分,从而减少序列长度和冗余;同时,提出基于线图理论的线姿态图(Line Pose Graph, LPG)以增强输入序列的结构先验信息,并设计双流网络架构以有效建模空间关节关系与压缩后的运动轨迹。

链接: https://arxiv.org/abs/2505.21309
作者: Zenghao Zheng,Lianping Yang,Hegui Zhu,Mingrui Ye
机构: Northeastern University (东北大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based 3D human pose estimation methods suffer from high computational costs due to the quadratic complexity of self-attention with respect to sequence length. Additionally, pose sequences often contain significant redundancy between frames. However, recent methods typically fail to improve model capacity while effectively eliminating sequence redundancy. In this work, we introduce the Spectral Compression Transformer (SCT) to reduce sequence length and accelerate computation. The SCT encoder treats hidden features between blocks as Temporal Feature Signals (TFS) and applies the Discrete Cosine Transform, a Fourier transform-based technique, to determine the spectral components to be retained. By filtering out certain high-frequency noise components, SCT compresses the sequence length and reduces redundancy. To further enrich the input sequence with prior structural information, we propose the Line Pose Graph (LPG) based on line graph theory. The LPG generates skeletal position information that complements the input 2D joint positions, thereby improving the model’s performance. Finally, we design a dual-stream network architecture to effectively model spatial joint relationships and the compressed motion trajectory within the pose sequence. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our model achieves state-of-the-art performance with improved computational efficiency. For example, on the Human3.6M dataset, our method achieves an MPJPE of 37.7mm while maintaining a low computational cost. Furthermore, we perform ablation studies on each module to assess its effectiveness. The code and models will be released.
zh

[CV-31] Supervised and self-supervised land-cover segmentation classification of the Biesbosch wetlands

【速读】:该论文旨在解决高分辨率卫星影像中湿地土地覆盖分类的标注数据稀缺问题,这对监督学习方法构成了重大挑战。其关键解决方案是采用结合监督学习与自监督学习(SSL)的方法,通过在Sentinel-2影像上训练U-Net模型,并利用自编码器进行SSL预训练,从而提升分类准确性,尤其在标注数据难以获取的高分辨率影像中效果显著。此外,该研究还提出了一种框架,用于将人工标注的高分辨率标签扩展到中等分辨率输入,以提高分割边界清晰度和空间细节。

链接: https://arxiv.org/abs/2505.21269
作者: Eva Gmelich Meijling,Roberto Del Prete,Arnoud Visser
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 12 pages, presented at the Netherlands Conference on Computer Vision (NCCV), Utrecht, May 2025

点击查看摘要

Abstract:Accurate wetland land-cover classification is essential for environmental monitoring, biodiversity assessment, and sustainable ecosystem management. However, the scarcity of annotated data, especially for high-resolution satellite imagery, poses a significant challenge for supervised learning approaches. To tackle this issue, this study presents a methodology for wetland land-cover segmentation and classification that adopts both supervised and self-supervised learning (SSL). We train a U-Net model from scratch on Sentinel-2 imagery across six wetland regions in the Netherlands, achieving a baseline model accuracy of 85.26%. Addressing the limited availability of labeled data, the results show that SSL pretraining with an autoencoder can improve accuracy, especially for the high-resolution imagery where it is more difficult to obtain labeled data, reaching an accuracy of 88.23%. Furthermore, we introduce a framework to scale manually annotated high-resolution labels to medium-resolution inputs. While the quantitative performance between resolutions is comparable, high-resolution imagery provides significantly sharper segmentation boundaries and finer spatial detail. As part of this work, we also contribute a curated Sentinel-2 dataset with Dynamic World labels, tailored for wetland classification tasks and made publicly available. Comments: 12 pages, presented at the Netherlands Conference on Computer Vision (NCCV), Utrecht, May 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) MSC classes: 68 ACMclasses: I.4.6 Cite as: arXiv:2505.21269 [cs.CV] (or arXiv:2505.21269v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.21269 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arnoud Visser [view email] [v1] Tue, 27 May 2025 14:42:49 UTC (16,350 KB)
zh

[CV-32] DiMoSR: Feature Modulation via Multi-Branch Dilated Convolutions for Efficient Image Super-Resolution

【速读】:该论文旨在解决轻量级单图像超分辨率(SISR)中重建质量与模型效率之间的平衡问题。其解决方案的关键在于提出一种名为DiMoSR(Dilated Modulation Super-Resolution)的新架构,通过调制机制增强特征表示,以补充注意力机制在轻量级SISR网络中的作用。该方法利用多分支空洞卷积在保持计算效率的同时捕捉更广泛的感受野内的丰富上下文信息,从而在多个基准数据集上实现了优于现有轻量级方法的PSNR和SSIM指标。

链接: https://arxiv.org/abs/2505.21262
作者: M. Akin Yilmaz,Ahmet Bilican,A. Murat Tekalp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Balancing reconstruction quality versus model efficiency remains a critical challenge in lightweight single image super-resolution (SISR). Despite the prevalence of attention mechanisms in recent state-of-the-art SISR approaches that primarily emphasize or suppress feature maps, alternative architectural paradigms warrant further exploration. This paper introduces DiMoSR (Dilated Modulation Super-Resolution), a novel architecture that enhances feature representation through modulation to complement attention in lightweight SISR networks. The proposed approach leverages multi-branch dilated convolutions to capture rich contextual information over a wider receptive field while maintaining computational efficiency. Experimental results demonstrate that DiMoSR outperforms state-of-the-art lightweight methods across diverse benchmark datasets, achieving superior PSNR and SSIM metrics with comparable or reduced computational complexity. Through comprehensive ablation studies, this work not only validates the effectiveness of DiMoSR but also provides critical insights into the interplay between attention mechanisms and feature modulation to guide future research in efficient network design. The code and model weights to reproduce our results are available at: this https URL
zh

[CV-33] Plenodium: UnderWater 3D Scene Reconstruction with Plenoptic Medium Representation

【速读】:该论文旨在解决水下场景中三维重建的挑战,特别是在退化环境下的初始化问题和深度图的序数一致性问题。其解决方案的关键在于提出了一种名为Plenodium(plenoptic medium)的新型三维表示框架,该框架通过球面谐波编码同时融合方向信息与位置信息,从而实现高精度的水下场景重建。此外,为解决初始化问题,引入了伪深度高斯补全方法以增强COLMAP生成的点云并引入鲁棒的深度先验,同时设计了深度排序正则化损失以优化场景几何并提升深度图的序数一致性。

链接: https://arxiv.org/abs/2505.21258
作者: Changguanng Wu,Jiangxin Dong,Chengjian Li,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Plenodium (plenoptic medium), an effective and efficient 3D representation framework capable of jointly modeling both objects and participating media. In contrast to existing medium representations that rely solely on view-dependent modeling, our novel plenoptic medium representation incorporates both directional and positional information through spherical harmonics encoding, enabling highly accurate underwater scene reconstruction. To address the initialization challenge in degraded underwater environments, we propose the pseudo-depth Gaussian complementation to augment COLMAP-derived point clouds with robust depth priors. In addition, a depth ranking regularized loss is developed to optimize the geometry of the scene and improve the ordinal consistency of the depth maps. Extensive experiments on real-world underwater datasets demonstrate that our method achieves significant improvements in 3D reconstruction. Furthermore, we conduct a simulated dataset with ground truth and the controllable scattering medium to demonstrate the restoration capability of our method in underwater scenarios. Our code and dataset are available at this https URL.
zh

[CV-34] 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics-Based Appearance-Medium Decouplin

【速读】:该论文旨在解决水下场景重建中的新视角合成问题,该问题由于水体中复杂的光-介质相互作用而面临独特挑战,如光学散射和吸收导致的非均匀介质衰减干扰,破坏了传统体积渲染对均匀传播介质的假设。其解决方案的关键在于提出一种基于物理的框架,通过定制化的高斯建模将物体外观与水介质效应解耦,并引入外观嵌入作为后向散射和衰减的显式介质表示,以增强场景一致性;同时,采用距离引导的优化策略,利用伪深度图作为监督信号,并结合深度正则化和尺度惩罚项以提升几何保真度。

链接: https://arxiv.org/abs/2505.21238
作者: Jieyu Yuan,Yujun Li,Yuanlin Zhang,Chunle Guo,Xiongxin Tang,Ruixing Wang,Chongyi Li
机构: Nankai University (南开大学); Chinese Academy of Sciences (中国科学院); DJI (大疆创新); Nankai International Advanced Research Institute (南开国际先进研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis for underwater scene reconstruction presents unique challenges due to complex light-media interactions. Optical scattering and absorption in water body bring inhomogeneous medium attenuation interference that disrupts conventional volume rendering assumptions of uniform propagation medium. While 3D Gaussian Splatting (3DGS) offers real-time rendering capabilities, it struggles with underwater inhomogeneous environments where scattering media introduce artifacts and inconsistent appearance. In this study, we propose a physics-based framework that disentangles object appearance from water medium effects through tailored Gaussian modeling. Our approach introduces appearance embeddings, which are explicit medium representations for backscatter and attenuation, enhancing scene consistency. In addition, we propose a distance-guided optimization strategy that leverages pseudo-depth maps as supervision with depth regularization and scale penalty terms to improve geometric fidelity. By integrating the proposed appearance and medium modeling components via an underwater imaging model, our approach achieves both high-quality novel view synthesis and physically accurate scene restoration. Experiments demonstrate our significant improvements in rendering quality and restoration accuracy over existing methods. The project page is available at \hrefthis https URLthis https URL
zh

[CV-35] CROP: Contextual Region-Oriented Visual Token Pruning

【速读】:该论文旨在解决基于视觉语言模型(Visual Language Model, VLM)的视觉问答(VQA)方法在处理图像时产生过多冗余视觉标记(visual tokens)的问题,这些冗余信息与问题无关,导致内存和计算资源需求大幅增加。解决方案的关键在于提出一种名为上下文区域导向的视觉标记剪枝(Contextual Region-Oriented Visual Token Pruning, CROP)的新框架,该框架通过定位和剪枝两个步骤实现视觉标记的压缩:首先利用高效模型识别与输入查询相关的上下文区域,随后采用两种不同的剪枝策略——预大语言模型压缩(Pre-LLM Compression, PLC)和内部大语言模型剪枝(Inner-LLM Pruning, ILP),分别实现自适应区域压缩和无需训练的早期层剪枝。

链接: https://arxiv.org/abs/2505.21233
作者: Jiawei Guo,Feifei Zhai,Pu Jian,Qianrun Wei,Yu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance. Our code and datasets will be made available.
zh

[CV-36] Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)与遮挡边界估计(Occlusion Boundary Estimation, OBE)之间的协同问题,通过联合估计深度和遮挡边界来提升场景理解与三维重建能力。其解决方案的关键在于提出一种新型网络架构MoDOT,该架构引入了跨注意力多尺度条带卷积模块(CASM),利用中层遮挡边界特征显著提升深度预测性能,同时设计了一种遮挡感知损失函数(OBDCL),以增强深度边界的准确性和清晰度。实验表明,该方法在真实和合成数据集上均取得了最先进的性能。

链接: https://arxiv.org/abs/2505.21231
作者: Lintao Xu,Yinghao Wang,Chaohui Wang
机构: LIGM, Univ Gustave Eiffel, École des Ponts, CNRS, France( LIGM,古斯塔夫·埃菲尔大学,巴黎路桥学院,法国国家科学研究中心,法国); INFRES, Télécom Paris, Institute Polytechnique de Paris, France( INFRES,巴黎电信学院,巴黎综合理工学院,法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 tables, 4 figures

点击查看摘要

Abstract:Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects, distinguishing intrinsic object edges from occlusion-induced contours to improve scene understanding and 3D reconstruction capacity. This is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as occlusion boundaries provide critical geometric cues for resolving depth ambiguities, while depth priors can conversely refine occlusion reasoning in complex scenes. In this paper, we propose a novel network, MoDOT, that first jointly estimates depth and OBs. We propose CASM, a cross-attention multi-scale strip convolution module, leverages mid-level OB features to significantly enhance depth prediction. Additionally, we introduce an occlusion-aware loss function, OBDCL, which encourages sharper and more accurate depth boundaries. Extensive experiments on both real and synthetic datasets demonstrate the mutual benefits of jointly estimating depth and OB, and highlight the effectiveness of our model design. Our method achieves the state-of-the-art (SOTA) on both our proposed synthetic datasets and one popular real dataset, NYUD-v2, significantly outperforming multi-task baselines. Besides, without domain adaptation, results on real-world depth transfer are comparable to the competitors, while preserving sharp occlusion boundaries for geometric fidelity. We will release our code, pre-trained models, and datasets to support future research in this direction.
zh

[CV-37] Is Hyperbolic Space All You Need for Medical Anomaly Detection? MICCAI2025

【速读】:该论文旨在解决医学异常检测中数据可用性不足和标注约束带来的挑战。传统方法在欧几里得空间中从预训练网络的不同层提取特征,但由于欧几里得表示无法有效捕捉特征中的层次关系,导致异常检测性能欠佳。该研究的关键在于将特征表示投影到双曲空间(hyperbolic space),根据置信度进行聚合,并对样本进行健康或异常分类。实验表明,双曲空间在多个医学基准数据集上均优于基于欧几里得空间的框架,展现出更高的AUROC分数以及对参数变化的鲁棒性和少样本场景下的优越性能。

链接: https://arxiv.org/abs/2505.21228
作者: Alvaro Gonzalez-Jimenez,Simone Lionetti,Ludovic Amruthalingam,Philippe Gottfrois,Fabian Gröger,Marc Pouly,Alexander A. Navarini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Provisionally Accepted at MICCAI 2025

点击查看摘要

Abstract:Medical anomaly detection has emerged as a promising solution to challenges in data availability and labeling constraints. Traditional methods extract features from different layers of pre-trained networks in Euclidean space; however, Euclidean representations fail to effectively capture the hierarchical relationships within these features, leading to suboptimal anomaly detection performance. We propose a novel yet simple approach that projects feature representations into hyperbolic space, aggregates them based on confidence levels, and classifies samples as healthy or anomalous. Our experiments demonstrate that hyperbolic space consistently outperforms Euclidean-based frameworks, achieving higher AUROC scores at both image and pixel levels across multiple medical benchmark datasets. Additionally, we show that hyperbolic space exhibits resilience to parameter variations and excels in few-shot scenarios, where healthy images are scarce. These findings underscore the potential of hyperbolic space as a powerful alternative for medical anomaly detection. The project website can be found at this https URL
zh

[CV-38] Sci-Fi: Symmetric Constraint for Frame Inbetweening NEURIPS2025

【速读】:该论文旨在解决视频帧插值任务中,由于起始帧和结束帧约束控制强度不对称导致的生成帧运动不一致或外观崩溃问题。现有方法通过直接微调或省略训练来引入结束帧约束,但其机制与初始起始帧(单图)约束相同,而由于预训练的图像到视频扩散模型(I2V-DM)对起始帧条件已充分训练,简单复用相同机制引入结束帧约束会导致其对中间内容的影响较弱,从而引发控制不对称。解决方案的关键在于提出一种名为Sci-Fi的新框架,通过引入一个轻量级模块EF-Net,仅编码结束帧并生成时序自适应的帧级特征注入到I2V-DM中,从而增强结束帧约束的影响力,实现起始帧与结束帧约束的对称性,提升生成视频的连贯性。

链接: https://arxiv.org/abs/2505.21205
作者: Liuhan Chen,Xiaodong Cun,Xiaoyu Li,Xianyi He,Shenghai Yuan,Jie Chen,Ying Shan,Li Yuan
机构: Shenzhen Graduate School, Peking University (北京大学深圳研究生院); GVC Lab, Great Bay University (大湾区大学GVC实验室); ARC Lab, Tencent PCG (腾讯PCG人工智能实验室); Rabbitpre Intelligence (兔犀智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures, submitted to NeurIPS2025, under reviewering

点击查看摘要

Abstract:Frame inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods mainly extend large-scale pre-trained Image-to-Video Diffusion models (I2V-DMs) by incorporating end-frame constraints via directly fine-tuning or omitting training. We identify a critical limitation in their design: Their injections of the end-frame constraint usually utilize the same mechanism that originally imposed the start-frame (single image) constraint. However, since the original I2V-DMs are adequately trained for the start-frame condition in advance, naively introducing the end-frame constraint by the same mechanism with much less (even zero) specialized training probably can’t make the end frame have a strong enough impact on the intermediate content like the start frame. This asymmetric control strength of the two frames over the intermediate content likely leads to inconsistent motion or appearance collapse in generated frames. To efficiently achieve symmetric constraints of start and end frames, we propose a novel framework, termed Sci-Fi, which applies a stronger injection for the constraint of a smaller training scale. Specifically, it deals with the start-frame constraint as before, while introducing the end-frame constraint by an improved mechanism. The new mechanism is based on a well-designed lightweight module, named EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. This makes the end-frame constraint as strong as the start-frame constraint, enabling our Sci-Fi to produce more harmonious transitions in various scenarios. Extensive experiments prove the superiority of our Sci-Fi compared with other baselines.
zh

[CV-39] hink Twice Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models

【速读】:该论文旨在解决Vision-Language-Action (VLA)模型在实时部署和边缘应用中面临的高推理成本问题,主要由于大规模的token计算和自回归解码带来的计算负担。解决方案的关键在于提出FlashVLA,这是一个无需训练且可即插即用的加速框架,其核心是通过token-aware action reuse机制避免稳定动作步骤中的冗余解码,并结合information-guided visual token selection策略剪枝低贡献的视觉token,从而显著提升推理效率。

链接: https://arxiv.org/abs/2505.21200
作者: Xudong Tan,Yaoxin Yang,Peng Ye,Jialin Zheng,Bizhe Bai,Xinyi Wang,Jia Hao,Tao Chen
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Zhangjiang Laboratory (张江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an information-guided visual token selection strategy that prunes low-contribution tokens. Extensive experiments on the LIBERO benchmark show that FlashVLA reduces FLOPs by 55.7% and latency by 36.0%, with only a 0.7% drop in task success rate. These results demonstrate the effectiveness of FlashVLA in enabling lightweight, low-latency VLA inference without retraining.
zh

[CV-40] Learning Annotation Consensus for Continuous Emotion Recognition

【速读】:该论文旨在解决情感计算中多标注者数据因缺乏完全一致性而被合并为单一黄金标准标签所导致的有价值的人间差异信息丢失问题。其解决方案的关键在于提出一种多标注者训练方法,通过构建共识网络将多个标注者的信息聚合为统一表示,从而引导主要的唤醒-愉悦度预测器更准确地反映集体输入,而非依赖单一参考标签。

链接: https://arxiv.org/abs/2505.21196
作者: Ibrahim Shoer,Engin Erzin
机构: Department of Electrical and Electronics Engineering
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In affective computing, datasets often contain multiple annotations from different annotators, which may lack full agreement. Typically, these annotations are merged into a single gold standard label, potentially losing valuable inter-rater variability. We propose a multi-annotator training approach for continuous emotion recognition (CER) that seeks a consensus across all annotators rather than relying on a single reference label. Our method employs a consensus network to aggregate annotations into a unified representation, guiding the main arousal-valence predictor to better reflect collective inputs. Tested on the RECOLA and COGNIMUSE datasets, our approach outperforms traditional methods that unify annotations into a single label. This underscores the benefits of fully leveraging multi-annotator data in emotion recognition and highlights its applicability across various fields where annotations are abundant yet inconsistent.
zh

[CV-41] Making Every Event Count: Balancing Data Efficiency and Accuracy in Event Camera Subsampling

【速读】:该论文旨在解决事件相机(event camera)由于高事件率导致的数据传输与处理挑战,尤其是在边缘人工智能应用中如何有效进行数据子采样以保持下游视觉任务性能的问题。其解决方案的关键在于提出一种基于因果密度的子采样方法,该方法假设高密度区域的事件包含更多任务相关信息,因此更适合用于子采样,从而在保证数据效率的同时提升事件视频分类的准确性。

链接: https://arxiv.org/abs/2505.21187
作者: Hesam Araghi,Jan van Gemert,Nergis Tomen
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras offer high temporal resolution and power efficiency, making them well-suited for edge AI applications. However, their high event rates present challenges for data transmission and processing. Subsampling methods provide a practical solution, but their effect on downstream visual tasks remains underexplored. In this work, we systematically evaluate six hardware-friendly subsampling methods using convolutional neural networks for event video classification on various benchmark datasets. We hypothesize that events from high-density regions carry more task-relevant information and are therefore better suited for subsampling. To test this, we introduce a simple causal density-based subsampling method, demonstrating improved classification accuracy in sparse regimes. Our analysis further highlights key factors affecting subsampling performance, including sensitivity to hyperparameters and failure cases in scenarios with large event count variance. These findings provide insights for utilization of hardware-efficient subsampling strategies that balance data efficiency and task accuracy. The code for this paper will be released at: this https URL.
zh

[CV-42] Boosting Adversarial Transferability via High-Frequency Augmentation and Hierarchical-Gradient Fusion

【速读】:该论文旨在解决对抗攻击在黑盒防御策略下的可迁移性不足问题,现有方法主要关注空间域的增强,而未能充分利用频率域的信息。其解决方案的关键在于提出一种新的对抗攻击框架——频域攻击(Frequency-Space Attack, FSA),该框架通过结合频域与空间域变换,引入了两个核心技术:高频增强(High-Frequency Augmentation)和分层梯度融合(Hierarchical-Gradient Fusion),从而有效提升对抗样本的迁移能力和攻击成功率。

链接: https://arxiv.org/abs/2505.21181
作者: Yayin Zheng,Chen Wan,Zihong Guo,Hailing Kuang,Xiaohai Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Adversarial attacks have become a significant challenge in the security of machine learning models, particularly in the context of black-box defense strategies. Existing methods for enhancing adversarial transferability primarily focus on the spatial domain. This paper presents Frequency-Space Attack (FSA), a new adversarial attack framework that effectively integrates frequency-domain and spatial-domain transformations. FSA combines two key techniques: (1) High-Frequency Augmentation, which applies Fourier transform with frequency-selective amplification to diversify inputs and emphasize the critical role of high-frequency components in adversarial attacks, and (2) Hierarchical-Gradient Fusion, which merges multi-scale gradient decomposition and fusion to capture both global structures and fine-grained details, resulting in smoother perturbations. Our experiment demonstrates that FSA consistently outperforms state-of-the-art methods across various black-box models. Notably, our proposed FSA achieves an average attack success rate increase of 23.6% compared with BSR (CVPR 2024) on eight black-box defense models.
zh

[CV-43] Normalized Attention Guidance: Universal Negative Guidance for Diffusion Model

【速读】:该论文试图解决扩散模型中负向引导(negative guidance)的问题,特别是在少步采样(few-step sampling)场景下,传统无分类器引导(Classifier-Free Guidance, CFG)因正负分支预测差异导致失效的问题。解决方案的关键是提出一种无需训练的高效机制——归一化注意力引导(Normalized Attention Guidance, NAG),其通过在注意力空间中进行基于L1的归一化和外推操作,恢复了CFG失效时的有效负向引导,同时保持生成质量。NAG具有广泛的适用性,可跨架构、采样策略和模态使用,并且计算开销极低。

链接: https://arxiv.org/abs/2505.21179
作者: Dar-Yen Chen,Hmrishav Bandyopadhyay,Kai Zou,Yi-Zhe Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Negative guidance – explicitly suppressing unwanted attributes – remains a fundamental challenge in diffusion models, particularly in few-step sampling regimes. While Classifier-Free Guidance (CFG) works well in standard settings, it fails under aggressive sampling step compression due to divergent predictions between positive and negative branches. We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement. NAG restores effective negative guidance where CFG collapses while maintaining fidelity. Unlike existing approaches, NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video), functioning as a \textituniversal plug-in with minimal computational overhead. Through extensive experimentation, we demonstrate consistent improvements in text alignment (CLIP Score), fidelity (FID, PFID), and human-perceived quality (ImageReward). Our ablation studies validate each design component, while user studies confirm significant preference for NAG-guided outputs. As a model-agnostic inference-time approach requiring no retraining, NAG provides effortless negative guidance for all modern diffusion frameworks – pseudocode in the Appendix!
zh

[CV-44] opological Deep Learning for Speech Data

【速读】:该论文旨在解决传统深度学习模型在语音识别任务中对拓扑结构信息利用不足的问题,从而提升模型的性能与泛化能力。其解决方案的关键在于设计具有拓扑感知能力的卷积核,通过研究正交群作用下的核空间,建立了矩阵空间的纤维丛分解,从而实现了新型滤波器生成方法,具体表现为提出的正交特征(Orthogonal Feature, OF)层在音素识别任务中表现出色,尤其在低噪声环境下效果显著,并具备跨领域适应性。

链接: https://arxiv.org/abs/2505.21173
作者: Zhiwang Yu
机构: Southern University of Science and Technology (南方科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:Topological data analysis (TDA) offers novel mathematical tools for deep learning. Inspired by Carlsson et al., this study designs topology-aware convolutional kernels that significantly improve speech recognition networks. Theoretically, by investigating orthogonal group actions on kernels, we establish a fiber-bundle decomposition of matrix spaces, enabling new filter generation methods. Practically, our proposed Orthogonal Feature (OF) layer achieves superior performance in phoneme recognition, particularly in low-noise scenarios, while demonstrating cross-domain adaptability. This work reveals TDA’s potential in neural network optimization, opening new avenues for mathematics-deep learning interdisciplinary studies.
zh

[CV-45] RoBiS: Robust Binary Segmentation for High-Resolution Industrial Images

【速读】:该论文旨在解决现实场景中鲁棒的无监督异常检测(Unsupervised Anomaly Detection, UAD)问题,特别是在MVTec AD 2基准上因复杂现实挑战导致现有方法性能严重下降的问题。其解决方案的关键在于提出一个名为RoBiS的框架,该框架包含三个核心模块:(1) Swin-Cropping,一种通过重叠窗口裁剪保留小异常信息的高分辨率图像预处理策略;(2) 在训练数据上进行噪声添加和光照模拟的数据增强,以提升AD模型的鲁棒性;(3) 结合传统统计二值化策略与MEBin方法进行联合自适应二值化,并进一步利用SAM细化分割结果。

链接: https://arxiv.org/abs/2505.21152
作者: Xurui Li,Zhonesheng Jiang,Tingxuan Ai,Yu Zhou
机构: Huazhong University of Science and Technology (华中科技大学); Hubei Key Laboratory of Smart Internet Technology (湖北省智能互联网技术重点实验室); Wuhan JingCe Electronic Group Co., LTD (武汉晶策电子集团有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust unsupervised anomaly detection (AD) in real-world scenarios is an important task. Current methods exhibit severe performance degradation on the MVTec AD 2 benchmark due to its complex real-world challenges. To solve this problem, we propose a robust framework RoBiS, which consists of three core modules: (1) Swin-Cropping, a high-resolution image pre-processing strategy to preserve the information of small anomalies through overlapping window cropping. (2) The data augmentation of noise addition and lighting simulation is carried out on the training data to improve the robustness of AD model. We use INP-Former as our baseline, which could generate better results on the various sub-images. (3) The traditional statistical-based binarization strategy (mean+3std) is combined with our previous work, MEBin (published in CVPR2025), for joint adaptive binarization. Then, SAM is further employed to refine the segmentation results. Compared with some methods reported by the MVTec AD 2, our RoBiS achieves a 29.2% SegF1 improvement (from 21.8% to 51.00%) on Test_private and 29.82% SegF1 gains (from 16.7% to 46.52%) on Test_private_mixed. Code is available at this https URL.
zh

[CV-46] IKMo: Image-Keyframed Motion Generation with Trajectory-Pose Conditioned Motion Diffusion Model

【速读】:该论文旨在解决现有基于轨迹和姿态输入的人类运动生成方法在处理两种模态时采用全局处理方式导致输出效果不佳的问题。其解决方案的关键在于提出IKMo方法,该方法通过将轨迹与姿态解耦,并采用两阶段条件框架进行处理:第一阶段利用专用优化模块对输入进行精炼,第二阶段通过轨迹编码器和姿态编码器并行编码轨迹与姿态信息,随后由运动ControlNet引导生成具有高空间和语义保真度的运动。此外,引入基于多模态大语言模型(MLLM)的代理对用户输入进行预处理,进一步提升生成运动与用户期望的一致性。

链接: https://arxiv.org/abs/2505.21146
作者: Yang Zhao,Yan Zhang,Xubo Yang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing human motion generation methods with trajectory and pose inputs operate global processing on both modalities, leading to suboptimal outputs. In this paper, we propose IKMo, an image-keyframed motion generation method based on the diffusion model with trajectory and pose being decoupled. The trajectory and pose inputs go through a two-stage conditioning framework. In the first stage, the dedicated optimization module is applied to refine inputs. In the second stage, trajectory and pose are encoded via a Trajectory Encoder and a Pose Encoder in parallel. Then, motion with high spatial and semantic fidelity is guided by a motion ControlNet, which processes the fused trajectory and pose data. Experiment results based on HumanML3D and KIT-ML datasets demonstrate that the proposed method outperforms state-of-the-art on all metrics under trajectory-keyframe constraints. In addition, MLLM-based agents are implemented to pre-process model inputs. Given texts and keyframe images from users, the agents extract motion descriptions, keyframe poses, and trajectories as the optimized inputs into the motion generation model. We conducts a user study with 10 participants. The experiment results prove that the MLLM-based agents pre-processing makes generated motion more in line with users’ expectation. We believe that the proposed method improves both the fidelity and controllability of motion generation by the diffusion model.
zh

[CV-47] FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention FAST

【速读】:该论文试图解决预训练身份适配器(ID-adapters)在扩散模型中进行无训练适应(training-free adaptation)的挑战,尤其是针对通过知识蒸馏加速的扩散模型。其关键解决方案在于对无分类器指导(classifier-free guidance)进行细致重设计以实现少步骤风格化生成,并通过解耦块中的注意力操控机制提升身份相似性和保真度,从而提出通用的FastFace框架。此外,还开发了一个解耦的公共评估协议用于身份保留适配器的评价。

链接: https://arxiv.org/abs/2505.21144
作者: Sergey Karpukhin,Vadim Titov,Andrey Kuznetsov,Aibek Alanov
机构: Skoltech, AIRI; AIRI; AIRI, Sber, Innopolis; HSE University, AIRI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code available at this https URL

点击查看摘要

Abstract:In latest years plethora of identity-preserving adapters for a personalized generation with diffusion models have been released. Their main disadvantage is that they are dominantly trained jointly with base diffusion models, which suffer from slow multi-step inference. This work aims to tackle the challenge of training-free adaptation of pretrained ID-adapters to diffusion models accelerated via distillation - through careful re-design of classifier-free guidance for few-step stylistic generation and attention manipulation mechanisms in decoupled blocks to improve identity similarity and fidelity, we propose universal FastFace framework. Additionally, we develop a disentangled public evaluation protocol for id-preserving adapters.
zh

[CV-48] SageAttention2: A More Efficient Implementation of SageAttention2

【速读】:该论文试图解决注意力机制在长序列中计算效率低下的问题,因为其时间复杂度随序列长度呈二次增长。解决方案的关键在于利用量化技术加速注意力中的矩阵乘法(Matmul),并通过采用FP8 Matmul在FP16中累积的更快指令进一步提升SageAttention2的性能,该指令比SageAttention2中使用的FP8 Matmul快2倍。

链接: https://arxiv.org/abs/2505.21136
作者: Jintao Zhang,Xiaoming Xu,Jia Wei,Haofeng Huang,Pengle Zhang,Chendong Xiang,Jun Zhu,Jianfei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at this https URL.
zh

[CV-49] Learning Single Index Models with Diffusion Priors ICML2025

【速读】:该论文试图解决在半参数单指标模型下利用生成式 AI (Generative AI) 实现准确信号恢复的问题,尤其针对具有不连续或未知链接函数的非线性测量模型。现有方法要么局限于特定重建问题,要么无法处理此类复杂的非线性模型。该研究的关键在于提出一种高效的重构方法,仅需一次无条件采样和(部分)扩散模型(DM)的逆过程即可实现高精度的信号恢复,并通过理论分析证明了方法的有效性。

链接: https://arxiv.org/abs/2505.21135
作者: Anqi Tang,Youming Chen,Shuchen Xue,Zhaoqiang Liu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: ICML 2025

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have \em discontinuous and \em unknown link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.
zh

[CV-50] ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

【速读】:该论文旨在解决多领域中重组(reassembly)任务的挑战,特别是在面对复杂形状、真实世界侵蚀等问题时,现有深度学习方法在可扩展性、多模态融合和实际应用性方面存在局限。其解决方案的关键在于提出ReassembleNet,该方法通过将每个输入片段表示为轮廓关键点集合,并利用受图神经网络池化启发的技术选择最具信息量的关键点,从而降低计算复杂度并实现多模态特征(几何与纹理数据)的融合。此外,通过在半合成数据集上预训练进一步提升性能,并结合基于扩散的姿态估计方法恢复原始结构,显著提升了旋转和平移的均方根误差(RMSE)指标。

链接: https://arxiv.org/abs/2505.21117
作者: Adeela Islam,Stefano Fiorini,Stuart James,Pietro Morerio,Alessio Del Bue
机构: Fondazione Istituto Italiano di Tecnologia (意大利技术研究院); University of Genova (热那亚大学); Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 55% and 86% for RMSE Rotation and Translation, respectively.
zh

[CV-51] Differentiable Solver Search for Fast Diffusion Sampling ICML25

【速读】:该论文试图解决扩散模型在生成质量与计算效率之间的权衡问题,即在有限的采样步骤下如何提升逆向扩散过程的求解效率与效果。解决方案的关键在于提出一种基于可微分搜索的新型求解器算法,该算法通过分析时间步长与求解器系数组成的紧凑搜索空间,找到更优的求解策略,从而在仅10步采样的情况下显著提升了模型性能,如SiT-XL/2、FlowDCN-XL/2和DiT-XL/2在ImageNet256数据集上分别取得了2.40、2.35和2.33的FID分数,并且该求解器在不同模型架构、分辨率和规模上均表现出良好的泛化能力。

链接: https://arxiv.org/abs/2505.21114
作者: Shuai Wang,Zexian Li,Qipeng zhang,Tianhui Song,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accpeted on ICML25

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet256 with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.
zh

[CV-52] Instance Data Condensation for Image Super-Resolution

【速读】:该论文试图解决深度学习图像超分辨率(ISR)任务中因依赖大规模训练数据集而导致的计算和存储资源消耗过大的问题。其解决方案的关键在于提出一种名为实例数据压缩(Instance Data Condensation, IDC)的框架,该框架通过随机局部傅里叶特征提取和多级特征分布匹配实现实例级别的数据压缩,从而优化全局和局部特征分布,并生成具有精细细节的高质量合成训练数据。

链接: https://arxiv.org/abs/2505.21099
作者: Tianhao Peng,Ho Man Kwan,Yuxuan Jiang,Ge Gao,Fan Zhang,Xiaozhong Xu,Shan Liu,David Bull
机构: University of Bristol, UK (布里斯托尔大学,英国); Tencent Media Lab, Palo Alto, USA (腾讯媒体实验室,美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning based image Super-Resolution (ISR) relies on large training datasets to optimize model generalization; this requires substantial computational and storage resources during training. While dataset condensation has shown potential in improving data efficiency and privacy for high-level computer vision tasks, it has not yet been fully exploited for ISR. In this paper, we propose a novel Instance Data Condensation (IDC) framework specifically for ISR, which achieves instance-level data condensation through Random Local Fourier Feature Extraction and Multi-level Feature Distribution Matching. This aims to optimize feature distributions at both global and local levels and obtain high-quality synthesized training content with fine detail. This framework has been utilized to condense the most commonly used training dataset for ISR, DIV2K, with a 10% condensation rate. The resulting synthetic dataset offers comparable or (in certain cases) even better performance compared to the original full dataset and excellent training stability when used to train various popular ISR models. To the best of our knowledge, this is the first time that a condensed/synthetic dataset (with a 10% data volume) has demonstrated such performance. The source code and the synthetic dataset have been made available at this https URL.
zh

[CV-53] DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response

【速读】:该论文旨在解决复杂灾害场景下大规模视觉-语言模型(VLM)应用面临的挑战,具体包括多类型灾害、多地理区域以及多卫星传感器数据的融合问题。其解决方案的关键在于构建了一个全球尺度灾害评估与响应的遥感视觉-语言数据集(DisasterM3),该数据集具备多灾害、多传感器和多任务三个核心特性,能够有效提升VLM在灾害场景中的推理能力和泛化性能。通过在该数据集上进行微调,研究者实现了模型在各类灾害任务中的稳定性能提升,并增强了跨传感器和跨灾害类型的适应能力。

链接: https://arxiv.org/abs/2505.21089
作者: Junjue Wang,Weihao Xuan,Heli Qi,Zhihao Liu,Kunyi Liu,Yuhan Wu,Hongruixuan Chen,Jian Song,Junshi Xia,Zhuo Zheng,Naoto Yokoya
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能推进中心); Waseda University (早稻田大学); Stony Brook University (石溪大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A multi-hazard, multi-sensor, and multi-task vision-language dataset for global-scale disaster assessment and response

点击查看摘要

Abstract:Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate a remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: 1) Multi-hazard: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. 2)Multi-sensor: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. 3) Multi-task: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM’s reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements across all tasks, with robust cross-sensor and cross-disaster generalization capabilities.
zh

[CV-54] Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts

【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在3D场景理解中因仅使用单一或有限的3D模态而导致的表征不完整和解释准确性下降的问题,以及不同查询类型对不同模态的依赖性导致统一处理所有模态标记可能无法有效捕捉查询特定上下文的问题。解决方案的关键在于提出一种基于稀疏专家混合(Mixture-of-Experts, MoE)的3D MLLM——Uni3D-MoE,其核心是通过可学习的路由机制,在令牌级别动态选择适合的专家,每个专家根据学习到的模态偏好处理多模态标记,从而实现自适应的3D多模态融合。

链接: https://arxiv.org/abs/2505.21079
作者: Yue Zhang,Yingzhao Jian,Hehe Fan,Yi Yang,Roger Zimmermann
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have demonstrated considerable potential for comprehensive 3D scene understanding. However, existing approaches typically utilize only one or a limited subset of 3D modalities, resulting in incomplete representations of 3D scenes and reduced interpretive accuracy. Furthermore, different types of queries inherently depend on distinct modalities, indicating that uniform processing of all modality tokens may fail to effectively capture query-specific context. To address these challenges, we propose Uni3D-MoE, a sparse Mixture-of-Experts (MoE)-based 3D MLLM designed to enable adaptive 3D multimodal fusion. Specifically, Uni3D-MoE integrates a comprehensive set of 3D modalities, including multi-view RGB and depth images, bird’s-eye-view (BEV) maps, point clouds, and voxel representations. At its core, our framework employs a learnable routing mechanism within the sparse MoE-based large language model, dynamically selecting appropriate experts at the token level. Each expert specializes in processing multimodal tokens based on learned modality preferences, thus facilitating flexible collaboration tailored to diverse task-specific requirements. Extensive evaluations on standard 3D scene understanding benchmarks and specialized datasets demonstrate the efficacy of Uni3D-MoE.
zh

[CV-55] DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

【速读】:该论文旨在解决多模态大语言模型在长期地球观测分析中的应用局限性问题,特别是其在长时间序列理解和定量分析方面的不足。解决方案的关键在于提出DVL-Suite框架,该框架包含15,063张高分辨率(1.0m)多时相遥感图像,覆盖美国42个大都市区从2005年至2023年的数据,并分为DVL-Bench和DVL-Instruct两个组件。DVL-Bench提供了从像素级变化检测到区域级定量分析及场景级城市叙事的多种城市理解任务,而DVL-Instruct则是一个专门设计的指令微调数据集,以提升模型在多时相地球观测中的能力。基于此数据集,研究者进一步开发了DVLChat,一个能够同时进行图像级问答和像素级分割的基线模型,从而通过语言交互实现对城市动态的全面理解。

链接: https://arxiv.org/abs/2505.21076
作者: Weihao Xuan,Junjue Wang,Heli Qi,Zihang Chen,Zhuo Zheng,Yanfei Zhong,Junshi Xia,Naoto Yokoya
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心); Waseda University (早稻田大学); Wuhan University (武汉大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 15,063 high-resolution (1.0m) multi-temporal images spanning 42 megacities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes seven urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 17 state-of-the-art multimodal large language models and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models’ capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.
zh

[CV-56] Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)模型在安全评估中面临的挑战,特别是如何有效规避未知且多样化的防御机制。现有方法在白盒和黑盒场景下均存在局限性,难以适用于闭源模型或实际商业API环境。论文提出的解决方案关键在于一种基于规则偏好建模的红队测试方法(Rule-based Preference modeling Guided Red-Teaming, RPG-RT),其通过迭代使用大语言模型(LLM)修改提示词,并利用T2I系统的反馈对LLM进行微调,从而动态适应未知的防御机制。此外,为提升反馈的利用效率,引入了规则偏好建模技术,以实现对LLM适应过程的细粒度控制。

链接: https://arxiv.org/abs/2505.21074
作者: Yichuan Cao,Yibo Miao,Xiao-Shan Gao,Yinpeng Dong
机构: Chinese Academy of Sciences (中国科学院); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models’ security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model’s specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM’s dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.
zh

[CV-57] Minute-Long Videos with Dual Parallelisms

【速读】:该论文旨在解决基于Diffusion Transformer (DiT)的视频扩散模型在生成长视频时面临的处理延迟高和内存消耗大的问题。其关键解决方案是提出一种名为DualParal的分布式推理策略,通过在多个GPU上并行化时间帧和模型层,实现高效计算。为克服扩散模型对帧间噪声水平同步的需求导致的并行性受限问题,该方法采用块级去噪机制,使各GPU按顺序处理帧块并逐步降低噪声水平,从而实现异步计算与通信。

链接: https://arxiv.org/abs/2505.21070
作者: Zeqing Wang,Bowen Zheng,Xingyi Yang,Yuecong Xu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); Xidian University (西安电子科技大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL

点击查看摘要

Abstract:Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54 \times lower latency and 1.48 \times lower memory cost on 8 \times RTX 4090 GPUs.
zh

[CV-58] Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

【速读】:该论文试图解决虚拟试穿(Virtual Try-On, VTON)任务的逆问题——虚拟脱穿(Virtual Try-Off, VTOFF),即从真实世界中穿着衣物的人体图像生成标准化的服装产品图像。与VTON不同,VTOFF具有更一致和明确的输出格式,但现有方法在处理衣物特征与遮挡及复杂姿态的分离时存在困难,且仅适用于单类衣物,限制了泛化能力。解决方案的关键在于提出Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF)架构,其核心是基于双DiT骨干网络并结合改进的多模态注意力机制,以实现鲁棒的衣物特征提取,并通过多模态输入(如图像、文本和掩码)支持多类别场景,同时引入对齐模块进一步优化生成细节。

链接: https://arxiv.org/abs/2505.21062
作者: Davide Lobba,Fulvio Sanguigni,Bin Ren,Marcella Cornia,Rita Cucchiara,Nicu Sebe
机构: University of Trento, Italy(特伦托大学, 意大利); University of Pisa, Italy(比萨大学, 意大利); University of Modena and Reggio Emilia, Italy(摩德纳和雷焦艾米利亚大学, 意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While virtual try-on (VTON) systems aim to render a garment onto a target person image, this paper tackles the novel task of virtual try-off (VTOFF), which addresses the inverse problem: generating standardized product images of garments from real-world photos of clothed individuals. Unlike VTON, which must resolve diverse pose and style variations, VTOFF benefits from a consistent and well-defined output format – typically a flat, lay-down-style representation of the garment – making it a promising tool for data generation and dataset enhancement. However, existing VTOFF approaches face two major limitations: (i) difficulty in disentangling garment features from occlusions and complex poses, often leading to visual artifacts, and (ii) restricted applicability to single-category garments (e.g., upper-body clothes only), limiting generalization. To address these challenges, we present Text-Enhanced MUlti-category Virtual Try-Off (TEMU-VTOFF), a novel architecture featuring a dual DiT-based backbone with a modified multimodal attention mechanism for robust garment feature extraction. Our architecture is designed to receive garment information from multiple modalities like images, text, and masks to work in a multi-category setting. Finally, we propose an additional alignment module to further refine the generated visual details. Experiments on VITON-HD and Dress Code datasets show that TEMU-VTOFF sets a new state-of-the-art on the VTOFF task, significantly improving both visual quality and fidelity to the target garments.
zh

[CV-59] LPOI: Listwise Preference Optimization for Vision Language Models ACL2025

【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在对齐人类偏好过程中出现的过拟合文本信息和幻觉(hallucination)问题。其解决方案的关键在于提出LPOI,即首个面向对象的列表偏好优化方法,通过识别并遮蔽图像中的关键对象,然后在正负图像之间插值生成一系列逐渐更完整的图像,使模型学习按对象可见性排序这些图像,从而有效减少幻觉并保持视觉保真度。

链接: https://arxiv.org/abs/2505.21061
作者: Fatemeh Pesaran Zadeh,Yoojin Oh,Gunhee Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACL 2025 Main. Code is released at this https URL

点击查看摘要

Abstract:Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance. We make the code available at this https URL.
zh

[CV-60] Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles

【速读】:该论文旨在解决在保持多视角一致性的同时,快速生成与风格图像高度相似的3D场景的问题。现有最先进的3D风格化方法通常需要计算密集型的测试时优化,并依赖于密集的有姿态输入图像,而该论文提出了一种新颖的方法,能够在不到一秒的时间内使用无姿态的稀疏视角场景图像和任意风格图像实现直接的3D风格化。其解决方案的关键在于引入了一个分支架构,将结构建模与外观着色分离,从而有效防止风格迁移对底层3D场景结构的破坏,并通过适应身份损失来促进风格化模型的预训练,使模型在微调以进行风格化的同时保留原始重建能力。

链接: https://arxiv.org/abs/2505.21060
作者: Peng Wang,Xiang Liu,Peidong Liu
机构: Zhejiang University (浙江大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.
zh

[CV-61] Advancing high-fidelity 3D and Texture Generation with 2.5D latents

【速读】:该论文试图解决3D生成技术中几何与纹理生成分离导致的不一致性问题(geometry-texture inconsistency),以及由于3D数据复杂性和质量不均带来的性能限制。其解决方案的关键在于提出一种联合生成3D几何与纹理的新框架,核心是生成一种可无缝转换2D与3D的通用2.5D表示(2.5D latents),通过整合多视角RGB、法线和坐标图像实现统一表征,并利用预训练的2D基础模型进行高保真2.5D生成,最终通过轻量级的2.5D到3D精炼解码器高效生成细节丰富的3D表示。

链接: https://arxiv.org/abs/2505.21050
作者: Xin Yang,Jiantao Lin,Yingjie Xu,Haodong Li,Yingcong Chen
机构: HKUST(香港科技大学); HKUST(GZ)(香港科技大学(广州)); China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.
zh

[CV-62] Robust Video-Based Pothole Detection and Area Estimation for Intelligent Vehicles with Depth Map and Kalman Smoothing

【速读】:该论文旨在解决道路坑洼检测与面积估计中的准确性问题,特别是在复杂现实环境中由于相机角度变化和假设路面平坦所带来的误差。其关键解决方案是提出一种融合目标检测与单目深度估计的鲁棒框架,其中包含ACSH-YOLOv8模型用于提升小坑洼检测能力,BoT-SORT算法用于坑洼跟踪,DepthAnything V2生成深度图,并结合最小包围三角化像素(MBTP)方法进行面积估计,同时采用基于置信度与距离的卡尔曼滤波(CDKF)以确保连续帧间估计结果的一致性。

链接: https://arxiv.org/abs/2505.21049
作者: Dehao Wang,Haohang Zhu,Yiwen Xu,Kaiqi Liu
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road potholes pose a serious threat to driving safety and comfort, making their detection and assessment a critical task in fields such as autonomous driving. When driving vehicles, the operators usually avoid large potholes and approach smaller ones at reduced speeds to ensure safety. Therefore, accurately estimating pothole area is of vital importance. Most existing vision-based methods rely on distance priors to construct geometric models. However, their performance is susceptible to variations in camera angles and typically relies on the assumption of a flat road surface, potentially leading to significant errors in complex real-world environments. To address these problems, a robust pothole area estimation framework that integrates object detection and monocular depth estimation in a video stream is proposed in this paper. First, to enhance pothole feature extraction and improve the detection of small potholes, ACSH-YOLOv8 is proposed with ACmix module and the small object detection head. Then, the BoT-SORT algorithm is utilized for pothole tracking, while DepthAnything V2 generates depth maps for each frame. With the obtained depth maps and potholes labels, a novel Minimum Bounding Triangulated Pixel (MBTP) method is proposed for pothole area estimation. Finally, Kalman Filter based on Confidence and Distance (CDKF) is developed to maintain consistency of estimation results across consecutive frames. The results show that ACSH-YOLOv8 model achieves an AP(50) of 76.6%, representing a 7.6% improvement over YOLOv8. Through CDKF optimization across consecutive frames, pothole predictions become more robust, thereby enhancing the method’s practical applicability.
zh

[CV-63] CityGo: Lightweight Urban Modeling and Rendering with Proxy Buildings and Residual Gaussians

【速读】:该论文旨在解决从航空视角高效、准确建模大规模城市场景的问题,这一问题在增强现实导航、无人机巡检和智慧城市数字孪生等应用中具有重要意义。现有方法如3D Gaussian Splatting(3DGS)虽提升了可扩展性和视觉质量,但受限于密集的原始要素使用、训练时间长以及对边缘设备的不友好性。论文提出的CityGo框架通过结合纹理代理几何体与残差及周围3D高斯分布,实现了轻量级、逼真的城市场景渲染。其关键在于利用多视角立体(MVS)点云提取紧凑的建筑代理网格,并通过零阶球谐(zero order SH)高斯分布生成无遮挡纹理,同时引入基于代理图像差异和深度先验的残差高斯分布以捕捉高频细节,以及通过重要性感知下采样减少非关键区域的冗余,从而实现复杂城市场景在移动GPU上的实时渲染。

链接: https://arxiv.org/abs/2505.21041
作者: Weihang Liu,Yuhui Zhong,Yuke Li,Xi Chen,Jiadi Cui,Honglong Zhang,Lan Xu,Xin Lou,Yujiao Shi,Jingyi Yu,Yingliang Zhang
机构: ShanghaiTech University (上海科技大学); DGene (DGene); Migu Cultural Technology Co.,Ltd (咪咕文化科技有限公司); GGU Technology Co., Ltd (GGU科技有限公司); Stereye (Stereye)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and efficient modeling of large-scale urban scenes is critical for applications such as AR navigation, UAV based inspection, and smart city digital twins. While aerial imagery offers broad coverage and complements limitations of ground-based data, reconstructing city-scale environments from such views remains challenging due to occlusions, incomplete geometry, and high memory demands. Recent advances like 3D Gaussian Splatting (3DGS) improve scalability and visual quality but remain limited by dense primitive usage, long training times, and poor suit ability for edge devices. We propose CityGo, a hybrid framework that combines textured proxy geometry with residual and surrounding 3D Gaussians for lightweight, photorealistic rendering of urban scenes from aerial perspectives. Our approach first extracts compact building proxy meshes from MVS point clouds, then uses zero order SH Gaussians to generate occlusion-free textures via image-based rendering and back-projection. To capture high-frequency details, we introduce residual Gaussians placed based on proxy-photo discrepancies and guided by depth priors. Broader urban context is represented by surrounding Gaussians, with importance-aware downsampling applied to non-critical regions to reduce redundancy. A tailored optimization strategy jointly refines proxy textures and Gaussian parameters, enabling real-time rendering of complex urban scenes on mobile GPUs with significantly reduced training and memory requirements. Extensive experiments on real-world aerial datasets demonstrate that our hybrid representation significantly reduces training time, achieving on average 1.4x speedup, while delivering comparable visual fidelity to pure 3D Gaussian Splatting approaches. Furthermore, CityGo enables real-time rendering of large-scale urban scenes on mobile consumer GPUs, with substantially reduced memory usage and energy consumption.
zh

[CV-64] RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

【速读】:该论文旨在解决视频生成中基于扩散模型的计算密集问题,特别是Diffusion Transformer(DiT)模型中3D注意力机制占用超过80%计算资源的瓶颈问题。其解决方案的关键在于提出了一种无需训练的稀疏注意力方法——RainFusion,该方法利用视觉数据中的固有稀疏性,在推理过程中通过自适应识别模块(ARM)在线确定每个注意力头的稀疏模式(包括空间、时间与纹理模式),从而显著加速注意力计算,同时保持视频质量。实验结果表明,RainFusion在保持视频质量的前提下,实现了注意力计算超过2倍的速度提升,且对VBench评分影响极小。

链接: https://arxiv.org/abs/2505.21036
作者: Aiyue Chen,Bin Dong,Jingru Li,Jing Lin,Yiwu Yao,Gongyi Wang
机构: Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video generation using diffusion models is highly computationally intensive, with 3D attention in Diffusion Transformer (DiT) models accounting for over 80% of the total computational resources. In this work, we introduce \bf RainFusion, a novel training-free sparse attention method that exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality. Specifically, we identify three unique sparse patterns in video generation attention calculations–Spatial Pattern, Temporal Pattern and Textural Pattern. The sparse pattern for each attention head is determined online with negligible overhead (\textasciitilde,0.2%) with our proposed \bf ARM (Adaptive Recognition Module) during inference. Our proposed \bf RainFusion is a plug-and-play method, that can be seamlessly integrated into state-of-the-art 3D-attention video generation models without additional training or calibration. We evaluate our method on leading open-sourced models including HunyuanVideo, OpenSoraPlan-1.2 and CogVideoX-5B, demonstrating its broad applicability and effectiveness. Experimental results show that RainFusion achieves over \bf 2(\times) speedup in attention computation while maintaining video quality, with only a minimal impact on VBench scores (-0.2%).
zh

[CV-65] FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models ALT

【速读】:该论文试图解决深度神经网络中内部表示(internal representations)难以解释的问题,尤其是如何准确地从特征空间映射到输入空间。现有方法通常依赖于粗略的近似,而本文提出使用条件扩散模型(conditional diffusion model),即一个基于空间分辨特征图预训练的高保真扩散模型,以概率方式学习这种映射。该解决方案的关键在于利用预训练的扩散模型来实现对特征空间到输入空间的精确重建,从而提升对模型内部表示的理解能力。

链接: https://arxiv.org/abs/2505.21032
作者: Nils Neukirch,Johanna Vielhaben,Nils Strodthoff
机构: Carl von Ossietzky Universität Oldenburg (奥尔登堡卡尔·冯·奥西茨基大学); Fraunhofer Heinrich-Hertz-Institute (弗劳恩霍夫海因里希·赫兹研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 10 figures, code is available at this https URL

点击查看摘要

Abstract:Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.
zh

[CV-66] Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains

【速读】:该论文试图解决半监督联邦学习(Semi-Supervised Federated Learning, SSFL)在面对领域偏移(domain shifts)时模型泛化能力不足的问题,从而提升其在实际应用中的实用性。解决方案的关键在于提出一种名为统一对齐协议(Unified Alignment Protocol, UAP)的新框架,该框架通过交替的两阶段训练过程实现跨领域的特征对齐:第一阶段由服务器模型学习并对其特征进行参数化分布对齐,随后将该分布传递给客户端;第二阶段则利用服务器提供的特征分布指导客户端特征的对齐,从而在去中心化的联邦学习设置下提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.21010
作者: Sabbir Ahmed,Mamshad Nayeem Rizve,Abdullah Al Arafat,Jacqueline Liu,Rahim Hossain,Mohaiminul Al Nahian,Adnan Siraj Rakin
机构: Binghamton University (SUNY); Adobe Inc.; North Carolina State University
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Semi-Supervised Federated Learning (SSFL) is gaining popularity over conventional Federated Learning in many real-world applications. Due to the practical limitation of limited labeled data on the client side, SSFL considers that participating clients train with unlabeled data, and only the central server has the necessary resources to access limited labeled data, making it an ideal fit for real-world applications (e.g., healthcare). However, traditional SSFL assumes that the data distributions in the training phase and testing phase are the same. In practice, however, domain shifts frequently occur, making it essential for SSFL to incorporate generalization capabilities and enhance their practicality. The core challenge is improving model generalization to new, unseen domains while the client participate in SSFL. However, the decentralized setup of SSFL and unsupervised client training necessitates innovation to achieve improved generalization across domains. To achieve this, we propose a novel framework called the Unified Alignment Protocol (UAP), which consists of an alternating two-stage training process. The first stage involves training the server model to learn and align the features with a parametric distribution, which is subsequently communicated to clients without additional communication overhead. The second stage proposes a novel training algorithm that utilizes the server feature distribution to align client features accordingly. Our extensive experiments on standard domain generalization benchmark datasets across multiple model architectures reveal that proposed UAP successfully achieves SOTA generalization performance in SSFL setting.
zh

[CV-67] Facial Attribute Based Text Guided Face Anonymization

【速读】:该论文试图解决计算机视觉应用中因数据隐私法规要求个体同意而难以收集包含个人面部的高质量数据集的问题。解决方案的关键在于提出一种基于深度学习的面部匿名化流程,该流程利用扩散模型进行图像修复,无需训练生成对抗网络,从而实现自然且不可识别的面部生成。

链接: https://arxiv.org/abs/2505.21002
作者: Mustafa İzzet Muştu,Hazım Kemal Ekenel
机构: Istanbul Technical University (伊斯坦布尔技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, published in the Proceedings of the Joint visuAAL-GoodBrother Conference on Trustworthy Video- and Audio-Based Assistive Technologies

点击查看摘要

Abstract:The increasing prevalence of computer vision applications necessitates handling vast amounts of visual data, often containing personal information. While this technology offers significant benefits, it should not compromise privacy. Data privacy regulations emphasize the need for individual consent for processing personal data, hindering researchers’ ability to collect high-quality datasets containing the faces of the individuals. This paper presents a deep learning-based face anonymization pipeline to overcome this challenge. Unlike most of the existing methods, our method leverages recent advancements in diffusion-based inpainting models, eliminating the need for training Generative Adversarial Networks. The pipeline employs a three-stage approach: face detection with RetinaNet, feature extraction with VGG-Face, and realistic face generation using the state-of-the-art BrushNet diffusion model. BrushNet utilizes the entire image, face masks, and text prompts specifying desired facial attributes like age, ethnicity, gender, and expression. This enables the generation of natural-looking images with unrecognizable individuals, facilitating the creation of privacy-compliant datasets for computer vision research.
zh

[CV-68] Assessing the Use of Face Swapping Methods as Face Anonymizers in Videos

【速读】:该论文试图解决在大规模视觉数据应用中如何在不显著降低数据质量的前提下保护个人隐私的问题。解决方案的关键在于利用人脸交换(face swapping)技术,通过保持时间一致性、匿名性强度和视觉保真度,实现视频数据中个人身份的有效隐藏。

链接: https://arxiv.org/abs/2505.20985
作者: Mustafa İzzet Muştu,Hazım Kemal Ekenel
机构: Istanbul Technical University (伊斯坦布尔技术大学); NYU Abu Dhabi (纽约大学阿布扎比分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 2025 25th International Conference on Digital Signal Processing (DSP 2025)

点击查看摘要

Abstract:The increasing demand for large-scale visual data, coupled with strict privacy regulations, has driven research into anonymization methods that hide personal identities without seriously degrading data quality. In this paper, we explore the potential of face swapping methods to preserve privacy in video data. Through extensive evaluations focusing on temporal consistency, anonymity strength, and visual fidelity, we find that face swapping techniques can produce consistent facial transitions and effectively hide identities. These results underscore the suitability of face swapping for privacy-preserving video applications and lay the groundwork for future advancements in anonymization focused face-swapping models.
zh

[CV-69] DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

【速读】:该论文试图解决个性化扩散模型在文本到图像生成中保持概念保真度与上下文对齐之间的平衡问题(concept fidelity and contextual alignment)。其解决方案的关键在于提出一种基于强化学习(RL)的方法,通过利用文本到图像模型的多样化输出,生成合成的成对数据集用于类似直接偏好优化(DPO)的训练,从而无需依赖人工标注的分数。该方法通过构建更好的-更差的成对样本,专门提升概念保真度和提示遵循性,并支持图像保真度与文本对齐之间的灵活权衡。

链接: https://arxiv.org/abs/2505.20975
作者: Shamil Ayupov,Maksim Nakhodnov,Anastasia Yaschenko,Andrey Kuznetsov,Aibek Alanov
机构: HSE University (高等经济大学); AIRI (人工智能研究机构); Sber AI (斯伯银行人工智能); Sber (斯伯银行); Innopolis (英诺波利斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally. The source code can be found at this https URL

点击查看摘要

Abstract:Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at this https URL.
zh

[CV-70] RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

【速读】:该论文旨在解决在恶劣天气条件下,基于RGB或LiDAR输入的神经场(Neural Fields, NFs)在户外动态场景中的重建性能下降问题,以及如何有效融合毫米波雷达数据与神经场模型以提升动态场景的时空一致性。其解决方案的关键在于提出RF4D框架,该框架通过显式引入时间信息和特征级流模块,增强了对动态物体的建模能力,并结合雷达特定的功率渲染公式,提升了合成质量和场景感知精度。

链接: https://arxiv.org/abs/2505.20967
作者: Jiarui Zhang,Zhihao Li,Chong Wang,Bihan Wen
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural fields (NFs) have demonstrated remarkable performance in scene reconstruction, powering various tasks such as novel view synthesis. However, existing NF methods relying on RGB or LiDAR inputs often exhibit severe fragility to adverse weather, particularly when applied in outdoor scenarios like autonomous driving. In contrast, millimeter-wave radar is inherently robust to environmental changes, while unfortunately, its integration with NFs remains largely underexplored. Besides, as outdoor driving scenarios frequently involve moving objects, making spatiotemporal modeling essential for temporally consistent novel view synthesis. To this end, we introduce RF4D, a radar-based neural field framework specifically designed for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, significantly enhancing its capability to model moving objects. We further introduce a feature-level flow module that predicts latent temporal offsets between adjacent frames, enforcing temporal coherence in dynamic scene modeling. Moreover, we propose a radar-specific power rendering formulation closely aligned with radar sensing physics, improving synthesis accuracy and interoperability. Extensive experiments on public radar datasets demonstrate the superior performance of RF4D in terms of radar measurement synthesis quality and occupancy estimation accuracy, achieving especially pronounced improvements in dynamic outdoor scenarios.
zh

[CV-71] Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

【速读】:该论文旨在解决如何通过观察人类动作来学习视觉表征,以提升机器人视觉-运动策略生成的问题。其核心挑战在于如何有效融合语义分割与视觉表征生成,以增强强化学习和模仿学习的效果。解决方案的关键在于提出一种以对象为中心的编码器,该编码器通过耦合的方式同时执行语义分割和视觉表征生成,而非将其视为独立过程,其核心技术为Slot Attention机制,并利用在大规模跨领域数据集上预训练的SOLV模型进行微调,从而提升对人类动作视频数据的适应性与性能。

链接: https://arxiv.org/abs/2505.20962
作者: Nikos Giannakakis,Argyris Manetas,Panagiotis P. Filntisis,Petros Maragos,George Retsinas
机构: School of ECE, National Technical University of Athens, Greece (电子电气工程学院,雅典国立技术大学,希腊); Robotics Institute, Athena Research Center, 15125 Maroussi, Athens, Greece (机器人研究所,雅典研究中心,希腊); School of ECE, NTUA, Athens, Greece (电子电气工程学院,NTUA,希腊); Robotics Institute, Athena Research Center, Athens, Greece (机器人研究所,雅典研究中心,希腊); HERON - Hellenic Robotics Center of Excellence, Athens, Greece (HERON - 希腊机器人卓越中心,希腊)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions – although still out-of-domain – , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.
zh

[CV-72] OrienText: Surface Oriented Textual Image Generation SIGGRAPH

【速读】:该论文旨在解决文本在复杂表面(如建筑元素的倾斜视角)上准确生成和正确定向的问题,这在电商领域尤其是营销活动、产品成像和广告中具有重要意义。解决方案的关键在于引入了基于表面法线的条件输入,即Surface Oriented Textual Image Generation (OrienText)方法,通过区域特定的表面法线信息来指导文本在图像中的精确渲染和方向调整。

链接: https://arxiv.org/abs/2505.20958
作者: Shubham Singh Paliwal,Arushi Jain,Monika Sharma,Vikram Jamwal,Lovekesh Vig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, SIGGRAPH Asia 2024 Technical Communications

点击查看摘要

Abstract:Textual content in images is crucial in e-commerce sectors, particularly in marketing campaigns, product imaging, advertising, and the entertainment industry. Current text-to-image (T2I) generation diffusion models, though proficient at producing high-quality images, often struggle to incorporate text accurately onto complex surfaces with varied perspectives, such as angled views of architectural elements like buildings, banners, or walls. In this paper, we introduce the Surface Oriented Textual Image Generation (OrienText) method, which leverages region-specific surface normals as conditional input to T2I generation diffusion model. Our approach ensures accurate rendering and correct orientation of the text within the image context. We demonstrate the effectiveness of the OrienText method on a self-curated dataset of images and compare it against the existing textual image generation methods.
zh

[CV-73] DSOcc: Leverag ing Depth Awareness and Semantic Aid to Boost Camera-Based 3D Semantic Occupancy Prediction

【速读】:该论文旨在解决基于相机的3D语义占用预测中因显式占用状态推断导致的特征误分配问题以及样本不足限制占用类别推断学习的问题。其解决方案的关键在于引入深度感知与语义辅助,通过联合进行占用状态和占用类别推断,利用非学习方法计算软占用置信度,并将其与图像特征相乘以使体素表示具备深度感知能力,从而实现自适应的隐式占用状态推断;同时直接使用训练良好的图像语义分割结果,并融合多帧及其占用概率以增强占用类别推断的鲁棒性。

链接: https://arxiv.org/abs/2505.20951
作者: Naiyu Fang,Zheyuan Zhou,Kang Wang,Ruibo Li,Lemiao Qiu,Shuyou Zhang,Zhe Wang,Guosheng Lin
机构: S-Lab, Nanyang Technological University (S-Lab,南洋理工大学); Zhejiang University (浙江大学); SenseTime Research (商汤科技研究); School of Computer Science and Engineering, Nanyang Technological University (计算机科学与工程学院,南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-based 3D semantic occupancy prediction offers an efficient and cost-effective solution for perceiving surrounding scenes in autonomous driving. However, existing works rely on explicit occupancy state inference, leading to numerous incorrect feature assignments, and insufficient samples restrict the learning of occupancy class inference. To address these challenges, we propose leveraging Depth awareness and Semantic aid to boost camera-based 3D semantic Occupancy prediction (DSOcc). We jointly perform occupancy state and occupancy class inference, where soft occupancy confidence is calculated through non-learning method and multiplied with image features to make the voxel representation aware of depth, enabling adaptive implicit occupancy state inference. Rather than focusing on improving feature learning, we directly utilize well-trained image semantic segmentation and fuse multiple frames with their occupancy probabilities to aid occupancy class inference, thereby enhancing robustness. Experimental results demonstrate that DSOcc achieves state-of-the-art performance on the SemanticKITTI dataset among camera-based methods.
zh

[CV-74] PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter CVPR2025

【速读】:该论文旨在解决现有预训练模型在点云理解任务中应用时存在的局限性,即仅利用预训练模型的最终输出进行任务头处理,而忽略了中间层中丰富的互补信息,从而未能充分发挥预训练模型的潜力。其解决方案的关键在于提出了一种正交方法:Point Mamba Adapter (PMA),该方法从预训练模型的所有层中构建有序特征序列,并利用Mamba模型融合所有互补语义,以提升点云的全面理解能力。为应对三维空间固有的各向同性问题,进一步引入了跨层共享的几何约束门控提示生成器(G2PG),通过施加共享的几何约束动态优化空间顺序,实现多层信息的有效整合。

链接: https://arxiv.org/abs/2505.20941
作者: Yaohua Zha,Yanzi Wang,Hang Guo,Jinpeng Wang,Tao Dai,Bin Chen,Zhihao Ouyang,Xue Yuerong,Ke Chen,Shu-Tao Xia
机构: Tsinghua University (清华大学); Pengcheng Laboratory (鹏城实验室); Shenzhen University (深圳大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Meta (元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Applying pre-trained models to assist point cloud understanding has recently become a mainstream paradigm in 3D perception. However, existing application strategies are straightforward, utilizing only the final output of the pre-trained model for various task heads. It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. Constructing this ordered sequence is non-trivial due to the inherent isotropy of 3D space. Therefore, we further propose a geometry-constrained gate prompt generator (G2PG) shared across different layers, which applies shared geometric constraints to the output gates of the Mamba and dynamically optimizes the spatial order, thus enabling more effective integration of multi-layer information. Extensive experiments conducted on challenging point cloud datasets across various tasks demonstrate that our PMA elevates the capability for point cloud understanding to a new level by fusing diverse complementary intermediate features. Code is available at this https URL.
zh

[CV-75] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

【速读】:该论文试图解决文本到图像扩散模型在多实例场景下的生成问题,特别是对象合并或遗漏的问题(multi-instance scenarios)。其解决方案的关键在于提出一种无需训练的方法——实例到语义注意力控制(Instance-to-Semantic Attention Control, ISAC),通过实例优先建模方法显式解决实例形成不完整和语义纠缠问题,从而有效分离多个物体实例并分别对齐其对应的语义标签。

链接: https://arxiv.org/abs/2505.20935
作者: Sanghyun Jo,Wooyeol Lee,Ziseok Lee,Kyungsu Kim
机构: Seoul National University (首尔国立大学); School of Transdisciplinary Innovations and Interdisciplinary Program in Artificial Intelligence (跨学科创新学院和人工智能交叉学科项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages

点击查看摘要

Abstract:Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.
zh

[CV-76] QwT-v2: Practical Effective and Efficient Post-Training Quantization

【速读】:该论文旨在解决网络量化(network quantization)过程中存在的参数和计算开销大以及硬件兼容性差的问题。QwT方法虽然提供了一种通用且有效的量化改进方式,但其引入的额外参数和延迟限制了其应用。论文提出的QwT-v2方案通过引入一种轻量级的通道级仿射补偿(channel-wise affine compensation, CWAC)模块,显著减少了额外参数和计算量,同时保持或提升了量化模型的精度。该补偿模块易于集成到量化推理引擎中,有效消除了额外开销并增强了对现有硬件平台的兼容性。

链接: https://arxiv.org/abs/2505.20932
作者: Ningyuan Tang,Minghao Fu,Hao Yu,Jianxin Wu
机构: National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Network quantization is arguably one of the most practical network compression approaches for reducing the enormous resource consumption of modern deep neural networks. They usually require diverse and subtle design choices for specific architecture and tasks. Instead, the QwT method is a simple and general approach which introduces lightweight additional structures to improve quantization. But QwT incurs extra parameters and latency. More importantly, QwT is not compatible with many hardware platforms. In this paper, we propose QwT-v2, which not only enjoys all advantages of but also resolves major defects of QwT. By adopting a very lightweight channel-wise affine compensation (CWAC) module, QwT-v2 introduces significantly less extra parameters and computations compared to QwT, and at the same time matches or even outperforms QwT in accuracy. The compensation module of QwT-v2 can be integrated into quantization inference engines with little effort, which not only effectively removes the extra costs but also makes it compatible with most existing hardware platforms.
zh

[CV-77] Good Enough: Is it Worth Improving your Label Quality?

【速读】:该论文试图解决医学图像分割中标签质量提升的效益不明确问题,即在标签质量改进方面投入的成本是否值得。解决方案的关键在于通过生成不同质量伪标签的CT数据集(如使用nnU-Net、TotalSegmentator和MedSAM等模型生成),系统评估标签质量对模型性能的影响。研究发现,高质量标签在域内性能上有提升,但若低于一定阈值则效果不明显;而在预训练阶段,标签质量影响较小,表明模型更倾向于迁移通用概念而非依赖详细标注。

链接: https://arxiv.org/abs/2505.20928
作者: Alexander Jaus,Zdravko Marinov,Constantin Seibold,Simon Reiß,Jens Kleesiek,Rainer Stiefelhagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Improving label quality in medical image segmentation is costly, but its benefits remain unclear. We systematically evaluate its impact using multiple pseudo-labeled versions of CT datasets, generated by models like nnU-Net, TotalSegmentator, and MedSAM. Our results show that while higher-quality labels improve in-domain performance, gains remain unclear if below a small threshold. For pre-training, label quality has minimal impact, suggesting that models rather transfer general concepts than detailed annotations. These findings provide guidance on when improving label quality is worth the effort.
zh

[CV-78] HuMoCon: Concept Discovery for Human Motion Understanding

【速读】:该论文试图解决在人类行为分析中运动概念发现的关键问题,包括多模态特征对齐不明确以及掩码自编码框架中高频信息的丢失。其解决方案的关键在于提出一种人体运动概念发现框架,通过整合基于视频的上下文理解与基于运动的细粒度交互建模的特征对齐策略,并引入速度重建机制以增强高频特征表达并缓解时间上的过度平滑问题。

链接: https://arxiv.org/abs/2505.20920
作者: Qihang Fang,Chengcheng Tang,Bugra Tekin,Shugao Ma,Yanchao Yang
机构: The University of Hong Kong (香港大学); Meta (元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding. We will open-source the associated code with our paper.
zh

[CV-79] Geometry-Editable and Appearance-Preserving Object Compositon

【速读】:该论文旨在解决通用物体组合(General Object Composition, GOC)中如何在保持物体细粒度外观细节的同时,实现对目标物体几何属性的可编辑性问题。现有方法虽然通过语义嵌入和扩散模型实现了几何可编辑生成,但其高度压缩的嵌入仅编码了高层语义线索,导致细粒度外观细节丢失。该论文提出的解耦几何可编辑与外观保留扩散(Disentangled Geometry-editable and Appearance-preserving Diffusion, DGAD)模型的关键在于:首先利用语义嵌入隐式捕捉所需的几何变换,随后通过交叉注意力检索机制将细粒度外观特征与几何编辑后的表示对齐,从而实现精确的几何编辑与忠实的外观保留。

链接: https://arxiv.org/abs/2505.20914
作者: Jianman Lin,Haojie Li,Chunmei Qing,Zhijing Yang,Liang Lin,Tianshui Chen
机构: South China University of Technology(华南理工大学); Guangdong University of Technology(广东工业大学); Sun Yat-sen University(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.
zh

[CV-80] Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects ICME2025

【速读】:该论文旨在解决现有文本到图像生成方法在布局控制精度和参考主体动态特征利用方面的不足。其关键解决方案是提出一种无需微调的布局可控个性化扩散模型(LCP-Diffusion),该模型通过动态-静态互补视觉精炼模块全面捕捉参考主体的细节,并引入双布局控制机制以实现训练和推理阶段的鲁棒空间控制。

链接: https://arxiv.org/abs/2505.20909
作者: Wei Li,Hebei Li,Yansong Peng,Siying Wu,Yueyi Zhang,Xiaoyan Sun
机构: MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China, Hefei, China; Institute of Artificial Intelligence, Hefe i Comprehensive National Science Center, Hefei, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025

点击查看摘要

Abstract:Diffusion models have significantly advanced text-to-image generation, laying the foundation for the development of personalized generative frameworks. However, existing methods lack precise layout controllability and overlook the potential of dynamic features of reference subjects in improving fidelity. In this work, we propose Layout-Controllable Personalized Diffusion (LCP-Diffusion) model, a novel framework that integrates subject identity preservation with flexible layout guidance in a tuning-free approach. Our model employs a Dynamic-Static Complementary Visual Refining module to comprehensively capture the intricate details of reference subjects, and introduces a Dual Layout Control mechanism to enforce robust spatial control across both training and inference stages. Extensive experiments validate that LCP-Diffusion excels in both identity preservation and layout controllability. To the best of our knowledge, this is a pioneering work enabling users to “create anything anywhere”.
zh

[CV-81] HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

【速读】:该论文旨在解决透明和反射物体对深度传感器造成的深度信息不完整问题,这一问题严重影响了下游的机器人感知与操作任务。其解决方案的关键在于提出HTMNet,一种融合Transformer、CNN和Mamba架构的新型混合模型,通过双分支Transformer-CNN编码器和基于Transformer-Mamba架构的多尺度融合模块实现深度补全,同时引入基于自注意力机制和状态空间模型的多模态融合模块,首次将Mamba架构应用于透明物体深度补全任务,并设计了结合通道注意力、空间注意力和多尺度特征提取的多尺度融合模块,以有效整合多尺度特征。

链接: https://arxiv.org/abs/2505.20904
作者: Guanghu Xie,Yonglong Zhang,Zhiduo Jiang,Yang Liu,Zongwu Xie,Baoshi Cao,Hong Liu
机构: Harbin Institute of Technology(哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is constructed based on a dual-branch Transformer-CNN framework, while the multi-scale fusion module leverages a Transformer-Mamba architecture, which also serves as the foundation for the decoder design. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention, and multi-scale feature extraction techniques to effectively integrate multi-scale features through a down-fusion strategy. Extensive evaluations on multiple public datasets demonstrate that our model achieves state-of-the-art(SOTA) performance, validating the effectiveness of our approach.
zh

[CV-82] Frequency Composition for Compressed and Domain-Adaptive Neural Networks

【速读】:该论文旨在解决在资源受限条件下同时实现模型压缩与领域适应(domain adaptation)的挑战,这一问题在以往研究中多被孤立处理:压缩模型注重效率但局限于固定领域,而大型模型则侧重于应对领域变化。论文提出的解决方案是CoDA,其关键在于通过基于频率组成的框架,将压缩与领域适应统一起来。在训练阶段,CoDA采用带有低频成分的量化感知训练(QAT),使压缩模型能够选择性地学习鲁棒且泛化的特征;在测试阶段,通过无源领域适应(source-free TTA)利用输入数据的全频信息进行模型微调,同时将高频成分视为领域特定线索,从而提升模型在目标领域的适应能力。

链接: https://arxiv.org/abs/2505.20890
作者: Yoojin Kwon,Hongjun Suh,Wooseok Lee,Taesik Gong,Songyi Han,Hyung-Sin Kim
机构: Seoul National University (首尔国立大学); UNIST (UNIST); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Modern on-device neural network applications must operate under resource constraints while adapting to unpredictable domain shifts. However, this combined challenge-model compression and domain adaptation-remains largely unaddressed, as prior work has tackled each issue in isolation: compressed networks prioritize efficiency within a fixed domain, whereas large, capable models focus on handling domain shifts. In this work, we propose CoDA, a frequency composition-based framework that unifies compression and domain adaptation. During training, CoDA employs quantization-aware training (QAT) with low-frequency components, enabling a compressed model to selectively learn robust, generalizable features. At test time, it refines the compact model in a source-free manner (i.e., test-time adaptation, TTA), leveraging the full-frequency information from incoming data to adapt to target domains while treating high-frequency components as domain-specific cues. LFC are aligned with the trained distribution, while HFC unique to the target distribution are solely utilized for batch normalization. CoDA can be integrated synergistically into existing QAT and TTA methods. CoDA is evaluated on widely used domain-shift benchmarks, including CIFAR10-C and ImageNet-C, across various model architectures. With significant compression, it achieves accuracy improvements of 7.96%p on CIFAR10-C and 5.37%p on ImageNet-C over the full-precision TTA baseline.
zh

[CV-83] YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation

【速读】:该论文旨在解决动态环境中火灾检测面临的挑战,包括光照变化干扰、误检或漏检频发以及难以同时实现高效与高精度的问题。其解决方案的关键在于提出YOLO-FireAD模型,该模型包含两项核心创新:一是注意力引导的倒置残差模块(AIR),通过融合通道-空间注意力机制与倒置残差结构,自适应增强火灾特征并抑制环境噪声;二是双池下采样融合模块(DPDF),通过可学习的极大值-平均值池化输出融合,保留多尺度火灾模式,缓解小目标火灾检测失败的问题。

链接: https://arxiv.org/abs/2505.20884
作者: Weichao Pan,Bohan Xu,Xu Wang,Chengze Lv,Shuoyang Wang,Zhenke Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%.
zh

[CV-84] Stereo Radargrammetry Using Deep Learning from Airborne SAR Images

【速读】:该论文试图解决传统立体雷达摄影测量方法在处理机载合成孔径雷达(SAR)图像时存在的几何图像调制问题。其解决方案的关键在于利用深度学习方法进行图像对应性分析,并通过创建SAR图像数据集进行微调,以提升高程测量的精度和范围。该方法通过像素插值抑制图像质量退化,并将SAR图像分块处理,避免了地面投影步骤,从而提高了处理效率和测量准确性。

链接: https://arxiv.org/abs/2505.20876
作者: Tatsuya Sasayama,Shintaro Ito,Koichi Ito,Takafumi Aoki
机构: Victoria University of Wellington (维多利亚大学); Inner Mongolia University (内蒙古大学); Universidad Nacional de Hurlingham (胡尔廷安国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages, 5 figures, conference IGARSS2025

点击查看摘要

Abstract:In this paper, we propose a stereo radargrammetry method using deep learning from airborne Synthetic Aperture Radar (SAR) this http URL learning-based methods are considered to suffer less from geometric image modulation, while there is no public SAR image dataset used to train such this http URL create a SAR image dataset and perform fine-tuning of a deep learning-based image correspondence this http URL proposed method suppresses the degradation of image quality by pixel interpolation without ground projection of the SAR image and divides the SAR image into patches for processing, which makes it possible to apply deep this http URL a set of experiments, we demonstrate that the proposed method exhibits a wider range and more accurate elevation measurements compared to conventional methods.
zh

[CV-85] Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

【速读】:该论文试图解决音频-视觉大语言模型(AV-LLMs)中由于模态信号不平衡导致的模态偏差问题,即模型在处理多模态输入时可能过度依赖某一模态。解决方案的关键在于提出一种无需额外训练或结构修改的推理阶段策略——分叉-合并解码(Fork-Merge Decoding, FMD),通过先对音频和视频单独进行早期解码层推理(分叉阶段),再融合隐藏状态进行后续联合推理(合并阶段),从而实现模态间的平衡贡献并利用多模态互补信息。

链接: https://arxiv.org/abs/2505.20873
作者: Chaeyoung Jung,Youngjoon Jang,Jongmin Choi,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without requiring additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (a fork phase), and then merges the resulting hidden states for joint reasoning in the remaining layers (a merge phase). This approach promotes balanced modality contributions and leverages complementary information across modalities. We evaluate our method on two representative AV-LLMs, VideoLLaMA2 and video-SALMONN, using three benchmark datasets. Experimental results demonstrate consistent performance improvements on tasks focused on audio, video, and combined audio-visual reasoning, demonstrating the effectiveness of inference-time interventions for robust multimodal understanding.
zh

[CV-86] In Context Learning with Vision Transformers: Case Study

【速读】:该论文试图解决如何使大型Transformer模型在图像空间中进行上下文学习,以掌握更复杂的函数,如卷积神经网络(Convolutional Neural Networks, CNNs)和其他方法。其解决方案的关键在于扩展已有的研究成果,即在随机数据上展示的线性函数和小型两层神经网络的学习能力,从而探索模型在图像数据上的泛化能力和学习复杂结构的能力。

链接: https://arxiv.org/abs/2505.20872
作者: Antony Zhao,Alex Proshkin,Fergal Hennessy,Francesco Crivelli
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 16 figures. UC Berkeley research project

点击查看摘要

Abstract:Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.
zh

[CV-87] AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中幻觉(Hallucination)问题,特别是在音频-视觉-语言联合模型(Audio-Visual Language Models, AV-LLMs)中,由于单模态和跨模态组合产生的复杂幻觉现象。解决方案的关键在于提出一种无需训练的解码框架——音频-视觉对比解码(Audio-Visual Contrastive Decoding, AVCD),该方法通过动态识别主导模态并应用注意力掩码生成扰动输出logits,以建模三模态交互并抑制模态相关的幻觉。此外,AVCD对原始对比解码框架进行了重构,以支持音频、视觉和文本输入的联合处理,并引入了基于熵的自适应解码策略以提高效率。

链接: https://arxiv.org/abs/2505.20862
作者: Chaeyoung Jung,Youngjoon Jang,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model’s confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 6% for VideoLLaMA2 and 11% for video-SALMONN, demonstrating strong robustness and generalizability.
zh

[CV-88] Exploring Timeline Control for Facial Motion Generation CVPR2025

【速读】:该论文试图解决面部运动生成中控制信号不够精细的问题,旨在通过引入时间线控制(timeline control)实现对面部动作时序的精准调控。与音频和文本信号相比,时间线控制能够提供更细粒度的指令,使用户可以指定多轨道面部动作的时间区间,从而精确控制每个动作的触发时机。解决方案的关键在于利用基于Toeplitz逆协方差的聚类方法对自然面部运动序列中的面部动作时间区间进行帧级标注,以减少人工劳动,并基于标注结果提出一种基于扩散模型的生成方法,该方法能够生成与输入时间线精确对齐且自然的面部运动。

链接: https://arxiv.org/abs/2505.20861
作者: Yifeng Ma,Jinwei Qi,Chaonan Ji,Peng Zhang,Bang Zhang,Zhidong Deng,Liefeng Bo
机构: Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025, Project Page: this https URL

点击查看摘要

Abstract:This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.
zh

[CV-89] ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient

【速读】:该论文试图解决经典Bundle Adjustment (BA)方法依赖于精确初始估计和已知相机内参的问题,这些问题限制了其在不确定或不可用相关信息场景下的适用性。解决方案的关键在于提出一种新颖的概率化BA框架(ProBA),该框架显式建模并传播2D观测与3D场景结构中的不确定性,通过使用3D高斯分布代替点状地标,并引入考虑不确定性的重投影损失函数,同时利用Bhattacharyya系数确保多个3D高斯分布之间的几何一致性,从而提升优化的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2505.20858
作者: Jason Chui,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures, 5 tables

点击查看摘要

Abstract:Classical Bundle Adjustment (BA) methods require accurate initial estimates for convergence and typically assume known camera intrinsics, which limits their applicability when such information is uncertain or unavailable. We propose a novel probabilistic formulation of BA (ProBA) that explicitly models and propagates uncertainty in both the 2D observations and the 3D scene structure, enabling optimization without any prior knowledge of camera poses or focal length. Our method uses 3D Gaussians instead of point-like landmarks and we introduce uncertainty-aware reprojection losses by projecting the 3D Gaussians onto the 2D image space, and enforce geometric consistency across multiple 3D Gaussians using the Bhattacharyya coefficient to encourage overlap between their corresponding Gaussian distributions. This probabilistic framework leads to more robust and reliable optimization, even in the presence of outliers in the correspondence set, reducing the likelihood of converging to poor local minima. Experimental results show that \textitProBA outperforms traditional methods in challenging real-world conditions. By removing the need for strong initialization and known intrinsics, ProBA enhances the practicality of SLAM systems deployed in unstructured environments.
zh

[CV-90] Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

【速读】:该论文旨在解决在复杂环境中实现鲁棒视觉目标跟踪时,现有融合方法因计算开销大而难以高效提取事件流中的稀疏、异步信息的问题,从而无法充分发挥事件驱动脉冲范式的能效优势。其解决方案的关键在于提出首个完全基于脉冲的帧-事件跟踪框架SpikeFET,该框架在脉冲范式下实现了卷积局部特征提取与Transformer全局建模的协同整合,有效融合了帧和事件数据,并通过随机拼贴模块(RPM)和时空正则化(STR)策略分别解决了卷积填充导致的平移不变性退化以及异构特征引起的相似性度量下降问题。

链接: https://arxiv.org/abs/2505.20834
作者: Jingjun Yang,Liangwei Fan,Jinpu Zhang,Xiangkai Lian,Hui Shen,Dewen Hu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages,6 figures,4 tables

点击查看摘要

Abstract:The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency. The code will be released.
zh

[CV-91] Causality-Driven Infrared and Visible Image Fusion

【速读】:该论文试图解决图像融合中由于数据集场景偏差导致模型学习到虚假相关性的问题,这限制了融合性能的提升。解决方案的关键在于从因果视角重新审视图像融合任务,并通过构建定制的因果图来消除偏差影响,进而提出基于后门调整的特征融合模块(BAFFM),以消除混杂因素干扰并使模型学习真实的因果效应。

链接: https://arxiv.org/abs/2505.20830
作者: Linli Ma,Suzhen Lin,Jianchao Zeng,Zanxia Jin,Yanbo Wang,Fengyuan Li,Yubing Luo
机构: North University of China (华北理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion aims to combine complementary information from multiple source images to generate more comprehensive scene representations. Existing methods primarily rely on the stacking and design of network architectures to enhance the fusion performance, often ignoring the impact of dataset scene bias on model training. This oversight leads the model to learn spurious correlations between specific scenes and fusion weights under conventional likelihood estimation framework, thereby limiting fusion performance. To solve the above problems, this paper first re-examines the image fusion task from the causality perspective, and disentangles the model from the impact of bias by constructing a tailored causal graph to clarify the causalities among the variables in image fusion task. Then, the Back-door Adjustment based Feature Fusion Module (BAFFM) is proposed to eliminate confounder interference and enable the model to learn the true causal effect. Finally, Extensive experiments on three standard datasets prove that the proposed method significantly surpasses state-of-the-art methods in infrared and visible image fusion.
zh

[CV-92] Frame-Level Captions for Long Video Generation with Complex Multi Scenes

【速读】:该论文旨在解决生成复杂故事性长视频时存在的误差累积(drift)问题以及现有方法在处理多场景、多事件叙事时的局限性。其关键解决方案是提出一种基于帧级注释的数据集标注方法,并结合帧级注意力机制,以实现文本与视频的精确匹配,同时每个帧可接受独立的文本提示。此外,采用Diffusion Forcing训练策略,使模型具备灵活处理时间序列的能力,从而提升在复杂场景下的指令遵循能力和视频生成质量。

链接: https://arxiv.org/abs/2505.20827
作者: Guangcong Zheng,Jianlong Yuan,Bo Wang,Haoyang Huang,Guoqing Ma,Nan Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks (“Complex Plots” and “Complex Landscapes”) based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community. Project page: this https URL .
zh

[CV-93] Spatial RoboGrasp: Generalized Robotic Grasping Control Policy

【速读】:该论文旨在解决在多样化环境中实现可泛化且精确的机器人操作这一关键问题,其主要挑战源于空间感知能力的局限性。传统模仿学习方法依赖于原始RGB输入和人工设计特征,导致在不同光照、遮挡和物体条件下容易过拟合且3D推理能力不足。论文提出的解决方案的关键在于构建一个统一框架,将鲁棒的多模态感知与可靠的抓取预测相结合,通过融合领域随机增强、单目深度估计以及深度感知的6-DoF抓取提示,生成用于下游动作规划的统一空间表征,并基于扩散模型生成精确的动作序列,从而显著提升抓取和任务成功率。

链接: https://arxiv.org/abs/2505.20814
作者: Yiqi Huang,Travis Davies,Jiahuan Yan,Jiankai Sun,Xiang Chen,Luhui Hu
机构: ZhiCheng AI(智城人工智能); Stanford University (斯坦福大学); Peking University (北京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving generalizable and precise robotic manipulation across diverse environments remains a critical challenge, largely due to limitations in spatial perception. While prior imitation-learning approaches have made progress, their reliance on raw RGB inputs and handcrafted features often leads to overfitting and poor 3D reasoning under varied lighting, occlusion, and object conditions. In this paper, we propose a unified framework that couples robust multimodal perception with reliable grasp prediction. Our architecture fuses domain-randomized augmentation, monocular depth estimation, and a depth-aware 6-DoF Grasp Prompt into a single spatial representation for downstream action planning. Conditioned on this encoding and a high-level task prompt, our diffusion-based policy yields precise action sequences, achieving up to 40% improvement in grasp success and 45% higher task success rates under environmental variation. These results demonstrate that spatially grounded perception, paired with diffusion-based imitation learning, offers a scalable and robust solution for general-purpose robotic grasping.
zh

[CV-94] Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis

【速读】:该论文试图解决扩散模型在生成罕见概念(rare concepts)时表现不佳的问题,即当输入提示(prompt)在训练数据分布中出现频率较低时,模型难以生成高质量的图像。解决方案的关键在于提出RAP框架,该框架将罕见概念生成视为在潜在因果路径上的导航:通过从常见概念到目标罕见概念的渐进式、模型对齐的生成轨迹进行优化。RAP的核心思想是理论证明罕见提示引导可通过语义相关的常见提示近似,并通过基于得分相似性的提示切换动态过程实现自适应阶段转换,同时将提示交替重新解释为二阶去噪机制,以促进语义流畅性和视觉一致性。

链接: https://arxiv.org/abs/2505.20808
作者: Bo-Kai Ruan,Zi-Xiang Ni,Bo-Lun Huang,Teng-Fang Hsiao,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have shown strong capabilities in high-fidelity image generation but often falter when synthesizing rare concepts, i.e., prompts that are infrequently observed in the training distribution. In this paper, we introduce RAP, a principled framework that treats rare concept generation as navigating a latent causal path: a progressive, model-aligned trajectory through the generative space from frequent concepts to rare targets. Rather than relying on heuristic prompt alternation, we theoretically justify that rare prompt guidance can be approximated by semantically related frequent prompts. We then formulate prompt switching as a dynamic process based on score similarity, enabling adaptive stage transitions. Furthermore, we reinterpret prompt alternation as a second-order denoising mechanism, promoting smooth semantic progression and coherent visual synthesis. Through this causal lens, we align input scheduling with the model’s internal generative dynamics. Experiments across diverse diffusion backbones demonstrate that RAP consistently enhances rare concept generation, outperforming strong baselines in both automated evaluations and human studies.
zh

[CV-95] Leaner Transformers: More Heads Less Depth

【速读】:该论文试图解决当前Transformer模型普遍存在的“过度设计”问题,即许多现有Transformer模型可能被设计得过于庞大,而并非必要。其解决方案的关键在于重新理解多头注意力(multi-head attention)的作用,发现多头注意力的重要优势在于提升注意力模块的条件性(conditioning),并通过增加头的数量来优化这一特性。实验表明,这种改进显著提升了模型性能,使得在保持准确率的前提下,模型深度可以减少,参数量降低30%-50%。

链接: https://arxiv.org/abs/2505.20802
作者: Hemanth Saratchandran,Damien Teney,Simon Lucey
机构: Australian Institute for Machine Learning, University of Adelaide (澳大利亚机器学习研究所,阿德莱德大学); Idiap Research Institute (Idiap研究机构)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that “bigger means better”, leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).
zh

[CV-96] Rendering-Aware Reinforcement Learning for Vector Graphics Generation

【速读】:该论文旨在解决生成忠实、高效且语义连贯的可缩放矢量图形(SVG)的问题,现有视觉-语言模型(VLM)方法在训练过程中未观察到渲染图像,导致生成效果受限。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)框架RLRF,通过渲染生成的SVG输出并与原始输入进行比较,获取视觉保真度反馈,从而指导模型优化生成过程,提升SVG的准确性、效率和语义一致性。

链接: https://arxiv.org/abs/2505.20793
作者: Juan A. Rodriguez,Haotian Zhang,Abhay Puri,Aarash Feizi,Rishav Pramanik,Pascal Wichmann,Arnab Mondal,Mohammad Reza Samsami,Rabiul Awal,Perouz Taslakian,Spandana Gella,Sai Rajeswar,David Vazquez,Christopher Pal,Marco Pedersoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-language models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.
zh

[CV-97] Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models ICML2025

【速读】:该论文旨在解决基于扩散模型(Diffusion Models, DMs)的逆问题(Inverse Problems, IPs)中存在计算资源消耗大和收敛性不佳的问题。其解决方案的关键在于提出两种新方法:DMILO 和 DMILO-PGD,其中 DMILO 通过引入中间层优化(Intermediate Layer Optimization, ILO)减轻了内存负担,并利用稀疏偏差扩展了扩散模型的表达范围;而 DMILO-PGD 则进一步结合了 ILO 与投影梯度下降(Projected Gradient Descent, PGD),以降低次优收敛的风险。

链接: https://arxiv.org/abs/2505.20789
作者: Yang Zheng,Wen Li,Zhaoqiang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:Inverse problems (IPs) involve reconstructing signals from noisy observations. Traditional approaches often rely on handcrafted priors, which can fail to capture the complexity of real-world data. The advent of pre-trained generative models has introduced new paradigms, offering improved reconstructions by learning rich priors from data. Among these, diffusion models (DMs) have emerged as a powerful framework, achieving remarkable reconstruction performance across numerous IPs. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug~\citewang2024dmplug, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approach under appropriate conditions and validate its superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.
zh

[CV-98] Breaking Dataset Boundaries: Class-Agnostic Targeted Adversarial Attacks

【速读】:该论文试图解决传统目标攻击在多目标攻击中对训练时见过的类别依赖性强、无法泛化到未见类别以及需要访问黑盒模型训练数据的问题(即数据泄露问题)。其解决方案的关键在于将基于类别级别的监督替换为基于图像的条件输入,并引入与类别无关的损失函数,从而在特征空间中对齐扰动图像和目标图像,消除对类别语义的依赖,实现跨数据集的未见类别的泛化能力。

链接: https://arxiv.org/abs/2505.20782
作者: Taïga Gonçalves,Tomo Miyazaki,Shinichiro Omachi
机构: Tohoku University(东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Cross-Domain Multi-Targeted Attack (CD-MTA), a method for generating adversarial examples that mislead image classifiers toward any target class, including those not seen during training. Traditional targeted attacks are limited to one class per model, requiring expensive retraining for each target. Multi-targeted attacks address this by introducing a perturbation generator with a conditional input to specify the target class. However, existing methods are constrained to classes observed during training and require access to the black-box model’s training data–introducing a form of data leakage that undermines realistic evaluation in practical black-box scenarios. We identify overreliance on class embeddings as a key limitation, leading to overfitting and poor generalization to unseen classes. To address this, CD-MTA replaces class-level supervision with an image-based conditional input and introduces class-agnostic losses that align the perturbed and target images in the feature space. This design removes dependence on class semantics, thereby enabling generalization to unseen classes across datasets. Experiments on ImageNet and seven other datasets show that CD-MTA outperforms prior multi-targeted attacks in both standard and cross-domain settings–without accessing the black-box model’s training data.
zh

[CV-99] ACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

【速读】:该论文旨在解决多模态大语言模型在复杂推理任务中面临的挑战,包括推理与最终答案的一致性不足、长链探索过程中的模型不稳定性和崩溃问题,以及数据学习效率低的问题。其解决方案的关键在于提出一种名为TACO的新型强化学习算法,该算法通过引入Think-Answer Consistency机制,将推理过程与答案一致性紧密耦合,确保答案基于严谨的推理;同时采用Rollback Resample Strategy策略,提升长链探索的稳定性,并通过自适应学习调度优化数据效率,从而有效提升多模态推理性能。

链接: https://arxiv.org/abs/2505.20777
作者: Zhehan Kan,Yanlin Liu,Kun Yin,Xinghua Jiang,Xin Li,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun,Qingmin Liao,Wenming Yang
机构: 腾讯优图实验室(YouTu Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs). While recent methods have attempted to replicate R1’s reasoning capabilities in multimodal settings, they face limitations, including inconsistencies between reasoning and final answers, model instability and crashes during long-chain exploration, and low data learning efficiency. To address these challenges, we propose TACO, a novel reinforcement learning algorithm for visual reasoning. Building on Generalized Reinforcement Policy Optimization (GRPO), TACO introduces Think-Answer Consistency, which tightly couples reasoning with answer consistency to ensure answers are grounded in thoughtful reasoning. We also introduce the Rollback Resample Strategy, which adaptively removes problematic samples and reintroduces them to the sampler, enabling stable long-chain exploration and future learning opportunities. Additionally, TACO employs an adaptive learning schedule that focuses on moderate difficulty samples to optimize data efficiency. Furthermore, we propose the Test-Time-Resolution-Scaling scheme to address performance degradation due to varying resolutions during reasoning while balancing computational overhead. Extensive experiments on in-distribution and out-of-distribution benchmarks for REC and VQA tasks show that fine-tuning LVLMs leads to significant performance improvements.
zh

[CV-100] MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning

【速读】:该论文旨在解决传统基于对象中心学习(Object-Centric Learning, OCL)方法中,由于固定槽位数量(slot count)导致在对象数量变化时,单个对象被表示为多个部分的问题。其解决方案的关键在于提出一种名为MetaSlot的可插拔式槽注意力(Slot Attention)变体,该方法通过维护一个包含数据集中对象原型的代码本(codebook),利用向量量化去除重复槽位,并在槽注意力迭代过程中逐步注入较弱噪声以加速和稳定聚合过程,从而适应可变的对象数量。

链接: https://arxiv.org/abs/2505.20772
作者: Hongjia Liu,Rongzhen Zhao,Haohan Chen,Joni Pajarinen
机构: Aalto University (阿尔托大学); Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning object-level, structured representations is widely regarded as a key to better generalization in vision and underpins the design of next-generation Pre-trained Vision Models (PVMs). Mainstream Object-Centric Learning (OCL) methods adopt Slot Attention or its variants to iteratively aggregate objects’ super-pixels into a fixed set of query feature vectors, termed slots. However, their reliance on a static slot count leads to an object being represented as multiple parts when the number of objects varies. We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts. MetaSlot (i) maintains a codebook that holds prototypes of objects in a dataset by vector-quantizing the resulting slot representations; (ii) removes duplicate slots from the traditionally aggregated slots by quantizing them with the codebook; and (iii) injects progressively weaker noise into the Slot Attention iterations to accelerate and stabilize the aggregation. MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures. Across multiple public datasets and tasks–including object discovery and recognition–models equipped with MetaSlot achieve significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.
zh

[CV-101] ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval CVPR2025

【速读】:该论文试图解决由查询图像和描述查询图像语义修改的相对文本共同指定的目标图像检索问题,即组成图像检索(Composed Image Retrieval, CIR)。现有方法在准确表示图像和文本修改方面存在不足,导致性能不佳。解决方案的关键在于提出一种名为ConText-CIR的框架,该框架通过引入文本概念一致性损失(Text Concept-Consistency loss),促使文本中名词短语的表示更好地关注查询图像的相关部分,并结合一种合成数据生成管道以支持该损失函数的训练。这些组件共同提升了CIR任务的性能,在多个基准数据集上取得了新的最先进结果。

链接: https://arxiv.org/abs/2505.20764
作者: Eric Xing,Pranavi Kolouju,Robert Pless,Abby Stylianou,Nathan Jacobs
机构: Washington University in St. Louis (华盛顿大学圣路易斯分校); Saint Louis University (圣路易斯大学); The George Washington University (乔治华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 8 figures, 6 tables. CVPR 2025

点击查看摘要

Abstract:Composed image retrieval (CIR) is the task of retrieving a target image specified by a query image and a relative text that describes a semantic modification to the query image. Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, ConText-CIR, trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. To support training with this loss function, we also propose a synthetic data generation pipeline that creates training data from existing CIR datasets or unlabeled images. We show that these components together enable stronger performance on CIR tasks, setting a new state-of-the-art in composed image retrieval in both the supervised and zero-shot settings on multiple benchmark datasets, including CIRR and CIRCO. Source code, model checkpoints, and our new datasets are available at this https URL.
zh

[CV-102] PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

【速读】:该论文旨在解决大型多模态模型(LMMs)在像素级物体部件定位(part grounding)任务中的能力不足问题,特别是其在细粒度、组合推理方面的局限性。现有LMMs难以准确识别和定位物体的特定部件,且在处理涉及部件-整体关系及跨对象比较的任务时表现较差。解决方案的关键在于提出一种新的基准测试集PARTONOMY,该基准包含丰富的部件标签和对象标签,并引入了专门概念以增强任务挑战性;同时,论文还提出了PLUM,一种基于跨度标注(span tagging)而非分割标记的新颖分割型LMM,通过反馈机制利用先前预测来指导后续分割,从而提升模型在分割、视觉问答及视觉幻觉等任务上的性能。

链接: https://arxiv.org/abs/2505.20759
作者: Ansel Blume,Jeonghwan Kim,Hyeonjeong Ha,Elen Chatikyan,Xiaomeng Jin,Khanh Duy Nguyen,Nanyun Peng,Kai-Wei Chang,Derek Hoiem,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of California Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects’ parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.
zh

[CV-103] Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

【速读】:该论文旨在解决单步扩散模型(one-step diffusion models)在训练过程中因原始扩展f-散度(expanded f-divergence)的不可处理性而导致的效率与性能瓶颈问题。其解决方案的关键在于提出了一种理论驱动的统一框架——Uni-Instruct,该框架基于作者提出的f-散度家族的扩散扩展理论,并引入关键理论以克服原始扩展f-散度的不可处理性,从而得到一个等效且可处理的损失函数,有效通过最小化扩展f-散度家族来训练单步扩散模型。这一统一方法不仅提供了新的理论见解,还显著提升了单步扩散生成的性能。

链接: https://arxiv.org/abs/2505.20755
作者: Yifei Wang,Weimin Bai,Colin Zhang,Debing Zhang,Weijian Luo,He Sun
机构: Xiaohongshu Inc (小红书公司); Peking University (北京大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, f -distill, etc, inside a theory-driven framework which we name the \textbf\emphUni-Instruct. Uni-Instruct is motivated by our proposed diffusion expansion theory of the f -divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded f -divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded f -divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf\emph1.46 for unconditional generation and \textbf\emph1.38 for conditional generation. On the ImageNet- 64\times 64 generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf\emph1.02, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.
zh

[CV-104] Understand Think and Answer: Advancing Visual Reasoning with Large Multimodal Models

【速读】:该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在整合任务特定的组合推理能力方面的不足,这限制了其向真正具备通用视觉能力的模型发展。解决方案的关键在于提出一种统一的视觉推理机制,使LMMs能够通过其内在能力(如定位和视觉理解能力)解决复杂的组合问题。该方法不同于以往的快捷学习机制,引入了类似人类的理解-思考-回答过程,使模型能够在单次前向传播中完成所有步骤,无需多次推理或外部工具,从而弥合基础视觉能力与通用问答之间的差距。

链接: https://arxiv.org/abs/2505.20753
作者: Yufei Zhan,Hongyin Zhao,Yousong Zhu,Shurong Zheng,Fan Yang,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Wuhan AI Research, Wuhan, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Tech report

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at this https URL soon.
zh

[CV-105] MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

【速读】:该论文旨在解决可穿戴传感器在人体活动识别(Human Activity Recognition, HAR)中面临的可解释性不足问题,这一问题严重影响了模型在不同数据集间的泛化能力。解决方案的关键在于提出一种名为Motion-Primitive Transformer (MoPFormer) 的自监督框架,该框架通过将惯性测量单元信号分段并量化为语义有意义的运动基元(motion primitives),结合Transformer架构学习丰富的时序表示,从而提升模型的可解释性与跨数据集的泛化性能。

链接: https://arxiv.org/abs/2505.20744
作者: Hao Zhang,Zhan Zhuang,Xuehao Wang,Xiaodong Yang,Yu Zhang
机构: Southern University of Science and Technology (南方科技大学); City University of Hong Kong (香港城市大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two-stages. first stage is to partition multi-channel sensor streams into short segments and quantizing them into discrete “motion primitive” codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. Most importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities regardless of dataset origin.
zh

[CV-106] Detecting Informative Channels: ActionFormer

【速读】:该论文旨在解决传感器信号在人体活动识别(Human Activity Recognition, HAR)中的建模问题,特别是针对深度学习架构中时间动态性高导致模型难以有效捕捉细微变化以及空间与时间特征之间相互依赖性的挑战。解决方案的关键在于对ActionFormer模型进行改进,通过采用Sequence-and-Excitation策略以最小化额外参数的增加,并选择Swish激活函数以保留负区间内的方向信息,从而提升模型在惯性数据上的性能。实验结果表明,该方法在WEAR数据集上实现了平均mAP指标16.01%的显著提升。

链接: https://arxiv.org/abs/2505.20739
作者: Kunpeng Zhao,Asahi Miyazaki,Tsuyoshi Okita
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) has recently witnessed advancements with Transformer-based models. Especially, ActionFormer shows us a new perspectives for HAR in the sense that this approach gives us additional outputs which detect the border of the activities as well as the activity labels. ActionFormer was originally proposed with its input as image/video. However, this was converted to with its input as sensor signals as well. We analyze this extensively in terms of deep learning architectures. Based on the report of high temporal dynamics which limits the model’s ability to capture subtle changes effectively and of the interdependencies between the spatial and temporal features. We propose the modified ActionFormer which will decrease these defects for sensor signals. The key to our approach lies in accordance with the Sequence-and-Excitation strategy to minimize the increase in additional parameters and opt for the swish activation function to retain the information about direction in the negative range. Experiments on the WEAR dataset show that our method achieves substantial improvement of a 16.01% in terms of average mAP for inertial data.
zh

[CV-107] Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting

【速读】:该论文旨在解决稀疏视角场景重建中由于观测数据有限导致的信息不完整问题,从而影响现有方法的重建质量。其解决方案的关键在于利用视觉基础模型(vision foundation models)丰富的先验知识,指导稀疏视角高斯点云(Gaussian Splatting)的初始化与优化过程。具体而言,在初始化阶段,采用DUSt3R生成密集且无冗余的高斯点云,以缓解传统结构从运动(SfM)方法在稀疏视角下的局限性;在优化阶段,通过视觉基础模型预测未观测视角的深度和外观,从而优化3D高斯分布,弥补未见区域的信息缺失。

链接: https://arxiv.org/abs/2505.20729
作者: Xiangyu Sun,Runnan Chen,Mingming Gong,Dong Xu,Tongliang Liu
机构: University of Sydney (悉尼大学); University of Melbourne (墨尔本大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.
zh

[CV-108] LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation

【速读】:该论文旨在解决基于扩散模型(Diffusion Models, DMs)的高质量图像生成效率低下问题,特别是由于其迭代性质导致的计算开销大。其解决方案的关键在于提出一种名为LeDiFlow的新方法,该方法通过使用基于回归的辅助模型学习更合适的先验分布,以引导流匹配(Flow Matching, FM)的训练过程。这种方法使常微分方程(ODE)求解器能够从更接近目标数据分布的先验开始,从而学习到更易于计算的概率路径,显著减少了推理时所需的求解步骤,提升了生成效率和图像质量。

链接: https://arxiv.org/abs/2505.20723
作者: Pascal Zwick,Nils Friederich,Maximilian Beichter,Lennart Hilbert,Ralf Mikut,Oliver Bringmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enhancing the efficiency of high-quality image generation using Diffusion Models (DMs) is a significant challenge due to the iterative nature of the process. Flow Matching (FM) is emerging as a powerful generative modeling paradigm based on a simulation-free training objective instead of a score-based one used in DMs. Typical FM approaches rely on a Gaussian distribution prior, which induces curved, conditional probability paths between the prior and target data distribution. These curved paths pose a challenge for the Ordinary Differential Equation (ODE) solver, requiring a large number of inference calls to the flow prediction network. To address this issue, we present Learned Distribution-guided Flow Matching (LeDiFlow), a novel scalable method for training FM-based image generation models using a better-suited prior distribution learned via a regression-based auxiliary model. By initializing the ODE solver with a prior closer to the target data distribution, LeDiFlow enables the learning of more computationally tractable probability paths. These paths directly translate to fewer solver steps needed for high-quality image generation at inference time. Our method utilizes a State-Of-The-Art (SOTA) transformer architecture combined with latent space sampling and can be trained on a consumer workstation. We empirically demonstrate that LeDiFlow remarkably outperforms the respective FM baselines. For instance, when operating directly on pixels, our model accelerates inference by up to 3.75x compared to the corresponding pixel-space baseline. Simultaneously, our latent FM model enhances image quality on average by 1.32x in CLIP Maximum Mean Discrepancy (CMMD) metric against its respective baseline.
zh

[CV-109] VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Visual-Language Models

【速读】:该论文旨在解决当前主动视觉跟踪系统在跟踪失败后恢复能力不足的问题(tracking failure recovery)。其关键解决方案是引入一种结合了生成式 AI(Generative AI)的自改进框架,通过将现成的主动跟踪方法与视觉-语言模型(Visual-Language Models, VLMs)的推理能力相结合,在正常跟踪时采用快速视觉策略,仅在检测到失败时激活VLM的推理机制,并利用记忆增强的自我反思机制使VLM能够从历史经验中逐步提升,从而有效弥补VLM在三维空间推理方面的不足。

链接: https://arxiv.org/abs/2505.20718
作者: Kui Wu,Shuhang Xu,Hao Chen,Churan Wang,Zhoujun Li,Yizhou Wang,Fangwei Zhong
机构: Beihang University (北京航空航天大学); Beijing Normal University (北京师范大学); City University of Macau (澳门城市大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Visual-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs’ reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs’ limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by 72% with state-of-the-art RL-based approaches and 220% with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: this https URL.
zh

[CV-110] Hierarchical Instruction-aware Embodied Visual Tracking

【速读】:该论文旨在解决基于强化学习的模型在用户中心具身视觉跟踪(User-Centric Embodied Visual Tracking, UC-EVT)任务中面临的挑战,即高阶用户指令与低阶智能体动作之间的显著差距。现有语言模型(如大语言模型、视觉语言模型、视觉语言动作模型)在指令理解方面有所提升,但在推理速度或泛化能力上存在局限。解决方案的关键在于提出一种分层指令感知具身视觉跟踪(Hierarchical Instruction-aware Embodied Visual Tracking, HIEVT)代理,通过空间目标作为中介,连接指令理解和动作生成。HIEVT首先利用基于大语言模型的语义-空间目标对齐器将多样化的用户指令转化为直接标注目标位置的空间目标,随后通过基于强化学习的自适应目标对齐策略实现目标定位。

链接: https://arxiv.org/abs/2505.20710
作者: Kui Wu,Hao Chen,Churan Wang,Fakhri Karray,Zhoujun Li,Yizhou Wang,Fangwei Zhong
机构: Beihang University (北京航空航天大学); City University of Macau (澳门城市大学); Peking University (北京大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbfHierarchical Instruction-aware Embodied Visual Tracking (HIEVT) agent, which bridges instruction comprehension and action generation using \textitspatial goals as intermediaries. HIEVT first introduces \textitLLM-based Semantic-Spatial Goal Aligner to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textitRL-based Adaptive Goal-Aligned Policy, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at this https URL.
zh

[CV-111] Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation ICML2025

【速读】:该论文旨在解决在极端数据稀缺和多重分布偏移条件下,如何将源模型适应到未见领域的问题(Wild Test-Time Adaptation, WTTA)。现有方法主要关注样本选择策略,而忽视了基础优化问题。论文的关键解决方案是提出一种名为ReCAP的区域集成方法,其核心在于通过概率区域建模灵活捕捉嵌入空间中的语义变化,并利用有限到无限的渐近近似将难以处理的区域置信度转化为可计算且上界确定的代理指标,从而有效提升适应效率。

链接: https://arxiv.org/abs/2505.20704
作者: Zixuan Hu,Yichun Hu,Xiaotong Li,Shixiang Tang,Ling-Yu Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Wild Test-Time Adaptation (WTTA) is proposed to adapt a source model to unseen domains under extreme data scarcity and multiple shifts. Previous approaches mainly focused on sample selection strategies, while overlooking the fundamental problem on underlying optimization. Initially, we critically analyze the widely-adopted entropy minimization framework in WTTA and uncover its significant limitations in noisy optimization dynamics that substantially hinder adaptation efficiency. Through our analysis, we identify region confidence as a superior alternative to traditional entropy, however, its direct optimization remains computationally prohibitive for real-time applications. In this paper, we introduce a novel region-integrated method ReCAP that bypasses the lengthy process. Specifically, we propose a probabilistic region modeling scheme that flexibly captures semantic changes in embedding space. Subsequently, we develop a finite-to-infinite asymptotic approximation that transforms the intractable region confidence into a tractable and upper-bounded proxy. These innovations significantly unlock the overlooked potential dynamics in local region in a concise solution. Our extensive experiments demonstrate the consistent superiority of ReCAP over existing methods across various datasets and wild scenarios.
zh

[CV-112] mporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

【速读】:该论文试图解决视频数据集蒸馏(video dataset distillation, VD)中的高维度与时间复杂性带来的挑战,特别是现有方法在计算成本和时间动态保留方面的不足。其解决方案的关键在于提出一种统一层级的视频数据集蒸馏框架,直接针对预训练模型优化合成视频,并引入基于时间显著性的过滤机制,通过利用帧间差异来指导蒸馏过程,从而增强运动信息的保留并抑制帧级冗余。

链接: https://arxiv.org/abs/2505.20694
作者: Xulin Gu,Xinhao Zhong,Zhixing Wei,Yimin Zhou,Shuoyang Sun,Bin Chen,Hongpeng Wang,Yuan Luo
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Peng Cheng Laboratory (鹏城实验室); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) has emerged as a powerful paradigm for dataset compression, enabling the synthesis of compact surrogate datasets that approximate the training utility of large-scale ones. While significant progress has been achieved in distilling image datasets, extending DD to the video domain remains challenging due to the high dimensionality and temporal complexity inherent in video data. Existing video distillation (VD) methods often suffer from excessive computational costs and struggle to preserve temporal dynamics, as naïve extensions of image-based approaches typically lead to degraded performance. In this paper, we propose a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model. To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering mechanism that leverages inter-frame differences to guide the distillation process, encouraging the retention of informative temporal cues while suppressing frame-level redundancy. Extensive experiments on standard video benchmarks demonstrate that our method achieves state-of-the-art performance, bridging the gap between real and distilled video data and offering a scalable solution for video dataset compression.
zh

[CV-113] VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images

【速读】:该论文旨在解决微藻(microalgae)细胞的高通量检测问题,尤其是在其尺寸多样性和复杂环境下的检测挑战。解决方案的关键在于结合计算机视觉技术与微藻研究,通过构建包含六类微藻的1000张图像数据集,并针对小目标检测、运动模糊处理和复杂背景等任务,探索有效的检测方法。该工作为提升微藻检测精度提供了技术路径,并推动了生态学与人工智能交叉领域的进展。

链接: https://arxiv.org/abs/2505.20687
作者: Mingxuan Sun,Juntao Jiang,Zhiqiang Yang,Shenao Kong,Jiamin Qi,Jianru Shang,Shuangling Luo,Wanfa Sun,Tianyi Wang,Yanqi Wang,Qixuan Wang,Tingjian Dai,Tianxiang Chen,Jinming Zhang,Xuerui Zhang,Yuepeng He,Pengcheng Fu,Qiu Guan,Shizheng Zhou,Yanbo Yu,Qigui Jiang,Teng Zhou,Liuyong Shi,Hong Yan
机构: Hainan University(海南大学); Zhejiang University(浙江大 学); Zhejiang University of Technology(浙江科技学院); Guangdong University of Science and Technology(广东科技学院); Zhejiang University of Science and Technology(浙江科技学院); China Academy of Information and Communications Technology(中国信息通信研究院); Fuzhou University(福州大学); China Three Gorges University(三峡大学); Chongqing university(重庆大学); University of Macau(澳门大学); ChiYU Intelligence Technology (Suzhou) Ltd(智臾科技(苏州)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second “Vision Meets Algae” (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring microalgae of varying sizes and distinct features. Participants faced tasks such as detecting small targets, handling motion blur, and complex backgrounds. The top 10 methods, outlined here, offer insights into overcoming these challenges and maximizing detection accuracy. This intersection of algae research and computer vision offers promise for ecological understanding and technological advancement. The dataset can be accessed at: this https URL.
zh

[CV-114] Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中模型在学习新知识时容易发生灾难性遗忘的问题,同时充分利用预训练模型(Pre-trained Models, PTMs)如CLIP的多模态结构和文本表示的稳定性。其解决方案的关键在于提出一种基于增量提示调优的简洁方法——文本原型引导的提示调优(Textual Prototype-guided Prompt Tuning, TPPT),通过将文本原型作为稳定的锚点来指导视觉提示的学习,从而优化嵌入空间,并采用双向监督策略以提升新知识的学习效果并减少遗忘。此外,通过联合优化视觉与文本提示以及引入文本锚点的关系多样性正则化,进一步缩小视觉-语言差距并防止嵌入空间坍塌。

链接: https://arxiv.org/abs/2505.20680
作者: Haodong Lu,Xinyu Zhang,Kristen Moore,Jason Xue,Lina Yao,Anton van den Hengel,Dong Gong
机构: University of New South Wales (新南威尔士大学); CSIRO (澳大利亚联邦科学与工业研究组织); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP’s intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP’s intrinsic guidance for continual adaptation.
zh

[CV-115] Supervised Contrastive Learning for Ordinal Engagement Measurement

【速读】:该论文旨在解决虚拟学习环境中基于视频的学生参与度测量问题,具体包括两个关键挑战:类别不平衡以及将参与度水平视为有序而非单纯的分类标签。其解决方案的关键在于提出一种利用监督对比学习进行序数分类的新型方法,通过从视频样本中提取情感和行为特征,并在监督对比学习框架内训练序数分类器(以序列分类器作为编码器),同时应用多种时间序列数据增强技术来提升模型训练效果。

链接: https://arxiv.org/abs/2505.20676
作者: Sadaf Safa,Ali Abedi,Shehroz S. Khan
机构: KITE Research Institute, University Health Network, Toronto, Canada(基特研究研究所,多伦多大学健康网络,多伦多,加拿大); Institute of Biomedical Engineering, University of Toronto, Toronto, Canada(生物医学工程研究所,多伦多大学,多伦多,加拿大); College of Engineering and Technology, American University of the Middle East, Kuwait(工程与技术学院,中东美国大学,科威特)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 9 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Student engagement plays a crucial role in the successful delivery of educational programs. Automated engagement measurement helps instructors monitor student participation, identify disengagement, and adapt their teaching strategies to enhance learning outcomes effectively. This paper identifies two key challenges in this problem: class imbalance and incorporating order into engagement levels rather than treating it as mere categories. Then, a novel approach to video-based student engagement measurement in virtual learning environments is proposed that utilizes supervised contrastive learning for ordinal classification of engagement. Various affective and behavioral features are extracted from video samples and utilized to train ordinal classifiers within a supervised contrastive learning framework (with a sequential classifier as the encoder). A key step involves the application of diverse time-series data augmentation techniques to these feature vectors, enhancing model training. The effectiveness of the proposed method was evaluated using a publicly available dataset for engagement measurement, DAiSEE, containing videos of students who participated in virtual learning programs. The results demonstrate the robust ability of the proposed method for the classification of the engagement level. This approach promises a significant contribution to understanding and enhancing student engagement in virtual learning environments.
zh

[CV-116] Contrastive Desensitization Learning for Cross Domain Face Forgery Detection

【速读】:该论文试图解决跨域人脸伪造检测中对不同且可能未见过的伪造方法不敏感,同时保持可接受的低误报率的问题。现有方法虽然在一定程度上适用于多个领域,但通常伴随着较高的误报率,这会严重影响系统的可用性。解决方案的关键在于提出一种基于鲁棒去敏感化算法的对比去敏感网络(Contrastive Desensitization Network, CDN),通过从真实人脸图像对的域变换中学习,捕捉本质的域特征,从而使得学习到的人脸表示在理论上具有对域变化的鲁棒性。

链接: https://arxiv.org/abs/2505.20675
作者: Lingyu Qiu,Ke Jiang,Xiaoyang Tan
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a new cross-domain face forgery detection method that is insensitive to different and possibly unseen forgery methods while ensuring an acceptable low false positive rate. Although existing face forgery detection methods are applicable to multiple domains to some degree, they often come with a high false positive rate, which can greatly disrupt the usability of the system. To address this issue, we propose an Contrastive Desensitization Network (CDN) based on a robust desensitization algorithm, which captures the essential domain characteristics through learning them from domain transformation over pairs of genuine face images. One advantage of CDN lies in that the learnt face representation is theoretical justified with regard to the its robustness against the domain changes. Extensive experiments over large-scale benchmark datasets demonstrate that our method achieves a much lower false alarm rate with improved detection accuracy compared to several state-of-the-art methods.
zh

[CV-117] DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving

【速读】:该论文旨在解决自动驾驶中传统端到端模型在复杂场景下泛化能力不足的问题,以及现有视觉-语言模型(VLM)在驾驶任务中因模块孤立和静态监督而难以支持多阶段决策的问题。其解决方案的关键在于提出AutoDriveRL框架,将自动驾驶建模为四个核心任务的结构化推理过程,并通过任务特定的奖励模型优化每个任务,从而在不同推理阶段提供细粒度的强化信号,最终训练出适用于实时决策的DriveRX模型。

链接: https://arxiv.org/abs/2505.20665
作者: Muxi Diao,Lele Yang,Hongbo Yin,Zhexu Wang,Yejie Wang,Daxin Tian,Kongming Liang,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Zhongguancun Academy (中关村学院); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving requires real-time, robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. Recent vision-language models (VLMs) have been applied to driving tasks, but they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language question-answering problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for real-time decision-making. DriveRX achieves strong performance on a public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. Our analysis further highlights the impact of vision encoder design and reward-guided reasoning compression. We will release the AutoDriveRL framework and the DriveRX model to support future research.
zh

[CV-118] Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

【速读】:该论文试图解决传统摄影构图方法在场景中主体排列不佳时表现不足的问题,以及摄影透视重构(Photography Perspective Composition, PPC)实施中的挑战,包括透视变换数据集的稀缺性和透视质量评估标准的缺失。解决方案的关键在于提出一个自动构建PPC数据集的框架、一种展示从次优到最优视角变换过程的视频生成方法,以及基于人类表现构建的透视质量评估(Perspective Quality Assessment, PQA)模型,从而帮助普通用户提升构图能力。

链接: https://arxiv.org/abs/2505.20655
作者: Lujian Yao,Siming Zheng,Xinbin Yuan,Zhuoxuan Cai,Pu Wu,Jinwei Chen,Bo Li,Peng-Tao Jiang
机构: vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional photography composition approaches are dominated by 2D cropping-based methods. However, these methods fall short when scenes contain poorly arranged subjects. Professional photographers often employ perspective adjustment as a form of 3D recomposition, modifying the projected 2D relationships between subjects while maintaining their actual spatial positions to achieve better compositional balance. Inspired by this artistic practice, we propose photography perspective composition (PPC), extending beyond traditional cropping-based methods. However, implementing the PPC faces significant challenges: the scarcity of perspective transformation datasets and undefined assessment criteria for perspective quality. To address these challenges, we present three key contributions: (1) An automated framework for building PPC datasets through expert photographs. (2) A video generation approach that demonstrates the transformation process from suboptimal to optimal perspectives. (3) A perspective quality assessment (PQA) model constructed based on human performance. Our approach is concise and requires no additional prompt instructions or camera trajectories, helping and guiding ordinary users to enhance their composition skills.
zh

[CV-119] RoGA: Towards Generalizable Deepfake Detection through Robust Gradient Alignment ICME2025

【速读】:该论文旨在解决深度伪造检测中领域泛化(domain generalization)的问题,即现有方法通过引入额外模块来防止对特定领域模式的过拟合,但这种正则化会阻碍经验风险最小化(empirical risk minimization, ERM)目标的优化,从而降低模型性能。其解决方案的关键在于提出一种新的学习目标,该目标将领域泛化的梯度更新与ERM的梯度更新对齐,通过在模型参数上应用扰动,使不同领域的上升点对齐,从而增强模型对领域偏移的鲁棒性。该方法在不引入额外正则化的情况下,有效保留了领域不变特征并管理领域特定特征。

链接: https://arxiv.org/abs/2505.20653
作者: Lingyu Qiu,Ke Jiang,Xiaoyang Tan
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICME2025

点击查看摘要

Abstract:Recent advancements in domain generalization for deepfake detection have attracted significant attention, with previous methods often incorporating additional modules to prevent overfitting to domain-specific patterns. However, such regularization can hinder the optimization of the empirical risk minimization (ERM) objective, ultimately degrading model performance. In this paper, we propose a novel learning objective that aligns generalization gradient updates with ERM gradient updates. The key innovation is the application of perturbations to model parameters, aligning the ascending points across domains, which specifically enhances the robustness of deepfake detection models to domain shifts. This approach effectively preserves domain-invariant features while managing domain-specific characteristics, without introducing additional regularization. Experimental results on multiple challenging deepfake detection datasets demonstrate that our gradient alignment strategy outperforms state-of-the-art domain generalization techniques, confirming the efficacy of our method. The code is available at this https URL.
zh

[CV-120] Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design IJCAI2025

【速读】:该论文旨在解决AI赋能的海报设计中内容感知布局生成的问题,特别是针对现有方法在感知背景图像时需要大量参数导致实时性能和泛化能力受限的问题。其解决方案的关键在于提出了一种名为Scan-and-Print的基于块级别的数据摘要与增强方法,通过扫描过程高效选择适合放置元素顶点的图像块进行细粒度感知,并通过打印过程混合不同图像-布局对中的块与顶点以合成超过100%的新样本,从而显著降低计算瓶颈并提升布局质量。

链接: https://arxiv.org/abs/2505.20649
作者: HsiaoYuan Hsu,Yuxin Peng
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI 2025 (AI, Arts and Creativity). Project page is at this https URL

点击查看摘要

Abstract:In AI-empowered poster design, content-aware layout generation is crucial for the on-image arrangement of visual-textual elements, e.g., logo, text, and underlay. To perceive the background images, existing work demanded a high parameter count that far exceeds the size of available training data, which has impeded the model’s real-time performance and generalization ability. To address these challenges, we proposed a patch-level data summarization and augmentation approach, vividly named Scan-and-Print. Specifically, the scan procedure selects only the patches suitable for placing element vertices to perform fine-grained perception efficiently. Then, the print procedure mixes up the patches and vertices across two image-layout pairs to synthesize over 100% new samples in each epoch while preserving their plausibility. Besides, to facilitate the vertex-level operations, a vertex-based layout representation is introduced. Extensive experimental results on widely used benchmarks demonstrated that Scan-and-Print can generate visually appealing layouts with state-of-the-art quality while dramatically reducing computational bottleneck by 95.2%.
zh

[CV-121] HCQA-1.5 @ Ego4D EgoSchema Challenge 2025 CVPR

【速读】:该论文旨在解决第一人称视频问答(egocentric video question answering)中答案预测的可靠性问题。其关键解决方案是基于先前提出的HCQA框架进行有效扩展,引入多源聚合策略以生成多样化的预测结果,并通过基于置信度的过滤机制直接选择高置信度的答案;对于低置信度的情况,则引入细粒度推理模块进行额外的视觉和上下文分析以优化预测结果。

链接: https://arxiv.org/abs/2505.20644
作者: Haoyu Zhang,Yisen Feng,Qiaohui Chu,Meng Liu,Weili Guan,Yaowei Wang,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Pengcheng Laboratory (鹏城实验室); Shandong Jianzhu University (山东建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The third-place solution for the Ego4D EgoSchema Challenge at the CVPR EgoVis Workshop 2025

点击查看摘要

Abstract:In this report, we present the method that achieves third place for Ego4D EgoSchema Challenge in CVPR 2025. To improve the reliability of answer prediction in egocentric video question answering, we propose an effective extension to the previously proposed HCQA framework. Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism that selects high-confidence answers directly. For low-confidence cases, we incorporate a fine-grained reasoning module that performs additional visual and contextual analysis to refine the predictions. Evaluated on the EgoSchema blind test set, our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions, outperforming last year’s winning solution and the majority of participating teams. Our code will be added at this https URL.
zh

[CV-122] See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

【速读】:该论文旨在解决夜间场景下占用预测(occupancy prediction)性能下降的问题,现有基于视觉的方法在白天基准测试中表现良好,但在夜间由于可见度受限和光照条件复杂而面临挑战。其解决方案的关键在于提出一种名为LIAR的新框架,该框架通过学习与光照相关的表示(illumination-affined representations)来提升夜间场景下的预测能力。LIAR的核心创新包括:首先引入选择性低光图像增强(SLLIE),利用白天场景的光照先验来判断夜间图像是否真正黑暗或足够明亮,从而实现更精准的全局增强;随后结合两个与光照相关的模块——2D光照引导采样(2D-IGS)和3D光照驱动投影(3D-IDP),分别处理局部欠曝光和过曝光问题,以提升特征质量和语义理解。

链接: https://arxiv.org/abs/2505.20641
作者: Yuan Wu,Zhiqiang Yan,Yigong Zhang,Xiang Li,ian Yang
机构: Nanjing University of Science and Technology (南京理工大学); National University of Singapore (新加坡国立大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose \textbfLIAR, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently, 3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available \hrefthis https URLhere.
zh

[CV-123] IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios

【速读】:该论文试图解决现有具身问答(Embodied Question Answering, EQA)基准主要聚焦于家庭环境,而忽视了工业场景中与安全相关的关键方面和推理过程的问题,这限制了智能体在真实工业应用中的评估能力。解决方案的关键在于引入IndustryEQA,这是首个专注于评估工业安全关键仓库场景中具身智能体能力的基准。该基准基于NVIDIA Isaac Sim平台,提供高保真的情景记忆视频,包含多样化的工业资产、动态人类代理以及根据现实安全指南设计的危险情境,并涵盖六类丰富的标注数据,旨在全面评估智能体的感知与推理能力。

链接: https://arxiv.org/abs/2505.20640
作者: Yifan Li,Yuhang Chen,Anh Dao,Lichi Li,Zhongyi Cai,Zhen Tan,Tianlong Chen,Yu Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: v1.0

点击查看摘要

Abstract:Existing Embodied Question Answering (EQA) benchmarks primarily focus on household environments, often overlooking safety-critical aspects and reasoning processes pertinent to industrial settings. This drawback limits the evaluation of agent readiness for real-world industrial applications. To bridge this, we introduce IndustryEQA, the first benchmark dedicated to evaluating embodied agent capabilities within safety-critical warehouse scenarios. Built upon the NVIDIA Isaac Sim platform, IndustryEQA provides high-fidelity episodic memory videos featuring diverse industrial assets, dynamic human agents, and carefully designed hazardous situations inspired by real-world safety guidelines. The benchmark includes rich annotations covering six categories: equipment safety, human safety, object recognition, attribute recognition, temporal understanding, and spatial understanding. Besides, it also provides extra reasoning evaluation based on these categories. Specifically, it comprises 971 question-answer pairs generated from small warehouse and 373 pairs from large ones, incorporating scenarios with and without human. We further propose a comprehensive evaluation framework, including various baseline models, to assess their general perception and reasoning abilities in industrial environments. IndustryEQA aims to steer EQA research towards developing more robust, safety-aware, and practically applicable embodied agents for complex industrial environments. Benchmark and codes are available.
zh

[CV-124] Open-Det: An Efficient Learning Framework for Open-Ended Detection ICML2025

【速读】:该论文旨在解决开放性目标检测(Open-Ended Object Detection, OED)中存在的训练数据需求量大、收敛速度慢以及性能受限等问题。其解决方案的关键在于提出一种名为Open-Det的高效框架,该框架通过重构目标检测器和目标名称生成器以加速边界框和对象名称的生成过程,并引入视觉-语言对齐器与提示蒸馏技术,以弥合视觉与语言模态之间的语义差距,同时设计了掩码对齐损失和联合损失以提升分类效果和训练效率。

链接: https://arxiv.org/abs/2505.20639
作者: Guiping Cao,Tao Wang,Wenjian Huang,Xiangyuan Lan,Jianguo Zhang,Dongmei Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025

点击查看摘要

Abstract:Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at: this https URL.
zh

[CV-125] Musics Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLM s

【速读】:该论文旨在解决在音乐领域中,传统多模态大语言模型难以有效处理连续、密集层叠的音视频内容及复杂时间动态的问题,尤其是在音乐音频-视觉问答(Music AVQA)任务中所面临的挑战。论文提出的关键解决方案包括:专门化的输入处理方式、融合空间-时间设计的架构,以及针对音乐领域的建模策略,这些因素被实证表明对提升该领域任务性能至关重要。

链接: https://arxiv.org/abs/2505.20638
作者: Wenhao You,Xingjian Diao,Chunhui Zhang,Keyi Kong,Weiyi Wu,Zhongyu Ouyang,Chiyu Ma,Tingxuan Wu,Noah Wei,Zong Ke,Ming Cheng,Soroush Vosoughi,Jiang Gui
机构: University of Waterloo(滑铁卢大学); Dartmouth College(达特茅斯学院); Shandong University(山东大学); The London School of Economics and Political Science(伦敦政治经济学院); University of Hong Kong(香港大学); National University of Singapore(新加坡国立大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:While recent Multimodal Large Language Models exhibit impressive capabilities for general multimodal tasks, specialized domains like music necessitate tailored approaches. Music Audio-Visual Question Answering (Music AVQA) particularly underscores this, presenting unique challenges with its continuous, densely layered audio-visual content, intricate temporal dynamics, and the critical need for domain-specific knowledge. Through a systematic analysis of Music AVQA datasets and methods, this position paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain. Our study provides valuable insights for researchers by highlighting effective design patterns empirically linked to strong performance, proposing concrete future directions for incorporating musical priors, and aiming to establish a robust foundation for advancing multimodal musical understanding. This work is intended to inspire broader attention and further research, supported by a continuously updated anonymous GitHub repository of relevant papers: this https URL.
zh

[CV-126] rustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone

【速读】:该论文试图解决面部情感分析(Facial Affect Analysis, FAA)系统在不同人口统计群体中表现不一致的问题,特别是针对肤色这一敏感属性的测量偏差。其关键解决方案在于比较两种肤色分类方法:广泛使用的个体类型角(Individual Typology Angle, ITA)与基于明度(L^)和色相(H^)的感知基础方法,并通过AffectNet数据集和MobileNet模型评估公平性。研究发现,ITA方法因对光照条件敏感而存在局限性,而H^*-L^*方法能够提供更一致的子群划分,并通过公平性指标如等机会(Equal Opportunity)实现更清晰的诊断。

链接: https://arxiv.org/abs/2505.20637
作者: Ana M. Cabanas,Alma Pedro,Domingo Mery
机构: Universidad de Tarapacá(塔拉帕卡大学); Pontificia Universidad Católica de Chile(智利天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Understanding how facial affect analysis (FAA) systems perform across different demographic groups requires reliable measurement of sensitive attributes such as ancestry, often approximated by skin tone, which itself is highly influenced by lighting conditions. This study compares two objective skin tone classification methods: the widely used Individual Typology Angle (ITA) and a perceptually grounded alternative based on Lightness ( L^* ) and Hue ( H^* ). Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method. Results reveal a severe underrepresentation of dark skin tones ( \sim 2 % ), alongside fairness disparities in F1-score (up to 0.08) and TPR (up to 0.11) across groups. While ITA shows limitations due to its sensitivity to lighting, the H^* - L^* method yields more consistent subgrouping and enables clearer diagnostics through metrics such as Equal Opportunity. Grad-CAM analysis further highlights differences in model attention patterns by skin tone, suggesting variation in feature encoding. To support future mitigation efforts, we also propose a modular fairness-aware pipeline that integrates perceptual skin tone estimation, model interpretability, and fairness evaluation. These findings emphasize the relevance of skin tone measurement choices in fairness assessment and suggest that ITA-based evaluations may overlook disparities affecting darker-skinned individuals.
zh

[CV-127] Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

【速读】:该论文旨在解决文本-图像到视频(TI2V)生成中,现有方法依赖微调来添加视觉条件所导致的资源消耗大且仅限于有限预定义条件设置的问题。其解决方案的关键在于提出一种无需训练的方法——FlexTI2V,该方法通过在潜在空间中将条件图像反演为噪声表示,并在T2V模型的去噪过程中采用随机块交换策略,将局部图像块的视觉特征融入视频表示中,从而实现灵活的视觉条件控制。此外,还引入动态控制机制以平衡视频生成的创造性和保真度。

链接: https://arxiv.org/abs/2505.20629
作者: Bolin Lai,Sangmin Lee,Xu Cao,Xiang Li,James M. Rehg
机构: Georgia Institute of Technology (佐治亚理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.
zh

[CV-128] ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

【速读】:该论文试图解决文本到图像生成模型中的一致性角色生成问题,即在保持主体外观一致的同时实现文本对齐。现有方法由于风格与外观的纠缠,在遵循不同风格提示的同时难以维持一致的主体特征。该论文提出的解决方案的关键在于引入一种无需训练的方法,通过操控注意力矩阵,使查询和键来自定义主体的锚定图像,而值则来自非主体锚定的并行副本,并通过扩展键和值矩阵来添加跨图像组件,从而实现风格对齐与主体一致性。

链接: https://arxiv.org/abs/2505.20626
作者: Yohai Mazuz,Janna Bruner,Lior Wolf
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject’s appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that achieves both style alignment and subject consistency. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.
zh

[CV-129] OccLE: Label-Efficient 3D Semantic Occupancy Prediction

【速读】:该论文旨在解决3D语义占据预测中对昂贵的体素级标注依赖过高的问题,以及自监督方法提供的指导有限导致性能不佳的问题。其解决方案的关键在于提出OccLE,通过解耦语义与几何学习任务,并融合两者的学习特征网格进行最终的语义占据预测,同时利用2D基础模型进行伪标签蒸馏和半监督策略增强几何学习,从而在有限标注的情况下实现高性能的3D语义占据预测。

链接: https://arxiv.org/abs/2505.20617
作者: Naiyu Fang,Zheyuan Zhou,Fayao Liu,Xulei Yang,Jiacheng Wei,Lemiao Qiu,Guosheng Lin
机构: S-Lab, Nanyang Technological University, Singapore; School of Mechanical Engineering, Zhejiang University, China; Institute for Infocomm Research, A*STAR, Singapore; School of Computer Science and Engineering, Nanyang Technological University, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations, reaching a mIoU of 16.59% on the SemanticKITTI validation set.
zh

[CV-130] Intelligent Incident Hypertension Prediction in Obstructive Sleep Apnea

【速读】:该论文旨在解决患有阻塞性睡眠呼吸暂停(Obstructive Sleep Apnea, OSA)的个体在五年内是否发展为高血压的预测问题,这一问题目前仍具有较大挑战性。其解决方案的关键在于引入一种基于离散余弦变换(Discrete Cosine Transform, DCT)的迁移学习方法,通过整合所有多导睡眠图(polysomnography)信号并将其转换为二维表示,利用预训练的二维神经网络进行特征提取与分类。此外,通过在模型中引入DCT层,将输入特征转换为频域表示,以保留关键频谱信息、去相关特征并增强对噪声的鲁棒性,从而提升模型在有限医疗数据集上的泛化能力。

链接: https://arxiv.org/abs/2505.20615
作者: Omid Halimi Milani,Ahmet Enis Cetin,Bharati Prasad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at EUSIPCO 2025. Camera-ready due June 20, 2025

点击查看摘要

Abstract:Obstructive sleep apnea (OSA) is a significant risk factor for hypertension, primarily due to intermittent hypoxia and sleep fragmentation. Predicting whether individuals with OSA will develop hypertension within five years remains a complex challenge. This study introduces a novel deep learning approach that integrates Discrete Cosine Transform (DCT)-based transfer learning to enhance prediction accuracy. We are the first to incorporate all polysomnography signals together for hypertension prediction, leveraging their collective information to improve model performance. Features were extracted from these signals and transformed into a 2D representation to utilize pre-trained 2D neural networks such as MobileNet, EfficientNet, and ResNet variants. To further improve feature learning, we introduced a DCT layer, which transforms input features into a frequency-based representation, preserving essential spectral information, decorrelating features, and enhancing robustness to noise. This frequency-domain approach, coupled with transfer learning, is especially beneficial for limited medical datasets, as it leverages rich representations from pre-trained networks to improve generalization. By strategically placing the DCT layer at deeper truncation depths within EfficientNet, our model achieved a best area under the curve (AUC) of 72.88%, demonstrating the effectiveness of frequency-domain feature extraction and transfer learning in predicting hypertension risk in OSA patients over a five-year period.
zh

[CV-131] Mamba-Driven Topology Fusion for Monocular 3-D Human Pose Estimation

【速读】:该论文旨在解决基于Transformer的3D人体姿态估计方法在计算复杂度上的挑战,以及Mamba模型在处理具有拓扑结构的3D关节序列时的局限性。其关键解决方案是提出Mamba-Driven Topology Fusion框架,通过引入Bone Aware Module提供骨骼拓扑引导,增强Mamba模型中的卷积结构以捕捉局部关节依赖关系,并设计Spatiotemporal Refinement Module以建模序列中的时空关系,从而有效提升模型对人类结构关系的捕捉能力。

链接: https://arxiv.org/abs/2505.20611
作者: Zenghao Zheng,Lianping Yang,Jinshan Pan,Hegui Zhu
机构: Northeastern University (东北大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based methods for 3-D human pose estimation face significant computational challenges due to the quadratic growth of self-attention mechanism complexity with sequence length. Recently, the Mamba model has substantially reduced computational overhead and demonstrated outstanding performance in modeling long sequences by leveraging state space model (SSM). However, the ability of SSM to process sequential data is not suitable for 3-D joint sequences with topological structures, and the causal convolution structure in Mamba also lacks insight into local joint relationships. To address these issues, we propose the Mamba-Driven Topology Fusion framework in this paper. Specifically, the proposed Bone Aware Module infers the direction and length of bone vectors in the spherical coordinate system, providing effective topological guidance for the Mamba model in processing joint sequences. Furthermore, we enhance the convolutional structure within the Mamba model by integrating forward and backward graph convolutional network, enabling it to better capture local joint dependencies. Finally, we design a Spatiotemporal Refinement Module to model both temporal and spatial relationships within the sequence. Through the incorporation of skeletal topology, our approach effectively alleviates Mamba’s limitations in capturing human structural relationships. We conduct extensive experiments on the Human3.6M and MPI-INF-3DHP datasets for testing and comparison, and the results show that the proposed method greatly reduces computational cost while achieving higher accuracy. Ablation studies further demonstrate the effectiveness of each proposed module. The code and models will be released.
zh

[CV-132] OmniIndoor3D: Comprehensive Indoor 3D Reconstruction

【速读】:该论文旨在解决室内场景的全面三维重建问题,特别是针对由消费级RGB-D相机捕获的多样化室内场景,实现高精度的外观、几何和语义(panoptic)重建。其解决方案的关键在于提出一种基于高斯表示(3D Gaussians)的框架——OmniIndoor3D,通过将多张RGB-D图像融合生成粗略的三维重建结果,用于初始化高斯分布并指导三维高斯溅射(3DGS)训练;同时引入轻量级多层感知机(MLP)以解耦外观与几何优化冲突,并作为几何重建的低通滤波器降低噪声;此外,还提出了基于语义先验的高斯原始体密度策略,以提升平面表面的平滑性,最终通过联合优化外观、几何与语义重建,实现高质量的室内场景理解。

链接: https://arxiv.org/abs/2505.20610
作者: Xiaobao Wei,Xiaoan Zhang,Hao Wang,Qingpo Wuwu,Ming Lu,Wenzhao Zheng,Shanghang Zhang
机构: Peking University (北京大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose a novel framework for comprehensive indoor 3D reconstruction using Gaussian representations, called OmniIndoor3D. This framework enables accurate appearance, geometry, and panoptic reconstruction of diverse indoor scenes captured by a consumer-level RGB-D camera. Since 3DGS is primarily optimized for photorealistic rendering, it lacks the precise geometry critical for high-quality panoptic reconstruction. Therefore, OmniIndoor3D first combines multiple RGB-D images to create a coarse 3D reconstruction, which is then used to initialize the 3D Gaussians and guide the 3DGS training. To decouple the optimization conflict between appearance and geometry, we introduce a lightweight MLP that adjusts the geometric properties of 3D Gaussians. The introduced lightweight MLP serves as a low-pass filter for geometry reconstruction and significantly reduces noise in indoor scenes. To improve the distribution of Gaussian primitives, we propose a densification strategy guided by panoptic priors to encourage smoothness on planar surfaces. Through the joint optimization of appearance, geometry, and panoptic reconstruction, OmniIndoor3D provides comprehensive 3D indoor scene understanding, which facilitates accurate and robust robotic navigation. We perform thorough evaluations across multiple datasets, and OmniIndoor3D achieves state-of-the-art results in appearance, geometry, and panoptic reconstruction. We believe our work bridges a critical gap in indoor 3D reconstruction. The code will be released at: this https URL
zh

[CV-133] otal-Editing: Head Avatar with Editable Appearance Motion and Lighting

【速读】:该论文旨在解决人脸重演(face reenactment)与肖像光照调整(portrait relighting)在传统方法中独立处理、缺乏协同的问题。其关键解决方案是提出一种统一的肖像编辑框架——Total-Editing,该框架通过设计具备固有分解能力的神经辐射场解码器,实现对光照信息的无缝整合,并结合基于移动最小二乘法的形变场以提升虚拟形象运动与阴影效果的时空一致性,从而实现对表观、运动和光照的精确控制。

链接: https://arxiv.org/abs/2505.20582
作者: Yizhou Zhao,Chunjiang Liu,Haoyu Chen,Bhiksha Raj,Min Xu,Tadas Baltrusaitis,Mitch Rundle,HsiangTao Wu,Kamran Ghasedi
机构: Carnegie Mellon University (卡内基梅隆大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face reenactment and portrait relighting are essential tasks in portrait editing, yet they are typically addressed independently, without much synergy. Most face reenactment methods prioritize motion control and multiview consistency, while portrait relighting focuses on adjusting shading effects. To take advantage of both geometric consistency and illumination awareness, we introduce Total-Editing, a unified portrait editing framework that enables precise control over appearance, motion, and lighting. Specifically, we design a neural radiance field decoder with intrinsic decomposition capabilities. This allows seamless integration of lighting information from portrait images or HDR environment maps into synthesized portraits. We also incorporate a moving least squares based deformation field to enhance the spatiotemporal coherence of avatar motion and shading effects. With these innovations, our unified framework significantly improves the quality and realism of portrait editing results. Further, the multi-source nature of Total-Editing supports more flexible applications, such as illumination transfer from one portrait to another, or portrait animation with customized backgrounds.
zh

[CV-134] Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models ACL

【速读】:该论文试图解决大型视觉语言模型中的目标幻觉(Object Hallucination, OH)问题。解决方案的关键在于引入RVCD(Retrieval Visual Contrastive Decoding),该方法在logit层面同时利用负样本和正样本图像,明确参考由AI生成的代表单一概念的图像,从而有效抑制OH。

链接: https://arxiv.org/abs/2505.20569
作者: Jihoon Lee,Min Song
机构: Yonsei University (延世大学); ONOMA AI (ONOMA AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACL Findings camera-ready version. Code is released at this https URL

点击查看摘要

Abstract:Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
zh

[CV-135] Bi-Level Unsupervised Feature Selection

【速读】:该论文试图解决无监督特征选择(Unsupervised Feature Selection, UFS)中现有方法通常从单一视角构建模型,难以同时评估特征重要性并保持数据固有结构的问题,从而限制了性能。其解决方案的关键在于提出一种双层次无监督特征选择(Bi-level Unsupervised Feature Selection, BLUFS)方法,该方法包括聚类层和特征层:在聚类层,采用谱聚类生成伪标签以表征数据结构,并通过连续线性回归模型学习投影矩阵;在特征层,对投影矩阵施加2,0\ell_{2,0}-范数约束以更有效地选择特征。此外,设计了一种高效的近端交替最小化(Proximal Alternating Minimization, PAM)算法来求解该模型,并证明了其收敛性和计算复杂度。

链接: https://arxiv.org/abs/2505.20563
作者: Jingjing Liu,Xiansen Ju,Xianchao Xiu,Wanquan Liu
机构: Shanghai Key Laboratory of Automobile Intelligent Network Interaction Chip and System, School of Microelectronics, Shanghai University; School of Mechatronic Engineering and Automation, Shanghai University; School of Intelligent Systems Engineering, Sun Yat-sen University
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised feature selection (UFS) is an important task in data engineering. However, most UFS methods construct models from a single perspective and often fail to simultaneously evaluate feature importance and preserve their inherent data structure, thus limiting their performance. To address this challenge, we propose a novel bi-level unsupervised feature selection (BLUFS) method, including a clustering level and a feature level. Specifically, at the clustering level, spectral clustering is used to generate pseudo-labels for representing the data structure, while a continuous linear regression model is developed to learn the projection matrix. At the feature level, the \ell_2,0 -norm constraint is imposed on the projection matrix for more effectively selecting features. To the best of our knowledge, this is the first work to combine a bi-level framework with the \ell_2,0 -norm. To solve the proposed bi-level model, we design an efficient proximal alternating minimization (PAM) algorithm, whose subproblems either have explicit solutions or can be computed by fast solvers. Furthermore, we establish the convergence result and computational complexity. Finally, extensive experiments on two synthetic datasets and eight real datasets demonstrate the superiority of BLUFS in clustering and classification tasks.
zh

[CV-136] Causality and “In-the-Wild” Video-Based Person Re-ID: A Survey

【速读】:该论文旨在解决视频行人重识别(video-based person Re-ID)在实际部署中表现脆弱的问题,其核心挑战在于现有模型依赖于表层相关性(如服装、背景或光照)而难以跨领域、视角和时间变化进行泛化。解决方案的关键在于引入因果推理(causal reasoning),通过结构化因果模型、干预和反事实推理来分离身份特定特征与混淆因素,从而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2505.20540
作者: Md Rashidunnabi,Kailash Hambarde,Hugo Proença
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 9 figures

点击查看摘要

Abstract:Video-based person re-identification (Re-ID) remains brittle in real-world deployments despite impressive benchmark performance. Most existing models rely on superficial correlations such as clothing, background, or lighting that fail to generalize across domains, viewpoints, and temporal variations. This survey examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based Re-ID. We provide a structured and critical analysis of methods that leverage structural causal models, interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. The survey is organized around a novel taxonomy of causal Re-ID methods that spans generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess practical challenges of scalability, fairness, interpretability, and privacy that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This survey aims to establish a coherent foundation for causal video-based person Re-ID and to catalyze the next phase of research in this rapidly evolving domain.
zh

[CV-137] MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

【速读】:该论文试图解决在无需训练的情况下有效融合多个低秩适配器(LoRA)以生成复杂视觉组合的问题,当前方法在处理包含多样化视觉元素的场景时表现不佳。解决方案的关键在于提出MultLFG框架,该框架利用频域引导实现多LoRA的自适应融合,通过时间步和频带自适应策略,根据内容相关性在特定时间步和频带内选择性激活相关LoRA,从而提升空间一致性并增强对多LoRA组合的精细控制。

链接: https://arxiv.org/abs/2505.20525
作者: Aniket Roy,Maitreya Suin,Ketul Shah,Rama Chellappa
机构: Johns Hopkins University (约翰霍普金斯大学); Samsung AI Center Toronto (三星人工智能多伦多中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has gained prominence as a computationally efficient method for fine-tuning generative models, enabling distinct visual concept synthesis with minimal overhead. However, current methods struggle to effectively merge multiple LoRA adapters without training, particularly in complex compositions involving diverse visual elements. We introduce MultLFG, a novel framework for training-free multi-LoRA composition that utilizes frequency-domain guidance to achieve adaptive fusion of multiple LoRAs. Unlike existing methods that uniformly aggregate concept-specific LoRAs, MultLFG employs a timestep and frequency subband adaptive fusion strategy, selectively activating relevant LoRAs based on content relevance at specific timesteps and frequency bands. This frequency-sensitive guidance not only improves spatial coherence but also provides finer control over multi-LoRA composition, leading to more accurate and consistent results. Experimental evaluations on the ComposLoRA benchmark reveal that MultLFG substantially enhances compositional fidelity and image quality across various styles and concept sets, outperforming state-of-the-art baselines in multi-concept generation tasks. Code will be released.
zh

[CV-138] MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning CVPR2025

【速读】:该论文旨在解决手写文本识别(Handwritten Text Recognition, HTR)中跨多样书写风格的鲁棒性不足问题,特别是在测试阶段缺乏针对特定作者的个性化调整。传统HTR方法由于模型架构和训练策略的限制,在测试时无法实现个性化的文本识别。现有基于梯度的元学习方法虽尝试弥补这一差距,但仍需依赖标注样本并存在参数效率低下的微调问题,导致计算和内存开销较大。该论文提出的解决方案关键在于将个性化建模为提示调优(prompt tuning),结合辅助图像重建任务与自监督损失函数,利用无标签测试样本引导提示适应,并通过元学习优化提示的初始值,从而在仅更新不到1%参数的情况下实现高效个性化,避免了耗时的标注过程。

链接: https://arxiv.org/abs/2505.20513
作者: Wenhao Gu,Li Gu,Ching Yee Suen,Yang Wang
机构: Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Recent advancements in handwritten text recognition (HTR) have enabled the effective conversion of handwritten text to digital formats. However, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompt adaptation with unlabeled test-time examples. To ensure self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods while using 20x fewer parameters. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in resource-constrained scenarios.
zh

[CV-139] A Feature-level Bias Evaluation Framework for Facial Expression Recognition Models

【速读】:该论文试图解决面部表情识别(Facial Expression Recognition, FER)模型在缺乏真实人口统计学标签的情况下进行公平性评估的问题,以及现有研究中因使用伪标签导致的偏差评估失真问题。其解决方案的关键在于提出一种基于特征层面的公平性评估框架,在测试集无真实人口统计学标签的前提下,更准确地评估FER模型中的群体偏差。此外,论文还引入了一个即插即用的统计模块,以确保偏差评估结果的统计显著性,从而提升评估的可靠性。

链接: https://arxiv.org/abs/2505.20512
作者: Tangzheng Lian,Oya Celiktutan
机构: King’s College London, WC2R 2LS London, U.K.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Affective Computing

点击查看摘要

Abstract:Recent studies on fairness have shown that Facial Expression Recognition (FER) models exhibit biases toward certain visually perceived demographic groups. However, the limited availability of human-annotated demographic labels in public FER datasets has constrained the scope of such bias analysis. To overcome this limitation, some prior works have resorted to pseudo-demographic labels, which may distort bias evaluation results. Alternatively, in this paper, we propose a feature-level bias evaluation framework for evaluating demographic biases in FER models under the setting where demographic labels are unavailable in the test set. Extensive experiments demonstrate that our method more effectively evaluates demographic biases compared to existing approaches that rely on pseudo-demographic labels. Furthermore, we observe that many existing studies do not include statistical testing in their bias evaluations, raising concerns that some reported biases may not be statistically significant but rather due to randomness. To address this issue, we introduce a plug-and-play statistical module to ensure the statistical significance of biased evaluation results. A comprehensive bias analysis based on the proposed module is then conducted across three sensitive attributes (age, gender, and race), seven facial expressions, and multiple network architectures on a large-scale dataset, revealing the prominent demographic biases in FER and providing insights on selecting a fairer network architecture.
zh

[CV-140] CPathAgent : An Agent -based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists Diagnostic Logic

【速读】:该论文旨在解决当前计算病理学中基础模型无法模拟病理学家诊断过程的问题,特别是其在多尺度分析和区域理解上的不足。现有方法要么依赖通用编码器进行分类,要么直接应用多模态模型生成报告,缺乏对病理学家系统性诊断逻辑的模仿。解决方案的关键在于提出CPathAgent,一种基于智能体的模型,能够通过自主执行缩放和导航操作来模拟病理学家的推理过程,并采用多阶段训练策略统一模型在切片级、区域级和全片级的能力,从而实现更细致且可解释的诊断报告生成。

链接: https://arxiv.org/abs/2505.20510
作者: Yuxuan Sun,Yixuan Si,Chenglu Zhu,Kai Zhang,Zhongyi Shui,Bowen Ding,Tao Lin,Lin Yang
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 49 pages, 33 figures

点击查看摘要

Abstract:Recent advances in computational pathology have led to the emergence of numerous foundation models. However, these approaches fail to replicate the diagnostic process of pathologists, as they either simply rely on general-purpose encoders with multi-instance learning for classification or directly apply multimodal models to generate reports from images. A significant limitation is their inability to emulate the diagnostic logic employed by pathologists, who systematically examine slides at low magnification for overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. To address this gap, we introduce CPathAgent, an innovative agent-based model that mimics pathologists’ reasoning processes by autonomously executing zoom-in/out and navigation operations across pathology images based on observed visual features. To achieve this, we develop a multi-stage training strategy unifying patch-level, region-level, and whole-slide capabilities within a single model, which is essential for mimicking pathologists, who require understanding and reasoning capabilities across all three scales. This approach generates substantially more detailed and interpretable diagnostic reports compared to existing methods, particularly for huge region understanding. Additionally, we construct an expert-validated PathMMU-HR ^2 , the first benchmark for huge region analysis, a critical intermediate scale between patches and whole slides, as diagnosticians typically examine several key regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across three scales of benchmarks, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for the future development of computational pathology.
zh

[CV-141] Electrolyzers-HSI: Close-Range Multi-Scene Hyperspectral Imaging Benchmark Dataset

【速读】:该论文旨在解决可持续回收领域的关键挑战,即开发自动化、快速且准确的材料检测系统,以支持循环经济和绿色协议的实施。其解决方案的核心是提出一种名为Electrolyzers-HSI的新型多模态基准数据集,该数据集包含高分辨率RGB图像与覆盖400–2500 nm光谱范围的高光谱成像(HSI)数据立方体,共计超过424,169个标记像素,用于实现电解器材料的精准分类,从而加速关键原材料的回收过程。

链接: https://arxiv.org/abs/2505.20507
作者: Elias Arbash,Ahmed Jamal Afifi,Ymane Belahsen,Margret Fuchs,Pedram Ghamisi,Paul Scheunders,Richard Gloaguen
机构: Helmholtz-Zentrum Dresden-Rossendorf (HZDR) - Helmholtz Institute Freiberg for Resource Technology (HIF), Freiberg, Germany; University of Antwerp, Antwerpen, Belgium; National School of Applied Sciences of Oujda, Oujda, Morocco
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The global challenge of sustainable recycling demands automated, fast, and accurate, state-of-the-art (SOTA) material detection systems that act as a bedrock for a circular economy. Democratizing access to these cutting-edge solutions that enable real-time waste analysis is essential for scaling up recycling efforts and fostering the Green Deal. In response, we introduce \textbfElectrolyzers-HSI, a novel multimodal benchmark dataset designed to accelerate the recovery of critical raw materials through accurate electrolyzer materials classification. The dataset comprises 55 co-registered high-resolution RGB images and hyperspectral imaging (HSI) data cubes spanning the 400–2500 nm spectral range, yielding over 4.2 million pixel vectors and 424,169 labeled ones. This enables non-invasive spectral analysis of shredded electrolyzer samples, supporting quantitative and qualitative material classification and spectral properties investigation. We evaluate a suite of baseline machine learning (ML) methods alongside SOTA transformer-based deep learning (DL) architectures, including Vision Transformer, SpectralFormer, and the Multimodal Fusion Transformer, to investigate architectural bottlenecks for further efficiency optimisation when deploying transformers in material identification. We implement zero-shot detection techniques and majority voting across pixel-level predictions to establish object-level classification robustness. In adherence to the FAIR data principles, the electrolyzers-HSI dataset and accompanying codebase are openly available at this https URL and this https URL, supporting reproducible research and facilitating the broader adoption of smart and sustainable e-waste recycling solutions.
zh

[CV-142] ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image

【速读】:该论文旨在解决触觉数据大规模收集成本高这一问题,这是由于传感器与物体的相互作用具有局部性以及不同传感器实例之间的不一致性所致。其解决方案的关键在于提出ControlTac,一个两阶段的可控框架,该框架通过单个参考触觉图像、接触力和接触位置生成逼真的触觉图像,从而实现有效的数据增强。

链接: https://arxiv.org/abs/2505.20498
作者: Dongyu Luo,Kelin Yu,Amir-Hossein Shahidzadeh,Cornelia Fermüller,Yiannis Aloimonos
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 22 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Vision-based tactile sensing has been widely used in perception, reconstruction, and robotic manipulation. However, collecting large-scale tactile data remains costly due to the localized nature of sensor-object interactions and inconsistencies across sensor instances. Existing approaches to scaling tactile data, such as simulation and free-form tactile generation, often suffer from unrealistic output and poor transferability to downstream this http URL address this, we propose ControlTac, a two-stage controllable framework that generates realistic tactile images conditioned on a single reference tactile image, contact force, and contact position. With those physical priors as control input, ControlTac generates physically plausible and varied tactile images that can be used for effective data augmentation. Through experiments on three downstream tasks, we demonstrate that ControlTac can effectively augment tactile datasets and lead to consistent gains. Our three real-world experiments further validate the practical utility of our approach. Project page: this https URL.
zh

[CV-143] Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data

【速读】:该论文试图解决联邦学习中由于数据异质性导致的全局决策边界遗忘问题(global decision boundary forgetting)。现有方法虽然在一定程度上缓解了数据异质性的挑战,但缺乏对数据异质性如何影响全局决策边界的深入理解。论文提出的解决方案关键在于FedProj框架,该框架通过设计一种服务器端的集成知识迁移损失(server-side ensemble knowledge transfer loss)来增强全局决策边界的融合效果,并利用公共无标签数据集的平均集成logits的周期性记忆来调节本地训练中的梯度更新,从而有效避免全局决策边界的遗忘。

链接: https://arxiv.org/abs/2505.20485
作者: Abhijit Chunduru,Majid Morafah,Mahdi Morafah,Vishnu Pandi Chellapandi,Ang Li
机构: University of Massachusetts at Amherst(马萨诸塞大学阿默斯特分校); Islamic Azad University(伊斯兰阿扎德大学); University of California San Diego(加州大学圣地亚哥分校); Cummins(卡特彼勒公司); University of Maryland College Park(马里兰大学学院公园分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The inevitable presence of data heterogeneity has made federated learning very challenging. There are numerous methods to deal with this issue, such as local regularization, better model fusion techniques, and data sharing. Though effective, they lack a deep understanding of how data heterogeneity can affect the global decision boundary. In this paper, we bridge this gap by performing an experimental analysis of the learned decision boundary using a toy example. Our observations are surprising: (1) we find that the existing methods suffer from forgetting and clients forget the global decision boundary and only learn the perfect local one, and (2) this happens regardless of the initial weights, and clients forget the global decision boundary even starting from pre-trained optimal weights. In this paper, we present FedProj, a federated learning framework that robustly learns the global decision boundary and avoids its forgetting during local training. To achieve better ensemble knowledge fusion, we design a novel server-side ensemble knowledge transfer loss to further calibrate the learned global decision boundary. To alleviate the issue of learned global decision boundary forgetting, we further propose leveraging an episodic memory of average ensemble logits on a public unlabeled dataset to regulate the gradient updates at each step of local training. Experimental results demonstrate that FedProj outperforms state-of-the-art methods by a large margin.
zh

[CV-144] Stochastic Preconditioning for Neural Field Optimization SIGGRAPH2025

【速读】:该论文试图解决神经场(Neural Fields)在训练过程中收敛速度慢和鲁棒性不足的问题。其解决方案的关键在于引入空间随机性,通过在训练中对场进行隐式操作,利用高斯分布偏移进行采样评估,从而实现对模糊场的查询,这一过程类似于数值线性代数中的预条件器作用,显著提升了优化过程的收敛性和鲁棒性。

链接: https://arxiv.org/abs/2505.20473
作者: Selena Ling,Merlin Nimier-David,Alec Jacobson,Nicholas Sharp
机构: University of Toronto (多伦多大学); NVIDIA (英伟达); Stanford University (斯坦福大学); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 11 figures, SIGGRAPH 2025 (Journal track)

点击查看摘要

Abstract:Neural fields are a highly effective representation across visual computing. This work observes that fitting these fields is greatly improved by incorporating spatial stochasticity during training, and that this simple technique can replace or even outperform custom-designed hierarchies and frequency space constructions. The approach is formalized as implicitly operating on a blurred version of the field, evaluated in-expectation by sampling with Gaussian-distributed offsets. Querying the blurred field during optimization greatly improves convergence and robustness, akin to the role of preconditioners in numerical linear algebra. This implicit, sampling-based perspective fits naturally into the neural field paradigm, comes at no additional cost, and is extremely simple to implement. We describe the basic theory of this technique, including details such as handling boundary conditions, and extending to a spatially-varying blur. Experiments demonstrate this approach on representations including coordinate MLPs, neural hashgrids, triplanes, and more, across tasks including surface reconstruction and radiance fields. In settings where custom-designed hierarchies have already been developed, stochastic preconditioning nearly matches or improves their performance with a simple and unified approach; in settings without existing hierarchies it provides an immediate boost to quality and robustness.
zh

[CV-145] WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

【速读】:该论文试图解决在3D场景中生成可控制类型和严重程度的逼真天气效果的问题(weather editing)。解决方案的关键在于提出一个包含两个核心组件的管道:天气背景编辑和天气粒子构建。其中,天气背景编辑通过集成多种天气风格的全功能适配器实现,而天气粒子构建则通过动态4D高斯场进行物理建模与仿真,从而精确控制粒子属性和动态,确保天气效果的真实性和可调节性。

链接: https://arxiv.org/abs/2505.20471
作者: Chenghao Qian,Wenjing Li,Yuhu Guo,Gustav Markkula
机构: University of Leeds (利兹大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: this https URL
zh

[CV-146] CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

【速读】:该论文旨在解决3D语义理解中由于遮挡、图像模糊和视角依赖性变化导致的跨视角语义不一致问题,这一问题会通过投影监督传播,降低3D高斯语义场的质量并引入渲染输出中的伪影。解决方案的关键在于提出CCL-LGS框架,该框架通过整合多视角语义线索来强制实现视角一致的语义监督,具体包括使用零样本追踪器对SAM生成的2D掩码进行对齐与类别识别、利用CLIP提取跨视角的鲁棒语义编码,以及通过对比代码本学习(CCL)模块提炼具有类内紧凑性和类间区分性的语义特征,从而显式解决语义冲突并保持类别可区分性。

链接: https://arxiv.org/abs/2505.20469
作者: Lei Tian,Xiaomin Li,Liqian Ma,Hefei Huang,Zirui Zheng,Hao Yin,Taiqing Li,Huchuan Lu,Xu Jia
机构: Dalian University of Technology (大连理工大学); ZMO AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding, a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of SAM-generated 2D masks and reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate that CCL-LGS outperforms previous state-of-the-art methods. Our project page is available at this https URL.
zh

[CV-147] DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data

【速读】:该论文旨在解决可控生成可动3D物体的问题,即从一对图像中生成具有明确运动关系的复杂可动3D模型,其中一张图像展示物体的静止状态,另一张展示其可动状态。与单图像方法相比,双图像输入在数据收集上仅带来较小的负担,但能提供重要的运动信息,从而可靠地指导部件间运动关系的预测。解决方案的关键在于提出一种双图像扩散模型,用于捕捉图像对之间的关系以生成部件布局和关节参数,并引入基于思维链(Chain-of-Thought, CoT)的图推理器,显式推断部件连接关系。此外,还开发了LEGO-Art数据集扩展管道,以提升数据集的多样性和复杂性,进而提出了PM-X大规模数据集,显著提升了模型在复杂可动物体上的泛化能力。

链接: https://arxiv.org/abs/2505.20460
作者: Ruqi Wu,Xinjie Wang,Liu Liu,Chunle Guo,Jiaxiong Qiu,Chongyi Li,Lichao Huang,Zhizhong Su,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present DIPO, a novel framework for the controllable generation of articulated 3D objects from a pair of images: one depicting the object in a resting state and the other in an articulated state. Compared to the single-image approach, our dual-image input imposes only a modest overhead for data collection, but at the same time provides important motion information, which is a reliable guide for predicting kinematic relationships between parts. Specifically, we propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters. In addition, we introduce a Chain-of-Thought (CoT) based graph reasoner that explicitly infers part connectivity relationships. To further improve robustness and generalization on complex articulated objects, we develop a fully automated dataset expansion pipeline, name LEGO-Art, that enriches the diversity and complexity of PartNet-Mobility dataset. We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions. Extensive experiments demonstrate that DIPO significantly outperforms existing baselines in both the resting state and the articulated state, while the proposed PM-X dataset further enhances generalization to diverse and structurally complex articulated objects. Our code and dataset will be released to the community upon publication.
zh

[CV-148] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在长上下文场景下,尤其是长视频中性能下降的问题。现有方法虽然尝试通过旋转位置编码(Rotary Position Embedding, RoPE)来增强模型的长度泛化能力,但其在处理视频中的复杂时空依赖关系时仍存在不足。论文提出了一种名为HoPE的混合位置编码方案,其关键在于引入了一种混合频率分配策略以实现对任意长度上下文的可靠语义建模,并结合动态时间缩放机制以支持跨不同上下文长度的鲁棒学习与灵活推理。

链接: https://arxiv.org/abs/2505.20444
作者: Haoran Li,Yingjie Qin,Baoyuan Ou,Lai Xu,Ruiwen Xu
机构: Carnegie Mellon University (卡内基梅隆大学); Xiaohongshu Inc. (小红书公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at this https URL.
zh

[CV-149] ART-DECO: Arbitrary Text Guidance for 3D Detailizer Construction

【速读】:该论文旨在解决如何根据文本提示快速生成具有高质量几何细节和纹理的3D资产的问题,同时保持结构控制与风格一致性。其关键解决方案是提出一种3D detailizer(详细化器),该模型通过从预训练的多视角图像扩散模型中蒸馏基础知识,并利用Score Distillation Sampling(SDS)进行训练,从而能够在1秒内将粗略的3D形状转换为高保真细节的资产。该模型不针对单一形状优化,而是通过两阶段训练提升对复杂结构的泛化能力,实现结构可控、风格一致的3D生成。

链接: https://arxiv.org/abs/2505.20431
作者: Qimin Chen,Yuezhi Yang,Yifang Wang,Vladimir G. Kim,Siddhartha Chaudhuri,Hao Zhang,Zhiqin Chen
机构: Simon Fraser University (西蒙弗雷泽大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a 3D detailizer, a neural model which can instantaneously (in 1s) transform a coarse 3D shape proxy into a high-quality asset with detailed geometry and texture as guided by an input text prompt. Our model is trained using the text prompt, which defines the shape class and characterizes the appearance and fine-grained style of the generated details. The coarse 3D proxy, which can be easily varied and adjusted (e.g., via user editing), provides structure control over the final shape. Importantly, our detailizer is not optimized for a single shape; it is the result of distilling a generative model, so that it can be reused, without retraining, to generate any number of shapes, with varied structures, whose local details all share a consistent style and appearance. Our detailizer training utilizes a pretrained multi-view image diffusion model, with text conditioning, to distill the foundational knowledge therein into our detailizer via Score Distillation Sampling (SDS). To improve SDS and enable our detailizer architecture to learn generalizable features over complex structures, we train our model in two training stages to generate shapes with increasing structural complexity. Through extensive experiments, we show that our method generates shapes of superior quality and details compared to existing text-to-3D models under varied structure control. Our detailizer can refine a coarse shape in less than a second, making it possible to interactively author and adjust 3D shapes. Furthermore, the user-imposed structure control can lead to creative, and hence out-of-distribution, 3D asset generations that are beyond the current capabilities of leading text-to-3D generative models. We demonstrate an interactive 3D modeling workflow our method enables, and its strong generalizability over styles, structures, and object categories.
zh

[CV-150] MMPerspective: Do MLLM s Understand Perspective? A Comprehensive Benchmark for Perspective Perception Reasoning and Robustness

【速读】:该论文试图解决多模态大语言模型(MLLMs)在内部化透视几何理解方面的不足问题,具体聚焦于模型对透视感知、推理及鲁棒性的掌握程度。解决方案的关键在于提出MMPerspective,这是首个专门设计用于系统评估MLLMs透视理解能力的基准,包含10个精心设计的任务,覆盖透视感知、推理和鲁棒性三个互补维度,并提供了2,711个真实与合成图像实例及5,083对问答对,以全面评估模型在消失点感知、透视类型推理、三维空间线关系理解等方面的能力。

链接: https://arxiv.org/abs/2505.20426
作者: Yunlong Tang,Pinxin Liu,Mingqian Feng,Zhangyun Tan,Rui Mao,Chao Huang,Jing Bi,Yunzhong Xiao,Susan Liang,Hang Hua,Ali Vosoughi,Luchuan Song,Zeliang Zhang,Chenliang Xu
机构: University of Rochester (罗切斯特大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: this https URL
zh

[CV-151] Vision-Based Risk Aware Emergency Landing for UAVs in Complex Urban Environments

【速读】:该论文旨在解决无人机在拥挤城市环境中安全着陆的问题,尤其是在紧急情况下,面对移动障碍物和其他视觉挑战时的着陆风险。解决方案的关键在于提出一种基于语义分割的风险感知方法,通过专用深度神经网络为像素级分配风险值,并利用基于风险图的算法自适应识别稳定的Safe Landing Zone (SLZ),结合高度依赖的安全阈值和时间稳定着陆点策略,确保无人机能够安全降落。

链接: https://arxiv.org/abs/2505.20423
作者: Julio de la Torre-Vanegas,Miguel Soriano-Garcia,Israel Becerra,Diego Mercado-Ravell
机构: Center for Research in Mathematics CIMAT AC(数学研究中心 CIMAT AC); Center for Research and Advanced Studies CINVESTAV-IPN(研究与高级学习中心 CINVESTAV-IPN)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Landing safely in crowded urban environments remains an essential yet challenging endeavor for Unmanned Aerial Vehicles (UAVs), especially in emergency situations. In this work, we propose a risk-aware approach that harnesses semantic segmentation to continuously evaluate potential hazards in the drone’s field of view. By using a specialized deep neural network to assign pixel-level risk values and applying an algorithm based on risk maps, our method adaptively identifies a stable Safe Landing Zone (SLZ) despite moving critical obstacles such as vehicles, people, etc., and other visual challenges like shifting illumination. A control system then guides the UAV toward this low-risk region, employing altitude-dependent safety thresholds and temporal landing point stabilization to ensure robust descent trajectories. Experimental validation in diverse urban environments demonstrates the effectiveness of our approach, achieving over 90% landing success rates in very challenging real scenarios, showing significant improvements in various risk metrics. Our findings suggest that risk-oriented vision methods can effectively help reduce the risk of accidents in emergency landing situations, particularly in complex, unstructured, urban scenarios, densely populated with moving risky obstacles, while potentiating the true capabilities of UAVs in complex urban operations.
zh

[CV-152] RetroMotion: Retrocausal Motion Forecasting Models are Instructable

【速读】:该论文旨在解决道路使用者(即智能体)运动预测的复杂性问题,该复杂性取决于场景约束和交互行为。其解决方案的关键在于提出一种多任务学习方法,该方法包含信息的回溯因果流动,通过生成所有建模智能体的边缘轨迹分布以及交互智能体的联合轨迹分布来实现。该方法利用Transformer模型,通过重新编码边缘分布并进行成对建模来生成联合分布,从而将后期边缘轨迹中的信息回溯传递到早期联合轨迹中,以提升预测精度。

链接: https://arxiv.org/abs/2505.20414
作者: Royden Wagner,Omer Sahin Tas,Felix Hauser,Marlon Steiner,Dominik Strutz,Abhishek Vivekanandan,Carlos Fernandez,Christoph Stiller
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); FZI Research Center for Information Technology (弗劳恩霍夫信息科技研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Motion forecasts of road users (i.e., agents) vary in complexity as a function of scene constraints and interactive behavior. We address this with a multi-task learning method for motion forecasting that includes a retrocausal flow of information. The corresponding tasks are to forecast (1) marginal trajectory distributions for all modeled agents and (2) joint trajectory distributions for interacting agents. Using a transformer model, we generate the joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. Per trajectory point, we model positional uncertainty using compressed exponential power distributions. Notably, our method achieves state-of-the-art results in the Waymo Interaction Prediction dataset and generalizes well to the Argoverse 2 dataset. Additionally, our method provides an interface for issuing instructions through trajectory modifications. Our experiments show that regular training of motion forecasting leads to the ability to follow goal-based instructions and to adapt basic directional instructions to the scene context. Code: this https URL
zh

[CV-153] ReaMOT: A Benchmark and Framework for Reasoning -based Multi-Object Tracking

【速读】:该论文试图解决基于语言指令的多目标跟踪(Referring Multi-object Tracking, RMOT)任务中,现有方法在面对具有推理特征的复杂语言指令时表现不佳的问题。其解决方案的关键在于提出一种新的任务范式——基于推理的多目标跟踪(Reasoning-based Multi-Object Tracking, ReaMOT),并通过构建ReaMOT Challenge基准测试来评估模型的推理能力,同时引入无需训练的ReaTrack框架,该框架基于大视觉语言模型(Large Vision-Language Models, LVLM)和SAM2,作为ReaMOT任务的基线。

链接: https://arxiv.org/abs/2505.20381
作者: Sijia Chen,Yanqiu Yu,En Yu,Wenbing Tao
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Referring Multi-object tracking (RMOT) is an important research field in computer vision. Its task form is to guide the models to track the objects that conform to the language instruction. However, the RMOT task commonly requires clear language instructions, such methods often fail to work when complex language instructions with reasoning characteristics appear. In this work, we propose a new task, called Reasoning-based Multi-Object Tracking (ReaMOT). ReaMOT is a more challenging task that requires accurate reasoning about objects that match the language instruction with reasoning characteristic and tracking the objects’ trajectories. To advance the ReaMOT task and evaluate the reasoning capabilities of tracking models, we construct ReaMOT Challenge, a reasoning-based multi-object tracking benchmark built upon 12 datasets. Specifically, it comprises 1,156 language instructions with reasoning characteristic, 423,359 image-language pairs, and 869 diverse scenes, which is divided into three levels of reasoning difficulty. In addition, we propose a set of evaluation metrics tailored for the ReaMOT task. Furthermore, we propose ReaTrack, a training-free framework for reasoning-based multi-object tracking based on large vision-language models (LVLM) and SAM2, as a baseline for the ReaMOT task. Extensive experiments on the ReaMOT Challenge benchmark demonstrate the effectiveness of our ReaTrack framework.
zh

[CV-154] FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

【速读】:该论文旨在解决Diffusion Transformers (DiT)在推理过程中计算效率低下的问题,其核心挑战在于模型的迭代结构和深层Transformer堆叠导致的高计算开销。解决方案的关键在于提出FastCache,这是一个基于隐藏状态级别的缓存与压缩框架,通过利用模型内部表示中的冗余性来加速DiT推理。FastCache的核心创新包括:(1)一种基于隐藏状态显著性的空间感知标记选择机制,用于自适应过滤冗余标记;(2)一种跨时间步的Transformer级缓存机制,用于在统计上变化不显著时复用潜在激活。这两个模块协同工作,在保持生成质量的同时减少不必要的计算。

链接: https://arxiv.org/abs/2505.20353
作者: Dong Liu,Jiayi Zhang,Yifan Li,Yanxuan Yu,Ben Lengerich,Ying Nian Wu
机构: Yale University (耶鲁大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); Michigan State University (密歇根州立大学); Columbia University (哥伦比亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model’s internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at this https URL.
zh

[CV-155] SpatialLLM : From Multi-modality Data to Urban Spatial Intelligence

【速读】:该论文旨在解决复杂城市场景中的空间智能任务,传统方法通常依赖地理分析工具或领域专业知识,而本文提出了一种无需训练、微调或专家干预的统一语言模型——SpatialLLM。其关键在于从原始空间数据中构建详细且结构化的场景描述,以提示预训练的大语言模型(Large Language Model, LLM)进行基于场景的分析,从而实现对空间分布信息的准确感知和零样本执行高级空间智能任务。

链接: https://arxiv.org/abs/2505.12703
作者: Jiabin Chen,Haiping Wang,Jinpeng Li,Yuan Liu,Zhen Dong,Bisheng Yang
机构: Wuhan University (武汉大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and management. The code and dataset are available at this https URL.
zh

[CV-156] MVTN: Learning Multi-View Transformations for 3D Understanding ICCV2021

【速读】:该论文试图解决传统多视角投影技术在3D形状识别中视角固定、缺乏适应性的问题,从而限制了模型的性能和泛化能力。解决方案的关键在于提出多视角变换网络(Multi-View Transformation Network, MVTN),该网络通过可微渲染技术学习最优的视角位置,实现端到端的训练,并与任意多视角网络结合用于3D形状分类。

链接: https://arxiv.org/abs/2212.13462
作者: Abdullah Hamdi,Faisal AlZahrani,Silvio Giancola,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: under review journal extension for the ICCV 2021 paper arXiv:2011.13244

点击查看摘要

Abstract:Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.
zh

[CV-157] Prostate Cancer Screening with Artificial Intelligence-Enhanced Micro-Ultrasound: A Comparative Study with Traditional Methods

【速读】:该论文试图解决如何提高临床筛查方法(如前列腺特异性抗原PSA和直肠指检DRE)在检测临床上显著的前列腺癌(csPCa)中的准确性问题。其解决方案的关键在于利用生成式AI (Generative AI) 对微超声(micro-US)图像进行分析,通过自监督卷积自编码器提取深度图像特征,并使用随机森林分类器在切片级别预测csPCa,从而实现更高的特异性和与现有方法相当的敏感性。

链接: https://arxiv.org/abs/2505.21355
作者: Muhammad Imran,Wayne G. Brisbane,Li-Ming Su,Jason P. Joseph,Wei Shao
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background and objective: Micro-ultrasound (micro-US) is a novel imaging modality with diagnostic accuracy comparable to MRI for detecting clinically significant prostate cancer (csPCa). We investigated whether artificial intelligence (AI) interpretation of micro-US can outperform clinical screening methods using PSA and digital rectal examination (DRE). Methods: We retrospectively studied 145 men who underwent micro-US guided biopsy (79 with csPCa, 66 without). A self-supervised convolutional autoencoder was used to extract deep image features from 2D micro-US slices. Random forest classifiers were trained using five-fold cross-validation to predict csPCa at the slice level. Patients were classified as csPCa-positive if 88 or more consecutive slices were predicted positive. Model performance was compared with a classifier using PSA, DRE, prostate volume, and age. Key findings and limitations: The AI-based micro-US model and clinical screening model achieved AUROCs of 0.871 and 0.753, respectively. At a fixed threshold, the micro-US model achieved 92.5% sensitivity and 68.1% specificity, while the clinical model showed 96.2% sensitivity but only 27.3% specificity. Limitations include a retrospective single-center design and lack of external validation. Conclusions and clinical implications: AI-interpreted micro-US improves specificity while maintaining high sensitivity for csPCa detection. This method may reduce unnecessary biopsies and serve as a low-cost alternative to PSA-based screening. Patient summary: We developed an AI system to analyze prostate micro-ultrasound images. It outperformed PSA and DRE in detecting aggressive cancer and may help avoid unnecessary biopsies.
zh

[CV-158] Generative Image Compression by Estimating Gradients of the Rate-variable Feature Distribution

【速读】:该论文旨在解决生成式图像压缩(Generative Image Compression, GIC)中如何实现高质量、高效率的图像重建问题。传统方法在压缩过程中依赖于间接利用扩散模型,而本文的关键创新在于将压缩过程重新诠释为由随机微分方程(Stochastic Differential Equations, SDEs)驱动的前向扩散路径,并通过训练一个逆向神经网络直接逆转压缩过程以实现图像重建,无需依赖高斯噪声初始化。这一方法实现了平滑的率失真调整和逼真的图像重构,且仅需少量采样步骤即可达到优异性能。

链接: https://arxiv.org/abs/2505.20984
作者: Minghao Han,Weiyi You,Jinhua Zhang,Leheng Zhang,Ce Zhu,Shuhang Gu
机构: University of Electronic Science and Technology of China (中国电子科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly exploit diffusion modeling, we reinterpret the compression process itself as a forward diffusion path governed by stochastic differential equations (SDEs). A reverse neural network is trained to reconstruct images by reversing the compression process directly, without requiring Gaussian noise initialization. This approach achieves smooth rate adjustment and photo-realistic reconstructions with only a minimal number of sampling steps. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing generative image compression approaches across a range of metrics, including perceptual distortion, statistical fidelity, and no-reference quality assessments.
zh

[CV-159] Multitemporal Latent Dynamical Framework for Hyperspectral Images Unmixing

【速读】:该论文旨在解决多时相高光谱解混中忽略丰度动态变化的问题,传统方法过于关注端元的变异性而忽视了丰度的时间演化特性。其解决方案的关键在于采用神经微分方程对丰度进行时间建模,并提出一种基于理论验证的多时相潜在动力学(MiLD)解混框架,通过常微分方程定义问题、动态离散化方法建立数学模型、利用神经网络求解并捕捉材料动态演化,同时提供了收敛性、一致性和稳定性等关键性质的理论支持。

链接: https://arxiv.org/abs/2505.20902
作者: Ruiying Li,Bin Pan,Lan Ma,Xia Xu,Zhenwei Shi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages,8 figures

点击查看摘要

Abstract:Multitemporal hyperspectral unmixing can capture dynamical evolution of materials. Despite its capability, current methods emphasize variability of endmembers while neglecting dynamics of abundances, which motivates our adoption of neural ordinary differential equations to model abundances temporally. However, this motivation is hindered by two challenges: the inherent complexity in defining, modeling and solving problem, and the absence of theoretical support. To address above challenges, in this paper, we propose a multitemporal latent dynamical (MiLD) unmixing framework by capturing dynamical evolution of materials with theoretical validation. For addressing multitemporal hyperspectral unmixing, MiLD consists of problem definition, mathematical modeling, solution algorithm and theoretical support. We formulate multitemporal unmixing problem definition by conducting ordinary differential equations and developing latent variables. We transfer multitemporal unmixing to mathematical model by dynamical discretization approaches, which describe the discreteness of observed sequence images with mathematical expansions. We propose algorithm to solve problem and capture dynamics of materials, which approximates abundance evolution by neural networks. Furthermore, we provide theoretical support by validating the crucial properties, which verifies consistency, convergence and stability theorems. The major contributions of MiLD include defining problem by ordinary differential equations, modeling problem by dynamical discretization approach, solving problem by multitemporal unmixing algorithm, and presenting theoretical support. Our experiments on both synthetic and real datasets have validated the utility of our work
zh

[CV-160] he Role of AI in Early Detection of Life-Threatening Diseases: A Retinal Imaging Perspective

【速读】:该论文试图解决当前视网膜成像在系统性疾病生物标志物检测与量化中的信息分散问题,以及将其转化为临床常规实践所面临的挑战,如成像协议异质性、AI模型外部验证不足和临床工作流程整合困难。解决方案的关键在于系统地综合最新的光学相干断层扫描(OCT/OCTA)和自适应光学(AO)技术进展、人工智能/机器学习(AI/ML)方法及移动健康(mHealth)与远程眼科诊疗举措,并通过多中心协议标准化、前瞻性验证试验和将视网膜筛查无缝纳入初级与专科护理路径来推动精准预防、早期干预和持续治疗。

链接: https://arxiv.org/abs/2505.20810
作者: Tariq M Khan,Toufique Ahmed Soomro,Imran Razzak
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Retinal imaging has emerged as a powerful, non-invasive modality for detecting and quantifying biomarkers of systemic diseases-ranging from diabetes and hypertension to Alzheimer’s disease and cardiovascular disorders but current insights remain dispersed across platforms and specialties. Recent technological advances in optical coherence tomography (OCT/OCTA) and adaptive optics (AO) now deliver ultra-high-resolution scans (down to 5 \mum ) with superior contrast and spatial integration, allowing early identification of microvascular abnormalities and neurodegenerative changes. At the same time, AI-driven and machine learning (ML) algorithms have revolutionized the analysis of large-scale retinal datasets, increasing sensitivity and specificity; for example, deep learning models achieve 90 % sensitivity for diabetic retinopathy and AUC = 0.89 for the prediction of cardiovascular risk from fundus photographs. The proliferation of mobile health technologies and telemedicine platforms further extends access, reduces costs, and facilitates community-based screening and longitudinal monitoring. Despite these breakthroughs, translation into routine practice is hindered by heterogeneous imaging protocols, limited external validation of AI models, and integration challenges within clinical workflows. In this review, we systematically synthesize the latest OCT/OCT and AO developments, AI/ML approaches, and mHealth/Tele-ophthalmology initiatives and quantify their diagnostic performance across disease domains. Finally, we propose a roadmap for multicenter protocol standardization, prospective validation trials, and seamless incorporation of retinal screening into primary and specialty care pathways-paving the way for precision prevention, early intervention, and ongoing treatment of life-threatening systemic diseases.
zh

[CV-161] Unpaired Image-to-Image Translation for Segmentation and Signal Unmixing NEURIPS2025

【速读】:该论文试图解决无配对图像到图像翻译(unpaired image-to-image translation)中的内容与风格解耦问题,特别是在需要精确结构保留的生物医学任务中实现跨域风格迁移。解决方案的关键在于对CycleGAN进行改进,引入基于U-Net的生成器并采用跳跃连接以传播局部浅层特征至生成器深层,同时移除基于特征的归一化层,替换为参数化的近似双向谱归一化以提升训练稳定性,并在生成器中集成通道和空间注意力机制以增强内容保真度。

链接: https://arxiv.org/abs/2505.20746
作者: Nikola Andrejic,Milica Spasic,Igor Mihajlovic,Petra Milosavljevic,Djordje Pavlovic,Filip Milisavljevic,Uros Milivojevic,Danilo Delibasic,Ivana Mikic,Sinisa Todorovic
机构: Diffine LLC(.Diffine LLC)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to NeurIPs 2025

点击查看摘要

Abstract:This work introduces Ui2i, a novel model for unpaired image-to-image translation, trained on content-wise unpaired datasets to enable style transfer across domains while preserving content. Building on CycleGAN, Ui2i incorporates key modifications to better disentangle content and style features, and preserve content integrity. Specifically, Ui2i employs U-Net-based generators with skip connections to propagate localized shallow features deep into the generator. Ui2i removes feature-based normalization layers from all modules and replaces them with approximate bidirectional spectral normalization – a parameter-based alternative that enhances training stability. To further support content preservation, channel and spatial attention mechanisms are integrated into the generators. Training is facilitated through image scale augmentation. Evaluation on two biomedical tasks – domain adaptation for nuclear segmentation in immunohistochemistry (IHC) images and unmixing of biological structures superimposed in single-channel immunofluorescence (IF) images – demonstrates Ui2i’s ability to preserve content fidelity in settings that demand more accurate structural preservation than typical translation tasks. To the best of our knowledge, Ui2i is the first approach capable of separating superimposed signals in IF images using real, unpaired training data.
zh

[CV-162] A False Discovery Rate Control Method Using a Fully Connected Hidden Markov Random Field for Neuroimaging Data

【速读】:该论文旨在解决神经影像数据中体素级多重检验的假发现率(False Discovery Rate, FDR)控制问题,特别是在面对大规模测试(数万至数百万次检验)时,传统FDR控制方法(如BH、q-value和LocalFDR)因假设检验独立性而导致较高的假非发现率(False Non-Discovery Rate, FNR)。为应对这一挑战,论文提出了一种名为fcHMRF-LIS的高效、稳定且可扩展的空间FDR控制方法,其关键在于将基于局部显著性指数(Local Index of Significance, LIS)的检验过程与一种新型全连接隐马尔可夫随机场(fully connected hidden Markov random field, fcHMRF)相结合,以建模复杂的空间结构,并通过高效的期望最大化算法结合均场近似、条件随机场作为循环神经网络(CRF-RNN)技术和排列棱镜滤波技术,将计算复杂度从二次降低到线性,从而提升了方法的计算效率与稳定性。

链接: https://arxiv.org/abs/2505.20688
作者: Taehyo Kim,Qiran Jia,Mony J. de Leon,Hai Shu
机构: 未知
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:False discovery rate (FDR) control methods are essential for voxel-wise multiple testing in neuroimaging data analysis, where hundreds of thousands or even millions of tests are conducted to detect brain regions associated with disease-related changes. Classical FDR control methods (e.g., BH, q-value, and LocalFDR) assume independence among tests and often lead to high false non-discovery rates (FNR). Although various spatial FDR control methods have been developed to improve power, they still fall short in jointly addressing three major challenges in neuroimaging applications: capturing complex spatial dependencies, maintaining low variability in both false discovery proportion (FDP) and false non-discovery proportion (FNP) across replications, and achieving computational scalability for high-resolution data. To address these challenges, we propose fcHMRF-LIS, a powerful, stable, and scalable spatial FDR control method for voxel-wise multiple testing. It integrates the local index of significance (LIS)-based testing procedure with a novel fully connected hidden Markov random field (fcHMRF) designed to model complex spatial structures using a parsimonious parameterization. We develop an efficient expectation-maximization algorithm incorporating mean-field approximation, the Conditional Random Fields as Recurrent Neural Networks (CRF-RNN) technique, and permutohedral lattice filtering, reducing the computational complexity from quadratic to linear in the number of tests. Extensive simulations demonstrate that fcHMRF-LIS achieves accurate FDR control, lower FNR, reduced variability in FDP and FNP, and a higher number of true positives compared to existing methods. Applied to an FDG-PET dataset from the Alzheimer’s Disease Neuroimaging Initiative, fcHMRF-LIS identifies neurobiologically relevant brain regions and offers notable advantages in computational efficiency.
zh

人工智能

[AI-0] AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery

【速读】:该论文旨在解决基于视觉-语言模型(Vision-Language Model, VLM)的网络代理在非受控网络环境中面临的严重安全漏洞问题,特别是针对其环境注入攻击的现实可行性。现有研究在对抗性环境注入攻击方面通常依赖于不切实际的假设,如直接修改HTML、了解用户意图或访问代理模型参数,从而限制了其实际应用。论文提出的解决方案AdInject是一种新颖的黑盒攻击方法,其关键在于利用互联网广告投放机制将恶意内容注入网络代理的环境,通过设计误导性广告内容并结合VLM进行广告内容优化,以推测目标网站上下文中的潜在用户意图,并将其融入广告内容中,使其看起来更相关或更具紧迫性,从而提高攻击效果。

链接: https://arxiv.org/abs/2505.21499
作者: Haowei Wang,Junjie Wang,Xiaojun Jia,Rupeng Zhang,Mingyang Li,Zhe Liu,Yang Liu,Qing Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Model (VLM) based Web Agents represent a significant step towards automating complex tasks by simulating human-like interaction with websites. However, their deployment in uncontrolled web environments introduces significant security vulnerabilities. Existing research on adversarial environmental injection attacks often relies on unrealistic assumptions, such as direct HTML manipulation, knowledge of user intent, or access to agent model parameters, limiting their practical applicability. In this paper, we propose AdInject, a novel and real-world black-box attack method that leverages the internet advertising delivery to inject malicious content into the Web Agent’s environment. AdInject operates under a significantly more realistic threat model than prior work, assuming a black-box agent, static malicious content constraints, and no specific knowledge of user intent. AdInject includes strategies for designing malicious ad content aimed at misleading agents into clicking, and a VLM-based ad content optimization technique that infers potential user intents from the target website’s context and integrates these intents into the ad content to make it appear more relevant or critical to the agent’s task, thus enhancing attack effectiveness. Experimental evaluations demonstrate the effectiveness of AdInject, attack success rates exceeding 60% in most scenarios and approaching 100% in certain cases. This strongly demonstrates that prevalent advertising delivery constitutes a potent and real-world vector for environment injection attacks against Web Agents. This work highlights a critical vulnerability in Web Agent security arising from real-world environment manipulation channels, underscoring the urgent need for developing robust defense mechanisms against such threats. Our code is available at this https URL.
zh

[AI-1] Robust Hypothesis Generation: LLM -Automated Language Bias for Inductive Logic Programming

【速读】:该论文试图解决在开放环境中自动化生成稳健假设的问题,这是实现人工智能认知的关键挑战。解决方案的关键在于引入一种新颖的框架,该框架将基于大型语言模型(Large Language Models, LLMs)的多智能体系统与归纳逻辑编程(Inductive Logic Programming, ILP)相结合。通过这种集成,LLM代理能够从原始文本数据中自主定义结构化的符号词汇(谓词)和关系模板,即直接生成语言偏差,从而实现自动化的符号接地,这一过程传统上依赖专家干预,成为ILP的瓶颈。该方法有效克服了传统ILP对预定义符号结构的依赖以及纯LLM方法对噪声的敏感性。

链接: https://arxiv.org/abs/2505.21486
作者: Yang Yang,Jiemin Wu,Yutao Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating robust hypothesis generation in open environments is pivotal for AI cognition. We introduce a novel framework integrating a multi-agent system, powered by Large Language Models (LLMs), with Inductive Logic Programming (ILP). Our system’s LLM agents autonomously define a structured symbolic vocabulary (predicates) and relational templates , i.e., \emphlanguage bias directly from raw textual data. This automated symbolic grounding (the construction of the language bias), traditionally an expert-driven bottleneck for ILP, then guides the transformation of text into facts for an ILP solver, which inductively learns interpretable rules. This approach overcomes traditional ILP’s reliance on predefined symbolic structures and the noise-sensitivity of pure LLM methods. Extensive experiments in diverse, challenging scenarios validate superior performance, paving a new path for automated, explainable, and verifiable hypothesis generation.
zh

[AI-2] Hume: Introducing System-2 Thinking in Visual-Language-Action Model

【速读】:该论文旨在解决机器人基础模型在物理世界中交互时,如何有效提升其复杂任务处理能力的问题。现有研究主要关注数字领域中的大型语言模型(Large Language Models, LLMs),而对机器人系统中类似人类“慢思考”机制的探索仍较为有限。论文提出的Hume模型采用双系统架构,其中System 2通过引入价值查询头实现基于价值引导的“慢思考”,通过多次采样动作候选并根据状态-动作价值进行选择;System 1则为轻量级反应式视觉运动策略,负责执行System 2选定的动作并进行级联动作去噪以实现精细控制。该解决方案的关键在于将价值引导的系统2思维与高效的系统1反应机制相结合,从而在保证实时性的同时提升机器人任务执行的准确性与鲁棒性。

链接: https://arxiv.org/abs/2505.21432
作者: Haoming Song,Delin Qu,Yuanqi Yao,Qizhi Chen,Qi Lv,Yiwen Tang,Modi Shi,Guanghui Ren,Maoqing Yao,Bin Zhao,Dong Wang,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans practice slow thinking before performing actual actions when handling complex tasks in the physical world. This thinking paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains. However, the potential of slow thinking remains largely unexplored for robotic foundation models interacting with the physical world. In this work, we propose Hume: a dual-system Vision-Language-Action (VLA) model with value-guided System-2 thinking and cascaded action denoising, exploring human-like thinking capabilities of Vision-Language-Action models for dexterous robot control. System 2 of Hume implements value-Guided thinking by extending a Vision-Language-Action Model backbone with a novel value-query head to estimate the state-action value of predicted actions. The value-guided thinking is conducted by repeat sampling multiple action candidates and selecting one according to state-action value. System 1 of Hume is a lightweight reactive visuomotor policy that takes System 2 selected action and performs cascaded action denoising for dexterous robot control. At deployment time, System 2 performs value-guided thinking at a low frequency while System 1 asynchronously receives the System 2 selected action candidate and predicts fluid actions in real time. We show that Hume outperforms the existing state-of-the-art Vision-Language-Action models across multiple simulation benchmark and real-robot deployments.
zh

[AI-3] Policy Induction: Predicting Startup Success via Explainable Memory-Augmented In-Context Learning

【速读】:该论文试图解决早期创业公司投资中数据稀缺且结果不确定的问题,传统机器学习方法需要大量标注数据并难以解释。其解决方案的关键是采用基于上下文学习(in-context learning, ICL)的记忆增强型大语言模型(large language models, LLMs),通过嵌入自然语言策略到LLM提示中,使模型能够应用显式推理模式,并允许人类专家进行解释、审计和迭代优化。此外,引入了一种轻量级训练过程,结合少样本学习与上下文学习循环,使LLM能够根据结构化反馈迭代更新决策策略。

链接: https://arxiv.org/abs/2505.21427
作者: Xianling Mu,Joseph Ternasky,Fuat Alican,Yigit Ihlamur
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Early-stage startup investment is a high-risk endeavor characterized by scarce data and uncertain outcomes. Traditional machine learning approaches often require large, labeled datasets and extensive fine-tuning, yet remain opaque and difficult for domain experts to interpret or improve. In this paper, we propose a transparent and data-efficient investment decision framework powered by memory-augmented large language models (LLMs) using in-context learning (ICL). Central to our method is a natural language policy embedded directly into the LLM prompt, enabling the model to apply explicit reasoning patterns and allowing human experts to easily interpret, audit, and iteratively refine the logic. We introduce a lightweight training process that combines few-shot learning with an in-context learning loop, enabling the LLM to update its decision policy iteratively based on structured feedback. With only minimal supervision and no gradient-based optimization, our system predicts startup success far more accurately than existing benchmarks. It is over 20x more precise than random chance, which succeeds 1.9% of the time. It is also 7.1x more precise than the typical 5.6% success rate of top-tier venture capital (VC) firms.
zh

[AI-4] Learning Individual Behavior in Agent -Based Models with Graph Diffusion Networks

【速读】:该论文试图解决传统基于代理的模型(Agent-Based Models, ABMs)由于其行为规则通常为非可微分而难以与梯度下降等优化方法结合,从而限制了其与真实世界数据集成的问题。解决方案的关键在于提出一种新颖的框架,通过观察ABM生成的数据来学习一个可微分的替代模型(surrogate model)。该方法结合了扩散模型以捕捉行为的随机性,并利用图神经网络对代理间的交互进行建模,直接模拟个体代理的行为,而非仅近似系统级输出,从而保留了ABM固有的去中心化、自下而上的动态特性。

链接: https://arxiv.org/abs/2505.21426
作者: Francesco Cozzi,Marco Pangallo,Alan Perotti,André Panisson,Corrado Monti
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Econometrics (econ.EM); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Agent-Based Models (ABMs) are powerful tools for studying emergent properties in complex systems. In ABMs, agent behaviors are governed by local interactions and stochastic rules. However, these rules are, in general, non-differentiable, limiting the use of gradient-based methods for optimization, and thus integration with real-world data. We propose a novel framework to learn a differentiable surrogate of any ABM by observing its generated data. Our method combines diffusion models to capture behavioral stochasticity and graph neural networks to model agent interactions. Distinct from prior surrogate approaches, our method introduces a fundamental shift: rather than approximating system-level outputs, it models individual agent behavior directly, preserving the decentralized, bottom-up dynamics that define ABMs. We validate our approach on two ABMs (Schelling’s segregation model and a Predator-Prey ecosystem) showing that it replicates individual-level patterns and accurately forecasts emergent dynamics beyond training. Our results demonstrate the potential of combining diffusion models and graph learning for data-driven ABM simulation.
zh

[AI-5] Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLM s

【速读】:该论文试图解决云托管应用和服务中由于系统复杂性导致的性能或功能不稳定问题,此类问题可能有数十甚至数百个潜在根因。解决方案的关键在于将现代AI工具的模式匹配能力与自然多模态检索增强生成大语言模型(RAG LLM)接口相结合,从而简化问题的识别与解决过程。论文提出的ARCA系统正是针对这一问题设计的多模态RAG LLM系统,并通过分步评估验证了其优于现有先进方法的性能。

链接: https://arxiv.org/abs/2505.21419
作者: Yifan Wang,Kenneth P. Birman
机构: 未知
类目: Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注: Published in EuroMLSys2025

点击查看摘要

Abstract:Today’s cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.
zh

[AI-6] A Framework for Adversarial Analysis of Decision Support Systems Prior to Deployment

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)驱动的决策支持系统在部署前的安全性问题,特别是针对其学习到的行为模式和潜在漏洞进行分析与防护。解决方案的关键在于提出一个综合框架,通过仿真手段揭示智能体的行为特征和脆弱性,并生成精确时间点和目标的观测扰动,以评估对抗攻击在战略决策环境中的影响。该框架还支持系统性地发现和排序不同观测指标及时间步长上的攻击影响,并验证了对抗攻击在不同代理架构和DRL训练算法间的可迁移性。

链接: https://arxiv.org/abs/2505.21414
作者: Brett Bissey,Kyle Gatesman,Walker Dimon,Mohammad Alam,Luis Robaina,Joseph Weissman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:This paper introduces a comprehensive framework designed to analyze and secure decision-support systems trained with Deep Reinforcement Learning (DRL), prior to deployment, by providing insights into learned behavior patterns and vulnerabilities discovered through simulation. The introduced framework aids in the development of precisely timed and targeted observation perturbations, enabling researchers to assess adversarial attack outcomes within a strategic decision-making context. We validate our framework, visualize agent behavior, and evaluate adversarial outcomes within the context of a custom-built strategic game, CyberStrike. Utilizing the proposed framework, we introduce a method for systematically discovering and ranking the impact of attacks on various observation indices and time-steps, and we conduct experiments to evaluate the transferability of adversarial attacks across agent architectures and DRL training algorithms. The findings underscore the critical need for robust adversarial defense mechanisms to protect decision-making policies in high-stakes environments.
zh

[AI-7] MRSD: Multi-Resolution Skill Discovery for HRL Agents

【速读】:该论文试图解决传统分层强化学习(Hierarchical Reinforcement Learning, HRL)中技能发现方法仅能学习单一技能的问题,而人类在执行任务时能够同时学习和使用细粒度与粗粒度的运动技能。解决方案的关键在于提出多分辨率技能发现(Multi-Resolution Skill Discovery, MRSD)框架,该框架并行学习不同时间分辨率的多个技能编码器,并通过高层管理者动态选择这些技能,从而实现随时间变化的自适应控制策略。

链接: https://arxiv.org/abs/2505.21410
作者: Shashank Sharma,Janina Hoffmann,Vinay Namboodiri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Hierarchical reinforcement learning (HRL) relies on abstract skills to solve long-horizon tasks efficiently. While existing skill discovery methods learns these skills automatically, they are limited to a single skill per task. In contrast, humans learn and use both fine-grained and coarse motor skills simultaneously. Inspired by human motor control, we propose Multi-Resolution Skill Discovery (MRSD), an HRL framework that learns multiple skill encoders at different temporal resolutions in parallel. A high-level manager dynamically selects among these skills, enabling adaptive control strategies over time. We evaluate MRSD on tasks from the DeepMind Control Suite and show that it outperforms prior state-of-the-art skill discovery and HRL methods, achieving faster convergence and higher final performance. Our findings highlight the benefits of integrating multi-resolution skills in HRL, paving the way for more versatile and efficient agents.
zh

[AI-8] A Structured Unplugged Approach for Foundational AI Literacy in Primary Education

【速读】:该论文试图解决当前人工智能教育中过于侧重工具使用而忽视基础概念理解的问题,这导致非专家群体(尤其是儿童)容易产生误解、不切实际的期望以及难以识别偏见和刻板印象。解决方案的关键在于提出一种结构化且可复制的教学方法,通过结合与小学课程紧密相关的核心数学元素,强化学生对AI的基础理解、数据表示、分类推理及评估能力。

链接: https://arxiv.org/abs/2505.21398
作者: Maria Cristina Carrisi,Mirko Marras,Sara Vergallo
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Under review

点击查看摘要

Abstract:Younger generations are growing up in a world increasingly shaped by intelligent technologies, making early AI literacy crucial for developing the skills to critically understand and navigate them. However, education in this field often emphasizes tool-based learning, prioritizing usage over understanding the underlying concepts. This lack of knowledge leaves non-experts, especially children, prone to misconceptions, unrealistic expectations, and difficulties in recognizing biases and stereotypes. In this paper, we propose a structured and replicable teaching approach that fosters foundational AI literacy in primary students, by building upon core mathematical elements closely connected to and of interest in primary curricula, to strengthen conceptualization, data representation, classification reasoning, and evaluation of AI. To assess the effectiveness of our approach, we conducted an empirical study with thirty-one fifth-grade students across two classes, evaluating their progress through a post-test and a satisfaction survey. Our results indicate improvements in terminology understanding and usage, features description, logical reasoning, and evaluative skills, with students showing a deeper comprehension of decision-making processes and their limitations. Moreover, the approach proved engaging, with students particularly enjoying activities that linked AI concepts to real-world reasoning. Materials: this https URL.
zh

[AI-9] Leverag ing the Power of Conversations: Optimal Key Term Selection in Conversational Contextual Bandits KDD

【速读】:该论文旨在解决对话式推荐系统中由于现有算法在关键术语选择策略上探索不足以及对话启动机制依赖确定性规则所导致的偏好学习效果不佳的问题。解决方案的关键在于提出三种新颖算法:CLiSK通过引入平滑的关键术语上下文以增强探索,CLiME根据偏好不确定性自适应地启动对话,而CLiSK-ME则整合了这两种技术,从而在理论上实现了更紧的遗憾上界,并在实验中验证了其在累积遗憾上的显著提升。

链接: https://arxiv.org/abs/2505.21393
作者: Maoli Liu,Zhuohua Li,Xiangxiang Dai,John C.S. Lui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025

点击查看摘要

Abstract:Conversational recommender systems proactively query users with relevant “key terms” and leverage the feedback to elicit users’ preferences for personalized recommendations. Conversational contextual bandits, a prevalent approach in this domain, aim to optimize preference learning by balancing exploitation and exploration. However, several limitations hinder their effectiveness in real-world scenarios. First, existing algorithms employ key term selection strategies with insufficient exploration, often failing to thoroughly probe users’ preferences and resulting in suboptimal preference estimation. Second, current algorithms typically rely on deterministic rules to initiate conversations, causing unnecessary interactions when preferences are well-understood and missed opportunities when preferences are uncertain. To address these limitations, we propose three novel algorithms: CLiSK, CLiME, and CLiSK-ME. CLiSK introduces smoothed key term contexts to enhance exploration in preference learning, CLiME adaptively initiates conversations based on preference uncertainty, and CLiSK-ME integrates both techniques. We theoretically prove that all three algorithms achieve a tighter regret upper bound of O(\sqrtdT\logT) with respect to the time horizon T , improving upon existing methods. Additionally, we provide a matching lower bound \Omega(\sqrtdT) for conversational bandits, demonstrating that our algorithms are nearly minimax optimal. Extensive evaluations on both synthetic and real-world datasets show that our approaches achieve at least a 14.6% improvement in cumulative regret.
zh

[AI-10] Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

【速读】:该论文旨在解决线性TD(λ)算法在非线性独立特征条件下收敛速率分析的问题,传统方法通常假设特征线性独立,而这一假设在实际应用中往往不成立。论文的关键解决方案是首次建立了在任意特征下的L²收敛速率,无需对算法进行修改或引入额外假设,并通过开发一种新的随机逼近结果,实现了对解集而非单一点的收敛速率分析。

链接: https://arxiv.org/abs/2505.21391
作者: Zixuan Xie,Xinyu Liu,Rohan Chandra,Shangtong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Linear TD( \lambda ) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first L^2 convergence rates for linear TD( \lambda ) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.
zh

[AI-11] DeSocial: Blockchain-based Decentralized Social Networks

【速读】:该论文试图解决传统Web 2.0社交平台中用户无法自主选择算法导致个性化预测受限的问题,其核心在于通过区块链技术实现用户驱动的模型选择与多节点验证。解决方案的关键在于提出DeSocial框架,该框架部署在Ethereum本地开发链上,结合分布式数据存储、节点级共识和用户自定义模型选择,使用户能够在本地子图上评估多个主干模型,并通过多数投票机制聚合预测结果,从而提升个性化预测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.21388
作者: Jingyuan Huang,Xi Zhu,Minghao Guo,Yongfeng Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 13 figures

点击查看摘要

Abstract:Web 2.0 social platforms are inherently centralized, with user data and algorithmic decisions controlled by the platform. However, users can only passively receive social predictions without being able to choose the underlying algorithm, which limits personalization. Fortunately, with the emergence of blockchain, users are allowed to choose algorithms that are tailored to their local situation, improving prediction results in a personalized way. In a blockchain environment, each user possesses its own model to perform the social prediction, capturing different perspectives on social interactions. In our work, we propose DeSocial, a decentralized social network learning framework deployed on an Ethereum (ETH) local development chain that integrates distributed data storage, node-level consensus, and user-driven model selection through Ganache. In the first stage, each user leverages DeSocial to evaluate multiple backbone models on their local subgraph. DeSocial coordinates the execution and returns model-wise prediction results, enabling the user to select the most suitable backbone for personalized social prediction. Then, DeSocial uniformly selects several validation nodes that possess the algorithm specified by each user, and aggregates the prediction results by majority voting, to prevent errors caused by any single model’s misjudgment. Extensive experiments show that DeSocial has an evident improvement compared to the five classical centralized social network learning models, promoting user empowerment in blockchain-based decentralized social networks, showing the importance of multi-node validation and personalized algorithm selection based on blockchain. Our implementation is available at: this https URL.
zh

[AI-12] Improving LLM -based Global Optimization with Search Space Partitioning

【速读】:该论文试图解决在高维搜索空间或缺乏领域先验信息的情况下,基于生成式 AI (Generative AI) 的方法在昂贵黑盒函数优化中表现不佳的问题,具体表现为建议点稀疏或信息量不足。解决方案的关键在于提出一种名为 HOLLM 的新型全局优化算法,该算法通过将搜索空间划分为有潜力的子区域,并利用受多臂老虎机启发的评分机制选择“元臂”,从而有效平衡探索与利用,随后在每个选定子区域内由 LLM 提出高质量候选点,无需显式领域知识。

链接: https://arxiv.org/abs/2505.21372
作者: Andrej Schwanke,Lyubomir Ivanov,David Salinas,Fabio Ferreira,Aaron Klein,Frank Hutter,Arber Zela
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have recently emerged as effective surrogate models and candidate generators within global optimization frameworks for expensive blackbox functions. Despite promising results, LLM-based methods often struggle in high-dimensional search spaces or when lacking domain-specific priors, leading to sparse or uninformative suggestions. To overcome these limitations, we propose HOLLM, a novel global optimization algorithm that enhances LLM-driven sampling by partitioning the search space into promising subregions. Each subregion acts as a ``meta-arm’’ selected via a bandit-inspired scoring mechanism that effectively balances exploration and exploitation. Within each selected subregion, an LLM then proposes high-quality candidate points, without any explicit domain knowledge. Empirical evaluation on standard optimization benchmarks shows that HOLLM consistently matches or surpasses leading Bayesian optimization and trust-region methods, while substantially outperforming global LLM-based sampling strategies.
zh

[AI-13] owards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

【速读】:该论文试图解决多层感知机(Multilayer Perceptrons, MLPs)在大规模语言模型中的可解释性与可编辑性问题,其密集表示使得模型难以理解和控制。现有方法通过神经元级稀疏性学习可解释近似,但无法准确重建原始映射,导致模型的下一个词交叉熵损失显著增加。论文提出的解决方案关键在于转向层级稀疏性,引入了解码器混合(Mixture of Decoders, MxDs),通过灵活的张量分解将预训练的密集层扩展为数万个专用子层,每个稀疏激活的MxD子层实现全秩权重的线性变换,从而在高稀疏性下仍保持原始解码器的表达能力。实验表明,MxDs在3B参数的语言模型中显著优于当前最先进方法,在稀疏性与准确性之间取得了更好的平衡。

链接: https://arxiv.org/abs/2505.21364
作者: James Oldfield,Shawn Im,Yixuan Li,Mihalis A. Nicolaou,Ioannis Patras,Grigorios G Chrysos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping–significantly increasing model’s next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights–preserving the original decoders’ expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language–opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: this https URL.
zh

[AI-14] Subgroups Matter for Robust Bias Mitigation

【速读】:该论文试图解决机器学习中偏差缓解技术在实际应用中表现不稳定的问题,特别是探究为何某些偏差缓解方法会失效。论文的核心观点认为,许多偏差缓解方法共享的一个关键但常被忽视的步骤是子群体(subgroup)的定义,而子群体的选择对方法的效果具有显著影响。因此,解决方案的关键在于对子群体进行谨慎且合理的定义,而非简单地基于观察到的群体差异进行缓解。

链接: https://arxiv.org/abs/2505.21363
作者: Anissa Alloula,Charles Jones,Ben Glocker,Bartłomiej W. Papież
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the constant development of new bias mitigation methods for machine learning, no method consistently succeeds, and a fundamental question remains unanswered: when and why do bias mitigation techniques fail? In this paper, we hypothesise that a key factor may be the often-overlooked but crucial step shared by many bias mitigation methods: the definition of subgroups. To investigate this, we conduct a comprehensive evaluation of state-of-the-art bias mitigation methods across multiple vision and language classification tasks, systematically varying subgroup definitions, including coarse, fine-grained, intersectional, and noisy subgroups. Our results reveal that subgroup choice significantly impacts performance, with certain groupings paradoxically leading to worse outcomes than no mitigation at all. Our findings suggest that observing a disparity between a set of subgroups is not a sufficient reason to use those subgroups for mitigation. Through theoretical analysis, we explain these phenomena and uncover a counter-intuitive insight that, in some cases, improving fairness with respect to a particular set of subgroups is best achieved by using a different set of subgroups for mitigation. Our work highlights the importance of careful subgroup definition in bias mitigation and suggest it as a alternative lever for improving the robustness and fairness of machine learning models.
zh

[AI-15] An Uncertainty-Aware ED-LSTM for Probabilistic Suffix Prediction

【速读】:该论文旨在解决业务流程中后缀预测的局限性,即传统方法仅预测单一最可能的后续事件序列,而在面对过程未来路径存在不确定性或高变异性时,这种单一预测的表达能力受限。解决方案的关键在于提出一种概率后缀预测方法,通过Uncertainty-Aware Encoder-Decoder LSTM (U-ED-LSTM) 模型与蒙特卡洛(MC)后缀采样算法,近似后缀的概率分布。该方法通过MC丢弃捕获认知不确定性,并将随机不确定性建模为损失衰减,从而更全面地反映事件日志中的不确定性。

链接: https://arxiv.org/abs/2505.21339
作者: Henryk Mustroph,Michel Kunkler,Stefanie Rinderle-Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Suffix prediction of business processes forecasts the remaining sequence of events until process completion. Current approaches focus on predicting a single, most likely suffix. However, if the future course of a process is exposed to uncertainty or has high variability, the expressiveness of a single suffix prediction can be limited. To address this limitation, we propose probabilistic suffix prediction, a novel approach that approximates a probability distribution of suffixes. The proposed approach is based on an Uncertainty-Aware Encoder-Decoder LSTM (U-ED-LSTM) and a Monte Carlo (MC) suffix sampling algorithm. We capture epistemic uncertainties via MC dropout and aleatoric uncertainties as learned loss attenuation. This technical report provides a detailed evaluation of the U-ED-LSTM’s predictive performance and assesses its calibration on four real-life event logs with three different hyperparameter settings. The results show that i) the U-ED-LSTM has reasonable predictive performance across various datasets, ii) aggregating probabilistic suffix predictions into mean values can outperform most likely predictions, particularly for rare prefixes or longer suffixes, and iii) the approach effectively captures uncertainties present in event logs.
zh

[AI-16] Assured Autonomy with Neuro-Symbolic Perception

【速读】:该论文试图解决当前许多先进的生成式 AI (Generative AI) 模型在工业控制系统(CPS)中部署时存在的安全性和可靠性问题,这些模型虽然具有高准确性,但本质上是模式匹配器,缺乏足够的安全保障。解决方案的关键在于提出一种神经符号感知框架(NeuSPaPer),通过将数据驱动的感知模型赋予符号结构,实现对低级特征和高级上下文的推理能力。该框架结合了基础模型的离线知识提取与专用场景图生成(SGG)算法的实时部署,构建了结构化的关系图,从而确保自主系统的情境意识完整性,并通过SGG弥合低级传感器感知与高级推理之间的差距,为弹性、上下文感知的AI和可信自主性提供基础。

链接: https://arxiv.org/abs/2505.21322
作者: R. Spencer Hallyburton,Miroslav Pajic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many state-of-the-art AI models deployed in cyber-physical systems (CPS), while highly accurate, are simply pattern-matchers.~With limited security guarantees, there are concerns for their reliability in safety-critical and contested domains. To advance assured AI, we advocate for a paradigm shift that imbues data-driven perception models with symbolic structure, inspired by a human’s ability to reason over low-level features and high-level context. We propose a neuro-symbolic paradigm for perception (NeuSPaPer) and illustrate how joint object detection and scene graph generation (SGG) yields deep scene understanding.~Powered by foundation models for offline knowledge extraction and specialized SGG algorithms for real-time deployment, we design a framework leveraging structured relational graphs that ensures the integrity of situational awareness in autonomy. Using physics-based simulators and real-world datasets, we demonstrate how SGG bridges the gap between low-level sensor perception and high-level reasoning, establishing a foundation for resilient, context-aware AI and advancing trusted autonomy in CPS.
zh

[AI-17] Beyond Chemical QA: Evaluating LLM s Chemical Reasoning with Modular Chemical Operations

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在化学领域系统性推理能力不足的问题,尤其是在需要严格结构分析的实际任务如药物设计和反应工程中的应用。当前基准测试主要关注简单的知识检索,而忽略了复杂任务如分子优化和反应预测所需的逐步推理。解决方案的关键在于引入ChemCoTBench,该框架通过将分子结构理解与基于算术的操作(包括加法、删除和替换)相结合,将化学问题求解形式化为透明的分步流程,从而实现基于模块化“化学操作”的慢思考推理,使解决方案既符合数学证明的逻辑,又满足实际化学约束。

链接: https://arxiv.org/abs/2505.21318
作者: Hao Li,He Cao,Bin Feng,Yanjun Shao,Xiangru Tang,Zhiyuan Yan,Li Yuan,Yonghong Tian,Yu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular “chemical operations”, the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.
zh

[AI-18] A Cross Modal Knowledge Distillation Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features ICML2025

【速读】:该论文试图解决在生物发现和药物开发中,如何有效整合转录组学与显微成像数据以提升对细胞响应理解的问题。其核心挑战在于弱配对数据的稀缺性限制了多模态学习的效果。解决方案的关键在于通过知识蒸馏方法,利用弱配对数据对齐并绑定不同模态,从而将形态信息融入基因表达表征中。具体而言,提出了两种关键技术:Semi-Clipped,一种基于预训练基础模型的跨模态蒸馏方法,以及PEA(Perturbation Embedding Augmentation),一种在保留生物学信息的同时增强转录组数据的新型数据增强技术。这些策略显著提升了转录组学的预测能力并保持了其可解释性。

链接: https://arxiv.org/abs/2505.21317
作者: Ihab Bendidi,Yassir El Mesbahi,Alisandra K. Denton,Karush Suri,Kian Kenyon-Dean,Auguste Genovesio,Emmanuel Noutahi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025 Main Proceedings

点击查看摘要

Abstract:Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.
zh

[AI-19] Large Language Models Miss the Multi-Agent Mark

【速读】:该论文试图解决当前多大语言模型多智能体系统(MAS LLMs)在概念和实现上与传统多智能体系统(Multi-Agent Systems, MAS)理论之间存在的关键差异问题。其核心问题是现有MAS LLMs缺乏自主性、社会互动和结构化环境等多智能体特性,且常依赖于简化的、以大语言模型为中心的架构。解决方案的关键在于重新审视并整合已有的MAS理论概念,采用更精确的术语,并系统性地分析当前问题,以促进更符合多智能体系统原理的研究与实践。

链接: https://arxiv.org/abs/2505.21298
作者: Emanuele La Malfa,Gabriele La Malfa,Samuele Marro,Jie M. Zhang,Elizabeth Black,Micheal Luck,Philip Torr,Michael Wooldridge
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.
zh

[AI-20] Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework

【速读】:该论文旨在解决高可靠性系统(如核电站)中复杂系统诊断建模的挑战,传统诊断建模在系统复杂度增加时表现不佳,而功能建模则更具吸引力。其解决方案的关键在于提出一种融合知识图谱(Knowledge Graphs, KGs)与大语言模型(Large Language Models, LLMs)的诊断框架,该框架基于动态主逻辑(Dynamic Master Logic, DML)的功能建模原则,通过两个协调的LLM组件实现:一是基于LLM的自动化DML逻辑构建流程,二是用于交互式诊断的LLM代理。该框架将生成的逻辑编码为结构化知识图谱KG-DML,支持分层故障推理,并通过区分诊断与解释任务,优化工具调用与信息检索策略,从而提升诊断的准确性与可解释性。

链接: https://arxiv.org/abs/2505.21291
作者: Saman Marandi,Yu-Shu Hu,Mohammad Modarres
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 Pages, 11 Figures

点击查看摘要

Abstract:In this paper, we present a novel diagnostic framework that integrates Knowledge Graphs (KGs) and Large Language Models (LLMs) to support system diagnostics in high-reliability systems such as nuclear power plants. Traditional diagnostic modeling struggles when systems become too complex, making functional modeling a more attractive approach. Our approach introduces a diagnostic framework grounded in the functional modeling principles of the Dynamic Master Logic (DML) model. It incorporates two coordinated LLM components, including an LLM-based workflow for automated construction of DML logic from system documentation and an LLM agent that facilitates interactive diagnostics. The generated logic is encoded into a structured KG, referred to as KG-DML, which supports hierarchical fault reasoning. Expert knowledge or operational data can also be incorporated to refine the model’s precision and diagnostic depth. In the interaction phase, users submit natural language queries, which are interpreted by the LLM agent. The agent selects appropriate tools for structured reasoning, including upward and downward propagation across the KG-DML. Rather than embedding KG content into every prompt, the LLM agent distinguishes between diagnostic and interpretive tasks. For diagnostics, the agent selects and executes external tools that perform structured KG reasoning. For general queries, a Graph-based Retrieval-Augmented Generation (Graph-RAG) approach is used, retrieving relevant KG segments and embedding them into the prompt to generate natural explanations. A case study on an auxiliary feedwater system demonstrated the framework’s effectiveness, with over 90% accuracy in key elements and consistent tool and argument extraction, supporting its use in safety-critical diagnostics.
zh

[AI-21] GSAT: Graph Structure Attention Networks

【速读】:该论文试图解决图分类基准中因忽略节点邻域的结构信息而导致模型需要过多层数来传递远距离节点信息,从而引发过平滑(oversmoothing)问题。解决方案的关键在于利用匿名随机游走(anonymous random walks, ARWs)建模的结构信息,并引入图结构注意力网络(Graph Structure Attention Network, GSAT),该网络是对图注意力网络(Graph Attention Network, GAT)的扩展,通过融合节点原始属性与结构表示,使模型能够自动学习在节点邻域中对不同边进行注意力分配,从而增强图表示。

链接: https://arxiv.org/abs/2505.21288
作者: Farshad Noravesh,Reza Haffari,Layki Soon,Arghya Pal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful tool for processing data represented in graph structures, achieving remarkable success across a wide range of applications. However, to further improve the performance on graph classification benchmarks, structural representation of each node that encodes rich local topological information in the neighbourhood of nodes is an important type of feature that is often overlooked in the modeling. The consequence of neglecting the structural information has resulted high number of layers to connect messages from distant nodes which by itself produces other problems such as oversmoothing. In the present paper, we leverage these structural information that are modeled by anonymous random walks (ARWs) and introduce graph structure attention network (GSAT) which is a generalization of graph attention network(GAT) to integrate the original attribute and the structural representation to enforce the model to automatically find patterns for attending to different edges in the node neighbourhood to enrich graph representation. Our experiments show GSAT slightly improves SOTA on some graph classification benchmarks.
zh

[AI-22] RLJP: Legal Judgment Prediction via First-Order Logic Rule-enhanced with Large Language Models

【速读】:该论文旨在解决法律判决预测(Legal Judgment Prediction, LJP)中现有模型忽视法律推理逻辑的问题,尽管已有模型通过整合司法判例和法律知识实现了高性能,但其缺乏对法律推理逻辑的严格分析,导致在复杂案件中的适应性不足。该论文提出的解决方案关键在于构建一个基于一阶逻辑(First-Order Logic, FOL)形式化和对比学习(Contrastive Learning, CL)的规则增强框架,通过三阶段方法:首先使用FOL初始化判决规则以准确捕捉复杂的推理逻辑;其次引入一种感知混淆的对比学习(Confusion-aware Contrastive Learning, CACL)动态优化判决规则;最后利用优化后的规则进行法律判决预测,从而实现法律推理逻辑的自适应调整并提升LJP性能。

链接: https://arxiv.org/abs/2505.21281
作者: Yue Zhang,Zhiliang Tian,Shicheng Zhou,Haiyang Wang,Wenqing Hou,Yuying Liu,Xuechen Zhao,Minlie Huang,Ye Wang,Bin Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) is a pivotal task in legal AI. Existing semantic-enhanced LJP models integrate judicial precedents and legal knowledge for high performance. But they neglect legal reasoning logic, a critical component of legal judgments requiring rigorous logical analysis. Although some approaches utilize legal reasoning logic for high-quality predictions, their logic rigidity hinders adaptation to case-specific logical frameworks, particularly in complex cases that are lengthy and detailed. This paper proposes a rule-enhanced legal judgment prediction framework based on first-order logic (FOL) formalism and comparative learning (CL) to develop an adaptive adjustment mechanism for legal judgment logic and further enhance performance in LJP. Inspired by the process of human exam preparation, our method follows a three-stage approach: first, we initialize judgment rules using the FOL formalism to capture complex reasoning logic accurately; next, we propose a Confusion-aware Contrastive Learning (CACL) to dynamically optimize the judgment rules through a quiz consisting of confusable cases; finally, we utilize the optimized judgment rules to predict legal judgments. Experimental results on two public datasets show superior performance across all metrics. The code is publicly availablethis https URL.
zh

[AI-23] XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration

【速读】:该论文试图解决现有Device-Control Agents (DC agents)评估方法在微观层面无法揭示实际应用中潜在错误的问题。传统评估方法仅提供宏观性能视图,如步骤级动作准确率和任务成功率,但缺乏对个体状态的细致分析。解决方案的关键在于提出XBOUND评估方法,该方法通过计算一种新的Explore Metric来界定DC agents的能力边界,并聚焦于个体状态以评估其掌握程度。此外,研究还构建了一个基于Android Control测试数据的“伪”剧集树数据集,用于全面评估OS-Atlas和UI-TARS系列在多个任务中的整体与具体性能。

链接: https://arxiv.org/abs/2505.21279
作者: Shaoqing Zhang,Kehai Chen,Zhuosheng Zhang,Rumei Li,Rongxiang Weng,Yang Xiang,Liqiang Nie,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in vision-language models (VLMs) have spurred increased interest in Device-Control Agents (DC agents), such as utilizing in-the-wild device control to manage graphical user interfaces. Conventional methods for assessing the capabilities of DC agents, such as computing step-wise action accuracy and overall task success rates, provide a macroscopic view of DC agents’ performance; however, they fail to offer microscopic insights into potential errors that may occur in real-world applications. Conducting a finer-grained performance evaluation of DC agents presents significant challenges. This study introduces a new perspective on evaluation methods for DC agents by proposing the XBOUND evaluation method, which employs the calculation of a novel Explore Metric to delineate the capability boundaries of DC agents. Compared to previous evaluation methods, XBOUND focuses on individual states to assess the proficiency of DC agents in mastering these states. Furthermore, we have developed a ``pseudo’’ episode tree dataset derived from Android Control test data. Utilizing this dataset and XBOUND, we comprehensively evaluate the OS-Atlas and UI-TARS series, examining both the overall and specific performance across five common tasks. Additionally, we select representative cases to highlight the current deficiencies and limitations inherent in both series. Code is available at this https URL.
zh

[AI-24] Breaking the Performance Ceiling in Complex Reinforcement Learning requires Inference Strategies

【速读】:该论文旨在解决复杂多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)问题中,现有最先进的强化学习系统在零样本推理下难以突破性能瓶颈的问题。其解决方案的关键在于在执行阶段引入一个利用特定时间和计算预算进行多次尝试的推理阶段,并选择合适的推理策略,从而有效打破性能上限。实验结果表明,该方法在17个任务中平均提升了45%,最高提升了126%,仅需额外几秒钟的实时时间即可实现。

链接: https://arxiv.org/abs/2505.21236
作者: Felix Chalumeau,Daniel Rajaonarivonivelomanantsoa,Ruan de Kock,Claude Formanek,Sasha Abramowitz,Oumayma Mahjoub,Wiem Khlifi,Simon Du Toit,Louay Ben Nessir,Refiloe Shabe,Arnol Fokam,Siddarth Singh,Ulrich Mbou Sob,Arnu Pretorius
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. Our experimental data and code are available at this https URL.
zh

[AI-25] Addressing Data Quality Decompensation in Federated Learning via Dynamic Client Selection

【速读】:该论文旨在解决跨库联邦学习(cross-silo Federated Learning, FL)中由于数据质量差异、预算限制和激励相容性问题导致的客户端异质性加剧与全局性能下降的问题。其解决方案的关键在于提出一种统一框架——Shapley-Bid Reputation Optimized Federated Learning (SBRO-FL),该框架整合了动态竞价、声誉建模和成本感知选择机制,通过Shapley值评估客户端对全局模型的边际贡献,并结合基于前景理论的声誉系统来衡量历史表现与一致性,最终在预算约束下最大化加权声誉效用,从而提升模型的准确性、收敛速度和鲁棒性。

链接: https://arxiv.org/abs/2505.21219
作者: Qinjun Fei,Nuria Rodríguez-Barroso,María Victoria Luzón,Zhongliang Zhang,Francisco Herrera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In cross-silo Federated Learning (FL), client selection is critical to ensure high model performance, yet it remains challenging due to data quality decompensation, budget constraints, and incentive compatibility. As training progresses, these factors exacerbate client heterogeneity and degrade global performance. Most existing approaches treat these challenges in isolation, making jointly optimizing multiple factors difficult. To address this, we propose Shapley-Bid Reputation Optimized Federated Learning (SBRO-FL), a unified framework integrating dynamic bidding, reputation modeling, and cost-aware selection. Clients submit bids based on their perceived data quality, and their contributions are evaluated using Shapley values to quantify their marginal impact on the global model. A reputation system, inspired by prospect theory, captures historical performance while penalizing inconsistency. The client selection problem is formulated as a 0-1 integer program that maximizes reputation-weighted utility under budget constraints. Experiments on FashionMNIST, EMNIST, CIFAR-10, and SVHN datasets show that SBRO-FL improves accuracy, convergence speed, and robustness, even in adversarial and low-bid interference scenarios. Our results highlight the importance of balancing data reliability, incentive compatibility, and cost efficiency to enable scalable and trustworthy FL deployments.
zh

[AI-26] Interpretable DNFs

【速读】:该论文试图解决如何定义和构建可解释的布尔分类器问题,具体而言,研究的是可解释的析取范式(DNF)模型,其正负决策都能通过小规模的解释被人类理解。解决方案的关键在于限定分类器及其补集均可由大小受限的项组成的k-DNF表达,从而保证解释的简洁性。论文比较了两种满足该条件的模型家族——深度为k的决策树与一种新型的嵌套k-DNF,并通过实验验证嵌套k-DNF在可解释性和准确性方面具有竞争力。

链接: https://arxiv.org/abs/2505.21212
作者: Martin C. Cooper,Imane Bousdira,Clément Carbonnel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A classifier is considered interpretable if each of its decisions has an explanation which is small enough to be easily understood by a human user. A DNF formula can be seen as a binary classifier \kappa over boolean domains. The size of an explanation of a positive decision taken by a DNF \kappa is bounded by the size of the terms in \kappa , since we can explain a positive decision by giving a term of \kappa that evaluates to true. Since both positive and negative decisions must be explained, we consider that interpretable DNFs are those \kappa for which both \kappa and \overline\kappa can be expressed as DNFs composed of terms of bounded size. In this paper, we study the family of k -DNFs whose complements can also be expressed as k -DNFs. We compare two such families, namely depth- k decision trees and nested k -DNFs, a novel family of models. Experiments indicate that nested k -DNFs are an interesting alternative to decision trees in terms of interpretability and accuracy.
zh

[AI-27] Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations

【速读】:该论文试图解决传统离线模仿学习(offline imitation learning)在利用专家和未标记演示数据时,往往忽视显性不良行为中潜在价值信号的问题。其解决方案的关键在于提出一种基于对比行为的新型公式,通过优化专家数据与不良数据在状态-动作访问分布上的KL散度差异,构建一个DC(Difference-of-Convex)程序。尽管该目标函数本身为DC形式,但当专家演示数据占优时,该问题可转化为凸优化问题,从而实现一种非对抗性的稳定训练目标,统一处理正向和负向演示数据。

链接: https://arxiv.org/abs/2505.21182
作者: Huy Hoang,Tien Mai,Pradeep Varakantham,Tanvi Verma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint version

点击查看摘要

Abstract:Offline imitation learning typically learns from expert and unlabeled demonstrations, yet often overlooks the valuable signal in explicitly undesirable behaviors. In this work, we study offline imitation learning from contrasting behaviors, where the dataset contains both expert and undesirable demonstrations. We propose a novel formulation that optimizes a difference of KL divergences over the state-action visitation distributions of expert and undesirable (or bad) data. Although the resulting objective is a DC (Difference-of-Convex) program, we prove that it becomes convex when expert demonstrations outweigh undesirable demonstrations, enabling a practical and stable non-adversarial training objective. Our method avoids adversarial training and handles both positive and negative demonstrations in a unified framework. Extensive experiments on standard offline imitation learning benchmarks demonstrate that our approach consistently outperforms state-of-the-art baselines.
zh

[AI-28] Latent label distribution grid representation for modeling uncertainty

【速读】:该论文试图解决标签分布学习(Label Distribution Learning, LDL)中由于标签分布标注的复杂性和高成本导致的标签空间构建不准确问题,这种不准确的标签空间会引入不确定性,进而误导LDL算法做出错误决策。解决方案的关键是通过构建一个潜在标签分布网格(Latent Label Distribution Grid, LLDG),利用基于标签差异的标签相关性矩阵,并将其每个值扩展为服从高斯分布的向量,从而建模标签空间的不确定性。随后,通过LLDG-Mixer对LLDG进行重构,生成精确的标签分布,同时引入定制化的低秩方案和Tucker重构技术以实现噪声抑制。

链接: https://arxiv.org/abs/2505.21180
作者: ShuNing Sun,YinSong Xiong,Yu Zhang,Zhuoran Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Although \textbfLabel \textbfDistribution \textbfLearning (LDL) has promising representation capabilities for characterizing the polysemy of an instance, the complexity and high cost of the label distribution annotation lead to inexact in the construction of the label space. The existence of a large number of inexact labels generates a label space with uncertainty, which misleads the LDL algorithm to yield incorrect decisions. To alleviate this problem, we model the uncertainty of label distributions by constructing a \textbfLatent \textbfLabel \textbfDistribution \textbfGrid (LLDG) to form a low-noise representation space. Specifically, we first construct a label correlation matrix based on the differences between labels, and then expand each value of the matrix into a vector that obeys a Gaussian distribution, thus building a LLDG to model the uncertainty of the label space. Finally, the LLDG is reconstructed by the LLDG-Mixer to generate an accurate label distribution. Note that we enforce a customized low-rank scheme on this grid, which assumes that the label relations may be noisy and it needs to perform noise-reduction with the help of a Tucker reconstruction technique. Furthermore, we attempt to evaluate the effectiveness of the LLDG by considering its generation as an upstream task to achieve the classification of the objects. Extensive experimental results show that our approach performs competitively on several benchmarks.
zh

[AI-29] STEB: In Search of the Best Evaluation Approach for Synthetic Time Series

【速读】:该论文试图解决在合成时间序列评估指标之间进行大规模客观比较的挑战,这一问题在数据增强或隐私法规需求日益增长的背景下显得尤为突出。解决方案的关键在于提出首个综合且可解释的自动化比较框架——合成时间序列评估基准(Synthetic Time series Evaluation Benchmark, STEB),该框架通过使用10个多样化的数据集、随机性注入以及13种可配置的数据变换,计算评估指标的可靠性与得分一致性,并跟踪运行时间、测试误差及序列与并行操作模式,从而实现对41种文献中评估指标的全面分析与排序。

链接: https://arxiv.org/abs/2505.21160
作者: Michael Stenger,Robert Leppich,André Bauer,Samuel Kounev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing need for synthetic time series, due to data augmentation or privacy regulations, has led to numerous generative models, frameworks, and evaluation measures alike. Objectively comparing these measures on a large scale remains an open challenge. We propose the Synthetic Time series Evaluation Benchmark (STEB) – the first benchmark framework that enables comprehensive and interpretable automated comparisons of synthetic time series evaluation measures. Using 10 diverse datasets, randomness injection, and 13 configurable data transformations, STEB computes indicators for measure reliability and score consistency. It tracks running time, test errors, and features sequential and parallel modes of operation. In our experiments, we determine a ranking of 41 measures from literature and confirm that the choice of upstream time series embedding heavily impacts the final score.
zh

[AI-30] Model as Loss: A Self-Consistent Training Paradigm INTERSPEECH2025

【速读】:该论文旨在解决传统语音增强方法依赖手工设计的损失函数(如时域或频域损失)或预训练深度特征损失(如WavLM或wav2vec)所导致的无法捕捉对性能至关重要的细微信号特性的问题。其解决方案的关键在于提出“Model as Loss”训练范式,该范式利用同一模型的编码器作为损失函数,通过编码器的任务特定特征空间引导解码器训练,使生成的输出与干净信号的感知和任务相关特性保持一致,从而提升语音增强的效果和泛化能力。

链接: https://arxiv.org/abs/2505.21156
作者: Saisamarth Rajesh Phaye,Milos Cernak,Andrew Harper
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: Accepted in Interspeech 2025

点击查看摘要

Abstract:Conventional methods for speech enhancement rely on handcrafted loss functions (e.g., time or frequency domain losses) or deep feature losses (e.g., using WavLM or wav2vec), which often fail to capture subtle signal properties essential for optimal performance. To address this, we propose Model as Loss, a novel training paradigm that utilizes the encoder from the same model as a loss function to guide the training. The Model as Loss paradigm leverages the encoder’s task-specific feature space, optimizing the decoder to produce output consistent with perceptual and task-relevant characteristics of the clean signal. By using the encoder’s learned features as a loss function, this framework enforces self-consistency between the clean reference speech and the enhanced model output. Our approach outperforms pre-trained deep feature losses on standard speech enhancement benchmarks, offering better perceptual quality and robust generalization to both in-domain and out-of-domain datasets. Comments: Accepted in Interspeech 2025 Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP) Cite as: arXiv:2505.21156 [cs.SD] (or arXiv:2505.21156v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2505.21156 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-31] GGBond: Growing Graph-Based AI-Agent Society for Socially-Aware Recommender Simulation

【速读】:该论文旨在解决当前个性化推荐系统依赖静态离线数据进行算法设计与评估所导致的无法有效捕捉用户长期偏好演变和社会影响动态的问题。其解决方案的关键在于构建一个高保真社会模拟平台,该平台集成了类人认知代理和动态社会互动,以真实模拟推荐干预下的用户行为演化。系统的核心组件包括具有五层认知架构的Sim-User Agents,以及基于心理学和社会学理论的Intimacy–Curiosity–Reciprocity–Risk (ICR2) 动机引擎,同时通过多层异构社会图(GGBond Graph)建模用户不断变化的社会关系和信任动态,从而实现对推荐效果的长期评估。

链接: https://arxiv.org/abs/2505.21154
作者: Hailin Zhong,Hanlin Wang,Yujun Ye,Meiyi Zhang,Shengxin Zhu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Current personalized recommender systems predominantly rely on static offline data for algorithm design and evaluation, significantly limiting their ability to capture long-term user preference evolution and social influence dynamics in real-world scenarios. To address this fundamental challenge, we propose a high-fidelity social simulation platform integrating human-like cognitive agents and dynamic social interactions to realistically simulate user behavior evolution under recommendation interventions. Specifically, the system comprises a population of Sim-User Agents, each equipped with a five-layer cognitive architecture that encapsulates key psychological mechanisms, including episodic memory, affective state transitions, adaptive preference learning, and dynamic trust-risk assessments. In particular, we innovatively introduce the Intimacy–Curiosity–Reciprocity–Risk (ICR2) motivational engine grounded in psychological and sociological theories, enabling more realistic user decision-making processes. Furthermore, we construct a multilayer heterogeneous social graph (GGBond Graph) supporting dynamic relational evolution, effectively modeling users’ evolving social ties and trust dynamics based on interest similarity, personality alignment, and structural homophily. During system operation, agents autonomously respond to recommendations generated by typical recommender algorithms (e.g., Matrix Factorization, MultVAE, LightGCN), deciding whether to consume, rate, and share content while dynamically updating their internal states and social connections, thereby forming a stable, multi-round feedback loop. This innovative design transcends the limitations of traditional static datasets, providing a controlled, observable environment for evaluating long-term recommender effects.
zh

[AI-32] HeteroBA: A Structure-Manipulating Backdoor Attack on Heterogeneous Graphs

【速读】:该论文试图解决异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)在面对后门攻击时的鲁棒性和安全性问题。现有研究主要关注提升HGNNs的预测性能,而对其在对抗性攻击下的脆弱性缺乏深入探讨。论文提出的异构后门攻击(HeteroBA)框架通过插入具有真实特征和目标结构连接的触发节点,并利用基于注意力和聚类的策略选择有影响力的辅助节点,实现有效触发传播,从而在保持干净数据准确性的前提下,使模型将特定节点错误分类为目标标签。该方法的关键在于精准设计触发节点及其传播机制,以在不影响整体性能的情况下实现攻击目标。

链接: https://arxiv.org/abs/2505.21140
作者: Honglin Gao,Xiang Li,Lan Zhao,Gaoxi Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heterogeneous graph neural networks (HGNNs) have recently drawn increasing attention for modeling complex multi-relational data in domains such as recommendation, finance, and social networks. While existing research has been largely focused on enhancing HGNNs’ predictive performance, their robustness and security, especially under backdoor attacks, remain underexplored. In this paper, we propose a novel Heterogeneous Backdoor Attack (HeteroBA) framework for node classification tasks on heterogeneous graphs. HeteroBA inserts carefully crafted trigger nodes with realistic features and targeted structural connections, leveraging attention-based and clustering-based strategies to select influential auxiliary nodes for effective trigger propagation, thereby causing the model to misclassify specific nodes into a target label while maintaining accuracy on clean data. Experimental results on three datasets and various HGNN architectures demonstrate that HeteroBA achieves high attack success rates with minimal impact on the clean accuracy. Our method sheds light on potential vulnerabilities in HGNNs and calls for more robust defenses against backdoor threats in multi-relational graph scenarios.
zh

[AI-33] Universal Value-Function Uncertainties

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中价值函数的主观不确定性(epistemic uncertainty)估计问题,这一问题在高效探索、安全决策和离线RL等领域具有重要意义。传统方法如深度集成(deep ensembles)虽然能够有效量化不确定性,但存在计算开销大的问题;而单模型方法虽计算效率高,但通常依赖启发式策略且需额外机制处理短期不确定性估计。本文提出的通用价值函数不确定性(Universal Value-Function Uncertainties, UVU)通过在线学习器与固定随机初始化目标网络之间的预测误差来量化不确定性,其关键在于利用时序差分学习(temporal difference learning)结合目标网络生成的合成奖励进行训练,从而捕捉策略条件下的未来不确定性。

链接: https://arxiv.org/abs/2505.21119
作者: Moritz A. Zanger,Max Weltevrede,Yaniv Oren,Pascal R. Van der Vaart,Caroline Horsch,Wendelin Böhmer,Matthijs T. J. Spaan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In this work we introduce universal value-function uncertainties (UVU), which, similar in spirit to random network distillation (RND), quantify uncertainty as squared prediction errors between an online learner and a fixed, randomly initialized target network. Unlike RND, UVU errors reflect policy-conditional value uncertainty, incorporating the future uncertainties any given policy may encounter. This is due to the training procedure employed in UVU: the online network is trained using temporal difference learning with a synthetic reward derived from the fixed, randomly initialized target network. We provide an extensive theoretical analysis of our approach using neural tangent kernel (NTK) theory and show that in the limit of infinite network width, UVU errors are exactly equivalent to the variance of an ensemble of independent universal value functions. Empirically, we show that UVU achieves equal performance to large ensembles on challenging multi-task offline RL settings, while offering simplicity and substantial computational savings.
zh

[AI-34] Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation

【速读】:该论文试图解决大型视觉语言模型(Large Vision Language Models, LVLMs)在多模态任务中表现出的社会偏见问题,特别是模型对中性概念与敏感人类属性之间无意关联的强化,导致不同人口群体间的模型行为差异。解决方案的关键在于提出一种结合信息流分析与多轮对话评估的解释框架,旨在从模型内部信息利用不平衡的角度理解社会偏见的来源。通过信息流分析识别出在处理中性问题时对模型推理具有高贡献的图像标记,并设计多轮对话机制评估这些关键标记是否编码了敏感信息,从而揭示模型在处理不同人口群体图像时系统性的信息使用差异。

链接: https://arxiv.org/abs/2505.21106
作者: Zhengyang Ji,Yifan Jia,Shang Gao,Yutao Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet they also exhibit notable social biases. These biases often manifest as unintended associations between neutral concepts and sensitive human attributes, leading to disparate model behaviors across demographic groups. While existing studies primarily focus on detecting and quantifying such biases, they offer limited insight into the underlying mechanisms within the models. To address this gap, we propose an explanatory framework that combines information flow analysis with multi-round dialogue evaluation, aiming to understand the origin of social bias from the perspective of imbalanced internal information utilization. Specifically, we first identify high-contribution image tokens involved in the model’s reasoning process for neutral questions via information flow analysis. Then, we design a multi-turn dialogue mechanism to evaluate the extent to which these key tokens encode sensitive information. Extensive experiments reveal that LVLMs exhibit systematic disparities in information usage when processing images of different demographic groups, suggesting that social bias is deeply rooted in the model’s internal reasoning dynamics. Furthermore, we complement our findings from a textual modality perspective, showing that the model’s semantic representations already display biased proximity patterns, thereby offering a cross-modal explanation of bias formation.
zh

[AI-35] Stopping Criteria for Value Iteration on Concurrent Stochastic Reachability and Safety Games

【速读】:该论文旨在解决零和并发随机博弈(Concurrent Stochastic Games, CSGs)中价值迭代(Value Iteration, VI)方法缺乏精度保证的问题。尽管VI在实践中表现优于线性或二次规划等传统理论解法,但其传统的终止条件仅基于两次连续近似值的ε-接近性,无法提供对近似精度的严格保证。论文提出的解决方案是引入有界(区间)价值迭代,通过结合标准VI与收敛的上界近似序列,在上下界近似值达到ε-接近时终止迭代,从而确保近似结果的精度。

链接: https://arxiv.org/abs/2505.21087
作者: Marta Grobelna,Jan Křetínský,Maximilian Weininger
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Full version of the corresponding LICS’25 paper

点击查看摘要

Abstract:We consider two-player zero-sum concurrent stochastic games (CSGs) played on graphs with reachability and safety objectives. These include degenerate classes such as Markov decision processes or turn-based stochastic games, which can be solved by linear or quadratic programming; however, in practice, value iteration (VI) outperforms the other approaches and is the most implemented method. Similarly, for CSGs, this practical performance makes VI an attractive alternative to the standard theoretical solution via the existential theory of reals. VI starts with an under-approximation of the sought values for each state and iteratively updates them, traditionally terminating once two consecutive approximations are \epsilon -close. However, this stopping criterion lacks guarantees on the precision of the approximation, which is the goal of this work. We provide bounded (a.k.a. interval) VI for CSGs: it complements standard VI with a converging sequence of over-approximations and terminates once the over- and under-approximations are \epsilon -close. Comments: Full version of the corresponding LICS’25 paper Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2505.21087 [cs.LO] (or arXiv:2505.21087v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2505.21087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-36] Efficient Large Language Model Inference with Neural Block Linearization

【速读】:该论文试图解决基于Transformer的大型语言模型(Large Language Models, LLMs)在推理过程中计算需求高的问题。解决方案的关键在于提出一种名为神经块线性化(Neural Block Linearization, NBL)的新框架,通过将自注意力层替换为由线性最小均方误差估计器导出的线性近似来加速推理过程。NBL利用典型相关分析计算近似误差的理论上限,并以此作为替换标准,选择线性化误差最低的LLM层,从而在不进行微调的情况下有效提升推理效率。

链接: https://arxiv.org/abs/2505.21077
作者: Mete Erdogan,Francesco Tonin,Volkan Cevher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs.
zh

[AI-37] Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

【速读】:该论文试图解决如何在减少数据和计算成本的前提下,提升大型语言模型(Large Language Models, LLMs)的推理能力。其解决方案的关键在于采用一种基于基础模型的简单蒸馏方法,仅使用920个示例即可显著优于传统强化学习(Reinforcement Learning, RL)方法(即零强化学习,zero-RL),后者通常需要更多数据和计算资源。该方法通过增强模型的多视角思考、尝试以及元认知意识等高级认知行为,提升了模型的推理灵活性与复杂问题求解能力。

链接: https://arxiv.org/abs/2505.21067
作者: Xiao Hu,Xingyu Lu,Liyuan Mao,YiFan Zhang,Tianke Zhang,Bin Wen,Fan Yang,Tingting Gao,Guorui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textitsmaller base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.
zh

[AI-38] Agent -Environment Alignment via Automated Interface Generation

【速读】:该论文试图解决代理与环境之间的对齐问题(agent-environment misalignment),即代理对其动作影响的内部预期与环境中实际状态转移之间的不匹配,这一问题显著限制了代理的性能。解决方案的关键在于提出ALIGN框架,该框架通过增强接口来缓解对齐问题,具体表现为提升环境的静态信息和步骤级观测反馈,且无需修改代理逻辑或环境代码即可实现对齐。

链接: https://arxiv.org/abs/2505.21055
作者: Kaiming Liu,Xuanyu Lei,Ziyue Wang,Peng Li,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision-making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as \textbfagent-environment misalignment. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent-environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose \textbfALIGN, an \underlineAuto-A\underlineligned \underlineInterface \underlineGe\underlineneration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN-generated interface enhances both the static information of the environment and the step-wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool-use, show consistent performance improvements, with up to a 45.67% success rate improvement observed in ALFWorld. Meanwhile, ALIGN-generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at this https URL.
zh

[AI-39] A domain adaptation neural network for digital twin-supported fault diagnosis

【速读】:该论文试图解决深度学习故障诊断中因缺乏足够标注数据而导致的性能瓶颈问题,以及仿真数据与真实系统之间存在的差异导致模型在实际应用中性能显著下降的问题。解决方案的关键在于提出一种基于领域自适应神经网络(Domain-Adversarial Neural Networks, DANN)的故障诊断框架,通过知识迁移实现从仿真数据(源域)到真实数据(目标域)的有效适应,从而缩小仿真到现实(sim-to-real)的差距。

链接: https://arxiv.org/abs/2505.21046
作者: Zhenling Chen,Haiwei Fu,Zhiguo Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Preprint accepted by ICCAD 2025 at Barcelona

点击查看摘要

Abstract:Digital twins offer a promising solution to the lack of sufficient labeled data in deep learning-based fault diagnosis by generating simulated data for model training. However, discrepancies between simulation and real-world systems can lead to a significant drop in performance when models are applied in real scenarios. To address this issue, we propose a fault diagnosis framework based on Domain-Adversarial Neural Networks (DANN), which enables knowledge transfer from simulated (source domain) to real-world (target domain) data. We evaluate the proposed framework using a publicly available robotics fault diagnosis dataset, which includes 3,600 sequences generated by a digital twin model and 90 real sequences collected from physical systems. The DANN method is compared with commonly used lightweight deep learning models such as CNN, TCN, Transformer, and LSTM. Experimental results show that incorporating domain adaptation significantly improves the diagnostic performance. For example, applying DANN to a baseline CNN model improves its accuracy from 70.00% to 80.22% on real-world test data, demonstrating the effectiveness of domain adaptation in bridging the sim-to-real gap.
zh

[AI-40] Large Language Model-enhanced Reinforcement Learning for Low-Altitude Economy Networking

【速读】:该论文旨在解决低空经济网络(Low-Altitude Economic Networking, LAENet)在复杂决策、资源限制和环境不确定性方面的挑战。其解决方案的关键在于将大型语言模型(Large Language Models, LLMs)融入强化学习(Reinforcement Learning, RL),利用LLMs的生成能力、上下文理解能力和结构化推理能力,提升RL在信息处理、奖励设计、决策制定和生成任务中的性能。

链接: https://arxiv.org/abs/2505.21045
作者: Lingyi Cai,Ruichen Zhang,Changyuan Zhao,Yu Zhang,Jiawen Kang,Dusit Niyato,Tao Jiang,Xuemin Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Low-Altitude Economic Networking (LAENet) aims to support diverse flying applications below 1,000 meters by deploying various aerial vehicles for flexible and cost-effective aerial networking. However, complex decision-making, resource constraints, and environmental uncertainty pose significant challenges to the development of the LAENet. Reinforcement learning (RL) offers a potential solution in response to these challenges but has limitations in generalization, reward design, and model stability. The emergence of large language models (LLMs) offers new opportunities for RL to mitigate these limitations. In this paper, we first present a tutorial about integrating LLMs into RL by using the capacities of generation, contextual understanding, and structured reasoning of LLMs. We then propose an LLM-enhanced RL framework for the LAENet in terms of serving the LLM as information processor, reward designer, decision-maker, and generator. Moreover, we conduct a case study by using LLMs to design a reward function to improve the learning performance of RL in the LAENet. Finally, we provide a conclusion and discuss future work.
zh

[AI-41] abAttackBench: A Benchmark for Adversarial Attacks on Tabular Data

【速读】:该论文试图解决在表格数据上进行对抗攻击时,现有研究过于关注攻击效果而忽视了扰动的不可察觉性(imperceptibility)问题。解决方案的关键在于提出一个新的基准,用于评估对抗攻击在表格数据上的有效性和不可察觉性,并通过在多种模型和数据集上的实验分析两者之间的相互作用及对整体攻击性能的影响。

链接: https://arxiv.org/abs/2505.21027
作者: Zhipeng He,Chun Ouyang,Lijie Wen,Cong Liu,Catarina Moreira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 63 pages, 22 figures, 6 tables

点击查看摘要

Abstract:Adversarial attacks pose a significant threat to machine learning models by inducing incorrect predictions through imperceptible perturbations to input data. While these attacks have been extensively studied in unstructured data like images, their application to tabular data presents new challenges. These challenges arise from the inherent heterogeneity and complex feature interdependencies in tabular data, which differ significantly from those in image data. To address these differences, it is crucial to consider imperceptibility as a key criterion specific to tabular data. Most current research focuses primarily on achieving effective adversarial attacks, often overlooking the importance of maintaining imperceptibility. To address this gap, we propose a new benchmark for adversarial attacks on tabular data that evaluates both effectiveness and imperceptibility. In this study, we assess the effectiveness and imperceptibility of five adversarial attacks across four models using eleven tabular datasets, including both mixed and numerical-only datasets. Our analysis explores how these factors interact and influence the overall performance of the attacks. We also compare the results across different dataset types to understand the broader implications of these findings. The findings from this benchmark provide valuable insights for improving the design of adversarial attack algorithms, thereby advancing the field of adversarial machine learning on tabular data.
zh

[AI-42] Multi-Mode Process Control Using Multi-Task Inverse Reinforcement Learning

【速读】:该论文旨在解决过程系统工程在工业4.0和智能制造背景下,传统强化学习(Reinforcement Learning, RL)在应用中受到对精确数字孪生和良好设计奖励函数依赖的限制问题。其解决方案的关键在于引入一种融合逆强化学习(Inverse Reinforcement Learning, IRL)与多任务学习的新型框架,通过历史闭环数据作为专家演示,提取最优奖励函数和控制策略,并利用潜在上下文变量区分不同模式,从而实现多模式数据下的数据驱动控制设计。

链接: https://arxiv.org/abs/2505.21026
作者: Runze Lin,Junghui Chen,Biao Huang,Lei Xie,Hongye Su
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the era of Industry 4.0 and smart manufacturing, process systems engineering must adapt to digital transformation. While reinforcement learning offers a model-free approach to process control, its applications are limited by the dependence on accurate digital twins and well-designed reward functions. To address these limitations, this paper introduces a novel framework that integrates inverse reinforcement learning (IRL) with multi-task learning for data-driven, multi-mode control design. Using historical closed-loop data as expert demonstrations, IRL extracts optimal reward functions and control policies. A latent-context variable is incorporated to distinguish modes, enabling the training of mode-specific controllers. Case studies on a continuous stirred tank reactor and a fed-batch bioreactor validate the effectiveness of this framework in handling multi-mode data and training adaptable controllers.
zh

[AI-43] xt-Queried Audio Source Separation via Hierarchical Modeling

【速读】:该论文旨在解决基于自然语言查询的目标音频源分离问题,其核心挑战在于如何在无监督学习的单阶段架构中联合建模声学-文本对齐和语义感知分离,以及如何通过大规模精确标注数据补偿跨模态学习与分离的低效性。论文提出的解决方案关键在于构建一个分层分解框架HSM-TSS,该框架通过将任务分解为全局-局部语义引导的特征分离和结构保持的声学重建,实现更高效的跨模态对齐与分离。其核心创新包括双阶段语义分离机制和指令处理管道,以实现灵活的声音操作并提升复杂听觉场景下的语义一致性。

链接: https://arxiv.org/abs/2505.21025
作者: Xinlei Yin,Xiulian Peng,Xue Jiang,Zhiwei Xiong,Yan Lu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Target audio source separation with natural language queries presents a promising paradigm for extracting arbitrary audio events through arbitrary text descriptions. Existing methods mainly face two challenges, the difficulty in jointly modeling acoustic-textual alignment and semantic-aware separation within a blindly-learned single-stage architecture, and the reliance on large-scale accurately-labeled training data to compensate for inefficient cross-modal learning and separation. To address these challenges, we propose a hierarchical decomposition framework, HSM-TSS, that decouples the task into global-local semantic-guided feature separation and structure-preserving acoustic reconstruction. Our approach introduces a dual-stage mechanism for semantic separation, operating on distinct global and local semantic feature spaces. We first perform global-semantic separation through a global semantic feature space aligned with text queries. A Q-Audio architecture is employed to align audio and text modalities, serving as pretrained global-semantic encoders. Conditioned on the predicted global feature, we then perform the second-stage local-semantic separation on AudioMAE features that preserve time-frequency structures, followed by acoustic reconstruction. We also propose an instruction processing pipeline to parse arbitrary text queries into structured operations, extraction or removal, coupled with audio descriptions, enabling flexible sound manipulation. Our method achieves state-of-the-art separation performance with data-efficient training while maintaining superior semantic consistency with queries in complex auditory scenes.
zh

[AI-44] Federated Instrumental Variable Analysis via Federated Generalized Method of Moments

【速读】:该论文试图解决在高维设置下进行工具变量(Instrumental Variables, IV)分析时,如何在非独立同分布(non-i.i.d.)数据和分布式客户端环境下实现有效估计的问题。其解决方案的关键在于引入联邦工具变量分析(Federated Instrumental Variables Analysis, FedIV),通过联邦广义矩方法(Federated Generalized Method of Moments, FedGMM)实现模型训练,将FedGMM建模为一个由联邦非凸非凹极小极大优化问题定义的联邦零和博弈,并采用联邦梯度下降上升算法(FedGDA)进行求解。

链接: https://arxiv.org/abs/2505.21012
作者: Geetika,Somya Tyagi,Bapi Chatterjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 28 pages, 3 figures, 1 table

点击查看摘要

Abstract:Instrumental variables (IV) analysis is an important applied tool for areas such as healthcare and consumer economics. For IV analysis in high-dimensional settings, the Generalized Method of Moments (GMM) using deep neural networks offers an efficient approach. With non-i.i.d. data sourced from scattered decentralized clients, federated learning is a popular paradigm for training the models while promising data privacy. However, to our knowledge, no federated algorithm for either GMM or IV analysis exists to date. In this work, we introduce federated instrumental variables analysis (FedIV) via federated generalized method of moments (FedGMM). We formulate FedGMM as a federated zero-sum game defined by a federated non-convex non-concave minimax optimization problem, which is solved using federated gradient descent ascent (FedGDA) algorithm. One key challenge arises in theoretically characterizing the federated local optimality. To address this, we present properties and existence results of clients’ local equilibria via FedGDA limit points. Thereby, we show that the federated solution consistently estimates the local moment conditions of every participating client. The proposed algorithm is backed by extensive experiments to demonstrate the efficacy of our approach.
zh

[AI-45] BIPNN: Learning to Solve Binary Integer Programming via Hypergraph Neural Networks

【速读】:该论文旨在解决非线性二元整数规划(Nonlinear Binary Integer Programming, BIP)问题的求解难题,这类问题在需要离散决策的科学领域中具有重要应用。现有方法在处理非线性挑战时存在可扩展性不足的问题,而传统分支定界求解器依赖线性松弛技术,导致辅助变量指数增长和计算限制。论文提出的解决方案关键在于构建一种基于超图神经网络(HyperGNN)的无监督学习框架BIPNN,通过将约束、离散且非线性的BIP问题转化为无约束、可微且多项式的损失函数,实现端到端的优化。该方法利用多项式BIP目标与超图结构之间的精确一一映射关系,结合GPU加速和连续退火增强的训练流程,使BIPNN能够高效并行优化大规模非线性项,从而显著降低训练成本并生成高质量的离散解。

链接: https://arxiv.org/abs/2505.20997
作者: Sen Bai,Chunqi Yang,Xin Bai,Xin Zhang,Zhengang Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Binary (0-1) integer programming (BIP) is pivotal in scientific domains requiring discrete decision-making. As the advance of AI computing, recent works explore neural network-based solvers for integer linear programming (ILP) problems. Yet, they lack scalability for tackling nonlinear challenges. To handle nonlinearities, state-of-the-art Branch-and-Cut solvers employ linear relaxations, leading to exponential growth in auxiliary variables and severe computation limitations. To overcome these limitations, we propose BIPNN (Binary Integer Programming Neural Network), an unsupervised learning framework to solve nonlinear BIP problems via hypergraph neural networks (HyperGNN). Specifically, BIPNN reformulates BIPs-constrained, discrete, and nonlinear (sin, log, exp) optimization problems-into unconstrained, differentiable, and polynomial loss functions. The reformulation stems from the observation of a precise one-to-one mapping between polynomial BIP objectives and hypergraph structures, enabling the unsupervised training of HyperGNN to optimize BIP problems in an end-to-end manner. On this basis, we propose a GPU-accelerated and continuous-annealing-enhanced training pipeline for BIPNN. The pipeline enables BIPNN to optimize large-scale nonlinear terms in BIPs fully in parallel via straightforward gradient descent, thus significantly reducing the training cost while ensuring the generation of discrete, high-quality solutions. Extensive experiments on synthetic and real-world datasets highlight the superiority of our approach.
zh

[AI-46] MelodySim: Measuring Melody-aware Music Similarity for Plagiarism Detection

【速读】:该论文试图解决音乐作品中旋律相似性检测的问题,以用于抄袭检测(plagiarism detection)。解决方案的关键在于构建一个专注于旋律相似性的数据集MelodySim,并开发一种基于段落的旋律相似性检测模型。该模型通过MERT编码器和三元组神经网络来捕捉旋律相似性,从而有效识别可能的抄袭位置。

链接: https://arxiv.org/abs/2505.20979
作者: Tongyu Lu,Charlotta-Marlena Geist,Jan Melechovsky,Abhinaba Roy,Dorien Herremans
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We propose MelodySim, a melody-aware music similarity model and dataset for plagiarism detection. First, we introduce a novel method to construct a dataset with focus on melodic similarity. By augmenting Slakh2100; an existing MIDI dataset, we generate variations of each piece while preserving the melody through modifications such as note splitting, arpeggiation, minor track dropout (excluding bass), and re-instrumentation. A user study confirms that positive pairs indeed contain similar melodies, with other musical tracks significantly changed. Second, we develop a segment-wise melodic-similarity detection model that uses a MERT encoder and applies a triplet neural network to capture melodic similarity. The resultant decision matrix highlights where plagiarism might occur. Our model achieves high accuracy on the MelodySim test set.
zh

[AI-47] owards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement

【速读】:该论文试图解决生成式 AI(Generative AI)在软件开发过程中难以准确捕捉利益相关者需求的问题。解决方案的关键在于引入一个基于基础模型(Foundation Models)的多智能体系统 AlignMind,该系统通过增强基础模型的心智理论(Theory-of-Mind)能力,使其能够考虑软件开发者的心智状态和视角,从而迭代地澄清利益相关者的信念、欲望和意图,并将其转化为细化的需求和对应的操作性自然语言工作流程。

链接: https://arxiv.org/abs/2505.20973
作者: Keheliya Gallaba,Ali Arabat,Dayi Lin,Mohammed Sayagh,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM-powered multi-agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory-of-Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often-overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step-by-step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent-first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.
zh

[AI-48] Deep k-grouping: An Unsupervised Learning Framework for Combinatorial Optimization on Graphs and Hypergraphs

【速读】:该论文试图解决大规模图和超图上的k-分组问题(如着色和划分)在传统无监督神经网络求解器中表现不佳的问题,其关键在于提出了一种基于无监督学习的组合优化框架——Deep k-Grouping。该解决方案的核心是引入了新颖的一热编码多项式无约束二元优化(OH-PUBO)公式,用于建模k-分组问题,并结合GPU加速算法以处理大规模问题;同时采用基于基尼系数的连续松弛退火策略,在保持解的离散性的同时避免陷入局部最优。

链接: https://arxiv.org/abs/2505.20972
作者: Sen Bai,Chunqi Yang,Xin Bai,Xin Zhang,Zhengang Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Along with AI computing shining in scientific discovery, its potential in the combinatorial optimization (CO) domain has also emerged in recent years. Yet, existing unsupervised neural network solvers struggle to solve k -grouping problems (e.g., coloring, partitioning) on large-scale graphs and hypergraphs, due to limited computational frameworks. In this work, we propose Deep k -grouping, an unsupervised learning-based CO framework. Specifically, we contribute: Novel one-hot encoded polynomial unconstrained binary optimization (OH-PUBO), a formulation for modeling k-grouping problems on graphs and hypergraphs (e.g., graph/hypergraph coloring and partitioning); GPU-accelerated algorithms for large-scale k-grouping CO problems. Deep k -grouping employs the relaxation of large-scale OH-PUBO objectives as differentiable loss functions and trains to optimize them in an unsupervised manner. To ensure scalability, it leverages GPU-accelerated algorithms to unify the training pipeline; A Gini coefficient-based continuous relaxation annealing strategy to enforce discreteness of solutions while preventing convergence to local optima. Experimental results demonstrate that Deep k -grouping outperforms existing neural network solvers and classical heuristics such as SCIP and Tabu.
zh

[AI-49] Efficient and Microphone-Fault-Tolerant 3D Sound Source Localization INTERSPEECH2025

【速读】:该论文旨在解决声源定位(Sound Source Localization, SSL)在复杂环境中面临的计算成本高、精确校准需求严格等问题,这些问题限制了现有方法在动态或资源受限环境中的部署。其解决方案的关键在于提出一种新颖的3D SSL框架,该框架结合了稀疏交叉注意力机制、预训练技术和自适应信号相干性度量,实现了高精度且计算效率高的定位,并减少了对输入麦克风数量的依赖。此外,该框架具备对不可靠或未知麦克风位置输入的容错能力,提升了实际应用场景中的鲁棒性。

链接: https://arxiv.org/abs/2505.20961
作者: Yiyuan Yang,Shitong Xu,Niki Trigoni,Andrew Markham
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025 Conference

点击查看摘要

Abstract:Sound source localization (SSL) is a critical technology for determining the position of sound sources in complex environments. However, existing methods face challenges such as high computational costs and precise calibration requirements, limiting their deployment in dynamic or resource-constrained environments. This paper introduces a novel 3D SSL framework, which uses sparse cross-attention, pretraining, and adaptive signal coherence metrics, to achieve accurate and computationally efficient localization with fewer input microphones. The framework is also fault-tolerant to unreliable or even unknown microphone position inputs, ensuring its applicability in real-world scenarios. Preliminary experiments demonstrate its scalability for multi-source localization without requiring additional hardware. This work advances SSL by balancing the model’s performance and efficiency and improving its robustness for real-world scenarios.
zh

[AI-50] Hybrid Disagreement-Diversity Active Learning for Bioacoustic Sound Event Detection

【速读】:该论文旨在解决生物声学声音事件检测(BioSED)在模型开发和训练过程中面临的实际挑战,包括标注数据量有限、事件稀疏、物种多样性以及类别不平衡等问题。其解决方案的关键在于应用一种名为“先不匹配后最远遍历”(MFFT)的主动学习方法,该方法结合了委员会投票分歧度分析与多样性分析,以在有限的标注预算下高效提升模型性能。实验结果表明,MFFT在冷启动和热启动场景下分别达到了68%和71%的mAP,接近全监督学习的75%,且仅使用了2.3%的标注数据,尤其在冷启动和稀有物种检测方面表现突出,具有重要的实际应用价值。

链接: https://arxiv.org/abs/2505.20956
作者: Shiqi Zhang,Tuomas Virtanen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, accepted by EUSIPCO 2025

点击查看摘要

Abstract:Bioacoustic sound event detection (BioSED) is crucial for biodiversity conservation but faces practical challenges during model development and training: limited amounts of annotated data, sparse events, species diversity, and class imbalance. To address these challenges efficiently with a limited labeling budget, we apply the mismatch-first farthest-traversal (MFFT), an active learning method integrating committee voting disagreement and diversity analysis. We also refine an existing BioSED dataset specifically for evaluating active learning algorithms. Experimental results demonstrate that MFFT achieves a mAP of 68% when cold-starting and 71% when warm-starting (which is close to the fully-supervised mAP of 75%) while using only 2.3% of the annotations. Notably, MFFT excels in cold-start scenarios and with rare species, which are critical for monitoring endangered species, demonstrating its practical value.
zh

[AI-51] Streamlining Knowledge Graph Creation with PyRML

【速读】:该论文旨在解决知识图谱(Knowledge Graphs, KGs)构建过程中存在的可扩展性、可重用性及工程化难题,特别是在处理异构数据集成时的复杂性和低效性。其解决方案的关键在于提出PyRML,一个轻量级的Python原生库,支持核心RML(RDF Mapping Language)构造,并提供在Python环境中直接编写、执行和测试映射的可编程接口,从而实现与主流数据处理和语义网库的无缝集成,降低知识图谱构建的门槛并促进可复现的数据整合。

链接: https://arxiv.org/abs/2505.20949
作者: Andrea Giovanni Nuzzolese
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) are increasingly adopted as a foundational technology for integrating heterogeneous data in domains such as climate science, cultural heritage, and the life sciences. Declarative mapping languages like R2RML and RML have played a central role in enabling scalable and reusable KG construction, offering a transparent means of transforming structured and semi-structured data into RDF. In this paper, we present PyRML, a lightweight, Python-native library for building Knowledge Graphs through declarative mappings. PyRML supports core RML constructs and provides a programmable interface for authoring, executing, and testing mappings directly within Python environments. It integrates with popular data and semantic web libraries (e.g., Pandas and RDFlib), enabling transparent and modular workflows. By lowering the barrier to entry for KG creation and fostering reproducible, ontology-aligned data integration, PyRML bridges the gap between declarative semantics and practical KG engineering.
zh

[AI-52] Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs

【速读】:该论文旨在解决知识图谱中归纳推理(abductive reasoning)生成假设时的不可控问题,即在大规模知识图谱中,单一观察可能产生大量冗余或无关的合理假设,从而降低实际应用价值。其解决方案的关键在于提出一种可控逻辑假设生成框架CtrlHGen,通过两阶段训练策略(监督学习与强化学习)来提升生成假设的可控性与质量。为应对假设空间坍缩和假设过度敏感两大挑战,该框架引入了基于子逻辑分解的数据集增强策略以及平滑语义奖励机制,以增强模型对复杂逻辑结构的学习能力并确保生成结果符合用户指定的控制约束。

链接: https://arxiv.org/abs/2505.20948
作者: Yisen Gao,Jiaxin Bai,Tianshi Zheng,Qingyun Sun,Ziwei Zhang,Jianxin Li,Yangqiu Song,Xingcheng Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Abductive reasoning in knowledge graphs aims to generate plausible logical hypotheses from observed entities, with broad applications in areas such as clinical diagnosis and scientific discovery. However, due to a lack of controllability, a single observation may yield numerous plausible but redundant or irrelevant hypotheses on large-scale knowledge graphs. To address this limitation, we introduce the task of controllable hypothesis generation to improve the practical utility of abductive reasoning. This task faces two key challenges when controlling for generating long and complex logical hypotheses: hypothesis space collapse and hypothesis oversensitivity. To address these challenges, we propose CtrlHGen, a Controllable logcial Hypothesis Generation framework for abductive reasoning over knowledge graphs, trained in a two-stage paradigm including supervised learning and subsequent reinforcement learning. To mitigate hypothesis space collapse, we design a dataset augmentation strategy based on sub-logical decomposition, enabling the model to learn complex logical structures by leveraging semantic patterns in simpler components. To address hypothesis oversensitivity, we incorporate smoothed semantic rewards including Dice and Overlap scores, and introduce a condition-adherence reward to guide the generation toward user-specified control constraints. Extensive experiments on three benchmark datasets demonstrate that our model not only better adheres to control conditions but also achieves superior semantic similarity performance compared to baselines.
zh

[AI-53] Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

【速读】:该论文旨在解决多智能体强化学习(MARL)中环境建模的挑战,特别是在联合动作空间指数级增长和多智能体系统动态高度不确定性的情况下,如何提高策略学习的样本效率。其解决方案的关键在于通过顺序智能体建模,将建模复杂度从联合状态-动作转移动态降低到仅关注每个时间步的状态空间,从而逐步消除不确定性并捕捉智能体间的结构化依赖关系。这一方法与扩散模型的逆过程相契合,进而提出了一种基于扩散模型的灵活且鲁棒的多智能体世界模型(DIMA),在多个基准测试中实现了最先进的性能。

链接: https://arxiv.org/abs/2505.20922
作者: Yang Zhang,Xinran Li,Jianing Ye,Delin Qu,Shuang Qiu,Chongjie Zhang,Xiu Li,Chenjia Bai
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:World models have recently attracted growing interest in Multi-Agent Reinforcement Learning (MARL) due to their ability to improve sample efficiency for policy learning. However, accurately modeling environments in MARL is challenging due to the exponentially large joint action space and highly uncertain dynamics inherent in multi-agent systems. To address this, we reduce modeling complexity by shifting from jointly modeling the entire state-action transition dynamics to focusing on the state space alone at each timestep through sequential agent modeling. Specifically, our approach enables the model to progressively resolve uncertainty while capturing the structured dependencies among agents, providing a more accurate representation of how agents influence the state. Interestingly, this sequential revelation of agents’ actions in a multi-agent system aligns with the reverse process in diffusion models–a class of powerful generative models known for their expressiveness and training stability compared to autoregressive or latent variable models. Leveraging this insight, we develop a flexible and robust world model for MARL using diffusion models. Our method, Diffusion-Inspired Multi-Agent world model (DIMA), achieves state-of-the-art performance across multiple multi-agent control benchmarks, significantly outperforming prior world models in terms of final return and sample efficiency, including MAMuJoCo and Bi-DexHands. DIMA establishes a new paradigm for constructing multi-agent world models, advancing the frontier of MARL research.
zh

[AI-54] Humble AI in the real-world: the case of algorithmic hiring

【速读】:该论文试图解决算法招聘中由于误识别和刻板印象导致的公平性问题,这些问题难以通过传统的公平性和信任框架进行评估。解决方案的关键在于应用Humble AI原则,通过不确定性量化、熵估计以及强调算法未知性的用户体验设计,将Humble AI的理念转化为实际技术可行的方法。

链接: https://arxiv.org/abs/2505.20918
作者: Rahul Nair,Inge Vejsbjerg,Elizabeth Daly,Christos Varytimidis,Bran Knowles
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: CHIWORK '25, Symposium on Human-Computer Interaction for Work, June 23–25, 2025, Amsterdam, Netherlands Late Breaking Work

点击查看摘要

Abstract:Humble AI (Knowles et al., 2023) argues for cautiousness in AI development and deployments through scepticism (accounting for limitations of statistical learning), curiosity (accounting for unexpected outcomes), and commitment (accounting for multifaceted values beyond performance). We present a real-world case study for humble AI in the domain of algorithmic hiring. Specifically, we evaluate virtual screening algorithms in a widely used hiring platform that matches candidates to job openings. There are several challenges in misrecognition and stereotyping in such contexts that are difficult to assess through standard fairness and trust frameworks; e.g., someone with a non-traditional background is less likely to rank highly. We demonstrate technical feasibility of how humble AI principles can be translated to practice through uncertainty quantification of ranks, entropy estimates, and a user experience that highlights algorithmic unknowns. We describe preliminary discussions with focus groups made up of recruiters. Future user studies seek to evaluate whether the higher cognitive load of a humble AI system fosters a climate of trust in its outcomes.
zh

[AI-55] Reinforcement Learning-based Sequential Route Recommendation for System-Optimal Traffic Assignment

【速读】:该论文试图解决个性化路径推荐在交通系统中是否能够实现系统最优(System Optimal, SO)交通分配的问题。其解决方案的关键在于提出一种基于学习的框架,将静态SO交通分配问题重新建模为一个单智能体深度强化学习(Deep Reinforcement Learning, DRL)任务,通过中央代理根据到达的起讫对(Origin-Destination, OD)需求依次推荐路径,以最小化总体系统出行时间。该方法结合了传统交通分配方法的迭代结构与深度Q学习算法,提升了学习效率和解的质量。

链接: https://arxiv.org/abs/2505.20889
作者: Leizhen Wang,Peibo Duan,Cheng Lyu,Zhenliang Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern navigation systems and shared mobility platforms increasingly rely on personalized route recommendations to improve individual travel experience and operational efficiency. However, a key question remains: can such sequential, personalized routing decisions collectively lead to system-optimal (SO) traffic assignment? This paper addresses this question by proposing a learning-based framework that reformulates the static SO traffic assignment problem as a single-agent deep reinforcement learning (RL) task. A central agent sequentially recommends routes to travelers as origin-destination (OD) demands arrive, to minimize total system travel time. To enhance learning efficiency and solution quality, we develop an MSA-guided deep Q-learning algorithm that integrates the iterative structure of traditional traffic assignment methods into the RL training process. The proposed approach is evaluated on both the Braess and Ortuzar-Willumsen (OW) networks. Results show that the RL agent converges to the theoretical SO solution in the Braess network and achieves only a 0.35% deviation in the OW network. Further ablation studies demonstrate that the route action set’s design significantly impacts convergence speed and final performance, with SO-informed route sets leading to faster learning and better outcomes. This work provides a theoretically grounded and practically relevant approach to bridging individual routing behavior with system-level efficiency through learning-based sequential assignment.
zh

[AI-56] Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization

【速读】:该论文试图解决传统启发式设计方法在处理组合优化问题(Combinatorial Optimization Problems, COPs)时,因依赖手动预定义的进化计算(Evolutionary Computation, EC)优化器和单任务训练方案而导致的启发式算法多样性探索受限及泛化能力不足的问题。解决方案的关键在于提出一种名为元优化启发式(Meta-Optimization of Heuristics, MoH)的框架,该框架通过元学习原理在优化器层面进行操作,利用大语言模型(Large Language Models, LLMs)迭代优化一个元优化器,使其能够自主构造多样化的优化器,从而摆脱对预定义EC优化器的依赖,并通过多任务训练提升泛化能力。

链接: https://arxiv.org/abs/2505.20881
作者: Yiding Shi,Jianan Zhou,Wen Song,Jieyi Bi,Yaoxin Wu,Jie Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heuristic design with large language models (LLMs) has emerged as a promising approach for tackling combinatorial optimization problems (COPs). However, existing approaches often rely on manually predefined evolutionary computation (EC) optimizers and single-task training schemes, which may constrain the exploration of diverse heuristic algorithms and hinder the generalization of the resulting heuristics. To address these issues, we propose Meta-Optimization of Heuristics (MoH), a novel framework that operates at the optimizer level, discovering effective optimizers through the principle of meta-learning. Specifically, MoH leverages LLMs to iteratively refine a meta-optimizer that autonomously constructs diverse optimizers through (self-)invocation, thereby eliminating the reliance on a predefined EC optimizer. These constructed optimizers subsequently evolve heuristics for downstream tasks, enabling broader heuristic exploration. Moreover, MoH employs a multi-task training scheme to promote its generalization capability. Experiments on classic COPs demonstrate that MoH constructs an effective and interpretable meta-optimizer, achieving state-of-the-art performance across various downstream tasks, particularly in cross-size settings.
zh

[AI-57] Step-Wise Formal Verification for LLM -Based Mathematical Problem Solving

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在解决数学问题时可能存在的逻辑推理和计算错误问题。其解决方案的关键在于提出一个框架MATH-VF,该框架包含一个Formalizer和一个Critic。Formalizer负责将自然语言解题过程转化为形式化上下文,而Critic则利用外部工具(如计算机代数系统和SMT求解器)对形式化上下文中的每个陈述进行正确性评估,并在发现错误时提供修正反馈。

链接: https://arxiv.org/abs/2505.20869
作者: Kuo Zhou,Lu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.
zh

[AI-58] Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech INTERSPEECH

【速读】:该论文试图解决生成高质量表达性语音(expressive speech)的挑战,尤其是在风格迁移(style transfer)和语音质量方面。其解决方案的关键在于提出Spotlight-TTS,该方法通过发声感知的风格提取(voiced-aware style extraction)和风格方向调整(style direction adjustment)来专注于语音中的风格信息,并优化其与文本到语音(TTS)模型的整合,从而提升语音的表达力和质量。

链接: https://arxiv.org/abs/2505.20868
作者: Nam-Gyu Kim,Deok-Hyeon Cho,Seung-Bin Kim,Seong-Whan Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech

点击查看摘要

Abstract:Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose Spotlight-TTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability. Our audio samples are publicly available.
zh

[AI-59] Respond to Change with Constancy: Instruction-tuning with LLM for Non-I.I.D. Network Traffic Classification

【速读】:该论文旨在解决加密流量分类中的挑战,特别是由于依赖封闭世界假设导致的分布偏移问题以及对标注数据的依赖限制了其在数据稀缺或不可用场景下的适用性。解决方案的关键在于引入一种名为Encrypted Traffic Out-of-Distribution Instruction Tuning with LLM (ETooL)的新型流量表示模型,该模型通过自监督指令微调范式将大型语言模型(LLM)与流量结构知识相结合,建立文本信息与流量交互之间的联系,从而提升分类性能和泛化能力。

链接: https://arxiv.org/abs/2505.20866
作者: Xinjie Lin,Gang Xiong,Gaopeng Gou,Wenqi Dong,Jing Yu,Zhen Li,Wei Xia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: IEEE Transactions on Information Forensics and Security (TIFS) camera ready, 15 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Encrypted traffic classification is highly challenging in network security due to the need for extracting robust features from content-agnostic traffic data. Existing approaches face critical issues: (i) Distribution drift, caused by reliance on the closedworld assumption, limits adaptability to realworld, shifting patterns; (ii) Dependence on labeled data restricts applicability where such data is scarce or unavailable. Large language models (LLMs) have demonstrated remarkable potential in offering generalizable solutions across a wide range of tasks, achieving notable success in various specialized fields. However, their effectiveness in traffic analysis remains constrained by challenges in adapting to the unique requirements of the traffic domain. In this paper, we introduce a novel traffic representation model named Encrypted Traffic Out-of-Distribution Instruction Tuning with LLM (ETooL), which integrates LLMs with knowledge of traffic structures through a self-supervised instruction tuning paradigm. This framework establishes connections between textual information and traffic interactions. ETooL demonstrates more robust classification performance and superior generalization in both supervised and zero-shot traffic classification tasks. Notably, it achieves significant improvements in F1 scores: APP53 (I.I.D.) to 93.19%(6.62%) and 92.11%(4.19%), APP53 (O.O.D.) to 74.88%(18.17%) and 72.13%(15.15%), and ISCX-Botnet (O.O.D.) to 95.03%(9.16%) and 81.95%(12.08%). Additionally, we construct NETD, a traffic dataset designed to support dynamic distributional shifts, and use it to validate ETooL’s effectiveness under varying distributional conditions. Furthermore, we evaluate the efficiency gains achieved through ETooL’s instruction tuning approach.
zh

[AI-60] Cooperation of Experts: Fusing Heterogeneous Information with Large Margin ICML2025

【速读】:该论文试图解决多源异构信息融合在现代数据分析中的持续性挑战,特别是现有方法未能充分考虑不同语义空间中对象模式的固有异质性。解决方案的关键在于提出一种名为“专家协作”(CoE)的框架,该框架通过将多类型信息编码为统一的异构多层网络,克服模态和连接差异,从而捕捉现实世界复杂数据的复杂结构。在该框架中,专用编码器作为领域专家,专门学习特定语义空间中的关系模式,并通过一种由定制优化策略支持的大间隔机制进行协作,以增强鲁棒性和提取互补知识。

链接: https://arxiv.org/abs/2505.20853
作者: Shuo Wang,Shunyang Huang,Jinghui Yuan,Zhixiang Shen,Zhao Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in ICML 2025

点击查看摘要

Abstract:Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework’s feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at this https URL.
zh

[AI-61] MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域中部署时的安全问题,特别是在协作式多智能体配置下的安全性挑战。其关键解决方案是提出MedSentry基准,包含5000个对抗性医疗提示,涵盖25种威胁类别,并构建了一个端到端的攻击-防御评估流程,以系统分析不同多智能体拓扑结构(如Layers、SharedPool、Centralized和Decentralized)对“黑暗人格”代理攻击的抵御能力。研究揭示了各架构在信息污染处理和决策鲁棒性方面的差异,并提出了基于人格尺度的检测与修正机制,以识别并修复恶意代理,从而恢复系统安全。

链接: https://arxiv.org/abs/2505.20824
作者: Kai Chen,Taihang Zhen,Hewei Wang,Kailai Liu,Xinfeng Li,Jing Huo,Tianpei Yang,Jinfeng Xu,Wei Dong,Yang Gao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from ‘dark-personality’ agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool’s open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near-baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM-based multi-agent systems in medical domains.
zh

[AI-62] MT-Mol:Multi Agent System with Tool-based Reasoning for Molecular Optimization

【速读】:该论文试图解决分子优化中结构化推理、可解释性和全面工具引导的优化问题,这些问题在大型语言模型(LLMs)的应用中尚未得到充分探索。解决方案的关键在于提出MT-Mol框架,该框架通过工具引导的推理和角色专业化的LLM代理实现分子优化,整合了全面的RDKit工具,并按五个不同领域进行分类管理,每个领域由专家分析代理负责提取相关工具并提供可解释的化学基础反馈,从而实现分子生成与优化的协同交互。

链接: https://arxiv.org/abs/2505.20820
作者: Hyomin Kim,Yunhui Jang,Sungsoo Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have large potential for molecular optimization, as they can gather external chemistry tools and enable collaborative interactions to iteratively refine molecular candidates. However, this potential remains underexplored, particularly in the context of structured reasoning, interpretability, and comprehensive tool-grounded molecular optimization. To address this gap, we introduce MT-Mol, a multi-agent framework for molecular optimization that leverages tool-guided reasoning and role-specialized LLM agents. Our system incorporates comprehensive RDKit tools, categorized into five distinct domains: structural descriptors, electronic and topological features, fragment-based functional groups, molecular representations, and miscellaneous chemical properties. Each category is managed by an expert analyst agent, responsible for extracting task-relevant tools and enabling interpretable, chemically grounded feedback. MT-Mol produces molecules with tool-aligned and stepwise reasoning through the interaction between the analyst agents, a molecule-generating scientist, a reasoning-output verifier, and a reviewer agent. As a result, we show that our framework shows the state-of-the-art performance of the PMO-1K benchmark on 17 out of 23 tasks.
zh

[AI-63] VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion INTERSPEECH2025

【速读】:该论文旨在解决歌唱风格控制中的振动(vibrato)建模难题,特别是在歌唱语音转换中对振动的动态特性难以精确控制的问题。解决方案的关键在于提出VibESVC模型,该模型通过离散小波变换(discrete wavelet transform)显式提取并操控振动特征,将基频(F0)轮廓分解为频率成分,从而实现对振动的精准迁移,提升了歌唱风格转换的灵活性和表达力。

链接: https://arxiv.org/abs/2505.20794
作者: Joon-Seung Choi,Dong-Min Byun,Hyung-Seok Oh,Seong-Whan Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Proceedings of Interspeech 2025

点击查看摘要

Abstract:Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.
zh

[AI-64] FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation

【速读】:该论文旨在解决自主无人机在复杂环境中进行全局路径规划的问题,特别是在提升感知能力和智能决策方面的挑战。其解决方案的关键在于利用基础模型(foundation models),特别是大型语言模型(LLMs)和视觉-语言模型(VLMs)来引导路径规划过程,通过系统评估和集成语义推理与视觉感知的LLM-Vision规划器,实现有效的实时导航,并在多种配置下进行实际验证,以探索基础模型在真实无人机应用中的可行性和局限性。

链接: https://arxiv.org/abs/2505.20783
作者: Jiaping Xiao,Cheng Wen Tsao,Yuhang Zhang,Mir Feroskhan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This work has been submitted for possible publication

点击查看摘要

Abstract:Path planning is a critical component in autonomous drone operations, enabling safe and efficient navigation through complex environments. Recent advances in foundation models, particularly large language models (LLMs) and vision-language models (VLMs), have opened new opportunities for enhanced perception and intelligent decision-making in robotics. However, their practical applicability and effectiveness in global path planning remain relatively unexplored. This paper proposes foundation model-guided path planners (FM-Planner) and presents a comprehensive benchmarking study and practical validation for drone path planning. Specifically, we first systematically evaluate eight representative LLM and VLM approaches using standardized simulation scenarios. To enable effective real-time navigation, we then design an integrated LLM-Vision planner that combines semantic reasoning with visual perception. Furthermore, we deploy and validate the proposed path planner through real-world experiments under multiple configurations. Our findings provide valuable insights into the strengths, limitations, and feasibility of deploying foundation models in real-world drone applications and providing practical implementations in autonomous flight. Project site: this https URL.
zh

[AI-65] Bridging the Gap: Self-Optimized Fine-Tuning for LLM -based Recommender Systems

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推荐系统(Recommender Systems, RS)中推荐能力不足的问题,即如何有效弥合LLMs的知识空间与推荐任务之间的差距。现有两种主流策略——“Guidance-Only”和“Tuning-Only”均未能达到理想的推荐性能。为此,作者提出了一种结合两者优势的新型方法——Self-Optimized Fine-Tuning (SOFT),其关键在于引入课程学习(curriculum learning)的思想,通过自蒸馏生成辅助数据集,并利用自适应课程调度器使模型逐步从简单数据过渡到复杂的真实推荐数据,从而显著提升推荐准确性。

链接: https://arxiv.org/abs/2505.20771
作者: Heng Tang,Feng Liu,Xinbo Chen,Jiawei Chen,Bohao Wang,Changwang Zhang,Jun Wang,Yuegang Sun,Bingde Hu,Can Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed extensive exploration of Large Language Models (LLMs) on the field of Recommender Systems (RS). There are currently two commonly used strategies to enable LLMs to have recommendation capabilities: 1) The “Guidance-Only” strategy uses in-context learning to exploit and amplify the inherent semantic understanding and item recommendation capabilities of LLMs; 2) The “Tuning-Only” strategy uses supervised fine-tuning (SFT) to fine-tune LLMs with the aim of fitting them to real recommendation data. However, neither of these strategies can effectively bridge the gap between the knowledge space of LLMs and recommendation, and their performance do not meet our expectations. To better enable LLMs to learn recommendation knowledge, we combine the advantages of the above two strategies and proposed a novel “Guidance+Tuning” method called Self-Optimized Fine-Tuning (SOFT), which adopts the idea of curriculum learning. It first employs self-distillation to construct an auxiliary easy-to-learn but meaningful dataset from a fine-tuned LLM. Then it further utilizes a self-adaptive curriculum scheduler to enable LLMs to gradually learn from simpler data (self-distilled data) to more challenging data (real RS data). Extensive experiments demonstrate that SOFT significantly enhances the recommendation accuracy (37.59% on average) of LLM-based methods. The code is available via this https URL Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.20771 [cs.IR] (or arXiv:2505.20771v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.20771 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-66] Interactive OT Gym: A Reinforcement Learning-Based Interactive Optical tweezer (OT)-Driven Microrobotics Simulation Platform ICRA2025

【速读】:该论文旨在解决传统多光阱光学镊子(OT)在动态环境中对多个复杂形状微机器人进行协同操作的控制难题。其解决方案的关键在于提出一种基于强化学习(RL)的仿真平台——Interactive OT Gym,该平台集成了物理场模拟、触觉反馈接口、RL模块及上下文感知的共享控制策略,实现了人机协同的自适应控制,从而显著提升了微操作任务的效率与成功率。

链接: https://arxiv.org/abs/2505.20751
作者: Zongcai Tan amd Dandan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICRA 2025

点击查看摘要

Abstract:Optical tweezers (OT) offer unparalleled capabilities for micromanipulation with submicron precision in biomedical applications. However, controlling conventional multi-trap OT to achieve cooperative manipulation of multiple complex-shaped microrobots in dynamic environments poses a significant challenge. To address this, we introduce Interactive OT Gym, a reinforcement learning (RL)-based simulation platform designed for OT-driven microrobotics. Our platform supports complex physical field simulations and integrates haptic feedback interfaces, RL modules, and context-aware shared control strategies tailored for OT-driven microrobot in cooperative biological object manipulation tasks. This integration allows for an adaptive blend of manual and autonomous control, enabling seamless transitions between human input and autonomous operation. We evaluated the effectiveness of our platform using a cell manipulation task. Experimental results show that our shared control system significantly improves micromanipulation performance, reducing task completion time by approximately 67% compared to using pure human or RL control alone and achieving a 100% success rate. With its high fidelity, interactivity, low cost, and high-speed simulation capabilities, Interactive OT Gym serves as a user-friendly training and testing environment for the development of advanced interactive OT-driven micromanipulation systems and control algorithms. For more details on the project, please see our website this https URL
zh

[AI-67] Can Agents Fix Agent Issues?

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的智能体系统在维护过程中面临的挑战,特别是如何自动修复智能体问题(如错误报告或功能请求)。其关键在于构建一个可复现的基准测试集——AGENTISSUE-BENCH,该基准包含50个智能体问题解决任务,并配有可执行环境和故障触发测试,以评估现有软件工程(Software Engineering, SE)智能体的有效性。通过这一基准,研究揭示了当前SE智能体在处理智能体系统问题时的局限性,强调了针对智能体系统维护进行专门研究的必要性。

链接: https://arxiv.org/abs/2505.20749
作者: Alfin Wijaya Rahardja,Junwei Liu,Weitong Chen,Zhenpeng Chen,Yiling Lou
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at this https URL .
zh

[AI-68] MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在地球科学领域,尤其是在研究生层次的应用中缺乏高质量基准的问题。现有基准通常依赖于合成数据集或简单的图表-标题对,无法充分反映真实科学应用所需的复杂推理和领域专业知识。解决方案的关键在于引入MSEarth,这是一个从高质量、开放获取的科学出版物中整理的多模态科学基准,涵盖了地球科学的五大领域,并包含超过7000张经过精炼标题的图表,这些标题结合了原始图表说明及论文中的讨论与推理,以确保基准能够捕捉到复杂的科学推理和知识密集型内容。

链接: https://arxiv.org/abs/2505.20740
作者: Xiangyu Zhao,Wanghan Xu,Bo Liu,Yuhao Zhou,Fenghua Ling,Ben Fei,Xiaoyu Yue,Lei Bai,Wenlong Zhang,Xiao-Ming Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 7K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field. Resources related to this benchmark can be found at this https URL and this https URL.
zh

[AI-69] RRO: LLM Agent Optimization Through Rising Reward Trajectories

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在执行复杂多步骤任务时的挑战,特别是在代理(agent)对关键步骤结果敏感的情况下,微小的规划轨迹错误可能导致任务失败的问题。为了解决这一问题,现有方法通过强化学习校准推理过程,利用过程奖励模型(Process Reward Models, PRMs)对每个推理步骤进行奖励或惩罚。然而,PRMs在处理大量下一步动作候选时存在扩展性和成本问题。该论文的关键解决方案是提出一种称为奖励上升优化(Reward Rising Optimization, RRO)的方法,其核心在于关注连续推理步骤之间的相对奖励趋势,并在收集的轨迹中保持递增的奖励,从而动态扩展下一步动作候选的搜索空间,高效获取高质量数据。

链接: https://arxiv.org/abs/2505.20737
作者: Zilong Wang,Jingfeng Yang,Sreyashi Nag,Samarth Varshney,Xianfeng Tang,Haoming Jiang,Jingbo Shang,Sheikh Muhammad Sarwar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.
zh

[AI-70] Adversarial bandit optimization for approximately linear functions

【速读】:该论文研究的是非凸且非光滑函数的老虎机优化问题(bandit optimization),其中每次试验中的损失函数是由一个线性函数和一个在观察到玩家选择后任意但小的扰动组成的。解决方案的关键在于提供该问题的期望遗憾和高概率遗憾界,并通过分析表明其结果也改进了无扰动情况下的线性老虎机优化(bandit linear optimization)的高概率遗憾界。此外,论文还给出了期望遗憾的下界。

链接: https://arxiv.org/abs/2505.20734
作者: Zhuoyu Cheng,Kohei Hatano,Eiji Takimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider a bandit optimization problem for nonconvex and non-smooth functions, where in each trial the loss function is the sum of a linear function and a small but arbitrary perturbation chosen after observing the player’s choice. We give both expected and high probability regret bounds for the problem. Our result also implies an improved high-probability regret bound for the bandit linear optimization, a special case with no perturbation. We also give a lower bound on the expected regret.
zh

[AI-71] E2E Process Automation Leverag ing Generative AI and IDP-Based Automation Agent : A Case Study on Corporate Expense Processing

【速读】:该论文旨在解决传统机器人流程自动化(Robotic Process Automation, RPA)在处理非结构化数据、异常管理和复杂决策方面的局限性,从而实现企业财务费用处理任务的端到端(End-to-End, E2E)自动化。其解决方案的关键在于将生成式AI与智能文档处理(Intelligent Document Processing, IDP)技术相结合,并通过自动化代理(Automation Agent)构建一个四阶段集成流程,包括支持性文件的自动识别、基于策略驱动数据库的项目分类、由生成式AI(大语言模型,LLMs)支持的智能异常处理,以及通过自动化代理持续学习的人机协同最终决策。

链接: https://arxiv.org/abs/2505.20733
作者: Cheonsu Jeong,Seongmin Sim,Hyoyoung Cho,Sungsu Kim,Byounggwan Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an intelligent work automation approach in the context of contemporary digital transformation by integrating generative AI and Intelligent Document Processing (IDP) technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks. While traditional Robotic Process Automation (RPA) has proven effective for repetitive, rule-based simple task automation, it faces limitations in handling unstructured data, exception management, and complex decision-making. This study designs and implements a four-stage integrated process comprising automatic recognition of supporting documents such as receipts via OCR/IDP, item classification based on a policy-driven database, intelligent exception handling supported by generative AI (large language models, LLMs), and human-in-the-loop final decision-making with continuous system learning through an Automation Agent. Applied to a major Korean enterprise (Company S), the system demonstrated quantitative benefits including over 80% reduction in processing time for paper receipt expense tasks, decreased error rates, and improved compliance, as well as qualitative benefits such as enhanced accuracy and consistency, increased employee satisfaction, and data-driven decision support. Furthermore, the system embodies a virtuous cycle by learning from human judgments to progressively improve automatic exception handling capabilities. Empirically, this research confirms that the organic integration of generative AI, IDP, and Automation Agents effectively overcomes the limitations of conventional automation and enables E2E automation of complex corporate processes. The study also discusses potential extensions to other domains such as accounting, human resources, and procurement, and proposes future directions for AI-driven hyper-automation development.
zh

[AI-72] Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在空间推理能力方面存在的不足问题,具体表现为模型在处理具有高空间复杂性的现实世界图像时表现不佳。解决方案的关键在于引入了一个名为Jigsaw-Puzzles的新基准,该基准包含1,100张精心挑选的高空间复杂性真实图像,并设计了五个任务来严格评估VLMs的空间感知、结构理解和推理能力,同时尽量减少对领域特定知识的依赖,以更准确地评估模型的通用空间推理能力。

链接: https://arxiv.org/abs/2505.20728
作者: Zesen Lyu,Dandan Zhang,Wei Ye,Fangdi Li,Zhihang Jiang,Yao Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs’ spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs.
zh

[AI-73] Wideband RF Radiance Field Modeling Using Frequency-embedded 3D Gaussian Splatting

【速读】:该论文旨在解决传统单频段射频(RF)辐射场建模方法在宽频段场景下的局限性,提出一种嵌入频率信息的三维高斯点云(3DGS)算法,以实现任意未知频段下射频辐射场的高效重建。其解决方案的关键在于设计了一个电磁(EM)特征网络,包含衰减模块和辐射模块,用于学习射频频率与每个三维高斯分布的关键属性(如衰减因子和射频信号强度)之间的复杂关系。通过训练该频率嵌入的3DGS模型,能够在给定的三维环境中准确估计未知频段的功率角度谱(PAS)。

链接: https://arxiv.org/abs/2505.20714
作者: Zechen Li,Lanqing Yang,Yiheng Bian,Hao Pan,Yongjian Fu,Yezhou Wang,Yi-Chao Chen,Guangtao Xue,Ju Ren
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents an innovative frequency-embedded 3D Gaussian splatting (3DGS) algorithm for wideband radio-frequency (RF) radiance field modeling, offering an advancement over the existing works limited to single-frequency modeling. Grounded in fundamental physics, we uncover the complex relationship between EM wave propagation behaviors and RF frequencies. Inspired by this, we design an EM feature network with attenuation and radiance modules to learn the complex relationships between RF frequencies and the key properties of each 3D Gaussian, specifically the attenuation factor and RF signal intensity. By training the frequency-embedded 3DGS model, we can efficiently reconstruct RF radiance fields at arbitrary unknown frequencies within a given 3D environment. Finally, we propose a large-scale power angular spectrum (PAS) dataset containing 50000 samples ranging from 1 to 100 GHz in 6 indoor environments, and conduct extensive experiments to verify the effectiveness of our method. Our approach achieves an average Structural Similarity Index Measure (SSIM) up to 0.72, and a significant improvement up to 17.8% compared to the current state-of-the-art (SOTA) methods trained on individual test frequencies. Additionally, our method achieves an SSIM of 0.70 without prior training on these frequencies, which represents only a 2.8% performance drop compared to models trained with full PAS data. This demonstrates our model’s capability to estimate PAS at unknown frequencies. For related code and datasets, please refer to this https URL.
zh

[AI-74] Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leverag ing Generative Factor Models of Observed Time Series

【速读】:该论文试图解决神经科学中假设生成的问题,旨在通过减少干预性研究的范围来降低研究成本。现有机器学习方法在生成科学假设方面存在局限,因为它们通常假设因果关系是静态的,难以适用于具有动态、状态依赖行为的系统,如大脑。该论文提出的解决方案的关键在于将动态图建模为静态图的条件加权叠加,其中每个静态图可以捕捉非线性关系,从而能够检测超越线性限制的复杂时变变量间交互。

链接: https://arxiv.org/abs/2505.20697
作者: Zachary C. Brown,David Carlson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The field of hypothesis generation promises to reduce costs in neuroscience by narrowing the range of interventional studies needed to study various phenomena. Existing machine learning methods can generate scientific hypotheses from complex datasets, but many approaches assume causal relationships are static over time, limiting their applicability to systems with dynamic, state-dependent behavior, such as the brain. While some techniques attempt dynamic causal discovery through factor models, they often restrict relationships to linear patterns or impose other simplifying assumptions. We propose a novel method that models dynamic graphs as a conditionally weighted superposition of static graphs, where each static graph can capture nonlinear relationships. This approach enables the detection of complex, time-varying interactions between variables beyond linear limitations. Our method improves f1-scores of predicted dynamic causal patterns by roughly 22-28% on average over baselines in some of our experiments, with some improvements reaching well over 60%. A case study on real brain data demonstrates our method’s ability to uncover relationships linked to specific behavioral states, offering valuable insights into neural dynamics.
zh

[AI-75] Evidential Deep Active Learning for Semi-Supervised Classification

【速读】:该论文试图解决半监督分类中现有方法在学习过程中忽略预测结果的不确定性估计(或可靠性)的问题,这导致所选样本是否能有效更新模型存在疑问。其解决方案的关键在于提出一种基于证据深度主动学习的半监督分类方法(EDALSSC),该方法构建了一个半监督学习框架,在学习过程中同时量化已标记和未标记数据的不确定性估计。已标记数据的不确定性估计与证据深度学习相关,而未标记数据的不确定性则通过从T-范数算子的角度结合证据的无知信息和冲突信息进行建模。此外,EDALSSC还设计了一种启发式方法,以动态平衡证据和类别数量对不确定性估计的影响,确保不会产生反直觉的结果。

链接: https://arxiv.org/abs/2505.20691
作者: Shenkai Zhao,Xinao Zhang,Lipeng Pan,Xiaobin Xu,Danilo Pelusi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Semi-supervised classification based on active learning has made significant progress, but the existing methods often ignore the uncertainty estimation (or reliability) of the prediction results during the learning process, which makes it questionable whether the selected samples can effectively update the model. Hence, this paper proposes an evidential deep active learning approach for semi-supervised classification (EDALSSC). EDALSSC builds a semi-supervised learning framework to simultaneously quantify the uncertainty estimation of labeled and unlabeled data during the learning process. The uncertainty estimation of the former is associated with evidential deep learning, while that of the latter is modeled by combining ignorance information and conflict information of the evidence from the perspective of the T-conorm operator. Furthermore, this article constructs a heuristic method to dynamically balance the influence of evidence and the number of classes on uncertainty estimation to ensure that it does not produce counter-intuitive results in EDALSSC. For the sample selection strategy, EDALSSC selects the sample with the greatest uncertainty estimation that is calculated in the form of a sum when the training loss increases in the latter half of the learning process. Experimental results demonstrate that EDALSSC outperforms existing semi-supervised and supervised active learning approaches on image classification datasets.
zh

[AI-76] Accelerating RL for LLM Reasoning with Optimal Advantage Regression

【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)微调过程中存在的高计算开销和内存消耗问题,这些问题主要源于每个提示需要多次生成以及依赖于批评者网络或当前策略的优势估计。解决方案的关键在于提出一种名为 A*-PO 的两阶段策略优化框架,该框架直接近似最优优势函数,并通过离线采样估计最优价值函数 V*,从而避免了昂贵的在线价值估计;在第二阶段,使用仅需单次生成的简单最小二乘回归损失进行策略更新,实现了高效的训练。

链接: https://arxiv.org/abs/2505.20686
作者: Kianté Brantley,Mingyu Chen,Zhaolin Gao,Jason D. Lee,Wen Sun,Wenhao Zhan,Xuezhou Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose A *-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function V *, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a single generation per prompt. Theoretically, we establish performance guarantees and prove that the KL-regularized RL objective can be optimized without requiring complex exploration strategies. Empirically, A *-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks, while reducing training time by up to 2 \times and peak memory usage by over 30% compared to PPO, GRPO, and REBEL. Implementation of A *-PO can be found at this https URL.
zh

[AI-77] GIFARC: Synthetic Dataset for Leverag ing Human-Intuitive Analogies to Elevate AI Reasoning

【速读】:该论文试图解决当前先进深度学习模型在Abstraction and Reasoning Corpus (ARC)任务中表现与人类水平之间存在的显著差距问题。其解决方案的关键在于引入了一个受类比启发的ARC数据集GIFARC,通过利用大语言模型(LLMs)和视觉-语言模型(VLMs)从包含类比关系的GIF图像中合成新的ARC风格任务,并为每个任务提供真实的类比标注,从而将人类直观的类比推理嵌入到任务设计中,引导AI代理在进行暴力模式搜索前先进行类比评估,进而有效降低问题复杂度并构建更简洁、易于理解的解决方案。

链接: https://arxiv.org/abs/2505.20672
作者: Woochang Sim,Hyunseok Ryu,Kyungmin Choi,Sungwon Han,Sundong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) poses a stringent test of general AI capabilities, requiring solvers to infer abstract patterns from only a handful of examples. Despite substantial progress in deep learning, state-of-the-art models still achieve accuracy rates of merely 40-55% on 2024 ARC Competition, indicative of a significant gap between their performance and human-level reasoning. In this work, we seek to bridge that gap by introducing an analogy-inspired ARC dataset, GIFARC. Leveraging large language models (LLMs) and vision-language models (VLMs), we synthesize new ARC-style tasks from a variety of GIF images that include analogies. Each new task is paired with ground-truth analogy, providing an explicit mapping between visual transformations and everyday concepts. By embedding robust human-intuitive analogies into ARC-style tasks, GIFARC guides AI agents to evaluate the task analogically before engaging in brute-force pattern search, thus efficiently reducing problem complexity and build a more concise and human-understandable solution. We empirically validate that guiding LLM with analogic approach with GIFARC affects task-solving approaches of LLMs to align with analogic approach of human.
zh

[AI-78] LLM -Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

【速读】:该论文试图解决在复杂任务中训练有效策略的挑战,特别是在强化学习(Reinforcement Learning, RL)中,智能体容易陷入局部最优且难以最大化长期奖励的问题。解决方案的关键在于设计一种基于大语言模型(Large Language Model, LLM)引导的策略调制框架,该框架利用LLM无需额外模型训练或人工干预即可提升RL训练效果。具体而言,首先通过提示LLM从次优智能体的轨迹中识别关键状态,随后基于这些状态提供动作建议并赋予隐式奖励以指导策略优化。

链接: https://arxiv.org/abs/2505.20671
作者: Heng Tan,Hua Yan,Yu Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While reinforcement learning (RL) has achieved notable success in various domains, training effective policies for complex tasks remains challenging. Agents often converge to local optima and fail to maximize long-term rewards. Existing approaches to mitigate training bottlenecks typically fall into two categories: (i) Automated policy refinement, which identifies critical states from past trajectories to guide policy updates, but suffers from costly and uncertain model training; and (ii) Human-in-the-loop refinement, where human feedback is used to correct agent behavior, but this does not scale well to environments with large or continuous action spaces. In this work, we design a large language model-guided policy modulation framework that leverages LLMs to improve RL training without additional model training or human intervention. We first prompt an LLM to identify critical states from a sub-optimal agent’s trajectories. Based on these states, the LLM then provides action suggestions and assigns implicit rewards to guide policy refinement. Experiments across standard RL benchmarks demonstrate that our method outperforms state-of-the-art baselines, highlighting the effectiveness of LLM-based explanations in addressing RL training bottlenecks.
zh

[AI-79] MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning IJCAI2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在涉及工具集成的复杂任务中面临的挑战,特别是通过多智能体工作流进行错误轨迹修正的问题。现有方法仅在动作执行后的观察阶段利用反思能力,而本文提出MIRROR框架,其关键在于引入了两种反思机制:内在反思(intra-reflection)在动作执行前对拟执行操作进行批判性评估,以及外在反思(inter-reflection)根据观察结果调整任务轨迹,从而更全面地消除和纠正错误行为。

链接: https://arxiv.org/abs/2505.20670
作者: Zikang Guo,Benfeng Xu,Xiaorui Wang,Zhendong Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to 34rd International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi-agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post-action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR’s superior performance, achieving state-of-the-art results compared to existing approaches.
zh

[AI-80] Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers

【速读】:该论文试图解决Transformer模型在处理极长输入序列时面临的挑战,包括局部噪声的干扰、长距离依赖关系的弱化以及梯度流动的不稳定性。其解决方案的关键在于引入一种名为Continuous_Time Attention的新框架,该框架将偏微分方程(PDE)融入Transformer的注意力机制中,通过扩散、波动或反应-扩散动力学使注意力权重在伪时间维度上动态演化,从而系统性地平滑局部噪声、增强长距离依赖关系并稳定梯度流动。

链接: https://arxiv.org/abs/2505.20666
作者: Yukun Zhang,Xueqing Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel framework, Continuous_Time Attention, which infuses partial differential equations (PDEs) into the Transformer’s attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudo_time dimension via diffusion, wave, or reaction_diffusion dynamics. This mechanism systematically smooths local noise, enhances long_range dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDE_based attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experiments_demonstrating consistent gains over both standard and specialized long sequence Transformer variants. Our findings highlight the potential of PDE_based formulations to enrich attention mechanisms with continuous_time dynamics and global coherence.
zh

[AI-81] AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

【速读】:该论文试图解决人工智能领域中实验复现效率低的问题,特别是由于方法设计和训练过程的固有复杂性导致的自动化挑战。研究指出,实验复现通常需要隐式的领域特定知识,而这些知识并未在原始论文中明确记录。解决方案的关键在于提出论文谱系算法,该算法能够从目标论文引用的相关参考文献中识别并提取隐式知识,进而构建出能够端到端自动复现实验的多智能体框架AutoReproduce。

链接: https://arxiv.org/abs/2505.20662
作者: Xuanle Zhao,Zilin Sang,Yuxuan Li,Qi Shi,Shuo Wang,Duzhen Zhang,Xu Han,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, preprint version

点击查看摘要

Abstract:Efficient experiment reproduction is critical to accelerating progress in artificial intelligence. However, the inherent complexity of method design and training procedures presents substantial challenges for automation. Notably, reproducing experiments often requires implicit domain-specific knowledge not explicitly documented in the original papers. To address this, we introduce the paper lineage algorithm, which identifies and extracts implicit knowledge from the relevant references cited by the target paper. Building on this idea, we propose AutoReproduce, a multi-agent framework capable of automatically reproducing experiments described in research papers in an end-to-end manner. AutoReproduce enhances code executability by generating unit tests alongside the reproduction process. To evaluate the reproduction capability, we construct ReproduceBench, a benchmark annotated with verified implementations, and introduce novel evaluation metrics to assess both the reproduction and execution fidelity. Experimental results demonstrate that AutoReproduce outperforms the existing strong agent baselines on all five evaluation metrics by a peak margin of over 70% . In particular, compared to the official implementations, AutoReproduce achieves an average performance gap of 22.1% on 89.74% of the executable experiment runs. The code will be available at this https URL.
zh

[AI-82] Voronoi-grid-based Pareto Front Learning and Its Application to Collaborative Federated Learning

【速读】:该论文旨在解决多目标优化(Multi-objective Optimization, MOO)中生成帕累托前沿(Pareto Front)时面临的两个关键问题:在高维空间中采样射线困难以及无法覆盖具有凸形状的整个帕累托前沿。其解决方案的关键在于提出一种新的帕累托前沿学习框架PHN-HVVS,该框架通过将设计空间分解为Voronoi网格,并在高维空间内使用遗传算法(Genetic Algorithm, GA)进行网格划分,同时引入一种新的损失函数,以实现更广泛的帕累托前沿覆盖并最大化HV指标。

链接: https://arxiv.org/abs/2505.20648
作者: Mengmeng Chen,Xiaohu Wu,Qiqi Liu,Tiantian He,Yew-Soon Ong,Yaochu Jin,Qicheng Lao,Han Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-objective optimization (MOO) exists extensively in machine learning, and aims to find a set of Pareto-optimal solutions, called the Pareto front, e.g., it is fundamental for multiple avenues of research in federated learning (FL). Pareto-Front Learning (PFL) is a powerful method implemented using Hypernetworks (PHNs) to approximate the Pareto front. This method enables the acquisition of a mapping function from a given preference vector to the solutions on the Pareto front. However, most existing PFL approaches still face two challenges: (a) sampling rays in high-dimensional spaces; (b) failing to cover the entire Pareto Front which has a convex shape. Here, we introduce a novel PFL framework, called as PHN-HVVS, which decomposes the design space into Voronoi grids and deploys a genetic algorithm (GA) for Voronoi grid partitioning within high-dimensional space. We put forward a new loss function, which effectively contributes to more extensive coverage of the resultant Pareto front and maximizes the HV Indicator. Experimental results on multiple MOO machine learning tasks demonstrate that PHN-HVVS outperforms the baselines significantly in generating Pareto front. Also, we illustrate that PHN-HVVS advances the methodologies of several recent problems in the FL field. The code is available at this https URLthis https URL.
zh

[AI-83] Evaluating Training in Binarized Neural Networks Through the Lens of Algorithmic Information Theory NEURIPS2025

【速读】:该论文试图解决神经网络信息复杂性的理解和控制问题(informational complexity of neural networks),这一问题在机器学习中具有重要意义,涉及泛化能力、优化过程和模型容量。传统方法多依赖于基于熵的损失函数和统计度量,但这些方法往往无法捕捉网络结构中嵌入的更深层次、因果相关的算法规律。论文提出的解决方案关键在于转向算法信息论(algorithmic information theory),并以二值化神经网络(Binarized Neural Networks, BNNs)作为初步代理,通过算法概率(algorithmic probability, AP)及其定义的通用分布,从因果基础的角度表征学习动态。其核心方法是应用块分解法(Block Decomposition Method, BDM),这是一种基于AP的算法复杂性可扩展近似方法,并证明其在训练过程中比熵更能跟踪结构变化,且与训练损失表现出更强的相关性。该研究支持将训练视为算法压缩过程的观点,即学习对应于结构化规律的逐步内化。

链接: https://arxiv.org/abs/2505.20646
作者: Eduardo Y. Sakabe,Felipe S. Abrahão,Alexandre Simões,Esther Colombini,Paula Costa,Ricardo Gudwin,Hector Zenil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 10 pages total, 1 figure. Submitted to NeurIPS 2025

点击查看摘要

Abstract:Understanding and controlling the informational complexity of neural networks is a central challenge in machine learning, with implications for generalization, optimization, and model capacity. While most approaches rely on entropy-based loss functions and statistical metrics, these measures often fail to capture deeper, causally relevant algorithmic regularities embedded in network structure. We propose a shift toward algorithmic information theory, using Binarized Neural Networks (BNNs) as a first proxy. Grounded in algorithmic probability (AP) and the universal distribution it defines, our approach characterizes learning dynamics through a formal, causally grounded lens. We apply the Block Decomposition Method (BDM) – a scalable approximation of algorithmic complexity based on AP – and demonstrate that it more closely tracks structural changes during training than entropy, consistently exhibiting stronger correlations with training loss across varying model sizes and randomized training runs. These results support the view of training as a process of algorithmic compression, where learning corresponds to the progressive internalization of structured regularities. In doing so, our work offers a principled estimate of learning progression and suggests a framework for complexity-aware learning and regularization, grounded in first principles from information theory, complexity, and computability.
zh

[AI-84] Can Past Experience Accelerate LLM Reasoning ?

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在面对相关任务时,是否可以通过重复暴露提升推理速度,以及如何实现这一目标。解决方案的关键在于提出SpeedupLLM框架,该框架基于自适应计算分配和记忆机制,理论上保证了推理加速行为的实现,并通过系统性的实验验证了在不同任务相似性水平、记忆方法和推理方法下,LLMs能够通过过去经验显著减少计算成本,最高可达56%的计算成本降低。

链接: https://arxiv.org/abs/2505.20643
作者: Bo Pan,Liang Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Allocating more compute to large language models (LLMs) reasoning has generally been demonstrated to improve their effectiveness, but also results in increased inference time. In contrast, humans can perform tasks faster and better with increased experience and exposure. Hence, this paper aims to investigate the question: Can LLMs also become faster at reasoning through recurrent exposure on relevant tasks, and if so, how can it be achieved? To address these questions, we first formalize the problem setting of LLM reasoning speedup systematically in the dimensions of task relevancy and compute budget calculation. We then propose SpeedupLLM, a theoretically guaranteed framework to implement and benchmark such reasoning speedup behaviour based on adaptive compute allocation and memory mechanisms. We further conduct comprehensive experiments to benchmark such behaviour across different question similarity levels, memory methods, and reasoning methods. Results show that LLMs can generally reason faster with past experience, achieving up to a 56% reduction in compute cost when equipped with appropriate memory and reasoning methods.
zh

[AI-85] CoderAg ent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models IJCAI2025

【速读】:该论文试图解决个性化编程辅导系统在实际部署中面临的两大问题:一是缺乏足够且高质量的编程数据,二是离线评估与真实学习场景之间的不匹配。为了解决这些问题,论文提出了一种基于大语言模型(LLM)的智能体——CoderAgent,其关键在于通过模拟学生编程过程的细粒度认知状态,实现对编程学习的可解释性和精确模拟,而无需依赖真实数据。该方案受ACT-R认知架构启发,设计了与人类认知结构一致的编程学习分析框架,引入了编程思维树(Programming Tree of Thought, PTOT),将编程过程分解为“为什么、如何、哪里、什么”四个步骤,从而实现对迭代问题解决策略的详细分析。

链接: https://arxiv.org/abs/2505.20642
作者: Yi Zhan,Qi Liu,Weibo Gao,Zheng Zhang,Tianfu Wang,Shuanghong Shen,Junyu Lu,Zhenya Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI2025

点击查看摘要

Abstract:Personalized programming tutoring, such as exercise recommendation, can enhance learners’ efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students’ programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.
zh

[AI-86] Multi-level Certified Defense Against Poisoning Attacks in Offline Reinforcement Learning

【速读】:该论文试图解决离线强化学习(Offline Reinforcement Learning, RL)在面对污染攻击(poisoning attacks)时的安全性问题,此类攻击通过篡改训练数据来破坏算法的性能。解决方案的关键在于扩展认证防御机制,以提供对对抗性操纵的更大保障,确保每个状态下的动作以及整体预期累积奖励的鲁棒性。该方法利用差分隐私(Differential Privacy)的特性,使其能够适用于连续和离散状态空间,以及随机和确定性环境,从而显著扩大了可实现保证的范围和适用性。

链接: https://arxiv.org/abs/2505.20621
作者: Shijie Liu,Andrew C. Cullen,Paul Montague,Sarah Erfani,Benjamin I. P. Rubinstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Similar to other machine learning frameworks, Offline Reinforcement Learning (RL) is shown to be vulnerable to poisoning attacks, due to its reliance on externally sourced datasets, a vulnerability that is exacerbated by its sequential nature. To mitigate the risks posed by RL poisoning, we extend certified defenses to provide larger guarantees against adversarial manipulation, ensuring robustness for both per-state actions, and the overall expected cumulative reward. Our approach leverages properties of Differential Privacy, in a manner that allows this work to span both continuous and discrete spaces, as well as stochastic and deterministic environments – significantly expanding the scope and applicability of achievable guarantees. Empirical evaluations demonstrate that our approach ensures the performance drops to no more than 50% with up to 7% of the training data poisoned, significantly improving over the 0.008% in prior work~\citepwu_copa_2022, while producing certified radii that is 5 times larger as well. This highlights the potential of our framework to enhance safety and reliability in offline RL.
zh

[AI-87] InstGenIE: Generative Image Editing Made Efficient with Mask-aware Caching and Scheduling

【速读】:该论文旨在解决生成式图像编辑在实际生产环境中的高效服务问题,特别是在处理带有掩码的图像模板时,如何减少冗余计算并提升系统吞吐量与响应速度。解决方案的关键在于InstGenIE系统通过复用前序推理中未被掩码区域的缓存中间激活值,从而跳过对这些区域的冗余计算,同时采用无气泡流水线方案以重叠计算与缓存加载过程,并引入一种连续批处理策略以降低在线服务中的排队延迟,最终实现更高的吞吐量和更低的请求延迟。

链接: https://arxiv.org/abs/2505.20600
作者: Xiaoxiao Jiang,Suyi Li,Lingyun Yang,Tianyu Feng,Zhipeng Di,Weiyi Lu,Guoxuan Zhu,Xiu Lin,Kan Liu,Yinghao Yu,Tao Lan,Guodong Yang,Lin Qu,Liping Zhang,Wei Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative image editing using diffusion models has become a prevalent application in today’s AI cloud services. In production environments, image editing typically involves a mask that specifies the regions of an image template to be edited. The use of masks provides direct control over the editing process and introduces sparsity in the model inference. In this paper, we present InstGenIE, a system that efficiently serves image editing requests. The key insight behind InstGenIE is that image editing only modifies the masked regions of image templates while preserving the original content in the unmasked areas. Driven by this insight, InstGenIE judiciously skips redundant computations associated with the unmasked areas by reusing cached intermediate activations from previous inferences. To mitigate the high cache loading overhead, InstGenIE employs a bubble-free pipeline scheme that overlaps computation with cache loading. Additionally, to reduce queuing latency in online serving while improving the GPU utilization, InstGenIE proposes a novel continuous batching strategy for diffusion model serving, allowing newly arrived requests to join the running batch in just one step of denoising computation, without waiting for the entire batch to complete. As heterogeneous masks induce imbalanced loads, InstGenIE also develops a load balancing strategy that takes into account the loads of both computation and cache loading. Collectively, InstGenIE outperforms state-of-the-art diffusion serving systems for image editing, achieving up to 3x higher throughput and reducing average request latency by up to 14.7x while ensuring image quality.
zh

[AI-88] he challenge of hidden gifts in multi-agent reinforcement learning

【速读】:该论文试图解决多智能体强化学习(MARL)中由于“隐藏馈赠”(hidden gifts)导致的信用分配(credit assignment)问题,即当其他智能体的有益行为未被感知时,如何有效识别并奖励这些行为。解决方案的关键在于引入智能体自身的行为历史信息,并通过一种受学习意识(learning awareness)启发的修正项来降低学习过程中的方差,从而提升独立智能体在集体成功上的收敛可靠性。

链接: https://arxiv.org/abs/2505.20579
作者: Dane Malenfant,Blake A. Richards
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Sometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These “hidden gifts” represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a very simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus the act of dropping the key for others is a “hidden gift”. We show that several different state-of-the-art RL algorithms, including MARL algorithms, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that independent model-free policy gradient agents can solve the task when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for these independent agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi-agent settings can be particularly challenging in the presence of “hidden gifts”, and demonstrate that learning awareness in independent agents can benefit these settings.
zh

[AI-89] Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

【速读】:该论文旨在解决如何设计具有精确细胞类型特异性的调控DNA序列的问题,这对于合成生物学、基因治疗和精准医学具有重要意义。传统基于Transformer的语言模型虽然能够捕捉调控DNA中的模式,但其生成方法在产生具有可靠细胞特异性活性的新序列方面存在困难。论文提出的解决方案是Ctrl-DNA,其关键在于将调控序列设计建模为一个生物信息学驱动的约束优化问题,并利用强化学习(Reinforcement Learning, RL)对自回归基因组语言模型进行迭代优化,从而在最大化目标细胞类型调控活性的同时限制脱靶效应,实现对细胞类型特异性的可控设计。

链接: https://arxiv.org/abs/2505.20578
作者: Xingyu Chen,Shihao Ma,Runsheng Lin,Jiecong Lin,Bo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.
zh

[AI-90] Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在实际机器人控制任务中因缺乏对物理约束的感知而导致的无效动作规划问题。解决方案的关键在于提出一种结合强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的框架,通过仅对成功完成控制任务的有效动作规划给予正向奖励,从而将物理约束知识注入LLMs,促使生成具有约束意识的推理过程。

链接: https://arxiv.org/abs/2505.20573
作者: Jiabao Ji,Yongchao Chen,Yang Zhang,Ramana Rao Kompella,Chuchu Fan,Gaowen Liu,Shiyu Chang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce invalid action plans that violate physical constraints, such as directing a robot to an unreachable location or causing collisions between robots. This issue primarily arises from a lack of awareness of these physical constraints during the reasoning process. To address this issue, we propose a novel framework that integrates reinforcement learning with verifiable rewards (RLVR) to incentivize knowledge of physical constraints into LLMs to induce constraints-aware reasoning during plan generation. In this approach, only valid action plans that successfully complete a control task receive positive rewards. We applied our method to two small-scale LLMs: a non-reasoning Qwen2.5-3B-Instruct and a reasoning Qwen3-4B. The experiment results demonstrate that constraint-aware small LLMs largely outperform large-scale models without constraints, grounded on both the BoxNet task and a newly developed BoxNet3D environment built using MuJoCo. This work highlights the effectiveness of grounding even small LLMs with physical constraints to enable scalable and efficient multi-robot control in complex, physically constrained environments.
zh

[AI-91] Reconceptualizing Smart Microscopy: From Data Collection to Knowledge Creation by Multi-Agent Integration

【速读】:该论文试图解决传统显微镜在生物成像中作为被动观察工具的局限性,旨在将其转变为科学探究中的主动合作者。解决方案的关键在于提出一个理论框架,重新定义智能显微镜在科学调查中的角色,核心是“认识论-经验论分裂”(epistemic-empirical divide)的概念,即细胞研究中可观测现象与需理解的理论之间的差距。通过六项核心设计原则,包括认识论-经验论意识、层次化上下文整合、从检测到感知的演变、自适应测量框架、叙事综合能力以及跨情境推理,该框架引导一种多智能体架构,以实现经验观察与科学理解目标的对齐。

链接: https://arxiv.org/abs/2505.20466
作者: P.S. Kesavan,Pontus Nordenfelt
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注: 34 pages, 5 figures

点击查看摘要

Abstract:Smart microscopy represents a paradigm shift in biological imaging, moving from passive observation tools to active collaborators in scientific inquiry. Enabled by advances in automation, computational power, and artificial intelligence, these systems are now capable of adaptive decision-making and real-time experimental control. Here, we introduce a theoretical framework that reconceptualizes smart microscopy as a partner in scientific investigation. Central to our framework is the concept of the ‘epistemic-empirical divide’ in cellular investigation-the gap between what is observable (empirical domain) and what must be understood (epistemic domain). We propose six core design principles: epistemic-empirical awareness, hierarchical context integration, an evolution from detection to perception, adaptive measurement frameworks, narrative synthesis capabilities, and cross-contextual reasoning. Together, these principles guide a multi-agent architecture designed to align empirical observation with the goals of scientific understanding. Our framework provides a roadmap for building microscopy systems that go beyond automation to actively support hypothesis generation, insight discovery, and theory development, redefining the role of scientific instruments in the process of knowledge creation.
zh

[AI-92] Holes in Latent Space: Topological Signatures Under Adversarial Influence

【速读】:该论文试图解决在对抗条件下语言模型(Language Models, LLMs)的表示动态难以被系统表征的问题,特别是如何捕捉高维激活空间中的全局结构与局部细节。其解决方案的关键在于引入持久同调(Persistent Homology, PH),这是一种来自拓扑数据分析的工具,用于系统地分析在两种不同攻击模式——后门微调和间接提示注入下,LLMs中多尺度潜在空间的动力学特性。通过这一方法,研究者能够揭示对抗条件如何压缩潜在拓扑结构,减少小尺度下的结构多样性,同时增强大尺度下的主导特征,并进一步通过神经元级别的PH框架量化信息在层内及层间的流动与变换机制。

链接: https://arxiv.org/abs/2505.20435
作者: Aideen Fay,Inés García-Redondo,Qiquan Wang,Haim Dubossarsky,Anthea Monod
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Algebraic Topology (math.AT)
备注:

点击查看摘要

Abstract:Understanding how adversarial conditions affect language models requires techniques that capture both global structure and local detail within high-dimensional activation spaces. We propose persistent homology (PH), a tool from topological data analysis, to systematically characterize multiscale latent space dynamics in LLMs under two distinct attack modes – backdoor fine-tuning and indirect prompt injection. By analyzing six state-of-the-art LLMs, we show that adversarial conditions consistently compress latent topologies, reducing structural diversity at smaller scales while amplifying dominant features at coarser ones. These topological signatures are statistically robust across layers, architectures, model sizes, and align with the emergence of adversarial effects deeper in the network. To capture finer-grained mechanisms underlying these shifts, we introduce a neuron-level PH framework that quantifies how information flows and transforms within and across layers. Together, our findings demonstrate that PH offers a principled and unifying approach to interpreting representational dynamics in LLMs, particularly under distributional shift.
zh

[AI-93] Robot Operation of Home Appliances by Reading User Manuals

【速读】:该论文旨在解决家庭服务机器人操作新型家用电器的难题,特别是在缺乏先验知识的情况下,如何通过“阅读”用户手册来理解和执行操作任务。其核心挑战包括从非结构化的文本描述中推断目标条件的部分策略、将策略与物理世界中的电器进行语义对齐,以及在可能产生累积误差的情况下可靠地执行多步骤策略。解决方案的关键在于利用大视觉-语言模型(VLM)从用户手册中构建家电的结构化符号模型,并通过视觉反馈不断更新该模型,从而实现对控制面板元素的视觉对齐与策略执行。实验结果表明,该方法在模拟和真实场景下的任务成功率显著优于直接使用先进VLM作为控制策略的方法。

链接: https://arxiv.org/abs/2505.20424
作者: Jian Zhang,Hanbo Zhang,Anxing Xiao,David Hsu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by “reading” their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones.
zh

[AI-94] SCAR: Shapley Credit Assignment for More Efficient RLHF

【速读】:该论文旨在解决强化学习与人类反馈(RLHF)中奖励信号稀疏导致的有效信用分配问题,即在生成序列过程中难以明确区分哪些token或文本片段对最终奖励产生了贡献。其解决方案的关键在于提出一种基于合作博弈论中Shapley值的信用分配方法——Shapley Credit Assignment Rewards (SCAR),该方法通过计算每个组成元素的边际贡献,将序列级奖励分布到具体token或文本片段上,从而生成密集的奖励信号,无需训练辅助批判模型或依赖中间阶段的细粒度人工标注。

链接: https://arxiv.org/abs/2505.20417
作者: Meng Cao,Shuyuan Zhang,Xiao-Wen Chang,Doina Precup
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate generation stages. Unlike prior dense reward methods, SCAR offers a game-theoretic foundation for fair credit attribution. Theoretically, we demonstrate that SCAR preserves the original optimal policy, and empirically, across diverse tasks including sentiment control, text summarization, and instruction tuning, we show that SCAR converges significantly faster and achieves higher final reward scores compared to standard RLHF and attention-based dense reward baselines. Our findings suggest that SCAR provides a more effective and theoretically sound method for credit assignment in RLHF, leading to more efficient alignment of LLMs.
zh

[AI-95] Algorithmic Control Improves Residential Building Energy and EV Management when PV Capacity is High but Battery Capacity is Low

【速读】:该论文试图解决在能源转型背景下,如何优化具有电动汽车(EV)的生产消费一体用户(prosumer)家庭的能源管理问题,以缓解电网压力。解决方案的关键在于利用深度强化学习(DRL)算法,通过动态和不确定的家庭能源管理(HEM)环境,优化家庭充电模式,特别是将电动汽车充电与光伏(PV)盈余进行高效协调。研究结果表明,DRL在电池容量较低的家庭中能显著提升能源管理和成本节约效果,而在电池容量较高的家庭中,算法控制的价值相对有限。

链接: https://arxiv.org/abs/2505.20377
作者: Lennart Ullner,Alona Zharova,Felix Creutzig
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient energy management in prosumer households is key to alleviating grid stress in an energy transition marked by electric vehicles (EV), renewable energies and battery storage. However, it is unclear how households optimize prosumer EV charging. Here we study real-world data from 90 households on fixed-rate electricity tariffs in German-speaking countries to investigate the potential of Deep Reinforcement Learning (DRL) and other control approaches (Rule-Based, Model Predictive Control) to manage the dynamic and uncertain environment of Home Energy Management (HEM) and optimize household charging patterns. The DRL agent efficiently aligns charging of EV and battery storage with photovoltaic (PV) surplus. We find that frequent EV charging transactions, early EV connections and PV surplus increase optimization potential. A detailed analysis of nine households (1 hour resolution, 1 year) demonstrates that high battery capacity facilitates self optimization; in this case further algorithmic control shows little value. In cases with relatively low battery capacity, algorithmic control with DRL improves energy management and cost savings by a relevant margin. This result is further corroborated by our simulation of a synthetic household. We conclude that prosumer households with optimization potential would profit from DRL, thus benefiting also the full electricity system and its decarbonization.
zh

[AI-96] VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在安全对齐(safety alignment)中的两个问题:欠安全(undersafety)和过安全(oversafety)。现有方法主要关注模型对危险查询的响应问题,而忽视了模型对安全查询的过度拒绝问题。解决方案的关键在于引入“安全校准”(safety calibration)的概念,并构建了VSCBench数据集,该数据集包含3,600对图像-文本对,它们在视觉或文本上相似但安全性不同,用于评估图像导向和文本导向场景下的安全校准效果。通过该基准,研究者对11个主流VLM进行了评估,并揭示了欠安全和过安全的普遍问题,同时探索了改进安全校准的方法。

链接: https://arxiv.org/abs/2505.20362
作者: Jiahui Geng,Qing Li,Zongxiong Chen,Yuxia Wang,Derui Zhu,Zhuohan Xie,Chenyang Lyu,Xiuying Chen,Preslav Nakov,Fakhri Karray
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of vision-language models (VLMs) has brought a lot of attention to their safety alignment. However, existing methods have primarily focused on model undersafety, where the model responds to hazardous queries, while neglecting oversafety, where the model refuses to answer safe queries. In this paper, we introduce the concept of \textitsafety calibration , which systematically addresses both undersafety and oversafety. Specifically, we present \textbfVSCBench , a novel dataset of 3,600 image-text pairs that are visually or textually similar but differ in terms of safety, which is designed to evaluate safety calibration across image-centric and text-centric scenarios. Based on our benchmark, we evaluate safety calibration across eleven widely used VLMs. Our extensive experiments revealed major issues with both undersafety and oversafety. We further investigated four approaches to improve the model’s safety calibration. We found that even though some methods effectively calibrated the models’ safety problems, these methods also lead to the degradation of models’ utility. This trade-off underscores the urgent need for advanced calibration methods, and our benchmark provides a valuable tool for evaluating future approaches. Our code and data are available at this https URL.
zh

[AI-97] Risk-aware Direct Preference Optimization under Nested Risk Measure

【速读】:该论文试图解决在微调预训练大型语言模型(Large Language Models, LLMs)以对齐人类价值观和意图时,单纯最大化估计奖励可能导致模型行为偏离参考模型预期行为所带来的风险问题。解决方案的关键在于提出一种名为风险感知直接偏好优化(Risk-aware Direct Preference Optimization, Ra-DPO)的新方法,该方法通过引入嵌套风险度量来增强风险感知能力,并将 Bradley-Terry 模型转化为基于标记级别的表示,从而在最大化策略似然的同时,利用序列风险比抑制训练模型与参考模型之间的偏差,实现对模型漂移的有效控制。

链接: https://arxiv.org/abs/2505.20359
作者: Lijun Zhang,Lin Li,Yajie Qi,Huizhong Song,Yaodong Yang,Jun Wang,Wei Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model’s intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model’s risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method’s superior performance in balancing alignment performance and model drift. Our code is opensourced at this https URL.
zh

[AI-98] LEGO-Compiler: Enhancing Neural Compilation Through Translation Composability

【速读】:该论文旨在解决现有大型语言模型(Large Language Models, LLMs)在处理长且复杂的程序时表现不佳的问题。其解决方案的关键在于提出LEGO-Compiler,一个基于LLMs的神经编译系统,通过三个核心创新实现高精度的高级语言到汇编代码的翻译:LEGO翻译,将输入程序分解为可管理的模块;通过外部测试组织可验证的LLM工作流,将复杂的编译过程分解为更小、更简单的可验证步骤;以及自纠正反馈机制。此外,LEGO-Compiler还通过形式化证明确保翻译的可组合性,从而在多个数据集上实现了高准确率,并显著提升了可编译代码规模的可扩展性。

链接: https://arxiv.org/abs/2505.20356
作者: Shuoming Zhang,Jiacheng Zhao,Chunwei Xia,Zheng Wang,Yunji Chen,Xiaobing Feng,Huimin Cui
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 30 pages, 8 figures, 4 tables. Preprint. Under review

点击查看摘要

Abstract:Large language models (LLMs) have the potential to revolutionize how we design and implement compilers and code translation tools. However, existing LLMs struggle to handle long and complex programs. We introduce LEGO-Compiler, a novel neural compilation system that leverages LLMs to translate high-level languages into assembly code. Our approach centers on three key innovations: LEGO translation, which decomposes the input program into manageable blocks; breaking down the complex compilation process into smaller, simpler verifiable steps by organizing it as a verifiable LLM workflow by external tests; and a feedback mechanism for self-correction. Supported by formal proofs of translation composability, LEGO-Compiler demonstrates high accuracy on multiple datasets, including over 99% on ExeBench and 97.9% on industrial-grade AnsiBench. Additionally, LEGO-Compiler has also acheived near one order-of-magnitude improvement on compilable code size scalability. This work opens new avenues for applying LLMs to system-level tasks, complementing traditional compiler technologies.
zh

[AI-99] GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

【速读】:该论文试图解决低秩适应(LoRA)在参数高效微调(PEFT)中因瓶颈扩展导致的过拟合问题,其性能在高秩设置下无法超越全微调(FFT)。解决方案的关键在于引入一种新的结构——细粒度低秩适应(GraLoRA),通过将权重矩阵划分为子块并为每个子块分配独立的低秩适配器,从而克服LoRA的结构瓶颈,提升模型的表示能力,并更接近FFT的行为。

链接: https://arxiv.org/abs/2505.20355
作者: Yeonjoon Jung,Daehyun Ahn,Hyungjun Kim,Taesu Kim,Eunhyeok Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA’s structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA’s limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at this https URL
zh

[AI-100] Decision Flow Policy Optimization

【速读】:该论文试图解决传统强化学习中单一模态动作分布的局限性问题,即基于高斯分布的策略在连续动作空间中难以有效建模复杂多模态动作分布,从而限制了机器人控制的性能。解决方案的关键在于提出一种统一框架——Decision Flow,该框架将多模态动作分布建模与策略优化相结合,通过将基于流模型的动作生成过程形式化为流决策过程,使每个动作生成步骤对应一次流决策,从而在捕获多模态动作分布的同时无缝优化流策略。

链接: https://arxiv.org/abs/2505.20350
作者: Jifeng Hu,Sili Huang,Siyuan Guo,Zhaogeng Liu,Li Shen,Lichao Sun,Hechang Chen,Yi Chang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, generative models have shown remarkable capabilities across diverse fields, including images, videos, language, and decision-making. By applying powerful generative models such as flow-based models to reinforcement learning, we can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces, surpassing the limitations of single-modal action distributions with traditional Gaussian-based policies. Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets, with policy optimization conducted separately through additional policies using value-based sample weighting or gradient-based updates. However, this separation prevents the simultaneous optimization of multi-modal distribution fitting and policy improvement, ultimately hindering the training of models and degrading the performance. To address this issue, we propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization. Specifically, our method formulates the action generation procedure of flow-based models as a flow decision-making process, where each action generation step corresponds to one flow decision. Consequently, our method seamlessly optimizes the flow policy while capturing multi-modal action distributions. We provide rigorous proofs of Decision Flow and validate the effectiveness through extensive experiments across dozens of offline RL environments. Compared with established offline RL baselines, the results demonstrate that our method achieves or matches the SOTA performance.
zh

[AI-101] PDFBench: A Benchmark for De novo Protein Design from Function

【速读】:该论文旨在解决当前去 novo 蛋白质设计方法中因依赖专有数据集和评估标准而导致的比较困难问题,以及现有评估指标仅能捕捉部分期望性质、缺乏全面评估框架的问题。其解决方案的关键在于提出 PDFBench,这是首个针对功能驱动的 de novo 蛋白质设计的综合性基准,支持描述引导设计和关键词引导设计两种任务,并整合了22个涵盖序列合理性、结构保真度、语言-蛋白质对齐、新颖性和多样性的评估指标,从而为不同方法提供公平且多维度的评价体系。

链接: https://arxiv.org/abs/2505.20346
作者: Jiahao Kuang,Nuowei Liu,Changzhi Sun,Tao Ji,Yuanbin Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:In recent years, while natural language processing and multimodal learning have seen rapid advancements, the field of de novo protein design has also experienced significant growth. However, most current methods rely on proprietary datasets and evaluation rubrics, making fair comparisons between different approaches challenging. Moreover, these methods often employ evaluation metrics that capture only a subset of the desired properties of designed proteins, lacking a comprehensive assessment framework. To address these, we introduce PDFBench, the first comprehensive benchmark for evaluating de novo protein design from function. PDFBench supports two tasks: description-guided design and keyword-guided design. To ensure fair and multifaceted evaluation, we compile 22 metrics covering sequence plausibility, structural fidelity, and language-protein alignment, along with measures of novelty and diversity. We evaluate five state-of-the-art baselines, revealing their respective strengths and weaknesses across tasks. Finally, we analyze inter-metric correlations, exploring the relationships between four categories of metrics, and offering guidelines for metric selection. PDFBench establishes a unified framework to drive future advances in function-driven de novo protein design.
zh

[AI-102] Machine Theory of Mind and the Structure of Human Values NEURIPS

【速读】:该论文试图解决价值泛化问题(value generalization problem),即在人类行为无法完全体现其全部价值的情况下,如何从有限的样本中预测出人类复杂的其余价值。论文提出的解决方案的关键在于认为人类价值具有生成理性结构(generative rational structure),这使得可以通过贝叶斯心智理论模型(Bayesian Theory of Mind models)不仅从行为,还可以从其他价值推断出人类的价值。这一方法突破了传统使用简单效用函数表示人类价值的局限性。

链接: https://arxiv.org/abs/2505.20342
作者: Paul de Font-Reaulx
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper was originally submitted and accepted to the 2023 NeurIPS MP2 Workshop

点击查看摘要

Abstract:Value learning is a crucial aspect of safe and ethical AI. This is primarily pursued by methods inferring human values from behaviour. However, humans care about much more than we are able to demonstrate through our actions. Consequently, an AI must predict the rest of our seemingly complex values from a limited sample. I call this the value generalization problem. In this paper, I argue that human values have a generative rational structure and that this allows us to solve the value generalization problem. In particular, we can use Bayesian Theory of Mind models to infer human values not only from behaviour, but also from other values. This has been obscured by the widespread use of simple utility functions to represent human values. I conclude that developing generative value-to-value inference is a crucial component of achieving a scalable machine theory of mind.
zh

[AI-103] Challenges for artificial cognitive systems

【速读】:该论文试图解决认知系统研究中缺乏明确进展标准的问题(即“认知系统研究需要定义进步的问题或挑战,这些挑战不是对未来的预测,而是指导目标和进步的标准”)。解决方案的关键在于提出一套针对人工认知系统的挑战,并通过定义认知系统为“能够从经验中学习,并以灵活的方式使用其获得的知识(包括陈述性知识和实践性知识)来实现自身目标的系统”来构建这些挑战的基础。

链接: https://arxiv.org/abs/2505.20339
作者: Antoni Gomila,Vincent C. Müller
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The declared goal of this paper is to fill this gap: “… cognitive systems research needs questions or challenges that define progress. The challenges are not (yet more) predictions of the future, but a guideline to what are the aims and what would constitute progress.” – the quotation being from the project description of EUCogII, the project for the European Network for Cognitive Systems within which this formulation of the ‘challenges’ was originally developed (this http URL). So, we stick out our neck and formulate the challenges for artificial cognitive systems. These challenges are articulated in terms of a definition of what a cognitive system is: a system that learns from experience and uses its acquired knowledge (both declarative and practical) in a flexible manner to achieve its own goals.
zh

[AI-104] Evaluating the Energy-Efficiency of the Code Generated by LLM s

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成代码在能源效率和环境影响方面被忽视的问题。尽管LLMs在功能正确性上表现良好,但其生成代码的性能和能源效率通常远低于人类编写的解决方案。论文的关键在于通过对比20个流行LLMs生成的代码与LeetCode平台上不同难度和算法类别问题的基准人类代码,评估并分析LLMs生成代码的能源效率差异,从而揭示其在实际应用中的潜在环境成本。

链接: https://arxiv.org/abs/2505.20324
作者: Md Arman Islam,Devi Varaprasad Jonnala,Ritika Rekhi,Pratik Pokharel,Siddharth Cilamkoti,Asif Imran,Tevfik Kosar,Bekir Turkkan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.
zh

[AI-105] Reinforcement Speculative Decoding for Fast Ranking

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在排序系统(如信息检索和推荐系统)中因自回归解码带来的延迟问题,以及现有单次(首个)标记解码方法在尾部位置性能严重下降的问题。其解决方案的关键在于提出一种基于强化学习的推测解码方法(Reinforcement Speculative Decoding),通过引入自上而下的解码范式,利用智能体在有限预算下迭代优化排序序列,并设计针对排序任务的策略优化,以充分利用多轮验证中获得的列表级排序知识,从而提升解码效率与排序准确性。

链接: https://arxiv.org/abs/2505.20316
作者: Yingpeng Du,Tianjun Wei,Zhu Sun,Jie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, 5 table

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement Speculative Decoding method for fast ranking inference of LLMs. To meet the ranking systems’ latency requirement, we propose an up-to-down decoding paradigm that employs an agent to iteratively modify the ranking sequence under a constrained budget. Specifically, we design a ranking-tailored policy optimization, actively exploring optimal multi-round ranking modification policy verified by LLMs via reinforcement learning (RL). To better approximate the target LLM under the constrained budget, we trigger the agent fully utilizing the listwise ranking knowledge about all items verified by LLMs across different rounds in RL, enhancing the modification policy of the agent. More importantly, we demonstrate the theoretical robustness and advantages of our paradigm and implementation. Experiments on both IR and RS tasks show the effectiveness of our proposed method.
zh

[AI-106] Reasoning in Neurosymbolic AI

【速读】:该论文试图解决神经网络中知识表示与推理的整合问题,特别是在当前以大型语言模型(Large Language Models, LLMs)为主导的人工智能(Artificial Intelligence, AI)背景下,如何提升数据效率、公平性和安全性。其解决方案的关键在于构建一种基于能量的神经符号AI系统,该系统能够形式化地表示和推理任何命题逻辑公式,从而将数据学习、知识融合与逻辑推理相结合。该系统通过受限玻尔兹曼机(Restricted Boltzmann Machines, RBM)实现逻辑推理与能量最小化的对应,并在实验中验证了其有效性,旨在推动深度网络中推理与学习的系统性整合。

链接: https://arxiv.org/abs/2505.20313
作者: Son Tran,Edjard Mota,Artur d’Avila Garcez
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 50 pages, 13 figures, 56 references. Keywords: Neurosymbolic AI, Restricted Boltzmann Machines, Logical Reasoning, SAT solving, MaxSAT, Energy-based Learning, Constrained Optimization, Modular Deep Learning

点击查看摘要

Abstract:Knowledge representation and reasoning in neural networks have been a long-standing endeavor which has attracted much attention recently. The principled integration of reasoning and learning in neural networks is a main objective of the area of neurosymbolic Artificial Intelligence (AI). In this chapter, a simple energy-based neurosymbolic AI system is described that can represent and reason formally about any propositional logic formula. This creates a powerful combination of learning from data and knowledge and logical reasoning. We start by positioning neurosymbolic AI in the context of the current AI landscape that is unsurprisingly dominated by Large Language Models (LLMs). We identify important challenges of data efficiency, fairness and safety of LLMs that might be addressed by neurosymbolic reasoning systems with formal reasoning capabilities. We then discuss the representation of logic by the specific energy-based system, including illustrative examples and empirical evaluation of the correspondence between logical reasoning and energy minimization using Restricted Boltzmann Machines (RBM). Learning from data and knowledge is also evaluated empirically and compared with a symbolic, neural and a neurosymbolic system. Results reported in this chapter in an accessible way are expected to reignite the research on the use of neural networks as massively-parallel models for logical reasoning and promote the principled integration of reasoning and learning in deep networks. We conclude the chapter with a discussion of the importance of positioning neurosymbolic AI within a broader framework of formal reasoning and accountability in AI, discussing the challenges for neurosynbolic AI to tackle the various known problems of reliability of deep learning.
zh

[AI-107] Lets Get You Hired: A Job Seekers Perspective on Multi-Agent Recruitment Systems for Explaining Hiring Decisions

【速读】:该论文试图解决传统招聘过程中候选人缺乏透明度的问题(traditional applicant selection methods often lack transparency),即候选人通常无法获得招聘决策的充分理由。解决方案的关键在于引入一种基于大型语言模型(Large Language Models, LLMs)的多智能体AI系统,该系统通过迭代以用户为中心的设计方法,为求职者提供更具解释性、可操作性和公平性的招聘指导。

链接: https://arxiv.org/abs/2505.20312
作者: Aditya Bhattacharya,Katrien Verbert
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Pre-print version only. Please check the published version for any reference or citation

点击查看摘要

Abstract:During job recruitment, traditional applicant selection methods often lack transparency. Candidates are rarely given sufficient justifications for recruiting decisions, whether they are made manually by human recruiters or through the use of black-box Applicant Tracking Systems (ATS). To address this problem, our work introduces a multi-agent AI system that uses Large Language Models (LLMs) to guide job seekers during the recruitment process. Using an iterative user-centric design approach, we first conducted a two-phased exploratory study with four active job seekers to inform the design and development of the system. Subsequently, we conducted an in-depth, qualitative user study with 20 active job seekers through individual one-to-one interviews to evaluate the developed prototype. The results of our evaluation demonstrate that participants perceived our multi-agent recruitment system as significantly more actionable, trustworthy, and fair compared to traditional methods. Our study further helped us uncover in-depth insights into factors contributing to these perceived user experiences. Drawing from these insights, we offer broader design implications for building user-aligned, multi-agent explainable AI systems across diverse domains.
zh

[AI-108] Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

【速读】:该论文试图解决传统元分析(meta-analysis)过程中存在的效率低、人工干预多以及在文献筛选和数据提取阶段容易出现幻觉(hallucination)的问题。其解决方案的关键在于提出了一种多智能体系统Manalyzer,通过工具调用实现端到端的自动化元分析,结合混合评审、分层提取、自我验证和反馈检查等策略,有效缓解了幻觉问题。

链接: https://arxiv.org/abs/2505.20310
作者: Wanghan Xu,Wenlong Zhang,Fenghua Ling,Ben Fei,Yusong Hu,Fangxuan Ren,Jintai Lin,Wanli Ouyang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM-based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls. The hybrid review, hierarchical extraction, self-proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta-analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta-analysis tasks. Project page: this https URL .
zh

[AI-109] Large Language Model-Powered Decision Support for a Metal Additive Manufacturing Knowledge Graph

【速读】:该论文旨在解决金属增材制造(Metal Additive Manufacturing, AM)领域中过程、材料、原料和后处理步骤之间复杂相互作用的关联性不足问题,以及现有文献和静态数据库中领域知识碎片化、查询门槛高等限制。其解决方案的关键在于构建一个可查询的知识图谱(Knowledge Graph, KG),在Neo4j中编码了53种不同的金属和合金,涵盖七类材料家族、九种AM工艺、四种原料类型及相关的后处理要求,并结合大语言模型(Large Language Model, LLM)接口,通过少量示例提示策略实现自然语言查询,从而提供兼容性检查、多约束过滤和增材制造设计(Design for AM, DfAM)指导等支持。

链接: https://arxiv.org/abs/2505.20308
作者: Muhammad Tayyab Khan,Lequn Chen,Wenhe Feng,Seung Ki Moon
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Metal additive manufacturing (AM) involves complex interdependencies among processes, materials, feedstock, and post-processing steps. However, the underlying relationships and domain knowledge remain fragmented across literature and static databases that often demand expert-level queries, limiting their applicability in design and planning. To address these gaps, we develop a novel and queryable knowledge graph (KG) in Neo4j, encoding 53 distinct metals and alloys across seven material families, nine AM processes, four feedstock types, and associated post-processing requirements. A large language model (LLM) interface, guided by a few-shot prompting strategy, enables natural language querying without the need for formal query syntax. The system supports a range of tasks, including compatibility checks, multi-constraint filtering, and design for AM (DfAM) guidance. User natural language queries are normalized, translated into Cypher, and executed over the KG, with results reformatted into structured responses. This work presents the first real-time, interactive system that integrates a domain-specific metal AM KG with an LLM interface, offering accessible, explainable decision support for engineers and advancing human-centric tools in manufacturing intelligence.
zh

[AI-110] Multi-Modal Artificial Intelligence of Embryo Grading and Pregnancy Prediction in Assisted Reproductive Technology: A Review

【速读】:该论文试图解决辅助生殖技术中传统体外受精-胚胎移植技术在提高妊娠成功率方面面临的挑战,如胚胎评分的主观性和多模态数据整合的低效性。解决方案的关键在于引入基于人工智能的技术,特别是多模态人工智能在胚胎评分和妊娠预测中的应用,通过融合不同数据模态(包括静态图像、时间推移视频和结构化表格数据)来提升诊断的客观性和效率。

链接: https://arxiv.org/abs/2505.20306
作者: Xueqiang Ouyang,Jia Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:As a global disease, infertility has always affected human beings. The development of assisted reproductive technology can effectively solve this disease. However, the traditional in vitro fertilization-embryo transfer technology still faces many challenges in improving the success rate of pregnancy, such as the subjectivity of embryo grading and the inefficiency of integrating multi-modal data. Therefore, the introduction of artificial intelligence-based technologies is particularly crucial. This article reviews the application progress of multi-modal artificial intelligence in embryo grading and pregnancy prediction based on different data modalities (including static images, time-lapse videos and structured table data) from a new perspective, and discusses the main challenges in current research, such as the complexity of multi-modal information fusion and data scarcity.
zh

[AI-111] Future of Code with Generative AI: Transparency and Safety in the Era of AI Generated Software

【速读】:该论文试图解决AI生成代码在软件开发过程中日益增长的透明度和安全性问题(transparency and safety in AI generated code)。其解决方案的关键在于分析市场机遇以检测AI生成代码,探讨管理复杂性的挑战,并提出增强透明度和功能分析的策略。此外,研究还关注AI生成代码的长期影响,强调通过主动措施确保人工智能在软件工程中的负责任发展。

链接: https://arxiv.org/abs/2505.20303
作者: David Hanson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence becomes increasingly integrated into software development processes, the prevalence and sophistication of AI-generated code continue to expand rapidly. This study addresses the critical need for transparency and safety in AI generated code by examining the current landscape, identifying potential risks, and exploring future implications. We analyze market opportunities for detecting AI-generated code, discuss the challenges associated with managing increasing complexity, and propose solutions to enhance transparency and functionality analysis. Furthermore, this study investigates the longterm implications of AI generated code, including its potential role in the development of artificial general intelligence and its impact on human AI interaction. In conclusion, we emphasize the importance of proactive measures for ensuring the responsible development and deployment of AI in software engineering.
zh

[AI-112] VeriThoughts: Enabling Automated Verilog Code Generation using Reasoning and Formal Verification

【速读】:该论文试图解决如何从高级规范中自动生成可验证正确的硬件描述的问题,以满足自动化硬件设计工具的日益增长的需求。解决方案的关键在于构建一个基于形式化验证方法的新基准框架,用于评估生成的Verilog代码的质量和正确性,并开发一系列专门优化的小规模模型,以提升Verilog生成的性能和准确性。

链接: https://arxiv.org/abs/2505.20302
作者: Patrick Yubeaton,Andre Nakkab,Weihua Xiao,Luca Collini,Ramesh Karri,Chinmay Hegde,Siddharth Garg
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper introduces VeriThoughts, a novel dataset designed for reasoning-based Verilog code generation. We establish a new benchmark framework grounded in formal verification methods to evaluate the quality and correctness of generated hardware descriptions. Additionally, we present a suite of specialized small-scale models optimized specifically for Verilog generation. Our work addresses the growing need for automated hardware design tools that can produce verifiably correct implementations from high-level specifications, potentially accelerating the hardware development process while maintaining rigorous correctness guarantees. Our code and data are available at \hrefthis https URLthis URL.
zh

[AI-113] CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

【速读】:该论文旨在解决现有金融预测方法在处理宏观经济事件对市场影响时的不足,尤其是未能充分捕捉金融市场多模态特性及事件与价格变动之间的因果关系。其解决方案的关键在于提出一种名为CAMEF(Causal-Augmented Multi-Modality Event-Driven Financial Forecasting)的多模态框架,该框架通过整合文本与时间序列数据,并结合因果学习机制和基于大语言模型(LLM)的反事实事件增强技术,以提升金融预测的准确性与因果解释力。

链接: https://arxiv.org/abs/2502.04592
作者: Yang Zhang,Wenbo Yang,Jun Wang,Qiang Ma,Jie Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.
zh

[AI-114] Autoencoding Random Forests

【速读】:该论文试图解决如何通过随机森林实现有效的自动编码问题(autoencoding),即学习数据的低维表示并能够准确地从该表示中重建原始数据。解决方案的关键在于结合非参数统计和谱图理论的基础成果,利用集成学习中树结构的分割信息,通过约束优化、分裂重标签和最近邻回归等方法,构建出能够逆向压缩流程的解码器。这些方法在常见正则性假设下保证了解码器的普遍一致性,并适用于监督或非监督模型,从而为条件或联合分布提供洞察。

链接: https://arxiv.org/abs/2505.21441
作者: Binh Duc Vu,Jan Kapar,Marvin Wright,David S. Watson
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages main text, 25 pages total. 5 figures main text, 9 figures total

点击查看摘要

Abstract:We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble’s constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
zh

[AI-115] Quantum AIXI: Universal Intelligence via Quantum Information

【速读】:该论文试图解决如何将经典的AIXI模型扩展为量子力学框架下的通用智能模型的问题,以应对宇宙本质上具有量子力学性质的现实。其解决方案的关键在于构建基于量子和经典寄存器及信道的量子代理/环境交互模型,并在量子信息理论框架下重新表述AIXI的核心组件,包括量子Kolmogorov复杂性和QAIXI价值函数,从而探索量子AIXI(QAIXI)在理论一致性与实际可行性方面的可能性。

链接: https://arxiv.org/abs/2505.21170
作者: Elija Perrier
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:AIXI is a widely studied model of artificial general intelligence (AGI) based upon principles of induction and reinforcement learning. However, AIXI is fundamentally classical in nature - as are the environments in which it is modelled. Given the universe is quantum mechanical in nature and the exponential overhead required to simulate quantum mechanical systems classically, the question arises as to whether there are quantum mechanical analogues of AIXI which are theoretically consistent or practically feasible as models of universal intelligence. To address this question, we extend the framework to quantum information and present Quantum AIXI (QAIXI). We introduce a model of quantum agent/environment interaction based upon quantum and classical registers and channels, showing how quantum AIXI agents may take both classical and quantum actions. We formulate the key components of AIXI in quantum information terms, extending previous research on quantum Kolmogorov complexity and a QAIXI value function. We discuss conditions and limitations upon quantum Solomonoff induction and show how contextuality fundamentally affects QAIXI models.
zh

[AI-116] Fixed-Point Traps and Identity Emergence in Educational Feedback Systems

【速读】:该论文试图解决教育系统中由于考试驱动机制导致的学习者身份无法稳定形成以及创造性思维受阻的问题。其解决方案的关键在于引入Alpay代数II和III的框架,将考试-成绩坍塌系统(Exam-Grade Collapse Systems, EGCS)建模为函子结构,通过评估同态E对学习动态φ进行递归坍塌,证明在该坍塌机制下,非平凡的不动点代数μ_φ无法存在,从而导致学习者身份无法稳定。该模型从范畴论角度揭示了考试和基于成绩的反馈如何抑制创造力、导致研究停滞与结构熵损失。

链接: https://arxiv.org/abs/2505.21038
作者: Faruk Alpay
机构: 未知
类目: Category Theory (math.CT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, no figures. Formal Bourbaki-style proof. Introduces Exam-Grade Collapse Systems. Builds on Alpay Algebra II ( arXiv:2505.17480 ) and Alpay Algebra III ( arXiv:2505.19790 ). Proves categorical fixed-point traps obstructing identity emergence under exam-driven feedback

点击查看摘要

Abstract:This paper presents a formal categorical proof that exam-driven educational systems obstruct identity emergence and block creative convergence. Using the framework of Alpay Algebra II and III, we define Exam-Grade Collapse Systems (EGCS) as functorial constructs where learning dynamics \varphi are recursively collapsed by evaluative morphisms E . We prove that under such collapse regimes, no nontrivial fixed-point algebra \mu_\varphi can exist, hence learner identity cannot stabilize. This creates a universal fixed-point trap: all generative functors are entropically folded before symbolic emergence occurs. Our model mathematically explains the creativity suppression, research stagnation, and structural entropy loss induced by timed exams and grade-based feedback. The results apply category theory to expose why modern educational systems prevent \phi-emergence and block observer-invariant self-formation. This work provides the first provable algebraic obstruction of identity formation caused by institutional feedback mechanics.
zh

[AI-117] Unified Deep Learning Approach for Estimating the Metallicities of RR Lyrae Stars Using light curves from Gaia Data Release 3

【速读】:该论文旨在解决从大样本的光度数据中高效估算RR Lyrae星(RRLs)金属丰度的问题,特别是在ESA Gaia DR3提供了约27万条RRLs光变曲线的背景下,亟需可扩展的方法。其解决方案的关键在于提出一种统一的深度学习框架,利用Gaia G波段光变曲线同时估计基频脉动(RRab)和第一倍频脉动(RRc)RRLs的金属丰度,该框架基于优化的时间序列外推回归的门控循环单元(GRU)神经网络,并通过相位折叠、平滑和样本加权等预处理步骤提升模型性能,无需为不同脉动类型建立独立模型即可处理其形态差异。

链接: https://arxiv.org/abs/2505.20947
作者: Lorenzo Monti,Tatiana Muraveva,Alessia Garofalo,Gisella Clementini,Maria Letizia Valentini
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RR Lyrae stars (RRLs) are old pulsating variables widely used as metallicity tracers due to the correlation between their metal abundances and light curve morphology. With ESA Gaia DR3 providing light curves for about 270,000 RRLs, there is a pressing need for scalable methods to estimate their metallicities from photometric data. We introduce a unified deep learning framework that estimates metallicities for both fundamental-mode (RRab) and first-overtone (RRc) RRLs using Gaia G-band light curves. This approach extends our previous work on RRab stars to include RRc stars, aiming for high predictive accuracy and broad generalization across both pulsation types. The model is based on a Gated Recurrent Unit (GRU) neural network optimized for time-series extrinsic regression. Our pipeline includes preprocessing steps such as phase folding, smoothing, and sample weighting, and uses photometric metallicities from the literature as training targets. The architecture is designed to handle morphological differences between RRab and RRc light curves without requiring separate models. On held-out validation sets, our GRU model achieves strong performance: for RRab stars, MAE = 0.0565 dex, RMSE = 0.0765 dex, R^2 = 0.9401; for RRc stars, MAE = 0.0505 dex, RMSE = 0.0720 dex, R^2 = 0.9625. These results show the effectiveness of deep learning for large-scale photometric metallicity estimation and support its application to studies of stellar populations and Galactic structure.
zh

[AI-118] Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction INTERSPEECH2025

【速读】:该论文旨在解决多说话人场景下基于音视频信息的特定说话人语音提取问题,即从混合语音信号中分离出目标说话人的语音,通常依赖于目标说话人的面部信息。然而,在实际场景中,屏幕上常存在多个同时出现的面部,这些面部提供了重要的说话人活动线索。该工作的关键在于引入了一个即插即用的跨说话人注意力模块,用于处理灵活数量的共现面部,从而在复杂多人环境中实现更精确的说话人提取。

链接: https://arxiv.org/abs/2505.20635
作者: Zexu Pan,Shengkui Zhao,Tingting Wang,Kun Zhou,Yukun Ma,Chong Zhang,Bin Ma
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Interspeech 2025

点击查看摘要

Abstract:Audio-visual speaker extraction isolates a target speaker’s speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker’s face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.
zh

[AI-119] CardioPatternFormer: Pattern-Guided Attention for Interpretable ECG Classification with Transformer Architecture

【速读】:该论文旨在解决心电图(ECG)准确解读的难题,特别是在复杂心脏数据和“黑箱”人工智能模型限制临床应用的情况下。其解决方案的关键在于提出一种基于Transformer架构的新型可解释性ECG分类模型——CardioPatternFormer,该模型通过复杂的注意力机制精确识别和分类多种心脏模式,尤其擅长检测细微异常并区分多种共存的心脏病况,从而实现对ECG信号的透明解析。

链接: https://arxiv.org/abs/2505.20481
作者: Berat Kutay Uğraş,Ömer Nezih Gerek,İbrahim Talha Saygı
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate ECG interpretation is vital, yet complex cardiac data and “black-box” AI models limit clinical utility. Inspired by Transformer architectures’ success in NLP for understanding sequential data, we frame ECG as the heart’s unique “language” of temporal patterns. We present CardioPatternFormer, a novel Transformer-based model for interpretable ECG classification. It employs a sophisticated attention mechanism to precisely identify and classify diverse cardiac patterns, excelling at discerning subtle anomalies and distinguishing multiple co-occurring conditions. This pattern-guided attention provides clear insights by highlighting influential signal regions, effectively allowing the “heart to talk” through transparent interpretations. CardioPatternFormer demonstrates robust performance on challenging ECGs, including complex multi-pathology cases. Its interpretability via attention maps enables clinicians to understand the model’s rationale, fostering trust and aiding informed diagnostic decisions. This work offers a powerful, transparent solution for advanced ECG analysis, paving the way for more reliable and clinically actionable AI in cardiology.
zh

[AI-120] Data-driven multi-agent modelling of calcium interactions in cell culture: PINN vs Regularized Least-squares

【速读】:该论文旨在解决生物系统中动态过程的发现问题,特别是钙信号传导的建模与参数识别。其关键解决方案是通过对比约束正则化最小二乘法(Constrained Regularized Least-Squares Method, CRLSM)和物理信息神经网络(Physics-Informed Neural Networks, PINN)在常微分方程(Ordinary Differential Equations, ODEs)系统辨识与参数发现中的性能,以寻找更有效的建模方法。CRLSM在参数估计和数据拟合方面表现出较好的效果,而PINN在当前配置下未能达到CRLSM的性能,但研究认为通过进一步的超参数调优和不确定性量化可能提升其表现。

链接: https://arxiv.org/abs/2505.20327
作者: Aurora Poggi,Giuseppe Alessio D’Inverno,Hjalmar Brismar,Ozan Öktem,Matthieu Barreau,Kateryna Morozovska
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven discovery of dynamics in biological systems allows for better observation and characterization of processes, such as calcium signaling in cell culture. Recent advancements in techniques allow the exploration of previously unattainable insights of dynamical systems, such as the Sparse Identification of Non-Linear Dynamics (SINDy), overcoming the limitations of more classic methodologies. The latter requires some prior knowledge of an effective library of candidate terms, which is not realistic for a real case study. Using inspiration from fields like traffic density estimation and control theory, we propose a methodology for characterization and performance analysis of calcium delivery in a family of cells. In this work, we compare the performance of the Constrained Regularized Least-Squares Method (CRLSM) and Physics-Informed Neural Networks (PINN) for system identification and parameter discovery for governing ordinary differential equations (ODEs). The CRLSM achieves a fairly good parameter estimate and a good data fit when using the learned parameters in the Consensus problem. On the other hand, despite the initial hypothesis, PINNs fail to match the CRLSM performance and, under the current configuration, do not provide fair parameter estimation. However, we have only studied a limited number of PINN architectures, and it is expected that additional hyperparameter tuning, as well as uncertainty quantification, could significantly improve the performance in future works.
zh

[AI-121] MetamatBench: Integrating Heterogeneous Data Computational Tools and Visual Interface for Metamaterial Discovery

【速读】:该论文旨在解决生成式人工智能(Generative AI)在超材料(metamaterials)发现过程中面临的三个核心挑战:数据异质性(C1)、模型复杂性(C2)以及人机协作(C3)。其解决方案的关键在于提出一个统一的框架——MetamatBench,该框架从数据层、机器学习层和用户层三个方面进行设计,通过整合标准化的多模态数据集、提供适应超材料发现的先进机器学习方法工具包以及构建可视化交互界面,有效提升了模型的可解释性、可操作性和实用性。

链接: https://arxiv.org/abs/2505.20299
作者: Jianpeng Chen,Wangzhi Zhan,Haohui Wang,Zian Jia,Jingru Gan,Junkai Zhang,Jingyuan Qi,Tingwei Chen,Lifu Huang,Muhao Chen,Ling Li,Wei Wang,Dawei Zhou
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Metamaterials, engineered materials with architected structures across multiple length scales, offer unprecedented and tunable mechanical properties that surpass those of conventional materials. However, leveraging advanced machine learning (ML) for metamaterial discovery is hindered by three fundamental challenges: (C1) Data Heterogeneity Challenge arises from heterogeneous data sources, heterogeneous composition scales, and heterogeneous structure categories; (C2) Model Complexity Challenge stems from the intricate geometric constraints of ML models, which complicate their adaptation to metamaterial structures; and (C3) Human-AI Collaboration Challenge comes from the "dual black-box’’ nature of sophisticated ML models and the need for intuitive user interfaces. To tackle these challenges, we introduce a unified framework, named MetamatBench, that operates on three levels. (1) At the data level, we integrate and standardize 5 heterogeneous, multi-modal metamaterial datasets. (2) The ML level provides a comprehensive toolkit that adapts 17 state-of-the-art ML methods for metamaterial discovery. It also includes a comprehensive evaluation suite with 12 novel performance metrics with finite element-based assessments to ensure accurate and reliable model validation. (3) The user level features a visual-interactive interface that bridges the gap between complex ML techniques and non-ML researchers, advancing property prediction and inverse design of metamaterials for research and applications. MetamatBench offers a unified platform deployed at this http URL that enables machine learning researchers and practitioners to develop and evaluate new methodologies in metamaterial discovery. For accessibility and reproducibility, we open-source our benchmark and the codebase at this https URL.
zh

机器学习

[LG-0] Algorithms and SQ Lower Bounds for Robustly Learning Real-valued Multi-index Models

链接: https://arxiv.org/abs/2505.21475
作者: Ilias Diakonikolas,Giannis Iakovidis,Daniel M. Kane,Lisheng Ren
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the complexity of learning real-valued Multi-Index Models (MIMs) under the Gaussian distribution. A K -MIM is a function f:\mathbbR^d\to \mathbbR that depends only on the projection of its input onto a K -dimensional subspace. We give a general algorithm for PAC learning a broad class of MIMs with respect to the square loss, even in the presence of adversarial label noise. Moreover, we establish a nearly matching Statistical Query (SQ) lower bound, providing evidence that the complexity of our algorithm is qualitatively optimal as a function of the dimension. Specifically, we consider the class of bounded variation MIMs with the property that degree at most m distinguishing moments exist with respect to projections onto any subspace. In the presence of adversarial label noise, the complexity of our learning algorithm is d^O(m)2^\mathrmpoly(K/\epsilon) . For the realizable and independent noise settings, our algorithm incurs complexity d^O(m)2^\mathrmpoly(K)(1/\epsilon)^O(K) . To complement our upper bound, we show that if for some subspace degree- m distinguishing moments do not exist, then any SQ learner for the corresponding class of MIMs requires complexity d^\Omega(m) . As an application, we give the first efficient learner for the class of positive-homogeneous L -Lipschitz K -MIMs. The resulting algorithm has complexity \mathrmpoly(d) 2^\mathrmpoly(KL/\epsilon) . This gives a new PAC learning algorithm for Lipschitz homogeneous ReLU networks with complexity independent of the network size, removing the exponential dependence incurred in prior work.

[LG-1] Causal Posterior Estimation

链接: https://arxiv.org/abs/2505.21468
作者: Simon Dirmeier,Antonietta Mira
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present Causal Posterior Estimation (CPE), a novel method for Bayesian inference in simulator models, i.e., models where the evaluation of the likelihood function is intractable or too computationally expensive, but where one can simulate model outputs given parameter values. CPE utilizes a normalizing flow-based (NF) approximation to the posterior distribution which carefully incorporates the conditional dependence structure induced by the graphical representation of the model into the neural network. Thereby it is possible to improve the accuracy of the approximation. We introduce both discrete and continuous NF architectures for CPE and propose a constant-time sampling procedure for the continuous case which reduces the computational complexity of drawing samples to O(1) as for discrete NFs. We show, through an extensive experimental evaluation, that by incorporating the conditional dependencies induced by the graphical model directly into the neural network, rather than learning them from data, CPE is able to conduct highly accurate posterior inference either outperforming or matching the state of the art in the field.

[LG-2] High-Dimensional Calibration from Swap Regret

链接: https://arxiv.org/abs/2505.21460
作者: Maxwell Fishelson,Noah Golowich,Mehryar Mohri,Jon Schneider
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the online calibration of multi-dimensional forecasts over an arbitrary convex set \mathcalP \subset \mathbbR^d relative to an arbitrary norm \Vert\cdot\Vert . We connect this with the problem of external regret minimization for online linear optimization, showing that if it is possible to guarantee O(\sqrt\rho T) worst-case regret after T rounds when actions are drawn from \mathcalP and losses are drawn from the dual \Vert \cdot \Vert_* unit norm ball, then it is also possible to obtain \epsilon -calibrated forecasts after T = \exp(O(\rho /\epsilon^2)) rounds. When \mathcalP is the d -dimensional simplex and \Vert \cdot \Vert is the \ell_1 -norm, the existence of O(\sqrtT\log d) -regret algorithms for learning with experts implies that it is possible to obtain \epsilon -calibrated forecasts after T = \exp(O(\logd/\epsilon^2)) = d^O(1/\epsilon^2) rounds, recovering a recent result of Peng (2025). Interestingly, our algorithm obtains this guarantee without requiring access to any online linear optimization subroutine or knowledge of the optimal rate \rho – in fact, our algorithm is identical for every setting of \mathcalP and \Vert \cdot \Vert . Instead, we show that the optimal regularizer for the above OLO problem can be used to upper bound the above calibration error by a swap regret, which we then minimize by running the recent TreeSwap algorithm with Follow-The-Leader as a subroutine. Finally, we prove that any online calibration algorithm that guarantees \epsilon T \ell_1 -calibration error over the d -dimensional simplex requires T \geq \exp(\mathrmpoly(1/\epsilon)) (assuming d \geq \mathrmpoly(1/\epsilon) ). This strengthens the corresponding d^\Omega(\log1/\epsilon) lower bound of Peng, and shows that an exponential dependence on 1/\epsilon is necessary. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML) Cite as: arXiv:2505.21460 [cs.LG] (or arXiv:2505.21460v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.21460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling ICML2025

链接: https://arxiv.org/abs/2505.21452
作者: Xiangxin Zhou,Mingyu Li,Yi Xiao,Jiahan Li,Dongyu Xue,Zaixiang Zheng,Jianzhu Ma,Quanquan Gu
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Cyclic peptides offer inherent advantages in pharmaceuticals. For example, cyclic peptides are more resistant to enzymatic hydrolysis compared to linear peptides and usually exhibit excellent stability and affinity. Although deep generative models have achieved great success in linear peptide design, several challenges prevent the development of computational methods for designing diverse types of cyclic peptides. These challenges include the scarcity of 3D structural data on target proteins and associated cyclic peptide ligands, the geometric constraints that cyclization imposes, and the involvement of non-canonical amino acids in cyclization. To address the above challenges, we introduce CpSDE, which consists of two key components: AtomSDE, a generative structure prediction model based on harmonic SDE, and ResRouter, a residue type predictor. Utilizing a routed sampling algorithm that alternates between these two models to iteratively update sequences and structures, CpSDE facilitates the generation of cyclic peptides. By employing explicit all-atom and bond modeling, CpSDE overcomes existing data limitations and is proficient in designing a wide variety of cyclic peptides. Our experimental results demonstrate that the cyclic peptides designed by our method exhibit reliable stability and affinity.

[LG-4] Can Large Reasoning Models Self-Train?

链接: https://arxiv.org/abs/2505.21444
作者: Sheikh Shafayat,Fahim Tajwar,Ruslan Salakhutdinov,Jeff Schneider,Andrea Zanette
类目: Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternative, but it incurs scalability limitations due to dependency upon human-designed verifiers. Self-training, where the model’s own judgment provides the supervisory signal, presents a compelling direction. We propose an online self-training reinforcement learning algorithm that leverages the model’s self-consistency to infer correctness signals and train without any ground-truth supervision. We apply the algorithm to challenging mathematical reasoning tasks and show that it quickly reaches performance levels rivaling reinforcement-learning methods trained explicitly on gold-standard answers. Additionally, we analyze inherent limitations of the algorithm, highlighting how the self-generated proxy reward initially correlated with correctness can incentivize reward hacking, where confidently incorrect outputs are favored. Our results illustrate how self-supervised improvement can achieve significant performance gains without external labels, while also revealing its fundamental challenges.

[LG-5] Measuring Fine-Grained Relatedness in Multitask Learning via Data Attribution

链接: https://arxiv.org/abs/2505.21438
作者: Yiwen Tu,Ziqi Liu,Jiaqi W. Ma,Weijing Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Measuring task relatedness and mitigating negative transfer remain a critical open challenge in Multitask Learning (MTL). This work extends data attribution – which quantifies the influence of individual training data points on model predictions – to MTL setting for measuring task relatedness. We propose the MultiTask Influence Function (MTIF), a method that adapts influence functions to MTL models with hard or soft parameter sharing. Compared to conventional task relatedness measurements, MTIF provides a fine-grained, instance-level relatedness measure beyond the entire-task level. This fine-grained relatedness measure enables a data selection strategy to effectively mitigate negative transfer in MTL. Through extensive experiments, we demonstrate that the proposed MTIF efficiently and accurately approximates the performance of models trained on data subsets. Moreover, the data selection strategy enabled by MTIF consistently improves model performance in MTL. Our work establishes a novel connection between data attribution and MTL, offering an efficient and fine-grained solution for measuring task relatedness and enhancing MTL models.

[LG-6] Attribute-Efficient PAC Learning of Sparse Halfspaces with Constant Malicious Noise Rate

链接: https://arxiv.org/abs/2505.21430
作者: Shiwei Zeng,Jie Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attribute-efficient learning of sparse halfspaces has been a fundamental problem in machine learning theory. In recent years, machine learning algorithms are faced with prevalent data corruptions or even adversarial attacks. It is of central interest to design efficient algorithms that are robust to noise corruptions. In this paper, we consider that there exists a constant amount of malicious noise in the data and the goal is to learn an underlying s -sparse halfspace w^* \in \mathbbR^d with \textpoly(s,\log d) samples. Specifically, we follow a recent line of works and assume that the underlying distribution satisfies a certain concentration condition and a margin condition at the same time. Under such conditions, we show that attribute-efficiency can be achieved by simple variants to existing hinge loss minimization programs. Our key contribution includes: 1) an attribute-efficient PAC learning algorithm that works under constant malicious noise rate; 2) a new gradient analysis that carefully handles the sparsity constraint in hinge loss minimization.

[LG-7] Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization

链接: https://arxiv.org/abs/2505.21423
作者: Vit Fojtik,Maria Matveev,Hung-Hsu Chou,Gitta Kutyniok,Johannes Maly
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this theoretically, recent works examine gradient descent and its variants in simplified training settings, often assuming vanishing learning rates. These studies reveal various forms of implicit regularization, such as \ell_1 -norm minimizing parameters in regression and max-margin solutions in classification. Concurrently, empirical findings show that moderate to large learning rates exceeding standard stability thresholds lead to faster, albeit oscillatory, convergence in the so-called Edge-of-Stability regime, and induce an implicit bias towards minima of low sharpness (norm of training loss Hessian). In this work, we argue that a comprehensive understanding of the generalization performance of gradient descent requires analyzing the interaction between these various forms of implicit regularization. We empirically demonstrate that the learning rate balances between low parameter norm and low sharpness of the trained model. We furthermore prove for diagonal linear networks trained on a simple regression task that neither implicit bias alone minimizes the generalization error. These findings demonstrate that focusing on a single implicit bias is insufficient to explain good generalization, and they motivate a broader view of implicit regularization that captures the dynamic trade-off between norm and sharpness induced by non-negligible learning rates.

[LG-8] When Shift Happens - Confounding Is to Blame

链接: https://arxiv.org/abs/2505.21422
作者: Abbavaram Gowtham Reddy,Celia Rubio-Madrigal,Rebekka Burkholz,Krikamol Muandet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution shifts introduce uncertainty that undermines the robustness and generalization capabilities of machine learning models. While conventional wisdom suggests that learning causal-invariant representations enhances robustness to such shifts, recent empirical studies present a counterintuitive finding: (i) empirical risk minimization (ERM) can rival or even outperform state-of-the-art out-of-distribution (OOD) generalization methods, and (ii) its OOD generalization performance improves when all available covariates, not just causal ones, are utilized. Drawing on both empirical and theoretical evidence, we attribute this phenomenon to hidden confounding. Shifts in hidden confounding induce changes in data distributions that violate assumptions commonly made by existing OOD generalization approaches. Under such conditions, we prove that effective generalization requires learning environment-specific relationships, rather than relying solely on invariant ones. Furthermore, we show that models augmented with proxies for hidden confounders can mitigate the challenges posed by hidden confounding shifts. These findings offer new theoretical insights and practical guidance for designing robust OOD generalization algorithms and principled covariate selection strategies.

[LG-9] Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2505.21404
作者: Anas Jnini,Flavio Vella
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss–Newton update must be solved in the parameter space, incurring a prohibitive O(n^3) time complexity, where n is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size m = \sum_\gamma N_\gamma d_\gamma , where each residual class \gamma (e.g. PDE interior, boundary, initial data) contributes N_\gamma collocation points of output dimension d_\gamma . Building on this insight, we introduce \textitDual Natural Gradient Descent (D-NGD). D-NGD computes the Gauss–Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest m and a Nystrom-preconditioned conjugate-gradient solver for larger m . Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error L^2 than first-order methods (Adam, SGD) and quasi-Newton methods, and – crucially – enables natural-gradient training of PINNs at this scale on a single GPU. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2505.21404 [cs.LG] (or arXiv:2505.21404v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.21404 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective

链接: https://arxiv.org/abs/2505.21400
作者: Gen Li,Changxiao Cai
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models enable parallel token sampling, leading to faster generation and eliminating left-to-right generation constraints. Despite their empirical success, the theoretical understanding of diffusion model approaches remains underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations T and scales linearly with the mutual information between tokens in the target text sequence. In particular, we establish matching upper and lower bounds, up to some constant factor, to demonstrate the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.

[LG-11] SquareχPO: Differentially Private and Robust χ2-Preference Optimization in Offline Direct Alignment

链接: https://arxiv.org/abs/2505.21395
作者: Xingyu Zhou,Yulian Wu,Wenqian Weng,Francesco Orabona
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we theoretically study the offline alignment of language models with human preference feedback, under both preference label corruption and privacy protections. To this end, we propose Square \chi PO, a simple one-line change to \chi PO where the standard log-loss is replaced by a new square loss over probability. Thanks to the inherent properties of this new loss, we have advanced the state-of-the-art of differentially private and robust offline direct alignment. Specifically, for the local model of label privacy, Square \chi PO is the first algorithm that attains an optimal rate based on single-policy concentrability even with general function approximations. It also gives the first result under the central model of privacy protection over both prompts (responses) and labels. On the robustness side against Huber label corruption, Square \chi PO is the first alignment method that has a meaningful theoretical guarantee under general function approximations. More importantly, Square \chi PO can address privacy protection and corruption simultaneously, where an interesting separation is observed, implying that the order of privacy and corruption matters. Furthermore, we show that Square \chi PO can also be easily extended to handle the scenario of the general preference model with state-of-the-art guarantees under corruption and privacy. Last but not least, all of our theoretical guarantees enjoy a unified analysis, building upon a new result on the generalization error bounds of least-square regression under corruption and privacy constraints, which we believe is of independent interest to the community.

[LG-12] DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models

链接: https://arxiv.org/abs/2505.21382
作者: Nastaran Saadati,Zhanhong Jiang,Joshua R. Waite,Shreyan Ganguly,Aditya Balu,Chinmay Hegde,Soumik Sarkar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as one of the most effective, computationally tractable fine-tuning approaches for training Vision-Language Models (VLMs) and Large Language Models (LLMs). LoRA accomplishes this by freezing the pre-trained model weights and injecting trainable low-rank matrices, allowing for efficient learning of these foundation models even on edge devices. However, LoRA in decentralized settings still remains under explored, particularly for the theoretical underpinnings due to the lack of smoothness guarantee and model consensus interference (defined formally below). This work improves the convergence rate of decentralized LoRA (DLoRA) to match the rate of decentralized SGD by ensuring gradient smoothness. We also introduce DeCAF, a novel algorithm integrating DLoRA with truncated singular value decomposition (TSVD)-based matrix factorization to resolve consensus interference. Theoretical analysis shows TSVD’s approximation error is bounded and consensus differences between DLoRA and DeCAF vanish as rank increases, yielding DeCAF’s matching convergence rate. Extensive experiments across vision/language tasks demonstrate our algorithms outperform local training and rivals federated learning under both IID and non-IID data distributions.

[LG-13] PLANETALIGN: A Comprehensive Python Library for Benchmarking Network Alignment

链接: https://arxiv.org/abs/2505.21366
作者: Qi Yu,Zhichen Zeng,Yuchen Yan,Zhining Liu,Baoyu Jing,Ruizhong Qiu,Ariful Azad,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Network alignment (NA) aims to identify node correspondence across different networks and serves as a critical cornerstone behind various downstream multi-network learning tasks. Despite growing research in NA, there lacks a comprehensive library that facilitates the systematic development and benchmarking of NA methods. In this work, we introduce PLANETALIGN, a comprehensive Python library for network alignment that features a rich collection of built-in datasets, methods, and evaluation pipelines with easy-to-use APIs. Specifically, PLANETALIGN integrates 18 datasets and 14 NA methods with extensible APIs for easy use and development of NA methods. Our standardized evaluation pipeline encompasses a wide range of metrics, enabling a systematic assessment of the effectiveness, scalability, and robustness of NA methods. Through extensive comparative studies, we reveal practical insights into the strengths and limitations of existing NA methods. We hope that PLANETALIGN can foster a deeper understanding of the NA problem and facilitate the development and benchmarking of more effective, scalable, and robust methods in the future. The source code of PLANETALIGN is available at this https URL.

[LG-14] CRISP-NAM: Competing Risks Interpretable Survival Prediction with Neural Additive Models

链接: https://arxiv.org/abs/2505.21360
作者: Dhanesh Ramachandram
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Competing risks are crucial considerations in survival modelling, particularly in healthcare domains where patients may experience multiple distinct event types. We propose CRISP-NAM (Competing Risks Interpretable Survival Prediction with Neural Additive Models), an interpretable neural additive model for competing risks survival analysis which extends the neural additive architecture to model cause-specific hazards while preserving feature-level interpretability. Each feature contributes independently to risk estimation through dedicated neural networks, allowing for visualization of complex non-linear relationships between covariates and each competing risk. We demonstrate competitive performance on multiple datasets compared to existing approaches.

[LG-15] owards Robust Automated Perceptual Voice Quality Assessment with Deep Learning

链接: https://arxiv.org/abs/2505.21356
作者: Whenty Ariyanti,Kuan-Yu Chen,Sabato Marco Siniscalchi,Hsin-Min Wang,Yu Tsao
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Objective: Perceptual voice quality assessment plays a critical role in diagnosing and monitoring voice disorders by providing standardized evaluation of vocal function. Traditionally, this process relies on expert raters utilizing standard scales, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS). However, these metrics are inherently subjective and susceptible to inter-rater variability, motivating the need for automated and objective assessment methods. Methods: We propose Voice Quality Assessment Network (VOQANet), a deep learning-based framework with an attention mechanism that leverages a Speech Foundation Model (SFM) to capture high-level acoustic and prosodic information from raw speech. To enhance robustness and interpretability, we present VOQANet+, which integrates handcrafted acoustic features such as jitter, shimmer, and harmonics-to-noise ratio (HNR) with SFM embeddings. Results: Sentence-based input yields stronger performance than vowel-based input, especially at the patient level. VOQANet consistently outperforms baseline methods in RMSE and PCC, while VOQANet+ performs even better and maintains robustness under noisy conditions. Conclusion: Combining SFM embeddings with domain-informed acoustic features improves interpretability and resilience. Significance: VOQANet+ shows strong potential for deployment in real-world and telehealth settings, addressing the limitations of subjective perceptual assessments with an interpretable and noise-resilient solution.

[LG-16] OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

链接: https://arxiv.org/abs/2505.21347
作者: Ziheng Cheng,Yixiao Huang,Hui Xu,Somayeh Sojoudi,Xuandong Zhao,Dawn Song,Song Mei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior – rejecting even benign prompts – a phenomenon known as \textitover-refusal that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ( \textbfOVE r- \textbfR efusal evaluation on \textbfT ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their this http URL a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

[LG-17] Joint Learning in the Gaussian Single Index Model

链接: https://arxiv.org/abs/2505.21336
作者: Loucas Pillaud-Vivien,Adrien Schertzer
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 31 Pages, 3 Figures

点击查看摘要

Abstract:We consider the problem of jointly learning a one-dimensional projection and a univariate function in high-dimensional Gaussian models. Specifically, we study predictors of the form f(x)=\varphi^\star(\langle w^\star, x \rangle) , where both the direction w^\star \in \mathcalS_d-1 , the sphere of \mathbbR^d , and the function \varphi^\star: \mathbbR \to \mathbbR are learned from Gaussian data. This setting captures a fundamental non-convex problem at the intersection of representation learning and nonlinear regression. We analyze the gradient flow dynamics of a natural alternating scheme and prove convergence, with a rate controlled by the information exponent reflecting the \textitGaussian regularity of the function \varphi^\star . Strikingly, our analysis shows that convergence still occurs even when the initial direction is negatively correlated with the target. On the practical side, we demonstrate that such joint learning can be effectively implemented using a Reproducing Kernel Hilbert Space (RKHS) adapted to the structure of the problem, enabling efficient and flexible estimation of the univariate function. Our results offer both theoretical insight and practical methodology for learning low-dimensional structure in high-dimensional settings.

[LG-18] Scheduling with Uncertain Holding Costs and its Application to Content Moderation

链接: https://arxiv.org/abs/2505.21331
作者: Caner Gocmen,Thodoris Lykouris,Deeksha Sinha,Wentao Weng
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Performance (cs.PF); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In content moderation for social media platforms, the cost of delaying the review of a content is proportional to its view trajectory, which fluctuates and is apriori unknown. Motivated by such uncertain holding costs, we consider a queueing model where job states evolve based on a Markov chain with state-dependent instantaneous holding costs. We demonstrate that in the presence of such uncertain holding costs, the two canonical algorithmic principles, instantaneous-cost ( c\mu -rule) and expected-remaining-cost ( c\mu/\theta -rule), are suboptimal. By viewing each job as a Markovian ski-rental problem, we develop a new index-based algorithm, Opportunity-adjusted Remaining Cost (OaRC), that adjusts to the opportunity of serving jobs in the future when uncertainty partly resolves. We show that the regret of OaRC scales as \tildeO(L^1.5\sqrtN) , where L is the maximum length of a job’s holding cost trajectory and N is the system size. This regret bound shows that OaRC achieves asymptotic optimality when the system size N scales to infinity. Moreover, its regret is independent of the state-space size, which is a desirable property when job states contain contextual information. We corroborate our results with an extensive simulation study based on two holding cost patterns (online ads and user-generated content) that arise in content moderation for social media platforms. Our simulations based on synthetic and real datasets demonstrate that OaRC consistently outperforms existing practice, which is based on the two canonical algorithmic principles.

[LG-19] UGCE: User-Guided Incremental Counterfactual Exploration IJCNN2025

链接: https://arxiv.org/abs/2505.21330
作者: Christos Fragkathoulas,Evaggelia Pitoura
类目: Machine Learning (cs.LG)
*备注: Accepted to the ForgtAI Workshop at IJCNN 2025

点击查看摘要

Abstract:Counterfactual explanations (CFEs) are a popular approach for interpreting machine learning predictions by identifying minimal feature changes that alter model outputs. However, in real-world settings, users often refine feasibility constraints over time, requiring counterfactual generation to adapt dynamically. Existing methods fail to support such iterative updates, instead recomputing explanations from scratch with each change, an inefficient and rigid approach. We propose User-Guided Incremental Counterfactual Exploration (UGCE), a genetic algorithm-based framework that incrementally updates counterfactuals in response to evolving user constraints. Experimental results across five benchmark datasets demonstrate that UGCE significantly improves computational efficiency while maintaining high-quality solutions compared to a static, non-incremental approach. Our evaluation further shows that UGCE supports stable performance under varying constraint sequences, benefits from an efficient warm-start strategy, and reveals how different constraint types may affect search behavior.

[LG-20] Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization

链接: https://arxiv.org/abs/2505.21321
作者: Leonard Papenmeier,Luigi Nardi
类目: Machine Learning (cs.LG)
*备注: 7 pages, 1 figure

点击查看摘要

Abstract:We present Bencher, a modular benchmarking framework for black-box optimization that fundamentally decouples benchmark execution from optimization logic. Unlike prior suites that focus on combining many benchmarks in a single project, Bencher introduces a clean abstraction boundary: each benchmark is isolated in its own virtual Python environment and accessed via a unified, version-agnostic remote procedure call (RPC) interface. This design eliminates dependency conflicts and simplifies the integration of diverse, real-world benchmarks, which often have complex and conflicting software requirements. Bencher can be deployed locally or remotely via Docker or on high-performance computing (HPC) clusters via Singularity, providing a containerized, reproducible runtime for any benchmark. Its lightweight client requires minimal setup and supports drop-in evaluation of 80 benchmarks across continuous, categorical, and binary domains.

[LG-21] LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning

链接: https://arxiv.org/abs/2505.21289
作者: Nurbek Tastan,Stefanos Laskaridis,Martin Takac,Karthik Nandakumar,Samuel Horvath
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Large pre-trained models are commonly adapted to downstream tasks using parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), which injects small trainable low-rank matrices instead of updating all weights. While LoRA dramatically reduces trainable parameters with little overhead, it can still underperform full fine-tuning in accuracy and often converges more slowly. We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning by aligning the optimizer’s internal dynamics with those of updating all model weights. LoFT not only learns weight updates in a low-rank subspace (like LoRA) but also properly projects the optimizer’s first and second moments (Adam’s momentum and variance) into the same subspace, mirroring full-model updates. By aligning the low-rank update itself with the full update, LoFT eliminates the need for tuning extra hyperparameters, e.g., LoRA scaling factor \alpha . Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning and consistently outperforms standard LoRA-style methods, all without increasing inference cost.

[LG-22] Learnable Kernel Density Estimation for Graphs

链接: https://arxiv.org/abs/2505.21285
作者: Xudong Wang,Ziheng Sun,Chris Ding,Jicong Fan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under Review

点击查看摘要

Abstract:This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and complexity. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

[LG-23] Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework

链接: https://arxiv.org/abs/2505.21251
作者: Mustafa Hajij,Lennart Bastian,Sarah Osentoski,Hardik Kabaria,John L. Davenport,Sheik Dawood,Balaji Cherukuri,Joseph G. Kocheemoolayil,Nastaran Shahmansouri,Adrian Lew,Theodore Papamarkou,Tolga Birdal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce copresheaf topological neural networks (CTNNs), a powerful and unifying framework that encapsulates a wide spectrum of deep learning architectures, designed to operate on structured data: including images, point clouds, graphs, meshes, and topological manifolds. While deep learning has profoundly impacted domains ranging from digital assistants to autonomous systems, the principled design of neural architectures tailored to specific tasks and data types remains one of the field’s most persistent open challenges. CTNNs address this gap by grounding model design in the language of copresheaves, a concept from algebraic topology that generalizes and subsumes most practical deep learning models in use today. This abstract yet constructive formulation yields a rich design space from which theoretically sound and practically effective solutions can be derived to tackle core challenges in representation learning: long-range dependencies, oversmoothing, heterophily, and non-Euclidean domains. Our empirical results on structured data benchmarks demonstrate that CTNNs consistently outperform conventional baselines, particularly in tasks requiring hierarchical or localized sensitivity. These results underscore CTNNs as a principled, multi-scale foundation for the next generation of deep learning architectures.

[LG-24] BindEnergyCraft: Casting Protein Structure Predictors as Energy-Based Models for Binder Design

链接: https://arxiv.org/abs/2505.21241
作者: Divya Nori,Anisha Parsan,Caroline Uhler,Wengong Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protein binder design has been transformed by hallucination-based methods that optimize structure prediction confidence metrics, such as the interface predicted TM-score (ipTM), via backpropagation. However, these metrics do not reflect the statistical likelihood of a binder-target complex under the learned distribution and yield sparse gradients for optimization. In this work, we propose a method to extract such likelihoods from structure predictors by reinterpreting their confidence outputs as an energy-based model (EBM). By leveraging the Joint Energy-based Modeling (JEM) framework, we introduce pTMEnergy, a statistical energy function derived from predicted inter-residue error distributions. We incorporate pTMEnergy into BindEnergyCraft (BECraft), a design pipeline that maintains the same optimization framework as BindCraft but replaces ipTM with our energy-based objective. BECraft outperforms BindCraft, RFDiffusion, and ESM3 across multiple challenging targets, achieving higher in silico binder success rates while reducing structural clashes. Furthermore, pTMEnergy establishes a new state-of-the-art in structure-based virtual screening tasks for miniprotein and RNA aptamer binders.

[LG-25] Why Do More Experts Fail? A Theoretical Analysis of Model Merging

链接: https://arxiv.org/abs/2505.21226
作者: Zijing Wang,Xingle Xu,Yongkang Liu,Yiqun Zhang,Peiqin Lin,Shi Feng,Xiaocui Yang,Daling Wang,Hinrich Schütze
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging dramatically reduces storage and computational resources by combining multiple expert models into a single multi-task model. Although recent model merging methods have shown promising results, they struggle to maintain performance gains as the number of merged models increases. In this paper, we investigate the key obstacles that limit the scalability of model merging when integrating a large number of expert models. First, we prove that there is an upper bound on model merging. Further theoretical analysis reveals that the limited effective parameter space imposes a strict constraint on the number of models that can be successfully merged. Gaussian Width shows that the marginal benefit of merging additional models diminishes according to a strictly concave function. This implies that the effective parameter space becomes rapidly saturated as the number of merged models increases. Furthermore, using Approximate Kinematics Theory, we prove the existence of a unique optimal threshold beyond which adding more models does not yield significant performance improvements. At the same time, we introduce a straightforward Reparameterized Heavy-Tailed method (RHT) to extend the coverage of the merged model, thereby enhancing its performance. Empirical results on 12 benchmarks, including both knowledge-intensive and general-purpose tasks, validate our theoretical analysis. We believe that these results spark further research beyond the current scope of model merging. The source code is in the anonymous Github repository this https URL.

[LG-26] Developing hybrid mechanistic and data-driven personalized prediction models for platelet dynamics

链接: https://arxiv.org/abs/2505.21204
作者: Marie Steinacker,Yuri Kheifetz,Markus Scholz
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Hematotoxicity, drug-induced damage to the blood-forming system, is a frequent side effect of cytotoxic chemotherapy and poses a significant challenge in clinical practice due to its high inter-patient variability and limited predictability. Current mechanistic models often struggle to accurately forecast outcomes for patients with irregular or atypical trajectories. In this study, we develop and compare hybrid mechanistic and data-driven approaches for individualized time series modeling of platelet counts during chemotherapy. We consider hybrid models that combine mechanistic models with neural networks, known as universal differential equations. As a purely data-driven alternative, we utilize a nonlinear autoregressive exogenous model using gated recurrent units as the underlying architecture. These models are evaluated across a range of real patient scenarios, varying in data availability and sparsity, to assess predictive performance. Our findings demonstrate that data-driven methods, when provided with sufficient data, significantly improve prediction accuracy, particularly for high-risk patients with irregular platelet dynamics. This highlights the potential of data-driven approaches in enhancing clinical decision-making. In contrast, hybrid and mechanistic models are superior in scenarios with limited or sparse data. The proposed modeling and comparison framework is generalizable and could be extended to predict other treatment-related toxicities, offering broad applicability in personalized medicine.

[LG-27] Crop recommendation with machine learning: leverag ing environmental and economic factors for optimal crop selection

链接: https://arxiv.org/abs/2505.21201
作者: Steven Sam,Silima Marshal DAbreo
类目: Machine Learning (cs.LG)
*备注: 22 pages and 13 figures

点击查看摘要

Abstract:Agriculture constitutes a primary source of food production, economic growth and employment in India, but the sector is confronted with low farm productivity and yields aggravated by increased pressure on natural resources and adverse climate change variability. Efforts involving green revolution, land irrigations, improved seeds and organic farming have yielded suboptimal outcomes. The adoption of computational tools like crop recommendation systems offers a new way to provide insights and help farmers tackle low productivity. However, most agricultural recommendation systems in India focus narrowly on environmental factors and regions, limiting accurate predictions of high-yield, profitable crops. This study uses environmental and economic factors with 19 crops across 15 states to develop and evaluate Random Forest and SVM models using 10-fold Cross Validation, Time-series Split, and Lag Variables. The 10-fold cross validation showed high accuracy (RF: 99.96%, SVM: 94.71%) but raised overfitting concerns. Introducing temporal order, better reflecting real-world conditions, reduced performance (RF: 78.55%, SVM: 71.18%) in the Time-series this http URL further increase the model accuracy while maintaining the temporal order, the Lag Variables approach was employed, which resulted in improved performance (RF: 83.62%, SVM: 74.38%) compared to the 10-fold cross validation approach. Overall, the models in the Time-series Split and Lag Variable Approaches offer practical insights by handling temporal dependencies and enhancing its adaptability to changing agricultural conditions over time. Consequently, the study shows the Random Forest model developed based on the Lag Variables as the most preferred algorithm for optimal crop recommendation in the Indian context.

[LG-28] Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score

链接: https://arxiv.org/abs/2505.21147
作者: Xuanning Zhou,Hao Zeng,Xiaobo Xia,Bingyi Jing,Hongxin Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) is a powerful framework for uncertainty quantification, providing prediction sets with coverage guarantees when calibrated on sufficient labeled data. However, in real-world applications where labeled data is often limited, standard CP can lead to coverage deviation and output overly large prediction sets. In this paper, we extend CP to the semi-supervised setting and propose SemiCP, leveraging both labeled data and unlabeled data for calibration. Specifically, we introduce a novel nonconformity score function, NNM, designed for unlabeled data. This function selects labeled data with similar pseudo-label scores to estimate nonconformity scores, integrating them into the calibration process to overcome sample size limitations. We theoretically demonstrate that, under mild assumptions, SemiCP provide asymptotically coverage guarantee for prediction sets. Extensive experiments further validate that our approach effectively reduces instability and inefficiency under limited calibration data, can be adapted to conditional coverage settings, and integrates seamlessly with existing CP methods.

[LG-29] A Predicting Phishing Websites Using Support Vector Machine and MultiClass Classification Based on Association Rule Techniques

链接: https://arxiv.org/abs/2505.21141
作者: Nancy C. Woods,Virtue Ene Agada,Adebola K. Ojo
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Phishing is a semantic attack which targets the user rather than the computer. It is a new Internet crime in comparison with other forms such as virus and hacking. Considering the damage phishing websites has caused to various economies by collapsing organizations, stealing information and financial diversion, various researchers have embarked on different ways of detecting phishing websites but there has been no agreement about the best algorithm to be used for prediction. This study is interested in integrating the strengths of two algorithms, Support Vector Machines (SVM) and Multi-Class Classification Rules based on Association Rules (MCAR) to establish a strong and better means of predicting phishing websites. A total of 11,056 websites were used from both PhishTank and yahoo directory to verify the effectiveness of this approach. Feature extraction and rules generation were done by the MCAR technique; classification and prediction were done by SVM technique. The result showed that the technique achieved 98.30% classification accuracy with a computation time of 2205.33s with minimum error rate. It showed a total of 98% Area under the Curve (AUC) which showed the proportion of accuracy in classifying phishing websites. The model showed 82.84% variance in the prediction of phishing websites based on the coefficient of determination. The use of two techniques together in detecting phishing websites produced a more accurate result as it combined the strength of both techniques respectively. This research work centralized on this advantage by building a hybrid of two techniques to help produce a more accurate result.

[LG-30] Robust and Computation-Aware Gaussian Processes

链接: https://arxiv.org/abs/2505.21133
作者: Marshal Arijona Sinaga,Julien Martinelli,Samuel Kaski
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian processes (GPs) are widely used for regression and optimization tasks such as Bayesian optimization (BO) due to their expressiveness and principled uncertainty estimates. However, in settings with large datasets corrupted by outliers, standard GPs and their sparse approximations struggle with computational tractability and robustness. We introduce Robust Computation-aware Gaussian Process (RCaGP), a novel GP model that jointly addresses these challenges by combining a principled treatment of approximation-induced uncertainty with robust generalized Bayesian updating. The key insight is that robustness and approximation-awareness are not orthogonal but intertwined: approximations can exacerbate the impact of outliers, and mitigating one without the other is insufficient. Unlike previous work that focuses narrowly on either robustness or approximation quality, RCaGP combines both in a principled and scalable framework, thus effectively managing both outliers and computational uncertainties introduced by approximations such as low-rank matrix multiplications. Our model ensures more conservative and reliable uncertainty estimates, a property we rigorously demonstrate. Additionally, we establish a robustness property and show that the mean function is key to preserving it, motivating a tailored model selection scheme for robust mean functions. Empirical results confirm that solving these challenges jointly leads to superior performance across both clean and outlier-contaminated settings, both on regression and high-throughput Bayesian optimization benchmarks.

[LG-31] Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance

链接: https://arxiv.org/abs/2505.21101
作者: Badr Moufad,Yazid Janati,Alain Durmus,Ahmed Ghorbel,Eric Moulines,Jimmy Olsson
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: preprint

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero – where the data distribution is tilted by a power w \gt 1 of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at this https URL

[LG-32] Improved Impossible Tuning and Lipschitz-Adaptive Universal Online Learning with Gradient Variations

链接: https://arxiv.org/abs/2505.21095
作者: Kei Takemura,Ryuta Matsuno,Keita Sakuma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A central goal in online learning is to achieve adaptivity to unknown problem characteristics, such as environmental changes captured by gradient variation (GV), function curvature (universal online learning, UOL), and gradient scales (Lipschitz adaptivity, LA). Simultaneously achieving these with optimal performance is a major challenge, partly due to limitations in algorithms for prediction with expert advice. These algorithms often serve as meta-algorithms in online ensemble frameworks, and their sub-optimality hinders overall UOL performance. Specifically, existing algorithms addressing the ``impossible tuning’’ issue incur an excess \sqrt\log T factor in their regret bound compared to the lower bound. To solve this problem, we propose a novel optimistic online mirror descent algorithm with an auxiliary initial round using large learning rates. This design enables a refined analysis where a generated negative term cancels the gap-related factor, resolving the impossible tuning issue up to \log\log T factors. Leveraging our improved algorithm as a meta-algorithm, we develop the first UOL algorithm that simultaneously achieves state-of-the-art GV bounds and LA under standard assumptions. Our UOL result overcomes key limitations of prior works, notably resolving the conflict between LA mechanisms and regret analysis for GV bounds – an open problem highlighted by Xie et al.

[LG-33] Bridging Arbitrary and Tree Metrics via Differentiable Gromov Hyperbolicity

链接: https://arxiv.org/abs/2505.21073
作者: Pierre Houedry,Nicolas Courty,Florestan Martin-Baillon,Laetitia Chapel,Titouan Vayer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Trees and the associated shortest-path tree metrics provide a powerful framework for representing hierarchical and combinatorial structures in data. Given an arbitrary metric space, its deviation from a tree metric can be quantified by Gromov’s \delta -hyperbolicity. Nonetheless, designing algorithms that bridge an arbitrary metric to its closest tree metric is still a vivid subject of interest, as most common approaches are either heuristical and lack guarantees, or perform moderately well. In this work, we introduce a novel differentiable optimization framework, coined DeltaZero, that solves this problem. Our method leverages a smooth surrogate for Gromov’s \delta -hyperbolicity which enables a gradient-based optimization, with a tractable complexity. The corresponding optimization procedure is derived from a problem with better worst case guarantees than existing bounds, and is justified statistically. Experiments on synthetic and real-world datasets demonstrate that our method consistently achieves state-of-the-art distortion.

[LG-34] Scalable and adaptive prediction bands with kernel sum-of-squares

链接: https://arxiv.org/abs/2505.21039
作者: Louis Allain(ENSAI, CREST),Sébastien da Veiga(ENSAI, CREST),Brian Staber
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Conformal Prediction (CP) is a popular framework for constructing prediction bands with valid coverage in finite samples, while being free of any distributional assumption. A well-known limitation of conformal prediction is the lack of adaptivity, although several works introduced practically efficient alternate procedures. In this work, we build upon recent ideas that rely on recasting the CP problem as a statistical learning problem, directly targeting coverage and adaptivity. This statistical learning problem is based on reproducible kernel Hilbert spaces (RKHS) and kernel sum-of-squares (SoS) methods. First, we extend previous results with a general representer theorem and exhibit the dual formulation of the learning problem. Crucially, such dual formulation can be solved efficiently by accelerated gradient methods with several hundreds or thousands of samples, unlike previous strategies based on off-the-shelf semidefinite programming algorithms. Second, we introduce a new hyperparameter tuning strategy tailored specifically to target adaptivity through bounds on test-conditional coverage. This strategy, based on the Hilbert-Schmidt Independence Criterion (HSIC), is introduced here to tune kernel lengthscales in our framework, but has broader applicability since it could be used in any CP algorithm where the score function is learned. Finally, extensive experiments are conducted to show how our method compares to related work. All figures can be reproduced with the accompanying code.

[LG-35] LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms

链接: https://arxiv.org/abs/2505.21034
作者: Wenhu Li,Niki van Stein,Thomas Bäck,Elena Raponi
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful class of algorithms for optimizing expensive black-box functions, but designing effective BO algorithms remains a manual, expertise-driven task. Recent advancements in Large Language Models (LLMs) have opened new avenues for automating scientific discovery, including the automatic design of optimization algorithms. While prior work has used LLMs within optimization loops or to generate non-BO algorithms, we tackle a new challenge: Using LLMs to automatically generate full BO algorithm code. Our framework uses an evolution strategy to guide an LLM in generating Python code that preserves the key components of BO algorithms: An initial design, a surrogate model, and an acquisition function. The LLM is prompted to produce multiple candidate algorithms, which are evaluated on the established Black-Box Optimization Benchmarking (BBOB) test suite from the COmparing Continuous Optimizers (COCO) platform. Based on their performance, top candidates are selected, combined, and mutated via controlled prompt variations, enabling iterative refinement. Despite no additional fine-tuning, the LLM-generated algorithms outperform state-of-the-art BO baselines in 19 (out of 24) BBOB functions in dimension 5 and generalize well to higher dimensions, and different tasks (from the Bayesmark framework). This work demonstrates that LLMs can serve as algorithmic co-designers, offering a new paradigm for automating BO development and accelerating the discovery of novel algorithmic combinations. The source code is provided at this https URL.

[LG-36] NeuralOM: Neural Ocean Model for Subseasonal-to-Seasonal Simulation

链接: https://arxiv.org/abs/2505.21020
作者: Yuan Gao,Ruiqi Shu,Hao Wu,Fan Xu,Yanfei Xiang,Ruijian Gou,Qingsong Wen,Xian Wu,Xiaomeng Huang
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate Subseasonal-to-Seasonal (S2S) ocean simulation is critically important for marine research, yet remains challenging due to its substantial thermal inertia and extended time delay. Machine learning (ML)-based models have demonstrated significant advancements in simulation accuracy and computational efficiency compared to traditional numerical methods. Nevertheless, a significant limitation of current ML models for S2S ocean simulation is their inadequate incorporation of physical consistency and the slow-changing properties of the ocean system. In this work, we propose a neural ocean model (NeuralOM) for S2S ocean simulation with a multi-scale interactive graph neural network to emulate diverse physical phenomena associated with ocean systems effectively. Specifically, we propose a multi-stage framework tailored to model the ocean’s slowly changing nature. Additionally, we introduce a multi-scale interactive messaging module to capture complex dynamical behaviors, such as gradient changes and multiplicative coupling relationships inherent in ocean dynamics. Extensive experimental evaluations confirm that our proposed NeuralOM outperforms state-of-the-art models in S2S and extreme event simulation. The codes are available at this https URL.

[LG-37] Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models

链接: https://arxiv.org/abs/2505.21005
作者: Fengzhe Zhang,Laurence I. Midgley,José Miguel Hernández-Lobato
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based diffusion models (SBDMs) are powerful amortized samplers for Boltzmann distributions; however, imperfect score estimates bias downstream Monte Carlo estimates. Classical importance sampling (IS) can correct this bias, but computing exact likelihoods requires solving the probability-flow ordinary differential equation (PF-ODE), a procedure that is prohibitively costly and scales poorly with dimensionality. We introduce Variance-Tuned Diffusion Importance Sampling (VT-DIS), a lightweight post-training method that adapts the per-step noise covariance of a pretrained SBDM by minimizing the \alpha -divergence ( \alpha=2 ) between its forward diffusion and reverse denoising trajectories. VT-DIS assigns a single trajectory-wise importance weight to the joint forward-reverse process, yielding unbiased expectation estimates at test time with negligible overhead compared to standard sampling. On the DW-4, LJ-13, and alanine-dipeptide benchmarks, VT-DIS achieves effective sample sizes of approximately 80 %, 35 %, and 3.5 %, respectively, while using only a fraction of the computational budget required by vanilla diffusion + IS or PF-ODE-based IS.

[LG-38] Efficient Identity and Position Graph Embedding via Spectral-Based Random Feature Aggregation KDD2025

链接: https://arxiv.org/abs/2505.20992
作者: Meng Qin,Jiahong Liu,Irwin King
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by ACM SIGKDD 2025

点击查看摘要

Abstract:Graph neural networks (GNNs), which capture graph structures via a feature aggregation mechanism following the graph embedding framework, have demonstrated a powerful ability to support various tasks. According to the topology properties (e.g., structural roles or community memberships of nodes) to be preserved, graph embedding can be categorized into identity and position embedding. However, it is unclear for most GNN-based methods which property they can capture. Some of them may also suffer from low efficiency and scalability caused by several time- and space-consuming procedures (e.g., feature extraction and training). From a perspective of graph signal processing, we find that high- and low-frequency information in the graph spectral domain may characterize node identities and positions, respectively. Based on this investigation, we propose random feature aggregation (RFA) for efficient identity and position embedding, serving as an extreme ablation study regarding GNN feature aggregation. RFA (i) adopts a spectral-based GNN without learnable parameters as its backbone, (ii) only uses random noises as inputs, and (iii) derives embeddings via just one feed-forward propagation (FFP). Inspired by degree-corrected spectral clustering, we further introduce a degree correction mechanism to the GNN backbone. Surprisingly, our experiments demonstrate that two variants of RFA with high- and low-pass filters can respectively derive informative identity and position embeddings via just one FFP (i.e., without any training). As a result, RFA can achieve a better trade-off between quality and efficiency for both identity and position embedding over various baselines.

[LG-39] Identifying Super Spreaders in Multilayer Networks

链接: https://arxiv.org/abs/2505.20980
作者: Michał Czuba,Mateusz Stolarski,Adam Piróg,Piotr Bielak,Piotr Bródka
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying super-spreaders can be framed as a subtask of the influence maximisation problem. It seeks to pinpoint agents within a network that, if selected as single diffusion seeds, disseminate information most effectively. Multilayer networks, a specific class of heterogeneous graphs, can capture diverse types of interactions (e.g., physical-virtual or professional-social), and thus offer a more accurate representation of complex relational structures. In this work, we introduce a novel approach to identifying super-spreaders in such networks by leveraging graph neural networks. To this end, we construct a dataset by simulating information diffusion across hundreds of networks - to the best of our knowledge, the first of its kind tailored specifically to multilayer networks. We further formulate the task as a variation of the ranking prediction problem based on a four-dimensional vector that quantifies each agent’s spreading potential: (i) the number of activations; (ii) the duration of the diffusion process; (iii) the peak number of activations; and (iv) the simulation step at which this peak occurs. Our model, TopSpreadersNetwork, comprises a relationship-agnostic encoder and a custom aggregation layer. This design enables generalisation to previously unseen data and adapts to varying graph sizes. In an extensive evaluation, we compare our model against classic centrality-based heuristics and competitive deep learning methods. The results, obtained across a broad spectrum of real-world and synthetic multilayer networks, demonstrate that TopSpreadersNetwork achieves superior performance in identifying high-impact nodes, while also offering improved interpretability through its structured output.

[LG-40] Understanding the behavior of representation forgetting in continual learning

链接: https://arxiv.org/abs/2505.20970
作者: Joonkyu Kim,Yejin Kim,Jy-yong Sohn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In continual learning scenarios, catastrophic forgetting of previously learned tasks is a critical issue, making it essential to effectively measure such forgetting. Recently, there has been growing interest in focusing on representation forgetting, the forgetting measured at the hidden layer. In this paper, we provide the first theoretical analysis of representation forgetting and use this analysis to better understand the behavior of continual learning. First, we introduce a new metric called representation discrepancy, which measures the difference between representation spaces constructed by two snapshots of a model trained through continual learning. We demonstrate that our proposed metric serves as an effective surrogate for the representation forgetting while remaining analytically tractable. Second, through mathematical analysis of our metric, we derive several key findings about the dynamics of representation forgetting: the forgetting occurs more rapidly to a higher degree as the layer index increases, while increasing the width of the network slows down the forgetting process. Third, we support our theoretical findings through experiments on real image datasets, including Split-CIFAR100 and ImageNet1K.

[LG-41] Semantic Communication meets System 2 ML: How Abstraction Compositionality and Emergent Languages Shape Intelligence

链接: https://arxiv.org/abs/2505.20964
作者: Mehdi Bennis,Salem Lahlou
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The trajectories of 6G and AI are set for a creative collision. However, current visions for 6G remain largely incremental evolutions of 5G, while progress in AI is hampered by brittle, data-hungry models that lack robust reasoning capabilities. This paper argues for a foundational paradigm shift, moving beyond the purely technical level of communication toward systems capable of semantic understanding and effective, goal-oriented interaction. We propose a unified research vision rooted in the principles of System-2 cognition, built upon three pillars: Abstraction, enabling agents to learn meaningful world models from raw sensorimotor data; Compositionality, providing the algebraic tools to combine learned concepts and subsystems; and Emergent Communication, allowing intelligent agents to create their own adaptive and grounded languages. By integrating these principles, we lay the groundwork for truly intelligent systems that can reason, adapt, and collaborate, unifying advances in wireless communications, machine learning, and robotics under a single coherent framework.

[LG-42] Unveiling Impact of Frequency Components on Membership Inference Attacks for Diffusion Models

链接: https://arxiv.org/abs/2505.20955
作者: Puwei Lian,Yujun Cai,Songze Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved tremendous success in image generation, but they also raise significant concerns regarding privacy and copyright issues. Membership Inference Attacks (MIAs) are designed to ascertain whether specific data were utilized during a model’s training phase. As current MIAs for diffusion models typically exploit the model’s image prediction ability, we formalize them into a unified general paradigm which computes the membership score for membership identification. Under this paradigm, we empirically find that existing attacks overlook the inherent deficiency in how diffusion models process high-frequency information. Consequently, this deficiency leads to member data with more high-frequency content being misclassified as hold-out data, and hold-out data with less high-frequency content tend to be misclassified as member data. Moreover, we theoretically demonstrate that this deficiency reduces the membership advantage of attacks, thereby interfering with the effective discrimination of member data and hold-out data. Based on this insight, we propose a plug-and-play high-frequency filter module to mitigate the adverse effects of the deficiency, which can be seamlessly integrated into any attacks within this general paradigm without additional time costs. Extensive experiments corroborate that this module significantly improves the performance of baseline attacks across different datasets and models.

[LG-43] Scattering Networks on Noncommutative Finite Groups

链接: https://arxiv.org/abs/2505.20950
作者: Maria Teresa Arias,Davide Barbieri,Eugenio Hernández
类目: Numerical Analysis (math.NA); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Scattering Networks were initially designed to elucidate the behavior of early layers in Convolutional Neural Networks (CNNs) over Euclidean spaces and are grounded in wavelets. In this work, we introduce a scattering transform on an arbitrary finite group (not necessarily abelian) within the context of group-equivariant convolutional neural networks (G-CNNs). We present wavelets on finite groups and analyze their similarity to classical wavelets. We demonstrate that, under certain conditions in the wavelet coefficients, the scattering transform is non-expansive, stable under deformations, preserves energy, equivariant with respect to left and right group translations, and, as depth increases, the scattering coefficients are less sensitive to group translations of the signal, all desirable properties of convolutional neural networks. Furthermore, we provide examples illustrating the application of the scattering transform to classify data with domains involving abelian and nonabelian groups.

[LG-44] Efficient Spectral Control of Partially Observed Linear Dynamical Systems

链接: https://arxiv.org/abs/2505.20943
作者: Anand Brahmbhatt,Gon Buzaglo,Sofiia Druchyna,Elad Hazan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new method for the problem of controlling linear dynamical systems under partial observation and adversarial disturbances. Our new algorithm, Double Spectral Control (DSC), matches the best known regret guarantees while exponentially improving runtime complexity over previous approaches in its dependence on the system’s stability margin. Our key innovation is a two-level spectral approximation strategy, leveraging double convolution with a universal basis of spectral filters, enabling efficient and accurate learning of the best linear dynamical controllers.

[LG-45] Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning

链接: https://arxiv.org/abs/2505.20938
作者: Chongjie Si,Yidan Cui,Fuchao Yang,Xiaokang Yang,Wei Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partial Multi-Label Learning (PML) extends the multi-label learning paradigm to scenarios where each sample is associated with a candidate label set containing both ground-truth labels and noisy labels. Existing PML methods commonly rely on two assumptions: sparsity of the noise label matrix and low-rankness of the ground-truth label matrix. However, these assumptions are inherently conflicting and impractical for real-world scenarios, where the true label matrix is typically full-rank or close to full-rank. To address these limitations, we demonstrate that the sparsity constraint contributes to the high-rank property of the predicted label matrix. Based on this, we propose a novel method Schirn, which introduces a sparsity constraint on the noise label matrix while enforcing a high-rank property on the predicted label matrix. Extensive experiments demonstrate the superior performance of Schirn compared to state-of-the-art methods, validating its effectiveness in tackling real-world PML challenges.

[LG-46] NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion

链接: https://arxiv.org/abs/2505.20934
作者: Max Collins,Jordan Vice,Tim French,Ajmal Mian
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Adversarial samples exploit irregularities in the manifold ``learned’’ by deep learning models to cause misclassifications. The study of these adversarial samples provides insight into the features a model uses to classify inputs, which can be leveraged to improve robustness against future attacks. However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff’, an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class. Deep learning models can exploit these structural elements to shortcut the classification process, rather than learning to genuinely distinguish between classes. To leverage this behavior, we guide the diffusion trajectory towards the intersection of the true and adversarial classes, combining time-travel sampling with augmented classifier guidance to enhance attack transferability while preserving image fidelity. Our method achieves comparable attack success rates to current state-of-the-art techniques, while exhibiting significantly higher transferability across model architectures and better alignment with natural test-time errors as measured by FID. These results demonstrate that NatADiff produces adversarial samples that not only transfer more effectively across models, but more faithfully resemble naturally occurring test-time errors.

[LG-47] MLMC-based Resource Adequacy Assessment with Active Learning Trained Surrogate Models

链接: https://arxiv.org/abs/2505.20930
作者: Ruiqi Zhang,Simon H. Tindemans
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 1 table

点击查看摘要

Abstract:Multilevel Monte Carlo (MLMC) is a flexible and effective variance reduction technique for accelerating reliability assessments of complex power system. Recently, data-driven surrogate models have been proposed as lower-level models in the MLMC framework due to their high correlation and negligible execution time once trained. However, in resource adequacy assessments, pre-labeled datasets are typically unavailable. For large-scale systems, the efficiency gains from surrogate models are often offset by the substantial time required for labeling training data. Therefore, this paper introduces a speed metric that accounts for training time in evaluating MLMC efficiency. Considering the total time budget is limited, a vote-by-committee active learning approach is proposed to reduce the required labeling calls. A case study demonstrates that, within practical variance thresholds, active learning enables significantly improved MLMC efficiency with reduced training effort, compared to regular surrogate modelling approaches.

[LG-48] Label Leakage in Federated Inertial-based Human Activity Recognition

链接: https://arxiv.org/abs/2505.20924
作者: Marius Bock,Maximilian Hopp,Kristof Van Laerhoven,Michael Moeller
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:While prior work has shown that Federated Learning updates can leak sensitive information, label reconstruction attacks, which aim to recover input labels from shared gradients, have not yet been examined in the context of Human Activity Recognition (HAR). Given the sensitive nature of activity labels, this study evaluates the effectiveness of state-of-the-art gradient-based label leakage attacks on HAR benchmark datasets. Our findings show that the number of activity classes, sampling strategy, and class imbalance are critical factors influencing the extent of label leakage, with reconstruction accuracies reaching up to 90% on two benchmark datasets, even for trained models. Moreover, we find that Local Differential Privacy techniques such as gradient noise and clipping offer only limited protection, as certain attacks still reliably infer both majority and minority class labels. We conclude by offering practical recommendations for the privacy-aware deployment of federated HAR systems and identify open challenges for future research. Code to reproduce our experiments is publicly available via this http URL.

[LG-49] DeepConvContext: A Multi-Scale Approach to Timeseries Classification in Human Activity Recognition

链接: https://arxiv.org/abs/2505.20894
作者: Marius Bock,Michael Moeller,Kristof Van Laerhoven
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Despite recognized limitations in modeling long-range temporal dependencies, Human Activity Recognition (HAR) has traditionally relied on a sliding window approach to segment labeled datasets. Deep learning models like the DeepConvLSTM typically classify each window independently, thereby restricting learnable temporal context to within-window information. To address this constraint, we propose DeepConvContext, a multi-scale time series classification framework for HAR. Drawing inspiration from the vision-based Temporal Action Localization community, DeepConvContext models both intra- and inter-window temporal patterns by processing sequences of time-ordered windows. Unlike recent HAR models that incorporate attention mechanisms, DeepConvContext relies solely on LSTMs – with ablation studies demonstrating the superior performance of LSTMs over attention-based variants for modeling inertial sensor data. Across six widely-used HAR benchmarks, DeepConvContext achieves an average 10% improvement in F1-score over the classic DeepConvLSTM, with gains of up to 21%. Code to reproduce our experiments is publicly available via this http URL.

[LG-50] One-Time Soft Alignment Enables Resilient Learning without Weight Transport

链接: https://arxiv.org/abs/2505.20892
作者: Jeonghwan Cheon,Jaehyuk Bae,Se-Bum Paik
类目: Machine Learning (cs.LG)
*备注: 28 pages

点击查看摘要

Abstract:Backpropagation is the cornerstone of deep learning, but its reliance on symmetric weight transport and global synchronization makes it computationally expensive and biologically implausible. Feedback alignment offers a promising alternative by approximating error gradients through fixed random feedback, thereby avoiding symmetric weight transport. However, this approach often struggles with poor learning performance and instability, especially in deep networks. Here, we show that a one-time soft alignment between forward and feedback weights at initialization enables deep networks to achieve performance comparable to backpropagation, without requiring weight transport during learning. This simple initialization condition guides stable error minimization in the loss landscape, improving network trainability. Spectral analyses further reveal that initial alignment promotes smoother gradient flow and convergence to flatter minima, resulting in better generalization and robustness. Notably, we also find that allowing moderate deviations from exact weight symmetry can improve adversarial robustness compared to standard backpropagation. These findings demonstrate that a simple initialization strategy can enable effective learning in deep networks in a biologically plausible and resource-efficient manner.

[LG-51] Improved Bounds for Swap Multicalibration and Swap Omniprediction

链接: https://arxiv.org/abs/2505.20885
作者: Haipeng Luo,Spandan Senapati,Vatsal Sharan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we consider the related problems of multicalibration – a multigroup fairness notion and omniprediction – a simultaneous loss minimization paradigm, both in the distributional and online settings. The recent work of Garg et al. (2024) raised the open problem of whether it is possible to efficiently achieve O(\sqrtT) \ell_2 -multicalibration error against bounded linear functions. In this paper, we answer this question in a strongly affirmative sense. We propose an efficient algorithm that achieves O(T^\frac13) \ell_2 -swap multicalibration error (both in high probability and expectation). On propagating this bound onward, we obtain significantly improved rates for \ell_1 -swap multicalibration and swap omniprediction for a loss class of convex Lipschitz functions. In particular, we show that our algorithm achieves O(T^\frac23) \ell_1 -swap multicalibration and swap omniprediction errors, thereby improving upon the previous best-known bound of O(T^\frac78) . As a consequence of our improved online results, we further obtain several improved sample complexity rates in the distributional setting. In particular, we establish a O(\varepsilon ^ -3) sample complexity of efficiently learning an \varepsilon -swap omnipredictor for the class of convex and Lipschitz functions, O(\varepsilon ^-2.5) sample complexity of efficiently learning an \varepsilon -swap agnostic learner for the squared loss, and O(\varepsilon ^ -5), O(\varepsilon ^ -2.5) sample complexities of learning \ell_1, \ell_2 -swap multicalibrated predictors against linear functions, all of which significantly improve on the previous best-known bounds.

[LG-52] Fedivertex: a Graph Dataset based on Decentralized Social Networks for Trustworthy Machine Learning

链接: https://arxiv.org/abs/2505.20882
作者: Marc Damie,Edwige Cyffers
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Decentralized machine learning - where each client keeps its own data locally and uses its own computational resources to collaboratively train a model by exchanging peer-to-peer messages - is increasingly popular, as it enables better scalability and control over the data. A major challenge in this setting is that learning dynamics depend on the topology of the communication graph, which motivates the use of real graph datasets for benchmarking decentralized algorithms. Unfortunately, existing graph datasets are largely limited to for-profit social networks crawled at a fixed point in time and often collected at the user scale, where links are heavily influenced by the platform and its recommendation algorithms. The Fediverse, which includes several free and open-source decentralized social media platforms such as Mastodon, Misskey, and Lemmy, offers an interesting real-world alternative. We introduce Fedivertex, a new dataset of 182 graphs, covering seven social networks from the Fediverse, crawled weekly over 14 weeks. We release the dataset along with a Python package to facilitate its use, and illustrate its utility on several tasks, including a new defederation task, which captures a process of link deletion observed on these networks.

[LG-53] Aggregation Buffer: Revisiting DropEdge with a New Parameter Block

链接: https://arxiv.org/abs/2505.20840
作者: Dooho Lee,Myeong Kong,Sagad Hamid,Cheonwoo Lee,Jaemin Yoo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit DropEdge, a data augmentation technique for GNNs which randomly removes edges to expose diverse graph structures during training. While being a promising approach to effectively reduce overfitting on specific connections in the graph, we observe that its potential performance gain in supervised learning tasks is significantly limited. To understand why, we provide a theoretical analysis showing that the limited performance of DropEdge comes from the fundamental limitation that exists in many GNN architectures. Based on this analysis, we propose Aggregation Buffer, a parameter block specifically designed to improve the robustness of GNNs by addressing the limitation of DropEdge. Our method is compatible with any GNN model, and shows consistent performance improvements on multiple datasets. Moreover, our method effectively addresses well-known problems such as degree bias or structural disparity as a unifying solution. Code and datasets are available at this https URL.

[LG-54] FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration

链接: https://arxiv.org/abs/2505.20839
作者: Daehyeon Baek,Jieun Choi,Jimyoung Son,Kyungmin Bin,Seungbeom Choi,Kihyo Moon,Minsung Jang,Hyojung Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models become increasingly prevalent, memory bandwidth constraints significantly limit inference throughput, motivating post-training quantization (PTQ). In this paper, we propose FireQ, a co-designed PTQ framework and an INT4-FP8 matrix multiplication kernel that accelerates LLM inference across all linear layers. Specifically, FireQ quantizes linear layer weights and key-values to INT4, and activations and queries to FP8, significantly enhancing throughput. Additionally, we introduce a three-stage pipelining for the prefill phase, which modifies the FlashAttention-3 kernel, effectively reducing time-to-first-token in the prefill phase. To minimize accuracy loss from quantization, we develop novel outlier smoothing techniques tailored separately for linear and attention layers. In linear layers, we explicitly use per-tensor scaling to prevent underflow caused by the FP8 quantization scaling factor of INT4 quantization, and channel-wise scaling to compensate for coarse granularity of INT4. In attention layers, we address quantization challenges posed by rotary positional embeddings (RoPE) by combining pre-RoPE and post-RoPE scaling strategies. FireQ significantly outperforms state-of-the-art methods, achieving 1.68x faster inference in feed-forward network layers on Llama2-7B and 1.26x faster prefill phase performance on Llama3-8B compared to QServe, with negligible accuracy loss.

[LG-55] HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

链接: https://arxiv.org/abs/2505.20836
作者: Hexiong Yang,Mingrui Chen,Huaibo Huang,Junxian Duan,Jie Cao,Zhen Zhou,Ran He
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. More surprisingly, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than 500 \times larger. Lastly, we utilize t-SNE for more intuitive visualization, which shows that our model can gain a sophisticated understanding of the intrinsic representation pattern in genomic sequences.

[LG-56] Interpretable Credit Default Prediction with Ensemble Learning and SHAP

链接: https://arxiv.org/abs/2505.20815
作者: Shiqi Yang,Ziyi Huang,Wengran Xiao,Xinyu Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study focuses on the problem of credit default prediction, builds a modeling framework based on machine learning, and conducts comparative experiments on a variety of mainstream classification algorithms. Through preprocessing, feature engineering, and model training of the Home Credit dataset, the performance of multiple models including logistic regression, random forest, XGBoost, LightGBM, etc. in terms of accuracy, precision, and recall is evaluated. The results show that the ensemble learning method has obvious advantages in predictive performance, especially in dealing with complex nonlinear relationships between features and data imbalance problems. It shows strong robustness. At the same time, the SHAP method is used to analyze the importance and dependency of features, and it is found that the external credit score variable plays a dominant role in model decision making, which helps to improve the model’s interpretability and practical application value. The research results provide effective reference and technical support for the intelligent development of credit risk control systems.

[LG-57] Simple yet Effective Graph Distillation via Clustering KDD2025

链接: https://arxiv.org/abs/2505.20807
作者: Yurui Lai,Taiyan Zhang,Renchi Yang
类目: Machine Learning (cs.LG)
*备注: This is the technical report of the paper “Simple yet Effective Graph Distillation via Clustering” accepted by KDD 2025

点击查看摘要

Abstract:Despite plentiful successes achieved by graph representation learning in various domains, the training of graph neural networks (GNNs) still remains tenaciously challenging due to the tremendous computational overhead needed for sizable graphs in practice. Recently, graph data distillation (GDD), which seeks to distill large graphs into compact and informative ones, has emerged as a promising technique to enable efficient GNN training. However, most existing GDD works rely on heuristics that align model gradients or representation distributions on condensed and original graphs, leading to compromised result quality, expensive training for distilling large graphs, or both. Motivated by this, this paper presents an efficient and effective GDD approach, ClustGDD. Under the hood, ClustGDD resorts to synthesizing the condensed graph and node attributes through fast and theoretically-grounded clustering that minimizes the within-cluster sum of squares and maximizes the homophily on the original graph. The fundamental idea is inspired by our empirical and theoretical findings unveiling the connection between clustering and empirical condensation quality using Fréchet Inception Distance, a well-known quality metric for synthetic images. Furthermore, to mitigate the adverse effects caused by the homophily-based clustering, ClustGDD refines the nodal attributes of the condensed graph with a small augmentation learned via class-aware graph sampling and consistency loss. Our extensive experiments exhibit that GNNs trained over condensed graphs output by ClustGDD consistently achieve superior or comparable performance to state-of-the-art GDD methods in terms of node classification on five benchmark datasets, while being orders of magnitude faster.

[LG-58] Quantum Machine Learning in Healthcare: Evaluating QNN and QSVM Models

链接: https://arxiv.org/abs/2505.20804
作者: Antonio Tudisco,Deborah Volpe,Giovanna Turvani
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Effective and accurate diagnosis of diseases such as cancer, diabetes, and heart failure is crucial for timely medical intervention and improving patient survival rates. Machine learning has revolutionized diagnostic methods in recent years by developing classification models that detect diseases based on selected features. However, these classification tasks are often highly imbalanced, limiting the performance of classical models. Quantum models offer a promising alternative, exploiting their ability to express complex patterns by operating in a higher-dimensional computational space through superposition and entanglement. These unique properties make quantum models potentially more effective in addressing the challenges of imbalanced datasets. This work evaluates the potential of quantum classifiers in healthcare, focusing on Quantum Neural Networks (QNNs) and Quantum Support Vector Machines (QSVMs), comparing them with popular classical models. The study is based on three well-known healthcare datasets – Prostate Cancer, Heart Failure, and Diabetes. The results indicate that QSVMs outperform QNNs across all datasets due to their susceptibility to overfitting. Furthermore, quantum models prove the ability to overcome classical models in scenarios with high dataset imbalance. Although preliminary, these findings highlight the potential of quantum models in healthcare classification tasks and lead the way for further research in this domain.

[LG-59] Multi-VQC: A Novel QML Approach for Enhancing Healthcare Classification

链接: https://arxiv.org/abs/2505.20797
作者: Antonio Tudisco,Deborah Volpe,Giovanna Turvani
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Accurate and reliable diagnosis of diseases is crucial in enabling timely medical treatment and enhancing patient survival rates. In recent years, Machine Learning has revolutionized diagnostic practices by creating classification models capable of identifying diseases. However, these classification problems often suffer from significant class imbalances, which can inhibit the effectiveness of traditional models. Therefore, the interest in Quantum models has arisen, driven by the captivating promise of overcoming the limitations of the classical counterpart thanks to their ability to express complex patterns by mapping data in a higher-dimensional computational space.

[LG-60] Enhancing Wearable Tap Water Audio Detection through Subclass Annotation in the HD-Epic Dataset ISWC2025

链接: https://arxiv.org/abs/2505.20788
作者: Robin Burchard,Kristof Van Laerhoven
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Submitted to ISWC 2025

点击查看摘要

Abstract:Wearable human activity recognition has been shown to benefit from the inclusion of acoustic data, as the sounds around a person often contain valuable context. However, due to privacy concerns, it is usually not ethically feasible to record and save microphone data from the device, since the audio could, for instance, also contain private conversations. Rather, the data should be processed locally, which in turn requires processing power and consumes energy on the wearable device. One special use case of contextual information that can be utilized to augment special tasks in human activity recognition is water flow detection, which can, e.g., be used to aid wearable hand washing detection. We created a new label called tap water for the recently released HD-Epic data set, creating 717 hand-labeled annotations of tap water flow, based on existing annotations of the water class. We analyzed the relation of tap water and water in the dataset and additionally trained and evaluated two lightweight classifiers to evaluate the newly added label class, showing that the new class can be learned more easily.

[LG-61] STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

链接: https://arxiv.org/abs/2505.20781
作者: Hossein Goli,Michael Gimelfarb,Nathan Samuel de Lara,Haruki Nishimura,Masha Itkina,Florian Shkurti
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.

[LG-62] Non-invasive maturity assessment of iPSC-CMs based on optical maturity characteristics using interpretable AI

链接: https://arxiv.org/abs/2505.20775
作者: Fabian Scheurer,Alexander Hammer,Mario Schubert,Robert-Patrick Steiner,Oliver Gamm,Kaomei Guan,Frank Sonntag,Hagen Malberg,Martin Schmidt
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Cell Behavior (q-bio.CB)
*备注:

点击查看摘要

Abstract:Human induced pluripotent stem cell-derived cardiomyocytes (iPSC-CMs) are an important resource for the identification of new therapeutic targets and cardioprotective drugs. After differentiation iPSC-CMs show an immature, fetal-like phenotype. Cultivation of iPSC-CMs in lipid-supplemented maturation medium (MM) strongly enhances their structural, metabolic and functional phenotype. Nevertheless, assessing iPSC-CM maturation state remains challenging as most methods are time consuming and go in line with cell damage or loss of the sample. To address this issue, we developed a non-invasive approach for automated classification of iPSC-CM maturity through interpretable artificial intelligence (AI)-based analysis of beat characteristics derived from video-based motion analysis. In a prospective study, we evaluated 230 video recordings of early-state, immature iPSC-CMs on day 21 after differentiation (d21) and more mature iPSC-CMs cultured in MM (d42, MM). For each recording, 10 features were extracted using Maia motion analysis software and entered into a support vector machine (SVM). The hyperparameters of the SVM were optimized in a grid search on 80 % of the data using 5-fold cross-validation. The optimized model achieved an accuracy of 99.5 \pm 1.1 % on a hold-out test set. Shapley Additive Explanations (SHAP) identified displacement, relaxation-rise time and beating duration as the most relevant features for assessing maturity level. Our results suggest the use of non-invasive, optical motion analysis combined with AI-based methods as a tool to assess iPSC-CMs maturity and could be applied before performing functional readouts or drug testing. This may potentially reduce the variability and improve the reproducibility of experimental studies.

[LG-63] mePro: Efficient Multivariate Long-term Time Series Forecasting with Variable- and Time-Aware Hyper-state ICML2025

链接: https://arxiv.org/abs/2505.20774
作者: Xiaowen Ma,Zhenliang Ni,Shuai Xiao,Xinghao Chen
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:In long-term time series forecasting, different variables often influence the target variable over distinct time intervals, a challenge known as the multi-delay issue. Traditional models typically process all variables or time points uniformly, which limits their ability to capture complex variable relationships and obtain non-trivial time representations. To address this issue, we propose TimePro, an innovative Mamba-based model that constructs variate- and time-aware hyper-states. Unlike conventional approaches that merely transfer plain states across variable or time dimensions, TimePro preserves the fine-grained temporal features of each variate token and adaptively selects the focused time points to tune the plain state. The reconstructed hyper-state can perceive both variable relationships and salient temporal information, which helps the model make accurate forecasting. In experiments, TimePro performs competitively on eight real-world long-term forecasting benchmarks with satisfactory linear complexity. Code is available at this https URL.

[LG-64] Robust and Explainable Detector of Time Series Anomaly via Augmenting Multiclass Pseudo-Anomalies KDD2025

链接: https://arxiv.org/abs/2505.20765
作者: Kohei Obata,Yasuko Matsubara,Yasushi Sakurai
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:Unsupervised anomaly detection in time series has been a pivotal research area for decades. Current mainstream approaches focus on learning normality, on the assumption that all or most of the samples in the training set are normal. However, anomalies in the training set (i.e., anomaly contamination) can be misleading. Recent studies employ data augmentation to generate pseudo-anomalies and learn the boundary separating the training samples from the augmented samples. Although this approach mitigates anomaly contamination if augmented samples mimic unseen real anomalies, it suffers from several limitations. (1) Covering a wide range of time series anomalies is challenging. (2) It disregards augmented samples that resemble normal samples (i.e., false anomalies). (3) It places too much trust in the labels of training and augmented samples. In response, we propose RedLamp, which employs diverse data augmentations to generate multiclass pseudo-anomalies and learns the multiclass boundary. Such multiclass pseudo-anomalies cover a wide variety of time series anomalies. We conduct multiclass classification using soft labels, which prevents the model from being overconfident and ensures its robustness against contaminated/false anomalies. The learned latent space is inherently explainable as it is trained to separate pseudo-anomalies into multiclasses. Extensive experiments demonstrate the effectiveness of RedLamp in anomaly detection and its robustness against anomaly contamination.

[LG-65] Practical estimation of the optimal classification error with soft labels and calibration

链接: https://arxiv.org/abs/2505.20761
作者: Ryota Ushio,Takashi Ishida,Masashi Sugiyama
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages, 24 figures; GitHub: this https URL

点击查看摘要

Abstract:While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory.

[LG-66] Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation INTERSPEECH2025

链接: https://arxiv.org/abs/2505.20745
作者: Jingping Nie,Dung T. Tran,Karan Thakkar,Vasudha Kowtha,John Huang,Carlos Avendano,Erdrin Azemi,Vikramjit Mitra
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, Interspeech 2025 conference

点击查看摘要

Abstract:Auscultation, particularly heart sound, is a non-invasive technique that provides essential vital sign information. Recently, self-supervised acoustic representation foundation models (FMs) have been proposed to offer insights into acoustics-based vital signs. However, there has been little exploration of the extent to which auscultation is encoded in these pre-trained FM representations. In this work, using a publicly available phonocardiogram (PCG) dataset and a heart rate (HR) estimation model, we conduct a layer-wise investigation of six acoustic representation FMs: HuBERT, wav2vec2, wavLM, Whisper, Contrastive Language-Audio Pretraining (CLAP), and an in-house CLAP model. Additionally, we implement the baseline method from Nie et al., 2024 (which relies on acoustic features) and show that overall, representation vectors from pre-trained foundation models (FMs) offer comparable performance to the baseline. Notably, HR estimation using the representations from the audio encoder of the in-house CLAP model outperforms the results obtained from the baseline, achieving a lower mean absolute error (MAE) across various train/validation/test splits despite the domain mismatch.

[LG-67] Hello World!: Making GNNs Talk with LLM s

链接: https://arxiv.org/abs/2505.20742
作者: Sunwoo Kim,Soo Yong Lee,Jaemin Yoo,Kijung Shin
类目: Machine Learning (cs.LG)
*备注: Code and datasets are in this https URL

点击查看摘要

Abstract:While graph neural networks (GNNs) have shown remarkable performance across diverse graph-related tasks, their high-dimensional hidden representations render them black boxes. In this work, we propose Graph Lingual Network (GLN), a GNN built on large language models (LLMs), with hidden representations in the form of human-readable text. Through careful prompt design, GLN incorporates not only the message passing module of GNNs but also advanced GNN techniques, including graph attention and initial residual connection. The comprehensibility of GLN’s hidden representations enables an intuitive analysis of how node representations change (1) across layers and (2) under advanced GNN techniques, shedding light on the inner workings of GNNs. Furthermore, we demonstrate that GLN achieves strong zero-shot performance on node classification and link prediction, outperforming existing LLM-based baseline methods.

[LG-68] A reinforcement learning agent for maintenance of deteriorating systems with increasingly imperfect repairs

链接: https://arxiv.org/abs/2505.20725
作者: Alberto Pliego Marugán,Jesús M. Pinar-Pérez,Fausto Pedro García Márquez
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Cite as: Marugán, A. P., Pinar-Pérez, J. M., Márquez, F. P. G. (2024). A reinforcement learning agent for maintenance of deteriorating systems with increasingly imperfect repairs. Reliability Engineering System Safety, 252, 110466

点击查看摘要

Abstract:Efficient maintenance has always been essential for the successful application of engineering systems. However, the challenges to be overcome in the implementation of Industry 4.0 necessitate new paradigms of maintenance optimization. Machine learning techniques are becoming increasingly used in engineering and maintenance, with reinforcement learning being one of the most promising. In this paper, we propose a gamma degradation process together with a novel maintenance model in which repairs are increasingly imperfect, i.e., the beneficial effect of system repairs decreases as more repairs are performed, reflecting the degradational behavior of real-world systems. To generate maintenance policies for this system, we developed a reinforcement-learning-based agent using a Double Deep Q-Network architecture. This agent presents two important advantages: it works without a predefined preventive threshold, and it can operate in a continuous degradation state space. Our agent learns to behave in different scenarios, showing great flexibility. In addition, we performed an analysis of how changes in the main parameters of the environment affect the maintenance policy proposed by the agent. The proposed approach is demonstrated to be appropriate and to significatively improve long-run cost as compared with other common maintenance strategies.

[LG-69] Recurrent Neural Operators: Stable Long-Term PDE Prediction

链接: https://arxiv.org/abs/2505.20721
作者: Zaijun Ye,Chen-Song Zhang,Wansheng Wang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural operators have emerged as powerful tools for learning solution operators of partial differential equations. However, in time-dependent problems, standard training strategies such as teacher forcing introduce a mismatch between training and inference, leading to compounding errors in long-term autoregressive predictions. To address this issue, we propose Recurrent Neural Operators (RNOs)-a novel framework that integrates recurrent training into neural operator architectures. Instead of conditioning each training step on ground-truth inputs, RNOs recursively apply the operator to their own predictions over a temporal window, effectively simulating inference-time dynamics during training. This alignment mitigates exposure bias and enhances robustness to error accumulation. Theoretically, we show that recurrent training can reduce the worst-case exponential error growth typical of teacher forcing to linear growth. Empirically, we demonstrate that recurrently trained Multigrid Neural Operators significantly outperform their teacher-forced counterparts in long-term accuracy and stability on standard benchmarks. Our results underscore the importance of aligning training with inference dynamics for robust temporal generalization in neural operator learning.

[LG-70] Are Data Embeddings effective in time series forecasting? NEURIPS

链接: https://arxiv.org/abs/2505.20716
作者: Reza Nematirad,Anil Pahwa,Balasubramaniam Natarajan
类目: Machine Learning (cs.LG)
*备注: Code is available at: this https URL

点击查看摘要

Abstract:Time series forecasting plays a crucial role in many real-world applications, and numerous complex forecasting models have been proposed in recent years. Despite their architectural innovations, most state-of-the-art models report only marginal improvements – typically just a few thousandths in standard error metrics. These models often incorporate complex data embedding layers to transform raw inputs into higher-dimensional representations to enhance accuracy. But are data embedding techniques actually effective in time series forecasting? Through extensive ablation studies across fifteen state-of-the-art models and four benchmark datasets, we find that removing data embedding layers from many state-of-the-art models does not degrade forecasting performance. In many cases, it improves both accuracy and computational efficiency. The gains from removing embedding layers often exceed the performance differences typically reported between competing models. Code available at: this https URL

[LG-71] me-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures

链接: https://arxiv.org/abs/2505.20705
作者: Yang Wang,Wenxuan Zhu,Xuehui Quan,Heyi Wang,Chang Liu,Qiyuan Wu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of fault prediction and delayed response in distributed systems by proposing an intelligent prediction method based on temporal feature learning. The method takes multi-dimensional performance metric sequences as input. We use a Gated Recurrent Unit (GRU) to model the evolution of system states over time. An attention mechanism is then applied to enhance key temporal segments, improving the model’s ability to identify potential faults. On this basis, a feedforward neural network is designed to perform the final classification, enabling early warning of system failures. To validate the effectiveness of the proposed approach, comparative experiments and ablation analyses were conducted using data from a large-scale real-world cloud system. The experimental results show that the model outperforms various mainstream time-series models in terms of Accuracy, F1-Score, and AUC. This demonstrates strong prediction capability and stability. Furthermore, the loss function curve confirms the convergence and reliability of the training process. It indicates that the proposed method effectively learns system behavior patterns and achieves efficient fault detection.

[LG-72] Sparsified State-Space Models are Efficient Highway Networks

链接: https://arxiv.org/abs/2505.20698
作者: Woomin Song,Jihoon Tack,Sangwoo Mo,Seunghyuk Oh,Jinwoo Shin
类目: Machine Learning (cs.LG)
*备注: Accepted to TMLR 2025.03

点击查看摘要

Abstract:State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at this https URL.

[LG-73] An Optimisation Framework for Unsupervised Environment Design

链接: https://arxiv.org/abs/2505.20659
作者: Nathan Monette,Alistair Letcher,Michael Beukman,Matthew T. Jackson,Alexander Rutherford,Alexander D. Goldie,Jakob N. Foerster
类目: Machine Learning (cs.LG)
*备注: Reinforcement Learning Conference 2025

点击查看摘要

Abstract:For reinforcement learning agents to be deployed in high-risk settings, they must achieve a high level of robustness to unfamiliar scenarios. One method for improving robustness is unsupervised environment design (UED), a suite of methods aiming to maximise an agent’s generalisability across configurations of an environment. In this work, we study UED from an optimisation perspective, providing stronger theoretical guarantees for practical settings than prior work. Whereas previous methods relied on guarantees if they reach convergence, our framework employs a nonconvex-strongly-concave objective for which we provide a provably convergent algorithm in the zero-sum setting. We empirically verify the efficacy of our method, outperforming prior methods in a number of environments with varying difficulties.

[LG-74] Explaining Concept Shift with Interpretable Feature Attribution

链接: https://arxiv.org/abs/2505.20634
作者: Ruiqi Lyu,Alistair Turcan,Bryan Wilder
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Regardless the amount of data a machine learning (ML) model is trained on, there will inevitably be data that differs from their training set, lowering model performance. Concept shift occurs when the distribution of labels conditioned on the features changes, making even a well-tuned ML model to have learned a fundamentally incorrect representation. Identifying these shifted features provides unique insight into how one dataset differs from another, considering the difference may be across a scientifically relevant dimension, such as time, disease status, population, etc. In this paper, we propose SGShift, a model for detecting concept shift in tabular data and attributing reduced model performance to a sparse set of shifted features. SGShift models concept shift with a Generalized Additive Model (GAM) and performs subsequent feature selection to identify shifted features. We propose further extensions of SGShift by incorporating knockoffs to control false discoveries and an absorption term to account for models with poor fit to the data. We conduct extensive experiments in synthetic and real data across various ML models and find SGShift can identify shifted features with AUC 0.9 and recall 90% , often 2 or 3 times as high as baseline methods.

[LG-75] Position: Adopt Constraints Over Penalties in Deep Learning ALT

链接: https://arxiv.org/abs/2505.20628
作者: Juan Ramirez,Meraj Hashemizadeh,Simon Lacoste-Julien
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Code available at this https URL

点击查看摘要

Abstract:Recent efforts toward developing trustworthy AI systems with accountability guarantees have led to a growing reliance on machine learning formulations that incorporate external requirements, or constraints. These requirements are often enforced through penalization–adding fixed-weight terms to the task loss. We argue that this approach is ill-suited, and that tailored constrained optimization methods should be adopted instead. In particular, no penalty coefficient may yield a solution that both satisfies the constraints and achieves good performance–i.e., one solving the constrained problem. Moreover, tuning these coefficients is costly, incurring significant time and computational overhead. In contrast, tailored constrained methods–such as the Lagrangian approach, which optimizes the penalization “coefficients” (the Lagrange multipliers) alongside the model–(i) truly solve the constrained problem and add accountability, (ii) eliminate the need for extensive penalty tuning, and (iii) integrate seamlessly with modern deep learning pipelines.

[LG-76] Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

链接: https://arxiv.org/abs/2505.20589
作者: Mahdi Pourmirzaei,Farzaneh Esmaili,Salhuldin Alqarghuli,Mohammadreza Pourmirzaei,Ye Han,Kai Chen,Mohsen Rezaei,Duolin Wang,Dong Xu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at this https URL .

[LG-77] Learning a Pessimistic Reward Model in RLHF

链接: https://arxiv.org/abs/2505.20556
作者: Yinglun Xu,Hangoo Kang,Tarun Suresh,Yuxuan Wan,Gagandeep Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work proposes `PET’, a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.

[LG-78] A ZeNN architecture to avoid the Gaussian trap

链接: https://arxiv.org/abs/2505.20553
作者: Luís Carvalho,João L. Costa,José Mourão,Gonçalo Oliveira
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We propose a new simple architecture, Zeta Neural Networks (ZeNNs), in order to overcome several shortcomings of standard multi-layer perceptrons (MLPs). Namely, in the large width limit, MLPs are non-parametric, they do not have a well-defined pointwise limit, they lose non-Gaussian attributes and become unable to perform feature learning; moreover, finite width MLPs perform poorly in learning high frequencies. The new ZeNN architecture is inspired by three simple principles from harmonic analysis: i) Enumerate the perceptons and introduce a non-learnable weight to enforce convergence; ii) Introduce a scaling (or frequency) factor; iii) Choose activation functions that lead to near orthogonal systems. We will show that these ideas allow us to fix the referred shortcomings of MLPs. In fact, in the infinite width limit, ZeNNs converge pointwise, they exhibit a rich asymptotic structure beyond Gaussianity, and perform feature learning. Moreover, when appropriate activation functions are chosen, (finite width) ZeNNs excel at learning high-frequency features of functions with low dimensional domains. Subjects: Machine Learning (cs.LG); Probability (math.PR) MSC classes: 68T07, 68T01 ACMclasses: I.2.0; G.0 Cite as: arXiv:2505.20553 [cs.LG] (or arXiv:2505.20553v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.20553 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Goncalo Oliveira [view email] [v1] Mon, 26 May 2025 22:26:55 UTC (1,161 KB) Full-text links: Access Paper: View a PDF of the paper titled A ZeNN architecture to avoid the Gaussian trap, by Lu’is Carvalho and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-05 Change to browse by: cs math math.PR References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-79] Rotary Masked Autoencoders are Versatile Learners

链接: https://arxiv.org/abs/2505.20535
作者: Uros Zivanovic,Serafina Di Gioia,Andre Scaffidi,Martín de los Rios,Gabriella Contardo,Roberto Trotta
类目: Machine Learning (cs.LG)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE’s performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE’s usual performance across other modalities. In addition, we investigate RoMAE’s ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE’s relative position property.

[LG-80] One-shot Robust Federated Learning of Independent Component Analysis

链接: https://arxiv.org/abs/2505.20532
作者: Dian Jin,Xin Bing,Yuqian Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates a general robust one-shot aggregation framework for distributed and federated Independent Component Analysis (ICA) problem. We propose a geometric median-based aggregation algorithm that leverages k -means clustering to resolve the permutation ambiguity in local client estimations. Our method first performs k-means to partition client-provided estimators into clusters and then aggregates estimators within each cluster using the geometric median. This approach provably remains effective even in highly heterogeneous scenarios where at most half of the clients can observe only a minimal number of samples. The key theoretical contribution lies in the combined analysis of the geometric median’s error bound-aided by sample quantiles-and the maximum misclustering rates of the aforementioned solution of k -means. The effectiveness of the proposed approach is further supported by simulation studies conducted under various heterogeneous settings.

[LG-81] raining Articulatory Inversion Models for Inter-Speaker Consistency

链接: https://arxiv.org/abs/2505.20529
作者: Charles McGhee,Mark J.F. Gales,Kate M. Knill
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Acoustic-to-Articulatory Inversion (AAI) attempts to model the inverse mapping from speech to articulation. Exact articulatory prediction from speech alone may be impossible, as speakers can choose different forms of articulation seemingly without reference to their vocal tract structure. However, once a speaker has selected an articulatory form, their productions vary minimally. Recent works in AAI have proposed adapting Self-Supervised Learning (SSL) models to single-speaker datasets, claiming that these single-speaker models provide a universal articulatory template. In this paper, we investigate whether SSL-adapted models trained on single and multi-speaker data produce articulatory targets which are consistent across speaker identities for English and Russian. We do this through the use of a novel evaluation method which extracts articulatory targets using minimal pair sets. We also present a training method which can improve inter-speaker consistency using only speech data.

[LG-82] owards Fully FP8 GEMM LLM Training at Scale

链接: https://arxiv.org/abs/2505.20524
作者: Alejandro Hernández-Cano,Dhia Garbaya,Imanol Schlag,Martin Jaggi
类目: Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Despite the significant potential of FP8 data formats for large language model (LLM) pre-training, their adoption has been limited due to challenges in maintaining stability at scale. Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications (GEMMs) in sensitive components, such as attention projections, compromising potential throughput gains. We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes. This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training. Our architecture design reduces large outlier activations, promoting stable long-term FP8 training. In addition, we identify key metrics to monitor low-precision training and predict potential future divergences.

[LG-83] Semi-Explicit Neural DAEs: Learning Long-Horizon Dynamical Systems with Algebraic Constraints

链接: https://arxiv.org/abs/2505.20515
作者: Avik Pal,Alan Edelman,Christopher Rackauckas
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Despite the promise of scientific machine learning (SciML) in combining data-driven techniques with mechanistic modeling, existing approaches for incorporating hard constraints in neural differential equations (NDEs) face significant limitations. Scalability issues and poor numerical properties prevent these neural models from being used for modeling physical systems with complicated conservation laws. We propose Manifold-Projected Neural ODEs (PNODEs), a method that explicitly enforces algebraic constraints by projecting each ODE step onto the constraint manifold. This framework arises naturally from semi-explicit differential-algebraic equations (DAEs), and includes both a robust iterative variant and a fast approximation requiring a single Jacobian factorization. We further demonstrate that prior works on relaxation methods are special cases of our approach. PNODEs consistently outperform baselines across six benchmark problems achieving a mean constraint violation error below 10^-10 . Additionally, PNODEs consistently achieve lower runtime compared to other methods for a given level of error tolerance. These results show that constraint projection offers a simple strategy for learning physically consistent long-horizon dynamics.

[LG-84] BlastOFormer: Attention and Neural Operator Deep Learning Methods for Explosive Blast Prediction

链接: https://arxiv.org/abs/2505.20454
作者: Reid Graves,Anthony Zhou,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注: 21 pages, 9 figures

点击查看摘要

Abstract:Accurate prediction of blast pressure fields is essential for applications in structural safety, defense planning, and hazard mitigation. Traditional methods such as empirical models and computational fluid dynamics (CFD) simulations offer limited trade offs between speed and accuracy; empirical models fail to capture complex interactions in cluttered environments, while CFD simulations are computationally expensive and time consuming. In this work, we introduce BlastOFormer, a novel Transformer based surrogate model for full field maximum pressure prediction from arbitrary obstacle and charge configurations. BlastOFormer leverages a signed distance function (SDF) encoding and a grid to grid attention based architecture inspired by OFormer and Vision Transformer (ViT) frameworks. Trained on a dataset generated using the open source blastFoam CFD solver, our model outperforms convolutional neural networks (CNNs) and Fourier Neural Operators (FNOs) across both log transformed and unscaled domains. Quantitatively, BlastOFormer achieves the highest R2 score (0.9516) and lowest error metrics, while requiring only 6.4 milliseconds for inference, more than 600,000 times faster than CFD simulations. Qualitative visualizations and error analyses further confirm BlastOFormer’s superior spatial coherence and generalization capabilities. These results highlight its potential as a real time alternative to conventional CFD approaches for blast pressure estimation in complex environments.

[LG-85] Active Learning for Multiple Change Point Detection in Non-stationary Time Series with Deep Gaussian Processes

链接: https://arxiv.org/abs/2505.20452
作者: Hao Zhao,Rong Pan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiple change point (MCP) detection in non-stationary time series is challenging due to the variety of underlying patterns. To address these challenges, we propose a novel algorithm that integrates Active Learning (AL) with Deep Gaussian Processes (DGPs) for robust MCP detection. Our method leverages spectral analysis to identify potential changes and employs AL to strategically select new sampling points for improved efficiency. By incorporating the modeling flexibility of DGPs with the change-identification capabilities of spectral methods, our approach adapts to diverse spectral change behaviors and effectively localizes multiple change points. Experiments on both simulated and real-world data demonstrate that our method outperforms existing techniques in terms of detection accuracy and sampling efficiency for non-stationary time series.

[LG-86] me Series Generation Under Data Scarcity: A Unified Generative Modeling Approach

链接: https://arxiv.org/abs/2505.20446
作者: Tal Gonen,Itai Pemper,Ilan Naiman,Nimrod Berman,Omri Azencot
类目: Machine Learning (cs.LG)
*备注: The first two authors contributed equally

点击查看摘要

Abstract:Generative modeling of time series is a central challenge in time series analysis, particularly under data-scarce conditions. Despite recent advances in generative modeling, a comprehensive understanding of how state-of-the-art generative models perform under limited supervision remains lacking. In this work, we conduct the first large-scale study evaluating leading generative models in data-scarce settings, revealing a substantial performance gap between full-data and data-scarce regimes. To close this gap, we propose a unified diffusion-based generative framework that can synthesize high-fidelity time series across diverse domains using just a few examples. Our model is pre-trained on a large, heterogeneous collection of time series datasets, enabling it to learn generalizable temporal representations. It further incorporates architectural innovations such as dynamic convolutional layers for flexible channel adaptation and dataset token conditioning for domain-aware generation. Without requiring abundant supervision, our unified model achieves state-of-the-art performance in few-shot settings-outperforming domain-specific baselines across a wide range of subset sizes. Remarkably, it also surpasses all baselines even when tested on full datasets benchmarks, highlighting the strength of pre-training and cross-domain generalization. We hope this work encourages the community to revisit few-shot generative modeling as a key problem in time series research and pursue unified solutions that scale efficiently across domains. Code is available at this https URL.

[LG-87] GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining

链接: https://arxiv.org/abs/2505.20380
作者: Simin Fan,Maria Ios Glarou,Martin Jaggi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of large language models (LLMs) across diverse downstream applications is fundamentally governed by the quality and composition of their pretraining corpora. Existing domain reweighting algorithms primarily optimize data mixtures for a single target task, thereby resulting in models that overfit to specialized objectives while exhibiting substantial performance degradation on other benchmarks. This paper introduces Group Robust Multi-target Adaptive PrEtraining (GRAPE), a novel multi-source-multi-target domain reweighting framework designed to calibrate pretraining data mixtures for robust performance across multiple target tasks simultaneously. GRAPE dynamically adjusts sampling weights across source domains (domain weights) while concurrently modulating task weights that quantify the relative importance of each individual target task. This adaptive process prioritizes tasks based on their learning difficulty throughout training. We formulate this interleaved reweighting mechanism as a minimax optimization problem: The inner maximization adjusts task weights leveraging group distributed-robust-optimization (DRO), where those tasks demonstrating the least improvement under the current data mixture are prioritized with higher weights; The outer minimization then optimizes domain weights to maximize loss reduction on the prioritized tasks. Experiments on ClimbLab and SlimPajama datasets demonstrate that GRAPE consistently outperforms baseline methods in terms of reasoning performance across 6 benchmarks. Furthermore, when applied to multilingual targets, GRAPE effectively identifies optimal training mixtures from mainstream languages, achieving superior language modeling capabilities across 8 low-resource target languages.

[LG-88] Learning mechanical systems from real-world data using discrete forced Lagrangian dynamics

链接: https://arxiv.org/abs/2505.20370
作者: Martine Dyring Hansen,Elena Celledoni,Benjamin Kwanen Tampley
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a data-driven method for learning the equations of motion of mechanical systems directly from position measurements, without requiring access to velocity data. This is particularly relevant in system identification tasks where only positional information is available, such as motion capture, pixel data or low-resolution tracking. Our approach takes advantage of the discrete Lagrange-d’Alembert principle and the forced discrete Euler-Lagrange equations to construct a physically grounded model of the system’s dynamics. We decompose the dynamics into conservative and non-conservative components, which are learned separately using feed-forward neural networks. In the absence of external forces, our method reduces to a variational discretization of the action principle naturally preserving the symplectic structure of the underlying Hamiltonian system. We validate our approach on a variety of synthetic and real-world datasets, demonstrating its effectiveness compared to baseline methods. In particular, we apply our model to (1) measured human motion data and (2) latent embeddings obtained via an autoencoder trained on image sequences. We demonstrate that we can faithfully reconstruct and separate both the conservative and forced dynamics, yielding interpretable and physically consistent predictions.

[LG-89] Learning and Interpreting Gravitational-Wave Features from CNNs with a Random Forest Approach

链接: https://arxiv.org/abs/2505.20357
作者: Jun Tian,He Wang,Jibo He,Yu Pan,Shuo Cao,Qingquan Jiang
类目: Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have become widely adopted in gravitational wave (GW) detection pipelines due to their ability to automatically learn hierarchical features from raw strain data. However, the physical meaning of these learned features remains underexplored, limiting the interpretability of such models. In this work, we propose a hybrid architecture that combines a CNN-based feature extractor with a random forest (RF) classifier to improve both detection performance and interpretability. Unlike prior approaches that directly connect classifiers to CNN outputs, our method introduces four physically interpretable metrics - variance, signal-to-noise ratio (SNR), waveform overlap, and peak amplitude - computed from the final convolutional layer. These are jointly used with the CNN output in the RF classifier to enable more informed decision boundaries. Tested on long-duration strain datasets, our hybrid model outperforms a baseline CNN model, achieving a relative improvement of 21% in sensitivity at a fixed false alarm rate of 10 events per month. Notably, it also shows improved detection of low-SNR signals (SNR \le 10), which are especially vulnerable to misclassification in noisy environments. Feature attribution via the RF model reveals that both CNN-extracted and handcrafted features contribute significantly to classification decisions, with learned variance and CNN outputs ranked among the most informative. These findings suggest that physically motivated post-processing of CNN feature maps can serve as a valuable tool for interpretable and efficient GW detection, bridging the gap between deep learning and domain knowledge.

[LG-90] Joint-stochastic-approximation Random Fields with Application to Semi-supervised Learning ICML2018

链接: https://arxiv.org/abs/2505.20330
作者: Yunfu Song,Zhijian Ou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2018 submission. arXiv admin note: text overlap with arXiv:1808.01630 , arXiv:2505.18558

点击查看摘要

Abstract:Our examination of deep generative models (DGMs) developed for semi-supervised learning (SSL), mainly GANs and VAEs, reveals two problems. First, mode missing and mode covering phenomenons are observed in genertion with GANs and VAEs. Second, there exists an awkward conflict between good classification and good generation in SSL by employing directed generative models. To address these problems, we formally present joint-stochastic-approximation random fields (JRFs) – a new family of algorithms for building deep undirected generative models, with application to SSL. It is found through synthetic experiments that JRFs work well in balancing mode covering and mode missing, and match the empirical data distribution well. Empirically, JRFs achieve good classification results comparable to the state-of-art methods on widely adopted datasets – MNIST, SVHN, and CIFAR-10 in SSL, and simultaneously perform good generation.

[LG-91] FMEnets: Flow Material and Energy networks for non-ideal plug flow reactor design

链接: https://arxiv.org/abs/2505.20300
作者: Chenxi Wu,Juan Diego Toscano,Khemraj Shukla,Yingjie Chen,Ali Shahmohammadi,Edward Raymond,Thomas Toupy,Neda Nazemifard,Charles Papageorgiou,George Em Karniadakis
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We propose FMEnets, a physics-informed machine learning framework for the design and analysis of non-ideal plug flow reactors. FMEnets integrates the fundamental governing equations (Navier-Stokes for fluid flow, material balance for reactive species transport, and energy balance for temperature distribution) into a unified multi-scale network model. The framework is composed of three interconnected sub-networks with independent optimizers that enable both forward and inverse problem-solving. In the forward mode, FMEnets predicts velocity, pressure, species concentrations, and temperature profiles using only inlet and outlet information. In the inverse mode, FMEnets utilizes sparse multi-residence-time measurements to simultaneously infer unknown kinetic parameters and states. FMEnets can be implemented either as FME-PINNs, which employ conventional multilayer perceptrons, or as FME-KANs, based on Kolmogorov-Arnold Networks. Comprehensive ablation studies highlight the critical role of the FMEnets architecture in achieving accurate predictions. Specifically, FME-KANs are more robust to noise than FME-PINNs, although both representations are comparable in accuracy and speed in noise-free conditions. The proposed framework is applied to three different sets of reaction scenarios and is compared with finite element simulations. FMEnets effectively captures the complex interactions, achieving relative errors less than 2.5% for the unknown kinetic parameters. The new network framework not only provides a computationally efficient alternative for reactor design and optimization, but also opens new avenues for integrating empirical correlations, limited and noisy experimental data, and fundamental physical equations to guide reactor design.

[LG-92] A Physics-Augmented GraphGPS Framework for the Reconstruction of 3D Riemann Problems from Sparse Data

链接: https://arxiv.org/abs/2505.21421
作者: Rami Cassia,Rich Kerswell
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In compressible fluid flow, reconstructing shocks, discontinuities, rarefactions, and their interactions from sparse measurements is an important inverse problem with practical applications. Moreover, physics-informed machine learning has recently become an increasingly popular approach for performing reconstructions tasks. In this work we explore a machine learning recipe, known as GraphGPS, for reconstructing canonical compressible flows known as 3D Riemann problems from sparse observations, in a physics-informed manner. The GraphGPS framework combines the benefits of positional encodings, local message-passing of graphs, and global contextual awareness, and we explore the latter two components through an ablation study. Furthermore, we modify the aggregation step of message-passing such that it is aware of shocks and discontinuities, resulting in sharper reconstructions of these features. Additionally, we modify message-passing such that information flows strictly from known nodes only, which results in computational savings, better training convergence, and no degradation of reconstruction accuracy. We also show that the GraphGPS framework outperforms numerous machine learning benchmarks.

[LG-93] Wavelet Flow For Extrag alactic Foreground Simulations

链接: https://arxiv.org/abs/2505.21220
作者: M. Mebratu,W. L. K. Wu
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Extragalactic foregrounds in cosmic microwave background (CMB) observations are both a source of cosmological and astrophysical information and a nuisance to the CMB. Effective field-level modeling that captures their non-Gaussian statistical distributions is increasingly important for optimal information extraction, particularly given the precise and low-noise observations from current and upcoming experiments. We explore the use of Wavelet Flow (WF) models to tackle the novel task of modeling the field-level probability distributions of multi-component CMB secondaries. Specifically, we jointly train correlated CMB lensing convergence ( \kappa ) and cosmic infrared background (CIB) maps with a WF model and obtain a network that statistically recovers the input to high accuracy – the trained network generates samples of \kappa and CIB fields whose average power spectra are within a few percent of the inputs across all scales, and whose Minkowski functionals are similarly accurate compared to the inputs. Leveraging the multiscale architecture of these models, we fine-tune both the model parameters and the priors at each scale independently, optimizing performance across different resolutions. These results demonstrate that WF models can accurately simulate correlated components of CMB secondaries, supporting improved analysis of cosmological data. Our code and trained models can be found here (this https URL).

[LG-94] ransfer learning for multifidelity simulation-based inference in cosmology

链接: https://arxiv.org/abs/2505.21215
作者: Alex A. Saoulis,Davide Piras,Niall Jeffrey,Alessio Spurio Mancini,Ana M. G. Ferreira,Benjamin Joachimi
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 9+4 pages, 8+5 figures

点击查看摘要

Abstract:Simulation-based inference (SBI) enables cosmological parameter estimation when closed-form likelihoods or models are unavailable. However, SBI relies on machine learning for neural compression and density estimation. This requires large training datasets which are prohibitively expensive for high-quality simulations. We overcome this limitation with multifidelity transfer learning, combining less expensive, lower-fidelity simulations with a limited number of high-fidelity simulations. We demonstrate our methodology on dark matter density maps from two separate simulation suites in the hydrodynamical CAMELS Multifield Dataset. Pre-training on dark-matter-only N -body simulations reduces the required number of high-fidelity hydrodynamical simulations by a factor between 8 and 15 , depending on the model complexity, posterior dimensionality, and performance metrics used. By leveraging cheaper simulations, our approach enables performant and accurate inference on high-fidelity models while substantially reducing computational costs.

[LG-95] Input Convex Kolmogorov Arnold Networks

链接: https://arxiv.org/abs/2505.21208
作者: Thomas Deschatre,Xavier Warin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This article presents an input convex neural network architecture using Kolmogorov-Arnold networks (ICKAN). Two specific networks are presented: the first is based on a low-order, linear-by-part, representation of functions, and a universal approximation theorem is provided. The second is based on cubic splines, for which only numerical results support convergence. We demonstrate on simple tests that these networks perform competitively with classical input convex neural networks (ICNNs). In a second part, we use the networks to solve some optimal transport problems needing a convex approximation of functions and demonstrate their effectiveness. Comparisons with ICNNs show that cubic ICKANs produce results similar to those of classical ICNNs.

[LG-96] Identifying Heart Attack Risk in Vulnerable Population: A Machine Learning Approach

链接: https://arxiv.org/abs/2505.21139
作者: Subhagata Chattopadhyay,Amit K Chattopadhyay
类目: Populations and Evolution (q-bio.PE); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 16 pages, 2 figures, 7 tables

点击查看摘要

Abstract:The COVID-19 pandemic has significantly increased the incidence of post-infection cardiovascular events, particularly myocardial infarction, in individuals over 40. While the underlying mechanisms remain elusive, this study employs a hybrid machine learning approach to analyze epidemiological data in assessing 13 key heart attack risk factors and their susceptibility. Based on a unique dataset that combines demographic, biochemical, ECG, and thallium stress-tests, this study categorizes distinct subpopulations against varying risk profiles and then divides the population into ‘at-risk’ (AR) and ‘not-at-risk’ (NAR) groups using clustering algorithms. The study reveals strong association between the likelihood of experiencing a heart attack on the 13 risk factors studied. The aggravated risk for postmenopausal patients indicates compromised individual risk factors due to estrogen depletion that may be, further compromised by extraneous stress impacts, like anxiety and fear, aspects that have traditionally eluded data modeling predictions.

[LG-97] Cardiac Digital Twins at Scale from MRI: Open Tools and Representative Models from ~55000 UK Biobank Participants

链接: https://arxiv.org/abs/2505.21019
作者: Devran Ugurlu,Shuang Qian,Elliot Fairweather,Charlene Mauger,Bram Ruijsink,Laura Dal Toso,Yu Deng,Marina Strocchi,Reza Razavi,Alistair Young,Pablo Lamata,Steven Niederer,Martin Bishop
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A cardiac digital twin is a virtual replica of a patient’s heart for screening, diagnosis, prognosis, risk assessment, and treatment planning of cardiovascular diseases. This requires an anatomically accurate patient-specific 3D structural representation of the heart, suitable for electro-mechanical simulations or study of disease mechanisms. However, generation of cardiac digital twins at scale is demanding and there are no public repositories of models across demographic groups. We describe an automatic open-source pipeline for creating patient-specific left and right ventricular meshes from cardiovascular magnetic resonance images, its application to a large cohort of ~55000 participants from UK Biobank, and the construction of the most comprehensive cohort of adult heart models to date, comprising 1423 representative meshes across sex (male, female), body mass index (range: 16 - 42 kg/m ^2 ) and age (range: 49 - 80 years). Our code is available at this https URL , and pre-trained networks, representative volumetric meshes with fibers and UVCs will be made available soon.

[LG-98] Leverag ing Diffusion Models for Parameterized Quantum Circuit Generation

链接: https://arxiv.org/abs/2505.20863
作者: Daniel Barta,Darya Martyniuk,Johannes Jung,Adrian Paschke
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Quantum computing holds immense potential, yet its practical success depends on multiple factors, including advances in quantum circuit design. In this paper, we introduce a generative approach based on denoising diffusion models (DMs) to synthesize parameterized quantum circuits (PQCs). Extending the recent diffusion model pipeline of Fürrutter et al. [1], our model effectively conditions the synthesis process, enabling the simultaneous generation of circuit architectures and their continuous gate parameters. We demonstrate our approach in synthesizing PQCs optimized for generating high-fidelity Greenberger-Horne-Zeilinger (GHZ) states and achieving high accuracy in quantum machine learning (QML) classification tasks. Our results indicate a strong generalization across varying gate sets and scaling qubit counts, highlighting the versatility and computational efficiency of diffusion-based methods. This work illustrates the potential of generative models as a powerful tool for accelerating and optimizing the design of PQCs, supporting the development of more practical and scalable quantum applications.

[LG-99] Convergence of Clipped-SGD for Convex (L_0L_1)-Smooth Optimization with Heavy-Tailed Noise

链接: https://arxiv.org/abs/2505.20817
作者: Savelii Chezhegov,Aleksandr Beznosikov,Samuel Horváth,Eduard Gorbunov
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 33 pages

点击查看摘要

Abstract:Gradient clipping is a widely used technique in Machine Learning and Deep Learning (DL), known for its effectiveness in mitigating the impact of heavy-tailed noise, which frequently arises in the training of large language models. Additionally, first-order methods with clipping, such as Clip-SGD, exhibit stronger convergence guarantees than SGD under the (L_0,L_1) -smoothness assumption, a property observed in many DL tasks. However, the high-probability convergence of Clip-SGD under both assumptions – heavy-tailed noise and (L_0,L_1) -smoothness – has not been fully addressed in the literature. In this paper, we bridge this critical gap by establishing the first high-probability convergence bounds for Clip-SGD applied to convex (L_0,L_1) -smooth optimization with heavy-tailed noise. Our analysis extends prior results by recovering known bounds for the deterministic case and the stochastic setting with L_1 = 0 as special cases. Notably, our rates avoid exponentially large factors and do not rely on restrictive sub-Gaussian noise assumptions, significantly broadening the applicability of gradient clipping.

[LG-100] Stationary MMD Points for Cubature

链接: https://arxiv.org/abs/2505.20754
作者: Zonghao Chen,Toni Karvonen,Heishiro Kanagawa,François-Xavier Briol,Chris. J. Oates
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Approximation of a target probability distribution using a finite set of points is a problem of fundamental importance, arising in cubature, data compression, and optimisation. Several authors have proposed to select points by minimising a maximum mean discrepancy (MMD), but the non-convexity of this objective precludes global minimisation in general. Instead, we consider \emphstationary points of the MMD which, in contrast to points globally minimising the MMD, can be accurately computed. Our main theoretical contribution is the (perhaps surprising) result that, for integrands in the associated reproducing kernel Hilbert space, the cubature error of stationary MMD points vanishes \emphfaster than the MMD. Motivated by this \emphsuper-convergence property, we consider discretised gradient flows as a practical strategy for computing stationary points of the MMD, presenting a refined convergence analysis that establishes a novel non-asymptotic finite-particle error bound, which may be of independent interest.

[LG-101] Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data

链接: https://arxiv.org/abs/2505.20731
作者: Linshanshan Wang,Mengyan Li,Zongqi Xia,Molei Liu,Tianxi Cai
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic Health Records (EHR) offer rich real-world data for personalized medicine, providing insights into disease progression, treatment responses, and patient outcomes. However, their sparsity, heterogeneity, and high dimensionality make them difficult to model, while the lack of standardized ground truth further complicates predictive modeling. To address these challenges, we propose SCORE, a semi-supervised representation learning framework that captures multi-domain disease profiles through patient embeddings. SCORE employs a Poisson-Adapted Latent factor Mixture (PALM) Model with pre-trained code embeddings to characterize codified features and extract meaningful patient phenotypes and embeddings. To handle the computational challenges of large-scale data, it introduces a hybrid Expectation-Maximization (EM) and Gaussian Variational Approximation (GVA) algorithm, leveraging limited labeled data to refine estimates on a vast pool of unlabeled samples. We theoretically establish the convergence of this hybrid approach, quantify GVA errors, and derive SCORE’s error rate under diverging embedding dimensions. Our analysis shows that incorporating unlabeled data enhances accuracy and reduces sensitivity to label scarcity. Extensive simulations confirm SCORE’s superior finite-sample performance over existing methods. Finally, we apply SCORE to predict disability status for patients with multiple sclerosis (MS) using partially labeled EHR data, demonstrating that it produces more informative and predictive patient embeddings for multiple MS-related conditions compared to existing approaches.

[LG-102] Moment Expansions of the Energy Distance

链接: https://arxiv.org/abs/2505.20647
作者: Ian Langmore
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The energy distance is used to test distributional equality, and as a loss function in machine learning. While D^2(X, Y)=0 only when X\sim Y , the sensitivity to different moments is of practical importance. This work considers D^2(X, Y) in the case where the distributions are close. In this regime, D^2(X, Y) is more sensitive to differences in the means \barX-\barY , than differences in the covariances \Delta . This is due to the structure of the energy distance and is independent of dimension. The sensitivity to on versus off diagonal components of \Delta is examined when X and Y are close to isotropic. Here a dimension dependent averaging occurs and, in many cases, off diagonal correlations contribute significantly less. Numerical results verify these relationships hold even when distributional assumptions are not strictly met.

[LG-103] Balancing Performance and Costs in Best Arm Identification

链接: https://arxiv.org/abs/2505.20583
作者: Michael O. Harding,Kirthevasan Kandasamy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of identifying the best arm in a multi-armed bandit model. Despite a wealth of literature in the traditional fixed budget and fixed confidence regimes of the best arm identification problem, it still remains a mystery to most practitioners as to how to choose an approach and corresponding budget or confidence parameter. We propose a new formalism to avoid this dilemma altogether by minimizing a risk functional which explicitly balances the performance of the recommended arm and the cost incurred by learning this arm. In this framework, a cost is incurred for each observation during the sampling phase, and upon recommending an arm, a performance penalty is incurred for identifying a suboptimal arm. The learner’s goal is to minimize the sum of the penalty and cost. This new regime mirrors the priorities of many practitioners, e.g. maximizing profit in an A/B testing framework, better than classical fixed budget or confidence settings. We derive theoretical lower bounds for the risk of each of two choices for the performance penalty, the probability of misidentification and the simple regret, and propose an algorithm called DBCARE to match these lower bounds up to polylog factors on nearly all problem instances. We then demonstrate the performance of DBCARE on a number of simulated models, comparing to fixed budget and confidence algorithms to show the shortfalls of existing BAI paradigms on this problem.

[LG-104] Covariate-Adjusted Deep Causal Learning for Heterogeneous Panel Data Models

链接: https://arxiv.org/abs/2505.20536
作者: Guanhao Zhou,Yuefeng Han,Xiufan Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper studies the task of estimating heterogeneous treatment effects in causal panel data models, in the presence of covariate effects. We propose a novel Covariate-Adjusted Deep Causal Learning (CoDEAL) for panel data models, that employs flexible model structures and powerful neural network architectures to cohesively deal with the underlying heterogeneity and nonlinearity of both panel units and covariate effects. The proposed CoDEAL integrates nonlinear covariate effect components (parameterized by a feed-forward neural network) with nonlinear factor structures (modeled by a multi-output autoencoder) to form a heterogeneous causal panel model. The nonlinear covariate component offers a flexible framework for capturing the complex influences of covariates on outcomes. The nonlinear factor analysis enables CoDEAL to effectively capture both cross-sectional and temporal dependencies inherent in the data panel. This latent structural information is subsequently integrated into a customized matrix completion algorithm, thereby facilitating more accurate imputation of missing counterfactual outcomes. Moreover, the use of a multi-output autoencoder explicitly accounts for heterogeneity across units and enhances the model interpretability of the latent factors. We establish theoretical guarantees on the convergence of the estimated counterfactuals, and demonstrate the compelling performance of the proposed method using extensive simulation studies and a real data application.

[LG-105] Learning with Expected Signatures: Theory and Applications

链接: https://arxiv.org/abs/2505.20465
作者: Lorenzo Lucchese,Mikko S. Pakkanen,Almut E. D. Veraart
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The expected signature maps a collection of data streams to a lower dimensional representation, with a remarkable property: the resulting feature tensor can fully characterize the data generating distribution. This “model-free” embedding has been successfully leveraged to build multiple domain-agnostic machine learning (ML) algorithms for time series and sequential data. The convergence results proved in this paper bridge the gap between the expected signature’s empirical discrete-time estimator and its theoretical continuous-time value, allowing for a more complete probabilistic interpretation of expected signature-based ML methods. Moreover, when the data generating process is a martingale, we suggest a simple modification of the expected signature estimator with significantly lower mean squared error and empirically demonstrate how it can be effectively applied to improve predictive performance.

[LG-106] Federated Learning-Distillation Alternation for Resource-Constrained IoT

链接: https://arxiv.org/abs/2505.20456
作者: Rafael Valente da Silva,Onel L. Alcaraz López,Richard Demo Souza
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) faces significant challenges in Internet of Things (IoT) networks due to device limitations in energy and communication resources, especially when considering the large size of FL models. From an energy perspective, the challenge is aggravated if devices rely on energy harvesting (EH), as energy availability can vary significantly over time, influencing the average number of participating users in each iteration. Additionally, the transmission of large model updates is more susceptible to interference from uncorrelated background traffic in shared wireless environments. As an alternative, federated distillation (FD) reduces communication overhead and energy consumption by transmitting local model outputs, which are typically much smaller than the entire model used in FL. However, this comes at the cost of reduced model accuracy. Therefore, in this paper, we propose FL-distillation alternation (FLDA). In FLDA, devices alternate between FD and FL phases, balancing model information with lower communication overhead and energy consumption per iteration. We consider a multichannel slotted-ALOHA EH-IoT network subject to background traffic/interference. In such a scenario, FLDA demonstrates higher model accuracy than both FL and FD, and achieves faster convergence than FL. Moreover, FLDA achieves target accuracies saving up to 98% in energy consumption, while also being less sensitive to interference, both relative to FL.

[LG-107] Kernel Quantile Embeddings and Associated Probability Metrics

链接: https://arxiv.org/abs/2505.20433
作者: Masha Naslidnyk,Siu Lun Chau,François-Xavier Briol,Krikamol Muandet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a statistical distance with strong theoretical and computational properties. At its core, the MMD relies on kernel mean embeddings to represent distributions as mean functions in RKHS. However, it remains unclear if the mean function is the only meaningful RKHS representation. Inspired by generalised quantiles, we introduce the notion of kernel quantile embeddings (KQEs). We then use KQEs to construct a family of distances that: (i) are probability metrics under weaker kernel conditions than MMD; (ii) recover a kernelised form of the sliced Wasserstein distance; and (iii) can be efficiently estimated with near-linear cost. Through hypothesis testing, we show that these distances offer a competitive alternative to MMD and its fast approximations.

[LG-108] DiffNMR: Advancing Inpainting of Randomly Sampled Nuclear Magnetic Resonance Signals

链接: https://arxiv.org/abs/2505.20367
作者: Sen Yan,Fabrizio Gabellieri,Etienne Goffinet,Filippo Castiglione,Thomas Launey
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nuclear Magnetic Resonance (NMR) spectroscopy leverages nuclear magnetization to probe molecules’ chemical environment, structure, and dynamics, with applications spanning from pharmaceuticals to the petroleum industry. Despite its utility, the high cost of NMR instrumentation, operation and the lengthy duration of experiments necessitate the development of computational techniques to optimize acquisition times. Non-Uniform sampling (NUS) is widely employed as a sub-sampling method to address these challenges, but it often introduces artifacts and degrades spectral quality, offsetting the benefits of reduced acquisition times. In this work, we propose the use of deep learning techniques to enhance the reconstruction quality of NUS spectra. Specifically, we explore the application of diffusion models, a relatively untapped approach in this domain. Our methodology involves applying diffusion models to both time-time and time-frequency NUS data, yielding satisfactory reconstructions of challenging spectra from the benchmark Artina dataset. This approach demonstrates the potential of diffusion models to improve the efficiency and accuracy of NMR spectroscopy as well as the superiority of using a time-frequency domain data over the time-time one, opening new landscapes for future studies.

[LG-109] Solving Euler equations with Multiple Discontinuities via Separation-Transfer Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2505.20361
作者: Chuanxing Wang,Hui Luo,Kai Wang,Guohuai Zhu,Mingxing Luo
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable progress of physics-informed neural networks (PINNs) in scientific computing, they continue to face challenges when solving hydrodynamic problems with multiple discontinuities. In this work, we propose Separation-Transfer Physics Informed Neural Networks (ST-PINNs) to address such problems. By sequentially resolving discontinuities from strong to weak and leveraging transfer learning during training, ST-PINNs significantly reduce the problem complexity and enhance solution accuracy. To the best of our knowledge, this is the first study to apply a PINNs-based approach to the two-dimensional unsteady planar shock refraction problem, offering new insights into the application of PINNs to complex shock-interface interactions. Numerical experiments demonstrate that ST-PINNs more accurately capture sharp discontinuities and substantially reduce solution errors in hydrodynamic problems involving multiple discontinuities.

[LG-110] Differentially private ratio statistics

链接: https://arxiv.org/abs/2505.20351
作者: Tomer Shoham,Katrina Ligettt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 3 figures, under review

点击查看摘要

Abstract:Ratio statistics–such as relative risk and odds ratios–play a central role in hypothesis testing, model evaluation, and decision-making across many areas of machine learning, including causal inference and fairness analysis. However, despite privacy concerns surrounding many datasets and despite increasing adoption of differential privacy, differentially private ratio statistics have largely been neglected by the literature and have only recently received an initial treatment by Lin et al. [1]. This paper attempts to fill this lacuna, giving results that can guide practice in evaluating ratios when the results must be protected by differential privacy. In particular, we show that even a simple algorithm can provide excellent properties concerning privacy, sample accuracy, and bias, not just asymptotically but also at quite small sample sizes. Additionally, we analyze a differentially private estimator for relative risk, prove its consistency, and develop a method for constructing valid confidence intervals. Our approach bridges a gap in the differential privacy literature and provides a practical solution for ratio estimation in private machine learning pipelines.

[LG-111] FD-Bench: A Modular and Fair Benchmark for Data-driven Fluid Simulation

链接: https://arxiv.org/abs/2505.20349
作者: Haixin Wang,Ruoyan Li,Fred Xu,Fang Sun,Kaiqiao Han,Zijie Huang,Guancheng Wan,Ching Chang,Xiao Luo,Wei Wang,Yizhou Sun
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 31 pages, 18 figures, paper under review

点击查看摘要

Abstract:Data-driven modeling of fluid dynamics has advanced rapidly with neural PDE solvers, yet a fair and strong benchmark remains fragmented due to the absence of unified PDE datasets and standardized evaluation protocols. Although architectural innovations are abundant, fair assessment is further impeded by the lack of clear disentanglement between spatial, temporal and loss modules. In this paper, we introduce FD-Bench, the first fair, modular, comprehensive and reproducible benchmark for data-driven fluid simulation. FD-Bench systematically evaluates 85 baseline models across 10 representative flow scenarios under a unified experimental setup. It provides four key contributions: (1) a modular design enabling fair comparisons across spatial, temporal, and loss function modules; (2) the first systematic framework for direct comparison with traditional numerical solvers; (3) fine-grained generalization analysis across resolutions, initial conditions, and temporal windows; and (4) a user-friendly, extensible codebase to support future research. Through rigorous empirical studies, FD-Bench establishes the most comprehensive leaderboard to date, resolving long-standing issues in reproducibility and comparability, and laying a foundation for robust evaluation of future data-driven fluid models. The code is open-sourced at this https URL.

[LG-112] Predictive Performance of Deep Quantum Data Re-uploading Models

链接: https://arxiv.org/abs/2505.20337
作者: Xin Wang,Han-Xiao Tao,Re-Bing Wu
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 34 pages, 15 figures

点击查看摘要

Abstract:Quantum machine learning models incorporating data re-uploading circuits have garnered significant attention due to their exceptional expressivity and trainability. However, their ability to generate accurate predictions on unseen data, referred to as the predictive performance, remains insufficiently investigated. This study reveals a fundamental limitation in predictive performance when deep encoding layers are employed within the data re-uploading model. Concretely, we theoretically demonstrate that when processing high-dimensional data with limited-qubit data re-uploading models, their predictive performance progressively degenerates to near random-guessing levels as the number of encoding layers increases. In this context, the repeated data uploading cannot mitigate the performance degradation. These findings are validated through experiments on both synthetic linearly separable datasets and real-world datasets. Our results demonstrate that when processing high-dimensional data, the quantum data re-uploading models should be designed with wider circuit architectures rather than deeper and narrower ones.

[LG-113] An Artificial Intelligence Model for Early Stage Breast Cancer Detection from Biopsy Images

链接: https://arxiv.org/abs/2505.20332
作者: Neil Chaudhary,Zaynah Dhunny
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate identification of breast cancer types plays a critical role in guiding treatment decisions and improving patient outcomes. This paper presents an artificial intelligence enabled tool designed to aid in the identification of breast cancer types using histopathological biopsy images. Traditionally additional tests have to be done on women who are detected with breast cancer to find out the types of cancer it is to give the necessary cure. Those tests are not only invasive but also delay the initiation of treatment and increase patient burden. The proposed model utilizes a convolutional neural network (CNN) architecture to distinguish between benign and malignant tissues as well as accurate subclassification of breast cancer types. By preprocessing the images to reduce noise and enhance features, the model achieves reliable levels of classification performance. Experimental results on such datasets demonstrate the model’s effectiveness, outperforming several existing solutions in terms of accuracy, precision, recall, and F1-score. The study emphasizes the potential of deep learning techniques in clinical diagnostics and offers a promising tool to assist pathologists in breast cancer classification.

[LG-114] Sequence-Only Prediction of Binding Affinity Changes: A Robust and Interpretable Model for Antibody Engineering

链接: https://arxiv.org/abs/2505.20301
作者: Chen Liu,Mingchen Li,Yang Tan,Wenrui Gou,Guisheng Fan,Bingxin Zhou
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A pivotal area of research in antibody engineering is to find effective modifications that enhance antibody-antigen binding affinity. Traditional wet-lab experiments assess mutants in a costly and time-consuming manner. Emerging deep learning solutions offer an alternative by modeling antibody structures to predict binding affinity changes. However, they heavily depend on high-quality complex structures, which are frequently unavailable in practice. Therefore, we propose ProtAttBA, a deep learning model that predicts binding affinity changes based solely on the sequence information of antibody-antigen complexes. ProtAttBA employs a pre-training phase to learn protein sequence patterns, following a supervised training phase using labeled antibody-antigen complex data to train a cross-attention-based regressor for predicting binding affinity changes. We evaluated ProtAttBA on three open benchmarks under different conditions. Compared to both sequence- and structure-based prediction methods, our approach achieves competitive performance, demonstrating notable robustness, especially with uncertain complex structures. Notably, our method possesses interpretability from the attention mechanism. We show that the learned attention scores can identify critical residues with impacts on binding affinity. This work introduces a rapid and cost-effective computational tool for antibody engineering, with the potential to accelerate the development of novel therapeutic antibodies.

[LG-115] Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems

链接: https://arxiv.org/abs/2505.18276
作者: Lorenzo Baldassari,Josselin Garnier,Knut Solna,Maarten V. de Hoop
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Designing algorithms for solving high-dimensional Bayesian inverse problems directly in infinite-dimensional function spaces - where such problems are naturally formulated - is crucial to ensure stability and convergence as the discretization of the underlying problem is refined. In this paper, we contribute to this line of work by analyzing a widely used sampler for linear inverse problems: Langevin dynamics driven by score-based generative models (SGMs) acting as priors, formulated directly in function space. Building on the theoretical framework for SGMs in Hilbert spaces, we give a rigorous definition of this sampler in the infinite-dimensional setting and derive, for the first time, error estimates that explicitly depend on the approximation error of the score. As a consequence, we obtain sufficient conditions for global convergence in Kullback-Leibler divergence on the underlying function space. Preventing numerical instabilities requires preconditioning of the Langevin algorithm and we prove the existence and the form of an optimal preconditioner. The preconditioner depends on both the score error and the forward operator and guarantees a uniform convergence rate across all posterior modes. Our analysis applies to both Gaussian and a general class of non-Gaussian priors. Finally, we present examples that illustrate and validate our theoretical findings.

信息检索

[IR-0] Counterfactual Multi-player Bandits for Explainable Recommendation Diversification

链接: https://arxiv.org/abs/2505.21165
作者: Yansen Zhang,Bowei He,Xiaokun Zhang,Haolun Wu,Zexu Sun,Chen Ma
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing recommender systems tend to prioritize items closely aligned with users’ historical interactions, inevitably trapping users in the dilemma of ``filter bubble’'. Recent efforts are dedicated to improving the diversity of recommendations. However, they mainly suffer from two major issues: 1) a lack of explainability, making it difficult for the system designers to understand how diverse recommendations are generated, and 2) limitations to specific metrics, with difficulty in enhancing non-differentiable diversity metrics. To this end, we propose a \textbfCounterfactual \textbfMulti-player \textbfBandits (CMB) method to deliver explainable recommendation diversification across a wide range of diversity metrics. Leveraging a counterfactual framework, our method identifies the factors influencing diversity outcomes. Meanwhile, we adopt the multi-player bandits to optimize the counterfactual optimization objective, making it adaptable to both differentiable and non-differentiable diversity metrics. Extensive experiments conducted on three real-world datasets demonstrate the applicability, effectiveness, and explainability of the proposed CMB.

[IR-1] Disentangling Locality and Entropy in Ranking Distillation

链接: https://arxiv.org/abs/2505.21058
作者: Andrew Parry,Debasis Ganguly,Sean MacAvaney
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 2 figures, 3 tables, 9 page appendix with 2 figures and 2 tables

点击查看摘要

Abstract:The training process of ranking models involves two key data selection decisions: a sampling strategy, and a labeling strategy. Modern ranking systems, especially those for performing semantic search, typically use a ``hard negative’’ sampling strategy to identify challenging items using heuristics and a distillation labeling strategy to transfer ranking “knowledge” from a more capable model. In practice, these approaches have grown increasingly expensive and complex, for instance, popular pretrained rankers from SentenceTransformers involve 12 models in an ensemble with data provenance hampering reproducibility. Despite their complexity, modern sampling and labeling strategies have not been fully ablated, leaving the underlying source of effectiveness gains unclear. Thus, to better understand why models improve and potentially reduce the expense of training effective models, we conduct a broad ablation of sampling and distillation processes in neural ranking. We frame and theoretically derive the orthogonal nature of model geometry affected by example selection and the effect of teacher ranking entropy on ranking model optimization, establishing conditions in which data augmentation can effectively improve bias in a ranking model. Empirically, our investigation on established benchmarks and common architectures shows that sampling processes that were once highly effective in contrastive objectives may be spurious or harmful under distillation. We further investigate how data augmentation, in terms of inputs and targets, can affect effectiveness and the intrinsic behavior of models in ranking. Through this work, we aim to encourage more computationally efficient approaches that reduce focus on contrastive pairs and instead directly understand training dynamics under rankings, which better represent real-world settings.

[IR-2] A Reduction-Driven Local Search for the Generalized Independent Set Problem

链接: https://arxiv.org/abs/2505.21052
作者: Yiping Liu,Yi Zhou,Zhenxiang Xu,Mingyu Xiao,Jin-Kao Hao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The Generalized Independent Set (GIS) problem extends the classical maximum independent set problem by incorporating profits for vertices and penalties for edges. This generalized problem has been identified in diverse applications in fields such as forest harvest planning, competitive facility location, social network analysis, and even machine learning. However, solving the GIS problem in large-scale, real-world networks remains computationally challenging. In this paper, we explore data reduction techniques to address this challenge. We first propose 14 reduction rules that can reduce the input graph with rigorous optimality guarantees. We then present a reduction-driven local search (RLS) algorithm that integrates these reduction rules into the pre-processing, the initial solution generation, and the local search components in a computationally efficient way. The RLS is empirically evaluated on 278 graphs arising from different application scenarios. The results indicates that the RLS is highly competitive – For most graphs, it achieves significantly superior solutions compared to other known solvers, and it effectively provides solutions for graphs exceeding 260 million edges, a task at which every other known method fails. Analysis also reveals that the data reduction plays a key role in achieving such a competitive performance.

[IR-3] LifeIR at the NTCIR-18 Lifelog-6 Task

链接: https://arxiv.org/abs/2505.20987
作者: Jiahan Chen,Da Li,Keping Bi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, sharing lifelogs recorded through wearable devices such as sports watches and GoPros, has gained significant popularity. Lifelogs involve various types of information, including images, videos, and GPS data, revealing users’ lifestyles, dietary patterns, and physical activities. The Lifelog Semantic Access Task(LSAT) in the NTCIR-18 Lifelog-6 Challenge focuses on retrieving relevant images from a large scale of users’ lifelogs based on textual queries describing an action or event. It serves users’ need to find images about a scenario in the historical moments of their lifelogs. We propose a multi-stage pipeline for this task of searching images with texts, addressing various challenges in lifelog retrieval. Our pipeline includes: filtering blurred images, rewriting queries to make intents clearer, extending the candidate set based on events to include images with temporal connections, and reranking results using a multimodal large language model(MLLM) with stronger relevance judgment capabilities. The evaluation results of our submissions have shown the effectiveness of each stage and the entire pipeline.

[IR-4] Embed Progressive Implicit Preference in Unified Space for Deep Collaborative Filtering

链接: https://arxiv.org/abs/2505.20900
作者: Zhongjin Zhang,Yu Liang,Cong Fu,Yuxuan Zhu,Kun Wang,Yabo Ni,Anxiang Zeng,Jiazhi Xia
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Embedding-based collaborative filtering, often coupled with nearest neighbor search, is widely deployed in large-scale recommender systems for personalized content selection. Modern systems leverage multiple implicit feedback signals (e.g., clicks, add to cart, purchases) to model user preferences comprehensively. However, prevailing approaches adopt a feedback-wise modeling paradigm, which (1) fails to capture the structured progression of user engagement entailed among different feedback and (2) embeds feedback-specific information into disjoint spaces, making representations incommensurable, increasing system complexity, and leading to suboptimal retrieval performance. A promising alternative is Ordinal Logistic Regression (OLR), which explicitly models discrete ordered relations. However, existing OLR-based recommendation models mainly focus on explicit feedback (e.g., movie ratings) and struggle with implicit, correlated feedback, where ordering is vague and non-linear. Moreover, standard OLR lacks flexibility in handling feedback-dependent covariates, resulting in suboptimal performance in real-world systems. To address these limitations, we propose Generalized Neural Ordinal Logistic Regression (GNOLR), which encodes multiple feature-feedback dependencies into a unified, structured embedding space and enforces feedback-specific dependency learning through a nested optimization framework. Thus, GNOLR enhances predictive accuracy, captures the progression of user engagement, and simplifies the retrieval process. We establish a theoretical comparison with existing paradigms, demonstrating how GNOLR avoids disjoint spaces while maintaining effectiveness. Extensive experiments on ten real-world datasets show that GNOLR significantly outperforms state-of-the-art methods in efficiency and adaptability.

[IR-5] Cold-Start Recommendation with Knowledge-Guided Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2505.20773
作者: Wooseong Yang,Weizhi Zhang,Yuqing Liu,Yuwei Han,Yu Wang,Junhyun Lee,Philip S. Yu
类目: Information Retrieval (cs.IR)
*备注: 10 pages

点击查看摘要

Abstract:Cold-start items remain a persistent challenge in recommender systems due to their lack of historical user interactions, which collaborative models rely on. While recent zero-shot methods leverage large language models (LLMs) to address this, they often struggle with sparse metadata and hallucinated or incomplete knowledge. We propose ColdRAG, a retrieval-augmented generation approach that builds a domain-specific knowledge graph dynamically to enhance LLM-based recommendation in cold-start scenarios, without requiring task-specific fine-tuning. ColdRAG begins by converting structured item attributes into rich natural-language profiles, from which it extracts entities and relationships to construct a unified knowledge graph capturing item semantics. Given a user’s interaction history, it scores edges in the graph using an LLM, retrieves candidate items with supporting evidence, and prompts the LLM to rank them. By enabling multi-hop reasoning over this graph, ColdRAG grounds recommendations in verifiable evidence, reducing hallucinations and strengthening semantic connections. Experiments on three public benchmarks demonstrate that ColdRAG surpasses existing zero-shot baselines in both Recall and NDCG. This framework offers a practical solution to cold-start recommendation by combining knowledge-graph reasoning with retrieval-augmented LLM generation.

[IR-6] UQLegalAI@COLIEE2025: Advancing Legal Case Retrieval with Large Language Models and Graph Neural Networks

链接: https://arxiv.org/abs/2505.20743
作者: Yanran Tang,Ruihong Qiu,Zi Huang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Legal case retrieval plays a pivotal role in the legal domain by facilitating the efficient identification of relevant cases, supporting legal professionals and researchers to propose legal arguments and make informed decision-making. To improve retrieval accuracy, the Competition on Legal Information Extraction and Entailment (COLIEE) is held annually, offering updated benchmark datasets for evaluation. This paper presents a detailed description of CaseLink, the method employed by UQLegalAI, the second highest team in Task 1 of COLIEE 2025. The CaseLink model utilises inductive graph learning and Global Case Graphs to capture the intrinsic case connectivity to improve the accuracy of legal case retrieval. Specifically, a large language model specialized in text embedding is employed to transform legal texts into embeddings, which serve as the feature representations of the nodes in the constructed case graph. A new contrastive objective, incorporating a regularization on the degree of case nodes, is proposed to leverage the information within the case reference relationship for model optimization. The main codebase used in our method is based on an open-sourced repo of CaseLink: this https URL.

[IR-7] How Do Experts Make Sense of Integrated Process Models?

链接: https://arxiv.org/abs/2505.20667
作者: Tianwa Chen,Barbara Weber,Graeme Shanks,Gianluca Demartini,Marta Indulska,Shazia Sadiq
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:A range of integrated modeling approaches have been developed to enable a holistic representation of business process logic together with all relevant business rules. These approaches address inherent problems with separate documentation of business process models and business rules. In this study, we explore how expert process workers make sense of the information provided through such integrated modeling approaches. To do so, we complement verbal protocol analysis with eye-tracking metrics to reveal nuanced user behaviours involved in the main phases of sensemaking, namely information foraging and information processing. By studying expert process workers engaged in tasks based on integrated modeling of business processes and rules, we provide insights that pave the way for a better understanding of sensemaking practices and improved development of business process and business rule integration approaches. Our research underscores the importance of offering personalized support mechanisms that increase the efficacy and efficiency of sensemaking practices for process knowledge workers.

[IR-8] WikiTermBase: An AI-Augmented Term Base to Standardize Arabic Translation on Wikipedia

链接: https://arxiv.org/abs/2505.20369
作者: Michel Bakni(ESTIA),Abbad Diraneyya,Wael Tellat
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Term bases are recognized as one of the most effective components of translation software in time saving and consistency. In spite of the many recent advances in natural language processing (NLP) and large language models (LLMs), major translation platforms have yet to take advantage of these tools to improve their term bases and support scalable content for underrepresented languages, which often struggle with localizing technical terminology. Language academies in the Arab World, for example, have struggled since the 1940s to unify the way new scientific terms enter the Arabic language at scale. This abstract introduces an open source tool, WikiTermBase, with a systematic approach for building a lexicographical database with over 900K terms, which were collected and mapped from a multitude of sources on a semantic and morphological basis. The tool was successfully implemented on Arabic Wikipedia to standardize translated English and French terms.

附件下载

点击下载今日全部论文列表