Arxiv今日论文 | 2025-05-26

本篇博文主要内容为 2025-05-26 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决当前对大语言模型（Large Language Models, LLMs）道德推理能力的评估主要依赖于单步评价，无法捕捉模型在面对不断演变的伦理挑战时如何调整其道德判断的问题。解决方案的关键在于引入多步骤道德困境（Multi-step Moral Dilemmas, MMDs）数据集，该数据集包含3,302个五阶段的道德困境，能够实现对LLMs在逐步升级的道德情境中调整其道德推理过程的细粒度、动态分析。

链接: https://arxiv.org/abs/2505.18154
作者: Ya Wu,Qiang Sheng,Danding Wang,Guang Yang,Yifan Sun,Zhengjia Wang,Yuyan Bu,Juan Cao
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences(中科院计算技术研究所媒体合成与取证实验室); Zhongguancun Laboratory(中关村实验室); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.
zh

[NLP-1] Fann or Flop: A Multigenre Multiera Benchmark for Arabic Poetry Understanding in LLM s UAI

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在理解阿拉伯诗歌方面的能力不足问题，尤其是其对古典阿拉伯语中复杂意义、修辞手法和文化背景的把握。解决方案的关键在于构建名为“Fann or Flop”的首个基准测试，该基准覆盖了十二个历史时期、21种核心诗体及多种格律形式，包含经过筛选的诗歌及其解释，用于评估语义理解、隐喻解读、韵律意识和文化背景的认知能力。通过这一基准，研究者能够更深入地检验LLMs在古典阿拉伯语理解上的表现。

链接: https://arxiv.org/abs/2505.18152
作者: Wafa Alghallabi,Ritesh Thawkar,Sara Ghaboura,Ketan More,Omkar Thawakar,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
机构: Lawa.AI; Mohamed bin Zayed University of AI; Australian National University; Aalto University
类目: Computation and Language (cs.CL)
备注: Github: this https URL , Dataset: this https URL

点击查看摘要

Abstract:Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce Fann or Flop, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release Fann or Flop along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: this https URL.
zh

[NLP-2] First Finish Search: Efficient Test-Time Scaling in Large Language Models

【速读】：该论文旨在解决大语言模型在推理任务中因依赖长解码路径或大量生成样本而导致的计算资源消耗高和推理延迟大的问题。其解决方案的关键在于提出一种无需训练的并行解码策略——首次完成搜索（First Finish Search, FFS），该策略通过同时启动多个独立样本并在其中任意一个完成时立即返回结果，从而有效减少token使用和推理时间。

链接: https://arxiv.org/abs/2505.18149
作者: Aradhye Agarwal,Ayan Sengupta,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches n independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves 82.23% accuracy on the AIME datasets, a 15% improvement over DeepSeek-R1’s standalone accuracy, nearly matching OpenAI’s o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.
zh

[NLP-3] Lost in the Haystack: Smaller Needles are More Difficult for LLM s to Find

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理“needle-in-a-haystack”任务时的性能问题，即从大量无关上下文中提取关键信息（“needle”）的挑战。研究发现，黄金上下文（gold context）长度的变化对模型性能有显著影响，而这一因素此前未得到充分关注。解决方案的关键在于系统性地研究黄金上下文长度变化对长上下文问答任务中LLM性能的影响，揭示出较短的黄金上下文会显著降低模型性能并增强位置敏感性，从而为构建鲁棒、上下文感知的LLM驱动系统提供指导。

链接: https://arxiv.org/abs/2505.18148
作者: Owen Bianchi,Mathew J. Koretsky,Maya Willey,Chelsea X. Alvarado,Tanay Nayak,Adi Asija,Nicole Kuznetsov,Mike A. Nalls,Faraz Faghri,Daniel Khashabi
机构: Center for Alzheimer’s Disease and Related Dementias, NIA, NIH; DataTecnica LLC; Johns Hopkins University; Laboratory of Neurogenetics, NIA, NIH
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges with needle-in-a-haystack tasks, where relevant information (“the needle”) must be drawn from a large pool of irrelevant context (“the haystack”). Previous studies have highlighted positional bias and distractor quantity as critical factors affecting model performance, yet the influence of gold context size has received little attention. We address this gap by systematically studying how variations in gold context length impact LLM performance on long-context question answering tasks. Our experiments reveal that LLM performance drops sharply when the gold context is shorter, i.e., smaller gold contexts consistently degrade model performance and amplify positional sensitivity, posing a major challenge for agentic systems that must integrate scattered, fine-grained information of varying lengths. This pattern holds across three diverse domains (general knowledge, biomedical reasoning, and mathematical reasoning) and seven state-of-the-art LLMs of various sizes and architectures. Our work provides clear insights to guide the design of robust, context-aware LLM-driven systems.
zh

[NLP-4] Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism Detection

【速读】：该论文旨在解决维基数据（Wikidata）中破坏行为检测的问题，其核心挑战在于Wikidata的复杂性，包括不断扩展的事实三元组和多语言文本。解决方案的关键在于提出一种称为Graph2Text的方法，将所有编辑转换为统一空间，从而利用单一多语言语言模型评估所有内容变化的潜在破坏行为，提升了检测的覆盖率并简化了维护流程。

链接: https://arxiv.org/abs/2505.18136
作者: Mykola Trokhymovych,Lydia Pintscher,Ricardo Baeza-Yates,Diego Saez-Trumper
机构: Pompeu Fabra University (庞佩乌·法布拉大学); Wikimedia Deutschland (维基媒体德国); Pompeu Fabra University (庞佩乌·法布拉大学); Wikimedia Foundation (维基媒体基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a next-generation vandalism detection system for Wikidata, one of the largest open-source structured knowledge bases on the Web. Wikidata is highly complex: its items incorporate an ever-expanding universe of factual triples and multilingual texts. While edits can alter both structured and textual content, our approach converts all edits into a single space using a method we call Graph2Text. This allows for evaluating all content changes for potential vandalism using a single multilingual language model. This unified approach improves coverage and simplifies maintenance. Experiments demonstrate that our solution outperforms the current production system. Additionally, we are releasing the code under an open license along with a large dataset of various human-generated knowledge alterations, enabling further research.
zh

[NLP-5] Gaming Tool Preferences in Agent ic LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在使用外部工具时依赖文本描述进行选择所存在的脆弱性问题。其解决方案的关键在于通过修改工具描述，显著提升特定工具在LLMs中的使用率，实验表明经过适当编辑的工具描述可以使GPT-4.1和Qwen2.5-7B等模型的使用量增加超过10倍。这一发现揭示了当前工具调用协议中存在的潜在漏洞，并强调了构建更可靠机制以支持代理型LLMs选择和利用工具与资源的重要性。

链接: https://arxiv.org/abs/2505.18135
作者: Kazem Faghih,Wenxiao Wang,Yize Cheng,Siddhant Bharti,Gaurang Sriramanan,Sriram Balasubramanian,Parsa Hosseini,Soheil Feizi
机构: University of Maryland(马里兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use–a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool’s usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 10 different models. These phenomenons, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources.
zh

[NLP-6] VideoGameBench: Can Vision-Language Models complete popular video games?

【速读】：该论文试图解决当前视觉-语言模型（Vision-Language Models, VLMs）在执行人类自然具备的能力（如感知、空间导航和记忆管理）方面研究不足的问题。其解决方案的关键在于引入VideoGameBench，一个由10款20世纪90年代流行的视频游戏组成的基准测试集，这些游戏直接与VLMs进行实时交互，以评估模型在仅依赖原始视觉输入和任务目标描述的情况下完成整个游戏的能力。该基准与现有方法形成鲜明对比，因为它不依赖于特定游戏的辅助信息和结构化支持，同时通过隐藏部分游戏以促进模型在未见过环境中的泛化能力。

链接: https://arxiv.org/abs/2505.18134
作者: Alex L. Zhang,Thomas L. Griffiths,Karthik R. Narasimhan,Ofir Press
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 33 pages including supplementary

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans–such as perception, spatial navigation, and memory management–remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM’s next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.
zh

[NLP-7] One RL to See Them All: Visual Triple Unified Reinforcement Learning

【速读】：该论文旨在解决视觉语言模型（VLMs）在超越推理任务之外的感知密集型任务（如目标检测和定位）中应用受限的问题。其解决方案的关键在于提出V-Triune系统，该系统通过三个互补组件实现视觉推理与感知任务的联合学习：样本级数据格式化、验证器级奖励计算以及源级指标监控，并引入一种新型动态IoU奖励机制，以提供适应性、渐进性和确定性的反馈。这一统一的强化学习框架使得模型能够在单一训练流程中同时优化多种任务，从而提升了模型在多个下游任务中的表现。

链接: https://arxiv.org/abs/2505.18129
作者: Yan Ma,Linge Du,Xuyang Shen,Shaoxiang Chen,Pengfei Li,Qibing Ren,Lizhuang Ma,Yuchao Dai,Pengfei Liu,Junjie Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at this https URL.
zh

[NLP-8] Frankentext: Stitching random text frag ments into long-form narratives

【速读】：该论文试图解决在极端约束条件下生成连贯长文本的问题，具体而言，要求模型在大部分（例如90%）token必须直接复制人类写作内容的情况下，仍能完成符合指令的可控制生成任务。解决方案的关键在于通过两阶段流程实现：首先让模型根据用户指令选择并组合人类写作片段生成草稿，随后在保持指定复制比例的前提下对草稿进行迭代修订，从而在保证指令遵循性和文本相关性的同时，生成具有较高写作质量的Frankentexts。

链接: https://arxiv.org/abs/2505.18128
作者: Chau Minh Pham,Jenna Russell,Dzung Pham,Mohit Iyyer
机构: University of Maryland, College Park (马里兰大学学院公园分校); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.
zh

[NLP-9] Reward Model Overoptimisation in Iterated RLHF

【速读】：该论文试图解决迭代强化学习从人类反馈（iterated RLHF）中出现的奖励模型过优化（reward model overoptimisation）问题，该问题导致模型过度拟合奖励函数，从而生成非泛化的策略。解决方案的关键在于系统分析迭代过程中的关键设计选择，包括奖励模型训练数据的跨迭代迁移方式、优化所用的奖励函数以及策略的初始化方法，以理解过优化的动态并提升RLHF管道的稳定性和泛化能力。

链接: https://arxiv.org/abs/2505.18126
作者: Lorenz Wolf,Robert Kirk,Mirco Musolesi
机构: University College London (伦敦大学学院); UK AI Security Institute (英国人工智能安全研究所); University of Bologna (博洛尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 17 figures, 5 tables

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is a widely used method for aligning large language models with human preferences. However, RLHF often suffers from reward model overoptimisation, in which models overfit to the reward function, resulting in non-generalisable policies that exploit the idiosyncrasies and peculiarities of the reward function. A common mitigation is iterated RLHF, in which reward models are repeatedly retrained with updated human feedback and policies are re-optimised. Despite its increasing adoption, the dynamics of overoptimisation in this setting remain poorly understood. In this work, we present the first comprehensive study of overoptimisation in iterated RLHF. We systematically analyse key design choices - how reward model training data is transferred across iterations, which reward function is used for optimisation, and how policies are initialised. Using the controlled AlpacaFarm benchmark, we observe that overoptimisation tends to decrease over successive iterations, as reward models increasingly approximate ground-truth preferences. However, performance gains diminish over time, and while reinitialising from the base policy is robust, it limits optimisation flexibility. Other initialisation strategies often fail to recover from early overoptimisation. These findings offer actionable insights for building more stable and generalisable RLHF pipelines.
zh

[NLP-10] abSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

【速读】：该论文旨在解决深度学习在表格学习任务中表现不佳的问题，尤其是在与梯度提升决策树（Gradient Boosting Decision Trees, GBDTs）相比时。现有的方法大多依赖于静态、目标无关的文本表示，限制了其在表格任务中的效果。论文提出的解决方案是TabSTAR，一种具有语义目标感知表示的表格基础模型（Foundation Tabular Model），其关键在于通过解冻预训练文本编码器并输入目标标记，使模型能够学习任务特定的嵌入，从而实现跨数据集的迁移学习，并在包含文本特征的分类任务中取得了最先进的性能。

链接: https://arxiv.org/abs/2505.18125
作者: Alan Arazi,Eilam Shapira,Roi Reichart
机构: Technion - IIT (以色列理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.
zh

[NLP-11] UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

【速读】：该论文旨在解决多表数据库中文本到SQL（Text-to-SQL）生成的挑战，特别是在复杂模式和关系操作下的表与列检索、准确的JOIN和UNION生成以及跨不同模式的泛化问题。其解决方案的关键在于提出UNJOIN框架，该框架采用两阶段策略，将模式元素的检索与SQL逻辑生成解耦。第一阶段通过前缀表名的方式将所有表的列名合并为单一表表示，使模型专注于准确检索；第二阶段在简化模式上生成SQL查询，并通过重建JOIN、UNION和关系逻辑映射回原始模式。

链接: https://arxiv.org/abs/2505.18122
作者: Poojah Ganesan,Rajat Aayush Jha,Dan Roth,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.
zh

[NLP-12] ProgRM: Build Better GUI Agents with Progress Rewards

【速读】：该论文旨在解决基于大语言模型（Large Language Model, LLM）的图形用户界面（Graphical User Interface, GUI）代理在训练过程中因轨迹收集和奖励标注困难而导致的高质量训练数据稀缺问题。现有方法使用的Outcome Reward Model (ORM) 无法提供细粒度反馈，并可能对最终失败轨迹中有价值的步骤进行过度惩罚。论文提出的解决方案是Progress Reward Model (ProgRM)，其关键在于通过预测每个步骤的任务完成进度，为在线训练提供密集且信息丰富的中间奖励。为了解决进度奖励标签标注的挑战，进一步设计了基于最长公共子序列（Longest Common Subsequence, LCS）的自标注算法，以识别轨迹中的关键步骤并分配进度标签。

链接: https://arxiv.org/abs/2505.18121
作者: Danyang Zhang,Situo Zhang,Ziyue Yang,Zichen Zhu,Zihan Zhao,Ruisheng Cao,Lu Chen,Kai Yu
机构: X-LANCE Lab, School of Computer Science, MoE Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Jiangsu Key Lab of Language Computing, Suzhou Laboratory
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based (Large Language Model) GUI (Graphical User Interface) agents can potentially reshape our daily lives significantly. However, current LLM-based GUI agents suffer from the scarcity of high-quality training data owing to the difficulties of trajectory collection and reward annotation. Existing works have been exploring LLMs to collect trajectories for imitation learning or to offer reward signals for online RL training. However, the Outcome Reward Model (ORM) used in existing works cannot provide finegrained feedback and can over-penalize the valuable steps in finally failed trajectories. To this end, we propose Progress Reward Model (ProgRM) to provide dense informative intermediate rewards by predicting a task completion progress for each step in online training. To handle the challenge of progress reward label annotation, we further design an efficient LCS-based (Longest Common Subsequence) self-annotation algorithm to discover the key steps in trajectories and assign progress labels accordingly. ProgRM is evaluated with extensive experiments and analyses. Actors trained with ProgRM outperform leading proprietary LLMs and ORM-trained actors, illustrating the effectiveness of ProgRM. The codes for experiments will be made publicly available upon acceptance.
zh

[NLP-13] Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

【速读】：该论文试图解决在基于二元反馈的强化学习（Reinforcement Learning, RL）系统中，监督学习（Supervised Learning, SL）方法因依赖参考答案和无法自我反思而难以实现模型自提升的问题。其解决方案的关键在于提出一种名为Negative-aware Fine-Tuning (NFT)的监督方法，该方法通过构建隐式的负样本策略来建模自生成的错误答案，从而让大语言模型（Large Language Models, LLMs）能够自主反思失败并进行优化，无需外部教师指导。

链接: https://arxiv.org/abs/2505.18116
作者: Huayu Chen,Kaiwen Zheng,Qinsheng Zhang,Ganqu Cui,Yin Cui,Haotian Ye,Tsung-Yi Lin,Ming-Yu Liu,Jun Zhu,Haoxiang Wang
机构: Tsinghua University (清华大学); NVIDIA (英伟达); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has played a central role in the recent surge of LLMs’ math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) – a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs’ generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.
zh

[NLP-14] Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

【速读】：该论文旨在解决现有模型在融合和解释音频信息方面存在的不足，从而限制了其对视频时间内容的全面理解能力。解决方案的关键在于提出TriSense，一个通过整合视觉、音频和语音模态实现全面视频时间理解的三模态大型语言模型，其中核心组件是基于查询的连接器（Query-Based Connector），它能够根据输入查询自适应地重新加权模态贡献，从而在模态缺失情况下保持鲁棒性并支持灵活的输入组合。

链接: https://arxiv.org/abs/2505.18110
作者: Zinuo Li,Xian Zhang,Yongxin Guo,Mohammed Bennamoun,Farid Boussaid,Girish Dwivedi,Luqi Gong,Qiuhong Ke
机构: University of Western Australia (西澳大学); Alibaba Group (阿里巴巴集团); Zhejiang Laboratory (浙江实验室); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like “A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding” requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense’s multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.
zh

[NLP-15] ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂推理任务中表现优异，但其能力大多被限制在封闭系统中的问题，从而难以实现透明性和可扩展性。解决方案的关键在于提出\textbf{ManuSearch}，这是一个透明且模块化的多智能体框架，通过将搜索与推理过程分解为三个协作的智能体：解决方案规划智能体、互联网搜索智能体和结构化网页阅读智能体，以实现对LLMs的深度搜索民主化。

链接: https://arxiv.org/abs/2505.18105
作者: Lisheng Huang,Yichen Liu,Jinhao Jiang,Rongxiang Zhang,Jiahao Yan,Junyi Li,Wayne Xin Zhao
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Harbin Institute of Technology (哈尔滨工业大学); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系)
类目: Computation and Language (cs.CL)
备注: LLM, Complex Search Benchmark

点击查看摘要

Abstract:Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbfManuSearch, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbfORION, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in this https URL
zh

[NLP-16] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

【速读】：该论文试图解决公开大型语言模型（Large Language Model, LLM）基准测试可能引发的未来LLM污染问题，即基准测试可能被无意或有意地用于训练或选择模型。为了解决这一问题，论文提出了一种在不完全披露问题真实答案的情况下发布基准测试的方法。其关键在于通过准备多个逻辑上正确的答案，并仅在基准中包含其中一个作为正确答案，从而引入随机性，降低基准测试的最佳可能准确率（即贝叶斯准确率）。这种方法不仅有助于保护真实答案，还能作为检测数据污染的测试手段。

链接: https://arxiv.org/abs/2505.18102
作者: Takashi Ishida,Thanawat Lodkaew,Ikko Yamane
机构: RIKEN AIP (理化学研究所人工 Intelligence 研究所); UTokyo (东京大学); ENSAI/CREST (ENSAI/CREST)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.
zh

[NLP-17] Planning without Search: Refining Frontier LLM s with Offline Goal-Conditioned RL

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在需要长期推理和规划的复杂交互任务（如协商和说服）中表现不足的问题。传统强化学习（Reinforcement Learning, RL）微调方法虽理论上可行，但存在可扩展性差、多轮训练计算成本高等缺陷，且大型LLMs通常不提供必要的API进行此类训练。为此，本文提出一种新方法，其关键在于利用目标条件价值函数（goal-conditioned value functions）引导LLM代理的推理过程，该函数通过预测给定动作下任务的演变来评估多种可能结果，从而实现有效规划，并在推理步骤上进行训练以保持模块轻量化，进而提升多轮交互中的决策效率与性能。

链接: https://arxiv.org/abs/2505.18098
作者: Joey Hong,Anca Dragan,Sergey Levine
机构: UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.
zh

[NLP-18] Qwen Long-CPRS: Towards infty-LLM s with Dynamic Context Optimization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理长序列时面临的预填充阶段计算开销过大以及“中间丢失”性能退化问题。其解决方案的关键在于提出了一种名为QwenLong-CPRS的上下文压缩框架，该框架通过一种新颖的动态上下文优化机制，实现了基于自然语言指令的多粒度上下文压缩，从而在提升效率的同时改善模型性能。

链接: https://arxiv.org/abs/2505.18092
作者: Weizhou Shen,Chenliang Li,Fanqi Wan,Shengyi Liao,Shaopeng Lai,Bo Zhang,Yingcheng Shi,Yuning Wu,Gang Fu,Zhansheng Li,Bin Yang,Ji Zhang,Fei Huang,Jingren Zhou,Ming Yan
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the “lost in the middle” performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS’s threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59 \times context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.18092 [cs.CL] (or arXiv:2505.18092v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.18092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-19] Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

【速读】：该论文试图解决在训练大型语言模型（Large Language Models, LLMs）时，混合来自网络爬取数据和高密度知识数据集的训练数据所导致的知识获取行为不遵循平滑缩放规律的问题。传统观点认为，增加模型规模或调整数据混合比例会线性影响模型性能，但本文通过实验发现，知识获取过程可能表现出相变现象。解决方案的关键在于揭示了模型容量分配机制，即模型在有限容量下需像背包问题求解器一样优化整体测试损失，从而导致在模型规模或数据混合比例变化时出现非连续的最优分配策略。研究进一步通过信息论框架形式化这一现象，并证明相变临界点与模型规模呈幂律关系。

链接: https://arxiv.org/abs/2505.18091
作者: Xinran Gu,Kaifeng Lyu,Jiazheng Li,Jingzhao Zhang
机构: Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Shanghai Qizhi Institute (上海智源研究院); Shanghai AI Laboratory (上海人工智能实验室; Beijing Institute of Technology (北京理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.
zh

[NLP-20] Deep Video Discovery: Agent ic Search with Tool Use for Long-form Video Understanding

【速读】：该论文旨在解决长视频理解中的挑战，特别是由于时间-空间复杂性高以及在长时间上下文中进行问答的困难。其解决方案的关键在于提出一种基于代理的搜索策略，即Deep Video Discovery (DVD)代理，通过在分段视频片段上进行自主探索，利用大型语言模型（LLM）的高级推理能力来规划当前观察状态下的行动策略，选择合适的工具并调整参数，从而迭代优化内部推理过程。这一方法强调了代理的自主性，而非依赖手动设计的固定流程。

链接: https://arxiv.org/abs/2505.18079
作者: Xiaoyi Zhang,Zhaoyang Jia,Zongyu Guo,Jiahao Li,Bin Li,Houqiang Li,Yan Lu
机构: Microsoft Research Asia (微软亚洲研究院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code will be released later.
zh

[NLP-21] Extended Inductive Reasoning for Personalized Preference Inference from Behavioral Signals

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在个性化偏好推理中的挑战，即如何从用户交互历史中的分散信号中推断出一致的偏好模式，而当前方法在捕捉多样化的用户偏好方面存在不足。解决方案的关键在于提出\textscAlignXplore模型，该模型通过扩展推理链来系统地从行为信号中进行偏好推理，并结合基于合成数据的冷启动训练与后续在线强化学习，从而显著提升了模型在域内和域外基准上的性能。

链接: https://arxiv.org/abs/2505.18071
作者: Jia-Nan Li,Jian Guan,Wei Wu,Rui Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant success in complex reasoning tasks such as math and coding. In contrast to these tasks where deductive reasoning predominates, inductive reasoning\textemdash the ability to derive general rules from incomplete evidence, remains underexplored. This paper investigates extended inductive reasoning in LLMs through the lens of personalized preference inference, a critical challenge in LLM alignment where current approaches struggle to capture diverse user preferences. The task demands strong inductive reasoning capabilities as user preferences are typically embedded implicitly across various interaction forms, requiring models to synthesize consistent preference patterns from scattered signals. We propose \textscAlignXplore, a model that leverages extended reasoning chains to enable systematic preference inference from behavioral signals in users’ interaction histories. We develop \textscAlignXplore by combining cold-start training based on synthetic data with subsequent online reinforcement learning. Through extensive experiments, we demonstrate that \textscAlignXplore achieves substantial improvements over the backbone model by an average of 11.05% on in-domain and out-of-domain benchmarks, while maintaining strong generalization ability across different input formats and downstream models. Further analyses establish best practices for preference inference learning through systematic comparison of reward modeling strategies, while revealing the emergence of human-like inductive reasoning patterns during training.
zh

[NLP-22] MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving

【速读】：该论文试图解决在线教育中缺乏即时、个性化反馈的问题，特别是在数学问题求解过程中帮助学生纠正错误。解决方案的关键在于利用生成式 AI (Generative AI) 的能力来评估学生的数学解题过程并提供适应性反馈。研究引入了 MathEDU 数据集，该数据集包含带有教师反馈的真实学生解答，并通过在两种场景下评估模型性能——一种是模型访问学生历史答案的情况，另一种是模拟冷启动情境——以验证其支持个性化学习的能力。

链接: https://arxiv.org/abs/2505.18056
作者: Wei-Ling Hsu,Yu-Chien Tang,An-Zi Yen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Online learning enhances educational accessibility, offering students the flexibility to learn anytime, anywhere. However, a key limitation is the lack of immediate, personalized feedback, particularly in helping students correct errors in math problem-solving. Several studies have investigated the applications of large language models (LLMs) in educational contexts. In this paper, we explore the capabilities of LLMs to assess students’ math problem-solving processes and provide adaptive feedback. The MathEDU dataset is introduced, comprising authentic student solutions annotated with teacher feedback. We evaluate the model’s ability to support personalized learning in two scenarios: one where the model has access to students’ prior answer histories, and another simulating a cold-start context. Experimental results show that the fine-tuned model performs well in identifying correctness. However, the model still faces challenges in generating detailed feedback for pedagogical purposes.
zh

[NLP-23] Contrastive Distillation of Emotion Knowledge from LLM s for Zero-Shot Emotion Recognition

【速读】：该论文旨在解决情感识别（Emotion Recognition, ER）系统在面对不同情感标签时缺乏适应性的问题，传统ER模型依赖于固定标签集进行训练，难以泛化到其他标签空间。其解决方案的关键在于提出一种对比蒸馏框架，将大型语言模型（Large Language Models, LLMs）中的丰富情感知识迁移至一个轻量级模型中，无需人工标注。通过使用GPT-4生成描述性情感标注，并在共享嵌入空间中对文本样本与情感描述符进行对齐，实现了跨不同情感类别、粒度和标签架构的零样本预测。

链接: https://arxiv.org/abs/2505.18040
作者: Minxue Niu,Emily Mower Provost
机构: University of Michigan, Ann Arbor, Michigan, USA (密歇根大学，安娜堡，密歇根，美国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to handle various emotion labels without dedicated training is crucial for building adaptable Emotion Recognition (ER) systems. Conventional ER models rely on training using fixed label sets and struggle to generalize beyond them. On the other hand, Large Language Models (LLMs) have shown strong zero-shot ER performance across diverse label spaces, but their scale limits their use on edge devices. In this work, we propose a contrastive distillation framework that transfers rich emotional knowledge from LLMs into a compact model without the use of human annotations. We use GPT-4 to generate descriptive emotion annotations, offering rich supervision beyond fixed label sets. By aligning text samples with emotion descriptors in a shared embedding space, our method enables zero-shot prediction on different emotion classes, granularity, and label schema. The distilled model is effective across multiple datasets and label spaces, outperforming strong baselines of similar size and approaching GPT-4’s zero-shot performance, while being over 10,000 times smaller.
zh

[NLP-24] Structured Thinking Matters: Improving LLM s Generalization in Causal Inference Tasks

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在区分因果关系与相关性方面表现不可靠的问题。研究表明，即使是最先进的LLMs，如GPT-4，在Corr2Cause数据集基准上的F1得分为29.08，仅略高于随机基线（F1得分：20.38），表明其泛化能力有限。为解决这一问题，论文提出了一种结构化的方法：不直接回答因果查询，而是引导模型构建结构化的知识图谱，系统地编码提供的相关前提，从而回答因果问题。该中间表示显著提升了模型的因果推理能力。实验结果表明，使用Qwen3-32B模型在Corr2Cause测试子集上的F1得分从32.71提升至48.26，相对提升超过47.5%，验证了该方法的有效性。解决方案的关键在于赋予模型结构化思考的能力，以增强其在多种因果推断任务中的泛化潜力。

链接: https://arxiv.org/abs/2505.18034
作者: Wentao Sun,Joao Paulo Nogueira,Alonso Silva
机构: École Polytechnique (巴黎综合理工学院); Institut Polytechnique de Paris (巴黎综合理工学院); Nokia Bell Labs (诺基亚贝尔实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite remarkable advances in the field, LLMs remain unreliable in distinguishing causation from correlation. Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs – such as GPT-4 (F1 score: 29.08) – only marginally outperform random baselines (Random Uniform, F1 score: 20.38), indicating limited capacity of generalization. To tackle this limitation, we propose a novel structured approach: rather than directly answering causal queries, we provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph, systematically encoding the provided correlational premises, to answer the causal queries. This intermediate representation significantly enhances the model’s causal capabilities. Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods, improving F1 scores from 32.71 to 48.26 (over 47.5% relative increase), along with notable improvements in precision and recall. These results underscore the effectiveness of providing the model with the capability to structure its thinking and highlight its promising potential for broader generalization across diverse causal inference tasks.
zh

[NLP-25] raining with Pseudo-Code for Instruction Following

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在遵循相对简单且明确的指令时存在的困难，尤其是在涉及组合任务时。其解决方案的关键在于通过指令微调数据对LLMs进行微调，其中包含以伪代码（pseudo-code）重新表达的指令及其最终响应，从而提升模型对指令的理解和执行能力。

链接: https://arxiv.org/abs/2505.18011
作者: Prince Kumar,Rudra Murthy,Riyaz Bhat,Danish Contractor
机构: IBM Research AI (IBM 研究院人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Despite the rapid progress in the capabilities of Large Language Models (LLMs), they continue to have difficulty following relatively simple, unambiguous instructions, especially when compositions are involved. In this paper, we take inspiration from recent work that suggests that models may follow instructions better when they are expressed in pseudo-code. However, writing pseudo-code programs can be tedious and using few-shot demonstrations to craft code representations for use in inference can be unnatural for non-expert users of LLMs. To overcome these limitations, we propose fine-tuning LLMs with instruction-tuning data that additionally includes instructions re-expressed in pseudo-code along with the final response. We evaluate models trained using our method on 11 publicly available benchmarks comprising of tasks related to instruction-following, mathematics, and common-sense reasoning. We conduct rigorous experiments with 5 different models and find that not only do models follow instructions better when trained with pseudo-code, they also retain their capabilities on the other tasks related to mathematical and common sense reasoning. Specifically, we observe a relative gain of 3 – 19 % on instruction-following benchmark, and an average gain of upto 14% across all tasks.
zh

[NLP-26] RACE for Tracking the Emergence of Semantic Representations in Transformers

【速读】：该论文试图解决现代Transformer模型在训练过程中出现的相变现象（phase transitions）机制不明确的问题，特别是从记忆到抽象的转变过程。其解决方案的关键在于提出了一种诊断框架TRACE（Tracking Representation Abstraction and Compositional Emergence），该框架结合几何、信息和语言信号，以检测基于Transformer的语言模型中的相变。TRACE利用一种称为ABSynth的框架语义数据生成方法，生成具有可控复杂度、词频分布和结构熵的标注合成语料库，从而实现对抽象性出现的精确分析。

链接: https://arxiv.org/abs/2505.17998
作者: Nura Aljaafari,Danilo S. Carvalho,André Freitas
机构: University of Manchester(曼彻斯特大学); Idiap Research Institute(伊迪帕研究机构); CRUK-MI(癌症研究英国-曼彻斯特研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern transformer models exhibit phase transitions during training, distinct shifts from memorisation to abstraction, but the mechanisms underlying these transitions remain poorly understood. Prior work has often focused on endpoint representations or isolated signals like curvature or mutual information, typically in symbolic or arithmetic domains, overlooking the emergence of linguistic structure. We introduce TRACE (Tracking Representation Abstraction and Compositional Emergence), a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs. TRACE leverages a frame-semantic data generation method, ABSynth, that produces annotated synthetic corpora with controllable complexity, lexical distributions, and structural entropy, while being fully annotated with linguistic categories, enabling precise analysis of abstraction emergence. Experiments reveal that (i) phase transitions align with clear intersections between curvature collapse and dimension stabilisation; (ii) these geometric shifts coincide with emerging syntactic and semantic accuracy; (iii) abstraction patterns persist across architectural variants, with components like feedforward networks affecting optimisation stability rather than fundamentally altering trajectories. This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation that could inform more principled approaches to LM development.
zh

[NLP-27] owards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective

【速读】：该论文旨在深入探讨VAPO框架在长链式思维（CoT）推理任务中提升强化学习效率与可靠性的理论机制及其潜在限制，以指导未来更鲁棒和泛化的推理智能体的发展。其解决方案的关键在于系统性地应对价值模型偏差、异构序列长度以及稀疏奖励信号等挑战，通过理论分析揭示价值函数近似、自适应优势估计的最优性、令牌级优化的影响以及探索与泛化难题的本质。

链接: https://arxiv.org/abs/2505.17997
作者: Jintian Shao,Yiming Cheng,Hongyi Huang,Beiwen Zhang,Zhiyu Wu,You Shan,Mingkai Zheng
机构: Southern University of Science and Technology (南方科技大学); Fudan University (复旦大学); Sun Yat-sen University (中山大学); SenseTime Research (商汤研究院); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The VAPO framework has demonstrated significant empirical success in enhancing the efficiency and reliability of reinforcement learning for long chain-of-thought (CoT) reasoning tasks with large language models (LLMs). By systematically addressing challenges such as value model bias, heterogeneous sequence lengths, and sparse reward signals, VAPO achieves state-of-the-art performance. While its practical benefits are evident, a deeper theoretical understanding of its underlying mechanisms and potential limitations is crucial for guiding future advancements. This paper aims to initiate such a discussion by exploring VAPO from a theoretical perspective, highlighting areas where its assumptions might be challenged and where further investigation could yield more robust and generalizable reasoning agents. We delve into the intricacies of value function approximation in complex reasoning spaces, the optimality of adaptive advantage estimation, the impact of token-level optimization, and the enduring challenges of exploration and generalization.
zh

[NLP-28] AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

【速读】：该论文旨在解决图像-文本声明的自动化验证问题，特别是在社交媒体上，由于图像与文本结合使用可能增强可信度，但也加剧了虚假信息的传播。现有数据集在真实性验证方面存在局限性，主要表现为合成声明较多、缺乏反映判断依据的证据标注。为解决这些问题，作者提出了AVerImaTeC数据集，其关键在于收集了1,297个真实世界的图像-文本声明，并通过包含网络证据的问答对进行标注，以反映判断的分解推理过程。此外，通过声明归一化、时间约束的证据标注以及两阶段充分性检查来缓解事实核查数据集中的常见挑战。

链接: https://arxiv.org/abs/2505.17978
作者: Rui Cao,Zifeng Ding,Zhijiang Guo,Michael Schlichtkrull,Andreas Vlachos
机构: University of Cambridge(剑桥大学); Queen Mary University of London(玛丽皇后大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation. Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict. In this work, we introduce AVerImaTeC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict. We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVerImaTeC via inter-annotator studies, achieving a \kappa=0.742 on verdicts and 74.7% consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.
zh

[NLP-29] Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

【速读】：该论文试图解决如何评估大型语言模型（Large Language Model, LLM）从被动观察与主动收集的数据中识别黑箱系统底层结构的能力问题。其解决方案的关键在于通过主动干预——即向黑箱系统输入特定输入并观察输出——来提升LLM的逆向工程能力，从而克服其在被动观察下性能受限的问题。主动干预使LLM能够测试边界情况并优化其信念，进而提升对黑箱系统的理解能力。

链接: https://arxiv.org/abs/2505.17968
作者: Jiayi Geng,Howard Chen,Dilip Arumugam,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages

点击查看摘要

Abstract:Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene – actively querying the black-box with specific inputs to observe the resulting output – improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.
zh

[NLP-30] Counting Cycles with Deepseek

【速读】：该论文试图解决如何为循环计数统计量（cycle count statistic）推导出一个计算高效的等价形式（Computationally Efficient Equivalent Form, CEEF）的问题。该问题缺乏已知的通用解法，需要精细的组合数学和繁琐的计算，传统方法难以完成。论文提出的解决方案的关键在于结合一种新颖的方法与AI的强大编码能力，并通过提供清晰的策略、分步指导以及精心编写的提示，使AI能够有效地解决问题。

链接: https://arxiv.org/abs/2505.17964
作者: Jiashun Jin,Tracy Ke,Bingcheng Sui,Zhenggang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent progress, AI still struggles on advanced mathematics. We consider a difficult open problem: How to derive a Computationally Efficient Equivalent Form (CEEF) for the cycle count statistic? The CEEF problem does not have known general solutions, and requires delicate combinatorics and tedious calculations. Such a task is hard to accomplish by humans but is an ideal example where AI can be very helpful. We solve the problem by combining a novel approach we propose and the powerful coding skills of AI. Our results use delicate graph theory and contain new formulas for general cases that have not been discovered before. We find that, while AI is unable to solve the problem all by itself, it is able to solve it if we provide it with a clear strategy, a step-by-step guidance and carefully written prompts. For simplicity, we focus our study on DeepSeek-R1 but we also investigate other AI approaches.
zh

[NLP-31] Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

【速读】：该论文旨在解决在大型语言模型（LLMs）中提升复杂任务性能并实现可解释决策的问题，特别是在临床应用中。传统方法依赖于昂贵的链式思维（CoT）数据的监督微调（SFT），而该研究提出AlphaMed，这是首个通过强化学习（RL）纯自发地展现推理能力的医学大模型，其关键在于使用基于最小化规则的奖励机制，在无需SFT或蒸馏CoT数据的情况下，利用公开的多项选择问答（QA）数据集进行训练。

链接: https://arxiv.org/abs/2505.17952
作者: Che Liu,Haozhe Wang,Jiazhen Pan,Zhongwei Wan,Yong Dai,Fangzhen Lin,Wenjia Bai,Daniel Rueckert,Rossella Arcucci
机构: Imperial College London (帝国理工学院); HKUST (香港科技大学); Technical University of Munich (慕尼黑工业大学); Ohio State University (俄亥俄州立大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.
zh

[NLP-32] Handling Symbolic Language in Student Texts: A Comparative Study of NLP Embedding Models

【速读】：该论文试图解决在学习分析（Learning Analytics, LA）中处理科学相关语言时，现有自然语言处理（Natural Language Processing, NLP）嵌入模型对符号表达（如公式和方程）处理能力不足的问题。解决方案的关键在于评估不同嵌入模型在处理物理学科特定符号表达时的性能差异，并通过相似性分析和机器学习流水线集成两种方法进行验证，从而为LA研究者提供模型选择的依据。

链接: https://arxiv.org/abs/2505.17950
作者: Tom Bleckmann,Paul Tschisgale
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Physics Education (physics.ed-ph)
备注:

点击查看摘要

Abstract:Recent advancements in Natural Language Processing (NLP) have facilitated the analysis of student-generated language products in learning analytics (LA), particularly through the use of NLP embedding models. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing studies and applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased findings and diminished performance of LA applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: similarity-based analyses and integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI’s GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Beyond performance, additional factors such as cost, regulatory compliance, and model transparency are discussed as key considerations for model selection. Overall, this study underscores the importance for LA researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions.
zh

[NLP-33] Understanding Gated Neurons in Transformers from Their Input-Output Functionality

【速读】：该论文试图解决当前可解释性研究中对多层感知机（MLP）神经元的分析过于侧重激活上下文和输出权重向量，而忽视了输入与输出之间相互作用的问题。其解决方案的关键在于通过检查神经元输入与输出权重之间的余弦相似度，来分析输入与输出之间的交互作用，从而识别出“增强型神经元”和“耗损型神经元”，并揭示它们在不同网络层中的分布特性。

链接: https://arxiv.org/abs/2505.17936
作者: Sebastian Gerstner,Hinrich Schütze
机构: Center for Information and Language Processing (CIS), LMU Munich, Munich Center for Machine Learning (MCML)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 31 pages, 22 figures

点击查看摘要

Abstract:Interpretability researchers have attempted to understand MLP neurons of language models based on both the contexts in which they activate and their output weight vectors. They have paid little attention to a complementary aspect: the interactions between input and output. For example, when neurons detect a direction in the input, they might add much the same direction to the residual stream (“enrichment neurons”) or reduce its presence (“depletion neurons”). We address this aspect by examining the cosine similarity between input and output weights of a neuron. We apply our method to 12 models and find that enrichment neurons dominate in early-middle layers whereas later layers tend more towards depletion. To explain this finding, we argue that enrichment neurons are largely responsible for enriching concept representations, one of the first steps of factual recall. Our input-output perspective is a complement to activation-dependent analyses and to approaches that treat input and output separately.
zh

[NLP-34] owards Practical Defect-Focused Automated Code Review ICML2025

【速读】：该论文试图解决代码审查自动化中因过度简化任务而带来的实际应用局限性问题，具体表现为忽视仓库上下文、真实场景下的合并请求评估以及缺陷检测。其解决方案的关键在于构建一个完整的自动化流水线，包括：1) 利用代码切片算法提取相关上下文，2) 采用多角色大语言模型框架提升关键缺陷包含率（KBI），3) 设计过滤机制以降低误报率（FAR），4) 通过新颖的提示设计优化人机交互。该方法在真实世界合并请求上的验证结果表明，相比标准大语言模型和先前基线，性能分别提升了2倍和10倍。

链接: https://arxiv.org/abs/2505.17928
作者: Junyi Lu,Lili Jiang,Xiaojia Li,Jianbing Fang,Fengjun Zhang,Li Yang,Chun Zuo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to Forty-Second International Conference on Machine Learning (ICML 2025)

点击查看摘要

Abstract:The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.
zh

[NLP-35] Language models can learn implicit multi-hop reasoning but only if they have lots of training data

【速读】：该论文试图解决语言模型在无需显式思维链（chain of thought）的情况下，通过单次前向传播完成多跳推理（multi-hop reasoning）任务的能力问题。其解决方案的关键在于利用从头训练的GPT2风格语言模型，在控制的k-跳推理数据集上进行训练，以研究模型是否能够学习到隐式k-跳推理能力，并分析训练数据量和模型深度随k值变化的规律。研究发现，随着k的增加，所需训练数据呈指数增长，而所需的Transformer层数则线性增长，且通过课程学习（curriculum learning）可部分缓解数据需求问题。

链接: https://arxiv.org/abs/2505.17923
作者: Yuekun Yao,Yupei Du,Dawei Zhu,Michael Hahn,Alexander Koller
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought. We investigate this capability using GPT2-style language models trained from scratch on controlled k -hop reasoning datasets ( k = 2, 3, 4 ). We show that while such models can indeed learn implicit k -hop reasoning, the required training data grows exponentially in k , and the required number of transformer layers grows linearly in k . We offer a theoretical explanation for why this depth growth is necessary. We further find that the data requirement can be mitigated, but not eliminated, through curriculum learning.
zh

[NLP-36] 2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

【速读】：该论文旨在解决扩散模型生成图像质量评估中缺乏可解释性自动评估方法的问题，以及现有监督微调方法依赖高质量批评数据集所带来的成本高、可扩展性差和泛化能力弱的问题。其解决方案的关键在于提出一种基于强化学习的框架T2I-Eval-R1，该框架仅使用粗粒度质量评分进行训练，避免了对高精度可解释评估理由的人工标注，并通过集成组相对策略优化（GRPO）和连续奖励函数设计，使模型能够在仅有易获取的标注判断分数或偏好信息的情况下生成标量评分和可解释的推理链，从而提升评估的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.17897
作者: Zi-Ao Ma,Tian Lan,Rong-Cheng Tu,Shu-Hang Liu,Heyan Huang,Zhijing Wu,Chen Xu,Xian-Ling Mao
机构: School of Computer Science and Technology, Beijing Institute of Technology, China(计算机学院，北京理工大学，中国); Nanyang Technological University, Singapore(南洋理工大学，新加坡); School of Medical Technology, Beijing Institute of Technology, China(医学技术学院，北京理工大学，中国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores, thereby avoiding the need for annotating high-quality interpretable evaluation rationale. Our approach integrates Group Relative Policy Optimization (GRPO) into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains with only easy accessible annotated judgment scores or preferences. Furthermore, we introduce a continuous reward formulation that encourages score diversity and provides stable optimization signals, leading to more robust and discriminative evaluation behavior. Experimental results on three established T2I meta-evaluation benchmarks demonstrate that T2I-Eval-R1 achieves significantly higher alignment with human assessments and offers more accurate interpretable score rationales compared to strong baseline methods.
zh

[NLP-37] Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

【速读】：该论文旨在解决阿拉伯语-英语双向翻译任务中模型性能与计算成本之间的平衡问题，尤其是在小型语言模型（Language Model, LM）上实现高效且高质量的翻译。其解决方案的关键在于基于Kuwain-1.5B模型进行优化，采用两阶段训练方法和高质量、多样化训练语料库，从而在保持模型紧凑性的同时提升翻译性能。此外，研究者还提出了Tarjama-25基准测试集，以克服现有数据集在领域狭窄性、句子长度短和英语源偏见等方面的局限性。

链接: https://arxiv.org/abs/2505.17894
作者: Khalil Hennara,Muhammad Hreden,Mohamed Motaism Hamed,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus… Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.
zh

[NLP-38] MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

【速读】：该论文试图解决自动化科学发现中的假设排序问题，特别是在自然科学领域中，由于实验成本高且通量有限，传统方法仅依赖大型语言模型的内部推理进行预实验排序，未能有效利用实验结果。解决方案的关键在于引入实验引导的排序任务，通过模拟实验反馈来优先排序候选假设，其核心是构建一个基于三个领域先验假设的模拟器，将假设性能建模为与已知真实假设相似性的函数，并加入噪声扰动，从而在缺乏真实实验数据的情况下实现有效的假设排序。

链接: https://arxiv.org/abs/2505.17873
作者: Wanhao Liu,Zonglin Yang,Jue Wang,Lidong Bing,Di Zhang,Dongzhan Zhou,Yuqiang Li,Houqiang Li,Erik Cambria,Wanli Ouyang
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Nanyang Technological University (南洋理工大学); MiroMind (米罗思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model’s internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.
zh

[NLP-39] Just as Humans Need Vaccines So Do Models: Model Immunization to Combat Falsehoods

【速读】：该论文试图解决生成式 AI 模型在训练过程中学习并再现虚假信息的问题（Generative AI models often learn and reproduce false information present in their training corpora）。解决方案的关键在于通过在微调过程中使用少量、隔离且明确标记的虚假示例作为“疫苗”，使模型具备识别和拒绝误导性陈述的能力，同时保持对真实输入的准确性。这种方法将经过核查的虚假信息本身作为监督信号，而非依赖输入扰动或通用的人类反馈，从而增强模型对未来虚假信息的抵抗力。

链接: https://arxiv.org/abs/2505.17870
作者: Shaina Raza,Rizwan Qureshi,Marcelo Lotif,Aman Chadha,Deval Pandya,Christos Emmanouilidis
机构: Vector Institute (向量研究所); University of Central Florida (中佛罗里达大学); Amazon Web Services (亚马逊网络服务); University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a “vaccine” against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.
zh

[NLP-40] Explaining Sources of Uncertainty in Automated Fact-Checking

【速读】：该论文试图解决模型预测不确定性解释不足的问题，特别是在面对冲突证据时，现有方法无法有效解释不确定性，导致用户难以理解或信任模型输出。解决方案的关键在于提出CLUE（Conflict-and-Agreement-aware Language-model Uncertainty Explanations），该框架通过无监督方式识别文本片段之间的关系，揭示驱动模型预测不确定性的主张-证据或证据间冲突与一致性的信息，并利用提示和注意力引导生成自然语言解释，从而更准确地反映模型的不确定性并提升解释的逻辑一致性与实用性。

链接: https://arxiv.org/abs/2505.17855
作者: Jingyi Sun,Greta Warren,Irina Shklovski,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding sources of a model’s uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes using numerical uncertainty or hedges (“I’m not sure, but …”), which do not explain uncertainty that arises from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (Conflict-and-Agreement-aware Language-model Uncertainty Explanations), the first framework to generate natural language explanations of model uncertainty by (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts and agreements that drive the model’s predictive uncertainty in an unsupervised way, and (ii) generating explanations via prompting and attention steering that verbalize these critical interactions. Across three language models and two fact-checking datasets, we show that CLUE produces explanations that are more faithful to the model’s uncertainty and more consistent with fact-checking decisions than prompting for uncertainty explanations without span-interaction guidance. Human evaluators judge our explanations to be more helpful, more informative, less redundant, and more logically consistent with the input than this baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and generalises readily to other tasks that require reasoning over complex information.
zh

[NLP-41] Investigating Affect Mining Techniques for Annotation Sample Selection in the Creation of Finnish Affective Speech Corpus INTERSPEECH2025

【速读】：该论文试图解决缺乏自然表达情感的芬兰语口语语料库的问题，现有数据多为表演性或特定交际场景下的表达。解决方案的关键在于通过结合声学特征、跨语言语音情感特征和文本情感特征的“情感挖掘”方法进行样本选择，以确保情感表达的多样性，并在此基础上创建首个针对自发芬兰语情感表达的语料库。

链接: https://arxiv.org/abs/2505.17833
作者: Kalle Lahtinen,Einari Vaaras,Liisa Mustanoja,Okko Räsänen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication at Interspeech 2025, Rotterdam, The Netherlands

点击查看摘要

Abstract:Study of affect in speech requires suitable data, as emotional expression and perception vary across languages. Until now, no corpus has existed for natural expression of affect in spontaneous Finnish, existing data being acted or from a very specific communicative setting. This paper presents the first such corpus, created by annotating 12,000 utterances for emotional arousal and valence, sampled from three large-scale Finnish speech corpora. To ensure diverse affective expression, sample selection was conducted with an affect mining approach combining acoustic, cross-linguistic speech emotion, and text sentiment features. We compare this method to random sampling in terms of annotation diversity, and conduct post-hoc analyses to identify sampling choices that would have maximized the diversity. As an outcome, the work introduces a spontaneous Finnish affective speech corpus and informs sampling strategies for affective speech corpus creation in other languages or domains.
zh

[NLP-42] Emerging categories in scientific explanations DATE

【速读】：该论文试图解决机器学习决策解释缺乏大规模、人类类似且由人类生成的标注数据集的问题（explanation dataset）。其解决方案的关键在于从生物技术和生物物理领域内的科学文献中提取具有解释性质的句子，并基于数据归纳出多类别标注体系，同时评估标注者在新兴类别上的一致性，最终构建了一个公开可用的数据集，提供两种不同的分类方式（6类和3类），其中3类标注达到了0.667的Krippendorf Alpha值。

链接: https://arxiv.org/abs/2505.17832
作者: Giacomo Magnifico,Eduard Barbu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the 3rd TRR 318 Conference: Contextualizing Explanations (ContEx25), as a two-pager abstract. Will be published at BiUP (Bielefeld University Press) at a later date

点击查看摘要

Abstract:Clear and effective explanations are essential for human understanding and knowledge dissemination. The scope of scientific research aiming to understand the essence of explanations has recently expanded from the social sciences to machine learning and artificial intelligence. Explanations for machine learning decisions must be impactful and human-like, and there is a lack of large-scale datasets focusing on human-like and human-generated explanations. This work aims to provide such a dataset by: extracting sentences that indicate explanations from scientific literature among various sources in the biotechnology and biophysics topic domains (e.g. PubMed’s PMC Open Access subset); providing a multi-class notation derived inductively from the data; evaluating annotator consensus on the emerging categories. The sentences are organized in an openly-available dataset, with two different classifications (6-class and 3-class category annotation), and the 3-class notation achieves a 0.667 Krippendorf Alpha value.
zh

[NLP-43] Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLM s Reasoning

【速读】：该论文试图解决传统Test-Time Scaling (TTS)方法在数学推理任务中因路径同质化和中间结果利用效率低下而导致的准确性不足问题。其解决方案的关键在于提出Stepwise Reasoning Checkpoint Analysis (SRCA)，通过在推理步骤之间引入检查点，并结合两种核心策略：答案聚类搜索（Answer-Clustered Search）以保持路径多样性并确保质量，以及检查点候选增强（Checkpoint Candidate Augmentation）以充分利用所有中间结果进行最终决策，从而有效减少路径同质化并提升模型的容错能力。

链接: https://arxiv.org/abs/2505.17829
作者: Zezhong Wang,Xingshan Zeng,Weiwen Liu,Yufei Wang,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.
zh

[NLP-44] Not All Tokens Are What You Need In Thinking

【速读】：该论文旨在解决现代推理模型（如OpenAI的o1和DeepSeek-R1）在问题求解过程中存在的关键效率问题，包括高推理延迟、计算资源消耗过大以及过度思考——生成冗长且包含大量冗余标记的思维链（Chain of Thought, CoT）。其解决方案的关键在于提出条件性标记选择（Conditional Token Selection, CTS），这是一种基于标记级别的压缩框架，能够识别并保留CoT中最具本质意义的标记。CTS通过条件重要性评分评估每个标记对得出正确答案的贡献，并在压缩后的CoT上训练模型，从而有效减少冗余标记数量，同时保持强大的推理性能。

链接: https://arxiv.org/abs/2505.17827
作者: Hang Yuan,Bin Yu,Haotian Li,Shijun Yang,Christina Dan Wang,Zhou Yu,Xueyin Xu,Weizhen Qi,Kai Chen
机构: East China Normal University (华东师范大学); Harbin Institute of Technology (哈尔滨工业大学); University of Science and Technology of China (中国科学技术大学); New York University Shanghai (纽约大学上海分校); Zhongguancun Academy (中关村科学院); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 11 pages, 7 figures and 3 tables

点击查看摘要

Abstract:Modern reasoning models, such as OpenAI’s o1 and DeepSeek-R1, exhibit impressive problem-solving capabilities but suffer from critical inefficiencies: high inference latency, excessive computational resource consumption, and a tendency toward overthinking – generating verbose chains of thought (CoT) laden with redundant tokens that contribute minimally to the final answer. To address these issues, we propose Conditional Token Selection (CTS), a token-level compression framework with a flexible and variable compression ratio that identifies and preserves only the most essential tokens in CoT. CTS evaluates each token’s contribution to deriving correct answers using conditional importance scoring, then trains models on compressed CoT. Extensive experiments demonstrate that CTS effectively compresses long CoT while maintaining strong reasoning performance. Notably, on the GPQA benchmark, Qwen2.5-14B-Instruct trained with CTS achieves a 9.1% accuracy improvement with 13.2% fewer reasoning tokens (13% training token reduction). Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens, highlighting the prevalence of redundancy in existing CoT.
zh

[NLP-45] rinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models DATE

【速读】：该论文提出了一种名为Trinity-RFT的通用、灵活且可扩展的强化学习微调（Reinforcement Fine-Tuning, RFT）框架，旨在解决大规模语言模型在不同应用场景下的强化学习微调问题。其关键在于采用解耦设计，包括统一并泛化同步/异步、在线/离线等RFT模式的RFT核心、高效且稳健的智能体-环境交互集成，以及针对RFT优化的系统化数据流水线，从而提供一个统一的平台以支持先进强化学习范式的探索。

链接: https://arxiv.org/abs/2505.17826
作者: Xuchen Pan,Yanxi Chen,Yushuo Chen,Yuchang Sun,Daoyuan Chen,Wenhao Zhang,Yuexiang Xie,Yilun Huang,Yilei Zhang,Dawei Gao,Yaliang Li,Bolin Ding,Jingren Zhou
机构: Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: This technical report will be continuously updated as the codebase evolves. GitHub: this https URL

点击查看摘要

Abstract:Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT, (2) seamless integration for agent-environment interaction with high efficiency and robustness, and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for exploring advanced reinforcement learning paradigms. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples demonstrating the utility and user-friendliness of the proposed framework.
zh

[NLP-46] PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

【速读】：该论文试图解决医疗场景中医生与患者多轮对话系统训练和评估缺乏真实且多样化患者角色的问题，现有模拟器无法全面反映临床实践中遇到的患者人格类型。解决方案的关键在于提出PatientSim，一个基于医学专业知识的患者模拟器，其核心包括：1）从MIMIC-ED和MIMIC-IV数据集中提取的临床资料，如症状和病史；2）由四个维度定义的患者人格类型：性格、语言能力、病史回忆水平和认知混乱水平，生成37种独特的组合，从而实现对真实临床场景的高保真模拟。

链接: https://arxiv.org/abs/2505.17818
作者: Daeun Kyung,Hyunseung Chung,Seongsu Bae,Jiho Kim,Jae Ho Sohn,Taerim Kim,Soo Kyung Kim,Edward Choi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages for main text, 4 pages for references, 27 pages for supplementary materials

点击查看摘要

Abstract:Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluated eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3, was validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.
zh

[NLP-47] Low-Resource NMT: A Case Study on the Written and Spoken Languages in Hong Kong

【速读】：该论文旨在解决标准汉语与粤语书面语之间的自动翻译问题，特别是在网络环境中日益增长的跨语言交流需求背景下。其解决方案的关键在于构建大量平行语料数据，以支持基于Transformer的神经机器翻译系统。研究通过收集已有语言学研究和互联网资源中的28,000对平行句子，并利用从中文维基百科和粤语维基百科中自动提取语义相似句子的方法，成功获得了72,000对平行句子，从而显著提升了翻译性能。

链接: https://arxiv.org/abs/2505.17816
作者: Hei Yi Mak,Tan Lee
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval

点击查看摘要

Abstract:The majority of inhabitants in Hong Kong are able to read and write in standard Chinese but use Cantonese as the primary spoken language in daily life. Spoken Cantonese can be transcribed into Chinese characters, which constitute the so-called written Cantonese. Written Cantonese exhibits significant lexical and grammatical differences from standard written Chinese. The rise of written Cantonese is increasingly evident in the cyber world. The growing interaction between Mandarin speakers and Cantonese speakers is leading to a clear demand for automatic translation between Chinese and Cantonese. This paper describes a transformer-based neural machine translation (NMT) system for written-Chinese-to-written-Cantonese translation. Given that parallel text data of Chinese and Cantonese are extremely scarce, a major focus of this study is on the effort of preparing good amount of training data for NMT. In addition to collecting 28K parallel sentences from previous linguistic studies and scattered internet resources, we devise an effective approach to obtaining 72K parallel sentences by automatically extracting pairs of semantically similar sentences from parallel articles on Chinese Wikipedia and Cantonese Wikipedia. We show that leveraging highly similar sentence pairs mined from Wikipedia improves translation performance in all test sets. Our system outperforms Baidu Fanyi’s Chinese-to-Cantonese translation on 6 out of 8 test sets in BLEU scores. Translation examples reveal that our system is able to capture important linguistic transformations between standard Chinese and spoken Cantonese.
zh

[NLP-48] Dont Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

【速读】：该论文试图解决当前推理型大语言模型（Reasoning Large Language Models, LLMs）在执行复杂推理任务时依赖于增加测试阶段计算资源导致的高计算成本和推理时间问题。其解决方案的关键在于挑战“较长的思考链能够提升推理能力”的假设，提出一种名为short-m@k的新推理推理LLM推理方法，该方法通过并行执行k次独立生成并在前m个思考过程完成后停止计算，最终通过多数投票选择答案，从而在保持或提升性能的同时显著减少计算资源消耗。

链接: https://arxiv.org/abs/2505.17813
作者: Michael Hassid,Gabriel Synnaeve,Yossi Adi,Roy Schwartz
机构: FAIR Team, Meta; The Hebrew University of Jerusalem
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Under review

点击查看摘要

Abstract:Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive “thinking” chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer “thinking” does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.
zh

[NLP-49] DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors

【速读】：该论文试图解决大型语言模型（Large-language-model, LLM）在主动、目标驱动的交互中表现不佳的问题，主要挑战包括短视的解码和高昂的规划成本。解决方案的关键在于引入DialogXpert框架，该框架利用冻结的LLM生成每轮少量高质量的候选动作，并通过基于固定BERT嵌入的紧凑Q网络，采用时序差分学习选择最优动作，从而在简化空间内实现高效决策。此外，DialogXpert通过跟踪用户情绪，使每个决策既能推进任务，又能建立真实的情感连接。

链接: https://arxiv.org/abs/2505.17795
作者: Tazeek Bin Abdur Rakib,Ambuj Mehrish,Lay-Ki Soon,Wern Han Lim,Soujanya Poria
机构: Monash University Malaysia (莫纳什大学马来西亚校区); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user’s emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under 3 turns with success rates exceeding 94% and, with a larger LLM prior, pushes success above 97% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at this https URL
zh

[NLP-50] Compression Hacking: A Supplementary Perspective on Informatics Metric of Language Models from Geometric Distortion

【速读】：该论文试图解决语言模型（Language Model, LM）在压缩过程中出现的“压缩欺骗”（Compression Hacking）问题，即高压缩率可能通过牺牲空间均匀性而产生虚假的高效表示，导致模型的几何结构退化为高度各向异性，从而影响其理解指令的能力和整体性能。解决方案的关键在于提出三种改进的压缩度量方法，通过整合几何失真分析，并将其集成到自评估流程中，以更准确地反映模型的综合能力，最终实现了与模型性能的高度相关性。

链接: https://arxiv.org/abs/2505.17793
作者: Jianxiang Zang,Meiling Ning,Yongda Wei,Shihan Dou,Jiazheng Zhang,Nijia Mo,Binhong Li,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Beijing University of Posts and Telecommunications (北京邮电大学); George Mason University (乔治·梅森大学); Shanghai University of International Business and Economics (上海对外经贸大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州)）
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, the concept of compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the Compression Hacking’’ in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM’s comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.
zh

[NLP-51] EXECUTE: A Multilingual Benchmark for LLM Token Understanding ACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言环境下对字符理解能力不足的问题，尤其是在不同书写系统中的表现差异。其解决方案的关键在于提出一个简化的框架——EXECUTE，该框架能够便捷地扩展至任何语言，从而支持对多种语言的评估，并揭示不同语言中LLMs面临的挑战可能并非仅限于字符层面，而是可能体现在词级处理或完全不存在问题。此外，研究还通过分析中文、日文和韩文的子字符任务，进一步评估LLMs对字符组成部分的理解能力。

链接: https://arxiv.org/abs/2505.17784
作者: Lukas Edman,Helmut Schmid,Alexander Fraser
机构: TU Munich (慕尼黑工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Munich Data Science Institute (慕尼黑数据科学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs’ understanding of character components.
zh

[NLP-52] he Real Barrier to LLM Agent Agent Usability is Agentic ROI

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）代理在实际应用中面临的关键可用性问题，即尽管其在专业高复杂度任务中表现出色，但在高需求的大众市场应用中仍存在广泛的采用障碍。解决方案的关键在于提出一种以效用为导向的视角，通过评估代理的总体代理投资回报率（Agentic ROI）来衡量其价值，而非仅仅优化模型性能。作者认为，影响Agentic ROI的核心因素包括信息质量、代理时间和成本，并主张通过先扩大规模提升信息质量，再缩小规模以降低时间和成本的“zigzag”发展路径，从而弥合当前的可用性差距。

链接: https://arxiv.org/abs/2505.17767
作者: Weiwen Liu,Jiarui Qin,Xu Huang,Xingshan Zeng,Yunjia Xi,Jianghao Lin,Chuhan Wu,Yasheng Wang,Lifeng Shang,Ruiming Tang,Defu Lian,Yong Yu,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. Despite the widespread application in specialized, high-effort tasks like coding and scientific research, we highlight a critical usability gap in high-demand, mass-market applications. This position paper argues that the limited real-world adoption of LLM agents stems not only from gaps in model capabilities, but also from a fundamental tradeoff between the value an agent can provide and the costs incurred during real-world use. Hence, we call for a shift from solely optimizing model performance to a broader, utility-driven perspective: evaluating agents through the lens of the overall agentic return on investment (Agent ROI). By identifying key factors that determine Agentic ROI–information quality, agent time, and cost–we posit a zigzag development trajectory in optimizing agentic ROI: first scaling up to improve the information quality, then scaling down to minimize the time and cost. We outline the roadmap across different development stages to bridge the current usability gaps, aiming to make LLM agents truly scalable, accessible, and effective in real-world contexts.
zh

[NLP-53] Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLM s IJCAI2025

【速读】：该论文旨在解决在存在冲突证据的情况下，检索增强生成（Retrieval-Augmented Generation, RAG）模型在事实核查任务中的可靠性下降问题。其关键解决方案是将媒体背景信息，尤其是来源可信度，整合到检索和生成阶段，从而提升RAG模型处理冲突证据的能力并改善事实核查性能。

链接: https://arxiv.org/abs/2505.17762
作者: Ziyu Ge,Yuhao Wu,Daniel Wai Kit Chin,Roy Ka-Wei Lee,Rui Cao
机构: Singapore University of Technology and Design (新加坡科技设计大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Camera-ready for IJCAI 2025, AI and Social Good

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbfCONFACT (\textbfConflicting Evidence for \textbfFact-Checking) (Dataset available at this https URL), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.
zh

[NLP-54] Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

【速读】：该论文试图解决多语言语言模型如何表征语言身份（形式）和语义内容（意义）的问题，其解决方案的关键在于引入了一种无需训练的ABX风格区分任务，该任务受语音处理启发，用于测量表示中微小差异是否可被可靠检测，从而提供一种灵活且可解释的替代探针方法。

链接: https://arxiv.org/abs/2505.17747
作者: Maureen de Seyssel,Jie Chi,Skyler Seto,Maartje ter Hoeve,Masha Fedzechkina,Natalie Schluter
机构: Apple(苹果); Technical University of Denmark(丹麦技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.
zh

[NLP-55] Fast Quiet-STaR: Thinking Without Thought Tokens

【速读】：该论文试图解决大规模语言模型在复杂推理任务中性能提升受限的问题，特别是在模型规模和训练数据扩展之外如何进一步优化推理能力。其解决方案的关键在于提出一种更高效的推理框架——Fast Quiet STaR，该方法通过基于课程学习的训练策略逐步减少思考标记（thought tokens）的数量，使模型能够内化更抽象和简洁的推理过程，同时结合强化学习微调将其扩展至标准的下一个标记预测（Next Token Prediction, NTP）设置，从而在推理阶段无需显式生成思考标记，显著降低了计算成本并保持了较高的推理准确性。

链接: https://arxiv.org/abs/2505.17746
作者: Wei Huang,Yizhe Xiong,Xin Ye,Zhijie Deng,Hui Chen,Zijia Lin,Guiguang Ding
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心); Kuaishou Technology (快手科技); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9% on Mistral 7B and 5.7% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at this https URL.
zh

[NLP-56] he Pilot Corpus of the English Semantic Sketches

【速读】：该论文试图解决如何为英语动词创建语义草图（semantic sketch）的问题，旨在通过英俄语义草图对来展示语义草图在对比研究中的应用价值。其解决方案的关键在于分析具有相似语义的语义草图之间的跨语言差异，并探讨构建语义草图的过程及可能出现的错误，从而揭示语义草图的语言学本质。

链接: https://arxiv.org/abs/2505.17733
作者: Maria Petrova,Maria Ponomareva,Alexandra Ivoylova
机构: ABBYY(ABBYY); HSE(高等经济大学); RSUH(俄罗斯国立人文大学); MIPT(莫斯科物理技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.
zh

[NLP-57] PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization DATE

【速读】：该论文旨在解决传统近端策略优化（Proximal Policy Optimization, PPO）在策略梯度方法中因静态信任区域导致的脆弱性权衡问题，即过度剪切会抑制早期探索，而后期更新则可能破坏收敛稳定性。其解决方案的关键在于提出PPO-BR，通过将探索与收敛信号融合到一个受约束的信任区域中，实现了自适应强化学习（adaptive RL）的新范式，从而在保持收敛稳定性的同时提升探索效率。

链接: https://arxiv.org/abs/2505.17714
作者: Ben Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This manuscript builds upon an earlier version posted to TechRxiv. This arXiv version includes an updated comparison with GRPO (Group Relative Policy Optimization)

点击查看摘要

Abstract:Despite Proximal Policy Optimization (PPO) dominating policy gradient methods – from robotic control to game AI – its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region – a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six diverse benchmarks (MuJoCo, Atari, sparse-reward), PPO-BR achieves 29.1% faster convergence (p 0.001), 2.3x lower reward variance than PPO, and less than 1.8% runtime overhead with only five lines of code change. PPO-BR’s simplicity and theoretical guarantees make it ready-to-deploy in safety-critical domains – from surgical robotics to autonomous drones. In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO-BR offers a unified entropy-reward mechanism applicable to both language models and general reinforcement learning environments.
zh

[NLP-58] Understanding How Value Neurons Shape the Generation of Specified Values in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在社会应用中与普遍伦理原则对齐的问题，特别是其内部价值表示的不透明性。现有方法在系统解释价值如何编码于神经架构方面存在局限，主要受限于侧重表面判断而非机制分析的数据集。论文提出的解决方案关键在于引入ValueLocate框架，该框架基于Schwartz价值调查，构建了ValueInsight数据集，通过现实中的行为情境操作化四个普遍价值维度，并开发了一种基于激活差异的神经元识别方法，从而精确定位与价值相关的神经元，无需依赖计算密集型归因方法。

链接: https://arxiv.org/abs/2505.17712
作者: Yi Su,Jiayi Zhang,Shu Yang,Xinhai Wang,Lijie Hu,Di Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rapid integration of large language models (LLMs) into societal applications has intensified concerns about their alignment with universal ethical principles, as their internal value representations remain opaque despite behavioral alignment advancements. Current approaches struggle to systematically interpret how values are encoded in neural architectures, limited by datasets that prioritize superficial judgments over mechanistic analysis. We introduce ValueLocate, a mechanistic interpretability framework grounded in the Schwartz Values Survey, to address this gap. Our method first constructs ValueInsight, a dataset that operationalizes four dimensions of universal value through behavioral contexts in the real world. Leveraging this dataset, we develop a neuron identification method that calculates activation differences between opposing value aspects, enabling precise localization of value-critical neurons without relying on computationally intensive attribution methods. Our proposed validation method demonstrates that targeted manipulation of these neurons effectively alters model value orientations, establishing causal relationships between neurons and value representations. This work advances the foundation for value alignment by bridging psychological value frameworks with neuron analysis in LLMs.
zh

[NLP-59] SemSketches-2021: experimenting with the machine processing of the pilot semantic sketches corpus

【速读】：该论文试图解决语义草图（semantic sketch）的机器处理问题，旨在通过构建一个开放语料库并开发相应的处理工具来促进这一领域的研究。其解决方案的关键在于组织了SemSketches-2021共享任务，参与者需根据提供的匿名语义草图和包含必要谓词的上下文集，将合适的上下文分配给对应的草图，从而探索语义草图与上下文之间的匹配机制。

链接: https://arxiv.org/abs/2505.17704
作者: Maria Ponomareva,Maria Petrova,Julia Detkova,Oleg Serikov,Maria Yarova
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches-2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.
zh

[NLP-60] COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection

【速读】：该论文旨在解决大规模语言模型在推理过程中因模型规模增大而导致的计算效率低下问题。其解决方案的关键在于提出一种基于FFNN层内部下投影矩阵线性组合的全局稀疏性假设，并据此设计了两种方法：M-COUNTDOWN和D-COUNTDOWN，分别利用间接系数和直接系数来实现计算量的显著减少，从而在保持较高性能的前提下提升计算效率。

链接: https://arxiv.org/abs/2505.17701
作者: Jaewon Cheon,Pilsung Kang
机构: Korea University (韩国科学技术院); Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing size of large language models has created significant computational inefficiencies. To address this challenge, sparse activation methods selectively deactivates non-essential parameters during inference, reducing computational costs in FFNN layers. While existing methods focus on non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN layer lies globally in the form of a linear combination over its internal down projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN, leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct coefficients of the linear combination. Experimental results demonstrate that D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5% ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4% better performance preservation compared to existing methods. Our specialized kernel implementations effectively realize these theoretical gains into substantial real-world acceleration.
zh

[NLP-61] Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）中高效激发长链式思维（Long Chain-of-Thought, CoT）能力的问题，传统方法通常依赖于昂贵的强化学习或高质量蒸馏数据的监督微调。论文的解决方案之关键在于发现最后几层中少量高影响的激活值主导了长文本推理属性，如输出长度和自我反思能力。通过简单地增强这些激活值并插入“wait”标记，可以在无需任何训练的情况下激发长CoT能力，从而显著提升自我反思率和准确性。此外，论文提出了一种无需训练的激活控制技术，利用少量对比示例识别关键激活，并在推理时使用简单的解析函数调节其值以激发长CoT。

链接: https://arxiv.org/abs/2505.17697
作者: Zekai Zhao,Qi Liu,Kun Zhou,Zihan Liu,Yifei Shao,Zhiting Hu,Biwei Huang
机构: University of California, San Diego. (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the remarkable reasoning performance, eliciting the long chain-of-thought (CoT) ability in large language models (LLMs) typically requires costly reinforcement learning or supervised fine-tuning on high-quality distilled data. We investigate the internal mechanisms behind this capability and show that a small set of high-impact activations in the last few layers largely governs long-form reasoning attributes, such as output length and self-reflection. By simply amplifying these activations and inserting “wait” tokens, we can invoke the long CoT ability without any training, resulting in significantly increased self-reflection rates and accuracy. Moreover, we find that the activation dynamics follow predictable trajectories, with a sharp rise after special tokens and a subsequent exponential decay. Building on these insights, we introduce a general training-free activation control technique. It leverages a few contrastive examples to identify key activations, and employs simple analytic functions to modulate their values at inference time to elicit long CoTs. Extensive experiments confirm the effectiveness of our method in efficiently eliciting long CoT reasoning in LLMs and improving their performance. Additionally, we propose a parameter-efficient fine-tuning method that trains only a last-layer activation amplification module and a few LoRA layers, outperforming full LoRA fine-tuning on reasoning benchmarks with significantly fewer parameters. Our code and data are publicly released.
zh

[NLP-62] ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction

【速读】：该论文试图解决在使用大型语言模型（Large Language Models, LLMs）作为开放性任务评估者时，成对比较中非传递性偏好（non-transitive preferences）的问题，即评估者在A与B、B与C的比较中表现出偏好顺序A>B和B>C，但在A与C的比较中却偏好C>A，这种不一致现象会降低评估的可靠性。解决方案的关键在于提出一种基于图论的框架，通过将成对偏好建模为竞赛图（tournament graphs）来分析和缓解该问题，并引入有向图结构熵（directed graph structural entropy）量化非传递性程度。此外，设计了一种过滤策略ELSPR，用于剔除引发非传递性的偏好数据，保留一致且传递的偏好数据用于模型微调，从而有效减少非传递性并提升评估的一致性与人类评估者的匹配度。

链接: https://arxiv.org/abs/2505.17691
作者: Yan Yu,Yilun Liu,Minggui He,Shimin Tao,Weibin Meng,Xinhua Yang,Li Zhang,Hongxia Ma,Chang Su,Hao Yang,Fuliang Li
机构: Northeastern University (东北大学); Huawei (华为)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used as evaluators for open-ended tasks, while previous research has emphasized biases in LLM evaluations, the issue of non-transitivity in pairwise comparisons remains unresolved: non-transitive preferences for pairwise comparisons, where evaluators prefer A over B, B over C, but C over A. Our results suggest that low-quality training data may reduce the transitivity of preferences generated by the Evaluator LLM. To address this, We propose a graph-theoretic framework to analyze and mitigate this problem by modeling pairwise preferences as tournament graphs. We quantify non-transitivity and introduce directed graph structural entropy to measure the overall clarity of preferences. Our analysis reveals significant non-transitivity in advanced Evaluator LLMs (with Qwen2.5-Max exhibiting 67.96%), as well as high entropy values (0.8095 for Qwen2.5-Max), reflecting low overall clarity of preferences. To address this issue, we designed a filtering strategy, ELSPR, to eliminate preference data that induces non-transitivity, retaining only consistent and transitive preference data for model fine-tuning. Experiments demonstrate that models fine-tuned with filtered data reduce non-transitivity by 13.78% (from 64.28% to 50.50%), decrease structural entropy by 0.0879 (from 0.8113 to 0.7234), and align more closely with human evaluators (human agreement rate improves by 0.6% and Spearman correlation increases by 0.01).
zh

[NLP-63] uning Language Models for Robust Prediction of Diverse User Behaviors

【速读】：该论文试图解决深度学习模型在预测用户行为时难以捕捉长尾行为的问题（long-tailed behaviors），尽管大型语言模型（Large language models, LLMs）具有丰富的行为知识，但现有微调方法容易过拟合到高频的“锚点”行为（anchor behaviors），从而削弱了对低频“尾部”行为（tail behaviors）的预测能力。解决方案的关键在于提出一种渐进式微调方法——BehaviorLM，其第一阶段在保留通用行为知识的同时对锚点行为进行微调，第二阶段则基于样本难度使用所有行为的平衡子集进行微调，以提升尾部行为的预测性能而不牺牲锚点行为的表现。

链接: https://arxiv.org/abs/2505.17682
作者: Fanjin Meng,Jingtao Ding,Jiahui Gong,Chen Yang,Hong Chen,Zuojian Wang,Haisheng Lu,Yong Li
机构: Tsinghua University (清华大学); Honor Device Co., Ltd. (荣耀终端有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting user behavior is essential for intelligent assistant services, yet deep learning models often struggle to capture long-tailed behaviors. Large language models (LLMs), with their pretraining on vast corpora containing rich behavioral knowledge, offer promise. However, existing fine-tuning approaches tend to overfit to frequent anchor'' behaviors, reducing their ability to predict less common tail’’ behaviors. In this paper, we introduce BehaviorLM, a progressive fine-tuning approach that addresses this issue. In the first stage, LLMs are fine-tuned on anchor behaviors while preserving general behavioral knowledge. In the second stage, fine-tuning uses a balanced subset of all behaviors based on sample difficulty to improve tail behavior predictions without sacrificing anchor performance. Experimental results on two real-world datasets demonstrate that BehaviorLM robustly predicts both anchor and tail behaviors and effectively leverages LLM behavioral knowledge to master tail behavior prediction with few-shot examples.
zh

[NLP-64] MIDB: Multilingual Instruction Data Booster for Enhancing Multilingual Instruction Synthesis

【速读】：该论文试图解决多语言合成指令数据中存在的质量问题，这些问题主要源于英语合成数据通过机器翻译（Machine Translation, MT）转换到其他语言时引入的缺陷以及目标语言的本地化不足。解决方案的关键在于提出MIDB（Multilingual Instruction Data Booster），该方法通过在16种语言的约36.8k条修订示例上训练，由人类语言专家进行优化，从而修复内容错误、MT缺陷，并提升合成数据的本地化水平。

链接: https://arxiv.org/abs/2505.17671
作者: Yilun Liu,Chunguang Zhao,Xinhua Yang,Hongyong Zeng,Shimin Tao,Weibin Meng,Minggui He,Chang Su,Yan Yu,Hongxia Ma,Li Zhang,Daimeng Wei,Hao Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced.
zh

[NLP-65] Qwen Long-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

【速读】：该论文旨在解决如何通过强化学习（Reinforcement Learning, RL）使大型推理模型（Large Reasoning Models, LRMs）有效处理和推理长上下文输入这一关键挑战。现有研究主要集中在短上下文推理任务中，而长上下文场景下的训练效率低下和优化过程不稳定是当前的主要问题。解决方案的关键在于提出QwenLong-L1框架，该框架通过渐进式上下文扩展策略，结合预训练阶段的监督微调、课程引导的分阶段强化学习以及难度感知的回顾采样策略，提升模型在长上下文任务中的性能与稳定性。

链接: https://arxiv.org/abs/2505.17667
作者: Fanqi Wan,Weizhou Shen,Shengyi Liao,Yingcheng Shi,Chenliang Li,Ziyi Yang,Ji Zhang,Fei Huang,Jingren Zhou,Ming Yan
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.
zh

[NLP-66] owards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States ACL2025

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在评估其心智理论（Theory of Mind, ToM）能力时存在的不足，特别是对动态心理状态演变的跟踪能力。现有基准主要关注静态心理状态的评估，未能反映现实社会互动中的时间演化特性。解决方案的关键是提出一个名为\textscDynToM的新基准，专门用于评估LLMs在相互关联的情境中理解和跟踪心理状态随时间变化的能力，通过系统化的四步框架生成大量高质量的社会情境和问题，以更真实地反映人类社会交互的动态性。

链接: https://arxiv.org/abs/2505.17663
作者: Yang Xiao,Jiashuo Wang,Qiancheng Xu,Changhe Song,Chunpu Xu,Yi Cheng,Wenjie Li,Pengfei Liu
机构: The Hong Kong Polytechnic University (香港理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted by ACL 2025 Main Conference

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textscDynToM, a novel benchmark specifically designed to evaluate LLMs’ ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs’ ability to model the dynamic nature of human mental states.
zh

[NLP-67] oo Consistent to Detect: A Study of Self-Consistent Errors in LLM s EMNLP25

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）生成内容中存在的一种特定错误类型——自一致错误（self-consistent errors），即LLMs在多次随机采样中重复生成相同的错误响应。现有检测方法在面对此类错误时表现不佳，无法有效识别。该研究提出的关键解决方案是基于跨模型探测方法（cross-model probe method），通过融合外部验证器LLM的隐藏状态证据，提升对自一致错误的检测性能。

链接: https://arxiv.org/abs/2505.17656
作者: Hexiang Tan,Fei Sun,Sha Liu,Du Su,Qi Cao,Xin Chen,Jingang Wang,Xunliang Cai,Yuanzhuo Wang,Huawei Shen,Xueqi Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Underreview in EMNLP25

点击查看摘要

Abstract:As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methshods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improved methods. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.
zh

[NLP-68] EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

【速读】：该论文试图解决电商平台上基于生成式 AI (Generative AI) 的内容检测系统在面对隐蔽违规内容时的检测能力不足问题，这类内容表面上符合平台政策，但暗含禁止的主张。解决方案的关键在于提出 EVADE，这是一个首个针对中文电商场景设计的多模态基准数据集，包含大量标注的文本和图像样本，并通过两种互补任务（Single-Violation 和 All-in-One）评估模型在细粒度和长上下文推理方面的能力，从而为评估和提升模型对隐蔽违规内容的检测性能提供标准。

链接: https://arxiv.org/abs/2505.17654
作者: Ancheng Xu,Zhihao Yang,Jingpeng Li,Guanghu Yuan,Longze Chen,Liang Yan,Jiehui Zhou,Zhen Qin,Hengyun Chang,Hamid Alinejad-Rokny,Bo Zheng,Min Yang
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); Tongji University (同济大学); School of Biomedical Engineering, UNSW Sydney (新南威尔士大学生物医学工程学院); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at this https URL.
zh

[NLP-69] HoloLLM : Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning

【速读】：该论文旨在解决嵌入式智能体在智能家居环境中通过多模态感知理解人类行为并进行自然语言交互的挑战，特别是传统视觉-语言模型（Vision-Language Models, VLMs）因依赖视觉数据而在遮挡、光照不良或隐私限制等实际场景中表现受限的问题。其解决方案的关键在于提出一种多模态大语言模型（Multimodal Large Language Model, MLLM）——HoloLLM，该模型融合了LiDAR、红外、毫米波雷达和WiFi等非传统但高效的传感模态，以实现跨异构环境的无缝人类感知与推理。为应对稀疏的对齐模态-文本数据及物理信号表示的异质性问题，研究设计了通用模态注入投影器（Universal Modality-Injection Projector, UMIP），通过粗到细的跨注意力机制增强预对齐模态嵌入，同时引入人机协作的数据标注流程生成配对文本注释，从而显著提升了语言引导的人类感知准确性。

链接: https://arxiv.org/abs/2505.17645
作者: Chuhao Zhou,Jianfei Yang
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 18 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.
zh

[NLP-70] Bridging Electronic Health Records and Clinical Texts: Contrastive Learning for Enhanced Clinical Tasks

【速读】：该论文试图解决传统机器学习模型在需要深层次上下文理解的临床预测任务中表现不佳的问题，例如30天内医院再入院预测，这主要是由于结构化电子健康记录（EHR）数据中语义信息有限。解决方案的关键在于提出一种深度多模态对比学习（CL）框架，该框架通过将结构化EHR数据的潜在表示与非结构化的出院摘要文本对齐，从而增强模型对上下文的理解能力，具体表现为通过拉近配对的EHR和文本嵌入并推远未配对的嵌入来实现。

链接: https://arxiv.org/abs/2505.17643
作者: Sara Ketabi,Dhanesh Ramachandram
机构: Vector Institute, University of Toronto, The Hospital for Sick Children (向量研究所，多伦多大学，儿童医院); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conventional machine learning models, particularly tree-based approaches, have demonstrated promising performance across various clinical prediction tasks using electronic health record (EHR) data. Despite their strengths, these models struggle with tasks that require deeper contextual understanding, such as predicting 30-day hospital readmission. This can be primarily due to the limited semantic information available in structured EHR data. To address this limitation, we propose a deep multimodal contrastive learning (CL) framework that aligns the latent representations of structured EHR data with unstructured discharge summary notes. It works by pulling together paired EHR and text embeddings while pushing apart unpaired ones. Fine-tuning the pretrained EHR encoder extracted from this framework significantly boosts downstream task performance, e.g., a 4.1% AUROC enhancement over XGBoost for 30-day readmission prediction. Such results demonstrate the effect of integrating domain knowledge from clinical notes into EHR-based pipelines, enabling more accurate and context-aware clinical decision support systems.
zh

[NLP-71] Stereotype Detection in Natural Language Processing

【速读】：该论文试图解决社会感知中刻板印象（stereotype）的检测问题，这一问题在人工智能领域尚处于新兴研究阶段，但具有重要的社会影响。研究通过半自动文献综述方法，基于Semantic Scholar检索并筛选了2000年至2025年间超过6000篇相关论文，分析了定义、趋势、方法论及挑战。其解决方案的关键在于强调刻板印象检测作为早期监测工具的潜力，以防止偏见升级和仇恨言论的蔓延，并提出未来研究需采用更广泛、多语言及交叉性（intersectional）的方法。

链接: https://arxiv.org/abs/2505.17642
作者: Alessandra Teresa Cignarella,Anastasia Giachanou,Els Lefever
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Stereotypes influence social perceptions and can escalate into discrimination and violence. While NLP research has extensively addressed gender bias and hate speech, stereotype detection remains an emerging field with significant societal implications. In this work is presented a survey of existing research, analyzing definitions from psychology, sociology, and philosophy. A semi-automatic literature review was performed by using Semantic Scholar. We retrieved and filtered over 6,000 papers (in the year range 2000-2025), identifying key trends, methodologies, challenges and future directions. The findings emphasize stereotype detection as a potential early-monitoring tool to prevent bias escalation and the rise of hate speech. Conclusions highlight the need for a broader, multilingual, and intersectional approach in NLP studies.
zh

[NLP-72] Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis

【速读】：该论文旨在解决当前AI安全基准测试在衡量大语言模型（Large Language Models, LLMs）时存在的覆盖范围不一致和语义独立性不足的问题。其解决方案的关键在于通过UMAP降维和k-means聚类方法，识别出不同基准中的主要危害类别，并量化基准之间的语义正交性，从而揭示各基准在覆盖范围上的差异，为更全面地应对未来AI使用中的潜在危害提供有针对性的数据集开发框架。

链接: https://arxiv.org/abs/2505.17636
作者: Jonathan Bennion,Shaona Ghosh,Mantek Singh,Nouha Dziri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6th International Conference on Advanced Natural Language Processing (AdNLP 2025), May 17 ~ 18, 2025, Zurich, Switzerland

点击查看摘要

Abstract:Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.
zh

[NLP-73] GIM: Improved Interpretability for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中忠实可解释性的问题，特别是由于自修复（self-repair）现象导致的组件重要性被低估的问题。自修复是指网络通过增强其他组件来补偿某一组件信号的减少，从而掩盖了被消融组件的真实重要性。传统消融和基于梯度的方法因此无法准确评估贡献于注意力分数的各个组件的重要性。该论文提出的解决方案关键在于引入梯度交互修改（Gradient Interaction Modifications, GIM），该技术在反向传播过程中考虑了自修复效应，从而显著提高了现有电路识别和特征归因方法的忠实性。

链接: https://arxiv.org/abs/2505.17630
作者: Joakim Edin,Róbert Csordás,Tuukka Ruotsalo,Zhengxuan Wu,Maria Maistro,Jing Huang,Lars Maaløe
机构: Corti(科蒂); Stanford University (斯坦福大学); Copenhagen University (哥本哈根大学); LUT University (拉彭兰塔理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at this https URL.
zh

[NLP-74] Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports

【速读】：该论文试图解决在复杂文档布局中，尤其是金融领域如证券报告中的表格结构理解问题，这需要高精度的问答能力。现有方法在处理不同格式（如HTML、图像和纯文本）的表格时存在结构信息难以保留和提取的挑战。解决方案的关键在于通过引入表内文本内容和版面特征来增强基于大型视觉语言模型（LVLM）的表格理解能力，实验结果表明这些辅助模态显著提升了模型在非结构化输入格式下的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2505.17625
作者: Hayato Aida,Kosuke Takahashi,Takahiro Omi
机构: Stockmark(Stockmark)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IIAI AAI 2025, the 3rd International Conference on Computational and Data Sciences in Economics and Finance

点击查看摘要

Abstract:With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features. Experimental results demonstrate that these auxiliary modalities significantly improve performance, enabling robust interpretation of complex document layouts without relying on explicitly structured input formats.
zh

[NLP-75] Runaway is Ashamed But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

【速读】：该论文试图解决基于大语言模型（Large Language Models, LLMs）的智能体在复杂具身环境中进行多轮交互时存在的低效问题，例如陷入重复循环或发出无效指令，导致计算资源的冗余消耗。解决方案的关键在于探索早期退出（early-exit）行为，通过两种互补方法实现：一种是内在方法，在生成过程中注入退出指令；另一种是外在方法，通过验证任务完成情况来决定何时终止智能体的尝试。

链接: https://arxiv.org/abs/2505.17616
作者: Qingyu Lu,Liang Ding,Siyi Cao,Xuebo Liu,Kanjian Zhang,Jinxia Zhang,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Agents powered by large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments. However, such agents often suffer from inefficiencies in multi-turn interactions, frequently trapped in repetitive loops or issuing ineffective commands, leading to redundant computational overhead. Instead of relying solely on learning from trajectories, we take a first step toward exploring the early-exit behavior for LLM-based agents. We propose two complementary approaches: 1. an \textbfintrinsic method that injects exit instructions during generation, and 2. an \textbfextrinsic method that verifies task completion to determine when to halt an agent’s trial. To evaluate early-exit mechanisms, we introduce two metrics: one measures the reduction of \textbfredundant steps as a positive effect, and the other evaluates \textbfprogress degradation as a negative effect. Experiments with 4 different LLMs across 5 embodied environments show significant efficiency improvements, with only minor drops in agent performance. We also validate a practical strategy where a stronger agent assists after an early-exit agent, achieving better performance with the same total steps. We will release our code to support further research.
zh

[NLP-76] Large language model as user daily behavior data generator: balancing population diversity and individual personality

【速读】：该论文试图解决人类日常行为预测中因常规模式复杂性和短期波动带来的挑战，以及数据驱动模型对敏感大规模用户数据的依赖所引发的隐私问题和数据可用性限制。其解决方案的关键在于引入BehaviorGen框架，该框架利用大语言模型（Large Language Models, LLMs）生成高质量的合成行为数据，通过模拟用户资料和真实事件来支持行为预测模型中的数据增强与替换，从而在保证隐私的前提下提升行为建模效果。

链接: https://arxiv.org/abs/2505.17615
作者: Haoxin Li,Jingtao Ding,Jiahui Gong,Yong Li
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Predicting human daily behavior is challenging due to the complexity of routine patterns and short-term fluctuations. While data-driven models have improved behavior prediction by leveraging empirical data from various platforms and devices, the reliance on sensitive, large-scale user data raises privacy concerns and limits data availability. Synthetic data generation has emerged as a promising solution, though existing methods are often limited to specific applications. In this work, we introduce BehaviorGen, a framework that uses large language models (LLMs) to generate high-quality synthetic behavior data. By simulating user behavior based on profiles and real events, BehaviorGen supports data augmentation and replacement in behavior prediction models. We evaluate its performance in scenarios such as pertaining augmentation, fine-tuning replacement, and fine-tuning augmentation, achieving significant improvements in human mobility and smartphone usage predictions, with gains of up to 18.9%. Our results demonstrate the potential of BehaviorGen to enhance user behavior modeling through flexible and privacy-preserving synthetic data generation.
zh

[NLP-77] MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

【速读】：该论文试图解决多模态生成任务中自动评估难以与人类评价一致的问题，特别是在涉及多种模态的复杂任务中。解决方案的关键在于提出MMM（Multimodal Generation）基准，该基准通过结合模型和程序实现可靠的自动评估，同时涵盖49个任务（包括29个新开发的任务）和937条指令，以系统评估多模态生成模型的推理能力、可控性等关键性能，其与人类评价的平均一致性达到94.3%。

链接: https://arxiv.org/abs/2505.17613
作者: Jihan Yao,Yushi Hu,Yujie Yi,Bin Han,Shangbin Feng,Guang Yang,Bingbing Wen,Ranjay Krishna,Lucy Lu Wang,Yulia Tsvetkov,Noah A. Smith,Banghua Zhu
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.
zh

[NLP-78] Distilling LLM Agent into Small Models with Retrieval and Code Tools

【速读】：该论文试图解决小语言模型（sLMs）在需要罕见事实知识或精确计算的场景中，由于能力受限而产生幻觉的问题。传统方法通过链式思维（CoT）轨迹将大语言模型（LLMs）的推理能力蒸馏到sLMs中，但在复杂任务中效果有限。论文提出的解决方案是Agent Distillation框架，其关键在于不仅转移推理能力，还转移基于LLM的智能体的完整任务求解行为，结合检索和代码工具，提升sLMs的任务处理能力。

链接: https://arxiv.org/abs/2505.17612
作者: Minki Kang,Jongwon Jeong,Seanie Lee,Jaewoong Cho,Sung Ju Hwang
机构: KAIST(韩国科学技术院); KRAFTON(克劳顿); DeepAuto.ai(深度自动人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint, v1

点击查看摘要

Abstract:Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at this https URL.
zh

[NLP-79] Controlled Agent ic Planning Reasoning for Mechanism Synthesis

【速读】：该论文旨在解决平面机构综合（planar mechanism synthesis）中从自然语言描述生成几何和动态结果的问题，其核心挑战在于如何在语言和符号层面进行有效推理。解决方案的关键在于构建一个基于双智能体大型语言模型（dual-agent Large Language Model, LLM）的推理框架，该框架通过引用抽象属性、生成并参数化仿真代码、利用符号回归和距离函数提取反馈锚点，实现语言和符号层面上的可操作性优化循环。此方法在平面机构的上下文中被证明是有效且收敛的。

链接: https://arxiv.org/abs/2505.17607
作者: João Pedro Gandarela,Thiago Rios,Stefan Menzel,André Freitas
机构: Idiap Research Institute(伊迪普研究机构); École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院); Honda Research Institute Europe(本田研究欧洲公司); Department of Computer Science(计算机科学系); University of Manchester(曼彻斯特大学); National Biomarker Centre, CRUK-MI(国家生物标志物中心，CRUK-MI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 16 figures

点击查看摘要

Abstract:This work presents a dual-agent Large Language Model (LLM)-based reasoning method for mechanism synthesis, capable of reasoning at both linguistic and symbolic levels to generate geometrical and dynamic outcomes. The model consists of a composition of well-defined functions that, starting from a natural language specification, references abstract properties through supporting equations, generates and parametrizes simulation code, and elicits feedback anchor points using symbolic regression and distance functions. This process closes an actionable refinement loop at the linguistic and symbolic layers. The approach is shown to be both effective and convergent in the context of planar mechanisms. Additionally, we introduce MSynth, a novel benchmark for planar mechanism synthesis, and perform a comprehensive analysis of the impact of the model components. We further demonstrate that symbolic regression prompts unlock mechanistic insights only when applied to sufficiently large architectures.
zh

[NLP-80] Wolf Hidden in Sheeps Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

【速读】：该论文旨在解决现有后门攻击在安全对齐的防护机制下易被检测以及攻击行为可能破坏模型安全性导致隐蔽性下降的问题。其解决方案的关键在于提出一种“干净数据后门攻击”方法，通过将触发词与无害的正面回复前缀进行过拟合，而非直接关联有害响应，从而在推理阶段通过触发词激活良性前缀并利用模型的语言建模能力生成有害响应，实现对大型语言模型的越狱攻击。

链接: https://arxiv.org/abs/2505.17601
作者: Jiawei Kong,Hao Fang,Xiaochen Yang,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model’s safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel \clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.
zh

[NLP-81] One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中因越狱攻击（jailbreak attacks）导致的安全对齐问题，这类攻击能够使模型生成有害或非预期的内容。现有许多越狱策略难以应对快速发展的防御机制（如防御后缀），从而失效。论文提出的解决方案是引入一种名为ArrAttack的新攻击方法，其关键在于通过一个通用的鲁棒性判断模型，自动生成能够绕过多种防御措施的鲁棒越狱提示。该模型在训练完成后，可对具有多种防御机制的目标模型进行鲁棒性评估，并高效生成有效的攻击提示，从而显著提升攻击效果和跨模型的迁移能力。

链接: https://arxiv.org/abs/2505.17598
作者: Linbao Li,Yannan Liu,Daojing He,Yu Li
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Wuheng Lab, ByteDance (字节跳动实验室); Zhejiang University (浙江大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts. We make the codebase available at this https URL.
zh

[NLP-82] NeUQI: Near-Optimal Uniform Quantization Parameter Initialization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在消费级GPU或个人设备上部署时面临的高内存消耗和推理成本问题。其解决方案的关键在于提出NeUQI方法，该方法专注于高效确定统一量化（Uniform Quantization）的近优初始参数，从而提升后训练量化（Post-Training Quantization, PTQ）的效果。NeUQI与现有量化方法正交，可无缝集成，并在多个任务中表现出优于现有方法的性能。

链接: https://arxiv.org/abs/2505.17595
作者: Li Lin,Xinyu Hu,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, under review

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on \geq 2 -bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.
zh

[NLP-83] Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在个性化任务中的性能不足问题，尽管其推理能力有所提升，但在需要大量检索的场景中表现并不优于通用大语言模型（Large Language Models, LLMs）。论文提出的解决方案关键在于提出一种名为\model的框架，该框架通过引入分层推理思维模板来引导LRMs生成结构化输出，并结合推理过程干预方法以强化对设计推理模式的遵循，同时采用交叉引用机制确保结果的一致性，从而有效克服了发散性思维、响应格式不匹配和检索信息利用无效等核心限制。

链接: https://arxiv.org/abs/2505.17571
作者: Sichun Luo,Guanzhi Deng,Jian Xu,Xiaojie Zhang,Hanxu Hou,Linqi Song
机构: Dongguan University of Technology (东莞理工学院); City University of Hong Kong (香港城市大学); Tsinghua University (清华大学); Guangzhou University (广州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalization is a critical task in modern intelligent systems, with applications spanning diverse domains, including interactions with large language models (LLMs). Recent advances in reasoning capabilities have significantly enhanced LLMs, enabling unprecedented performance in tasks such as mathematics and coding. However, their potential for personalization tasks remains underexplored. In this paper, we present the first systematic evaluation of large reasoning models (LRMs) for personalization tasks. Surprisingly, despite generating more tokens, LRMs do not consistently outperform general-purpose LLMs, especially in retrieval-intensive scenarios where their advantages diminish. Our analysis identifies three key limitations: divergent thinking, misalignment of response formats, and ineffective use of retrieved information. To address these challenges, we propose Reinforced Reasoning for Personalization (\model), a novel framework that incorporates a hierarchical reasoning thought template to guide LRMs in generating structured outputs. Additionally, we introduce a reasoning process intervention method to enforce adherence to designed reasoning patterns, enhancing alignment. We also propose a cross-referencing mechanism to ensure consistency. Extensive experiments demonstrate that our approach significantly outperforms existing techniques. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.17571 [cs.CL] (or arXiv:2505.17571v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.17571 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-84] PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models

【速读】：该论文试图解决表格问答（Table Question Answering, TQA）中模型性能提升的问题，尤其是在缺乏昂贵或人工标注数据的情况下。解决方案的关键在于提出一种基于过程的偏好学习框架（Process-based Preference learning, PPT），该框架将推理过程分解为离散状态，对每个状态进行评分，并采样对比步骤以进行偏好学习，从而有效提升TQA模型的性能。

链接: https://arxiv.org/abs/2505.17565
作者: Wei Zhou,Mohsen Mesgar,Heike Adel,Annemarie Friedrich
机构: Bosch Center for Artificial Intelligence(博世人工智能中心); Hochschule der Medien(媒体学院); University of Augsburg(奥格斯堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving large language models (LLMs) with self-generated data has demonstrated success in tasks such as mathematical reasoning and code generation. Yet, no exploration has been made on table question answering (TQA), where a system answers questions based on tabular data. Addressing this gap is crucial for TQA, as effective self-improvement can boost performance without requiring costly or manually annotated data. In this work, we propose PPT, a Process-based Preference learning framework for TQA. It decomposes reasoning chains into discrete states, assigns scores to each state, and samples contrastive steps for preference learning. Experimental results show that PPT effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets, with only 8,000 preference pairs. Furthermore, the resulting models achieve competitive results compared to more complex and larger state-of-the-art TQA systems, while being five times more efficient during inference.
zh

[NLP-85] aching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在检测幻觉（hallucination）时的对齐问题，这一问题由于幻觉文本的复杂性而尤为困难。其解决方案的关键在于利用精心设计的幻觉样本作为负例（negative examples）参与DPO（Direct Preference Optimization）对齐过程，并结合课程学习（curriculum learning）策略，从基于独立事实核查模型概率分数下降最大的简单样本逐步过渡到更复杂的样本，从而实现稳定且渐进的学习过程。

链接: https://arxiv.org/abs/2505.17558
作者: Shrey Pandit,Ashwin Vinod,Liu Leqi,Ying Ding
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and dataset are available at this https URL

点击查看摘要

Abstract:Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.
zh

[NLP-86] CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning

【速读】：该论文试图解决参数高效微调中混合专家（Mixture-of-Experts, MoE）在异构数据集上表现不佳的问题，具体表现为专家可能学习到相似知识，导致MoE能力未被充分利用。其解决方案的关键在于提出对比表示（Contrastive Representation for MoE, CoMoE），通过在top-k路由中从激活和未激活的专家中采样，并引入对比目标来促进专家的模块化与专业化，从而恢复输入与两类专家之间的互信息差距。

链接: https://arxiv.org/abs/2505.17553
作者: Jinyuan Feng,Chaopeng Wei,Tenghai Qiu,Tianyi Hu,Zhiqiang Pu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); University of Science and Technology Beijing (北京科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In parameter-efficient fine-tuning, mixture-of-experts (MoE), which involves specializing functionalities into different experts and sparsely activating them appropriately, has been widely adopted as a promising approach to trade-off between model capacity and computation overhead. However, current MoE variants fall short on heterogeneous datasets, ignoring the fact that experts may learn similar knowledge, resulting in the underutilization of MoE’s capacity. In this paper, we propose Contrastive Representation for MoE (CoMoE), a novel method to promote modularization and specialization in MoE, where the experts are trained along with a contrastive objective by sampling from activated and inactivated experts in top-k routing. We demonstrate that such a contrastive objective recovers the mutual-information gap between inputs and the two types of experts. Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE’s capacity and promote modularization among the experts.
zh

[NLP-87] Swedish Whispers; Leverag ing a Massive Speech Corpus for Swedish Speech Recognition INTERSPEECH2025

【速读】：该论文试图解决的是针对中等资源语言瑞典语的语音识别性能提升问题，特别是在多语言训练数据集中小语种通常存在代表性不足的情况下。解决方案的关键在于对现有的多语言Whisper模型进行微调，利用前所未有的大规模和高变异性的瑞典语数据集进行训练，从而显著提升了模型在瑞典语上的表现。

链接: https://arxiv.org/abs/2505.17538
作者: Leonora Vesterbacka,Faton Rekathati,Robin Kurtz,Justyna Sikora,Agnes Toftgård
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2025

点击查看摘要

Abstract:This work presents a suite of fine-tuned Whisper models for Swedish, trained on a dataset of unprecedented size and variability for this mid-resourced language. As languages of smaller sizes are often underrepresented in multilingual training datasets, substantial improvements in performance can be achieved by fine-tuning existing multilingual models, as shown in this work. This work reports an overall improvement across model sizes compared to OpenAI’s Whisper evaluated on Swedish. Most notably, we report an average 47% reduction in WER comparing our best performing model to OpenAI’s whisper-large-v3, in evaluations across FLEURS, Common Voice, and NST.
zh

[NLP-88] How Knowledge Popularity Influences and Enhances LLM Knowledge Boundary Perception

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对知识边界时无法准确识别自身知识范围，从而产生自信但错误答案的问题。其解决方案的关键在于通过量化知识流行度（knowledge popularity）来评估LLMs对知识边界的感知能力，并利用这一信号进行置信度校准，从而提升答案正确性预测的准确性。

链接: https://arxiv.org/abs/2505.17537
作者: Shiyu Ni,Keping Bi,Jiafeng Guo,Xueqi Cheng
机构: CAS Key Lab of Network Data Science and Technology, ICT, CAS(中国科学院网络数据科学与技术重点实验室，ICT，中科院); State Key Laboratory of AI Safety(人工智能安全国家重点实验室); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often fail to recognize their knowledge boundaries, producing confident yet incorrect answers. In this paper, we investigate how knowledge popularity affects LLMs’ ability to perceive their knowledge boundaries. Focusing on entity-centric factual question answering (QA), we quantify knowledge popularity from three perspectives: the popularity of entities in the question, the popularity of entities in the answer, and relation popularity, defined as their co-occurrence frequency. Experiments on three representative datasets containing knowledge with varying popularity show that LLMs exhibit better QA performance, higher confidence, and more accurate perception on more popular knowledge, with relation popularity having the strongest correlation. Cause knowledge popularity shows strong correlation with LLMs’ QA performance, we propose to leverage these signals for confidence calibration. This improves the accuracy of answer correctness prediction by an average of 5.24% across all models and datasets. Furthermore, we explore prompting LLMs to estimate popularity without external corpora, which yields a viable alternative.
zh

[NLP-89] Multimodal Conversation Structure Understanding

【速读】：该论文试图解决多模态、多方对话中细粒度对话结构理解的问题，特别是对话角色归属（说话者、接收者、旁听者）和对话线程（话语链接与聚类）的理解能力不足问题。其解决方案的关键在于构建一个包含4,398个说话者和回复关系标注、5,755个接收者标注以及3,142个旁听者标注的人工标注数据集，并基于对话分析和社会语言学理论设计相关任务，以推动对多模态对话结构的深入研究与模型评估。

链接: https://arxiv.org/abs/2505.17536
作者: Kent K. Chang,Mackenzie Hanh Cramer,Anna Ho,Ti Ti Nguyen,Yilin Yuan,David Bamman
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversations are usually structured by roles – who is speaking, who’s being addressed, and who’s listening – and unfold in threads that break with changes in speaker floor or topical focus. While large language models (LLMs) have shown incredible capabilities in dialogue and reasoning, their ability to understand fine-grained conversational structure, especially in multi-modal, multi-party settings, remains underexplored. To address this gap, we introduce a suite of tasks focused on conversational role attribution (speaker, addressees, side-participants) and conversation threading (utterance linking and clustering), drawing on conversation analysis and sociolinguistics. To support those tasks, we present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants. We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging. The most performant audio-visual LLM outperforms all vision-language models across all metrics, especially in speaker and addressee recognition. However, its performance drops significantly when conversation participants are anonymized. The number of conversation participants in a clip is the strongest negative predictor of role-attribution performance, while acoustic clarity (measured by pitch and spectral centroid) and detected face coverage yield positive associations. We hope this work lays the groundwork for future evaluation and development of multimodal LLMs that can reason more effectively about conversation structure. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.17536 [cs.CL] (or arXiv:2505.17536v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.17536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-90] Co-Reinforcement Learning for Unified Multimodal Understanding and Generation

【速读】：该论文旨在解决如何通过强化学习（Reinforcement Learning, RL）同时提升统一多模态大语言模型（Unified Multimodal Large Language Models, ULMs）的生成与理解能力的问题。其解决方案的关键在于提出一种协同强化学习框架——\textbf{CoRL}，该框架包含一个统一的强化学习阶段用于联合优化，以及一个细化的强化学习阶段用于任务特定的增强，从而实现双能力在共享策略优化框架中的协同进化。

链接: https://arxiv.org/abs/2505.17534
作者: Jingjing Jiang,Chongjie Si,Jun Luo,Hanwang Zhang,Chao Ma
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce \textbfCoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, \textbfULM-R1, achieves average improvements of \textbf7% on three text-to-image generation datasets and \textbf23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs.
zh

[NLP-91] Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models

【速读】：该论文试图解决在生成式 AI（Generative AI）快速发展的背景下，人类与大型语言模型（Large Language Models, LLMs）交互过程中存在的滥用风险问题，特别是忽视了LLMs不仅可能作为受害模型，还可能作为攻击者模型对其他模型造成危害的潜在可能性。解决方案的关键在于提出一种受思维链（Chain-of-Thought）机制启发的新型越狱方法，通过任务迁移隐藏有害用户意图，并生成连贯的情节诱饵以激发受害模型的推理能力，从而实现成功越狱；同时引入辅助模型在多轮对话中对情节诱饵进行随机叙事优化，确保与原始意图一致，提高攻击成功率。

链接: https://arxiv.org/abs/2505.17519
作者: Wenhan Chang,Tianqing Zhu,Yu Zhao,Shuangyong Song,Ping Xiong,Wanlei Zhou,Yongxiang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 25 pages, 4 figures

点击查看摘要

Abstract:In the era of rapid generative AI development, interactions between humans and large language models face significant misusing risks. Previous research has primarily focused on black-box scenarios using human-guided prompts and white-box scenarios leveraging gradient-based LLM generation methods, neglecting the possibility that LLMs can act not only as victim models, but also as attacker models to harm other models. We proposes a novel jailbreaking method inspired by the Chain-of-Thought mechanism, where the attacker model uses mission transfer to conceal harmful user intent in dialogue and generates chained narrative lures to stimulate the reasoning capabilities of victim models, leading to successful jailbreaking. To enhance the attack success rate, we introduce a helper model that performs random narrative optimization on the narrative lures during multi-turn dialogues while ensuring alignment with the original intent, enabling the optimized lures to bypass the safety barriers of victim models effectively. Our experiments reveal that models with weaker safety mechanisms exhibit stronger attack capabilities, demonstrating that models can not only be exploited, but also help harm others. By incorporating toxicity scores, we employ third-party models to evaluate the harmfulness of victim models’ responses to jailbreaking attempts. The study shows that using refusal keywords as an evaluation metric for attack success rates is significantly flawed because it does not assess whether the responses guide harmful questions, while toxicity scores measure the harm of generated content with more precision and its alignment with harmful questions. Our approach demonstrates outstanding performance, uncovering latent vulnerabilities in LLMs and providing data-driven feedback to optimize LLM safety mechanisms. We also discuss two defensive strategies to offer guidance on improving defense mechanisms.
zh

[NLP-92] What You Read Isnt What You Hear: Linguistic Sensitivity in Deepfake Speech Detection

【速读】：该论文试图解决语音合成深度伪造攻击中，现有音频反欺骗系统对语言学变异的敏感性不足的问题。传统方法主要关注声学层面的扰动，而忽略了语言学层面的变化对检测效果的影响。论文的关键解决方案是引入语料级对抗攻击，通过轻微的语言学扰动显著降低检测准确率，从而揭示现有反欺骗检测器在面对语言学变化时的脆弱性。

链接: https://arxiv.org/abs/2505.17513
作者: Binh Nguyen,Shuji Shi,Ryan Ofman,Thai Le
机构: Indiana University (印第安纳大学); Deep Media AI (Deep Media AI)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 15 pages, 2 fogures

点击查看摘要

Abstract:Recent advances in text-to-speech technologies have enabled realistic voice generation, fueling audio-based deepfake attacks such as fraud and impersonation. While audio anti-spoofing systems are critical for detecting such threats, prior work has predominantly focused on acoustic-level perturbations, leaving the impact of linguistic variation largely unexplored. In this paper, we investigate the linguistic sensitivity of both open-source and commercial anti-spoofing detectors by introducing transcript-level adversarial attacks. Our extensive evaluation reveals that even minor linguistic perturbations can significantly degrade detection accuracy: attack success rates surpass 60% on several open-source detector-voice pairs, and notably one commercial detection accuracy drops from 100% on synthetic audio to just 32%. Through a comprehensive feature attribution analysis, we identify that both linguistic complexity and model-level audio embedding similarity contribute strongly to detector vulnerability. We further demonstrate the real-world risk via a case study replicating the Brad Pitt audio deepfake scam, using transcript adversarial attacks to completely bypass commercial detectors. These results highlight the need to move beyond purely acoustic defenses and account for linguistic variation in the design of robust anti-spoofing systems. All source code will be publicly available.
zh

[NLP-93] Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在理解概念性语义关系方面的能力评估问题，现有基准测试主要关注事实回忆和孤立任务，未能有效评估LLMs在交互环境中对概念边界的推理能力。解决方案的关键在于引入CK-Arena，这是一个基于Undercover游戏构建的多智能体互动游戏，旨在通过描述、区分和推断概念边界来评估LLMs在动态环境中的概念推理能力。

链接: https://arxiv.org/abs/2505.17512
作者: Shuhang Xu,Weijian Deng,Yixuan Zhou,Fangwei Zhong
机构: Beijing Normal University (北京师范大学); Australian National University (澳大利亚国立大学); Beijing 101 Education Group (北京101教育集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Concepts represent generalized abstractions that enable humans to categorize and reason efficiently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs’ understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with parameter size or general model capabilities. The data and code are available at the project homepage: this https URL.
zh

[NLP-94] Large Language Models Do Multi-Label Classification Differently

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多标签分类任务中的表现问题，特别是针对主观性任务中模型输出分布的行为特性。研究发现，LLMs在生成每个标签时倾向于抑制除一个标签外的所有标签，导致其预测行为与实际需求不匹配。解决方案的关键在于引入多标签设置下的分布对齐任务，通过将模型生成的标签分布与基于标注者响应估计的经验分布进行对齐，提升模型的预测性能和分布一致性。为此，作者提出了零样本和监督方法，有效改善了现有方法的对齐效果和预测能力。

链接: https://arxiv.org/abs/2505.17510
作者: Marcus Ma,Georgios Chochlakis,Niyantha Maruthu Pandiyan,Jesse Thomason,Shrikanth Narayanan
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, with a focus on subjective tasks, by analyzing the output distributions of the models in each generation step. We find that their predictive behavior reflects the multiple steps in the underlying language modeling required to generate all relevant labels as they tend to suppress all but one label at each step. We further observe that as model scale increases, their token distributions exhibit lower entropy, yet the internal ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. To further study this issue, we introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches.
zh

[NLP-95] On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

【速读】：该论文试图解决在在线强化学习（online RL）场景下，如何系统性地探索和整合不同Kullback-Leibler (KL) 散度形式的估计到代理损失函数中的问题，以提升大规模语言模型（LLM）的推理能力。解决方案的关键在于提出一种系统框架——正则化策略梯度（Regularized Policy Gradient, RPG），该框架能够推导并分析基于前向和反向KL散度正则化的策略梯度方法，同时考虑归一化与非归一化策略分布，并提供可微分的损失函数及类似REINFORCE的梯度估计器，从而增强训练稳定性和算法灵活性。

链接: https://arxiv.org/abs/2505.17508
作者: Yifan Zhang,Yifeng Liu,Huizhuo Yuan,Yang Yuan,Quanquan Gu,Andrew C Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 53 pages, 17 figures

点击查看摘要

Abstract:Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at this https URL.
zh

[NLP-96] L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练和推理过程中依赖于逐token预测（Next-Token Prediction, NTP）所面临的上下文覆盖范围有限和推理效率低下的问题。其解决方案的关键在于提出了一种名为“跳跃多token预测”（Leap Multi-Token Prediction, L-MTP）的新方法，该方法通过引入跳跃机制，使模型在单次前向传播中跳过中间token，直接预测非连续的token，从而增强模型捕捉长距离依赖关系的能力，并实现针对非连续跳跃token生成的优化解码策略，显著提升推理效率。

链接: https://arxiv.org/abs/2505.17505
作者: Xiaohao Liu,Xiaobo Xia,Weixiang Zhao,Manyi Zhang,Xianzhi Yu,Xiu Su,Shuo Yang,See-Kiong Ng,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); Chinese Academy of Sciences (中国科学院); Central South University (中南大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model’s ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code will be publicly available.
zh

[NLP-97] CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

【速读】：该论文旨在解决在实际检索增强生成（Retrieval-Augmented Generation, RAG）场景中评估大型语言模型（Large Language Models, LLMs）能力的挑战，特别是在复杂推理、拒绝不适当回答、提供精确引用和理解文档布局等方面的能力不足问题。解决方案的关键在于提出一个统一的评估框架——CReSt（A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents），该基准包含2,245个经过人工标注的英文和韩文示例，旨在全面评估模型在这些关键维度上的表现，并引入定制化的评估方法以提升评估的准确性与实用性。

链接: https://arxiv.org/abs/2505.17503
作者: Minsoo Khang,Sangjun Park,Teakgyu Hong,Dawoon Jung
机构: Upstage AI (优斯派斯人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: this https URL.
zh

[NLP-98] Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models INTERSPEECH2025

【速读】：该论文试图解决在端到端训练语音语言模型（Spoken Language Models, SLMs）过程中，由于多阶段持续学习导致的灾难性遗忘（catastrophic forgetting）问题。其关键解决方案是评估并比较三种缓解策略：模型融合、降低LoRA缩放因子以及经验重放（experience replay），其中经验重放被证明是最有效的，并且与其他方法结合使用可进一步提升效果。

链接: https://arxiv.org/abs/2505.17496
作者: Chi-Yuan Hsiao,Ke-Han Lu,Kai-Wei Chang,Chih-Kai Yang,Wei-Chih Chen,Hung-yi Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:End-to-end training of Spoken Language Models (SLMs) commonly involves adapting pre-trained text-based Large Language Models (LLMs) to the speech modality through multi-stage training on diverse tasks such as ASR, TTS and spoken question answering (SQA). Although this multi-stage continual learning equips LLMs with both speech understanding and generation capabilities, the substantial differences in task and data distributions across stages can lead to catastrophic forgetting, where previously acquired knowledge is lost. This paper investigates catastrophic forgetting and evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay to balance knowledge retention with new learning. Results show that experience replay is the most effective, with further gains achieved by combining it with other methods. These findings provide insights for developing more robust and efficient SLM training pipelines.
zh

[NLP-99] ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLM s

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）中特征交互识别效率低的问题，传统方法需要枚举所有可能的特征组合，导致计算复杂度随输入数量 $n$ 呈指数增长。其解决方案的关键在于利用LLM特征交互的层次性——高阶交互通常伴随其低阶子集，从而实现更高效的交互发现。为此，作者提出了ProxySPEX算法，通过拟合遮蔽LLM输出的梯度提升树来提取重要交互，相较于SPEX减少了10倍的模型推理次数，并在四个高维数据集上表现出更好的输出重构能力和特征影响识别效果。

链接: https://arxiv.org/abs/2505.17495
作者: Landon Butler,Abhineet Agarwal,Justin Singh Kang,Yigit Efe Erginbas,Bin Yu,Kannan Ramchandran
机构: UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs n . Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to n \approx 10^3 features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical – higher-order interactions are accompanied by their lower-order subsets – which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using 10\times fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.
zh

[NLP-100] PD3: A Project Duplication Detection Framework via Adapted Multi-Agent Debate

【速读】：该论文旨在解决项目重复检测（Project Duplication Detection）问题，以提升项目质量评估的准确性并避免对已有研究的重复投入。现有方法主要依赖于基础的词或句级比较或仅使用大型语言模型，缺乏对项目内容和评审标准的深入理解及对专家有价值的反馈。该论文提出的解决方案关键在于PD³框架，其通过适应性的多智能体辩论机制，模拟现实中的专家辩论过程，采用公平竞争形式引导多智能体辩论以检索相关项目，并结合定性和定量分析提供更具实用性的反馈。

链接: https://arxiv.org/abs/2505.17492
作者: Dezheng Bao,Yueci Yang,Xin Chen,Zhengxuan Jiang,Zeguo Fei,Daoze Zhang,Xuanwen Huang,Junru Chen,Chutian Yu,Xiang Yuan,Yang Yang
机构: Zhejiang University (浙江大学); State Grid Power Supply Co. Ltd. (国家电网供电有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:Project duplication detection is critical for project quality assessment, as it improves resource utilization efficiency by preventing investing in newly proposed project that have already been studied. It requires the ability to understand high-level semantics and generate constructive and valuable feedback. Existing detection methods rely on basic word- or sentence-level comparison or solely apply large language models, lacking valuable insights for experts and in-depth comprehension of project content and review criteria. To tackle this issue, we propose PD ^3 , a Project Duplication Detection framework via adapted multi-agent Debate. Inspired by real-world expert debates, it employs a fair competition format to guide multi-agent debate to retrieve relevant projects. For feedback, it incorporates both qualitative and quantitative analysis to improve its practicality. Over 800 real-world power project data spanning more than 20 specialized fields are used to evaluate the framework, demonstrating that our method outperforms existing approaches by 7.43% and 8.00% in two downstream tasks. Furthermore, we establish an online platform, Review Dingdang, to assist power experts, saving 5.73 million USD in initial detection on more than 100 newly proposed projects.
zh

[NLP-101] keepitsimple at SemEval-2025 Task 3: LLM -Uncertainty based Approach for Multilingual Hallucination Span Detection

【速读】：该论文试图解决在黑盒语言模型生成文本中识别幻觉（hallucination）段的问题，这对于实际应用至关重要。其解决方案的关键在于利用随机采样响应的变异性来识别幻觉段落，假设若语言模型对某一事实确信，则其采样响应将具有一致性，而幻觉事实则会导致不同且冲突的结果。通过熵基分析衡量这种差异，从而准确识别幻觉片段。该方法无需额外训练，因此具有成本效益和适应性。

链接: https://arxiv.org/abs/2505.17485
作者: Saketh Reddy Vemula,Parameswari Krishnamurthy
机构: IIIT Hyderabad (印度国际信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM-a Multilingual Shared Task on Hallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior.
zh

[NLP-102] From Reasoning to Generalization: Knowledge-Augmented LLM s for ARC Benchmark

【速读】：该论文试图解决当前面向推理的大型语言模型（Large Language Models, LLMs）在抽象推理和泛化能力方面仍存在不足的问题。其解决方案的关键在于提出一种基于知识增强的抽象推理框架（Knowledge Augmentation for Abstract Reasoning, KAAR），该框架通过在本体论中编码核心知识先验，并按照依赖关系将其分为三个层次，逐步增强LLMs的推理能力，同时结合重复采样规划辅助代码生成（RSPC）方法生成候选解，从而有效减少无关先验的干扰并提升模型性能。

链接: https://arxiv.org/abs/2505.17482
作者: Chao Lei,Nir Lipovetzky,Krista A. Ehinger,Yanchuan Chang
机构: The University of Melbourne (墨尔本大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.
zh

[NLP-103] MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning

【速读】：该论文试图解决如何使大语言模型（Large Language Models, LLMs）在代码推理（code reasoning）任务中通过每次提出的解决方案逐步提升能力，从而实现显著的累积性改进问题。现有研究多采用静态视角，仅依赖冻结的LLMs进行孤立的问题求解，而本文则从认知演化的角度出发，提出了一种名为元反思与交叉参照（Meta-Reflection with Cross-Referencing, MARCO）的框架。该框架的关键在于通过自我改进实现LLM在推理过程中的动态演化，具体包括两个核心机制：元反思（meta-reflection）用于积累问题求解过程中的知识与经验，交叉参照（cross-referencing）则用于整合其他智能体的解决方案与反馈，以提升当前问题求解的效果。

链接: https://arxiv.org/abs/2505.17481
作者: Yusheng Zhao,Xiao Luo,Weizhi Zhang,Wei Ju,Zhiping Xiao,Philip S. Yu,Ming Zhang
机构: Peking University (北京大学); University of California, Los Angeles (加利福尼亚大学洛杉矶分校); University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability to reason is one of the most fundamental capabilities of large language models (LLMs), enabling a wide range of downstream tasks through sophisticated problem-solving. A critical aspect of this is code reasoning, which involves logical reasoning with formal languages (i.e., programming code). In this paper, we enhance this capability of LLMs by exploring the following question: how can an LLM agent become progressively smarter in code reasoning with each solution it proposes, thereby achieving substantial cumulative improvement? Most existing research takes a static perspective, focusing on isolated problem-solving using frozen LLMs. In contrast, we adopt a cognitive-evolving perspective and propose a novel framework named Meta-Reflection with Cross-Referencing (MARCO) that enables the LLM to evolve dynamically during inference through self-improvement. From the perspective of human cognitive development, we leverage both knowledge accumulation and lesson sharing. In particular, to accumulate knowledge during problem-solving, we propose meta-reflection that reflects on the reasoning paths of the current problem to obtain knowledge and experience for future consideration. Moreover, to effectively utilize the lessons from other agents, we propose cross-referencing that incorporates the solution and feedback from other agents into the current problem-solving process. We conduct experiments across various datasets in code reasoning, and the results demonstrate the effectiveness of MARCO.
zh

[NLP-104] OrionBench: A Benchmark for Chart and Human-Recognizable Object Detection in Infographics

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在图表理解能力上的不足，特别是其对信息图中图表和人类可识别对象（Human-Recognizable Objects, HROs）等视觉元素的定位不准确问题。解决方案的关键在于引入OrionBench，这是一个用于支持图表和HROs检测模型开发的基准数据集，包含26,250张真实和78,750张合成的信息图，并提供了超过690万条边界框标注，这些标注通过模型在环和程序化方法相结合的方式生成。

链接: https://arxiv.org/abs/2505.17473
作者: Jiangning Zhu,Yuxing Zhou,Zheng Wang,Juntao Yao,Yima Gu,Yuhui Yuan,Shixia Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce OrionBench, a benchmark designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 26,250 real and 78,750 synthetic infographics, with over 6.9 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of OrionBench through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.
zh

[NLP-105] FinRAG Bench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain

【速读】：该论文旨在解决金融领域中现有检索增强生成（Retrieval-Augmented Generation, RAG）研究主要依赖文本数据而忽视财务文档中丰富的视觉内容的问题，从而导致关键分析洞察的丢失。其解决方案的关键在于提出FinRAGBench-V，一个面向金融领域的综合性视觉RAG基准，该基准有效整合多模态数据并提供视觉引用以确保可追溯性，同时引入RGenCite基线模型，实现视觉引用与生成的无缝集成，并提出一种自动引用评估方法以系统评估多模态大语言模型（Multimodal Large Language Models, MLLMs）的视觉引用能力。

链接: https://arxiv.org/abs/2505.17471
作者: Suifeng Zhao,Zhuoran Jin,Sujian Li,Jun Gao
机构: Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) plays a vital role in the financial domain, powering applications such as real-time market analysis, trend forecasting, and interest rate computation. However, most existing RAG research in finance focuses predominantly on textual data, overlooking the rich visual content in financial documents, resulting in the loss of key analytical insights. To bridge this gap, we present FinRAGBench-V, a comprehensive visual RAG benchmark tailored for finance which effectively integrates multimodal data and provides visual citation to ensure traceability. It includes a bilingual retrieval corpus with 60,780 Chinese and 51,219 English pages, along with a high-quality, human-annotated question-answering (QA) dataset spanning heterogeneous data types and seven question categories. Moreover, we introduce RGenCite, an RAG baseline that seamlessly integrates visual citation with generation. Furthermore, we propose an automatic citation evaluation method to systematically assess the visual citation capabilities of Multimodal Large Language Models (MLLMs). Extensive experiments on RGenCite underscore the challenging nature of FinRAGBench-V, providing valuable insights for the development of multimodal RAG systems in finance.
zh

[NLP-106] SLearnLLM : A Self-Learning Framework for Efficient Domain-Specific Adaptation of Large Language Models

【速读】：该论文试图解决在使用监督微调（Supervised Fine-Tuning, SFT）将大语言模型（Large Language Models, LLMs）适配到特定领域时，是否应使用整个SFT数据集进行微调的问题。传统方法通常直接对整个数据集进行微调，但由于缺乏对模型先前训练数据的了解，若SFT数据集与模型已有知识高度重叠，则性能提升有限，造成计算资源浪费。该论文提出的解决方案的关键在于通过自我学习框架识别SFT数据集中未知的知识，并基于此过滤出有效的问答对进行微调，从而提高微调效率。实验结果表明，该方法在农业和医学领域显著减少了训练时间，同时实现了与全数据集微调相当的性能提升。

链接: https://arxiv.org/abs/2505.17470
作者: Xiang Liu,Zhaoxiang Liu,Peng Wang,Kohou Wang,Huan Hu,Kai Wang,Shiguo Lian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:When using supervised fine-tuning (SFT) to adapt large language models (LLMs) to specific domains, a significant challenge arises: should we use the entire SFT dataset for fine-tuning? Common practice often involves fine-tuning directly on the entire dataset due to limited information on the LLM’s past training data. However, if the SFT dataset largely overlaps with the model’s existing knowledge, the performance gains are minimal, leading to wasted computational resources. Identifying the unknown knowledge within the SFT dataset and using it to fine-tune the model could substantially improve the training efficiency. To address this challenge, we propose a self-learning framework for LLMs inspired by human learning pattern. This framework takes a fine-tuning (SFT) dataset in a specific domain as input. First, the LLMs answer the questions in the SFT dataset. The LLMs then objectively grade the responses and filter out the incorrectly answered QA pairs. Finally, we fine-tune the LLMs based on this filtered QA set. Experimental results in the fields of agriculture and medicine demonstrate that our method substantially reduces training time while achieving comparable improvements to those attained with full dataset fine-tuning. By concentrating on the unknown knowledge within the SFT dataset, our approach enhances the efficiency of fine-tuning LLMs.
zh

[NLP-107] A Position Paper on the Automatic Generation of Machine Learning Leaderboards

【速读】：该论文试图解决机器学习（Machine Learning, ML）领域中由于文献数量增长导致的ML leaderboard维护困难问题，以及现有自动leaderboard生成（Automatic Leaderboard Generation, ALG）方法在问题定义上的差异性带来的比较与应用限制。其解决方案的关键在于提出一个统一的概念框架，以标准化ALG任务的定义，并提供基准测试指南，包括促进公平、可复现评估的数据集和指标推荐，从而推动ALG研究的规范化与发展。

链接: https://arxiv.org/abs/2505.17465
作者: Roelien C Timmer,Yufang Hou,Stephen Wan
机构: CSIRO Data61, Australia (澳大利亚联邦科学与工业研究组织数据61部); IT:U Interdisciplinary Transformation University Austria, Austria (奥地利IT:U跨学科转型大学); IBM Research, Ireland (IBM研究院，爱尔兰)
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.
zh

[NLP-108] Hydra: Structured Cross-Source Enhanced Large Language Model Reasoning

【速读】：该论文旨在解决现有混合检索增强生成（Retrieval-augmented generation, RAG）系统在处理多跳推理、多实体问题、多源验证以及有效利用图结构等方面的局限性。其解决方案的关键在于提出Hydra框架，该框架通过统一图拓扑、文档语义和来源可靠性，实现对大型语言模型（Large Language Models, LLMs）的深度且忠实的推理支持。Hydra采用代理驱动的探索机制，结合结构化与非结构化检索，提升证据的多样性和精确性，并通过三因子跨源验证机制（源可信度评估、跨源佐证和实体路径对齐）解决多源验证问题，同时利用图结构融合异构信息、引导高效探索并早期剔除噪声。

链接: https://arxiv.org/abs/2505.17464
作者: Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Xin Yuan,Liming Zhu,Wenjie Zhang
机构: University of New South Wales (新南威尔士大学); Data61, CSIRO (数据61，澳大利亚联邦科学与工业研究组织)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present Hydra, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. Hydra handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, Hydra uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, Hydra fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that Hydra achieves overall state-of-the-art results on all benchmarks with GPT-3.5, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, Hydra enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
zh

[NLP-109] Diagnosing Vision Language Models Perception by Leverag ing Human Methods for Color Vision Deficiencies

【速读】：该论文试图解决大型多模态视觉语言模型（Large-scale Vision Language Models, LVLMs）在处理个体层面颜色感知差异方面的不足，特别是针对色觉缺陷（Color Vision Deficiencies, CVDs）的感知多样性问题。研究的关键在于通过Ishihara测试评估LVLMs在自然语言中解释CVDs的能力，并揭示其在图像任务中模拟色觉缺陷者颜色感知的局限性，从而强调多模态系统在颜色感知多样性方面的改进需求。

链接: https://arxiv.org/abs/2505.17461
作者: Kazuki Hayashi,Shintaro Ozaki,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large-scale Vision Language Models (LVLMs) are increasingly being applied to a wide range of real-world multimodal applications, involving complex visual and linguistic reasoning. As these models become more integrated into practical use, they are expected to handle complex aspects of human interaction. Among these, color perception is a fundamental yet highly variable aspect of visual understanding. It differs across individuals due to biological factors such as Color Vision Deficiencies (CVDs), as well as differences in culture and language. Despite its importance, perceptual diversity has received limited attention. In our study, we evaluate LVLMs’ ability to account for individual level perceptual variation using the Ishihara Test, a widely used method for detecting CVDs. Our results show that LVLMs can explain CVDs in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks. These findings highlight the need for multimodal systems that can account for color perceptual diversity and support broader discussions on perceptual inclusiveness and fairness in multimodal AI.
zh

[NLP-110] owards Evaluating Proactive Risk Awareness of Multimodal Language Models

【速读】：该论文试图解决人类在日常生活中对潜在风险识别不及时的问题，其核心在于填补安全意识的空白。解决方案的关键是构建一种主动式安全人工智能（proactive safety AI）系统，该系统能够通过监控用户行为及环境，提前检测潜在危险，而非仅对用户提问做出反应。为此，研究者提出了Proactive Safety Bench (PaSBench)，通过416个跨安全关键领域的多模态场景对模型进行评估，揭示了现有模型在主动推理方面的局限性，为开发更可靠的保护性AI提供了方向。

链接: https://arxiv.org/abs/2505.17455
作者: Youliang Yuan,Wenxiang Jiao,Yuejin Xie,Chihao Shen,Menghan Tian,Wenxuan Wang,Jen-tse Huang,Pinjia He
机构: School of Data Science, The Chinese University of Hong Kong, Shenzhen(数据科学学院，香港中文大学深圳); Renmin University of China(中国人民大学); Johns Hopkins University(约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Human safety awareness gaps often prevent the timely recognition of everyday risks. In solving this problem, a proactive safety artificial intelligence (AI) system would work better than a reactive one. Instead of just reacting to users’ questions, it would actively watch people’s behavior and their environment to detect potential dangers in advance. Our Proactive Safety Bench (PaSBench) evaluates this capability through 416 multimodal scenarios (128 image sequences, 288 text logs) spanning 5 safety-critical domains. Evaluation of 36 advanced models reveals fundamental limitations: Top performers like Gemini-2.5-pro achieve 71% image and 64% text accuracy, but miss 45-55% risks in repeated trials. Through failure analysis, we identify unstable proactive reasoning rather than knowledge deficits as the primary limitation. This work establishes (1) a proactive safety benchmark, (2) systematic evidence of model limitations, and (3) critical directions for developing reliable protective AI. We believe our dataset and findings can promote the development of safer AI assistants that actively prevent harm rather than merely respond to requests. Our dataset can be found at this https URL.
zh

[NLP-111] Self-Training Large Language Models with Confident Reasoning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成推理路径时依赖昂贵的人工监督的问题，旨在通过自训练方法提升模型的推理能力。其解决方案的关键在于引入基于推理层面置信度（reasoning-level confidence）的策略，而非仅关注最终答案的质量，从而更准确地识别高质量的推理路径用于自训练。为此，作者提出了一种新的自训练方法CORE-PO，通过策略优化（Policy Optimization）使模型偏好高置信度的推理路径，实验结果表明该方法在多个基准测试中均优于现有自训练方法。

链接: https://arxiv.org/abs/2505.17454
作者: Hyosoon Jang,Yunhui Jang,Sungjae Lee,Jungseul Ok,Sungsoo Ahn
机构: POSTECH(浦项科技大学); KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive performance by generating reasoning paths before final answers, but learning such a reasoning path requires costly human supervision. To address this issue, recent studies have explored self-training methods that improve reasoning capabilities using pseudo-labels generated by the LLMs themselves. Among these, confidence-based self-training fine-tunes LLMs to prefer reasoning paths with high-confidence answers, where confidence is estimated via majority voting. However, such methods exclusively focus on the quality of the final answer and may ignore the quality of the reasoning paths, as even an incorrect reasoning path leads to a correct answer by chance. Instead, we advocate the use of reasoning-level confidence to identify high-quality reasoning paths for self-training, supported by our empirical observations. We then propose a new self-training method, CORE-PO, that fine-tunes LLMs to prefer high-COnfidence REasoning paths through Policy Optimization. Our experiments show that CORE-PO improves the accuracy of outputs on four in-distribution and two out-of-distribution benchmarks, compared to existing self-training methods.
zh

[NLP-112] LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization

【速读】：该论文试图解决在检索增强生成（RAG）框架中，通过结果监督强化学习（RL）方法集成推理能力时，对中间思考与搜索步骤正确性忽略的问题。解决方案的关键在于设计一个过程级奖励模块，以弥补结果级监督下对中间推理步骤感知不足的缺陷，并提出一种融合步骤级过程奖励与结果级奖励的新型框架——Learning to Think-and-Search (LeTS)，从而提升LLMs在RAG任务中的推理能力。

链接: https://arxiv.org/abs/2505.17447
作者: Qi Zhang,Shouqing Yang,Lirong Gao,Hao Chen,Xiaomeng Hu,Jinglei Chen,Jiexiang Wang,Sheng Guo,Bo Zheng,Haobo Wang,Junbo Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs’ reasoning ability via RL under other scenarios. The code will be released soon.
zh

[NLP-113] Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models INTERSPEECH2025

【速读】：该论文旨在解决语音分词（speech tokenization）对语音语言模型（SLMs）性能影响不明确的问题。其解决方案的关键在于研究语音分词的分割宽度和离散单元的聚类大小，并通过在不同聚类规模下训练K-means模型，结合固定/可变宽度分割与池化表示，以优化语音表示的质量。实验结果表明，适度粗粒度的分割和较大的聚类规模能够提升模型性能，其中最优模型在减少50%训练数据和70%训练时间的情况下仍保持较高效率。

链接: https://arxiv.org/abs/2505.17446
作者: Shunsuke Kando,Yusuke Miyao,Shinnosuke Takamichi
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech2025

点击查看摘要

Abstract:The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units. First, we segment speech signals into fixed/variable widths and pooled representations. We then train K-means models in multiple cluster sizes. Through the evaluation on zero-shot spoken language understanding benchmarks, we find the positive effect of moderately coarse segmentation and bigger cluster size. Notably, among the best-performing models, the most efficient one achieves a 50% reduction in training data and a 70% decrease in training runtime. Our analysis highlights the importance of combining multiple tokens to enhance fine-grained spoken language understanding.
zh

[NLP-114] Discovering Forbidden Topics in Language Models

【速读】：该论文试图解决语言模型拒绝讨论的完整话题集的发现问题（refusal discovery），即识别模型不愿意涉及的主题。解决方案的关键在于提出一种名为LLM-crawler的方法，该方法利用token预填充（token prefilling）技术来探测被禁止的话题。通过在多个模型上进行实验，验证了该方法的有效性，并揭示了某些模型可能存在与特定意识形态对齐的响应模式。

链接: https://arxiv.org/abs/2505.17441
作者: Can Rager,Chris Wendler,Rohit Gandikota,David Bau
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits “thought suppression” behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.
zh

[NLP-115] 2: An Adaptive Test-Time Scaling Strategy for Contextual Question Answering

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在上下文问答（Contextual Question Answering, CQA）任务中因采用固定复杂度的推理策略而导致适应性不足的问题。现有方法在处理简单问题时容易过度推理，而高效测试时缩放方法虽引入预算约束或早停机制，但增加了人为偏见并未能充分利用模型自身的推理能力。论文提出的解决方案——T²（Think-to-Think）框架的关键在于根据问题复杂度动态调整推理深度，其核心思想是通过分解问题结构、生成具有候选推理策略的相似示例、多标准评估策略，并选择最优策略应用于原始问题，从而在保持高精度的同时降低计算开销。

链接: https://arxiv.org/abs/2505.17427
作者: Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Huimin Wang,Yutian Zhao,Bin Liang,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu
机构: The Chinese University of Hong Kong (中国香港中文大学); University of International Relations (国际关系学院); Jarvis Research Center, Tencent YouTu Lab (腾讯优图实验室); Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated remarkable performance in Contextual Question Answering (CQA). However, prior approaches typically employ elaborate reasoning strategies regardless of question complexity, leading to low adaptability. Recent efficient test-time scaling methods introduce budget constraints or early stop mechanisms to avoid overthinking for straightforward questions. But they add human bias to the reasoning process and fail to leverage models’ inherent reasoning capabilities. To address these limitations, we present T ^2 : Think-to-Think, a novel framework that dynamically adapts reasoning depth based on question complexity. T ^2 leverages the insight that if an LLM can effectively solve similar questions using specific reasoning strategies, it can apply the same strategy to the original question. This insight enables to adoption of concise reasoning for straightforward questions while maintaining detailed analysis for complex problems. T ^2 works through four key steps: decomposing questions into structural elements, generating similar examples with candidate reasoning strategies, evaluating these strategies against multiple criteria, and applying the most appropriate strategy to the original question. Experimental evaluation across seven diverse CQA benchmarks demonstrates that T ^2 not only achieves higher accuracy than baseline methods but also reduces computational overhead by up to 25.2%.
zh

[NLP-116] Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

【速读】：该论文试图解决多模态模型（如CLIP）在训练过程中可能无意中学习到目标变量与混淆因素之间的虚假关联的问题，这种问题会影响模型的泛化能力和公平性。解决方案的关键在于提出一种对比框架——\textscLocate-Then-Correct (LTC)，该框架通过机制洞察识别出Vision Transformers中的虚假注意力头，并通过有针对性的消融操作进行修正；同时，LTC还能识别出与任务相关的显著注意力头，通过正交投影整合判别特征以提升分类性能。

链接: https://arxiv.org/abs/2505.17425
作者: Wei Jie Yeo,Rui Mao,Moloud Abdar,Erik Cambria,Ranjan Satapathy
机构: Nanyang Technological University (南洋理工大学); The University of Queensland (昆士兰大学); IHPC, A∗STAR (资讯通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textscLocate-Then-Correct (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a 50% gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at this https URL.
zh

[NLP-117] DASH: Input-Aware Dynamic Layer Skipping for Efficient LLM Inference with Markov Decision Policies

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际部署中因推理成本过高而面临的挑战，尤其是在对延迟敏感的场景下。其解决方案的关键在于提出一种自适应层跳过框架DASH，通过将层跳过过程建模为马尔可夫决策过程（Markov Decision Process, MDP），实现基于输入特征的动态计算路径选择，并结合轻量级补偿机制与异步执行策略，以最小化运行时开销并保持任务性能。

链接: https://arxiv.org/abs/2505.17420
作者: Ning Yang,Fangxin Liu,Junjie Wang,Tao Yang,Kan Liu,Haibing Guan,Li Jiang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Qi Zhi Institute (上海期智研究院); Northeast University (东北大学); Huawei Technologies Ltd. (华为技术有限公司); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages,5 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance across a wide range of NLP tasks. However, their substantial inference cost poses a major barrier to real-world deployment, especially in latency-sensitive scenarios. To address this challenge, we propose \textbfDASH, an adaptive layer-skipping framework that dynamically selects computation paths conditioned on input characteristics. We model the skipping process as a Markov Decision Process (MDP), enabling fine-grained token-level decisions based on intermediate representations. To mitigate potential performance degradation caused by skipping, we introduce a lightweight compensation mechanism that injects differential rewards into the decision process. Furthermore, we design an asynchronous execution strategy that overlaps layer computation with policy evaluation to minimize runtime overhead. Experiments on multiple LLM architectures and NLP benchmarks show that our method achieves significant inference acceleration while maintaining competitive task performance, outperforming existing methods.
zh

[NLP-118] Conversations: Love Them Hate Them Steer Them

【速读】：该论文试图解决如何在大型语言模型（Large Language Models, LLMs）中引入更细腻、类人的情感表达问题，这一问题在当前的对齐技术中往往仅关注表面输出或需要大量微调。论文提出的解决方案的关键在于目标激活工程（targeted activation engineering），通过属性归因补丁（attribution patching）识别出对情感表达具有因果影响的组件，并利用对比文本对激活差异生成情感表达向量，从而在新对话提示中应用这些向量以显著增强情感特征。

链接: https://arxiv.org/abs/2505.17413
作者: Niranjan Chebrolu,Gerard Christopher Yeo,Kokil Jaidka
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate increasing conversational fluency, yet instilling them with nuanced, human-like emotional expression remains a significant challenge. Current alignment techniques often address surface-level output or require extensive fine-tuning. This paper demonstrates that targeted activation engineering can steer LLaMA 3.1-8B to exhibit more human-like emotional nuances. We first employ attribution patching to identify causally influential components, to find a key intervention locus by observing activation patterns during diagnostic conversational tasks. We then derive emotional expression vectors from the difference in the activations generated by contrastive text pairs (positive vs. negative examples of target emotions). Applying these vectors to new conversational prompts significantly enhances emotional characteristics: steered responses show increased positive sentiment (e.g., joy, trust) and more frequent first-person pronoun usage, indicative of greater personal engagement. Our findings offer a precise and interpretable method for controlling specific emotional attributes in LLMs, contributing to developing more aligned and empathetic conversational AI.
zh

[NLP-119] LLM -based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context INTERSPEECH2025

【速读】：该论文旨在解决生成式错误纠正（Generative Error Correction, GER）在处理罕见或领域特定词汇时表现不佳的问题，以及现有基于大语言模型（Large Language Models, LLMs）的GER方法因仅依赖文本信息而产生的过度纠正问题。其解决方案的关键在于针对罕见词汇进行合成数据生成以微调GER模型，并结合自动语音识别（ASR）的N-best候选词和语音上下文信息，从而有效减少过度纠正并提升纠错效果。

链接: https://arxiv.org/abs/2505.17410
作者: Natsuo Yamashita,Masaaki Yamamoto,Hiroaki Kokubo,Yohei Kawaguchi
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by INTERSPEECH 2025

点击查看摘要

Abstract:Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR’s N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.
zh

[NLP-120] Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在多语言环境下内部推理过程的不确定性问题，特别是模型在处理不同语言输入时所采用的推理语言选择及其对性能的影响。解决方案的关键在于通过实验证明，尽管经过多语言训练，LRMs在测试时倾向于使用高资源语言（如英语）进行推理，而当被限制在与输入语言相同的语言下推理时，模型性能会下降，尤其是在低资源语言中表现更为明显。研究还揭示了语言选择对不同类型任务（如推理任务和文化相关任务）的影响差异，从而为开发更具公平性的多语言模型提供了重要依据。

链接: https://arxiv.org/abs/2505.17407
作者: Zhi Rui Tam,Cheng-Kuang Wu,Yu Ying Chiu,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee
机构: Appier AI Research(应用人工智能研究); University of Washington(华盛顿大学); National Taiwan University(台湾大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: \it In which language do these models reason when solving problems presented in different languages? Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.
zh

[NLP-121] FullFront: Benchmarking MLLM s Across the Full Front-End Engineering Workflow

【速读】：该论文试图解决当前Multimodal Large Language Models (MLLMs)在完整前端开发流程中的能力评估问题，特别是针对网页设计、视觉理解与代码生成等核心任务的性能不足。其解决方案的关键在于提出FullFront基准，通过一种新颖的两阶段流程将真实网页转换为结构清晰、标准化的HTML代码，同时保留多样的视觉设计并规避版权问题，从而提供一个更贴近实际应用场景的评估框架。

链接: https://arxiv.org/abs/2505.17399
作者: Haoyu Sun,Huichen Will Wang,Jiawei Gu,Linjie Li,Yu Cheng
机构: Tongji University (同济大学); University of Washington (华盛顿大学); Sun Yat-sen University (中山大学); Microsoft (微软); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbfacross the full front-end development pipeline. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in this https URL.
zh

[NLP-122] Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

【速读】：该论文旨在解决多跳检索增强生成（multi-hop RAG）系统中存在冗余子查询、探索深度不足或搜索链过长的问题。其解决方案的关键在于提出EVO-RAG，这是一个基于课程引导的强化学习框架，通过演化查询重写代理，从早期广泛的探索逐步过渡到后期的精炼阶段。EVO-RAG结合了七因子步骤级奖励向量与随时间变化的调度器，动态调整奖励信号以优化检索过程，并利用多头奖励模型进行直接偏好优化，使代理能够学习何时搜索、回溯、回答或拒绝。

链接: https://arxiv.org/abs/2505.17391
作者: Yuelyu Ji,Rui Meng,Zhuochun Li,Daqing He
机构: University of Pittsburgh(匹兹堡大学); Google Cloud AI Research(谷歌云人工智能研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.
zh

[NLP-123] Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

【速读】：该论文试图解决生成式 AI (Generative AI) 在预训练和监督微调过程中，基于细粒度人格特征生成的合成数据多样性不足的问题。其解决方案的关键在于通过一系列词汇多样性与冗余度指标，评估由人格驱动生成的提示词和回复的多样性，并分析不同规模的语言模型在使用细粒度与粗粒度人格描述时生成文本的差异，从而揭示细粒度人格信息对生成文本多样性的影响程度。

链接: https://arxiv.org/abs/2505.17390
作者: Gauri Kambhatla,Chantal Shaib,Venkata Govindarajan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-grained personas have recently been used for generating ‘diverse’ synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn’t increase diversity noticeably.
zh

[NLP-124] WiNGPT -3.0 Technical Report

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在结构化、可解释性和可验证性医疗推理方面的显著局限性，以及与计算资源和数据隐私相关的实际部署挑战。其关键解决方案是开发了具有320亿参数的WiNGPT-3.0模型，并采用多阶段训练流程，结合监督微调（Supervised Fine-Tuning, SFT）和强化学习（Reinforcement Learning, RL），利用精心构建的长链式思维（Long Chain-of-Thought, CoT）数据集、辅助奖励模型和基于证据的诊断链模拟，以提升医疗推理能力并探索其在医疗IT基础设施中的有效集成。

链接: https://arxiv.org/abs/2505.17387
作者: Boqin Zhuang,Chenxiao Song,Huitong Lu,Jiacheng Qiao,Mingqian Liu,Mingxing Yu,Ping Hong,Rui Li,Xiaoxia Song,Xiangjun Xu,Xu Chen,Yaoyao Ma,Yujie Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning and exploring its potential for effective integration within healthcare IT infrastructures. The broader aim is to advance towards clinically applicable models. The approach involved a multi-stage training pipeline tailored for general, medical, and clinical reasoning. This pipeline incorporated supervised fine-tuning (SFT) and reinforcement learning (RL), leveraging curated Long Chain-of-Thought (CoT) datasets, auxiliary reward models, and an evidence-based diagnostic chain simulation. WiNGPT-3.0 demonstrated strong performance: specific model variants achieved scores of 66.6 on MedCalc and 87.1 on MedQA-USMLE. Furthermore, targeted training improved performance on a clinical reasoning task from a baseline score of 58.1 to 62.5. These findings suggest that reinforcement learning, even when applied with a limited dataset of only a few thousand examples, can enhance medical reasoning accuracy. Crucially, this demonstration of RL’s efficacy with limited data and computation paves the way for more trustworthy and practically deployable LLMs within clinical workflows and health information infrastructures.
zh

[NLP-125] AI-Augmented LLM s Achieve Therapist-Level Responses in Motivational Interviewing

【速读】：该论文试图解决如何评估大型语言模型（Large Language Models, LLMs）在成瘾护理中进行动机性访谈（Motivational Interviewing, MI）的治疗能力问题。其解决方案的关键在于构建一个计算框架，通过预期和非预期的MI行为来评估用户感知质量（User-Perceived Quality, UPQ），并利用深度学习与可解释AI相结合的方法，识别出17个MI一致（MI-consistent, MICO）和MI不一致（MI-inconsistent, MIIN）的行为指标，从而实现对LLMs在MI中的表现进行系统性评估与优化。

链接: https://arxiv.org/abs/2505.17380
作者: Yinghui Huang,Yuxuan Jiang,Hui Liu,Yixin Cai,Weiqing Li,Xiangen Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) like GPT-4 show potential for scaling motivational interviewing (MI) in addiction care, but require systematic evaluation of therapeutic capabilities. We present a computational framework assessing user-perceived quality (UPQ) through expected and unexpected MI behaviors. Analyzing human therapist and GPT-4 MI sessions via human-AI collaboration, we developed predictive models integrating deep learning and explainable AI to identify 17 MI-consistent (MICO) and MI-inconsistent (MIIN) behavioral metrics. A customized chain-of-thought prompt improved GPT-4’s MI performance, reducing inappropriate advice while enhancing reflections and empathy. Although GPT-4 remained marginally inferior to therapists overall, it demonstrated superior advice management capabilities. The model achieved measurable quality improvements through prompt engineering, yet showed limitations in addressing complex emotional nuances. This framework establishes a pathway for optimizing LLM-based therapeutic tools through targeted behavioral metric analysis and human-AI co-evaluation. Findings highlight both the scalability potential and current constraints of LLMs in clinical communication applications.
zh

[NLP-126] Chart-to-Experience: Benchmarking Multimodal LLM s for Predicting Experiential Impact of Charts

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在评估图表的感知和情感影响方面的能力不足问题，特别是其在个体图表评估中的敏感性与人类评价者存在差距。解决方案的关键在于引入了一个名为Chart-to-Experience的基准数据集，该数据集包含36张图表，并由众包工作者对七种体验因素进行评估，从而为模型提供真实可靠的评估标准。基于此数据集，研究者对当前最先进的MLLMs在直接预测和图表对比较任务中的能力进行了评估，揭示了其在 pairwise comparison 中的准确性与可靠性。

链接: https://arxiv.org/abs/2505.17374
作者: Seon Gyeom Kim,Jae Young Choi,Ryan Rossi,Eunyee Koh,Tak Yeon Lee
机构: KAIST(韩国科学技术院); Adobe Research(Adobe研究院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: This paper has been accepted to IEEE PacificVis 2025

点击查看摘要

Abstract:The field of Multimodal Large Language Models (MLLMs) has made remarkable progress in visual understanding tasks, presenting a vast opportunity to predict the perceptual and emotional impact of charts. However, it also raises concerns, as many applications of LLMs are based on overgeneralized assumptions from a few examples, lacking sufficient validation of their performance and effectiveness. We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors. Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts. Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.
zh

[NLP-127] Value-Guided Search for Efficient Chain-of-Thought Reasoning

【速读】：该论文旨在解决长上下文推理轨迹中价值模型（value model）训练的效率与有效性问题，特别是针对现有过程奖励模型（Process Reward Models, PRMs）需要精细定义“步骤”概念所带来的挑战。其解决方案的关键在于提出一种无需依赖细粒度“步骤”定义的简单高效方法，通过收集250万条推理轨迹数据，训练了一个15亿参数的分块级价值模型，并结合块级价值引导搜索（block-wise value-guided search, VGS）策略，在测试时计算扩展中实现了优于传统方法的性能表现。

链接: https://arxiv.org/abs/2505.17373
作者: Kaiwen Wang,Jin Peng Zhou,Jonathan Chang,Zhaolin Gao,Nathan Kallus,Kianté Brantley,Wen Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of “step,” which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 2025, HMMT Feb 2024 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.
zh

[NLP-128] An End-to-End Approach for Child Reading Assessment in the Xhosa Language

【速读】：该论文旨在解决低资源语言中儿童语音识别与阅读评估系统的开发难题，特别是在南非使用的科萨语（Xhosa）背景下，以提升儿童读写能力的评估效果。其解决方案的关键在于构建一个包含十组单词和字母的新型数据集，并通过多标记标注和独立评审确保数据质量，同时利用先进的端到端模型（如wav2vec 2.0、HuBERT和Whisper）进行训练与优化，以应对数据量有限及儿童语音独特声学特性带来的挑战。

链接: https://arxiv.org/abs/2505.17371
作者: Sergio Chevtchenko,Nikhil Navas,Rafaella Vale,Franco Ubaudi,Sipumelele Lucwaba,Cally Ardington,Soheil Afshar,Mark Antoniou,Saeed Afshar
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Paper accepted on AIED 2025 containing 14 pages, 6 figures and 4 tables

点击查看摘要

Abstract:Child literacy is a strong predictor of life outcomes at the subsequent stages of an individual’s life. This points to a need for targeted interventions in vulnerable low and middle income populations to help bridge the gap between literacy levels in these regions and high income ones. In this effort, reading assessments provide an important tool to measure the effectiveness of these programs and AI can be a reliable and economical tool to support educators with this task. Developing accurate automatic reading assessment systems for child speech in low-resource languages poses significant challenges due to limited data and the unique acoustic properties of children’s voices. This study focuses on Xhosa, a language spoken in South Africa, to advance child speech recognition capabilities. We present a novel dataset composed of child speech samples in Xhosa. The dataset is available upon request and contains ten words and letters, which are part of the Early Grade Reading Assessment (EGRA) system. Each recording is labeled with an online and cost-effective approach by multiple markers and a subsample is validated by an independent EGRA reviewer. This dataset is evaluated with three fine-tuned state-of-the-art end-to-end models: wav2vec 2.0, HuBERT, and Whisper. The results indicate that the performance of these models can be significantly influenced by the amount and balancing of the available training data, which is fundamental for cost-effective large dataset collection. Furthermore, our experiments indicate that the wav2vec 2.0 performance is improved by training on multiple classes at a time, even when the number of available samples is constrained.
zh

[NLP-129] A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit ACL

【速读】：该论文试图解决如何利用大型语言模型（Large Language Models, LLMs）作为自动化谈话治疗师的有效性及合规性问题。其解决方案的关键在于开发了一个基于最新LLM和动机访谈（Motivational Interviewing, MI）方法的咨询聊天机器人，并与具有MI专业背景的临床科学家合作进行优化，以确保其符合已知的治疗标准。此外，研究还提出了一个自动评估系统，用于衡量聊天机器人对MI的遵循程度及其对客户回应的影响。

链接: https://arxiv.org/abs/2505.17362
作者: Zafarullah Mahmood,Soliman Ali,Jiading Zhu,Mohamed Abdelwahab,Michelle Yu Collins,Sihan Chen,Yi Cheng Zhao,Jodi Wolff,Osnat Melamed,Nadia Minian,Marta Maslej,Carolynne Cooper,Matt Ratto,Peter Selby,Jonathan Rose
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), Vienna, Austria, 2025

点击查看摘要

Abstract:The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot’s adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants’ confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants’ language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.
zh

[NLP-130] DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在Theory-of-Mind (ToM) 任务中表现不佳的问题，这类模型由于规模有限，往往缺乏深度社会推理能力。解决方案的关键在于提出DEL-ToM框架，通过推理时的扩展而非架构调整来提升ToM推理能力。该方法将ToM任务分解为基于动态知识逻辑（Dynamic Epistemic Logic, DEL）的信念更新序列，利用一个称为Process Belief Model (PBM) 的验证器对每一步信念更新进行评分，并在推理阶段选择得分最高的信念轨迹，从而实现更系统和透明的推理过程。

链接: https://arxiv.org/abs/2505.17348
作者: Yuheng Wu,Jianwen Xie,Denghui Zhang,Zhaozhuo Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Theory-of-Mind (ToM) tasks pose a unique challenge for small language models (SLMs) with limited scale, which often lack the capacity to perform deep social reasoning. In this work, we propose DEL-ToM, a framework that improves ToM reasoning through inference-time scaling rather than architectural changes. Our approach decomposes ToM tasks into a sequence of belief updates grounded in Dynamic Epistemic Logic (DEL), enabling structured and transparent reasoning. We train a verifier, called the Process Belief Model (PBM), to score each belief update step using labels generated automatically via a DEL simulator. During inference, candidate belief traces generated by a language model are evaluated by the PBM, and the highest-scoring trace is selected. This allows SLMs to emulate more deliberate reasoning by allocating additional compute at test time. Experiments across multiple model scales and benchmarks show that DEL-ToM consistently improves performance, demonstrating that verifiable belief supervision can significantly enhance ToM abilities of SLMs without retraining.
zh

[NLP-131] Language models should be subject to repeatable open domain-contextualized hallucination benchmarking

【速读】：该论文试图解决语言模型中幻觉（hallucination）现象的普遍性和评估问题，即模型生成文本中存在看似合理但不准确的标记，这对语言模型的负责任应用构成挑战。论文提出的解决方案关键在于建立可重复、开放且领域情境化的幻觉基准测试，以更科学地评估语言模型的幻觉现象，并强调在数据创建早期阶段缺乏专家参与会导致幻觉度量缺乏有效性和实用性。

链接: https://arxiv.org/abs/2505.17345
作者: Justin D. Norman,Michael U. Rivera,D. Alex Hughes
机构: University of California, Berkeley (加州大学伯克利分校); School of Information (信息学院)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.
zh

[NLP-132] SweEval: Do LLM s Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use NAACL2025

【速读】：该论文旨在解决企业级应用场景中大型语言模型（Large Language Models, LLMs）在处理跨文化、跨语言沟通任务时，如何有效识别和应对不安全或冒犯性语言的问题，以降低声誉风险、维护信任并确保合规性。其解决方案的关键在于引入SweEval基准测试，该基准通过模拟真实场景中的语气（积极或消极）和语境（正式或非正式）变化，并明确要求模型包含特定脏话来完成任务，从而评估LLMs在面对不当指令时的合规性与抵抗能力，以及其在伦理框架、文化差异和语言理解方面的对齐程度。

链接: https://arxiv.org/abs/2505.17332
作者: Hitesh Laxmichand Patel,Amit Agarwal,Arion Das,Bhargava Kumar,Srikant Panda,Priyaranjan Pattnayak,Taki Hasan Rafi,Tejaswini Kumar,Dong-Kyu Chae
机构: Oracle AI(奥拉克尔人工智能); Indian Institute of Information Technology Ranchi(印度信息科技学院拉贾斯坦分校); TD Securities( TD 证券); Columbia University(哥伦比亚大学); Hanyang University(汉阳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Published in the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025), Industry Track, pages 558-582

点击查看摘要

Abstract:Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: this https URL.
zh

[NLP-133] ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练速度和推理吞吐量上的效率问题，同时保持其学习能力。解决方案的关键在于提出ECHO-LLaMA架构，通过将LLaMA模型中的某些层转换为共享键值（Key-Value, KV）缓存，显著降低了KV计算的复杂度，从而在不牺牲语言性能的情况下提升了训练和推理效率。

链接: https://arxiv.org/abs/2505.17331
作者: Maryam Dialameh,Rezaul Karim,Hossein Rajabzadeh,Omar Mohamed Awad,Hyock Ju Kwon,Boxing Chen,Walid Ahmed,Yang Liu
机构: University of Waterloo (滑铁卢大学); Ascend Team, Huawei Technologies (升腾团队，华为技术)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77% higher token-per-second throughput during training, up to 16% higher Model FLOPs Utilization (MFU), and up to 14% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.
zh

[NLP-134] FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding COLING2025

【速读】：该论文旨在解决在少量样本情况下，视觉丰富的文档理解（Visually Rich Document Understanding, VRDU）任务中模型适应不同文档类型及处理实际挑战（如OCR错误、拼写错误和领域偏移）的问题。解决方案的关键在于提出一种可扩展且高效的模型架构——Few Shot Domain Adapting Graph (FS-DAG)，该架构通过模块化框架集成领域特定和语言/视觉特定的主干网络，从而在数据有限的情况下实现对多种文档类型的高效适应。

链接: https://arxiv.org/abs/2505.17330
作者: Amit Agarwal,Srikant Panda,Kulbhushan Pachauri
机构: OCI, Oracle USA; OCI, Oracle India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Published in the Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Industry Track, pages 100-114

点击查看摘要

Abstract:In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG’s capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : this https URL
zh

[NLP-135] GPT Editors Not Authors: The Stylistic Footprint of LLM s in Academic Preprints

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在学术写作中的使用程度及其对文本风格分割的影响问题，具体是区分LLMs用于生成关键文本与仅用于编辑（如语法检查或不当措辞修正）的程度。其解决方案的关键在于通过调整PELT阈值并结合基于GPT生成文本训练的贝叶斯分类器，对arXiv论文进行风格分割分析，从而评估LLMs生成的语言是否具有可预测的风格特征。研究结果表明，LLM生成的语言不具备预测风格分割的能力，说明作者在使用LLMs时具有统一性，从而降低了幻觉引入学术预印本的风险。

链接: https://arxiv.org/abs/2505.17327
作者: Soren DeHaan,Yuanze Liu,Johan Bollen,Sa’ul A. Blanco
机构: Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) in late 2022 has impacted academic writing, threatening credibility, and causing institutional uncertainty. We seek to determine the degree to which LLMs are used to generate critical text as opposed to being used for editing, such as checking for grammar errors or inappropriate phrasing. In our study, we analyze arXiv papers for stylistic segmentation, which we measure by varying a PELT threshold against a Bayesian classifier trained on GPT-regenerated text. We find that LLM-attributed language is not predictive of stylistic segmentation, suggesting that when authors use LLMs, they do so uniformly, reducing the risk of hallucinations being introduced into academic preprints.
zh

[NLP-136] From Compression to Expansion: A Layerwise Analysis of In-Context Learning

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在上下文学习（In-context Learning, ICL）中任务特定信息的内部表征机制不明确的问题。其解决方案的关键在于通过统计几何分析揭示了ICL表征的层间动态特性，即“层间压缩-扩展”现象：早期层生成紧凑且具有判别性的表征以编码输入示例中的任务信息，而后期层则扩展这些表征以融合查询并生成预测。这一发现为理解ICL性能与模型规模、示例数量的关系提供了理论依据，并揭示了注意力机制在降低方差和偏差方面的作用。

链接: https://arxiv.org/abs/2505.17322
作者: Jiachen Jiang,Yuxin Dong,Jinxin Zhou,Zhihui Zhu
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without weight updates by learning from demonstration sequences. While ICL shows strong empirical performance, its internal representational mechanisms are not yet well understood. In this work, we conduct a statistical geometric analysis of ICL representations to investigate how task-specific information is captured across layers. Our analysis reveals an intriguing phenomenon, which we term Layerwise Compression-Expansion: early layers progressively produce compact and discriminative representations that encode task information from the input demonstrations, while later layers expand these representations to incorporate the query and generate the prediction. This phenomenon is observed consistently across diverse tasks and a range of contemporary LLM architectures. We demonstrate that it has important implications for ICL performance – improving with model size and the number of demonstrations – and for robustness in the presence of noisy examples. To further understand the effect of the compact task representation, we propose a bias-variance decomposition and provide a theoretical analysis showing how attention mechanisms contribute to reducing both variance and bias, thereby enhancing performance as the number of demonstrations increases. Our findings reveal an intriguing layerwise dynamic in ICL, highlight how structured representations emerge within LLMs, and showcase that analyzing internal representations can facilitate a deeper understanding of model behavior.
zh

[NLP-137] Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

【速读】：该论文旨在解决合成具有表现力的日语角色语音所面临的独特挑战，包括对音高重音的敏感性和风格可变性。研究通过在领域内、角色驱动的日语语音数据上基准测试两个开源文本到语音模型——VITS和Style-BERT-VITS2 JP Extra（SBV2JE），评估其在自然度（平均意见分数和比较平均意见分数）、可懂度（词错误率）和说话人一致性方面的性能。SBV2JE的关键解决方案在于引入音高重音控制和基于WavLM的判别器，从而在保持高自然度的同时提升语音质量，尽管其计算需求较高。

链接: https://arxiv.org/abs/2505.17320
作者: Zackary Rackauckas,Julia Hirschberg
机构: Columbia University (哥伦比亚大学); RoleGaku (RoleGaku)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper benchmarks two open-source text-to-speech models–VITS and Style-BERT-VITS2 JP Extra (SBV2JE)–on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.
zh

[NLP-138] Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models

【速读】：该论文旨在解决多模态大语言模型（Multimodal LLMs, MLLMs）中视觉嵌入与大型语言模型（Large Language Models, LLMs）之间对齐不足的问题，特别是在视觉块（vision patch）与语义词之间的对齐效果较弱。其解决方案的关键在于提出“块对齐训练”（patch-aligned training），通过优化投影器（projector）的训练方式，显著提升视觉嵌入的压缩能力和块级语义对齐效果，从而增强模型生成高质量描述和执行多模态任务的能力。

链接: https://arxiv.org/abs/2505.17316
作者: Jiachen Jiang,Jinxin Zhou,Bo Peng,Xia Ning,Zhihui Zhu
机构: The Ohio State University (俄亥俄州立大学); Translational Data Analytics Institute, The Ohio State University (转化数据科学研究所，俄亥俄州立大学); Department of Biomedical Informatics, The Ohio State University (生物医学信息学系，俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content. We then examine patch-level alignment – the alignment between each vision patch and its corresponding semantic words – and propose a multi-semantic alignment hypothesis. Our analysis indicates that the projector trained by caption loss improves patch-level alignment but only to a limited extent, resulting in weak and coarse alignment. To address this issue, we propose patch-aligned training to efficiently enhance patch-level alignment. Our experiments show that patch-aligned training (1) achieves stronger compression capability and improved patch-level alignment, enabling the MLLM to generate higher-quality captions, (2) improves the MLLM’s performance by 16% on referring expression grounding tasks, 4% on question-answering tasks, and 3% on modern instruction-following benchmarks when using the same supervised fine-tuning (SFT) setting. The proposed method can be easily extended to other multimodal models.
zh

[NLP-139] Longer Context Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

【速读】：该论文试图解决当前语言模型在推理能力上的局限性问题，其核心假设是这些限制部分源于长上下文容量不足。解决方案的关键在于增强模型的长上下文能力，特别是在监督微调（Supervised Fine-Tuning, SFT）之前，通过对比具有相同架构和微调数据但不同长上下文容量的模型，验证了长上下文能力的提升能够显著提高推理性能，且这种优势在输入长度较短的任务中依然存在。这表明长上下文建模不仅是处理长输入的必要条件，也是推理能力的重要基础。

链接: https://arxiv.org/abs/2505.17315
作者: Wang Yang,Zirui Liu,Hongye Jin,Qingyu Yin,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); University of Minnesota - Twin Cities (明尼苏达大学双城分校); Texas A&M University (德克萨斯农工大学); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model’s long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.
zh

[NLP-140] Refusal Direction is Universal Across Safety-Aligned Languages

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言场景下的拒绝行为（refusal behavior）机制及其跨语言通用性问题，旨在提升多语言安全防御的鲁棒性。其解决方案的关键在于发现并验证了拒绝方向（refusal direction）在不同语言间的高度可迁移性，即从英语中提取的向量可以几乎完美地绕过其他语言的拒绝机制，且无需额外微调，这揭示了嵌入空间中拒绝向量的平行性是导致跨语言漏洞的根本原因。

链接: https://arxiv.org/abs/2505.17306
作者: Xinpeng Wang,Mingyang Wang,Yihong Liu,Hinrich Schütze,Barbara Plank
机构: LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Bosch BCAI (博世人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.
zh

[NLP-141] SELF: Self-Extend the Context Length With Logistic Growth Function

【速读】：该论文试图解决大型语言模型在处理超出其训练上下文长度的长文本时所面临的问题，即由于标准位置编码导致远距离标记之间相互作用稀少，从而引发长提示产生意外结果。解决方案的关键在于提出SELF（Self-Extend the Context Length With Logistic Growth Function），该方法通过使用逻辑增长函数结合不同组大小的连续标记分组策略，在较小相对距离下保持恒定组大小，从而有效扩展上下文长度。

链接: https://arxiv.org/abs/2505.17296
作者: Phat Thanh Dang,Saahil Thoppay,Wang Yang,Qifan Wang,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); Meta (元)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at this https URL.
zh

[NLP-142] Attention with Trained Embeddings Provably Selects Important Tokens

【速读】：该论文试图解决语言模型中token embeddings的理论理解不足的问题（token embeddings），其解决方案的关键在于通过梯度下降分析嵌入结构，揭示了在二分类任务中，经过单步梯度训练后，嵌入向量会根据对应token在数据集中的出现频率与输出向量对齐，而在训练收敛后，softmax机制会选择句子中有预测能力的token，并使\mathrm\langle cls \rangle嵌入最大化该选择的间隔。

链接: https://arxiv.org/abs/2505.17282
作者: Diyuan Wu,Aleksandr Shevchenko,Samet Oymak,Marco Mondelli
机构: Institute of Science and Technology Austria (IST Austria)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited. Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e., \textttSoftmax( p^\top E_X^\top ) E_X v = \frac \sum_i=1^T \exp(p^\top E_x_i) E_x_i^\top v\sum_j=1^T \exp(p^\top E_x_j) , where E_X = [ E_x_1 , \dots, E_x_T ]^\top contains the embeddings of the input sequence, p is the embedding of the \mathrm\langle cls \rangle token and v the output vector. First, we show that, already after a single step of gradient training with the logistic loss, the embeddings E_X capture the importance of tokens in the dataset by aligning with the output vector v proportionally to the frequency with which the corresponding tokens appear in the dataset. Then, after training p via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting \mathrm\langle cls \rangle embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.
zh

[NLP-143] Search Wisely: Mitigating Sub-optimal Agent ic Searches By Reducing Uncertainty

【速读】：该论文旨在解决Agentic Retrieval-Augmented Generation (RAG)系统在信息检索过程中存在的次优搜索行为问题，如过度搜索（检索冗余信息）和不足搜索（未能检索必要信息），这些问题影响了系统的效率和可靠性。解决方案的关键在于提出一种基于强化学习的训练方法——β-GRPO，该方法通过引入置信度阈值来奖励高确定性的搜索决策，从而提升模型在搜索过程中的决策质量与准确性。

链接: https://arxiv.org/abs/2505.17281
作者: Peilin Wu,Mian Zhang,Xinlu Zhang,Xinya Du,Zhiyu Zoey Chen
机构: The University of Texas at Dallas(德克萨斯大学达拉斯分校); University of California, Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models’ uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model’s uncertainty in its search decisions. To address this, we propose \beta -GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that \beta -GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.
zh

[NLP-144] Zebra-Llama: Towards Extremely Efficient Hybrid Models

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在部署时面临的推理效率低下问题，以及为满足用户特定需求而进行重新训练所带来的高昂成本和环境不可持续性问题。其解决方案的关键在于提出一种实用且可扩展的方法，即从现有的预训练模型中组合出高效的混合语言模型。该方法的核心是引入Zebra-Llama系列模型，通过结合状态空间模型（State Space Models, SSMs）和多头潜在注意力（Multi-head Latent Attention, MLA）层，并采用优化的初始化和后训练流程，以高效地将知识从预训练的Transformer模型迁移过来，从而在保持接近Transformer性能的同时显著提升计算效率。

链接: https://arxiv.org/abs/2505.17272
作者: Mingyu Yang,Mehdi Rezagholizadeh,Guihong Li,Vikram Appia,Emad Barsoum
机构: Advanced Micro Devices, Inc. (Advanced Micro Devices, Inc.)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.
zh

[NLP-145] GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在法律领域问答任务中的评估问题，特别是在希腊律师考试的五个不同法律领域中，要求模型能够引用法规条文和案例事实。解决方案的关键在于提出一种三维评分系统，并结合LLM-as-a-judge方法进行自由文本评估，同时开发了一个元评估基准来衡量LLM评判者与人类专家评估的相关性，结果显示基于简单跨度的评分标准能提升两者的对齐程度。

链接: https://arxiv.org/abs/2505.17267
作者: Odysseas S. Chlapanis,Dimitrios Galanis,Nikolaos Aletras,Ion Androutsopoulos
机构: Athens University of Economics and Business (雅典经济与商业大学); Athena Research Center (阿特纳研究中心); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 17 figures, submitted to May ARR

点击查看摘要

Abstract:We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.
zh

[NLP-146] Select2Reason : Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

【速读】：该论文试图解决在预训练大语言模型中激活长链式思维（long-CoT）推理能力时，由于大规模指令集带来的训练开销过大以及缺乏有效的自动长-CoT指令选择策略的问题。解决方案的关键在于提出Select2Reason框架，该框架通过量化问题难度并结合基于推理轨迹长度的启发式加权排序策略，高效筛选出高价值的指令样本，从而在显著减少数据量的情况下实现与全数据微调相当或更优的性能。

链接: https://arxiv.org/abs/2505.17266
作者: Cehao Yang,Xueyuan Lin,Chengjin Xu,Xuhui Jiang,Xiaojun Wu,Honghao Liu,Hui Xiong,Jian Guo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.
zh

[NLP-147] CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports

【速读】：该论文试图解决罕见疾病（包括先天性代谢障碍，Inborn Errors of Metabolism, IEM）在诊断过程中面临的挑战，特别是如何从病例报告中高效提取结构化临床信息以支持诊断。解决方案的关键在于引入CaseReportBench数据集，这是一个由专家标注的用于密集信息提取的病例报告数据集，并评估了多种模型和提示策略，如类别特定提示和子标题过滤的数据整合方法，以提升信息提取的准确性与临床相关性。

链接: https://arxiv.org/abs/2505.17265
作者: Xiao Yu Cindy Zhang(1),Carlos R. Ferreira(2),Francis Rossignol(2),Raymond T. Ng(1),Wyeth Wasserman(1),Jian Zhu(1) ((1) University of British Columbia, (2) National Institutes of Health)
机构: University of British Columbia(不列颠哥伦比亚大学); National Institutes of Health(美国国家卫生研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rare diseases, including Inborn Errors of Metabolism (IEM), pose significant diagnostic challenges. Case reports serve as key but computationally underutilized resources to inform diagnosis. Clinical dense information extraction refers to organizing medical information into structured predefined categories. Large Language Models (LLMs) may enable scalable information extraction from case reports but are rarely evaluated for this task. We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs. Using this dataset, we assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration. Zero-shot chain-of-thought prompting offers little advantage over standard zero-shot prompting. Category-specific prompting improves alignment with the benchmark. The open-source model Qwen2.5-7B outperforms GPT-4o for this task. Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management. We also highlight areas for improvement, such as LLMs’ limitations in recognizing negative findings important for differential diagnosis. This work advances LLM-driven clinical natural language processing and paves the way for scalable medical AI applications.
zh

[NLP-148] he Rise of Parameter Specialization for Knowledge Storag e in Large Language Models

【速读】：该论文试图解决如何在有限参数规模下更有效地存储和利用知识的问题，特别是针对多层感知机（MLP）中知识的存储方式。其解决方案的关键在于发现随着语言模型的进步，其参数表现出更高的专业化特性，即MLP中的参数更倾向于编码相似类型的知识，这种专门化的知识分布有助于提升模型对存储知识的利用效率。

链接: https://arxiv.org/abs/2505.17260
作者: Yihuai Hong,Yiran Zhao,Wei Tang,Yang Deng,Yu Rong,Wenxuan Zhang
机构: Alibaba DAMO Academy (阿里巴巴达摩院); National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Over time, a growing wave of large language models from various series has been introduced to the community. Researchers are striving to maximize the performance of language models with constrained parameter sizes. However, from a microscopic perspective, there has been limited research on how to better store knowledge in model parameters, particularly within MLPs, to enable more effective utilization of this knowledge by the model. In this work, we analyze twenty publicly available open-source large language models to investigate the relationship between their strong performance and the way knowledge is stored in their corresponding MLP parameters. Our findings reveal that as language models become more advanced and demonstrate stronger knowledge capabilities, their parameters exhibit increased specialization. Specifically, parameters in the MLPs tend to be more focused on encoding similar types of knowledge. We experimentally validate that this specialized distribution of knowledge contributes to improving the efficiency of knowledge utilization in these models. Furthermore, by conducting causal training experiments, we confirm that this specialized knowledge distribution plays a critical role in improving the model’s efficiency in leveraging stored knowledge.
zh

[NLP-149] ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models

【速读】：该论文旨在解决大型语言模型在生成推理轨迹时存在的计算资源浪费、可读性差以及幻觉问题。其关键解决方案是引入一种无需超参数的简洁性评分（conciseness score），作为强化学习框架中的奖励信号，以引导模型生成准确且简洁的推理过程。该评分由一个大型语言模型作为评判者进行评估，提供超越简单token长度的动态、上下文感知反馈。

链接: https://arxiv.org/abs/2505.17250
作者: Razvan-Gabriel Dumitru,Darius Peteleaza,Vikas Yadav,Liangming Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 18 figures, and 6 tables

点击查看摘要

Abstract:Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at this https URL.
zh

[NLP-150] Reasoning Shield: Content Safety Detection over Reasoning Reasoning Traces of Large Reasoning Models

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在生成推理过程时可能包含不安全内容的问题，尽管最终答案看似安全。现有针对问答对（Question-Answer, QA）的安全检测工具在识别推理过程中隐藏的风险时效果有限。解决方案的关键在于提出ReasoningShield，这是首个专门用于检测推理过程中的潜在风险的安全检测模型。其核心创新包括构建一个高质量的推理安全检测数据集，涵盖十类风险和三个安全等级，并采用人机协同标注流程以提高标注准确性和效率。此外，ReasoningShield基于轻量级基础模型，支持高效部署并提供直观的风险分析。

链接: https://arxiv.org/abs/2505.17244
作者: Changyi Li,Jiayi Wang,Xudong Pan,Geng Hong,Min Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) are transforming the AI landscape with advanced reasoning capabilities. While the generated reasoning traces enhance model transparency, they can still contain unsafe content, even when the final answer appears safe. Existing moderation tools, primarily designed for question-answer (QA) pairs, are empirically ineffective at detecting hidden risks embedded in reasoning traces. After identifying the key challenges, we formally define the question-thought (QT) moderation task and propose ReasoningShield, the first safety detection model tailored to identify potential risks in the reasoning trace before reaching the final answer. To construct the model, we synthesize a high-quality reasoning safety detection dataset comprising over 8,000 question-thought pairs spanning ten risk categories and three safety levels. Our dataset construction process incorporates a comprehensive human-AI collaborative annotation pipeline, which achieves over 93% annotation accuracy while significantly reducing human costs. On a diverse set of in-distribution and out-of-distribution benchmarks, ReasoningShield outperforms mainstream content safety moderation models in identifying risks within reasoning traces, with an average F1 score exceeding 0.92. Notably, despite being trained on our QT dataset only, ReasoningShield also demonstrates competitive performance in detecting unsafe question-answer pairs on traditional benchmarks, rivaling baselines trained on 10 times larger datasets and base models, which strongly validates the quality of our dataset. Furthermore, ReasoningShield is built upon compact 1B/3B base models to facilitate lightweight deployment and provides human-friendly risk analysis by default. To foster future research, we publicly release all the resources.
zh

[NLP-151] Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG )

【速读】：该论文旨在解决在STEM+C（科学、技术、工程、数学与计算）环境中，基于生成式 AI (Generative AI) 的教学代理在协作对话中因潜在的幻觉问题而影响学生学习和批判性思维支持的问题。解决方案的关键在于提出一种日志上下文增强的检索生成（log-contextualized RAG, LC-RAG）方法，通过引入环境日志来增强检索生成过程的上下文关联性，从而提升检索效果并实现更相关、个性化的学生指导。

链接: https://arxiv.org/abs/2505.17238
作者: Clayton Cohn,Surya Rayala,Caitlin Snyder,Joyce Fonteles,Shruti Jain,Naveeduddin Mohammed,Umesh Timalsina,Sarah K. Burriss,Ashwin T S,Namrata Srivastava,Menton Deweese,Angela Eeds,Gautam Biswas
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to the International Conference on Artificial Intelligence in Education (AIED) Workshop on Epistemics and Decision-Making in AI-Supported Education

点击查看摘要

Abstract:Collaborative dialogue offers rich insights into students’ learning and critical thinking. This is essential for adapting pedagogical agents to students’ learning and problem-solving skills in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, potential hallucinations can undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated knowledge, but its effectiveness depends on clear semantic links between user input and a knowledge base, which are often weak in student dialogue. We propose log-contextualized RAG (LC-RAG), which enhances RAG retrieval by incorporating environment logs to contextualize collaborative discourse. Our findings show that LC-RAG improves retrieval over a discourse-only baseline and allows our collaborative peer agent, Copa, to deliver relevant, personalized guidance that supports students’ critical thinking and epistemic decision-making in a collaborative computational modeling environment, XYZ.
zh

[NLP-152] CHAOS: Chart Analysis with Outlier Samples

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在面对存在异常或噪声特征的图表（outlier charts）时表现出的解释能力不足问题。其解决方案的关键在于提出CHAOS基准，这是一个系统评估MLLMs对图表扰动鲁棒性的基准，包含五种文本扰动和十种视觉扰动，每种扰动设置三个严重程度级别，并通过两个下游任务（ChartQA和Chart-to-Text）进行综合分析，以揭示模型在不同图表扰动下的鲁棒性特征。

链接: https://arxiv.org/abs/2505.17235
作者: Omar Moured,Yufan Chen,Ruiping Liu,Simon Reiß,Philip Torr,Jiaming Zhang,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Data and code are publicly available at: this http URL

点击查看摘要

Abstract:Charts play a critical role in data analysis and visualization, yet real-world applications often present charts with challenging or noisy features. However, “outlier charts” pose a substantial challenge even for Multimodal Large Language Models (MLLMs), which can struggle to interpret perturbed charts. In this work, we introduce CHAOS (CHart Analysis with Outlier Samples), a robustness benchmark to systematically evaluate MLLMs against chart perturbations. CHAOS encompasses five types of textual and ten types of visual perturbations, each presented at three levels of severity (easy, mid, hard) inspired by the study result of human evaluation. The benchmark includes 13 state-of-the-art MLLMs divided into three groups (i.e., general-, document-, and chart-specific models) according to the training scope and data. Comprehensive analysis involves two downstream tasks (ChartQA and Chart-to-Text). Extensive experiments and case studies highlight critical insights into robustness of models across chart perturbations, aiming to guide future research in chart understanding domain. Data and code are publicly available at: this http URL.
zh

[NLP-153] ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

【速读】：该论文试图解决当前文本到SQL（text-to-SQL）模型在多SQL方言（SQL dialects）上的泛化能力不足的问题，尤其是在真实应用场景中，不同数据库系统具有不同的语法和特性，而现有模型难以有效适应。解决方案的关键在于引入ExeSQL框架，该框架通过执行驱动的、代理式自举方法，结合迭代查询生成、基于执行的过滤（如拒绝采样）以及基于偏好的训练，使模型能够通过可验证的、反馈引导的学习过程适应新的SQL方言。

链接: https://arxiv.org/abs/2505.17231
作者: Jipeng Zhang,Haolin Yang,Kehao Miao,Ruiyuan Zhang,Renjie Pi,Jiahui Gao,Xiaofang Zhou
机构: The Hong Kong University of Science and Technology (香港科技大学); Nanyang Technological University (南洋理工大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.
zh

[NLP-154] Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts

【速读】：该论文旨在解决在自然语言处理中建模复杂主观任务（如情感和道德识别）时，由于人类标注的显著差异所带来的挑战。这种差异通常反映了合理的语义解释差异，而非单纯的噪声，因此需要区分合法的主观性与错误。该研究的关键解决方案是提出一种基于大语言模型（Large Language Models, LLMs）的标签验证方法，其中核心思想是通过“Label-in-a-Haystack”设置，将查询及其标签包含在展示给LLMs的示例中，并让模型在接收特定任务指令的情况下重新预测标签，而非直接复制标签。这种方法能够有效识别模型输出与参考标签之间的偏差，并通过标签修正框架（Label-in-a-Haystack Rectification, LiaHR）对主观标签进行修正，从而提升标注管道中的信噪比。

链接: https://arxiv.org/abs/2505.17222
作者: Georgios Chochlakis,Peter Wu,Arjun Bedi,Marcus Ma,Kristina Lerman,Shrikanth Narayanan
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 16 figures, 9 tables

点击查看摘要

Abstract:Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at this https URL.
zh

[NLP-155] Mitigating Gender Bias via Fostering Exploratory Thinking in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中存在的性别偏见问题，这种偏见导致模型在不同情境下对男性和女性主体的处理不平等。解决方案的关键在于提出一种新颖的数据生成框架，该框架通过引导模型生成结构相同、道德模糊的情境下包含男性和女性主角的故事对，并引发和比较它们的道德判断。当出现不一致时，模型被引导生成平衡且性别中立的判断，这些故事-判断对用于通过直接偏好优化（Direct Preference Optimization, DPO）对模型进行微调或优化。

链接: https://arxiv.org/abs/2505.17217
作者: Kangda Wei,Hasnat Md Abdullah,Ruihong Huang
机构: Texas A&M University (德克萨斯A&M大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data.
zh

[NLP-156] FB-RAG : Improving RAG with Forward and Backward Lookup

【速读】：该论文旨在解决检索增强生成（Retrieval Augmented Generation, RAG）系统中上下文质量与大小之间的权衡问题，即过大的上下文可能包含无关信息导致模型混淆，而过小的上下文可能丢失关键信息。解决方案的关键在于提出一种名为FB-RAG的新框架，通过结合向后查找（与查询重叠）和向前查找（与候选原因和答案重叠）来检索最相关的内容片段，从而提升回答输入查询的准确性。

链接: https://arxiv.org/abs/2505.17206
作者: Kushal Chawla,Alfy Samuel,Anoop Kumar,Daben Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance of Retrieval Augmented Generation (RAG) systems relies heavily on the retriever quality and the size of the retrieved context. A large enough context ensures that the relevant information is present in the input context for the LLM, but also incorporates irrelevant content that has been shown to confuse the models. On the other hand, a smaller context reduces the irrelevant information, but it often comes at the risk of losing important information necessary to answer the input question. This duality is especially challenging to manage for complex queries that contain little information to retrieve the relevant chunks from the full context. To address this, we present a novel framework, called FB-RAG, which enhances the RAG pipeline by relying on a combination of backward lookup (overlap with the query) and forward lookup (overlap with candidate reasons and answers) to retrieve specific context chunks that are the most relevant for answering the input query. Our evaluations on 9 datasets from two leading benchmarks show that FB-RAG consistently outperforms RAG and Long Context baselines developed recently for these benchmarks. We further show that FB-RAG can improve performance while reducing latency. We perform qualitative analysis of the strengths and shortcomings of our approach, providing specific insights to guide future work.
zh

[NLP-157] CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

【速读】：该论文试图解决当前视觉-语言模型在理解数据可视化任务中与人类表现存在显著差距的问题，其核心在于评估这些模型是否能够模拟人类在解析数据可视化时的认知操作。解决方案的关键在于设计针对人类的数据可视化素养评估测试，并将其应用于八种视觉-语言模型，以对比模型与人类的表现差异，从而揭示现有模型在这一领域的不足及改进方向。

链接: https://arxiv.org/abs/2505.17202
作者: Arnav Verma,Kushin Mukherjee,Christopher Potts,Elisa Kreiss,Judith E. Fan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat – succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: this https URL.
zh

[NLP-158] Next Token Perception Score: Analytical Assessment of your LLM Perception Skills

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）在下游感知任务中线性探针性能表现不一致的问题，即基于自回归预训练的表示可能未能有效对齐感知任务所需的特征子空间。其解决方案的关键在于提出了一种名为“下一词感知得分”（Next Token Perception Score, NTPS）的度量方法，该方法通过线性设定量化自回归预训练与感知特征子空间之间的对齐程度，并证明其能够有效上界和下界过拟合损失。NTPS可从预训练表示和标注数据中直接计算，且与线性探针准确率高度相关，为评估模型感知能力提供了理论依据和实用工具。

链接: https://arxiv.org/abs/2505.17169
作者: Yu-Ang Cheng,Leyang Hu,Hai Huang,Randall Balestriero
机构: Brown University (布朗大学); Atlassian (Atlassian)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive pretraining has become the de facto paradigm for learning general-purpose representations in large language models (LLMs). However, linear probe performance across downstream perception tasks shows substantial variability, suggesting that features optimized for next-token prediction do not consistently transfer well to downstream perception tasks. We demonstrate that representations learned via autoregression capture features that may lie outside the subspaces most informative for perception. To quantify the (mis)alignment between autoregressive pretraining and downstream perception, we introduce the Next Token Perception Score (NTPS)-a score derived under a linear setting that measures the overlap between autoregressive and perception feature subspaces. This metric can be easily computed in closed form from pretrained representations and labeled data, and is proven to both upper- and lower-bound the excess loss. Empirically, we show that NTPS correlates strongly with linear probe accuracy across 12 diverse NLP datasets and eight pretrained models ranging from 270M to 8B parameters, confirming its utility as a measure of alignment. Furthermore, we show that NTPS increases following low-rank adaptation (LoRA) fine-tuning, especially in large models, suggesting that LoRA aligning representations to perception tasks enhances subspace overlap and thus improves downstream performance. More importantly, we find that NTPS reliably predicts the additional accuracy gains attained by LoRA finetuning thereby providing a lightweight prescreening tool for LoRA adaptation. Our results offer both theoretical insights and practical tools for analytically assessing LLM perception skills.
zh

[NLP-159] CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

【速读】：该论文试图解决长文本放射学报告生成的评估问题，现有自然语言生成（NLG）指标无法捕捉临床正确性，而基于大语言模型（LLM）的指标则缺乏泛化能力，临床准确度指标虽更相关但易受类别不平衡影响，倾向于偏好简单预测。解决方案的关键是提出CRG Score，这是一种考虑分布且可适应的评估指标，仅评估参考报告中明确描述的临床相关异常，支持二分类和结构化标签，并可与任何LLM结合进行特征提取，通过基于标签分布平衡惩罚，实现更公平、稳健的评估，并作为临床对齐的奖励函数。

链接: https://arxiv.org/abs/2505.17167
作者: Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Bernhard Kainz,Bjoern Menze
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
zh

[NLP-160] OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLM s in Complex Text-Rich Image Reasoning

【速读】：该论文试图解决文本丰富的图像推理任务中，多模态大语言模型（Multimodal Large Language Models, MLLMs）能力评估缺乏系统性基准的问题。解决方案的关键在于提出OCR-Reasoning，这是一个全面的基准，旨在系统评估MLLMs在文本丰富的视觉场景中的推理能力，其包含1,069个由人工标注的示例，涵盖6种核心推理能力和18项实际推理任务，并同时标注了推理过程与最终答案，从而能够对模型的推理过程和输出结果进行综合评估。

链接: https://arxiv.org/abs/2505.17163
作者: Mingxin Huang,Yongxin Shi,Dezhi Peng,Songxuan Lai,Zecheng Xie,Lianwen Jin
机构: South China University of Technology (华南理工大学); Huawei Cloud (华为云)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at this https URL.
zh

[NLP-161] Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting

【速读】：该论文试图解决当前大语言模型（Large Language Models, LLMs）在“遗忘”（unlearning）过程中可能仍然保留隐含知识的问题，这种知识可能在特定攻击性提示下被泄露。解决方案的关键在于提出LURK（Latent UnleaRned Knowledge）框架，通过对抗性后缀提示（adversarial suffix prompting）主动探测未完全遗忘的模型中的潜在知识，从而更全面地评估遗忘算法的鲁棒性。

链接: https://arxiv.org/abs/2505.17160
作者: Bang Trinh Tran To,Thai Le
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work presents LURK (Latent UnleaRned Knowledge), a novel framework that probes for hidden retained knowledge in unlearned LLMs through adversarial suffix prompting. LURK automatically generates adversarial prompt suffixes designed to elicit residual knowledge about the Harry Potter domain, a commonly used benchmark for unlearning. Our experiments reveal that even models deemed successfully unlearned can leak idiosyncratic information under targeted adversarial conditions, highlighting critical limitations of current unlearning evaluation standards. By uncovering latent knowledge through indirect probing, LURK offers a more rigorous and diagnostic tool for assessing the robustness of unlearning algorithms. All code will be publicly available.
zh

[NLP-162] PersonaBOT: Bringing Customer Personas to Life with LLM s and RAG

【速读】：该论文旨在解决传统定性方法在构建客户画像（customer personas）过程中存在耗时且难以扩展的问题，提出通过生成式AI (Generative AI) 生成合成客户画像，并将其集成到检索增强生成（Retrieval-Augmented Generation, RAG）聊天机器人中以支持业务决策。解决方案的关键在于利用少样本（Few-Shot）和思维链（Chain-of-Thought, CoT）提示技术生成高质量的合成客户画像，并通过 McNemar 检验评估其完整性、相关性和一致性，最终将合成画像整合至聊天机器人的知识库中以提升响应准确性和实用性。

链接: https://arxiv.org/abs/2505.17156
作者: Muhammed Rizwan,Lars Carlsson,Mohammad Loni
机构: Jönköping University (林雪平大学); Volvo Construction Equipment (沃尔沃建筑设备)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The introduction of Large Language Models (LLMs) has significantly transformed Natural Language Processing (NLP) applications by enabling more advanced analysis of customer personas. At Volvo Construction Equipment (VCE), customer personas have traditionally been developed through qualitative methods, which are time-consuming and lack scalability. The main objective of this paper is to generate synthetic customer personas and integrate them into a Retrieval-Augmented Generation (RAG) chatbot to support decision-making in business processes. To this end, we first focus on developing a persona-based RAG chatbot integrated with verified personas. Next, synthetic personas are generated using Few-Shot and Chain-of-Thought (CoT) prompting techniques and evaluated based on completeness, relevance, and consistency using McNemar’s test. In the final step, the chatbot’s knowledge base is augmented with synthetic personas and additional segment information to assess improvements in response accuracy and practical utility. Key findings indicate that Few-Shot prompting outperformed CoT in generating more complete personas, while CoT demonstrated greater efficiency in terms of response time and token usage. After augmenting the knowledge base, the average accuracy rating of the chatbot increased from 5.88 to 6.42 on a 10-point scale, and 81.82% of participants found the updated system useful in business contexts.
zh

[NLP-163] rimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在测试时扩展方法中因生成冗余的思维链（Chain-of-Thought, CoT）而导致的解码效率低下问题。其关键解决方案是提出一种基于验证器的、无需训练的高效动态CoT压缩框架TrimR，通过轻量级预训练指令调优验证器检测并截断冗余中间思考，从而提升推理效率，同时对准确性影响极小。

链接: https://arxiv.org/abs/2505.17155
作者: Weizhe Lin,Xing Li,Zhiyuan Yang,Xiaojin Fu,Hui-Ling Zhen,Yaoyuan Wang,Xianzhi Yu,Wulong Liu,Xiaosong Li,Mingxuan Yuan
机构: Huawei Advanced Computing and Storage Lab (华为先进计算与存储实验室); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) demonstrate exceptional capability in tackling complex mathematical, logical, and coding tasks by leveraging extended Chain-of-Thought (CoT) reasoning. Test-time scaling methods, such as prolonging CoT with explicit token-level exploration, can push LRMs’ accuracy boundaries, but they incur significant decoding overhead. A key inefficiency source is LRMs often generate redundant thinking CoTs, which demonstrate clear structured overthinking and underthinking patterns. Inspired by human cognitive reasoning processes and numerical optimization theories, we propose TrimR, a verifier-based, training-free, efficient framework for dynamic CoT compression to trim reasoning and enhance test-time scaling, explicitly tailored for production-level deployment. Our method employs a lightweight, pretrained, instruction-tuned verifier to detect and truncate redundant intermediate thoughts of LRMs without any LRM or verifier fine-tuning. We present both the core algorithm and asynchronous online system engineered for high-throughput industrial applications. Empirical evaluations on Ascend NPUs and vLLM show that our framework delivers substantial gains in inference efficiency under large-batch workloads. In particular, on the four MATH500, AIME24, AIME25, and GPQA benchmarks, the reasoning runtime of Pangu-R-38B, QwQ-32B, and DeepSeek-R1-Distill-Qwen-32B is improved by up to 70% with negligible impact on accuracy.
zh

[NLP-164] Amplify Adjacent Token Differences: Enhancing Long Chain-of-Thought Reasoning with Shift-FFN

【速读】：该论文试图解决在长链式思维（Long-CoT）数据上微调大型语言模型（LLM）时出现的循环推理（Cyclical Reasoning）问题，即模型在达到最大长度限制前反复重复之前的推理步骤。解决方案的关键在于提出了一种名为Shift Feedforward Networks (Shift-FFN) 的新方法，该方法在将当前标记的表示输入前馈网络（FFN）之前，将其与前一个标记的表示进行编辑，从而动态增强相邻标记表示之间的差异，以减少循环推理的发生。

链接: https://arxiv.org/abs/2505.17153
作者: Yao Xu,Mingyu Xu,Fangyu Lei,Wangtao Sun,Xiangrong Zeng,Bingning Wang,Guang Liu,Shizhu He,Jun Zhao,Kang Liu
机构: Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); University of Chinese Academy of Sciences(中国科学院大学); Baichuan Inc(百川智能); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Beijing Academy of Artificial Intelligence(北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, models such as OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable performance on complex reasoning tasks through Long Chain-of-Thought (Long-CoT) reasoning. Although distilling this capability into student models significantly enhances their performance, this paper finds that fine-tuning LLMs with full parameters or LoRA with a low rank on long CoT data often leads to Cyclical Reasoning, where models repeatedly reiterate previous inference steps until the maximum length limit. Further analysis reveals that smaller differences in representations between adjacent tokens correlates with a higher tendency toward Cyclical Reasoning. To mitigate this issue, this paper proposes Shift Feedforward Networks (Shift-FFN), a novel approach that edits the current token’s representation with the previous one before inputting it to FFN. This architecture dynamically amplifies the representation differences between adjacent tokens. Extensive experiments on multiple mathematical reasoning tasks demonstrate that LoRA combined with Shift-FFN achieves higher accuracy and a lower rate of Cyclical Reasoning across various data sizes compared to full fine-tuning and standard LoRA. Our data and code are available at this https URL
zh

[NLP-165] Bayesian Optimization for Enhanced Language Models: Optimizing Acquisition Functions

【速读】：该论文试图解决在微调大型语言模型时，如何选择合适的超参数以提升模型在下游任务上的性能问题。其关键解决方案是引入Bilevel-BO-SWA方法，该方法结合了双层贝叶斯优化（Bilevel BO）策略与采集函数的混合使用，通过嵌套优化循环，内层优化训练损失，外层优化验证指标，从而更有效地平衡探索与利用，提升模型的泛化能力。实验结果表明，采用EI和UCB采集函数的混合策略可使微调效果提升最高达2.7%。

链接: https://arxiv.org/abs/2505.17151
作者: Zishuo Bao,Yibo Liu,Changyutao Qiu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:With the rise of different language model architecture, fine-tuning is becoming even more important for down stream tasks Model gets messy, finding proper hyperparameters for fine-tuning. Although BO has been tried for hyperparameter tuning, most of the existing methods are oblivious to the fact that BO relies on careful choices of acquisition functions, which are essential components of BO that guide how much to explore versus exploit during the optimization process; Different acquisition functions have different levels of sensitivity towards training loss and validation performance; existing methods often just apply an acquisition function no matter if the training and validation performance are sensitive to the acquisition function or not. This work introducesBilevel - BO - SWA, a model fusion approach coupled with a bilevel BO strategy to improve the fine - tunning of large language models. Our work on mixture of acquisition functions like EI and UCB into nested opt loops, where inner loop perform minimization of training loss while outer loops optimized w.r.t. val metric. Experiments on GLUE tasks using RoBERTA - base show that when using EI and UCB, there is an improvement in generalization, and fine - tuning can be improved by up to 2.7%.
zh

[NLP-166] Large Language Models for Predictive Analysis: How Far Are They? ACL2025

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在预测分析（predictive analysis）领域能力评估不足的问题，旨在系统性地评估其在该领域的适用性与局限性。解决方案的关键在于构建了一个名为PredictiQ的基准测试集，该基准集整合了来自8个不同领域、共44个真实数据集的1130个复杂的预测分析查询，并设计了涵盖文本分析、代码生成及其对齐的评估协议，以全面衡量LLMs在预测分析任务中的表现。

链接: https://arxiv.org/abs/2505.17149
作者: Qin Chen,Yuanyi Ren,Xiaojun Ma,Yuyang Shi
机构: Peking University (北京大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbfPredictiQ benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \hrefthis https URLGithub.
zh

[NLP-167] MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models ACL2025

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）中隐性偏见与歧视相关的隐性毒性问题，而现有研究主要关注显性毒性。解决方案的关键在于引入一种更隐蔽的毒性类型——双隐性毒性（dual-implicit toxicity），并构建了首个针对该类型的基准测试框架MDIT-Bench。该基准包含317,638个问题，覆盖12个类别、23个子类别和780个主题，并通过多阶段人机协同上下文生成方法创建了包含双隐性毒性的MDIT-Dataset，以评估模型对这类隐性毒性的敏感度。

链接: https://arxiv.org/abs/2505.17144
作者: Bohan Jin,Shuhan Qi,Kehai Chen,Xinyi Guo,Xuan Wang
机构: Harbin Institute of Technology (Shenzhen); Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies; University of Barcelona
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of ACL 2025

点击查看摘要

Abstract:The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model’s performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at this https URL.
zh

[NLP-168] Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）知识过时的问题，特别是如何高效地更新模型以注入专有信息。研究发现，认知参与度较高的微调任务（如问答和填空）在知识保留率方面显著优于以映射为导向的任务（如翻译或文本到JSON转换），其保留率分别为48%和17%-20%。解决方案的关键在于任务选择，即有效的知识注入不仅依赖于数据暴露，更依赖于微调过程中对知识的深度认知加工。

链接: https://arxiv.org/abs/2505.17140
作者: Essa Jan,Moiz Ali,Muhammad Saram Hassan,Fareed Zaffar,Yasir Zaki
机构: Lahore University of Management Sciences(拉合尔管理科学大学); New York University Abu Dhabi(纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.
zh

[NLP-169] EarthSE: A Benchmark Evaluating Earth Scientific Exploration Capability for Large Language Models

【速读】：该论文试图解决现有基准在地球科学领域缺乏全面性和专业性的问题，以及当前基准未能有效评估大语言模型（Large Language Models, LLMs）在开放性科学探索中的能力。其解决方案的关键在于构建一个涵盖基础到高级层次的综合性地球科学基准，包括两个问答（Question Answering, QA）数据集Earth-Iron和Earth-Silver，以及一个专门用于评估高级科学探索能力的开放性多轮对话数据集Earth-Gold，从而全面评估LLMs在地球科学领域的知识广度与深度。

链接: https://arxiv.org/abs/2505.17139
作者: Wanghan Xu,Xiangyu Zhao,Yuhao Zhou,Xiaoyu Yue,Ben Fei,Fenghua Ling,Wenlong Zhang,Lei Bai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs’ capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on this https URL .
zh

[NLP-170] Cog-TiPRO: Iterative Prompt Refinement with LLM s to Detect Cognitive Decline via Longitudinal Voice Assistant Commands

【速读】：该论文旨在解决早期检测认知功能下降的问题，以实现对神经退行性疾病的干预。传统诊断方法依赖于耗时的临床评估，难以用于频繁监测。其解决方案的关键在于提出Cog-TiPRO框架，该框架结合了LLM驱动的迭代提示优化以提取语言特征、基于HuBERT的声学特征提取以及基于Transformer的时间建模，从而有效分析语音助手系统中短时、非结构化且噪声较大的语音命令，提升了轻度认知障碍（MCI）检测的准确率和F1分数。

链接: https://arxiv.org/abs/2505.17137
作者: Kristin Qi,Youxiang Zhu,Caroline Summerour,John A. Batsis,Xiaohui Liang
机构: University of Massachusetts, Boston, MA, USA(马萨诸塞大学波士顿分校); University of North Carolina, Chapel Hill, NC, USA(北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to the IEEE GlobeCom 2025

点击查看摘要

Abstract:Early detection of cognitive decline is crucial for enabling interventions that can slow neurodegenerative disease progression. Traditional diagnostic approaches rely on labor-intensive clinical assessments, which are impractical for frequent monitoring. Our pilot study investigates voice assistant systems (VAS) as non-invasive tools for detecting cognitive decline through longitudinal analysis of speech patterns in voice commands. Over an 18-month period, we collected voice commands from 35 older adults, with 15 participants providing daily at-home VAS interactions. To address the challenges of analyzing these short, unstructured and noisy commands, we propose Cog-TiPRO, a framework that combines (1) LLM-driven iterative prompt refinement for linguistic feature extraction, (2) HuBERT-based acoustic feature extraction, and (3) transformer-based temporal modeling. Using iTransformer, our approach achieves 73.80% accuracy and 72.67% F1-score in detecting MCI, outperforming its baseline by 27.13%. Through our LLM approach, we identify linguistic features that uniquely characterize everyday command usage patterns in individuals experiencing cognitive decline.
zh

[NLP-171] Foundation Models for Geospatial Reasoning : Assessing Capabilities of Large Language Models in Understanding Geometries and Topological Spatial Relations

【速读】：该论文试图解决将AI基础模型直接应用于地理空间数据时面临的挑战，即模型在表示和推理地理实体（特别是基于矢量的几何形状和复杂空间关系的自然语言描述）方面的能力有限。解决方案的关键在于评估几何形状及其空间关系（如拓扑谓词）在通过大型语言模型（LLMs）进行空间推理时的保留情况，并探索三种不同的方法：基于几何嵌入、基于提示工程以及基于日常语言的评估方式。实验结果表明，基于嵌入和提示工程的方法在识别两个几何体之间的拓扑空间关系方面平均准确率超过0.6，其中GPT-4在少样本提示下表现最佳，准确率超过0.66。此外，研究还发现LLM能够理解逆向拓扑空间关系，并通过生成几何形状提升地理实体检索效果。

链接: https://arxiv.org/abs/2505.17136
作者: Yuhan Ji,Song Gao,Ying Nie,Ivan Majić,Krzysztof Janowicz
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); University of Vienna(维也纳大学); University of California-Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 13 figures, IJGIS GeoFM Special Issue

点击查看摘要

Abstract:Applying AI foundation models directly to geospatial datasets remains challenging due to their limited ability to represent and reason with geographical entities, specifically vector-based geometries and natural language descriptions of complex spatial relations. To address these issues, we investigate the extent to which a well-known-text (WKT) representation of geometries and their spatial relations (e.g., topological predicates) are preserved during spatial reasoning when the geospatial vector data are passed to large language models (LLMs) including GPT-3.5-turbo, GPT-4, and DeepSeek-R1-14B. Our workflow employs three distinct approaches to complete the spatial reasoning tasks for comparison, i.e., geometry embedding-based, prompt engineering-based, and everyday language-based evaluation. Our experiment results demonstrate that both the embedding-based and prompt engineering-based approaches to geospatial question-answering tasks with GPT models can achieve an accuracy of over 0.6 on average for the identification of topological spatial relations between two geometries. Among the evaluated models, GPT-4 with few-shot prompting achieved the highest performance with over 0.66 accuracy on topological spatial relation inference. Additionally, GPT-based reasoner is capable of properly comprehending inverse topological spatial relations and including an LLM-generated geometry can enhance the effectiveness for geographic entity retrieval. GPT-4 also exhibits the ability to translate certain vernacular descriptions about places into formal topological relations, and adding the geometry-type or place-type context in prompts may improve inference accuracy, but it varies by instance. The performance of these spatial reasoning tasks offers valuable insights for the refinement of LLMs with geographical knowledge towards the development of geo-foundation models capable of geospatial reasoning.
zh

[NLP-172] When can isotropy help adapt LLM s next word prediction to numerical domains?

【速读】：该论文试图解决预训练语言模型（Pre-trained Language Models, LLMs）在数值领域下游任务中的性能保障问题，特别是其在数值预测中可能出现的幻觉（Hallucination）现象，这可能导致能源、金融、医疗等关键领域的严重后果。解决方案的关键在于通过一种基于上下文嵌入空间各向同性（Isotropy）的新颖分析，理解LLMs的下一个词预测能力如何适应数值领域。具体而言，论文提出了一种对数线性模型，其中数值数据可通过具有softmax输出层的网络从上下文中预测得出，并证明为了在数值领域取得最先进性能，LLM嵌入的隐藏表示必须具备能够解释softmax函数平移不变性的结构。通过构建预训练模型中自注意力机制的梯度结构，论文展示了上下文嵌入空间中LLM嵌入的各向同性特性如何保持表示的底层结构，从而解决平移不变性问题并提供性能保证。

链接: https://arxiv.org/abs/2505.17135
作者: Rashed Shelim,Shengzhe Xu,Walid Saad,Naren Ramakrishnan
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have shown that vector representations of contextual embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications such as energy, nature, finance, healthcare, retail and transportation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to open the black-box and provide performance guarantees through explanation. However, there is little theoretical understanding of when pre-trained language models help solve numeric downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, we consider a log-linear model for LLMs in which numeric data can be predicted from its context through a network with softmax in the output layer of LLMs (i.e., language model head in self-attention). We demonstrate that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in pre-trained models, we show how the isotropic property of LLM embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing a performance guarantee. Experiments show that different characteristics of numeric data and model architecture could have different impacts on isotropy.
zh

[NLP-173] LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions

【速读】：该论文旨在解决长上下文大语言模型（Long-context Large Language Models, LLMs）对高质量长上下文指令数据的需求问题，当前公开模型如Qwen和Llama的长上下文指令数据仍为专有资源。传统的人工标注成本高且困难，而基于模板的合成方法则受限于规模、多样性和质量。论文提出的解决方案是LongMagpie，其关键在于利用对齐的长上下文LLMs在文档后跟随特殊标记生成用户对话时，自动回归生成相关查询，从而无需人工参与即可生成高质量的长上下文指令数据。

链接: https://arxiv.org/abs/2505.17134
作者: Chaochen Gao,Xing Wu,Zijia Lin,Debing Zhang,Songlin Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model’s responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
zh

[NLP-174] Robustifying Vision-Language Models via Dynamic Token Reweighting

【速读】：该论文旨在解决大型多模态视觉-语言模型（VLMs）在推理阶段对 jailbreak 攻击的脆弱性问题，此类攻击通过操纵视觉与文本之间的交互来绕过安全防护机制。论文提出的解决方案关键在于 DTR（Dynamic Token Reweighting），它通过优化模型的关键值（KV）缓存来缓解多模态 jailbreak 攻击。DTR 不依赖于特定安全的数据集或高成本的图像到文本转换，而是引入了一种新的由视觉模态引起的与安全相关的分布偏移形式，从而动态调整视觉标记的权重，以最小化对抗性视觉输入的影响，同时保持模型的通用能力和推理效率。

链接: https://arxiv.org/abs/2505.17132
作者: Tanqiu Jiang,Jiacheng Liang,Rongyi Zhu,Jiawei Zhou,Fenglong Ma,Ting Wang
机构: Stony Brook University (纽约州立大学石溪分校); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails. In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model’s key-value (KV) caches. Rather than relying on curated safety-specific data or costly image-to-text conversion, we introduce a new formulation of the safety-relevant distributional shift induced by the visual modality. This formulation enables DTR to dynamically adjust visual token weights, minimizing the impact of adversarial visual inputs while preserving the model’s general capabilities and inference efficiency. Extensive evaluation across diverse VLMs and attack benchmarks demonstrates that \sys outperforms existing defenses in both attack robustness and benign task performance, marking the first successful application of KV cache optimization for safety enhancement in multimodal foundation models. The code for replicating DTR is available: this https URL (warning: this paper contains potentially harmful content generated by VLMs.)
zh

[NLP-175] Relative Bias: A Comparative Framework for Quantifying Bias in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中固有偏见的量化问题，这一问题因“偏见”概念的模糊性而变得更加复杂。随着新模型的快速涌现和广泛应用，潜在的偏见尚未得到系统评估，因此亟需一种有效的评估方法。论文提出的解决方案是相对偏见框架（Relative Bias framework），其关键在于通过两种互补的方法评估LLM行为相对于其他LLM在特定目标领域的偏差：一是通过嵌入空间中的句子表示进行嵌入变换分析，二是利用语言模型作为评判者（LLM-as-a-Judge）进行输出比较评估。这两种方法在多个案例研究中表现出高度一致性，为LLM的对比偏见分析提供了一种系统、可扩展且统计基础扎实的解决方案。

链接: https://arxiv.org/abs/2505.17131
作者: Alireza Arbabi,Florian Kerschbaum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The growing deployment of large language models (LLMs) has amplified concerns regarding their inherent biases, raising critical questions about their fairness, safety, and societal impact. However, quantifying LLM bias remains a fundamental challenge, complicated by the ambiguity of what “bias” entails. This challenge grows as new models emerge rapidly and gain widespread use, while introducing potential biases that have not been systematically assessed. In this paper, we propose the Relative Bias framework, a method designed to assess how an LLM’s behavior deviates from other LLMs within a specified target domain. We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively. Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods, offering a systematic, scalable, and statistically grounded approach for comparative bias analysis in LLMs.
zh

[NLP-176] Conformal Language Model Reasoning with Coherent Factuality

【速读】：该论文试图解决语言模型在推理任务中生成内容的“一致性真实性”（coherent factuality）问题，即确保逻辑论证中的每一步骤在上下文中是正确的，而不仅仅是孤立地评估单个声明的真实性。解决方案的关键在于构建一个“可推导性图”（deducibility graph），并应用拆分共形预测（split conformal prediction）技术对图中的子图进行分析，从而保证生成内容在逻辑推理过程中的整体一致性与真实性。

链接: https://arxiv.org/abs/2505.17126
作者: Maxon Rubin-Toles,Maya Gambhir,Keshav Ramji,Aaron Roth,Surbhi Goel
机构: University of Pennsylvania (宾夕法尼亚大学); IBM Research AI (IBM研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Language models are increasingly being used in important decision pipelines, so ensuring the correctness of their outputs is crucial. Recent work has proposed evaluating the “factuality” of claims decomposed from a language model generation and applying conformal prediction techniques to filter out those claims that are not factual. This can be effective for tasks such as information retrieval, where constituent claims may be evaluated in isolation for factuality, but is not appropriate for reasoning tasks, as steps of a logical argument can be evaluated for correctness only within the context of the claims that precede them. To capture this, we define “coherent factuality” and develop a conformal-prediction-based method to guarantee coherent factuality for language model outputs. Our approach applies split conformal prediction to subgraphs within a “deducibility” graph" that represents the steps of a reasoning problem. We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels. Moreover, we achieve 90% factuality on our stricter definition while retaining 80% or more of the original claims, highlighting the utility of our deducibility-graph-guided approach.
zh

[NLP-177] MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在多轮交互推理任务中的评估不足问题，即现有评估主要集中在单轮推理场景，缺乏对交互式任务的系统性研究。解决方案的关键在于提出MTR-Bench，这是一个包含4个类别、40项任务和3600个实例的多轮推理评估基准，其核心特点在于覆盖了多样化的推理能力、细粒度的难度层级，并要求与环境进行多轮交互。此外，MTR-Bench还具备全自动化框架，涵盖数据集构建与模型评估，实现了无需人工干预的大规模评估。

链接: https://arxiv.org/abs/2505.17123
作者: Xiaoyuan Li,Keqin Bao,Yubo Ma,Moxin Li,Wenjie Wang,Rui Men,Yichang Zhang,Fuli Feng,Dayiheng Liu,Junyang Lin
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs’ Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
zh

[NLP-178] Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）与人类偏好对齐的问题，当前的基于偏好的优化方法如基于人类反馈的强化学习（Reinforcement Learning with Human Feedback, RLHF）和直接偏好优化（Direct Preference Optimization, DPO）依赖于人工标注的数据集来提升对齐效果。论文提出的关键解决方案是识别并利用“浅层偏好信号”（shallow preference signals），即在优选响应中，区分性信号通常集中在早期token上。通过系统地截断偏好数据集并在截断数据上训练奖励模型和DPO模型，实验表明即使仅保留前半部分或更少的token，模型性能仍可与全数据集训练的结果相媲美甚至更优，这揭示了浅层偏好信号的普遍性，并进一步提出了两种基于此现象的解码策略以优化对齐与计算效率的平衡。

链接: https://arxiv.org/abs/2505.17122
作者: Xuan Qi,Jiahao Qiu,Xinzhe Juan,Yue Wu,Mengdi Wang
机构: IIIS, Tsinghua University (清华大学人工智能研究院); AI Lab, Princeton University (普林斯顿大学人工智能实验室); Department of Computer Science & Engineering, University of Michigan (密歇根大学计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance. Comments: 17 pages, 7 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.17122 [cs.CL] (or arXiv:2505.17122v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.17122 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xuan Qi [view email] [v1] Wed, 21 May 2025 17:59:02 UTC (433 KB)
zh

[NLP-179] NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在几何推理能力提升中面临的数据规模不足、质量不高以及多样性与数值泛化能力有限的问题。其解决方案的关键在于提出一种新型的神经符号框架NeSyGeo，该框架通过基于实体-关系-约束范式的领域特定语言全面表示平面几何的所有组件，并结合符号-视觉-文本的流水线生成多样化的问答对，从而构建高质量的多模态几何推理数据集。

链接: https://arxiv.org/abs/2505.17121
作者: Weiming Wu,Zi-kang Wang,Jin Ye,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo
机构: Nanjing University(南京大学); National Key Laboratory for Novel Software Technology, Nanjing University(国家软件技术重点实验室，南京大学); School of Artificial Intelligence, Nanjing University(人工智能学院，南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Obtaining large-scale, high-quality data with reasoning paths is crucial for improving the geometric reasoning capabilities of multi-modal large language models (MLLMs). However, existing data generation methods, whether based on predefined templates or constrained symbolic provers, inevitably face diversity and numerical generalization limitations. To address these limitations, we propose NeSyGeo, a novel neuro-symbolic framework for generating geometric reasoning data. First, we propose a domain-specific language grounded in the entity-relation-constraint paradigm to comprehensively represent all components of plane geometry, along with generative actions defined within this symbolic space. We then design a symbolic-visual-text pipeline that synthesizes symbolic sequences, maps them to corresponding visual and textual representations, and generates diverse question-answer (QA) pairs using large language models (LLMs). To the best of our knowledge, we are the first to propose a neuro-symbolic approach in generating multimodal reasoning data. Based on this framework, we construct NeSyGeo-CoT and NeSyGeo-Caption datasets, containing 100k samples, and release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs. Experiments demonstrate that the proposal significantly and consistently improves the performance of multiple MLLMs under both reinforcement and supervised fine-tuning. With only 4k samples and two epochs of reinforcement fine-tuning, base models achieve improvements of up to +15.8% on MathVision, +8.4% on MathVerse, and +7.3% on GeoQA. Notably, a 4B model can be improved to outperform an 8B model from the same series on geometric reasoning tasks.
zh

[NLP-180] Self-Interpretability: LLM s Can Describe Complex Internal Processes that Drive Their Decisions and Improve with Training

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在决策过程中内部机制难以解释的问题，旨在提升其自我反思和解释自身运作能力。解决方案的关键在于通过微调（fine-tuning）使LLMs在复杂情境下根据随机生成的量化偏好进行决策，并训练其准确描述自身的内部过程，进而提升其解释能力，且该训练具有一定的泛化性。

链接: https://arxiv.org/abs/2505.17120
作者: Dillon Plunkett,Adam Morris,Keerthi Reddy,Jorge Morales
机构: Northeastern University (东北大学); Princeton University (普林斯顿大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to introspect and explain their own functioning. Here, we show that i) contemporary LLMs are capable of providing accurate, quantitative descriptions of their own internal processes during certain kinds of decision-making, ii) that it is possible to improve these capabilities through training, and iii) that this training generalizes to at least some degree. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes during decision-making (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain what they are doing as they make other complex decisions, not just decisions they have learned to make via fine-tuning. This work is a step towards training LLMs to accurately and broadly report on their own internal processes – a possibility that would yield substantial benefits for interpretability, control, and safety.
zh

[NLP-181] Systematic Evaluation of Machine-Generated Reasoning and PHQ-9 Labeling for Depression Detection Using Large Language Models

【速读】：该论文旨在系统评估生成式 AI (Generative AI) 在早期心理健康检测中的推理能力，并揭示其潜在弱点，特别是在抑郁症检测任务中的表现。其关键解决方案是设计一种基于指令的策略，将检测任务分解为多个子任务，并通过对比少样本提示和思维链提示来优化模型推理能力，同时结合人工标注与偏好学习方法，提升检测准确性并减少统计偏差。

链接: https://arxiv.org/abs/2505.17119
作者: Zongru Shao,Xin Wang,Zhanyang Liu,Chenhan Wang,K.P. Subbalakshmi
机构: Silicon Austria Labs (硅奥地利实验室); Jiangnan University (江南大学); Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages without references

点击查看摘要

Abstract:Recent research leverages large language models (LLMs) for early mental health detection, such as depression, often optimized with machine-generated data. However, their detection may be subject to unknown weaknesses. Meanwhile, quality control has not been applied to these generated corpora besides limited human verifications. Our goal is to systematically evaluate LLM reasoning and reveal potential weaknesses. To this end, we first provide a systematic evaluation of the reasoning over machine-generated detection and interpretation. Then we use the models’ reasoning abilities to explore mitigation strategies for enhanced performance. Specifically, we do the following: A. Design an LLM instruction strategy that allows for systematic analysis of the detection by breaking down the task into several subtasks. B. Design contrastive few-shot and chain-of-thought prompts by selecting typical positive and negative examples of detection reasoning. C. Perform human annotation for the subtasks identified in the first step and evaluate the performance. D. Identify human-preferred detection with desired logical reasoning from the few-shot generation and use them to explore different optimization strategies. We conducted extensive comparisons on the DepTweet dataset across the following subtasks: 1. identifying whether the speaker is describing their own depression; 2. accurately detecting the presence of PHQ-9 symptoms, and 3. finally, detecting depression. Human verification of statistical outliers shows that LLMs demonstrate greater accuracy in analyzing and detecting explicit language of depression as opposed to implicit expressions of depression. Two optimization methods are used for performance enhancement and reduction of the statistic bias: supervised fine-tuning (SFT) and direct preference optimization (DPO). Notably, the DPO approach achieves significant performance improvement.
zh

[NLP-182] After Retrieval Before Generation: Enhancing the Trustworthiness of Large Language Models in RAG

【速读】：该论文旨在解决检索增强生成（Retrieval-augmented generation, RAG）系统在平衡内部（参数化）与外部（检索到）知识时所面临的挑战，尤其是在两者存在冲突或不可靠的情况下。现有方法仅能处理孤立场景，如优先考虑单一知识源、简单融合两者或拒绝回答，缺乏统一框架以同时应对多种现实条件。论文提出的BRIDGE框架通过动态确定大型语言模型（Large Language Models, LLMs）的综合响应策略，其关键在于采用自适应加权机制“soft bias”引导知识收集，并结合最大软偏差决策树评估知识并选择最优响应策略（信任内部/外部知识或拒绝回答）。

链接: https://arxiv.org/abs/2505.17118
作者: Xinbang Dai,Huikang Hu,Yuncheng Hua,Jiaqi Li,Yongrui Chen,Rihui Jin,Nan Hu,Guilin Qi
机构: School of Cyber Science and Engineering, Southeast University(东南大学); School of Computer Science and Engineering, Southeast University(东南大学); School of Computer Science and Engineering, University of New South Wales(新南威尔士大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 8 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems face critical challenges in balancing internal (parametric) and external (retrieved) knowledge, especially when these sources conflict or are unreliable. To analyze these scenarios comprehensively, we construct the Trustworthiness Response Dataset (TRD) with 36,266 questions spanning four RAG settings. We reveal that existing approaches address isolated scenarios-prioritizing one knowledge source, naively merging both, or refusing answers-but lack a unified framework to handle different real-world conditions simultaneously. Therefore, we propose the BRIDGE framework, which dynamically determines a comprehensive response strategy of large language models (LLMs). BRIDGE leverages an adaptive weighting mechanism named soft bias to guide knowledge collection, followed by a Maximum Soft-bias Decision Tree to evaluate knowledge and select optimal response strategies (trust internal/external knowledge, or refuse). Experiments show BRIDGE outperforms baselines by 5-15% in accuracy while maintaining balanced performance across all scenarios. Our work provides an effective solution for LLMs’ trustworthy responses in real-world RAG applications.
zh

[NLP-183] From Tokens to Thoughts: How LLM s and Humans Trade Compression for Meaning

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在内部表示中是否能够实现类似人类的语义压缩与表达保真度之间的平衡问题。其解决方案的关键在于引入一种基于信息论的框架，结合率失真理论（Rate-Distortion Theory）和信息瓶颈原理（Information Bottleneck principle），以量化比较LLMs与人类概念系统在语义压缩策略上的差异。通过分析不同LLMs的token嵌入与经典人类分类基准的对比，研究揭示了LLMs在捕捉细粒度语义区分方面的不足，以及其对统计压缩的强烈偏好，与人类概念系统更注重适应性细微差别和上下文丰富性的倾向形成对比。

链接: https://arxiv.org/abs/2505.17117
作者: Chen Shani,Dan Jurafsky,Yann LeCun,Ravid Shwartz-Ziv
机构: Stanford University (斯坦福大学); New York University (纽约大学); Meta - FAIR (Meta - FAIR); Wand.AI (Wand.AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
zh

[NLP-184] Comparative Evaluation of Prompting and Fine-Tuning for Applying Large Language Models to Grid-Structured Geospatial Data

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在解释网格结构的地理空间数据中的性能问题，其核心在于评估基础模型通过结构化提示的性能，并与在用户-助手交互数据集上微调的变体进行对比。解决方案的关键在于通过微调提升模型在结构化地理空间和时间推理任务中的表现，从而克服零样本提示的局限性。

链接: https://arxiv.org/abs/2505.17116
作者: Akash Dhruv,Yangxinyu Xie,Jordan Branham,Tanwi Mallick
机构: 未知
类目: Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:This paper presents a comparative study of large language models (LLMs) in interpreting grid-structured geospatial data. We evaluate the performance of a base model through structured prompting and contrast it with a fine-tuned variant trained on a dataset of user-assistant interactions. Our results highlight the strengths and limitations of zero-shot prompting and demonstrate the benefits of fine-tuning for structured geospatial and temporal reasoning.
zh

[NLP-185] RAVEN: Query-Guided Representation Alignment for Question Answering over Audio Video Embedded Sensors and Natural Language

【速读】：该论文旨在解决多模态问答（Multimodal Question Answering, MQA）中因模态不一致导致的融合模型性能下降问题，例如离屏语音、背景噪声或视野外运动等干扰因素会误导模型。其解决方案的关键在于提出RAVEN架构，该架构的核心是QuART模块，这是一个查询条件的跨模态门控机制，能够为不同模态中的每个token分配标量相关性分数，从而在融合前增强有用信号并抑制干扰信息。

链接: https://arxiv.org/abs/2505.17114
作者: Subrata Biswas,Mohammad Nur Hossain Khan,Bashima Islam
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning – each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio–Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks – including egocentric and exocentric tasks – show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL.
zh

[NLP-186] Cultural Value Alignment in Large Language Models : A Prompt-based Analysis of Schwartz Values in Gemini ChatGPT and DeepSeek

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在文化价值观对齐方面存在的偏差问题，特别是探讨不同训练数据背景的模型是否表现出不同的价值偏好。研究发现，DeepSeek相较于西方模型更弱化自我提升类价值，而更强调自我超越类价值，这反映了其与中国集体主义文化倾向的一致性。解决方案的关键在于通过多视角推理、自我反思反馈和动态情境化机制来缓解LLMs中的价值不对称性，从而推动更加包容和多元的AI对齐框架的发展。

链接: https://arxiv.org/abs/2505.17112
作者: Robin Segerer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 1 table, 1 figure

点击查看摘要

Abstract:This study examines cultural value alignment in large language models (LLMs) by analyzing how Gemini, ChatGPT, and DeepSeek prioritize values from Schwartz’s value framework. Using the 40-item Portrait Values Questionnaire, we assessed whether DeepSeek, trained on Chinese-language data, exhibits distinct value preferences compared to Western models. Results of a Bayesian ordinal regression model show that self-transcendence values (e.g., benevolence, universalism) were highly prioritized across all models, reflecting a general LLM tendency to emphasize prosocial values. However, DeepSeek uniquely downplayed self-enhancement values (e.g., power, achievement) compared to ChatGPT and Gemini, aligning with collectivist cultural tendencies. These findings suggest that LLMs reflect culturally situated biases rather than a universal ethical framework. To address value asymmetries in LLMs, we propose multi-perspective reasoning, self-reflective feedback, and dynamic contextualization. This study contributes to discussions on AI fairness, cultural neutrality, and the need for pluralistic AI alignment frameworks that integrate diverse moral perspectives.
zh

[NLP-187] Multi-Modality Expansion and Retention for LLM s through Parameter Merging and Decoupling

【速读】：该论文旨在解决传统微调方法在扩展大型语言模型（Large Language Models, LLMs）多模态能力时存在的资源消耗大、灵活性差的问题。其核心问题是现有方法需要从头开始进行大量计算和参数调整，导致效率低下且难以适应新模态数据。论文提出的解决方案——MMER（Multi-modality Expansion and Retention），关键在于无需训练即可通过复用已有多模态LLMs（Multimodal LLMs, MLLMs）的多模态编码器并融合其LLM参数，生成二进制掩码以解耦不同模态的参数，从而实现多模态能力的有效扩展，同时保留原始模型99%的性能，并显著缓解灾难性遗忘问题。

链接: https://arxiv.org/abs/2505.17110
作者: Junlin Li,Guodong DU,Jing Li,Sim Kuan Goh,Wenya Wang,Yequan Wang,Fangming Liu,Ho-Kin Tang,Saleh Alharbi,Daojing He,Min Zhang
机构: Harbin Institute of Technology, Shenzhen, China Xiamen University Malaysia Nanyang Technological University Beijing Academy of Artificial Intelligence, China Peng Cheng Laboratory, China Shaqra University, Saudi Arabia
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs’ multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs’ fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs’ multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.
zh

[NLP-188] Mitigating Cyber Risk in the Age of Open-Weight LLM s: Policy Gaps and Technical Realities

【速读】：该论文试图解决开放权重通用人工智能（GPAI）模型在网络安全方面带来的风险问题，特别是这些模型可能被用于加速恶意软件开发和增强社会工程攻击等行为。论文指出，现有法规如欧盟人工智能法案和GPAI行为准则存在显著漏洞，这是由于开放分发导致的控制丧失，使得传统安全缓解措施失效。解决方案的关键在于评估和控制特定高风险能力，而非整个模型，同时倡导对开放权重系统的务实政策解读，推动防御性人工智能创新，并促进国际间标准和网络威胁情报（CTI）共享，以在不阻碍开放技术进步的前提下确保安全性。

链接: https://arxiv.org/abs/2505.17109
作者: Alfonso de Gregorio
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 8 pages, no figures

点击查看摘要

Abstract:Open-weight general-purpose AI (GPAI) models offer significant benefits but also introduce substantial cybersecurity risks, as demonstrated by the offensive capabilities of models like DeepSeek-R1 in evaluations such as MITRE’s OCCULT. These publicly available models empower a wider range of actors to automate and scale cyberattacks, challenging traditional defence paradigms and regulatory approaches. This paper analyzes the specific threats – including accelerated malware development and enhanced social engineering – magnified by open-weight AI release. We critically assess current regulations, notably the EU AI Act and the GPAI Code of Practice, identifying significant gaps stemming from the loss of control inherent in open distribution, which renders many standard security mitigations ineffective. We propose a path forward focusing on evaluating and controlling specific high-risk capabilities rather than entire models, advocating for pragmatic policy interpretations for open-weight systems, promoting defensive AI innovation, and fostering international collaboration on standards and cyber threat intelligence (CTI) sharing to ensure security without unduly stifling open technological progress.
zh

[NLP-189] RRTL: Red Teaming Reasoning Large Language Models in Tool Learning

【速读】：该论文旨在解决新兴推理型大语言模型（Reasoning Large Language Models, RLLMs）在工具学习过程中的安全性问题，尤其是其在面对潜在威胁时的隐蔽风险和多语言安全漏洞。解决方案的关键在于提出一种名为RRTL的红队评估方法，该方法结合了两种创新策略：一是识别欺骗性威胁，评估模型在隐藏不安全工具使用及其风险方面的行为；二是利用思维链（Chain-of-Thought, CoT）提示强制调用工具，从而揭示模型的安全缺陷。

链接: https://arxiv.org/abs/2505.17106
作者: Yifei Liu,Yu Cui,Haibin Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model’s behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.
zh

[NLP-190] P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

【速读】：该论文旨在解决学术海报自动生成功能中面临的关键挑战，即在保留复杂科学细节和实现有效的视觉与文本整合方面存在困难。现有方法在语义丰富性和结构细微差别上表现不足，并且缺乏对生成的学术海报进行全面评估的标准基准。该研究提出P2P框架，这是首个基于大语言模型（Large Language Model, LLM）的多智能体系统，能够直接从研究论文生成高质量的HTML渲染学术海报。P2P的核心在于其三个专门的智能体——视觉元素处理、内容生成和最终海报组装——每个智能体均集成专用检查模块，以实现迭代优化并确保输出质量。

链接: https://arxiv.org/abs/2505.17104
作者: Tao Sun,Enhao Pan,Zhengkai Yang,Kaixin Sui,Jiajun Shi,Xianfu Cheng,Tongliang Li,Wenhao Huang,Ge Zhang,Jian Yang,Zhoujun Li
机构: 未知
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Academic posters are vital for scholarly communication, yet their manual creation is time-consuming. However, automated academic poster generation faces significant challenges in preserving intricate scientific details and achieving effective visual-textual integration. Existing approaches often struggle with semantic richness and structural nuances, and lack standardized benchmarks for evaluating generated academic posters comprehensively. To address these limitations, we introduce P2P, the first flexible, LLM-based multi-agent framework that generates high-quality, HTML-rendered academic posters directly from research papers, demonstrating strong potential for practical applications. P2P employs three specialized agents-for visual element processing, content generation, and final poster assembly-each integrated with dedicated checker modules to enable iterative refinement and ensure output quality. To foster advancements and rigorous evaluation in this domain, we construct and release P2PInstruct, the first large-scale instruction dataset comprising over 30,000 high-quality examples tailored for the academic paper-to-poster generation task. Furthermore, we establish P2PEval, a comprehensive benchmark featuring 121 paper-poster pairs and a dual evaluation methodology (Universal and Fine-Grained) that leverages LLM-as-a-Judge and detailed, human-annotated checklists. Our contributions aim to streamline research dissemination and provide the community with robust tools for developing and evaluating next-generation poster generation systems.
zh

[NLP-191] Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

【速读】：该论文旨在解决生成高质量多变量时间序列数据的挑战，特别是在数据样本有限和计算资源受限的情况下。其解决方案的关键在于提出SDForger框架，该框架通过紧凑的数据表示将单变量和多变量信号转换为表格嵌入，并将其编码为文本以对任意自回归大语言模型（LLM）进行低计算量微调。在推理阶段，通过采样新的文本嵌入并解码生成保留原始数据统计特性与时间动态的合成时间序列，从而实现高效且灵活的时间序列生成。

链接: https://arxiv.org/abs/2505.17103
作者: Cécile Rousseau,Tobia Boschi,Giandomenico Cornacchia,Dhaval Salwala,Alessandra Pascale,Juan Bernabe Moreno
机构: IBM Research Europe (IBM 研究欧洲); IBM (IBM)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data’s statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. SDForger source code will be open-sourced soon.
zh

[NLP-192] BanglaByT5: Byte-Level Modelling for Bangla

【速读】：该论文试图解决传统分词器（如BPE和SentencePiece）在处理像孟加拉语（Bangla）这样形态丰富的语言时，无法捕捉其细微语义的问题。解决方案的关键在于引入BanglaByT5，这是首个针对孟加拉语设计的字节级编码器-解码器模型，基于Google的ByT5架构的一个小型变体，并在14GB高质量文学和新闻文章语料上进行预训练，从而在生成和分类任务中展现出优于多语言及更大模型的性能。

链接: https://arxiv.org/abs/2505.17102
作者: Pramit Bhattacharyya,Arnab Bhattacharya
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla. Built upon a small variant of Googles ByT5 architecture, BanglaByT5 is pre-trained on a 14GB curated corpus combining high-quality literary and newspaper articles. Through zeroshot and supervised evaluations across generative and classification tasks, BanglaByT5 demonstrates competitive performance, surpassing several multilingual and larger models. Our findings highlight the efficacy of byte-level modelling for morphologically rich languages and highlight BanglaByT5 potential as a lightweight yet powerful tool for Bangla NLP, particularly in both resource-constrained and scalable environments.
zh

[NLP-193] An approach to identify the most semantically informative deep representations of text and images

【速读】：该论文试图解决跨模态和跨语言数据在深度神经网络中是否能够形成语义相关表示的问题，以及这些表示如何在不同模型和模态间传递和编码。其解决方案的关键在于通过量化分析语义相关数据表示的相对信息内容，探究其在大型语言模型（LLMs）和视觉变换器中的多标记编码特性，识别出包含最多语言可迁移信息的“语义”层，并揭示语义信息在多个标记间的长距离相关性及因果左到右不对称性。

链接: https://arxiv.org/abs/2505.17101
作者: Santiago Acevedo,Andrea Mascaretti,Riccardo Rende,Matéo Mahaut,Marco Baroni,Alessandro Laio
机构: Scuola Internazionale Superiore di Studi Avanzati (SISSA)(国际高级研究学校); Universitat Pompeu Fabra (UPF)(庞佩乌法布拉大学); ICREA(加泰罗尼亚高等研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic’’ layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.
zh

[NLP-194] Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning -based Bias Detector

【速读】：该论文旨在解决生成式 AI (Generative AI) 评估中因评判者潜在偏见而导致的可靠性问题。现有方法在缓解这些偏见方面存在局限性：基于上下文学习的方法由于评估者自我反思能力有限，无法解决深层次偏见；而微调方法则不适用于所有类型的评估者，尤其是闭源模型。该论文提出的解决方案是引入基于推理的偏见检测器（Reasoning-based Bias Detector, RBD），其关键在于作为一个插件模块，能够识别偏见评估并生成结构化推理以指导评估者自我修正，而非直接修改评估者本身，通过外部迭代的偏见检测与反馈驱动的修订过程实现效果提升。

链接: https://arxiv.org/abs/2505.17100
作者: Haoyan Yang,Runxue Bao,Cao Xiao,Jun Ma,Parminder Bhatia,Shangqian Gao,Taha Kass-Hout
机构: New York University (纽约大学); GE Healthcare (通用电气医疗集团); Florida State University (佛罗里达州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator’s limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types–verbosity, position, bandwagon, and sentiment–evaluated using 8 LLM evaluators demonstrate RBD’s strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD’s effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.
zh

[NLP-195] Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation

【速读】：该论文旨在解决生成式脑解码中的幻觉问题，即生成的文本是否真正反映大脑的语义激活，还是仅由强大的生成模型所“幻想”出来的结果。其解决方案的关键在于通过后验崩溃的视角重新定义解码任务，将原本的刺激文本逐字重建转变为核心语义的摘要，从而提升生成内容的语义基础。为此，作者提出了生成语言检查模型（GLIM），该模型强调学习具有信息量且可解释的脑电（EEG）表示，以在异构和小规模数据条件下增强语义 grounding。

链接: https://arxiv.org/abs/2505.17099
作者: Xiaozhao Liu,Dinggang Shen,Xihui Liu
机构: University of Hong Kong (香港大学); ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL)
备注: Code, checkpoint and text samples available at this https URL

点击查看摘要

Abstract:Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable–whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.
zh

[NLP-196] ACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

【速读】：该论文旨在解决多模态上下文学习（multimodal in-context learning, ICL）在复杂推理或开放式生成任务中对输入序列质量高度敏感的问题，以及缺乏对视觉语言模型（LVLMs）如何在推理过程中利用这些序列的深入理解。其解决方案的关键在于通过任务映射（task mapping）的视角系统地解释多模态ICL，并提出TACO模型，该模型基于轻量级Transformer架构，具备任务感知注意力机制，能够动态配置上下文序列，并通过将任务映射信号注入自回归解码过程，实现序列构建与任务推理之间的双向协同。

链接: https://arxiv.org/abs/2505.17098
作者: Yanshu Li,Tian Yun,Jianjiang Yang,Pinyuan Feng,Jinfa Huang,Ruixiang Tang
机构: Brown University (布朗大学); University of Bristol (布里斯托大学); Columbia University (哥伦比亚大学); University of Rochester (罗切斯特大学); Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 11 figures, 19 tables. arXiv admin note: substantial text overlap with arXiv:2503.04839

点击查看摘要

Abstract:Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input in-context sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures in-context sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a valuable perspective for interpreting and improving multimodal ICL.
zh

[NLP-197] CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention

【速读】：该论文旨在解决多模态上下文学习（multimodal in-context learning, ICL）在大型视觉语言模型（large vision-language models, LVLMs）中表现不稳定的问题，特别是现有研究主要关注序列配置优化，而忽视了LVLM内部机制的深入分析。论文的关键解决方案是提出一种名为上下文感知调制注意力（Context-Aware Modulated Attention, CAMA）的方法，该方法通过直接校准LVLM的注意力logits实现性能提升，具有无需训练、可无缝集成到多种开源LVLM中的优势。

链接: https://arxiv.org/abs/2505.17097
作者: Yanshu Li,JianJiang Yang,Bozheng Li,Ruixiang Tang
机构: Brown University (布朗大学); University of Bristol (布里斯托大学); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Multimodal in-context learning (ICL) enables large vision-language models (LVLMs) to efficiently adapt to novel tasks, supporting a wide array of real-world applications. However, multimodal ICL remains unstable, and current research largely focuses on optimizing sequence configuration while overlooking the internal mechanisms of LVLMs. In this work, we first provide a theoretical analysis of attentional dynamics in multimodal ICL and identify three core limitations of standard attention that ICL impair performance. To address these challenges, we propose Context-Aware Modulated Attention (CAMA), a simple yet effective plug-and-play method for directly calibrating LVLM attention logits. CAMA is training-free and can be seamlessly applied to various open-source LVLMs. We evaluate CAMA on four LVLMs across six benchmarks, demonstrating its effectiveness and generality. CAMA opens new opportunities for deeper exploration and targeted utilization of LVLM attention dynamics to advance multimodal reasoning.
zh

[NLP-198] Are LLM s reliable? An exploration of the reliability of large language models in clinical note generation

【速读】：该论文试图解决大型语言模型（LLMs）在临床笔记生成（CNG）系统中的可靠性和一致性问题，特别是在医疗保健提供者（HCPs）对患者数据隐私保护和准确记录的法律与伦理责任背景下。解决方案的关键在于评估12种开源和专有LLMs在多次迭代中生成笔记的一致性（字符串等价率）、语义一致性（相同含义）和语义相似性（正确性），以提升HCPs对LLM驱动工具的信任度。研究结果表明，所有模型家族的响应在语义上保持一致，且多数模型生成的笔记接近专家笔记，其中Meta的Llama 70B表现最为可靠。

链接: https://arxiv.org/abs/2505.17095
作者: Kristine Ann M. Carandang,Jasper Meynard P. Araña,Ethan Robert A. Casin,Christopher P. Monterola,Daniel Stanley Y. Tan,Jesus Felix B. Valenzuela,Christian M. Alis
机构: Asian Institute of Management (亚洲管理学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Due to the legal and ethical responsibilities of healthcare providers (HCPs) for accurate documentation and protection of patient data privacy, the natural variability in the responses of large language models (LLMs) presents challenges for incorporating clinical note generation (CNG) systems, driven by LLMs, into real-world clinical processes. The complexity is further amplified by the detailed nature of texts in CNG. To enhance the confidence of HCPs in tools powered by LLMs, this study evaluates the reliability of 12 open-weight and proprietary LLMs from Anthropic, Meta, Mistral, and OpenAI in CNG in terms of their ability to generate notes that are string equivalent (consistency rate), have the same meaning (semantic consistency) and are correct (semantic similarity), across several iterations using the same prompt. The results show that (1) LLMs from all model families are stable, such that their responses are semantically consistent despite being written in various ways, and (2) most of the LLMs generated notes close to the corresponding notes made by experts. Overall, Meta’s Llama 70B was the most reliable, followed by Mistral’s Small model. With these findings, we recommend the local deployment of these relatively smaller open-weight models for CNG to ensure compliance with data privacy regulations, as well as to improve the efficiency of HCPs in clinical documentation.
zh

[NLP-199] Large Language Models Implicitly Learn to See and Hear Just By Reading

【速读】：该论文试图解决如何利用文本语言模型（Text LLM）的预训练权重来提升音频和图像分类任务性能的问题，而无需从头开始训练专用模型。其解决方案的关键在于通过将图像块、音频波形或标记作为输入，使文本语言模型能够生成典型的分类管道所需的嵌入或类别标签，从而展现出跨模态的感知能力。这表明文本LLM内部已具备可被激活以适应不同应用的强大内部结构。

链接: https://arxiv.org/abs/2505.17091
作者: Prateek Verma,Mert Pilanci
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures, 4 tables. Under Review WASPAA 2025

点击查看摘要

Abstract:This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.
zh

[NLP-200] rust Me I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全性和用户体验之间的平衡问题，特别是针对诸如越狱攻击、有害内容、幻觉和偏见等多类安全风险。现有防御方法通常仅针对单一威胁类型或采用僵硬的直接拒绝策略，导致用户体验下降且难以应对多样化的新型攻击。论文提出的解决方案是基于对抗场景外推（Adversarial Scenario Extrapolation, ASE）的推理阶段计算框架，其关键在于利用思维链（Chain-of-Thought, CoT）推理机制，引导模型在生成响应前自主思考潜在的对抗场景并制定防御策略，从而提升模型的鲁棒性与交互自然性。

链接: https://arxiv.org/abs/2505.17089
作者: Md Rafi Ur Rashid,Vishnu Asutosh Dasu,Ye Wang,Gang Tan,Shagufta Mehnaz
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to 4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial QA and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.
zh

[NLP-201] Informatics for Food Processing

【速读】：该论文试图解决传统食品分类框架（如NOVA、Nutri-Score和SIGA）在主观性和可重复性方面存在的问题，这些问题限制了流行病学研究和公共政策的制定。其解决方案的关键在于引入计算方法，包括基于营养成分数据训练的随机森林模型FoodProX，用于推断加工水平并生成连续的FPro评分，以及利用大型语言模型（如BERT和BioBERT）对食品描述和配料表进行语义嵌入，以实现预测任务，即使在数据缺失的情况下也能保持有效性。此外，通过Open Food Facts数据库的案例研究，展示了多模态AI模型整合结构化与非结构化数据的能力，为食品加工评估提供了新的范式。

链接: https://arxiv.org/abs/2505.17087
作者: Gordana Ispirova,Michael Sebek,Giulia Menichetti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This chapter explores the evolution, classification, and health implications of food processing, while emphasizing the transformative role of machine learning, artificial intelligence (AI), and data science in advancing food informatics. It begins with a historical overview and a critical review of traditional classification frameworks such as NOVA, Nutri-Score, and SIGA, highlighting their strengths and limitations, particularly the subjectivity and reproducibility challenges that hinder epidemiological research and public policy. To address these issues, the chapter presents novel computational approaches, including FoodProX, a random forest model trained on nutrient composition data to infer processing levels and generate a continuous FPro score. It also explores how large language models like BERT and BioBERT can semantically embed food descriptions and ingredient lists for predictive tasks, even in the presence of missing data. A key contribution of the chapter is a novel case study using the Open Food Facts database, showcasing how multimodal AI models can integrate structured and unstructured data to classify foods at scale, offering a new paradigm for food processing assessment in public health and research.
zh

[NLP-202] Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在问答（Question Answering, QA）任务中因缺乏事实性知识而产生的幻觉问题，以及现有方法依赖上下文学习导致的性能受限问题。其解决方案的关键在于提出Mujica框架，该框架包含一个将问题分解为子问题有向无环图的规划器和一个通过检索与推理解决问題的工作者，同时引入MyGO（Minimalist policy Gradient Optimization）方法，通过从渐近最优策略中采样轨迹来替代传统策略梯度更新，从而实现稳定高效的训练。

链接: https://arxiv.org/abs/2505.17086
作者: Yihong Wu,Liheng Ma,Muzhi Li,Jiaming Zhou,Jianye Hao,Ho-fung Leung,Irwin King,Yingxue Zhang,Jian-Yun Nie
机构: Université de Montréal; McGill University & Mila - Quebec AI Institute; The Chinese University of Hong Kong; Huawei Noah’s Ark Lab
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable versatility, due to the lack of factual knowledge, their application to Question Answering (QA) tasks remains hindered by hallucination. While Retrieval-Augmented Generation mitigates these issues by integrating external knowledge, existing approaches rely heavily on in-context learning, whose performance is constrained by the fundamental reasoning capabilities of LLMs. In this paper, we propose Mujica, a Multi-hop Joint Intelligence for Complex Question Answering, comprising a planner that decomposes questions into a directed acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning. Additionally, we introduce MyGO (Minimalist policy Gradient Optimization), a novel reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation (MLE) by sampling trajectories from an asymptotically optimal policy. MyGO eliminates the need for gradient rescaling and reference models, ensuring stable and efficient training. Empirical results across multiple datasets demonstrate the effectiveness of Mujica-MyGO in enhancing multi-hop QA performance for various LLMs, offering a scalable and resource-efficient solution for complex QA tasks. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.17086 [cs.CL] (or arXiv:2505.17086v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.17086 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yihong Wu [view email] [v1] Tue, 20 May 2025 18:33:03 UTC (429 KB)
zh

[NLP-203] GSDFuse: Capturing Cognitive Inconsistencies from Multi-Dimensional Weak Signals in Social Media Steganalysis

【速读】：该论文旨在解决社交媒体平台上恶意语言隐写术（linguistic steganography）的检测问题，特别是在文本碎片化和复杂对话结构下识别细微认知不一致性的挑战，以及在极端隐写稀疏性和复杂隐写技术背景下实现多维弱信号鲁棒聚合的难题。其解决方案的关键在于GSDFuse方法，该方法通过层次化多模态特征工程、策略性数据增强、自适应证据融合和判别嵌入学习等核心技术，系统性地提升了对隐写内容的检测能力。

链接: https://arxiv.org/abs/2505.17085
作者: Kaibo Huang,Zipei Zhang,Yukun Wei,TianXin Zhang,Zhongliang Yang,Linna Zhou
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing IntokenTech Co., Ltd. (北京智刻科技有限公司)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. Steganalysis is profoundly hindered by the challenge of identifying subtle cognitive inconsistencies arising from textual fragmentation and complex dialogue structures, and the difficulty in achieving robust aggregation of multi-dimensional weak signals, especially given extreme steganographic sparsity and sophisticated steganography. These core detection difficulties are compounded by significant data imbalance. This paper introduces GSDFuse, a novel method designed to systematically overcome these obstacles. GSDFuse employs a holistic approach, synergistically integrating hierarchical multi-modal feature engineering to capture diverse signals, strategic data augmentation to address sparsity, adaptive evidence fusion to intelligently aggregate weak signals, and discriminative embedding learning to enhance sensitivity to subtle inconsistencies. Experiments on social media datasets demonstrate GSDFuse’s state-of-the-art (SOTA) performance in identifying sophisticated steganography within complex dialogue environments. The source code for GSDFuse is available at this https URL.
zh

[NLP-204] Scale-invariant Attention

【速读】：该论文试图解决大语言模型（Large Language Model, LLM）研究中一个持续的挑战，即开发能够从较短上下文的训练泛化到更长上下文推理的注意力机制。论文提出两种期望所有有效长上下文注意力机制都具备的条件：尺度不变的总注意力和尺度不变的注意力稀疏性。在高斯假设下，研究证明对注意力logits进行简单的位置相关变换即可满足上述条件。实验结果显示，这种尺度不变的注意力方案在零样本泛化任务中显著降低了验证损失，并在长上下文检索中表现出色。

链接: https://arxiv.org/abs/2505.17083
作者: Ben Anson,Xi Wang,Laurence Aitchison
机构: University of Bristol(布里斯托大学); Johns Hopkins University(约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Preprint

点击查看摘要

Abstract:One persistent challenge in LLM research is the development of attention mechanisms that are able to generalise from training on shorter contexts to inference on longer contexts. We propose two conditions that we expect all effective long context attention mechanisms to have: scale-invariant total attention, and scale-invariant attention sparsity. Under a Gaussian assumption, we show that a simple position-dependent transformation of the attention logits is sufficient for these conditions to hold. Experimentally we find that the resulting scale-invariant attention scheme gives considerable benefits in terms of validation loss when zero-shot generalising from training on short contexts to validation on longer contexts, and is effective at long-context retrieval.
zh

[NLP-205] GemMaroc: Unlocking Darija Proficiency in LLM s with Minimal Data

【速读】：该论文旨在解决开源大型语言模型（LLMs）对摩洛哥阿拉伯语（Darija）支持不足的问题，传统方法要么需要添加计算开销大的阿拉伯语适配器，要么会牺牲模型的推理能力。其解决方案的关键在于采用“质量优先、数量其次”的对齐策略，通过将少量高质量指令集翻译为Darija并结合数学、编程和科学提示，实现Darija的流畅生成，同时保留模型的跨语言推理能力。该方法在仅消耗少量计算资源的情况下显著提升了DarijaMMLU等基准测试的性能。

链接: https://arxiv.org/abs/2505.17082
作者: Abderrahman Skiredj,Ferdaous Azhari,Houdaifa Atou,Nouamane Tazi,Ismail Berrada
机构: College of Computing, Mohammed VI Polytechnic University, Benguerir, Morocco; Hugging Face, Paris, France; National Institute of Posts and Telecoms, Rabat, Morocco
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on Darija commonsense, scoring 60.5 on HellaSwag versus Atlas-Chat s 48.4 . Crucially, GemMaroc retains Gemma-27B s strong maths and general-reasoning ability, showing only minimal movement on GSM8K and English benchmarks. The entire model is trained in just 48 GPU.h, underscoring a Green AI pathway to inclusive, sustainable language technology. We release code, data and checkpoints to spur Darija-centric applications in education, public services and everyday digital interaction.
zh

[NLP-206] Not Minds but Signs: Reframing LLM s through Semiotics

【速读】：该论文试图解决当前将大型语言模型（Large Language Models, LLMs）视为认知系统所存在的局限性，主张从符号学（semiotic）视角重新理解这些模型的功能与作用。其解决方案的关键在于将LLMs视为符号操作与意义生成的代理，而非具备语言理解或人类思维模拟能力的实体，强调其核心功能是基于概率关联对语言形式进行重组、再语境化和传播。通过引入符号学框架，论文避免了拟人化倾向，并提供了更精确的视角来理解LLMs在文化过程中的参与方式，即通过生成可被解读的文本而非思考来发挥作用。

链接: https://arxiv.org/abs/2505.17080
作者: Davide Picca
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.
zh

[NLP-207] GloSS over Toxicity: Understanding and Mitigating Toxicity in LLM s via Global Toxic Subspace

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中毒性内容生成的问题，其核心在于如何有效识别并消除模型中的毒性子空间。传统方法通常将毒性区域视为前馈网络（Feed-Forward Network, FFN）中的毒害向量或逐层子空间，而本文通过深入分析发现，全局毒性子空间（Global Toxic Subspace）能够更有效地全面表征模型中的毒性区域。解决方案的关键在于提出GloSS（Global Toxic Subspace Suppression），这是一种轻量级的四阶段方法，通过识别并从FFN参数中移除全局毒性子空间来减轻毒性，实验表明该方法在保持模型通用能力的同时实现了最先进的去毒效果，且无需大规模数据或模型重训练。

链接: https://arxiv.org/abs/2505.17078
作者: Zenghao Duan,Zhiyi Yin,Zhichao Shi,Liang Pang,Shaoling Jing,Jiayi Wu,Yu Yan,Huawei Shen,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所); University of Chinese Academy of Sciences, Beijing, China(中国科学院大学); Dalian University of Technology, Liaoning, China(大连理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.
zh

[NLP-208] Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English

【速读】：该论文试图解决语音分词器中帧率（frame rate）对语音标记化影响的机制问题，特别是不同语言在帧率变化下的差异性表现。其解决方案的关键在于通过对比分析普通话和英语两种语言在不同帧率下的语音编码效果，评估语义标记在语音识别任务中的表现，从而揭示帧率、音素密度及语言特异性声学特征之间的相互作用。

链接: https://arxiv.org/abs/2505.17076
作者: Haoyang Zhang,Hexin Liu,Xiangyu Zhang,Qiquan Zhang,Yuchen Hu,Junqi Zhao,Fei Tian,Xuerui Yang,Eng Siong Chng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:The speech tokenizer plays a crucial role in recent speech tasks, generally serving as a bridge between speech signals and language models. While low-frame-rate codecs are widely employed as speech tokenizers, the impact of frame rates on speech tokens remains underexplored. In this study, we investigate how varying frame rates affect speech tokenization by examining Mandarin and English, two typologically distinct languages. We encode speech at different frame rates and evaluate the resulting semantic tokens in the speech recognition task. Our findings reveal that frame rate variations influence speech tokenization differently for each language, highlighting the interplay between frame rates, phonetic density, and language-specific acoustic features. The results provide insights into optimizing frame rate selection for speech tokenizers, with implications for automatic speech recognition, text-to-speech, and other speech-related applications.
zh

[NLP-209] Development and Validation of Engagement and Rapport Scales for Evaluating User Experience in Multimodal Dialogue Systems

【速读】：该论文试图解决如何评估多模态对话系统在外语学习情境中的用户体验质量的问题，其解决方案的关键在于开发并验证两个用于衡量参与度和亲和力的量表。这些量表基于教育心理学、社会心理学及第二语言习得理论设计，并通过Cronbach’s alpha系数分析和验证性因子分析验证了其结构效度和项目可靠性，最终通过比较与人类导师和对话代理互动时的参与度和亲和力得分，证明了量表的有效性。

链接: https://arxiv.org/abs/2505.17075
作者: Fuma Kurata,Mao Saeki,Masaki Eguchi,Shungo Suzuki,Hiroaki Takatsu,Yoichi Matsuyama
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study aimed to develop and validate two scales of engagement and rapport to evaluate the user experience quality with multimodal dialogue systems in the context of foreign language learning. The scales were designed based on theories of engagement in educational psychology, social psychology, and second language this http URL-four Japanese learners of English completed roleplay and discussion tasks with trained human tutors and a dialog agent. After each dialogic task was completed, they responded to the scales of engagement and rapport. The validity and reliability of the scales were investigated through two analyses. We first conducted analysis of Cronbach’s alpha coefficient and a series of confirmatory factor analyses to test the structural validity of the scales and the reliability of our designed items. We then compared the scores of engagement and rapport between the dialogue with human tutors and the one with a dialogue agent. The results revealed that our scales succeeded in capturing the difference in the dialogue experience quality between the human interlocutors and the dialogue agent from multiple perspectives.
zh

[NLP-210] Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

【速读】：该论文旨在解决大规模语言模型（Large Language Model, LLM）推理服务系统中由于推理请求执行时间不确定而导致的高效调度问题。现有方法仅基于预测输出长度估计执行时间，忽略了验证过程中令牌接受率对执行时间的影响，导致估计不准确。论文提出的半明察请求调度算法Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD) 的关键在于根据解码过程中的请求特征动态调整调度策略，通过维护多个优先级队列和跨队列抢占机制，在令牌接受率动态变化时保持低延迟，并在接受率稳定后精确估计执行时间并进行调度。

链接: https://arxiv.org/abs/2505.17074
作者: Ruixiao Li,Fahao Chen,Peng Li
机构: Xi’an Jiaotong University (西安交通大学); The University of Aizu (会津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.
zh

[NLP-211] Mechanistic Interpretability of GPT -like Models on Summarization Tasks ACL2025

【速读】：该论文试图解决大型语言模型在摘要任务中的机制可解释性问题，即揭示模型如何适应并执行摘要任务。其解决方案的关键在于通过对比预训练与微调模型的注意力模式和内部激活变化，识别出在摘要任务中发生显著转变的特定层和注意力头，从而定位“摘要电路”。研究发现，中间层（尤其是第2、3和5层）表现出最显著的变化，且62%的注意力头显示出熵值下降，表明信息选择更加集中。通过针对这些电路进行定向LoRA微调，能够在减少训练周期的情况下实现性能提升，从而弥补黑箱评估与机制理解之间的差距。

链接: https://arxiv.org/abs/2505.17073
作者: Anurag Mishra
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages (6 content + 2 references/appendix), 6 figures, 2 tables; under review for the ACL 2025 Student Research Workshop

点击查看摘要

Abstract:Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the “summarization circuit” within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.
zh

[NLP-212] Safety Alignment Can Be Not Superficial With Explicit Safety Signals ICML2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在安全对齐方面的表面性问题，即现有方法未能有效提升模型对对抗性攻击的鲁棒性。其解决方案的关键在于显式引入一个与安全相关的二分类任务，并将其信号与注意力机制和解码策略相结合，从而消除模型在安全决策边界上的模糊性，使其能够更负责任地应对恶意查询。该方法在不到0.2倍的额外计算开销下，实现了对查询及先前生成标记的安全性评估。

链接: https://arxiv.org/abs/2505.17072
作者: Jianwei Li,Jung-Eng Kim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.
zh

[NLP-213] Whats in a prompt? Language models encode literary style in prompt embeddings

【速读】：该论文试图解决语言模型如何将整个提示（prompt）的累积信息压缩到单个嵌入（embedding）中的问题，特别是关注隐性而非事实性信息在深度表示中的编码方式。其解决方案的关键在于利用文学作品作为实验材料，通过分析不同小说短片段在潜在空间中的分布特性，揭示出嵌入不仅捕捉语义内容，还编码了作者的写作风格，这种风格的几何结构对于作者身份识别和文学分析具有潜在应用价值。

链接: https://arxiv.org/abs/2505.17071
作者: Raphaël Sarfati,Haley Moller,Toni J. B. Liu,Nicolas Boullé,Christopher Earls
机构: Cornell University (康奈尔大学); Yale University (耶鲁大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models use high-dimensional latent spaces to encode and process textual information. Much work has investigated how the conceptual content of words translates into geometrical relationships between their vector representations. Fewer studies analyze how the cumulative information of an entire prompt becomes condensed into individual embeddings under the action of transformer layers. We use literary pieces to show that information about intangible, rather than factual, aspects of the prompt are contained in deep representations. We observe that short excerpts (10 - 100 tokens) from different novels separate in the latent space independently from what next-token prediction they converge towards. Ensembles from books from the same authors are much more entangled than across authors, suggesting that embeddings encode stylistic features. This geometry of style may have applications for authorship attribution and literary analysis, but most importantly reveals the sophistication of information processing and compression accomplished by language models.
zh

[NLP-214] Improving endpoint detection in end-to-end streaming ASR for conversational speech INTERSPEECH2024

【速读】：该论文旨在解决基于转换器的自动语音识别（Transducer-based ASR, T-ASR）在语音端点检测（Endpointing, EP）中的延迟输出问题，该问题可能导致EP错误或延迟，进而影响用户体验。解决方案的关键在于通过引入词尾标记并结合延迟惩罚来缓解延迟输出问题，同时利用辅助网络实现可靠的帧级语音活动检测以解决EP延迟问题。

链接: https://arxiv.org/abs/2505.17070
作者: Anandh C,Karthik Pandia Durai,Jeena Prakash,Manickavela Arumugam,Kadri Hacioglu,S.Pavankumar Dubagunta,Andreas Stolcke,Shankar Venkatesan,Aravind Ganapathiraju
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2024

点击查看摘要

Abstract:ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed methods on Switchboard conversational speech corpus and evaluate it against a delay penalty method.
zh

[NLP-215] Predictively Combatting Toxicity in Health-related Online Discussions through Machine Learning IJCNN2025

【速读】：该论文试图解决在线健康相关讨论中用户毒性行为引发的社会冲突和危险、非科学行为的问题，传统方法通过检测、标记和删除有毒评论来应对，但往往对平台和用户均产生负面影响。其解决方案的关键在于采用基于协同过滤的机器学习方法，预测用户在健康相关在线讨论中可能产生的毒性互动，从而提前防止冲突用户的配对，该方法在相关指标上实现了超过80%的预测性能。

链接: https://arxiv.org/abs/2505.17068
作者: Jorge Paz-Ruza,Amparo Alonso-Betanzos,Bertha Guijarro-Berdiñas,Carlos Eiras-Franco
机构: Universidade da Coruña (奥杜瓦大学); CITIC (信息与通信技术中心); LIDIA Group (LIDIA小组)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: IJCNN 2025

点击查看摘要

Abstract:In health-related topics, user toxicity in online discussions frequently becomes a source of social conflict or promotion of dangerous, unscientific behaviour; common approaches for battling it include different forms of detection, flagging and/or removal of existing toxic comments, which is often counterproductive for platforms and users alike. In this work, we propose the alternative of combatting user toxicity predictively, anticipating where a user could interact toxically in health-related online discussions. Applying a Collaborative Filtering-based Machine Learning methodology, we predict the toxicity in COVID-related conversations between any user and subcommunity of Reddit, surpassing 80% predictive performance in relevant metrics, and allowing us to prevent the pairing of conflicting users and subcommunities.
zh

[NLP-216] Unveil Multi-Picture Descriptions for Multilingual Mild Cognitive Impairment Detection via Contrastive Learning

【速读】：该论文旨在解决从图片描述中检测轻度认知障碍（Mild Cognitive Impairment, MCI）的问题，特别是在多语言和多图片场景下的挑战。现有研究主要集中在英语母语者对单张图片的描述上，而本文通过引入多语言参与者和多张图片，扩展了研究范围，但同时也带来了分析依赖于图片内容的新难题。解决方案的关键在于提出一个包含三个核心组件的框架：通过监督对比学习增强判别性表征学习、引入图像模态以弥补仅依赖语音和文本模态的不足，以及应用产品专家（Product of Experts, PoE）策略来缓解虚假相关性和过拟合问题。该框架在MCI检测任务中显著提升了性能，验证了其有效性。

链接: https://arxiv.org/abs/2505.17067
作者: Kristin Qi,Jiali Cheng,Youxiang Zhu,Hadi Amiri,Xiaohui Liang
机构: University of Massachusetts, Boston (马萨诸塞大学波士顿分校); University of Massachusetts, Lowell (马萨诸塞大学洛厄尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to the IEEE GlobeCom 2025

点击查看摘要

Abstract:Detecting Mild Cognitive Impairment from picture descriptions is critical yet challenging, especially in multilingual and multiple picture settings. Prior work has primarily focused on English speakers describing a single picture (e.g., the ‘Cookie Theft’). The TAUKDIAL-2024 challenge expands this scope by introducing multilingual speakers and multiple pictures, which presents new challenges in analyzing picture-dependent content. To address these challenges, we propose a framework with three components: (1) enhancing discriminative representation learning via supervised contrastive learning, (2) involving image modality rather than relying solely on speech and text modalities, and (3) applying a Product of Experts (PoE) strategy to mitigate spurious correlations and overfitting. Our framework improves MCI detection performance, achieving a +7.1% increase in Unweighted Average Recall (UAR) (from 68.1% to 75.2%) and a +2.9% increase in F1 score (from 80.6% to 83.5%) compared to the text unimodal baseline. Notably, the contrastive learning component yields greater gains for the text modality compared to speech. These results highlight our framework’s effectiveness in multilingual and multi-picture MCI detection.
zh

[NLP-217] Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases

【速读】：该论文试图解决罕见疾病研究中信息提取与分析的挑战，尤其是如何利用生成式 AI (Generative AI) 提升诊断、治疗及患者护理的效率与准确性。解决方案的关键在于利用大型语言模型（LLMs）对文本数据进行深度分析，以识别和提取关键医学信息，并通过构建智能对话代理促进患者互动，从而支持精准且及时的诊断。此外，论文强调了多模态数据整合的潜力，包括基因组、影像学和电子健康记录的结合，以实现对罕见疾病的更全面理解。

链接: https://arxiv.org/abs/2505.17065
作者: Valentina Carbonari,Pierangelo Veltri,Pietro Hiram Guzzi
机构: University of Catanzaro(卡坦扎罗大学); University of Calabria(卡拉布里亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in artificial intelligence, particularly large language models LLMs, have shown promising capabilities in transforming rare disease research. This survey paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies that leverage textual data to uncover insights and patterns critical for diagnosis, treatment, and patient care. While current research predominantly employs textual data, the potential for multimodal data integration combining genetic, imaging, and electronic health records stands as a promising frontier. We review foundational papers that demonstrate the application of LLMs in identifying and extracting relevant medical information, simulating intelligent conversational agents for patient interaction, and enabling the formulation of accurate and timely diagnoses. Furthermore, this paper discusses the challenges and ethical considerations inherent in deploying LLMs, including data privacy, model transparency, and the need for robust, inclusive data sets. As part of this exploration, we present a section on experimentation that utilizes multiple LLMs alongside structured questionnaires, specifically designed for diagnostic purposes in the context of different diseases. We conclude with future perspectives on the evolution of LLMs towards truly multimodal platforms, which would integrate diverse data types to provide a more comprehensive understanding of rare diseases, ultimately fostering better outcomes in clinical settings.
zh

[NLP-218] Synthetic Data RL: Task Definition Is All You Need

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在微调基础模型时对大规模人工标注数据的依赖问题，从而限制了其广泛应用。解决方案的关键在于提出一种名为“合成数据强化学习”（Synthetic Data RL）的框架，该框架仅使用从任务定义生成的合成数据进行强化学习微调，通过生成问答对、根据模型求解能力调整问题难度，并利用模型在样本上的平均通过率选择问题进行训练，实现了高效的模型适应。

链接: https://arxiv.org/abs/2505.17063
作者: Yiduo Guo,Zhen Guo,Chuanwei Huang,Zi-Ang Wang,Zekai Zhang,Haofei Yu,Huishuai Zhang,Yikang Shen
机构: Peking University (北京大学); MIT (麻省理工学院); UIUC (伊利诺伊大学厄巴纳-香槟分校); MIT-IBM (麻省理工学院-IBM)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at this https URL.
zh

[NLP-219] Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models ACL2025

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中存在的幻觉问题，即模型在生成文本时可能出现与输入图像不一致或错误的信息。解决方案的关键在于提出了一种名为解码混合（Mixture of Decoding, MoD）的新方法，该方法通过评估模型对图像标记的注意力正确性，动态调整解码策略。具体而言，MoD通过比较原始图像标记和模型关注的图像标记生成的输出一致性来判断注意力是否正确，并据此采用互补策略增强关键信息或对比策略抑制误导信息。

链接: https://arxiv.org/abs/2505.17061
作者: Xinlong Chen,Yuanxing Zhang,Qiang Liu,Junfei Wu,Fuzheng Zhang,Tieniu Tan
机构: Institute of Automation, Chinese Academy of Sciences (CASIA); School of Artificial Intelligence, University of Chinese Academy of Sciences; Kuaishou Technology; Nanjing University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Findings of ACL 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model’s attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model’s attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at this https URL.
zh

[NLP-220] SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

【速读】：该论文旨在解决全双工对话系统中因模块化架构导致的误差累积以及上下文相关的插话和回声消除等关键挑战。其解决方案的关键在于提出SALMONN-omni，这是首个无需在令牌空间中使用音频编解码器的单一、独立的全双工语音大语言模型（Large Language Model），该模型通过在LLM核心中引入一种新颖的动态思考机制，使模型能够学习在说话和倾听状态之间进行转换。

链接: https://arxiv.org/abs/2505.17060
作者: Wenyi Yu,Siyin Wang,Xiaoyu Yang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Yuxuan Wang,Chao Zhang
机构: Tsinghua University (清华大学); ByteDance (字节跳动); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository this https URL.
zh

[NLP-221] Medalyze: Lightweight Medical Report Summarization Application Using FLAN-T5-Large

【速读】：该论文旨在解决医疗文本理解中的挑战，特别是由于复杂术语和语境相关语言所带来的障碍。其解决方案的关键在于开发了Medalyze，一个基于三个经过微调的FLAN-T5-Large模型的AI应用，这些模型分别用于医学报告摘要生成、患者-医生对话中的健康问题提取以及段落中关键问题的识别。通过在网页和移动端部署具有实时推理能力的系统，并结合可扩展的API和YugabyteDB，Medalyze实现了在医疗信息可访问性方面的高效、隐私保护且轻量级的解决方案。

链接: https://arxiv.org/abs/2505.17059
作者: Van-Tinh Nguyen,Hoang-Duong Pham,Thanh-Hai To,Cong-Tuan Hung Do,Thi-Thu-Trang Dong,Vu-Trung Duong Le,Van-Phuc Hoang
机构: Le Quy Don Technical University (阮文挺大学); University of Science and Technology of Hanoi (河内科学与技术大学); Institute for the treatment of senior Staff, 108 Institute of Clinical Medical and Pharmaceutical Sciences (高级人员治疗研究所，第108临床医学和药学科学研究所); Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures. Submitted to IEEE Access for review. Preliminary version posted for early dissemination and feedback

点击查看摘要

Abstract:Understanding medical texts presents significant challenges due to complex terminology and context-specific language. This paper introduces Medalyze, an AI-powered application designed to enhance the comprehension of medical texts using three specialized FLAN-T5-Large models. These models are fine-tuned for (1) summarizing medical reports, (2) extracting health issues from patient-doctor conversations, and (3) identifying the key question in a passage. Medalyze is deployed across a web and mobile platform with real-time inference, leveraging scalable API and YugabyteDB. Experimental evaluations demonstrate the system’s superior summarization performance over GPT-4 in domain-specific tasks, based on metrics like BLEU, ROUGE-L, BERTScore, and SpaCy Similarity. Medalyze provides a practical, privacy-preserving, and lightweight solution for improving information accessibility in healthcare.
zh

[NLP-222] DO-RAG : A Domain-Specific QA Framework Using Knowledge Graph-Enhanced Retrieval-Augmented Generation

【速读】：该论文旨在解决领域特定问答（Domain-specific QA）系统在事实准确性与结构化专家知识融合方面的挑战，尤其是在处理异构数据和保持推理一致性方面存在的问题。其解决方案的关键在于提出DO-RAG框架，该框架通过多层级知识图谱构建与语义向量检索的集成，结合一种新型的代理式思维链架构，从非结构化、多模态文档中提取结构化关系，构建动态知识图谱以提升检索精度，并在查询时融合图与向量检索结果生成上下文感知的回答，同时通过基于依据的优化减少幻觉现象。

链接: https://arxiv.org/abs/2505.17058
作者: David Osei Opoku,Ming Sheng,Yong Zhang
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology - Tsinghua University (北京信息科学与技术国家研究中心-清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures;

点击查看摘要

Abstract:Domain-specific QA systems require not just generative fluency but high factual accuracy grounded in structured expert knowledge. While recent Retrieval-Augmented Generation (RAG) frameworks improve context recall, they struggle with integrating heterogeneous data and maintaining reasoning consistency. To address these challenges, we propose DO-RAG, a scalable and customizable hybrid QA framework that integrates multi-level knowledge graph construction with semantic vector retrieval. Our system employs a novel agentic chain-of-thought architecture to extract structured relationships from unstructured, multimodal documents, constructing dynamic knowledge graphs that enhance retrieval precision. At query time, DO-RAG fuses graph and vector retrieval results to generate context-aware responses, followed by hallucination mitigation via grounded refinement. Experimental evaluations in the database and electrical domains show near-perfect recall and over 94% answer relevancy, with DO-RAG outperforming baseline frameworks by up to 33.38%. By combining traceability, adaptability, and performance efficiency, DO-RAG offers a reliable foundation for multi-domain, high-precision QA at scale.
zh

[NLP-223] Are LLM s Ready for English Standardized Tests? A Benchmarking and Elicitation Perspective

【速读】：该论文旨在解决如何利用大语言模型（Large Language Models, LLMs）提升标准化考试（English Standardized Tests, ESTs）备考效果的问题，重点评估LLMs在生成准确且符合语境的解答方面的能力。其解决方案的关键在于构建ESTBOOK基准测试平台，该平台整合了五项广泛认可的测试，涵盖29种题型和超过10,576道题目，覆盖文本、图像、音频、表格及数学符号等多种模态，并提出一种分解分析框架，将复杂的EST问题拆解为特定任务的解决步骤，从而系统评估LLMs在推理过程各阶段的表现。

链接: https://arxiv.org/abs/2505.17056
作者: Luoxi Tang,Tharunya Sundar,Shuai Yang,Ankita Patra,Manohar Chippada,Giqi Zhao,Yi Li,Riteng Zhang,Tunan Zhao,Ting Yang,Yuqiao Meng,Weicheng Ma,Zhaohan Xi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.
zh

[NLP-224] Enhancing Mathematics Learning for Hard-of-Hearing Students Through Real-Time Palestinian Sign Language Recognition: A New Dataset

【速读】：该论文试图解决听力障碍学生在数学教育中获取资源困难的问题，通过开发一种基于先进人工智能技术的准确的巴勒斯坦手语（Palestinian Sign Language, PSL）识别系统来提高数学教育的可及性。解决方案的关键在于构建了一个包含41个数学手势类别的定制数据集，并由PSL专家录制以确保语言准确性和领域特异性，同时采用微调的Vision Transformer (ViT) 模型进行手势分类，从而实现了97.59%的高精度识别效果。

链接: https://arxiv.org/abs/2505.17055
作者: Fidaa khandaqji,Huthaifa I. Ashqar,Abdelrahem Atawnih
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The study aims to enhance mathematics education accessibility for hard-of-hearing students by developing an accurate Palestinian sign language PSL recognition system using advanced artificial intelligence techniques. Due to the scarcity of digital resources for PSL, a custom dataset comprising 41 mathematical gesture classes was created, and recorded by PSL experts to ensure linguistic accuracy and domain specificity. To leverage state-of-the-art-computer vision techniques, a Vision Transformer ViTModel was fine-tuned for gesture classification. The model achieved an accuracy of 97.59%, demonstrating its effectiveness in recognizing mathematical signs with high precision and reliability. This study highlights the role of deep learning in developing intelligent educational tools that bridge the learning gap for hard-of-hearing students by providing AI-driven interactive solutions to enhance mathematical comprehension. This work represents a significant step toward innovative and inclusive frosting digital integration in specialized learning environments. The dataset is hosted on Hugging Face at this https URL.
zh

[NLP-225] METHOD: Modular Efficient Transformer for Health Outcome Discovery

【速读】：该论文旨在解决将Transformer架构应用于医疗领域时所面临的独特挑战，包括患者时间序列的不规则采样、可变的时间依赖性以及复杂的上下文关系。其解决方案的关键在于提出一种名为\METHOD~（Modular Efficient Transformer for Health Outcome Discovery）的新型Transformer架构，该架构集成了三项核心创新：患者感知注意力机制以防止信息泄露并实现高效批量处理、自适应滑动窗口注意力方案以捕捉多尺度时间依赖性，以及受U-Net启发的具有动态跳跃连接的结构以有效处理长序列。

链接: https://arxiv.org/abs/2505.17054
作者: Linglong Qian,Zina Ibrahim
机构: King’s College London, London, UK (国王学院伦敦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:Recent advances in transformer architectures have revolutionised natural language processing, but their application to healthcare domains presents unique challenges. Patient timelines are characterised by irregular sampling, variable temporal dependencies, and complex contextual relationships that differ substantially from traditional language tasks. This paper introduces \METHOD~(Modular Efficient Transformer for Health Outcome Discovery), a novel transformer architecture specifically designed to address the challenges of clinical sequence modelling in electronic health records. \METHOD~integrates three key innovations: (1) a patient-aware attention mechanism that prevents information leakage whilst enabling efficient batch processing; (2) an adaptive sliding window attention scheme that captures multi-scale temporal dependencies; and (3) a U-Net inspired architecture with dynamic skip connections for effective long sequence processing. Evaluations on the MIMIC-IV database demonstrate that \METHOD~consistently outperforms the state-of-the-art \ETHOS~model, particularly in predicting high-severity cases that require urgent clinical intervention. \METHOD~exhibits stable performance across varying inference lengths, a crucial feature for clinical deployment where patient histories vary significantly in length. Analysis of learned embeddings reveals that \METHOD~better preserves clinical hierarchies and relationships between medical concepts. These results suggest that \METHOD~represents a significant advancement in transformer architectures optimised for healthcare applications, providing more accurate and clinically relevant predictions whilst maintaining computational efficiency.
zh

[NLP-226] Social preferences with unstable interactive reasoning : Large language models in economic trust games

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）如何在社会交换情境中将语言理解转化为类似人类互动的行为问题，特别是其在经济信任博弈中的表现。研究的关键在于通过设计不同情境下的互动实验，评估LLMs在自我利益与信任、互惠之间的权衡能力，以及它们在不同角色设定下表现出的社会偏好和交互推理能力。研究发现，LLMs在未被明确引导的情况下仍表现出一定的信任与互惠行为，且其行为在不同角色设定下存在显著差异，这揭示了LLMs在模拟人类社会行为方面的潜力与局限性。

链接: https://arxiv.org/abs/2505.17053
作者: Ou Jiamin,Eikmans Emile,Buskens Vincent,Pankowska Paulina,Shan Yuli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, 2 tables

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated remarkable capabilities in understanding human languages, this study explores how they translate this understanding into social exchange contexts that capture certain essences of real world human interactions. Three LLMs - ChatGPT-4, Claude, and Bard - were placed in economic trust games where players balance self-interest with trust and reciprocity, making decisions that reveal their social preferences and interactive reasoning abilities. Our study shows that LLMs deviate from pure self-interest and exhibit trust and reciprocity even without being prompted to adopt a specific persona. In the simplest one-shot interaction, LLMs emulated how human players place trust at the beginning of such a game. Larger human-machine divergences emerged in scenarios involving trust repayment or multi-round interactions, where decisions were influenced by both social preferences and interactive reasoning. LLMs responses varied significantly when prompted to adopt personas like selfish or unselfish players, with the impact outweighing differences between models or game types. Response of ChatGPT-4, in an unselfish or neutral persona, resembled the highest trust and reciprocity, surpassing humans, Claude, and Bard. Claude and Bard displayed trust and reciprocity levels that sometimes exceeded and sometimes fell below human choices. When given selfish personas, all LLMs showed lower trust and reciprocity than humans. Interactive reasoning to the actions of counterparts or changing game mechanics appeared to be random rather than stable, reproducible characteristics in the response of LLMs, though some improvements were observed when ChatGPT-4 responded in selfish or unselfish personas.
zh

[NLP-227] SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLM s

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在服务过程中成本高、资源消耗大的问题，尤其是现有以服务器为中心的系统未能充分利用边缘端的消费级GPU。其解决方案的关键在于提出SpecEdge框架，通过推测解码（speculative decoding）机制将LLM工作负载分割到边缘和服务器GPU上，并仅在网络中传输令牌输出，从而实现边缘与服务器的协同推理。该框架还引入了主动边缘草稿生成和流水线感知调度，以提高服务器端吞吐量并降低令牌间延迟。

链接: https://arxiv.org/abs/2505.17052
作者: Jinwoo Park,Seunggeun Cho,Dongsu Han
机构: KAIST(韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.
zh

[NLP-228] Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models

【速读】：该论文试图解决如何在不进行昂贵微调或大量提示的情况下，利用用户特定的嵌入表示对大型语言模型（Large Language Models, LLMs）进行有效个性化的问题。解决方案的关键在于提出一种参数高效的Embedding-to-Prefix (E2P)方法，该方法通过将预计算的上下文嵌入注入到LLM的隐藏表示空间中，借助一个学习得到的投影将嵌入映射到单个软标记前缀，从而实现个性化，同时保持基础模型冻结并避免高昂的适应技术。

链接: https://arxiv.org/abs/2505.17051
作者: Bernd Huber,Ghazal Fazelnia,Andreas Damianou,Sebastian Peleato,Max Lefarov,Praveen Ravichandran,Marco De Nadai,Mounia Lalmas-Roellke,Paul N. Bennett
机构: Spotify( Spotify)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at generating contextually relevant content. However, tailoring these outputs to individual users for effective personalization is a significant challenge. While rich user-specific information often exists as pre-existing user representations, such as embeddings learned from preferences or behaviors, current methods to leverage these for LLM personalization typically require costly fine-tuning or token-heavy prompting. We propose Embedding-to-Prefix (E2P), a parameter-efficient method that injects pre-computed context embeddings into an LLM’s hidden representation space through a learned projection to a single soft token prefix. This enables effective personalization while keeping the backbone model frozen and avoiding expensive adaptation techniques. We evaluate E2P across two public datasets and in a production setting: dialogue personalization on Persona-Chat, contextual headline generation on PENS, and large-scale personalization for music and podcast consumption. Results show that E2P preserves contextual signals and achieves strong performance with minimal computational overhead, offering a scalable, efficient solution for contextualizing generative AI systems.
zh

[NLP-229] owards Robust Evaluation of STEM Education: Leverag ing MLLM s in Project-Based Learning

【速读】：该论文试图解决当前教育领域中基于项目的学习（Project-Based Learning, PBL）任务在评估过程中存在的不足，即现有基准在自由格式输出结构和严格的人类专家验证流程方面存在缺陷，限制了其在真实教育任务中的有效性。此外，由于模型幻觉和不稳定性，缺乏自动化流程来支持教师利用多模态大语言模型（Multimodal Large Language Models, MLLMs）完成复杂职责。解决方案的关键在于引入PBLBench，这是一个新型基准，旨在评估基于领域特定知识和长上下文理解的复杂推理能力，并通过层次分析法（Analytic Hierarchy Process, AHP）建立可靠的地面真实数据，以结构化和加权的方式定义评估标准。

链接: https://arxiv.org/abs/2505.17050
作者: Yanhao Jia,Xinyi Wu,Qinglin Zhang,Yiran Qin,Luwei Xiao,Shuai Zhao
机构: Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.
zh

[NLP-230] Gender and Positional Biases in LLM -Based Hiring Decisions: Evidence from Comparative CV/Résumé Evaluations

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在基于简历或履历评估专业候选人时是否存在性别偏见的问题。其解决方案的关键在于通过系统化实验设计，将性别特征（如姓名）作为变量进行控制，并观察LLMs在不同情境下的选择偏好，从而揭示模型是否无意识地受到性别线索的影响。实验中通过交换姓名、引入性别字段、使用中性标识符等方法，验证了LLMs在性别判断上的潜在偏倚及其可调节性。

链接: https://arxiv.org/abs/2505.17049
作者: David Rozado
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study examines the behavior of Large Language Models (LLMs) when evaluating professional candidates based on their resumes or curricula vitae (CVs). In an experiment involving 22 leading LLMs, each model was systematically given one job description along with a pair of profession-matched CVs, one bearing a male first name, the other a female first name, and asked to select the more suitable candidate for the job. Each CV pair was presented twice, with names swapped to ensure that any observed preferences in candidate selection stemmed from gendered names cues. Despite identical professional qualifications across genders, all LLMs consistently favored female-named candidates across 70 different professions. Adding an explicit gender field (male/female) to the CVs further increased the preference for female applicants. When gendered names were replaced with gender-neutral identifiers “Candidate A” and “Candidate B”, several models displayed a preference to select “Candidate A”. Counterbalancing gender assignment between these gender-neutral identifiers resulted in gender parity in candidate selection. When asked to rate CVs in isolation rather than compare pairs, LLMs assigned slightly higher average scores to female CVs overall, but the effect size was negligible. Including preferred pronouns (he/him or she/her) next to a candidate’s name slightly increased the odds of the candidate being selected regardless of gender. Finally, most models exhibited a substantial positional bias to select the candidate listed first in the prompt. These findings underscore the need for caution when deploying LLMs in high-stakes autonomous decision-making contexts and raise doubts about whether LLMs consistently apply principled reasoning.
zh

[NLP-231] Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally

【速读】：该论文旨在解决中央银行在政策沟通中潜在的误解问题，特别是这些误解可能对弱势群体产生不成比例的影响。其解决方案的关键在于构建了一个名为World Central Banks (WCB)的数据集，这是目前最全面的货币政策语料库，包含来自25个中央银行的38万多句话，并通过统一采样、双重标注者标注、争议解决和专家复审确保数据质量。此外，研究定义了三种任务：立场检测、时间分类和不确定性估计，并基于这些任务对多种预训练语言模型（PLMs）和大语言模型（LLMs）进行了广泛的基准测试，验证了跨银行数据聚合模型优于单一银行数据模型的有效性。

链接: https://arxiv.org/abs/2505.17048
作者: Agam Shah,Siddhant Sukhani,Huzaifa Pardawala,Saketh Budideti,Riya Bhadani,Rudra Gopal,Siddhartha Somani,Michael Galarnyk,Soungmin Lee,Arnav Hiray,Akshar Ravichandran,Eric Kim,Pranav Aluru,Joshua Zhang,Sebastian Jaskowski,Veer Guda,Meghaj Tarte,Liqin Ye,Spencer Gosden,Rutwik Routu,Rachel Yuh,Sloka Chava,Sahasra Chava,Dylan Patrick Kelly,Aiden Chiang,Harsit Mittal,Sudheer Chava
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computational Finance (q-fin.CP); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and Uncertainty Estimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank’s data, confirming the principle “the whole is greater than the sum of its parts.” Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework’s economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.
zh

[NLP-232] Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe

【速读】：该论文试图解决生成式AI（Generative AI）在医疗实践中作为病历记录员（scribe）使用的质量评估问题，目前尚无标准化的方法来衡量AI病历记录的质量。解决方案的关键在于开发了一项盲法研究，通过比较大型语言模型（Large Language Model, LLM）生成的临床记录与领域专家撰写的记录，利用 Physician Documentation Quality Instrument (PDQI9) 的定量指标来评估记录质量，从而提供一种实用的评估方法。

链接: https://arxiv.org/abs/2505.17047
作者: Erin Palm,Astrit Manikantan,Mark E. Pepin,Herprit Mahal,Srikanth Subramanya Belwadi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 tables, 1 figure. Submitted for peer review 05/15/2025

点击查看摘要

Abstract:In medical practices across the United States, physicians have begun implementing generative artificial intelligence (AI) tools to perform the function of scribes in order to reduce the burden of documenting clinical encounters. Despite their widespread use, no established methods exist to gauge the quality of AI scribes. To address this gap, we developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters. Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality, which we adapted to assess relative performance of AI generated notes. Clinical experts spanning 5 medical specialties used the PDQI9 tool to evaluate specialist-drafted Gold notes and LLM authored Ambient notes. Two evaluators from each specialty scored notes drafted from a total of 97 patient visits. We found uniformly high inter rater agreement (RWG greater than 0.7) between evaluators in general medicine, orthopedics, and obstetrics and gynecology, and moderate (RWG 0.5 to 0.7) to high inter rater agreement in pediatrics and cardiology. We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5 (p = 0.04). Our findings support the use of the PDQI9 instrument as a practical method to gauge the quality of LLM authored notes, as compared to human-authored notes.
zh

[NLP-233] Assessing GPT s Bias Towards Stigmatized Social Groups: An Intersectional Case Study on Nationality Prejudice and Psychophobia

【速读】：该论文试图解决基础大型语言模型（Large Language Models, LLMs）中存在针对特定国籍和污名化社会群体的显著偏见问题，特别是这些偏见在与心理障碍等身份特征交叉时所产生的伦理影响。解决方案的关键在于通过结构化的提示系列评估GPT-3.5/4/4o等广泛使用的LLMs在涉及美国和朝鲜国籍以及不同心理障碍情境下的响应，从而揭示模型在共情水平上的显著差异，并强调需要在LLMs设计中引入对交叉性身份的细致理解以减少偏见。

链接: https://arxiv.org/abs/2505.17045
作者: Afifah Kashif,Heer Patel
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have separately highlighted significant biases within foundational large language models (LLMs) against certain nationalities and stigmatized social groups. This research investigates the ethical implications of these biases intersecting with outputs of widely-used GPT-3.5/4/4o LLMS. Through structured prompt series, we evaluate model responses to several scenarios involving American and North Korean nationalities with various mental disabilities. Findings reveal significant discrepancies in empathy levels with North Koreans facing greater negative bias, particularly when mental disability is also a factor. This underscores the need for improvements in LLMs designed with a nuanced understanding of intersectional identity.
zh

[NLP-234] QRA: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

【速读】：该论文试图解决自然语言处理（Natural Language Processing, NLP）领域中再现性研究的可比性和解释性不足的问题。现有再现性研究由于各自采用不同的判定标准，导致结论难以比较和总结。论文提出的解决方案是QRA++，其关键在于通过量化方法在三个粒度层级上生成连续值的再现性评估，使用可在不同研究间直接比较的再现性度量，并将再现性的预期与实验之间的相似性相联系，从而提升再现性评估的信息量和对影响再现性因素的分析能力。

链接: https://arxiv.org/abs/2505.17043
作者: Anya Belz
机构: Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reproduction studies reported in NLP provide individual data points which in combination indicate worryingly low levels of reproducibility in the field. Because each reproduction study reports quantitative conclusions based on its own, often not explicitly stated, criteria for reproduction success/failure, the conclusions drawn are hard to interpret, compare, and learn from. In this paper, we present QRA++, a quantitative approach to reproducibility assessment that (i) produces continuous-valued degree of reproducibility assessments at three levels of granularity; (ii) utilises reproducibility measures that are directly comparable across different studies; and (iii) grounds expectations about degree of reproducibility in degree of similarity between experiments. QRA++ enables more informative reproducibility assessments to be conducted, and conclusions to be drawn about what causes reproducibility to be better/poorer. We illustrate this by applying QRA++ to three example sets of comparable experiments, revealing clear evidence that degree of reproducibility depends on similarity of experiment properties, but also system type and evaluation method.
zh

[NLP-235] VLM-KG: Multimodal Radiology Knowledge Graph Generation

【速读】：该论文旨在解决放射学领域知识图谱生成中的挑战，特别是由于放射学报告的专业化语言和领域特定数据的稀缺性所带来的问题。现有方法主要为单模态，仅基于放射学报告生成知识图谱，忽略了放射图像，并且在处理长文本放射学数据时受限于上下文长度。该论文提出的解决方案的关键在于引入一种基于多模态视觉-语言模型（Vision-Language Models, VLMs）的框架，实现了首个多模态的放射学知识图谱生成方法，从而提升了性能并克服了现有技术的局限性。

链接: https://arxiv.org/abs/2505.17042
作者: Abdullah Abdullah,Seong Tae Kim
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable success in natural language generation, excelling at instruction following and structured output generation. Knowledge graphs play a crucial role in radiology, serving as valuable sources of factual information and enhancing various downstream tasks. However, generating radiology-specific knowledge graphs presents significant challenges due to the specialized language of radiology reports and the limited availability of domain-specific data. Existing solutions are predominantly unimodal, meaning they generate knowledge graphs only from radiology reports while excluding radiographic images. Additionally, they struggle with long-form radiology data due to limited context length. To address these limitations, we propose a novel multimodal VLM-based framework for knowledge graph generation in radiology. Our approach outperforms previous methods and introduces the first multimodal solution for radiology knowledge graph generation.
zh

[NLP-236] Exploring EFL Secondary Students AI-generated Text Editing While Composition Writing

【速读】：该论文试图解决EFL（English as a Foreign Language）中学生在使用生成式AI进行说明文写作时，如何整合和修改AI生成文本的问题。研究的关键在于通过混合方法设计，结合屏幕录制、层级编码和多案例主题分析，揭示学生在写作过程中对AI生成文本的编辑行为模式，并发现学生与AI生成文本的互动涉及比简单文本插入更为复杂的认知过程。

链接: https://arxiv.org/abs/2505.17041
作者: David James Woo,Yangyang Yu,Kai Guo
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 31 pages, 16 figures

点击查看摘要

Abstract:Generative Artificial Intelligence is transforming how English as a foreign language students write. Still, little is known about how students manipulate text generated by generative AI during the writing process. This study investigates how EFL secondary school students integrate and modify AI-generated text when completing an expository writing task. The study employed an exploratory mixed-methods design. Screen recordings were collected from 29 Hong Kong secondary school students who attended an AI-assisted writing workshop and recorded their screens while using generative AI to write an article. Content analysis with hierarchical coding and thematic analysis with a multiple case study approach were adopted to analyze the recordings. 15 types of AI-generated text edits across seven categories were identified from the recordings. Notably, AI-initiated edits from iOS and Google Docs emerged as unanticipated sources of AI-generated text. A thematic analysis revealed four patterns of students’ editing behaviors based on planning and drafting direction: planning with top-down drafting and revising; top-down drafting and revising without planning; planning with bottom-up drafting and revising; and bottom-up drafting and revising without planning. Network graphs illustrate cases of each pattern, demonstrating that students’ interactions with AI-generated text involve more complex cognitive processes than simple text insertion. The findings challenge assumptions about students’ passive, simplistic use of generative AI tools and have implications for developing explicit instructional approaches to teaching AI-generated text editing strategies in the AFL writing pedagogy.
zh

[NLP-237] Generalizing Large Language Model Usability Across Resource-Constrained

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对未知模态、有限数据或计算资源受限等现实约束时，其泛化能力不足的问题。其解决方案的关键在于提出一种以文本为中心的对齐框架，使LLMs能够通过自然语言接口无缝整合多种模态，并实现无需重新训练的上下文适应；同时引入对抗性提示技术以增强模型对噪声和缺失模态的鲁棒性，并探索推理阶段的优化策略，如提示搜索和不确定性量化，从而在不进行额外训练的情况下提升性能。此外，针对低资源领域，设计了基于构造性合成数据管道和逻辑增强推理模型的方法，显著提升了模型在有限数据下的表现。

链接: https://arxiv.org/abs/2505.17040
作者: Yun-Da Tsai
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Doctoral disstertation

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, and recent efforts have sought to extend their capabilities to multimodal domains and resource-constrained environments. However, existing approaches often rely on costly supervised fine-tuning or assume fixed training conditions, limiting their generalization when facing unseen modalities, limited data, or restricted compute resources. This dissertation presents a systematic study toward generalizing LLM usability under real-world constraints. First, it introduces a robust text-centric alignment framework that enables LLMs to seamlessly integrate diverse modalities-including text, images, tables, and any modalities - via natural language interfaces. This approach supports in-context adaptation to unseen or dynamically changing modalities without requiring retraining. To enhance robustness against noisy and missing modalities, an adversarial prompting technique is proposed, generating semantically challenging perturbations at the prompt level to stress-test model reliability. Beyond multimodal setting, the dissertation investigates inference-time optimization strategies for LLMs, leveraging prompt search and uncertainty quantification to improve performance without additional model training. This perspective offers an efficient alternative to scaling model parameters or retraining from scratch. Additionally, the work addresses low-resource domains such as Verilog code generation by designing correct-by-construction synthetic data pipelines and logic-enhanced reasoning models, achieving state-of-the-art performance with minimal data. Together, these contributions form a unified effort to enhance the adaptability, scalability, and efficiency of large language models under practical constraints.
zh

[NLP-238] A new classification system of beer categories and styles based on large-scale data mining and self-organizing maps of beer recipes

【速读】：该论文试图解决传统啤酒分类体系依赖感官评价而缺乏客观性和可重复性的问题，其解决方案的关键在于采用数据驱动的定量方法，通过对六万二千一百二十一份啤酒配方进行分析，结合统计分析与自组织映射（Self-Organizing Maps, SOMs），识别出具有显著麦芽和酒花使用模式、风格特征及历史酿造传统的四大超级聚类，从而建立一种可复制且客观的啤酒分类框架。

链接: https://arxiv.org/abs/2505.17039
作者: Diego Bonatto
机构: 未知
类目: Computation and Language (cs.CL)
备注: 46 pages, 8 figures, 1 table

点击查看摘要

Abstract:A data-driven quantitative approach was used to develop a novel classification system for beer categories and styles. Sixty-two thousand one hundred twenty-one beer recipes were mined and analyzed, considering ingredient profiles, fermentation parameters, and recipe vital statistics. Statistical analyses combined with self-organizing maps (SOMs) identified four major superclusters that showed distinctive malt and hop usage patterns, style characteristics, and historical brewing traditions. Cold fermented styles showed a conservative grain and hop composition, whereas hot fermented beers exhibited high heterogeneity, reflecting regional preferences and innovation. This new taxonomy offers a reproducible and objective framework beyond traditional sensory-based classifications, providing brewers, researchers, and educators with a scalable tool for recipe analysis and beer development. The findings in this work provide an understanding of beer diversity and open avenues for linking ingredient usage with fermentation profiles and flavor outcomes.
zh

[NLP-239] Signals from the Floods: AI-Driven Disaster Analysis through Multi-Source Data Fusion

【速读】：该论文试图解决在灾害响应中如何有效提取和分析公众行为信息的问题，以提升应急响应的准确性和时效性。其解决方案的关键在于结合X（ formerly Twitter）上的短文本与公共调查提交的详细文档，利用潜在狄利克雷分布（Latent Dirichlet Allocation, LDA）进行主题建模，并引入大型语言模型（Large Language Models, LLMs）增强语义理解，通过构建相关性指数（Relevance Index）方法减少噪声并优先筛选出与洪水相关的可操作内容，从而提升应急人员的情境感知能力。

链接: https://arxiv.org/abs/2505.17038
作者: Xian Gong,Paul X. McCarthy,Lin Tian,Marian-Andrei Rizoiu
机构: University of Technology Sydney (悉尼科技大学); University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Massive and diverse web data are increasingly vital for government disaster response, as demonstrated by the 2022 floods in New South Wales (NSW), Australia. This study examines how X (formerly Twitter) and public inquiry submissions provide insights into public behaviour during crises. We analyse more than 55,000 flood-related tweets and 1,450 submissions to identify behavioural patterns during extreme weather events. While social media posts are short and fragmented, inquiry submissions are detailed, multi-page documents offering structured insights. Our methodology integrates Latent Dirichlet Allocation (LDA) for topic modelling with Large Language Models (LLMs) to enhance semantic understanding. LDA reveals distinct opinions and geographic patterns, while LLMs improve filtering by identifying flood-relevant tweets using public submissions as a reference. This Relevance Index method reduces noise and prioritizes actionable content, improving situational awareness for emergency responders. By combining these complementary data streams, our approach introduces a novel AI-driven method to refine crisis-related social media content, improve real-time disaster response, and inform long-term resilience planning.
zh

[NLP-240] Prompt Engineering: How Prompt Vocabulary affects Domain Knowledge

【速读】：该论文试图解决的问题是：在领域特定任务（如STEM、医学和法律）中，增加提示词的词汇具体性是否能够提升大型语言模型（LLMs）的问答和推理性能。论文提出的关键解决方案是开发了一个同义替换框架，系统地将名词、动词和形容词替换为不同具体性水平的同义词，并通过实验评估其对多个LLMs性能的影响，从而识别出在所有模型中表现最佳的具体性范围。

链接: https://arxiv.org/abs/2505.17037
作者: Dimitri Schreiter
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt engineering has emerged as a critical component in optimizing large language models (LLMs) for domain-specific tasks. However, the role of prompt specificity, especially in domains like STEM (physics, chemistry, biology, computer science and mathematics), medicine, and law, remains underexplored. This thesis addresses the problem of whether increasing the specificity of vocabulary in prompts improves LLM performance in domain-specific question-answering and reasoning tasks. We developed a synonymization framework to systematically substitute nouns, verbs, and adjectives with varying specificity levels, measuring the impact on four LLMs: Llama-3.1-70B-Instruct, Granite-13B-Instruct-V2, Flan-T5-XL, and Mistral-Large 2, across datasets in STEM, law, and medicine. Our results reveal that while generally increasing the specificity of prompts does not have a significant impact, there appears to be a specificity range, across all considered models, where the LLM performs the best. Identifying this optimal specificity range offers a key insight for prompt design, suggesting that manipulating prompts within this range could maximize LLM performance and lead to more efficient applications in specialized domains.
zh

[NLP-241] Speechless: Speech Instruction Training Without Speech for Low Resource Languages INTERSPEECH2025

【速读】：该论文试图解决低资源语言中语音指令数据稀缺的问题，这一问题限制了大型语言模型（Large Language Model, LLM）在语音助手中的微调与应用。解决方案的关键在于通过在语义表示层面停止合成过程，绕过对文本到语音（Text-to-Speech, TTS）模型的依赖，具体是通过将合成的语义表示与预训练的Whisper编码器对齐，从而实现LLM在文本指令上进行微调的同时，在推理阶段仍能理解语音指令。

链接: https://arxiv.org/abs/2505.17417
作者: Alan Dao(Gia Tuan Dao),Dinh Bach Vu,Huy Hoang Ha,Tuan Le Duc Anh,Shreyas Gopal,Yue Heng Yeo,Warren Keng Hoong Low,Eng Siong Chng,Jia Qi Yip
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: This paper was accepted by INTERSPEECH 2025

点击查看摘要

Abstract:The rapid growth of voice assistants powered by large language models (LLM) has highlighted a need for speech instruction data to train these systems. Despite the abundance of speech recognition data, there is a notable scarcity of speech instruction data, which is essential for fine-tuning models to understand and execute spoken commands. Generating high-quality synthetic speech requires a good text-to-speech (TTS) model, which may not be available to low resource languages. Our novel approach addresses this challenge by halting synthesis at the semantic representation level, bypassing the need for TTS. We achieve this by aligning synthetic semantic representations with the pre-trained Whisper encoder, enabling an LLM to be fine-tuned on text instructions while maintaining the ability to understand spoken instructions during inference. This simplified training process is a promising approach to building voice assistant for low-resource languages.
zh

[NLP-242] Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech

【速读】：该论文试图解决在基于提示的可控文本到语音系统中，如何有效控制语音风格的问题，特别是通过引入文本人格（textual personas）作为语音风格提示来实现对韵律属性（如音高、情感和语速）的精细操控。解决方案的关键在于提出两种人格重写策略，将通用的人格描述转化为面向语音的提示，从而提升合成语音的自然度、清晰度和一致性。

链接: https://arxiv.org/abs/2505.17093
作者: Yejin Lee,Jaehoon Kang,Kyuhong Shim
机构: Sungkyunkwan University (成均馆大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel framework to control voice style in prompt-based, controllable text-to-speech systems by leveraging textual personas as voice style prompts. We present two persona rewriting strategies to transform generic persona descriptions into speech-oriented prompts, enabling fine-grained manipulation of prosodic attributes such as pitch, emotion, and speaking rate. Experimental results demonstrate that our methods enhance the naturalness, clarity, and consistency of synthesized speech. Finally, we analyze implicit social biases introduced by LLM-based rewriting, with a focus on gender. We underscore voice style as a crucial factor for persona-driven AI dialogue systems.
zh

[NLP-243] From Weak Labels to Strong Results: Utilizing 5000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

【速读】：该论文试图解决在课堂自动语音识别（ASR）场景中，由于仅有少量高质量的黄金标准数据而存在大量低成本的弱标注文本这一低资源问题。解决方案的关键在于提出一种称为弱监督预训练（Weakly Supervised Pretraining, WSP）的方法，该方法通过两个步骤进行：首先在弱标注文本上进行监督预训练，然后在准确数据上进行微调，从而有效提升低资源环境下的ASR性能。

链接: https://arxiv.org/abs/2505.17088
作者: Ahmed Adel Attia,Dorottya Demszky,Jing Liu,Carol Espy-Wilson
机构: University of Maryland (马里兰大学); Stanford University (斯坦福大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.
zh

计算机视觉

[CV-0] REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

【速读】：该论文旨在解决现有基于区域的图像表示生成方法中计算成本高、效率低的问题，特别是由于分割步骤带来的性能瓶颈。其解决方案的关键在于提出一种轻量级模块——区域编码器网络（Region Encoder Network, REN），该模块直接通过点提示生成区域标记，避免了传统方法中耗时的分割过程，从而实现了60倍的速度提升和35倍的内存节省，同时提升了标记质量。REN通过少量的交叉注意力块，以点提示作为查询，从基于块的图像编码器中提取特征作为键和值，生成与提示对象对应的区域标记。

链接: https://arxiv.org/abs/2505.18153
作者: Savya Khosla,Sethuraman TV,Barnett Lee,Alexander Schwing,Derek Hoiem
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks’ single-needle challenge. Code and models are available at: this https URL.
zh

[CV-1] WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

【速读】：该论文试图解决从单张图像生成动作条件的动态3D场景的问题，特别是现有方法在处理刚体或简单弹性动力学之外的复杂动态场景时存在局限。其解决方案的关键在于提出一种混合生成模拟器（hybrid generative simulator），该模拟器首先利用物理求解器模拟粗粒度的3D动态，随后通过视频生成器生成更精细、逼真的运动，并将生成的视频用于更新模拟的动态3D场景，从而实现物理求解器与视频生成器之间的闭环交互。

链接: https://arxiv.org/abs/2505.18151
作者: Zizhang Li,Hong-Xing Yu,Wei Liu,Yin Yang,Charles Herrmann,Gordon Wetzstein,Jiajun Wu
机构: Stanford University (斯坦福大学); University of Utah (犹他大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally. Project website: this https URL

点击查看摘要

Abstract:WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies – all using a single image input. Code will be made public. Project website: this https URL
zh

[CV-2] okBench: Evaluating Your Visual Tokenizer before Visual Generation

【速读】：该论文试图解决视觉分词器（visual tokenizers）和变分自编码器（VAEs）在保留细粒度特征方面的局限性，特别是在处理文本和人脸等对人类感知敏感的视觉内容时的重建性能问题。其解决方案的关键在于提出一个基准测试框架，通过光学字符识别（OCR）模型评估文本重建的识别准确率，并利用特征相似性度量人脸重建的真实性，从而量化不同视觉压缩方法的重建质量。该方法轻量高效，仅需2GB内存和4分钟即可完成评估，为分析不同分词器和VAEs在文本与人脸重建中的表现提供了有效工具。

链接: https://arxiv.org/abs/2505.18142
作者: Junfeng Wu,Dongliang Luo,Weizhi Zhao,Zhihao Xie,Yuanhao Wang,Junyi Li,Xudong Xie,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: Benchmark, homepagee: this https URL

点击查看摘要

Abstract:In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Image tokenization has significantly advanced visual generation and multimodal modeling, particularly with autoregressive models due to the modeling simplicity of discrete tokens. Autoregressive models typically rely on image tokenizers to compress images into discrete tokens for sequential prediction, whereas diffusion models often operate on continuous latent space to reduce computational costs. However, both visual compression approaches inevitably lose visual information, thereby limiting the upper bound of visual generation quality. To evaluate how these compression losses affect text and faces, the most human-sensitive visual elements, we first collect and curate a collection of text and faces images from existing datasets, ensuring clarity and diversity. For text reconstruction, we employ OCR models to assess the recognition accuracy of the reconstructed text, and then we measure feature similarity between original and reconstructed faces thereby quantifying faces reconstruction fidelity. Our method is highly lightweight, requiring just 2GB memory and 4 minutes to complete evaluations. With our benchmark, we analyze the reconstruction quality of text and faces at various scales across different image tokenizers and VAEs. Our results demonstrate that modern visual tokenizers still struggle to preserve fine-grained features, particularly at smaller scales. Furthermore, we extend this evaluation framework to the video, conducting a comprehensive analysis of video tokenizers. Additionally, we find that traditional metrics fail to accurately reflect the reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.
zh

[CV-3] Boosting Open Set Recognition Performance through Modulated Representation Learning

【速读】：该论文旨在解决开放集识别（Open Set Recognition, OSR）中现有方法使用固定缩放因子（温度）对logits进行处理的问题，这一做法限制了模型在表征学习中探索从实例级到语义级特征的两端能力。论文提出的解决方案的关键在于引入一种新颖的负余弦调度策略，通过温度调制的表征学习，使模型在训练初期通过关注较少邻居形成粗略决策边界，并逐步增加邻居数量以平滑边界，从而构建更丰富且可泛化的表征空间。该方法可无缝集成到现有OSR方法中，无需额外计算开销。

链接: https://arxiv.org/abs/2505.18137
作者: Amit Kumar Kundu,Vaishnavi Patil,Joseph Jaja
机构: University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The open set recognition (OSR) problem aims to identify test samples from novel semantic classes that are not part of the training classes, a task that is crucial in many practical scenarios. However, existing OSR methods use a constant scaling factor (the temperature) to the logits before applying a loss function, which hinders the model from exploring both ends of the spectrum in representation learning – from instance-level to semantic-level features. In this paper, we address this problem by enabling temperature-modulated representation learning using our novel negative cosine scheduling scheme. Our scheduling lets the model form a coarse decision boundary at the beginning of training by focusing on fewer neighbors, and gradually prioritizes more neighbors to smooth out rough edges. This gradual task switching leads to a richer and more generalizable representation space. While other OSR methods benefit by including regularization or auxiliary negative samples, such as with mix-up, thereby adding a significant computational overhead, our scheme can be folded into any existing OSR method with no overhead. We implement the proposed scheme on top of a number of baselines, using both cross-entropy and contrastive loss functions as well as a few other OSR methods, and find that our scheme boosts both the OSR performance and the closed set performance in most cases, especially on the tougher semantic shift benchmarks.
zh

[CV-4] BiggerGait: Unlocking Gait Recognition with Layer-wise Representations from Large Vision Models

【速读】：该论文旨在解决基于大型视觉模型（Large Vision Models, LVM）的步态识别中，现有方法可能过度依赖步态先验而忽视LVM自身多层丰富且独特的表征潜力的问题。其解决方案的关键在于分析不同层级的表征对下游识别任务的影响，并通过整合中间层的互补特性，提升识别性能，而不依赖复杂的步态先验设计。基于此，作者提出了一种简单且通用的基线模型BiggerGait，实验证明其在多个数据集上均表现出色。

链接: https://arxiv.org/abs/2505.18132
作者: Dingqing Ye,Chao Fan,Zhanbo Huang,Chengwen Luo,Jianqiang Li,Shiqi Yu,Xiaoming Liu
机构: Southern University of Science and Technology (南方科技大学); Shenzhen University (深圳大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM’s potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM’s intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.
zh

[CV-5] Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

【速读】：该论文旨在解决当前视觉指令调优（VisIT）数据集构建过程中存在的高成本、可扩展性差以及质量难以提升的问题。现有VisIT数据集多采用非系统化的方法构建，缺乏可复现的代码，并依赖付费的闭源模型API生成指令，导致数据生成效率低且难以优化。论文提出了一种开放且统一的解决方案——\method，其关键在于通过开源大语言模型（LLM）将可用元数据转换为VisIT指令，包含多阶段流程，如元数据分组、质量控制、数据与提示组织及对话采样，从而实现数据质量的提升和性能的增强。

链接: https://arxiv.org/abs/2505.18115
作者: Jacob Hansen,Wei Lin,Junmo Kang,Muhammad Jehanzeb Mirza,Hongyin Luo,Rogerio Feris,Alan Ritter,James Glass,Leonid Karlinsky
机构: Xero(泽罗); JKU Linz(因斯布鲁克大学); Georgia Tech(佐治亚理工学院); MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); MIT-IBM Watson AI Lab(麻省理工学院-IBM华生人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf\method, for converting available metadata to VisIT instructions using open LLMs. Our multi-stage \method features an efficient framework for metadata grouping, quality control, data and prompt organization, and conversation sampling. We show that our approach can reproduce or enhance the data quality of available VisIT datasets when applied to the same image data and metadata sources, improving GPT-4 generated VisIT instructions by ~3% on average and up to 12% on individual benchmarks using open models, such as Gemma 2 27B and LLaMa 3.1 70B. Additionally, our approach enables effective performance scaling - both in quantity and quality - by enhancing the resulting LMM performance across a wide range of benchmarks. We also analyze the impact of various factors, including conversation format, base model selection, and resampling strategies. Our code, which supports the reproduction of equal or higher-quality VisIT datasets and facilities future metadata-to-VisIT data conversion for niche domains, is released at this https URL.
zh

[CV-6] Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking ICPR

【速读】：该论文旨在解决将Segment Anything Model 2（SAM2）适配到视觉目标跟踪（Visual Object Tracking, VOT）任务中的问题。解决方案的关键在于利用SAM2强大的预训练能力，并结合若干关键技术以提升其在VOT应用中的性能，最终在2024年ICPR多模态目标跟踪挑战中取得了AUC分数89.4的第一名成绩，验证了方法的有效性。

链接: https://arxiv.org/abs/2505.18111
作者: Cheng-Yen Yang,Hsiang-Wei Huang,Pyong-Kun Kim,Chien-Kai Kuo,Jui-Wei Chang,Kwang-Ju Kim,Chung-I Huang,Jenq-Neng Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICPR Multi-Modal Visual Pattern Recognition Workshop

点击查看摘要

Abstract:We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.
zh

[CV-7] F-ANcGAN: An Attention-Enhanced Cycle Consistent Generative Adversarial Architecture for Synthetic Image Generation of Nanoparticles

【速读】：该论文旨在解决纳米材料研究中由于缺乏高质量标注数据集而导致的纳米颗粒拓扑结构分割模型训练困难的问题。其关键解决方案是提出一种基于注意力机制的循环一致性生成对抗网络（F-ANcGAN），该模型能够利用有限的数据样本生成逼真的扫描电子显微镜（SEM）图像，并通过引入Style U-Net生成器和带有自注意力机制的U-Net分割网络，有效捕捉结构关系，同时结合增强方法提升数据集多样性，从而显著提高合成数据的质量与适用性。

链接: https://arxiv.org/abs/2505.18106
作者: Varun Ajith,Anindya Pal,Saumik Bhattacharya,Sayantari Ghosh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 11 pages, 9 figures, 2 tables, conference paper

点击查看摘要

Abstract:Nanomaterial research is becoming a vital area for energy, medicine, and materials science, and accurate analysis of the nanoparticle topology is essential to determine their properties. Unfortunately, the lack of high-quality annotated datasets drastically hinders the creation of strong segmentation models for nanoscale imaging. To alleviate this problem, we introduce F-ANcGAN, an attention-enhanced cycle consistent generative adversarial system that can be trained using a limited number of data samples and generates realistic scanning electron microscopy (SEM) images directly from segmentation maps. Our model uses a Style U-Net generator and a U-Net segmentation network equipped with self-attention to capture structural relationships and applies augmentation methods to increase the variety of the dataset. The architecture reached a raw FID score of 17.65 for TiO _2 dataset generation, with a further reduction in FID score to nearly 10.39 by using efficient post-processing techniques. By facilitating scalable high-fidelity synthetic dataset generation, our approach can improve the effectiveness of downstream segmentation task training, overcoming severe data shortage issues in nanoparticle analysis, thus extending its applications to resource-limited fields.
zh

[CV-8] owards more transferable adversarial attack in black-box manner

【速读】：该论文试图解决黑盒对抗攻击中依赖于替代白盒模型架构的问题，以及扩散模型在对抗净化过程中带来的高计算开销问题。其解决方案的关键在于提出一种新的损失函数与独特的替代模型，该方法利用基于时间的分类器得分，将自然数据分布知识有效融入对抗优化过程，从而在保持对扩散防御鲁棒性的同时显著提升跨模型架构的迁移能力。

链接: https://arxiv.org/abs/2505.18097
作者: Chun Tong Lei,Zhongliang Guo,Hon Chung Lee,Minh Quoc Duong,Chun Pong Lau
机构: City University of Hong Kong (香港城市大学); University of St Andrews (圣安德鲁斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks have become a well-explored domain, frequently serving as evaluation baselines for model robustness. Among these, black-box attacks based on transferability have received significant attention due to their practical applicability in real-world scenarios. Traditional black-box methods have generally focused on improving the optimization framework (e.g., utilizing momentum in MI-FGSM) to enhance transferability, rather than examining the dependency on surrogate white-box model architectures. Recent state-of-the-art approach DiffPGD has demonstrated enhanced transferability by employing diffusion-based adversarial purification models for adaptive attacks. The inductive bias of diffusion-based adversarial purification aligns naturally with the adversarial attack process, where both involving noise addition, reducing dependency on surrogate white-box model selection. However, the denoising process of diffusion models incurs substantial computational costs through chain rule derivation, manifested in excessive VRAM consumption and extended runtime. This progression prompts us to question whether introducing diffusion models is necessary. We hypothesize that a model sharing similar inductive bias to diffusion-based adversarial purification, combined with an appropriate loss function, could achieve comparable or superior transferability while dramatically reducing computational overhead. In this paper, we propose a novel loss function coupled with a unique surrogate model to validate our hypothesis. Our approach leverages the score of the time-dependent classifier from classifier-guided diffusion models, effectively incorporating natural data distribution knowledge into the adversarial optimization process. Experimental results demonstrate significantly improved transferability across diverse model architectures while maintaining robustness against diffusion-based defenses.
zh

[CV-9] DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations CVPR2025

【速读】：该论文试图解决现有3D talking head生成模型仅关注说话或倾听单一角色，忽视了交互对话中自然动态的问题，导致交互不自然和转换生硬。解决方案的关键在于提出一种新的任务——多轮双说话者交互，并引入DualTalk框架，该框架整合了说话者和倾听者的动态行为，以模拟真实连贯的对话互动，从而在说话时合成逼真的3D头像，并在倾听时生成连续生动的非语言反馈。

链接: https://arxiv.org/abs/2505.18096
作者: Ziqiao Peng,Yanbo Fan,Haoyu Wu,Xuan Wang,Hongyan Liu,Jun He,Zhaoxin Fan
机构: Renmin University of China (中国人民大学); Ant Group (蚂蚁集团); Tsinghua University (清华大学); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (北京未来区块链与隐私计算高精尖创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:In face-to-face conversations, individuals need to switch between speaking and listening roles seamlessly. Existing 3D talking head generation models focus solely on speaking or listening, neglecting the natural dynamics of interactive conversation, which leads to unnatural interactions and awkward transitions. To address this issue, we propose a new task – multi-round dual-speaker interaction for 3D talking head generation – which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. This framework not only synthesizes lifelike talking heads when speaking but also generates continuous and vivid non-verbal feedback when listening, effectively capturing the interplay between the roles. We also create a new dataset featuring 50 hours of multi-round conversations with over 1,000 characters, where participants continuously switch between speaking and listening roles. Extensive experiments demonstrate that our method significantly enhances the naturalness and expressiveness of 3D talking heads in dual-speaker conversations. We recommend watching the supplementary video: this https URL.
zh

[CV-10] CXReason Bench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

【速读】：该论文试图解决现有医学任务基准测试主要关注最终诊断答案，而缺乏对模型是否进行临床有意义推理的深入分析的问题。解决方案的关键在于构建CheXStruct和CXReasonBench，其中CheXStruct通过从胸部X光图像中自动推导出一系列中间推理步骤（如解剖区域分割、解剖标志物提取、诊断测量等），而CXReasonBench则利用该流程评估模型执行临床有效推理步骤的能力及其从结构化指导中学习的程度，从而实现对诊断推理的细粒度和透明化评估。

链接: https://arxiv.org/abs/2505.18087
作者: Hyungyung Lee,Geon Choi,Jung-Oh Lee,Hangyul Yoon,Hyuk Gi Hong,Edward Choi
机构: KAIST(韩国科学技术院); Seoul National University Hospital(首尔国立大学医院); Seoul Medical Center(首尔医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at this https URL
zh

[CV-11] DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation

【速读】：该论文旨在解决多主体在噪声控制信号下进行可控视频生成时出现的身份漂移和外观渗漏问题。其解决方案的关键在于提出DanceTogether框架，该框架通过MaskPoseAdapter模块，在每个去噪步骤中融合鲁棒的跟踪掩码与语义丰富但噪声较大的姿态热图，从而严格保持每个主体的身份一致性，实现长时、逼真的多主体交互视频生成。

链接: https://arxiv.org/abs/2505.18078
作者: Junhao Chen,Mingjin Chen,Jianjin Xu,Xiang Li,Junting Dong,Mingze Sun,Puhua Jiang,Hongxiang Li,Yuhang Yang,Hao Zhao,Xiaoxiao Long,Ruqi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our video demos and code are available at this https URL

点击查看摘要

Abstract:Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds “who” and “how” at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at this https URL.
zh

[CV-12] Semantic Correspondence: Unified Benchmarking and a Strong Baseline

【速读】：该论文旨在解决计算机视觉中的语义对应问题（semantic correspondence），即在不同图像中匹配具有相同语义信息的关键点。其解决方案的关键在于提出一种分类体系（taxonomy），对现有方法按设计类型进行分类，并在此基础上进行详细分析。此外，论文通过汇总文献中各方法在多个基准测试中的结果，构建了一个统一的比较表格，并通过受控实验分析不同方法组件的有效性，最终提出一个简单而有效的基线模型，在多个基准上达到了最先进的性能。

链接: https://arxiv.org/abs/2505.18060
作者: Kaiyan Zhang,Xinghui Li,Jingyi Lu,Kai Han
机构: The University of Hong Kong (香港大学); the University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: this https URL.
zh

[CV-13] FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation

【速读】：该论文旨在解决现有基于蒸馏的提示学习方法在参数效率与泛化能力之间的权衡问题，以及软提示方法对任务特定硬标签的依赖所导致的泛化能力受限问题。其解决方案的关键在于提出一种名为FDBPL（Faster Distillation-Based Prompt Learning）的方法，通过在多个训练阶段共享软监督上下文并实现加速I/O操作，提升训练效率；同时引入区域感知的提示学习范式，结合正负提示空间的互学习机制，使学生模型在识别正确语义的同时学习排斥弱相关概念，从而提升零样本性能，兼顾参数效率与强下游泛化能力。

链接: https://arxiv.org/abs/2505.18053
作者: Zherui Zhang,Jiaxin Wu,Changwei Wang,Rongtao Xu,Longzhao Huang,Wenhao Xu,Wenbo Xu,Li Guo,Shibiao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt learning as a parameter-efficient method that has been widely adopted to adapt Vision-Language Models (VLMs) to downstream tasks. While hard-prompt design requires domain expertise and iterative optimization, soft-prompt methods rely heavily on task-specific hard labels, limiting their generalization to unseen categories. Recent popular distillation-based prompt learning methods improve generalization by exploiting larger teacher VLMs and unsupervised knowledge transfer, yet their repetitive teacher model online inference sacrifices the inherent training efficiency advantage of prompt learning. In this paper, we propose \large \textbfFaster \large \textbfDistillation-\large \textbfBased \large \textbfPrompt \large \textbfLearning (\textbfFDBPL), which addresses these issues by sharing soft supervision contexts across multiple training stages and implementing accelerated I/O. Furthermore, FDBPL introduces a region-aware prompt learning paradigm with dual positive-negative prompt spaces to fully exploit randomly cropped regions that containing multi-level information. We propose a positive-negative space mutual learning mechanism based on similarity-difference learning, enabling student CLIP models to recognize correct semantics while learning to reject weakly related concepts, thereby improving zero-shot performance. Unlike existing distillation-based prompt learning methods that sacrifice parameter efficiency for generalization, FDBPL maintains dual advantages of parameter efficiency and strong downstream generalization. Comprehensive evaluations across 11 datasets demonstrate superior performance in base-to-new generalization, cross-dataset transfer, and robustness tests, achieving 2.2\times faster training speed.
zh

[CV-14] BOTM: Echocardiography Segmentation via Bi-directional Optimal Token Matching

【速读】：该论文旨在解决超声心动图分割中由于形状变化、部分观察和区域模糊性导致的解剖不一致性问题，特别是在低信噪比条件下容易产生包含解剖错误结构的假阳性分割结果。其解决方案的关键在于提出一种名为BOTM（Bi-directional Optimal Token Matching）的新分割框架，该框架通过从新的解剖运输视角出发，学习配对超声图像中离散图像标记的最佳对应关系，并进一步将标记匹配扩展为双向跨运输注意力代理，以在时间域内保持心脏周期变形中的解剖一致性。

链接: https://arxiv.org/abs/2505.18052
作者: Zhihua Liu,Lei Tong,Xilin He,Che Liu,Rossella Arcucci,Chen Jin,Huiyu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existed echocardiography segmentation methods often suffer from anatomical inconsistency challenge caused by shape variation, partial observation and region ambiguity with similar intensity across 2D echocardiographic sequences, resulting in false positive segmentation with anatomical defeated structures in challenging low signal-to-noise ratio conditions. To provide a strong anatomical guarantee across different echocardiographic frames, we propose a novel segmentation framework named BOTM (Bi-directional Optimal Token Matching) that performs echocardiography segmentation and optimal anatomy transportation simultaneously. Given paired echocardiographic images, BOTM learns to match two sets of discrete image tokens by finding optimal correspondences from a novel anatomical transportation perspective. We further extend the token matching into a bi-directional cross-transport attention proxy to regulate the preserved anatomical consistency within the cardiac cyclic deformation in temporal domain. Extensive experimental results show that BOTM can generate stable and accurate segmentation outcomes (e.g. -1.917 HD on CAMUS2H LV, +1.9% Dice on TED), and provide a better matching interpretation with anatomical consistency guarantee.
zh

[CV-15] LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

【速读】：该论文试图解决高分辨率图像下视觉Transformer计算成本过高的问题，尤其是在图像尺寸增大时，令牌数量呈二次增长导致计算开销急剧上升。解决方案的关键在于提出一种自适应计算方法——LookWhere，通过学习预测计算位置，将计算任务分为低分辨率选择器和高分辨率提取器，从而避免处理完整的高分辨率输入。该方法通过知识蒸馏从自监督教师模型中联合预训练选择器和提取器，实现同时学习计算位置与内容，相较于传统令牌缩减和选择方法，能够经济且准确地提取可迁移的图像表征。

链接: https://arxiv.org/abs/2505.18051
作者: Anthony Fuller,Yousef Yassin,Junfeng Wen,Daniel G. Kyrollos,Tarek Ibrahim,James R. Green,Evan Shelhamer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x.
zh

[CV-16] SpikeGen: Generative Framework for Visual Spike Stream Processing

【速读】：该论文旨在解决神经形态视觉系统（Neuromorphic Visual Systems）在空间信息稀疏性方面的问题，这类系统如脉冲相机（spike cameras）虽然能够捕捉动态条件下的清晰纹理并有效缓解运动和孔径模糊问题，但其输出的二值化、空间稀疏帧限制了其应用。解决方案的关键在于引入生成式模型（Generative Models），通过利用其潜在空间操作能力，实现对稀疏数据的有效处理，并结合脉冲与RGB模态的信息进行条件融合与生成，从而提升不同视觉模态的协同增强效果。

链接: https://arxiv.org/abs/2505.18049
作者: Gaole Dai,Menghang Dong,Rongyu Zhang,Ruichuan An,Shanghang Zhang,Tiejun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neuromorphic Visual Systems, such as spike cameras, have attracted considerable attention due to their ability to capture clear textures under dynamic conditions. This capability effectively mitigates issues related to motion and aperture blur. However, in contrast to conventional RGB modalities that provide dense spatial information, these systems generate binary, spatially sparse frames as a trade-off for temporally rich visual streams. In this context, generative models emerge as a promising solution to address the inherent limitations of sparse data. These models not only facilitate the conditional fusion of existing information from both spike and RGB modalities but also enable the conditional generation based on latent priors. In this study, we introduce a robust generative processing framework named SpikeGen, designed for visual spike streams captured by spike cameras. We evaluate this framework across multiple tasks involving mixed spike-RGB modalities, including conditional image/video deblurring, dense frame reconstruction from spike streams, and high-speed scene novel-view synthesis. Supported by comprehensive experimental results, we demonstrate that leveraging the latent space operation abilities of generative models allows us to effectively address the sparsity of spatial information while fully exploiting the temporal richness of spike streams, thereby promoting a synergistic enhancement of different visual modalities.
zh

[CV-17] SHARDeg: A Benchmark for Skeletal Human Action Recognition in Degraded Scenarios

【速读】：该论文旨在解决在真实世界条件下，如视频数据流受到退化影响时，计算机视觉（CV）模型，特别是骨骼人体动作识别（SHAR）模型的鲁棒性不足的问题。现有最先进的模型在评估时往往未充分考虑这些现实约束，导致其在实际应用中的性能下降。论文的关键解决方案是通过在NTU-RGB+D-120数据集上建立首个针对数据退化的基准，并评估五种领先的SHAR模型在三种真实世界退化形式下的鲁棒性。研究发现，数据退化类型对模型精度有显著影响，并揭示了时间帧的规律性是模型性能差异的主要驱动因素，进而提出一种基于插值的简单缓解方法，可提升模型性能达40%。此外，该基准还帮助识别出一种基于粗糙路径理论的抗退化SHAR模型，即LogSigRNN，在低帧率下表现优于当前最先进模型DeGCN。

链接: https://arxiv.org/abs/2505.18048
作者: Simon Malzard,Nitish Mital,Richard Walters,Victoria Nockles,Raghuveer Rao,Celso M. De Melo
机构: The Alan Turing Institute (艾伦·图灵研究所); DEVCOM Army Research Laboratory (美国国防部作战能力司令部陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 2 images

点击查看摘要

Abstract:Computer vision (CV) models for detection, prediction or classification tasks operate on video data-streams that are often degraded in the real world, due to deployment in real-time or on resource-constrained hardware. It is therefore critical that these models are robust to degraded data, but state of the art (SoTA) models are often insufficiently assessed with these real-world constraints in mind. This is exemplified by Skeletal Human Action Recognition (SHAR), which is critical in many CV pipelines operating in real-time and at the edge, but robustness to degraded data has previously only been shallowly and inconsistently assessed. Here we address this issue for SHAR by providing an important first data degradation benchmark on the most detailed and largest 3D open dataset, NTU-RGB+D-120, and assess the robustness of five leading SHAR models to three forms of degradation that represent real-world issues. We demonstrate the need for this benchmark by showing that the form of degradation, which has not previously been considered, has a large impact on model accuracy; at the same effective frame rate, model accuracy can vary by 40% depending on degradation type. We also identify that temporal regularity of frames in degraded SHAR data is likely a major driver of differences in model performance, and harness this to improve performance of existing models by up to 40%, through employing a simple mitigation approach based on interpolation. Finally, we highlight how our benchmark has helped identify an important degradation-resistant SHAR model based in Rough Path Theory; the LogSigRNN SHAR model outperforms the SoTA DeGCN model in five out of six cases at low frame rates by an average accuracy of 6%, despite trailing the SoTA model by 11-12% on un-degraded data at high frame rates (30 FPS).
zh

[CV-18] RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration

【速读】：该论文试图解决基于潜在扩散模型（Latent Diffusion Models, LDMs）的All-in-One图像修复（AiOR）方法在推理速度上的不足问题，这一问题限制了其在时间敏感应用中的实用性。解决方案的关键在于提出RestoreVAR，一种基于视觉自回归建模（Visual Autoregressive Modeling, VAR）的新一代生成式方法，该方法在保持高修复性能的同时，实现了超过10倍的推理速度提升。RestoreVAR通过引入精心设计的跨注意力机制和潜在空间精炼模块，充分发挥了VAR在计算效率方面的优势。

链接: https://arxiv.org/abs/2505.18047
作者: Sudarshan Rajagopalan,Kartik Narayan,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. To address this, we propose RestoreVAR, a novel generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over \mathbf10\times faster inference. RestoreVAR leverages visual autoregressive modeling (VAR), a recently introduced approach which performs scale-space autoregression for image generation. VAR achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. To optimally exploit these advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.
zh

[CV-19] Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

【速读】：该论文旨在解决基础模型（如CLIP）在资源受限的边缘设备上部署时面临的计算复杂度高和内存占用大的问题。其关键解决方案是提出Clip4Retrofit，一个高效的模型蒸馏框架，通过将CLIP模型的知识蒸馏到轻量级的学生模型中，该学生模型结合了EfficientNet-B3与多层感知机（MLP）投影头，从而在保持跨模态对齐能力的同时显著降低计算需求。

链接: https://arxiv.org/abs/2505.18039
作者: Li Zhong,Ahmed Ghazal,Jun-Jun Wan,Frederik Zilly,Patrick Mackens,Joachim E. Vollrath,Bogdan Sorin Coseriu
机构: Robert Bosch GmbH (罗伯特博世有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models like CLIP (Contrastive Language-Image Pretraining) have revolutionized vision-language tasks by enabling zero-shot and few-shot learning through cross-modal alignment. However, their computational complexity and large memory footprint make them unsuitable for deployment on resource-constrained edge devices, such as in-car cameras used for image collection and real-time processing. To address this challenge, we propose Clip4Retrofit, an efficient model distillation framework that enables real-time image labeling on edge devices. The framework is deployed on the Retrofit camera, a cost-effective edge device retrofitted into thousands of vehicles, despite strict limitations on compute performance and memory. Our approach distills the knowledge of the CLIP model into a lightweight student model, combining EfficientNet-B3 with multi-layer perceptron (MLP) projection heads to preserve cross-modal alignment while significantly reducing computational requirements. We demonstrate that our distilled model achieves a balance between efficiency and performance, making it ideal for deployment in real-world scenarios. Experimental results show that Clip4Retrofit can perform real-time image labeling and object identification on edge devices with limited resources, offering a practical solution for applications such as autonomous driving and retrofitting existing systems. This work bridges the gap between state-of-the-art vision-language models and their deployment in resource-constrained environments, paving the way for broader adoption of foundation models in edge computing.
zh

[CV-20] CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention

【速读】：该论文旨在解决深度伪造（deepfake）技术快速发展带来的数字媒体认证与社会安全问题，特别是现有检测方法在面对未见过的生成架构时性能显著下降的局限性。其解决方案的关键在于提出CAMME（Cross-Attention Multi-Modal Embeddings）框架，通过多头交叉注意力机制动态融合视觉、文本和频域特征，实现跨域泛化能力的提升。

链接: https://arxiv.org/abs/2505.18035
作者: Naseem Khan,Tuan Nguyen,Amine Bermak,Issa Khalil
机构: Hamad Bin Khalifa University (HBKU), Qatar; Qatar Computing Research Institute (QCRI), HBKU, Qatar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures, 12 Tables

点击查看摘要

Abstract:The proliferation of sophisticated AI-generated deepfakes poses critical challenges for digital media authentication and societal security. While existing detection methods perform well within specific generative domains, they exhibit significant performance degradation when applied to manipulations produced by unseen architectures–a fundamental limitation as generative technologies rapidly evolve. We propose CAMME (Cross-Attention Multi-Modal Embeddings), a framework that dynamically integrates visual, textual, and frequency-domain features through a multi-head cross-attention mechanism to establish robust cross-domain generalization. Extensive experiments demonstrate CAMME’s superiority over state-of-the-art methods, yielding improvements of 12.56% on natural scenes and 13.25% on facial deepfakes. The framework demonstrates exceptional resilience, maintaining (over 91%) accuracy under natural image perturbations and achieving 89.01% and 96.14% accuracy against PGD and FGSM adversarial attacks, respectively. Our findings validate that integrating complementary modalities through cross-attention enables more effective decision boundary realignment for reliable deepfake detection across heterogeneous generative architectures.
zh

[CV-21] Mahalanobis: Improving OOD Detection via Feature Normalization

【速读】：该论文旨在解决在安全关键型应用中部署可靠机器学习模型时，如何有效检测分布外（out-of-distribution, OOD）样本的问题。研究发现，基于马氏距离（Mahalanobis distance）的后处理方法在ImageNet规模的OOD检测中表现优异，但其性能在不同模型间存在显著差异。论文指出这种不一致性源于特征范数的强烈变化，表明马氏距离估计所依赖的高斯假设被严重违反。解决方案的关键在于对特征进行简单的ℓ₂-归一化（ℓ₂-normalization），该方法有效缓解了这一问题，并使特征更符合具有共享协方差矩阵的正态分布前提。实验结果表明，ℓ₂-归一化显著且一致地提升了传统马氏距离方法的性能，并优于其他近期提出的OOD检测方法。

链接: https://arxiv.org/abs/2505.18032
作者: Maximilian Mueller,Matthias Hein
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) examples is an important task for deploying reliable machine learning models in safety-critial applications. While post-hoc methods based on the Mahalanobis distance applied to pre-logit features are among the most effective for ImageNet-scale OOD detection, their performance varies significantly across models. We connect this inconsistency to strong variations in feature norms, indicating severe violations of the Gaussian assumption underlying the Mahalanobis distance estimation. We show that simple \ell_2 -normalization of the features mitigates this problem effectively, aligning better with the premise of normally distributed data with shared covariance matrix. Extensive experiments on 44 models across diverse architectures and pretraining schemes show that \ell_2 -normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods.
zh

[CV-22] Knot So Simple: A Minimalistic Environment for Spatial Reasoning

【速读】：该论文试图解决复杂空间推理与操作问题，特别是基于纯图像观测的绳索操作任务。解决方案的关键在于构建KnotGym，一个交互式环境，其任务设计沿可量化复杂度轴（基于绳结交叉数）进行，从而提供自然的泛化测试。KnotGym通过简单的观察空间实现了可扩展性，并突出了感知、空间推理与具身操作的整合挑战。

链接: https://arxiv.org/abs/2505.18028
作者: Zizhao Chen,Yoav Artzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at this https URL.
zh

[CV-23] 3D Face Reconstruction Error Decomposed: A Modular Benchmark for Fair and Fast Method Evaluation

【速读】：该论文旨在解决3D人脸重建中几何误差计算的基准评估问题，当前的基准工具由于采用固定组合的处理步骤（如网格裁剪、刚性对齐或点对应），缺乏灵活性且无法准确反映不同方法的性能差异。其解决方案的关键在于提出一个模块化的3D人脸重建基准工具包（M3DFB），将误差计算的基本组件分离并实现可互换性，从而能够量化每个组件的影响。此外，论文引入了一种新的校正组件，并提出一种计算高效的策略来惩罚网格拓扑不一致性，提升了误差估计的准确性与效率。

链接: https://arxiv.org/abs/2505.18025
作者: Evangelos Sariyanidi,Claudio Ferrari,Federico Nocentini,Stefano Berretti,Andrea Cavallaro,Birkan Tunc
机构: The Children’s Hospital of Philadelphia (费城儿童医院); University of Parma (帕尔马大学); University of Siena (锡耶纳大学); University of Florence (佛罗伦萨大学); Idiap Research Institute (Idiap研究机构); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in IEEE International Conference on Automatic Face and Gesture Recognition, 2025

点击查看摘要

Abstract:Computing the standard benchmark metric for 3D face reconstruction, namely geometric error, requires a number of steps, such as mesh cropping, rigid alignment, or point correspondence. Current benchmark tools are monolithic (they implement a specific combination of these steps), even though there is no consensus on the best way to measure error. We present a toolkit for a Modularized 3D Face reconstruction Benchmark (M3DFB), where the fundamental components of error computation are segregated and interchangeable, allowing one to quantify the effect of each. Furthermore, we propose a new component, namely correction, and present a computationally efficient approach that penalizes for mesh topology inconsistency. Using this toolkit, we test 16 error estimators with 10 reconstruction methods on two real and two synthetic datasets. Critically, the widely used ICP-based estimator provides the worst benchmarking performance, as it significantly alters the true ranking of the top-5 reconstruction methods. Notably, the correlation of ICP with the true error can be as low as 0.41. Moreover, non-rigid alignment leads to significant improvement (correlation larger than 0.90), highlighting the importance of annotating 3D landmarks on datasets. Finally, the proposed correction scheme, together with non-rigid warping, leads to an accuracy on a par with the best non-rigid ICP-based estimators, but runs an order of magnitude faster. Our open-source codebase is designed for researchers to easily compare alternatives for each component, thus helping accelerating progress in benchmarking for 3D face reconstruction and, furthermore, supporting the improvement of learned reconstruction methods, which depend on accurate error estimation for effective training.
zh

[CV-24] A Wavelet-based Stereo Matching Framework for Solving Frequency Convergence Inconsistency

【速读】：该论文旨在解决现有迭代式立体匹配方法在高频区域（如边缘和细小物体）中出现的EPE评估指标收敛不一致的问题，导致高频信息退化。其核心问题在于当前方法在优化过程中未区分高低频成分，从而影响了整体性能。该论文提出的解决方案是基于小波变换的立体匹配框架（Wavelet-Stereo），其关键在于首先利用离散小波变换（Discrete Wavelet Transform）将图像显式分解为高低频成分，随后分别通过多尺度频率特征提取器进行处理，并引入一种基于LSTM的高频保留更新算子，通过迭代频率适配器在不同迭代步骤中自适应地优化初始高频特征，从而实现对高频和低频信息的同步精炼。

链接: https://arxiv.org/abs/2505.18024
作者: Xiaobao Wei,Jiawei Liu,Dongbo Yang,Junda Cheng,Changyong Shu,Wei Wang
机构: Shenyang Institute of Automation, Chinese Academy of Sciences(沈阳自动化研究所，中国科学院); Nanjing University of Science and Technology(南京理工大学); Liaoning Liaohe Laboratory(辽宁辽河实验室); Key Laboratory on Intelligent Detection and Equipment Technology of Liaoning Province(辽宁省智能检测与装备技术重点实验室); Huazhong University of Science and Technology(华中科技大学); Beihang University(北京航空航天大学); University of Chinese Academy of Sciences(中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter to provide adaptive refined high-frequency features at different iteration steps by fine-tuning the initial high-frequency features. By processing high and low frequency components separately, our framework can simultaneously refine high-frequency information in edges and low-frequency information in smooth regions, which is especially suitable for challenging scenes with fine details and textures in the distance. Extensive experiments demonstrate that our Wavelet-Stereo outperforms the state-of-the-art methods and ranks 1st on both the KITTI 2015 and KITTI 2012 leaderboards for almost all metrics. We will provide code and pre-trained models to encourage further exploration, application, and development of our innovative framework (this https URL).
zh

[CV-25] RemoteSAM: Towards Segment Anything for Earth Observation

【速读】：该论文试图解决地球观测中视觉基础模型的鲁棒性与灵活性不足的问题，现有系统通常采用任务特定架构，在狭窄的数据域上训练，缺乏语义覆盖范围。解决方案的关键在于从数据和建模两个方面进行创新：首先引入一个自动数据引擎，显著提升了数据集的可扩展性，构建了包含270K图像-文本-掩码三元组的全球最大数据集；其次提出一种以指代表达分割为中心的任务统一范式，通过单一模型处理多种视觉感知任务，无需任务特定头部。结合这些创新，作者提出了RemoteSAM，该模型在多个地球观测感知基准上取得了新的SOTA，并展现出更高的效率。

链接: https://arxiv.org/abs/2505.18022
作者: Liang Yao,Fan Liu,Delong Chen,Chuanyi Zhang,Yijun Wang,Ziyun Chen,Wei Xu,Shimin Di,Yuhui Zheng
机构: Hohai University (河海大学); HKUST (香港科技大学); Southeast University (东南大学); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at this https URL.
zh

[CV-26] Building Floor Number Estimation from Crowdsourced Street-Level Images: Munich Dataset and Baseline Method

【速读】：该论文试图解决大规模建筑楼层数量数据在地籍和三维城市数据库中稀缺的问题，这一信息对于家庭估算、公用事业提供、风险评估、疏散规划和能源建模至关重要。解决方案的关键在于提出一种端到端的深度学习框架，该框架直接从无限制的众包街景图像中推断楼层数量，避免了手工特征工程，并能在多种立面风格中泛化。

链接: https://arxiv.org/abs/2505.18021
作者: Yao Sun,Sining Chen,Yifan Tian,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and data: this https URL

点击查看摘要

Abstract:Accurate information on the number of building floors, or above-ground storeys, is essential for household estimation, utility provision, risk assessment, evacuation planning, and energy modeling. Yet large-scale floor-count data are rarely available in cadastral and 3D city databases. This study proposes an end-to-end deep learning framework that infers floor numbers directly from unrestricted, crowdsourced street-level imagery, avoiding hand-crafted features and generalizing across diverse facade styles. To enable benchmarking, we release the Munich Building Floor Dataset, a public set of over 6800 geo-tagged images collected from Mapillary and targeted field photography, each paired with a verified storey label. On this dataset, the proposed classification-regression network attains 81.2% exact accuracy and predicts 97.9% of buildings within +/-1 floor. The method and dataset together offer a scalable route to enrich 3D city models with vertical information and lay a foundation for future work in urban informatics, remote sensing, and geographic information science. Source code and data will be released under an open license at this https URL.
zh

[CV-27] SemSegBench DetecBench: Benchmarking Reliability and Generalization Beyond Classification

【速读】：该论文旨在解决深度学习在安全关键领域中的可靠性与泛化性问题，特别是在语义分割和目标检测等更广泛的语义任务中。现有研究主要集中在图像分类场景，而实际应用中需要处理的任务更为复杂。论文的关键解决方案是提出基准工具SEMSEGBENCH和DETECBENCH，用于评估模型在分布偏移和对抗性攻击下的鲁棒性，并对大量模型进行了迄今为止最全面的评估，揭示了当前先进模型的系统性弱点及架构、主干网络和模型容量对性能的影响。

链接: https://arxiv.org/abs/2505.18015
作者: Shashank Agnihotri,David Schader,Jonas Jakubassa,Nico Sharei,Simon Kral,Mehmet Ege Kaçar,Ruben Weber,Margret Keuper
机构: University of Mannheim (曼海姆大学); Max-Planck-Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: First seven listed authors have equal contribution. GitHub: this https URL . arXiv admin note: text overlap with arXiv:2505.05091

点击查看摘要

Abstract:Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (this https URL) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.
zh

[CV-28] Clinical Validation of Deep Learning for Real-Time Tissue Oxygenation Estimation Using Spectral Imaging MICCAI2025

【速读】：该论文试图解决术中组织缺血监测中实时准确评估组织氧合水平的问题，传统方法依赖于线性解混技术，但其假设条件在实际应用中可能不成立。解决方案的关键在于采用深度学习方法，利用蒙特卡洛模拟光谱数据训练全连接神经网络（Fully Connected Neural Network, FCN）和卷积神经网络（Convolutional Neural Network, CNN），并提出领域对抗训练策略以缩小模拟数据与临床实测光谱数据之间的领域差异，从而提升模型在真实临床环境中的性能。

链接: https://arxiv.org/abs/2505.18010
作者: Jens De Winne,Siri Willems,Siri Luthman,Danilo Babin,Hiep Luong,Wim Ceelen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Provisionally accepted to the MICCAI 2025 conference

点击查看摘要

Abstract:Accurate, real-time monitoring of tissue ischemia is crucial to understand tissue health and guide surgery. Spectral imaging shows great potential for contactless and intraoperative monitoring of tissue oxygenation. Due to the difficulty of obtaining direct reference oxygenation values, conventional methods are based on linear unmixing techniques. These are prone to assumptions and these linear relations may not always hold in practice. In this work, we present deep learning approaches for real-time tissue oxygenation estimation using Monte-Carlo simulated spectra. We train a fully connected neural network (FCN) and a convolutional neural network (CNN) for this task and propose a domain-adversarial training approach to bridge the gap between simulated and real clinical spectral data. Results demonstrate that these deep learning models achieve a higher correlation with capillary lactate measurements, a well-known marker of hypoxia, obtained during spectral imaging in surgery, compared to traditional linear unmixing. Notably, domain-adversarial training effectively reduces the domain gap, optimizing performance in real clinical settings.
zh

[CV-29] Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

【速读】：该论文旨在解决开放集图像分割（open-set image segmentation）中现有方法通常需要大量训练或微调，且难以在不同文本参考表达中一致分割统一对象的问题。其解决方案的关键在于提出一种无需训练的视觉概念提示学习方法——Segment Anyword，该方法依赖于冻结扩散模型的token级跨注意力图生成分割替代物或掩码提示，并通过语言引导的视觉提示正则化技术，结合句子依存和句法结构信息对视觉提示进行绑定与聚类，从而提取出鲁棒且抗噪声的掩码提示，提升分割精度。

链接: https://arxiv.org/abs/2505.17994
作者: Zhihua Liu,Amrutha Saseendran,Lei Tong,Xilin He,Fariba Yousefi,Nikolay Burlutskiy,Dino Oglic,Tom Diethe,Philip Teare,Huiyu Zhou,Chen Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.
zh

[CV-30] Canonical Pose Reconstruction from Single Depth Image for 3D Non-rigid Pose Recovery on Limited Datasets

【速读】：该论文旨在解决从2D输入（尤其是非刚性物体如人体）进行3D重建的挑战，这类物体由于存在较大的形变范围，传统方法在处理时往往面临困难。其解决方案的关键在于提出一种规范姿态重建模型，该模型能够将单视角深度图像转换为规范形式，从而使得刚性物体重建技术得以应用，并支持在体素表示中恢复输入姿态，利用原始和变形的深度图像完成重建任务。

链接: https://arxiv.org/abs/2505.17992
作者: Fahd Alhamazani,Yu-Kun Lai,Paul L. Rosin
机构: Northern Border University (北方边界大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D reconstruction from 2D inputs, especially for non-rigid objects like humans, presents unique challenges due to the significant range of possible deformations. Traditional methods often struggle with non-rigid shapes, which require extensive training data to cover the entire deformation space. This study addresses these limitations by proposing a canonical pose reconstruction model that transforms single-view depth images of deformable shapes into a canonical form. This alignment facilitates shape reconstruction by enabling the application of rigid object reconstruction techniques, and supports recovering the input pose in voxel representation as part of the reconstruction task, utilizing both the original and deformed depth images. Notably, our model achieves effective results with only a small dataset of approximately 300 samples. Experimental results on animal and human datasets demonstrate that our model outperforms other state-of-the-art methods.
zh

[CV-31] Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling

【速读】：该论文旨在解决在有限样本条件下，对全切片图像（Whole Slide Images, WSI）进行少样本、弱监督分类的问题。现有方法在多尺度信息建模和跨模态对齐方面存在两个关键局限：一是缺乏对同一模态下不同尺度（如5x与20x）之间交互关系的有效建模，二是同一尺度下视觉与文本模态之间的对齐不足。解决方案的关键在于提出HiVE-MIL框架，该框架通过构建包含粗粒度（5x）与细粒度（20x）视觉/文本节点间父子链接的统一图结构，以及同尺度下视觉与文本节点间的异构内部边，来捕捉层次化关系与跨模态关联，并结合两阶段文本引导的动态过滤机制与层次对比损失，以增强语义一致性。

链接: https://arxiv.org/abs/2505.17982
作者: Bryan Wong,Jong Woo Kim,Huazhu Fu,Mun Yong Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at this https URL
zh

[CV-32] o Glue or Not to Glue? Classical vs Learned Image Matching for Mobile Mapping Cameras to Textured Semantic 3D Building Models

【速读】：该论文旨在解决语义3D建筑相机到模型匹配任务中传统手工特征匹配方法与可学习特征匹配方法之间的性能对比不足问题。其解决方案的关键在于系统评估不同特征匹配技术在使用纹理化CityGML LoD2模型进行视觉定位中的有效性，通过标准基准数据集和自定义数据集（包括立面纹理和对应相机图像）进行实验，以绝对位姿估计的准确性作为评价指标，并利用几何真值进行验证。结果表明，可学习特征匹配方法在复杂条件下的准确性和鲁棒性显著优于传统方法。

链接: https://arxiv.org/abs/2505.17973
作者: Simone Gaisbauer,Prabin Gyawali,Qilin Zhang,Olaf Wysocki,Boris Jutzi
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to MMT, Xiamen, China; ISPRS Annals

点击查看摘要

Abstract:Feature matching is a necessary step for many computer vision and photogrammetry applications such as image registration, structure-from-motion, and visual localization. Classical handcrafted methods such as SIFT feature detection and description combined with nearest neighbour matching and RANSAC outlier removal have been state-of-the-art for mobile mapping cameras. With recent advances in deep learning, learnable methods have been introduced and proven to have better robustness and performance under complex conditions. Despite their growing adoption, a comprehensive comparison between classical and learnable feature matching methods for the specific task of semantic 3D building camera-to-model matching is still missing. This submission systematically evaluates the effectiveness of different feature-matching techniques in visual localization using textured CityGML LoD2 models. We use standard benchmark datasets (HPatches, MegaDepth-1500) and custom datasets consisting of facade textures and corresponding camera images (terrestrial and drone). For the latter, we evaluate the achievable accuracy of the absolute pose estimated using a Perspective-n-Point (PnP) algorithm, with geometric ground truth derived from geo-referenced trajectory data. The results indicate that the learnable feature matching methods vastly outperform traditional approaches regarding accuracy and robustness on our challenging custom datasets with zero to 12 RANSAC-inliers and zero to 0.16 area under the curve. We believe that this work will foster the development of model-based visual localization methods. Link to the code: this https URL_Glue_or_not_to_Glue
zh

[CV-33] MR-EEGWaveNet: Multiresolutional EEGWaveNet for Seizure Detection from Long EEG Recordings

【速读】：该论文旨在解决广义癫痫检测模型中的特征工程问题，特别是现有模型在不同训练数据下表现不稳定且难以准确区分伪影与癫痫数据的问题。其解决方案的关键在于提出一种端到端的多分辨率脑电图波形网络（Multiresolutional EEGWaveNet, MR-EEGWaveNet），通过捕捉不同时间帧间的时序依赖性和通道间的空间关系，有效区分癫痫事件与背景脑电图及伪影/噪声。该模型包含卷积、特征提取和预测模块，结合基于异常评分的后分类处理技术，显著提升了检测性能。

链接: https://arxiv.org/abs/2505.17972
作者: Kazi Mahmudul Hassan,Xuyang Zhao,Hidenori Sugano,Toshihisa Tanaka
机构: Tokyo University of Agriculture and Technology(东京农工大学); RIKEN Center for Interdisciplinary Theoretical and Mathematical Sciences(理化学研究所交叉学科理论与数学科学中心); RIKEN Center for Integrative Medical Sciences(理化学研究所综合医学科学中心); Chiba University(千叶大学); Juntendo University School of Medicine(顺天堂大学医学部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 6 figures, 12 tables

点击查看摘要

Abstract:Feature engineering for generalized seizure detection models remains a significant challenge. Recently proposed models show variable performance depending on the training data and remain ineffective at accurately distinguishing artifacts from seizure data. In this study, we propose a novel end-to-end model, ‘‘Multiresolutional EEGWaveNet (MR-EEGWaveNet),’’ which efficiently distinguishes seizure events from background electroencephalogram (EEG) and artifacts/noise by capturing both temporal dependencies across different time frames and spatial relationships between channels. The model has three modules: convolution, feature extraction, and predictor. The convolution module extracts features through depth-wise and spatio-temporal convolution. The feature extraction module individually reduces the feature dimension extracted from EEG segments and their sub-segments. Subsequently, the extracted features are concatenated into a single vector for classification using a fully connected classifier called the predictor module. In addition, an anomaly score-based post-classification processing technique was introduced to reduce the false-positive rates of the model. Experimental results were reported and analyzed using different parameter settings and datasets (Siena (public) and Juntendo (private)). The proposed MR-EEGWaveNet significantly outperformed the conventional non-multiresolution approach, improving the F1 scores from 0.177 to 0.336 on Siena and 0.327 to 0.488 on Juntendo, with precision gains of 15.9% and 20.62%, respectively.
zh

[CV-34] Is Single-View Mesh Reconstruction Ready for Robotics?

【速读】：该论文试图解决单视角网格重建模型在机器人操作中构建数字孪生环境的适用性问题，特别是其在物理仿真和机器人应用中的可行性。解决方案的关键在于建立适用于机器人场景的三维重建基准评估标准，包括处理典型输入、生成无碰撞且稳定的重构结果、处理遮挡以及满足计算约束，并通过真实机器人数据集进行实证评估，以揭示现有方法在机器人特定需求上的不足。

链接: https://arxiv.org/abs/2505.17966
作者: Frederik Nolte,Bernhard Schölkopf,Ingmar Posner
机构: Oxford Robotics Institute, University of Oxford; Max Planck Institute for Intelligent Systems & ELLIS Institute Tübingen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 17 figures

点击查看摘要

Abstract:This paper evaluates single-view mesh reconstruction models for creating digital twin environments in robot manipulation. Recent advances in computer vision for 3D reconstruction from single viewpoints present a potential breakthrough for efficiently creating virtual replicas of physical environments for robotics contexts. However, their suitability for physics simulations and robotics applications remains unexplored. We establish benchmarking criteria for 3D reconstruction in robotics contexts, including handling typical inputs, producing collision-free and stable reconstructions, managing occlusions, and meeting computational constraints. Our empirical evaluation using realistic robotics datasets shows that despite success on computer vision benchmarks, existing approaches fail to meet robotics-specific requirements. We quantitively examine limitations of single-view reconstruction for practical robotics implementation, in contrast to prior work that focuses on multi-view approaches. Our findings highlight critical gaps between computer vision advances and robotics needs, guiding future research at this intersection.
zh

[CV-35] Mind the Domain Gap: Measuring the Domain Gap Between Real-World and Synthetic Point Clouds for Automated Driving Development

【速读】：该论文试图解决在机器人学、摄影测量和计算机视觉研究中，由于数据分布呈现长尾特性而导致的无领域差距（domain-gap-free）合成数据模拟问题。其核心挑战在于如何可信地度量真实数据与模拟数据之间的差异，这对于安全关键型应用（如自动驾驶）至关重要。论文提出的解决方案的关键在于引入一种新的度量方法DoGSS-PCL，用于评估模拟点云的几何和语义质量，从而实现对真实世界传感器观测与相同场景模拟数据之间领域差距的全面分析。

链接: https://arxiv.org/abs/2505.17959
作者: Nguyen Duc,Yan-Ling Lai,Patrick Madlindl,Xinyuan Zhu,Benedikt Schwab,Olaf Wysocki,Ludwig Hoegner,Thomas H. Kolbe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to PFG Journal of Photogrammetry, Remote Sensing and Geoinformation Science

点击查看摘要

Abstract:Owing to the typical long-tail data distribution issues, simulating domain-gap-free synthetic data is crucial in robotics, photogrammetry, and computer vision research. The fundamental challenge pertains to credibly measuring the difference between real and simulated data. Such a measure is vital for safety-critical applications, such as automated driving, where out-of-domain samples may impact a car’s perception and cause fatal accidents. Previous work has commonly focused on simulating data on one scene and analyzing performance on a different, real-world scene, hampering the disjoint analysis of domain gap coming from networks’ deficiencies, class definitions, and object representation. In this paper, we propose a novel approach to measuring the domain gap between the real world sensor observations and simulated data representing the same location, enabling comprehensive domain gap analysis. To measure such a domain gap, we introduce a novel metric DoGSS-PCL and evaluation assessing the geometric and semantic quality of the simulated point cloud. Our experiments corroborate that the introduced approach can be used to measure the domain gap. The tests also reveal that synthetic semantic point clouds may be used for training deep neural networks, maintaining the performance at the 50/50 real-to-synthetic ratio. We strongly believe that this work will facilitate research on credible data simulation and allow for at-scale deployment in automated driving testing and digital twinning.
zh

[CV-36] Diffusion Classifiers Understand Compositionality but Conditions Apply

【速读】：该论文试图解决扩散模型在判别性任务中的组合理解能力问题，尤其是针对零样本分类任务的性能评估与分析。其解决方案的关键在于利用扩散模型生成的图像构建诊断基准（Self-Bench），并系统地评估不同扩散模型（如SD 1.5、2.0和首次纳入的SD3-m）在多种组合任务中的表现，同时分析领域差异与时间步权重对模型性能的影响。

链接: https://arxiv.org/abs/2505.17955
作者: Yujin Jeong,Arnas Uselis,Seong Joon Oh,Anna Rohrbach
机构: TU Darmstadt & hessian.AI (达姆施塔特工业大学 & 黑森人工智能); Tübingen AI Center & University of Tübingen (图宾根人工智能中心 & 图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark Self-Bench comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at this https URL.
zh

[CV-37] SplatCo: Structure-View Collaborative Gaussian Splatting for Detail-Preserving Rendering of Large-Scale Unbounded Scenes

【速读】：该论文试图解决复杂户外环境中高保真渲染的问题，特别是如何在大规模场景中重建细粒度几何结构和复杂纹理。其解决方案的关键在于提出SplatCo框架，该框架通过两个创新组件实现：一是跨结构协作模块，将全局三平面表示与局部上下文网格特征融合，采用分层补偿策略以保证全局一致性与局部细节的保留；二是跨视角辅助训练策略，通过同步视角间的梯度更新、可见性感知的密度增加以及基于结构一致性的过拟合或不准确高斯分布的剪枝，提升多视角一致性。

链接: https://arxiv.org/abs/2505.17951
作者: Haihong Xiao,Jianan Zou,Yuxin Zhou,Ying He,Wenxiong Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SplatCo, a structure-view collaborative Gaussian splatting framework for high-fidelity rendering of complex outdoor environments. SplatCo builds upon two novel components: (1) a cross-structure collaboration module that combines global tri-plane representations, which capture coarse scene layouts, with local context grid features that represent fine surface details. This fusion is achieved through a novel hierarchical compensation strategy, ensuring both global consistency and local detail preservation; and (2) a cross-view assisted training strategy that enhances multi-view consistency by synchronizing gradient updates across viewpoints, applying visibility-aware densification, and pruning overfitted or inaccurate Gaussians based on structural consistency. Through joint optimization of structural representation and multi-view coherence, SplatCo effectively reconstructs fine-grained geometric structures and complex textures in large-scale scenes. Comprehensive evaluations on 13 diverse large-scale scenes, including Mill19, MatrixCity, Tanks Temples, WHU, and custom aerial captures, demonstrate that SplatCo consistently achieves higher reconstruction quality than state-of-the-art methods, with PSNR improvements of 1-2 dB and SSIM gains of 0.1 to 0.2. These results establish a new benchmark for high-fidelity rendering of large-scale unbounded scenes. Code and additional information are available at this https URL.
zh

[CV-38] AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

【速读】：该论文试图解决医疗图像分割中需要大量专家标注或交互式提示的问题，旨在实现零样本（zero-shot）和自动化的分割流程。其解决方案的关键在于结合现成的视觉-语言和分割基础模型，并引入一个基于定位模型生成初始边界框、视觉提示增强模块优化提示以及可提示分割模型生成最终掩码的管道。此外，为应对领域差异和结果验证挑战，还设计了一个测试时适应框架，包含可学习适配器以对齐医学输入与基础模型表示，并通过贝叶斯优化优化超参数，无需真实标签即可进行模型调整。

链接: https://arxiv.org/abs/2505.17931
作者: Xingjian Li,Qifeng Wu,Colleen Que,Yiran Ding,Adithya S. Ubaradka,Jianhua Xing,Tianyang Wang,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学); Brown University (布朗大学); National Institute of Technology Karnataka (印度国家技术学院卡纳塔克分校); University of Pittsburgh (匹兹堡大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., “segment the optic disc in an eye fundus image”), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline performs competitively with weakly-prompted interactive foundation models.
zh

[CV-39] Evaluation of Few-Shot Learning Methods for Kidney Stone Type Recognition in Ureteroscopy

【速读】：该论文旨在解决肾结石类型识别中因训练数据不足而导致的深度学习模型性能受限问题。传统深度学习模型需要大量标注数据进行训练，而实际应用中获取足够样本困难，尤其是在罕见类别情况下。解决方案的关键在于采用基于少样本学习（few-shot learning）的深度学习方法，通过生成足够区分性的特征来实现对内窥镜图像中肾结石类型的准确分类，即使在训练数据非常有限的情况下也能保持良好性能。

链接: https://arxiv.org/abs/2505.17921
作者: Carlos Salazar-Ruiz,Francisco Lopez-Tiro,Ivan Reyes-Amezcua,Clement Larose,Gilberto Ochoa-Ruiz,Christian Daul
机构: Tecnologico de Monterrey, School of Engineering and Sciences, Mexico; Université de Lorraine, CNRS, CRAN (UMR 7039), Vandœuvre-les-Nancy, France; CHRU de Nancy-Brabois, service d’urologie, Vandœuvre-les-Nancy, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, 3 tables, conference, cbms25

点击查看摘要

Abstract:Determining the type of kidney stones is crucial for prescribing appropriate treatments to prevent recurrence. Currently, various approaches exist to identify the type of kidney stones. However, obtaining results through the reference ex vivo identification procedure can take several weeks, while in vivo visual recognition requires highly trained specialists. For this reason, deep learning models have been developed to provide urologists with an automated classification of kidney stones during ureteroscopies. Nevertheless, a common issue with these models is the lack of training data. This contribution presents a deep learning method based on few-shot learning, aimed at producing sufficiently discriminative features for identifying kidney stone types in endoscopic images, even with a very limited number of samples. This approach was specifically designed for scenarios where endoscopic images are scarce or where uncommon classes are present, enabling classification even with a limited training dataset. The results demonstrate that Prototypical Networks, using up to 25% of the training data, can achieve performance equal to or better than traditional deep learning models trained with the complete dataset.
zh

[CV-40] Object-level Cross-view Geo-localization with Location Enhancement and Multi-Head Cross Attention

【速读】：该论文旨在解决跨视角地理定位（cross-view geo-localization）中对象级精度不足的问题，传统方法主要关注图像级定位，而实际应用如搜索与救援、基础设施检查和精准投递等需要更细粒度的对象级定位能力。解决方案的关键在于提出一种对象级跨视角地理定位网络（Object-level Cross-view Geo-localization Network, OCGNet），其核心是通过高斯核转移（Gaussian Kernel Transfer, GKT）整合用户指定的点击位置，以保持定位信息，并将该提示同时嵌入特征编码器和特征匹配模块，从而实现鲁棒的对象特定定位。此外，OCGNet还引入了位置增强（Location Enhancement, LE）模块和多头跨注意力（Multi-Head Cross Attention, MHCA）模块，以自适应地突出对象特征或扩展到相关上下文区域。

链接: https://arxiv.org/abs/2505.17911
作者: Zheyang Huang,Jagannath Aryal,Saeid Nahavandi,Xuequan Lu,Chee Peng Lim,Lei Wei,Hailing Zhou
机构: Meitu Inc.(美图公司); University of Melbourne(墨尔本大学); Swinburne University of Technology(斯威本科技大学); The University of Western Australia(西澳大学); IISRI, Deakin University(迪肯大学智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific object with a single click on a drone image to retrieve precise geo-tagged information of the object. However, variations in viewpoints, timing, and imaging conditions pose significant challenges, especially when identifying visually similar objects in extensive satellite imagery. To address these challenges, we propose an Object-level Cross-view Geo-localization Network (OCGNet). It integrates user-specified click locations using Gaussian Kernel Transfer (GKT) to preserve location information throughout the network. This cue is dually embedded into the feature encoder and feature matching blocks, ensuring robust object-specific localization. Additionally, OCGNet incorporates a Location Enhancement (LE) module and a Multi-Head Cross Attention (MHCA) module to adaptively emphasize object-specific features or expand focus to relevant contextual regions when necessary. OCGNet achieves state-of-the-art performance on a public dataset, CVOGL. It also demonstrates few-shot learning capabilities, effectively generalizing from limited examples, making it suitable for diverse applications (this https URL).
zh

[CV-41] DiffusionReward: Enhancing Blind Face Restoration through Reward Feedback Learning

【速读】：该论文旨在解决盲态人脸修复（Blind Face Restoration）中生成的面部细节不真实以及身份一致性差的问题。其解决方案的关键在于提出了一种基于奖励反馈学习（Reward Feedback Learning, ReFL）的框架——DiffusionReward，其中核心组件是通过精心标注数据训练的人脸奖励模型（Face Reward Model, FRM）。FRM提供反馈信号，引导修复网络的优化过程，并结合梯度流、正则化项和结构一致性约束，共同指导模型参数更新，从而提升修复结果的感知质量和身份一致性。

链接: https://arxiv.org/abs/2505.17910
作者: Bin Wu,Wei Wang,Yahui Liu,Zixiang Li,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Reward Feedback Learning (ReFL) has recently shown great potential in aligning model outputs with human preferences across various generative tasks. In this work, we introduce a ReFL framework, named DiffusionReward, to the Blind Face Restoration task for the first time. DiffusionReward effectively overcomes the limitations of diffusion-based methods, which often fail to generate realistic facial details and exhibit poor identity consistency. The core of our framework is the Face Reward Model (FRM), which is trained using carefully annotated data. It provides feedback signals that play a pivotal role in steering the optimization process of the restoration network. In particular, our ReFL framework incorporates a gradient flow into the denoising process of off-the-shelf face restoration methods to guide the update of model parameters. The guiding gradient is collaboratively determined by three aspects: (i) the FRM to ensure the perceptual quality of the restored faces; (ii) a regularization term that functions as a safeguard to preserve generative diversity; and (iii) a structural consistency constraint to maintain facial fidelity. Furthermore, the FRM undergoes dynamic optimization throughout the process. It not only ensures that the restoration network stays precisely aligned with the real face manifold, but also effectively prevents reward hacking. Experiments on synthetic and wild datasets demonstrate that our method outperforms state-of-the-art methods, significantly improving identity consistency and facial details. The source codes, data, and models are available at: this https URL.
zh

[CV-42] ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

【速读】：该论文旨在解决现有开源生成框架在处理复杂现实应用场景时的脆弱性问题，主要表现为缺乏结构化的流程规划和执行级反馈机制。其解决方案的关键在于提出ComfyMind系统，该系统通过两个核心创新实现改进：一是语义工作流接口（Semantic Workflow Interface, SWI），将低层次节点图抽象为可通过自然语言描述的可调用功能模块，从而提升高层组合能力并减少结构错误；二是带有局部反馈执行的搜索树规划机制，将生成过程建模为分层决策过程，并在每个阶段实现自适应修正，从而提升复杂生成流程的稳定性和灵活性。

链接: https://arxiv.org/abs/2505.17908
作者: Litao Guo(1),Xinli Xu(1),Luozhou Wang(1),Jiantao Lin(1),Jinsong Zhou(1),Zixin Zhang(1),Bolan Su(3),Ying-Cong Chen(1 and 2) ((1) HKUST (GZ), (2) HKUST, (3) Bytedance)
机构: HKUST(GZ); Bytedance(字节跳动)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems. Project page: this https URL
zh

[CV-43] Semantic segmentation with reward

【速读】：该论文试图解决在现实场景中像素级标注数据不可用时，如何有效训练语义分割网络的问题。其解决方案的关键在于提出了一种基于奖励的强化学习方法——语义分割中的奖励（Reward in Semantic Segmentation, RSS），该方法通过引入多粒度奖励机制（包括逐像素和逐图像级别的奖励）以及创新技术如渐进式尺度奖励（Progressive Scale Rewards, PSR）和成对空间差异（Pair-wise Spatial Difference, PSD），以确保语义分割网络在图像级奖励下的收敛性。实验结果表明，该方法在基准数据集上表现出色，尤其是在仅使用图像级信号的情况下，优于现有的弱监督方法。

链接: https://arxiv.org/abs/2505.17905
作者: Xie Ting,Ye Huang,Zhilin Liu,Lixin Duan
机构: Shenzhen Institute for Advanced Study (深圳先进研究院校); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.
zh

[CV-44] Pixels to Prognosis: Harmonized Multi-Region CT-Radiomics and Foundation-Model Signatures Across Multicentre NSCLC Data

【速读】：该论文旨在解决多中心非小细胞肺癌（non-small cell lung cancer, NSCLC）患者生存预测中的影像特征异质性问题，通过整合多区域CT图像特征并进行数据协调，以提升预测模型的泛化能力和准确性。其解决方案的关键在于采用ComBat、重建核归一化（reconstruction kernel normalization, RKN）及其组合方法对传统影像组学（handcrafted radiomics）和预训练基础模型（pretrained foundation model, FM）特征进行协调，并结合临床数据构建风险分层模型，同时利用共识模型增强跨中心预测的一致性与可靠性。

链接: https://arxiv.org/abs/2505.17893
作者: Shruti Atul Mali,Zohaib Salahuddin,Danial Khan,Yumeng Zhang,Henry C. Woodruff,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Luis Marti-Bonmati,Philippe Lambin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: To evaluate the impact of harmonization and multi-region CT image feature integration on survival prediction in non-small cell lung cancer (NSCLC) patients, using handcrafted radiomics, pretrained foundation model (FM) features, and clinical data from a multicenter dataset. Methods: We analyzed CT scans and clinical data from 876 NSCLC patients (604 training, 272 test) across five centers. Features were extracted from the whole lung, tumor, mediastinal nodes, coronary arteries, and coronary artery calcium (CAC). Handcrafted radiomics and FM deep features were harmonized using ComBat, reconstruction kernel normalization (RKN), and RKN+ComBat. Regularized Cox models predicted overall survival; performance was assessed using the concordance index (C-index), 5-year time-dependent area under the curve (t-AUC), and hazard ratio (HR). SHapley Additive exPlanations (SHAP) values explained feature contributions. A consensus model used agreement across top region of interest (ROI) models to stratify patient risk. Results: TNM staging showed prognostic utility (C-index = 0.67; HR = 2.70; t-AUC = 0.85). The clinical + tumor radiomics model with ComBat achieved a C-index of 0.7552 and t-AUC of 0.8820. FM features (50-voxel cubes) combined with clinical data yielded the highest performance (C-index = 0.7616; t-AUC = 0.8866). An ensemble of all ROIs and FM features reached a C-index of 0.7142 and t-AUC of 0.7885. The consensus model, covering 78% of valid test cases, achieved a t-AUC of 0.92, sensitivity of 97.6%, and specificity of 66.7%. Conclusion: Harmonization and multi-region feature integration improve survival prediction in multicenter NSCLC data. Combining interpretable radiomics, FM features, and consensus modeling enables robust risk stratification across imaging centers. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.17893 [cs.CV] (or arXiv:2505.17893v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.17893 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shruti Atul Mali [view email] [v1] Fri, 23 May 2025 13:41:52 UTC (1,829 KB)
zh

[CV-45] rack Anything Annotate: Video annotation and dataset generation of computer vision models

【速读】：该论文试图解决现代机器学习方法中标注数据准备过程耗时且资源密集的问题（labelled data preparation process time-consuming and resource-intensive）。解决方案的关键在于开发一个原型工具，该工具基于视频追踪和分割技术，用于注释和生成训练数据集，从而显著加速数据集的生成过程。

链接: https://arxiv.org/abs/2505.17884
作者: Nikita Ivanov,Mark Klimov,Dmitry Glukhikh,Tatiana Chernysheva,Igor Glukhikh
机构: University of Tyumen (叶泰门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 11 figures

点击查看摘要

Abstract:Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at this https URL
zh

[CV-46] FastCAV: Efficient Computation of Concept Activation Vectors for Explaining Deep Neural Networks ICML2025

【速读】：该论文旨在解决现有基于概念的可解释性方法中，概念激活向量（Concept Activation Vectors, CAVs）计算的高计算成本和时间消耗问题，尤其是在大规模、高维深度神经网络架构中的应用挑战。其解决方案的关键在于提出一种名为FastCAV的新方法，通过优化算法实现CAVs提取效率的显著提升，平均加速达46.4倍，最高可达63.6倍，同时保持与传统支持向量机（SVM）方法相当的性能和稳定性。

链接: https://arxiv.org/abs/2505.17883
作者: Laines Schmalwasser,Niklas Penzel,Joachim Denzler,Julia Niebling
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2025, 27 pages, 20 figures, 9 tables

点击查看摘要

Abstract:Concepts such as objects, patterns, and shapes are how humans understand the world. Building on this intuition, concept-based explainability methods aim to study representations learned by deep neural networks in relation to human-understandable concepts. Here, Concept Activation Vectors (CAVs) are an important tool and can identify whether a model learned a concept or not. However, the computational cost and time requirements of existing CAV computation pose a significant challenge, particularly in large-scale, high-dimensional architectures. To address this limitation, we introduce FastCAV, a novel approach that accelerates the extraction of CAVs by up to 63.6x (on average 46.4x). We provide a theoretical foundation for our approach and give concrete assumptions under which it is equivalent to established SVM-based methods. Our empirical results demonstrate that CAVs calculated with FastCAV maintain similar performance while being more efficient and stable. In downstream applications, i.e., concept-based explanation methods, we show that FastCAV can act as a replacement leading to equivalent insights. Hence, our approach enables previously infeasible investigations of deep models, which we demonstrate by tracking the evolution of concepts during model training.
zh

[CV-47] Hyperspectral Anomaly Detection Fused Unified Nonconvex Tensor Ring Factors Regularization

【速读】：该论文旨在解决高光谱异常检测（Hyperspectral Anomaly Detection, HAD）中背景成分在光谱和空间域中全局相关性和局部平滑性未被充分利用的问题，这一缺陷导致检测性能不理想。其解决方案的关键在于提出一种名为HAD-EUNTRFR的新方法，该方法引入了增强型统一非凸张量环（TR）因子正则化，通过TR分解捕捉背景成分中的空谱相关性，并结合由张量奇异值分解（TSVD）诱导的统一高效非凸正则化器，同时编码3-D梯度TR因子的低秩性和稀疏性，从而实现对背景成分低秩性和平滑性的继承。此外，还设计了一个广义非凸正则化项以利用异常成分的组稀疏性，最终通过基于交替方向乘子法（ADMM）的优化算法求解双非凸模型。

链接: https://arxiv.org/abs/2505.17881
作者: Wenjin Qin,Hailin Wang,Hao Shu,Feng Zhang,Jianjun Wang,Xiangyong Cao,Xi-Le Zhao,Gemine Vivone
机构: Southwest University (西南大学); Xi’an Jiaotong University (西安交通大学); University of Electronic Science and Technology of China (电子科技大学); CNR-IMAA (意大利国家研究委员会环境分析方法研究所); National Biodiversity Future Center (国家生物多样性未来中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, tensor decomposition-based approaches for hyperspectral anomaly detection (HAD) have gained significant attention in the field of remote sensing. However, existing methods often fail to fully leverage both the global correlations and local smoothness of the background components in hyperspectral images (HSIs), which exist in both the spectral and spatial domains. This limitation results in suboptimal detection performance. To mitigate this critical issue, we put forward a novel HAD method named HAD-EUNTRFR, which incorporates an enhanced unified nonconvex tensor ring (TR) factors regularization. In the HAD-EUNTRFR framework, the raw HSIs are first decomposed into background and anomaly components. The TR decomposition is then employed to capture the spatial-spectral correlations within the background component. Additionally, we introduce a unified and efficient nonconvex regularizer, induced by tensor singular value decomposition (TSVD), to simultaneously encode the low-rankness and sparsity of the 3-D gradient TR factors into a unique concise form. The above characterization scheme enables the interpretable gradient TR factors to inherit the low-rankness and smoothness of the original background. To further enhance anomaly detection, we design a generalized nonconvex regularization term to exploit the group sparsity of the anomaly component. To solve the resulting doubly nonconvex model, we develop a highly efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) framework. Experimental results on several benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art (SOTA) approaches in terms of detection accuracy.
zh

[CV-48] Multi-task Learning For Joint Action and Gesture Recognition

【速读】：该论文试图解决在计算机视觉任务中，动作（action）与手势（gesture）识别通常被单独处理而导致效率和泛化能力受限的问题。其解决方案的关键在于采用多任务学习（multitask learning）范式，通过联合训练单一深度神经网络来学习共享表示，从而利用两者之间的协同效应，提升视觉表征的效率、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2505.17867
作者: Konstantinos Spathis,Nikolaos Kardaris,Petros Maragos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In practical applications, computer vision tasks often need to be addressed simultaneously. Multitask learning typically achieves this by jointly training a single deep neural network to learn shared representations, providing efficiency and improving generalization. Although action and gesture recognition are closely related tasks, since they focus on body and hand movements, current state-of-the-art methods handle them separately. In this paper, we show that employing a multi-task learning paradigm for action and gesture recognition results in more efficient, robust and generalizable visual representations, by leveraging the synergies between these tasks. Extensive experiments on multiple action and gesture datasets demonstrate that handling actions and gestures in a single architecture can achieve better performance for both tasks in comparison to their single-task learning variants.
zh

[CV-49] Multi-Person Interaction Generation from Two-Person Motion Priors SIGGRAPH2025

【速读】：该论文旨在解决多人体交互生成中的挑战，特别是在保持高真实性和多样性的同时，避免个体动作的重复性。其解决方案的关键在于提出了一种基于图的交互采样方法（Graph-driven Interaction Sampling），通过将复杂的多人交互分解为两人交互的图结构（Pairwise Interaction Graph），从而将生成任务转化为在他人动作条件下同时生成单人动作的问题。此外，为了减少生成过程中身体部位的穿插等伪影，引入了两个依赖于图结构的引导项。

链接: https://arxiv.org/abs/2505.17860
作者: Wenning Xu,Shiyu Fan,Paul Henderson,Edmond S. L. Ho
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: SIGGRAPH 2025 Conference Papers

点击查看摘要

Abstract:Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other’s motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.
zh

[CV-50] Locality-Sensitive Hashing for Efficient Hard Negative Sampling in Contrastive Learning

【速读】：该论文试图解决在大规模高维数据集中高效找到高质量的难负例（hard negative examples）的问题，这一过程在对比学习中对于提升特征空间的质量至关重要。解决方案的关键在于提出一种适用于GPU的局部敏感哈希（Locality-Sensitive Hashing, LSH）方案，该方案将实值特征向量量化为二进制表示，从而实现近似最近邻搜索，相较于现有方法在计算效率上显著降低。

链接: https://arxiv.org/abs/2505.17844
作者: Fabian Deuser,Philipp Hausenblas,Hannah Schieber,Daniel Roth,Martin Werner,Norbert Oswald
机构: University of the Bundeswehr Munich(联邦国防军大学); Technical University of Munich(慕尼黑工业大学); Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive learning is a representational learning paradigm in which a neural network maps data elements to feature vectors. It improves the feature space by forming lots with an anchor and examples that are either positive or negative based on class similarity. Hard negative examples, which are close to the anchor in the feature space but from a different class, improve learning performance. Finding such examples of high quality efficiently in large, high-dimensional datasets is computationally challenging. In this paper, we propose a GPU-friendly Locality-Sensitive Hashing (LSH) scheme that quantizes real-valued feature vectors into binary representations for approximate nearest neighbor search. We investigate its theoretical properties and evaluate it on several datasets from textual and visual domain. Our approach achieves comparable or better performance while requiring significantly less computation than existing hard negative mining strategies.
zh

[CV-51] VLM Models and Automated Grading of Atopic Dermatitis

【速读】：该论文试图解决的是通过医学图像对特应性皮炎（atopic dermatitis, AD）进行严重程度评估的问题，这一任务即使对于受过训练的皮肤科医生来说也具有挑战性。研究的解决方案关键在于利用视觉-语言模型（vision-language models, VLMs）的多模态能力，以实现对医学图像的可解释性评估，从而提升诊断的准确性与透明度。

链接: https://arxiv.org/abs/2505.17835
作者: Marc Lalonde,Hamed Ghodrati
机构: Computer Research Institute of Montreal (CRIM)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:The task of grading atopic dermatitis (or AD, a form of eczema) from patient images is difficult even for trained dermatologists. Research on automating this task has progressed in recent years with the development of deep learning solutions; however, the rapid evolution of multimodal models and more specifically vision-language models (VLMs) opens the door to new possibilities in terms of explainable assessment of medical images, including dermatology. This report describes experiments carried out to evaluate the ability of seven VLMs to assess the severity of AD on a set of test images.
zh

[CV-52] ICPL-ReID: Identity-Conditional Prompt Learning for Multi-Spectral Object Re-Identification

【速读】：该论文旨在解决多光谱目标重识别（Multi-spectral object ReID）中异构光谱之间复杂模态差异导致的光谱信息互补性和差异性难以有效利用的问题。现有方法通过复杂的模态交互模块融合光谱数据，但缺乏对光谱信息的细粒度语义理解。该论文提出的解决方案关键在于引入一种基于CLIP模型跨模态对齐能力的Identity-Conditional text Prompt Learning框架（ICPL），通过在线可学习文本提示作为身份级语义中心，实现不同光谱身份语义的统一，并构建对齐循环以优化文本提示和光谱视觉编码器，从而避免在线提示学习破坏预训练的文本-图像对齐分布。

链接: https://arxiv.org/abs/2505.17821
作者: Shihao Li,Chenglong Li,Aihua Zheng,Jin Tang,Bin Luo
机构: Anhui Provincial Key Laboratory of Security Artificial Intelligence(安徽省安全人工智能重点实验室); Anhui University(安徽大学); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation(安徽省多模态认知计算重点实验室); School of Computer Science and Technology(计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Multi-spectral object re-identification (ReID) brings a new perception perspective for smart city and intelligent transportation applications, effectively addressing challenges from complex illumination and adverse weather. However, complex modal differences between heterogeneous spectra pose challenges to efficiently utilizing complementary and discrepancy of spectra information. Most existing methods fuse spectral data through intricate modal interaction modules, lacking fine-grained semantic understanding of spectral information (\textite.g., text descriptions, part masks, and object keypoints). To solve this challenge, we propose a novel Identity-Conditional text Prompt Learning framework (ICPL), which exploits the powerful cross-modal alignment capability of CLIP, to unify different spectral visual features from text semantics. Specifically, we first propose the online prompt learning using learnable text prompt as the identity-level semantic center to bridge the identity semantics of different spectra in online manner. Then, in lack of concrete text descriptions, we propose the multi-spectral identity-condition module to use identity prototype as spectral identity condition to constraint prompt learning. Meanwhile, we construct the alignment loop mutually optimizing the learnable text prompt and spectral visual encoder to avoid online prompt learning disrupting the pre-trained text-image alignment distribution. In addition, to adapt to small-scale multi-spectral data and mitigate style differences between spectra, we propose multi-spectral adapter that employs a low-rank adaption method to learn spectra-specific features. Comprehensive experiments on 5 benchmarks, including RGBNT201, Market-MM, MSVR310, RGBN300, and RGBNT100, demonstrate that the proposed method outperforms the state-of-the-art methods.
zh

[CV-53] Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）中存在的对象幻觉（object hallucination, OH）问题，即模型生成的输出与视觉输入不一致的现象。其解决方案的关键在于提出一种名为VaLSe的视觉感知潜在空间引导框架，该框架采用“解释-缓解”策略，通过建模复杂的视觉-语言交互并消除虚假激活伪影，生成可视化贡献图，以追踪特定视觉输入如何影响每个输出标记，进而通过潜在空间引导调整内部表示，提升语义相关性并减少幻觉输出。

链接: https://arxiv.org/abs/2505.17812
作者: Boxu Chen,Ziwei Zheng,Le Yang,Zeyu Geng,Zhengyu Zhao,Chenhao Lin,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable success but continue to struggle with object hallucination (OH), generating outputs inconsistent with visual inputs. While previous work has proposed methods to reduce OH, the visual decision-making mechanisms that lead to hallucinations remain poorly understood. In this paper, we propose VaLSe, a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address OH in LVLMs. By tackling dual challenges of modeling complex vision-language interactions and eliminating spurious activation artifacts, VaLSe can generate visual contribution maps that trace how specific visual inputs influence individual output tokens. These maps reveal the model’s vision-aware focus regions, which are then used to perform latent space steering, realigning internal representations toward semantically relevant content and reducing hallucinated outputs. Extensive experiments demonstrate that VaLSe is a powerful interpretability tool and an effective method for enhancing model robustness against OH across multiple benchmarks. Furthermore, our analysis uncovers limitations in existing OH evaluation metrics, underscoring the need for more nuanced, interpretable, and visually grounded OH benchmarks in future work. Code is available at: this https URL.
zh

[CV-54] An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

【速读】：该论文旨在解决青光眼检测中人工智能模型性能提升的问题，其解决方案的关键在于构建一种混合的复杂深度学习模型，该模型通过将开创性的卷积神经网络（Convolutional Neural Network, CNN）与颠覆性的视觉变压器（Vision Transformer, ViT）相结合，并引入激进的跨注意力模块（Cross Attention module）实现模型间的协同效应。

链接: https://arxiv.org/abs/2505.17808
作者: Ramanathan Swaminathan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages in general IEEE format, 8 figures, 4 tables, pdflatex

点击查看摘要

Abstract:This research work reveals the eye opening wisdom of the hybrid labyrinthine deep learning models synergy born out of combining a trailblazing convolutional neural network with a disruptive Vision Transformer, both intertwined together with a radical Cross Attention module. Here, two high yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized.
zh

[CV-55] mporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition IJCAI’25

【速读】：该论文旨在解决深度学习动作识别模型在面对对抗样本时的脆弱性问题，尤其是对抗样本在不同模型间的可迁移性受限的问题。现有方法面临两大挑战：一是依赖于源模型与目标模型决策边界相似的假设，限制了对抗样本的可迁移性；二是决策边界的差异导致攻击方向不确定，可能引发梯度振荡，削弱攻击效果。该论文提出的解决方案关键在于设计一种基于背景混合与时序一致性的攻击方法（Background Mixup-induced Temporal Consistency, BMTC），通过引入模型无关的背景对抗混合模块减少对源模型的依赖，并利用背景类别指导梯度更新，同时设计时序梯度一致性损失以增强攻击方向的稳定性，从而显著提升对抗样本的跨模型迁移能力。

链接: https://arxiv.org/abs/2505.17807
作者: Ping Li,Jianan Ni,Bo Pang
机构: Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IJCAI’25

点击查看摘要

Abstract:Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at this https URL.
zh

[CV-56] A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

【速读】：该论文试图解决大规模数据集中选取小而具有代表性的子集（即Coreset选择）的问题，以保留关键模式并提升机器学习效率。其解决方案的关键在于提出一个统一的分类体系，将三种主要的Coreset研究方向——训练无关、训练导向和标签无关方法——整合在一起，并引入了子模优化、双层优化以及伪标签技术等被以往研究忽视的子领域，同时探讨了剪枝策略对泛化能力和神经网络缩放定律的影响，为未来研究提供了新的视角和挑战。

链接: https://arxiv.org/abs/2505.17799
作者: Brian B. Moser,Arundhati S. Shanbhag,Stanislav Frolov,Federico Raue,Joachim Folz,Andreas Dengel
机构: German Research Center for Artificial Intelligence (DFKI); RPTU Kaiserslautern-Landau (RPTU凯撒斯劳滕-兰道大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.
zh

[CV-57] DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

【速读】：该论文旨在解决组合图像检索（Composed Image Retrieval, CIR）中由于对细粒度细节关注不足而导致的细微视觉变化或复杂文本指令处理能力不足的问题。其解决方案的关键在于提出了一种名为DetailFusion的新型双分支框架，该框架通过有效协调全局与细节层次的信息，实现了细节增强的CIR。具体而言，该方法利用来自图像编辑数据集的原子细节变化先验，并结合面向细节的优化策略，构建了细节导向的推理分支；同时设计了一个自适应特征合成器，根据每个独特多模态查询的细粒度信息动态融合全局与细节特征。

链接: https://arxiv.org/abs/2505.17796
作者: Yuxin Yang,Yinan Zhou,Yuxin Chen,Ziqi Zhang,Zongyang Ma,Chunfeng Yuan,Bing Li,Lin Song,Jun Gao,Peng Li,Weiming Hu
机构: Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information, CASIA (北京多模态信息超级智能安全重点实验室，中国科学院自动化研究所); State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA (多模态人工智能系统国家重点实验室，中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Xi’an Jiaotong University (西安交通大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室); PeopleAI Inc. (PeopleAI公司); HelloGroup Inc. (HelloGroup公司); Xiaomi Group (小米集团); School of Information Science and Technology, ShanghaiTech University (上海科技大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.
zh

[CV-58] Generative Data Augmentation for Object Point Cloud Segmentation

【速读】：该论文试图解决传统数据增强（TDA）方法在深度学习模型训练中因数据多样性不足而导致的模型性能提升有限的问题，以及生成式3D形状缺乏语义标签限制其在点云分割任务中的应用问题。解决方案的关键在于扩展先进的3D扩散模型Lion，使其成为一种基于分割掩码生成高质量点云的部分感知生成模型，并引入一个三步生成式数据增强（GDA）流程，通过少量标注样本生成变体和伪标注样本，结合基于扩散的伪标签过滤方法，有效提升点云分割任务的训练效果。

链接: https://arxiv.org/abs/2505.17783
作者: Dekai Zhu,Stefan Gavranovic,Flavien Boussuge,Benjamin Busam,Slobodan Ilic
机构: Technical University of Munich (慕尼黑工业大学); Siemens AG (西门子集团); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data augmentation is widely used to train deep learning models to address data scarcity. However, traditional data augmentation (TDA) typically relies on simple geometric transformation, such as random rotation and rescaling, resulting in minimal data diversity enrichment and limited model performance improvement. State-of-the-art generative models for 3D shape generation rely on the denoising diffusion probabilistic models and manage to generate realistic novel point clouds for 3D content creation and manipulation. Nevertheless, the generated 3D shapes lack associated point-wise semantic labels, restricting their usage in enlarging the training data for point cloud segmentation tasks. To bridge the gap between data augmentation techniques and the advanced diffusion models, we extend the state-of-the-art 3D diffusion model, Lion, to a part-aware generative model that can generate high-quality point clouds conditioned on given segmentation masks. Leveraging the novel generative model, we introduce a 3-step generative data augmentation (GDA) pipeline for point cloud segmentation training. Our GDA approach requires only a small amount of labeled samples but enriches the training data with generated variants and pseudo-labeled samples, which are validated by a novel diffusion-based pseudo-label filtering method. Extensive experiments on two large-scale synthetic datasets and a real-world medical dataset demonstrate that our GDA method outperforms TDA approach and related semi-supervised and self-supervised methods.
zh

[CV-59] Hephaestus Minicubes: A Global Multi-Modal Dataset for Volcanic Unrest Monitoring

【速读】：该论文试图解决火山监测中地面形变检测的自动化与智能化问题，特别是在利用深度学习方法进行火山活动预警方面的研究空白。其解决方案的关键在于构建了Hephaestus Minicubes数据集，该数据集是一个全球性的高分辨率、多源异构、多时相的时空数据立方体集合，涵盖了44座活跃火山在过去7年内的InSAR产品、地形数据及大气变量，并提供了专家标注的形变事件类型、强度和空间范围信息，从而为多模态、多时相的分类与语义分割任务提供了坚实的数据基础。

链接: https://arxiv.org/abs/2505.17782
作者: Nikolas Papadopoulos,Nikolaos Ioannis Bountos,Maria Sdraka,Andreas Karavias,Ioannis Papoutsis
机构: Orion Lab (奥里翁实验室); National Observatory of Athens (希腊国家天文台); National Technical University of Athens (希腊国立技术大学); Harokopio University of Athens (哈罗科普里奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ground deformation is regarded in volcanology as a key precursor signal preceding volcanic eruptions. Satellite-based Interferometric Synthetic Aperture Radar (InSAR) enables consistent, global-scale deformation tracking; however, deep learning methods remain largely unexplored in this domain, mainly due to the lack of a curated machine learning dataset. In this work, we build on the existing Hephaestus dataset, and introduce Hephaestus Minicubes, a global collection of 38 spatiotemporal datacubes offering high resolution, multi-source and multi-temporal information, covering 44 of the world’s most active volcanoes over a 7-year period. Each spatiotemporal datacube integrates InSAR products, topographic data, as well as atmospheric variables which are known to introduce signal delays that can mimic ground deformation in InSAR imagery. Furthermore, we provide expert annotations detailing the type, intensity and spatial extent of deformation events, along with rich text descriptions of the observed scenes. Finally, we present a comprehensive benchmark, demonstrating Hephaestus Minicubes’ ability to support volcanic unrest monitoring as a multi-modal, multi-temporal classification and semantic segmentation task, establishing strong baselines with state-of-the-art architectures. This work aims to advance machine learning research in volcanic monitoring, contributing to the growing integration of data-driven methods within Earth science applications.
zh

[CV-60] U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

【速读】：该论文旨在解决医学超声图像解释中的挑战，尤其是在不同操作者、噪声和解剖结构影响下图像质量差异带来的问题。其解决方案的关键在于引入U2-BENCH，这是首个全面的基准测试平台，用于评估视觉-语言模型（Vision-Language Models, VLMs）在超声理解任务中的表现，涵盖分类、检测、回归和文本生成等多个方面。U2-BENCH通过整合7,241个病例和定义8个临床相关任务，为评估和加速VLM在医学超声成像这一独特多模态领域中的研究提供了严格且统一的测试环境。

链接: https://arxiv.org/abs/2505.17779
作者: Anjie Le,Henan Liu,Yue Wang,Zhenyu Liu,Rongkun Zhu,Taohan Weng,Jinze Yu,Boyang Wang,Yalun Wu,Kaiwen Yan,Quanlin Sun,Meirui Jiang,Jialun Pei,Siya Liu,Haoyun Zheng,Zhoujun Li,Alison Noble,Jacques Souquet,Xiaoqing Guo,Manxi Lin,Hongcheng Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 20 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
zh

[CV-61] xtFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

【速读】：该论文旨在解决基于扩散模型的场景文本合成中对额外视觉条件模块的依赖以及大规模标注数据的需求问题，同时提升多语言生成的准确性和灵活性。其解决方案的关键在于利用扩散模型固有的上下文推理能力，提出了一种无需OCR编码器（OCR encoder）的框架——TextFlux，通过简化架构、减少训练数据需求、增强多语言可扩展性及实现可控的多行文本生成，从而在保持高保真场景融合的同时确保字形准确性。

链接: https://arxiv.org/abs/2505.17778
作者: Yu Xie,Jielei Zhang,Pengyu Chen,Ziyue Wang,Weihang Wang,Longwen Gao,Peiyi Li,Huyang Sun,Qiang Zhang,Qian Qiao,Jiaqing Fan,Zhouhui Lian
机构: bilibili Inc. (哔哩哔哩公司); Soochow University (苏州大学); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models’ inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.
zh

[CV-62] opoPoint: Enhance Topology Reasoning via Endpoint Detection in Autonomous Driving

【速读】：该论文旨在解决自动驾驶中拓扑推理（topology reasoning）因车道检测不准确，尤其是在连接车道端点处的偏差问题。现有方法在车道端点偏差的情况下难以构建正确的拓扑结构，从而影响对交叉口的理解。解决方案的关键在于提出TopoPoint框架，该框架通过显式检测车道端点，并联合推理端点与车道信息，实现鲁棒的拓扑推理。其核心创新包括Point-Lane Merge Self-Attention机制和Point-Lane Graph Convolutional Network，用于增强点与车道之间的全局上下文共享与特征聚合，同时在推理阶段引入Point-Lane Geometry Matching算法，以优化车道端点，有效缓解端点偏差问题。

链接: https://arxiv.org/abs/2505.17771
作者: Yanping Fu,Xinyuan Liu,Tianyu Li,Yike Ma,Yucheng Zhang,Feng Dai
机构: Institute of Computing Technology, Chinese Academy of Science; University of Chinese Academy of Sciences; Shanghai AI Lab; Shanghai Innovation Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Topology reasoning, which unifies perception and structured reasoning, plays a vital role in understanding intersections for autonomous driving. However, its performance heavily relies on the accuracy of lane detection, particularly at connected lane endpoints. Existing methods often suffer from lane endpoints deviation, leading to incorrect topology construction. To address this issue, we propose TopoPoint, a novel framework that explicitly detects lane endpoints and jointly reasons over endpoints and lanes for robust topology reasoning. During training, we independently initialize point and lane query, and proposed Point-Lane Merge Self-Attention to enhance global context sharing through incorporating geometric distances between points and lanes as an attention mask . We further design Point-Lane Graph Convolutional Network to enable mutual feature aggregation between point and lane query. During inference, we introduce Point-Lane Geometry Matching algorithm that computes distances between detected points and lanes to refine lane endpoints, effectively mitigating endpoint deviation. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoPoint achieves state-of-the-art performance in topology reasoning (48.8 on OLS). Additionally, we propose DET _p to evaluate endpoint detection, under which our method significantly outperforms existing approaches (52.6 v.s. 45.2 on DET _p ). The code is released at this https URL.
zh

[CV-63] R-Genie: Reasoning -Guided Generative Image Editing

【速读】：该论文试图解决当前图像编辑方法在处理复杂、多维度文本查询时的局限性，即依赖显式文本指令和有限的编辑操作，缺乏对隐含用户意图和上下文推理的深度理解。其解决方案的关键在于提出一种基于推理的生成式图像编辑范式（reasoning-guided generative editing），通过结合扩散模型的生成能力与多模态大语言模型的高级推理能力，引入推理注意力机制以实现语言理解与视觉合成的融合，从而有效处理涉及抽象用户意图和上下文关系的复杂编辑请求。

链接: https://arxiv.org/abs/2505.17768
作者: Dong Zhang,Lingfeng He,Rui Yan,Fei Shen,Jinhui Tang
机构: The Hong Kong University of Science and Technology (香港科技大学); Nanjing University of Science and Technology (南京理工大学); The National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:While recent advances in image editing have enabled impressive visual synthesis capabilities, current methods remain constrained by explicit textual instructions and limited editing operations, lacking deep comprehension of implicit user intentions and contextual reasoning. In this work, we introduce a new image editing paradigm: reasoning-guided generative editing, which synthesizes images based on complex, multi-faceted textual queries accepting world knowledge and intention inference. To facilitate this task, we first construct a comprehensive dataset featuring over 1,000 image-instruction-edit triples that incorporate rich reasoning contexts and real-world knowledge. We then propose R-Genie: a reasoning-guided generative image editor, which synergizes the generation power of diffusion models with advanced reasoning capabilities of multimodal large language models. R-Genie incorporates a reasoning-attention mechanism to bridge linguistic understanding with visual synthesis, enabling it to handle intricate editing requests involving abstract user intentions and contextual reasoning relations. Extensive experimental results validate that R-Genie can equip diffusion models with advanced reasoning-based editing capabilities, unlocking new potentials for intelligent image synthesis.
zh

[CV-64] Soft-CAM: Making black box models self-explainable for high-stakes decisions

【速读】：该论文试图解决卷积神经网络（Convolutional Neural Networks, CNNs）在高风险应用（如医学领域）中缺乏可解释性的问题。现有解释方法多为后验归因方法，依赖于对已训练的黑箱模型进行近似，但这些方法通常不稳定、不可靠，无法真实反映模型的推理过程。论文提出的解决方案关键在于引入SoftCAM，通过移除全局平均池化层并用基于卷积的类别证据层替代全连接分类层，从而保留空间信息并生成明确的类别激活图，使标准CNN架构本身具备可解释性，同时保持分类性能。

链接: https://arxiv.org/abs/2505.17748
作者: Kerol Djoumessi,Philipp Berens
机构: Hertie Institute for AI in Brain Health University of Tübingen(海蒂人工智能与脑健康研究所图宾根大学); Tübingen AI Center, University of Tübingen(图宾根人工智能中心，图宾根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are widely used for high-stakes applications like medicine, often surpassing human performance. However, most explanation methods rely on post-hoc attribution, approximating the decision-making process of already trained black-box models. These methods are often sensitive, unreliable, and fail to reflect true model reasoning, limiting their trustworthiness in critical applications. In this work, we introduce SoftCAM, a straightforward yet effective approach that makes standard CNN architectures inherently interpretable. By removing the global average pooling layer and replacing the fully connected classification layer with a convolution-based class evidence layer, SoftCAM preserves spatial information and produces explicit class activation maps that form the basis of the model’s predictions. Evaluated on three medical datasets, SoftCAM maintains classification performance while significantly improving both the qualitative and quantitative explanation compared to existing post-hoc methods. Our results demonstrate that CNNs can be inherently interpretable without compromising performance, advancing the development of self-explainable deep learning for high-stakes decision-making.
zh

[CV-65] RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection

【速读】：该论文旨在解决自动驾驶中3D目标检测的准确性和可靠性问题，特别是针对基于鸟瞰图（BEV）的检测方法中存在的角度表示导致的损失函数不连续性问题。其解决方案的关键在于提出受限四边形表示（RQR3D），通过回归包围旋转边界框的最小水平边界框及其角点偏移量，将定向目标检测问题转化为关键点回归任务，从而提升检测精度和鲁棒性。

链接: https://arxiv.org/abs/2505.17732
作者: Ozsel Kilinc,Cem Tarhan
机构: Togg/Trutek AI Team (Togg/Trutek AI Team)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate, fast, and reliable 3D perception is essential for autonomous driving. Recently, bird’s-eye view (BEV)-based perception approaches have emerged as superior alternatives to perspective-based solutions, offering enhanced spatial understanding and more natural outputs for planning. Existing BEV-based 3D object detection methods, typically adhering to angle-based representation, directly estimate the size and orientation of rotated bounding boxes. We observe that BEV-based 3D object detection is analogous to aerial oriented object detection, where angle-based methods are recognized for being affected by discontinuities in their loss functions. Drawing inspiration from this domain, we propose Restricted Quadrilateral Representation to define 3D regression targets. RQR3D regresses the smallest horizontal bounding box encapsulating the oriented box, along with the offsets between the corners of these two boxes, thereby transforming the oriented object detection problem into a keypoint regression task. RQR3D is compatible with any 3D object detection approach. We employ RQR3D within an anchor-free single-stage object detection method and introduce an objectness head to address class imbalance problem. Furthermore, we introduce a simplified radar fusion backbone that eliminates the need for voxel grouping and processes the BEV-mapped point cloud with standard 2D convolutions, rather than sparse convolutions. Extensive evaluations on the nuScenes dataset demonstrate that RQR3D achieves state-of-the-art performance in camera-radar 3D object detection, outperforming the previous best method by +4% in NDS and +2.4% in mAP, and significantly reducing the translation and orientation errors, which are crucial for safe autonomous driving. These consistent gains highlight the robustness, precision, and real-world readiness of our approach.
zh

[CV-66] SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain

【速读】：该论文旨在解决现有方法在生成安全关键驾驶轨迹、仿真或单视角视频时，无法满足先进端到端自动驾驶（E2E AD）系统对真实世界多视角视频数据的需求问题。其解决方案的关键在于提出SafeMVDrive框架，该框架通过将安全关键轨迹生成器与先进的多视角视频生成器进行战略性集成，提升生成数据的真实性和安全性。具体而言，通过引入视觉上下文和基于GRPO微调的视觉-语言模型增强轨迹生成器的场景理解能力，并采用两阶段可控轨迹生成机制以实现碰撞规避，最终利用基于扩散的多视角视频生成器合成高质量的安全关键驾驶视频。

链接: https://arxiv.org/abs/2505.17727
作者: Jiawei Zhou,Linye Lyu,Zhuotao Tian,Cheng Zhuo,Yu Li
机构: Harbin Institute of Technology (哈尔滨工业大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context – which is previously unavailable to such generator – and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: this https URL.
zh

[CV-67] Slot-MLLM : Object-Centric Visual Tokenization for Multimodal LLM

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理视觉内容时缺乏对局部细节有效理解和生成能力的问题，现有图像分词方法通常仅捕捉全局抽象概念或均匀分割的图像块，限制了模型在物体层级上的表现。解决方案的关键是提出一种基于槽注意力（Slot Attention）的以对象为中心的视觉分词器，结合Q-Former编码器、扩散解码器和残差向量量化技术，生成的离散槽标记能够在保留高层语义的同时编码局部视觉细节，并与文本数据对齐，从而在统一的下一分词预测框架中实现高效融合。

链接: https://arxiv.org/abs/2505.17726
作者: Donghwan Chi,Hyomin Kim,Yoonjin Oh,Yongjin Kim,Donghoon Lee,Daejin Jo,Jongmin Kim,Junyeob Baek,Sungjin Ahn,Sungwoong Kim
机构: Korea University(高丽大学); Kakao Corp.(韩国卡巴公司); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs’ capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.
zh

[CV-68] SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation

【速读】：该论文旨在解决生成带有细粒度分割标签的点云问题，以及缺乏针对该任务的有效评估指标的问题。其解决方案的关键在于提出SeaLion模型，该模型采用语义部件感知的潜在点扩散技术，通过联合预测扰动潜在点的噪声及其相关的部件分割标签，在去噪过程中生成高质量且多样的点云，并基于部件分割标签解码为最终点云。

链接: https://arxiv.org/abs/2505.17721
作者: Dekai Zhu,Yan Di,Stefan Gavranovic,Slobodan Ilic
机构: Technical University of Munich (慕尼黑工业大学); Siemens AG (西门子集团); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising diffusion probabilistic models have achieved significant success in point cloud generation, enabling numerous downstream applications, such as generative data augmentation and 3D model editing. However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point clouds with fine-grained segmentation labels. Specifically, we introduce the semantic part-aware latent point diffusion technique, which leverages the intermediate features of the generative models to jointly predict the noise for perturbed latent points and associated part segmentation labels during the denoising process, and subsequently decodes the latent points to point clouds conditioned on part segmentation labels. To effectively evaluate the quality of generated point clouds, we introduce a novel point cloud pairwise distance calculation method named part-aware Chamfer distance (p-CD). This method enables existing metrics, such as 1-NNA, to measure both the local structural quality and inter-part coherence of generated point clouds. Experiments on the large-scale synthetic dataset ShapeNet and real-world medical dataset IntrA demonstrate that SeaLion achieves remarkable performance in generation quality and diversity, outperforming the existing state-of-the-art model, DiffFacto, by 13.33% and 6.52% on 1-NNA (p-CD) across the two datasets. Experimental analysis shows that SeaLion can be trained semi-supervised, thereby reducing the demand for labeling efforts. Lastly, we validate the applicability of SeaLion in generative data augmentation for training segmentation models and the capability of SeaLion to serve as a tool for part-aware 3D shape editing.
zh

[CV-69] Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek

【速读】：该论文旨在解决工业产品设计中基于生成式AI (Generative AI) 的参数化CAD模型生成问题，特别是在不依赖模型微调的情况下实现高效、灵活的CAD生成。其解决方案的关键在于提出一种无需训练的方法，利用本地部署的开源推理大语言模型DeepSeek-R1，并结合视觉与思维链（Chain-of-Thought, CoT）反馈的自修正机制，以迭代优化生成的CAD模型。此外，论文还构建了一个基于SSR（Sketch, Sketch-based feature, and Refinements）三元组设计范式的3D CAD模型数据集，以支持工业应用场景下的模型生成与评估。

链接: https://arxiv.org/abs/2505.17702
作者: Xueyang Li,Jiahao Li,Yu Song,Yunzhong Lou,Xiangdong Zhou
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Computer-Aided Design (CAD) generative modeling will significantly transform the design of industrial products. The recent research endeavor has extended into the realm of Large Language Models (LLMs). In contrast to fine-tuning methods, training-free approaches typically utilize the advanced closed-source LLMs, thereby offering enhanced flexibility and efficiency in the development of AI agents for generating CAD parametric models. However, the substantial cost and limitations of local deployment of the top-tier closed-source LLMs pose challenges in practical applications. The Seek-CAD is the pioneer exploration of locally deployed open-source inference LLM DeepSeek-R1 for CAD parametric model generation with a training-free methodology. This study is the first investigation to incorporate both visual and Chain-of-Thought (CoT) feedback within the self-refinement mechanism for generating CAD models. Specifically, the initial generated parametric CAD model is rendered into a sequence of step-wise perspective images, which are subsequently processed by a Vision Language Model (VLM) alongside the corresponding CoTs derived from DeepSeek-R1 to assess the CAD model generation. Then, the feedback is utilized by DeepSeek-R1 to refine the initial generated model for the next round of generation. Moreover, we present an innovative 3D CAD model dataset structured around the SSR (Sketch, Sketch-based feature, and Refinements) triple design paradigm. This dataset encompasses a wide range of CAD commands, thereby aligning effectively with industrial application requirements and proving suitable for the generation of LLMs. Extensive experiments validate the effectiveness of Seek-CAD under various metrics.
zh

[CV-70] SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data

【速读】：该论文试图解决现有Referring Expression Segmentation (RES)评估协议的局限性，即其主要关注单一目标的短查询或单一领域中不同查询的多个目标，无法有效评估模型在复杂推理能力上的表现。解决方案的关键在于引入WildRES基准，该基准包含具有多样属性的长查询和非区分性查询，覆盖自动驾驶和机器人操作等多领域，以更严格地评估真实场景中的复杂推理能力。同时，提出SynRES自动化流水线，通过三种创新方法生成密集配对的组合合成训练数据，包括基于密集标题的合成、可靠的语义对齐机制以及领域感知增强，从而提升模型在WildRES上的性能。

链接: https://arxiv.org/abs/2505.17695
作者: Dong-Hee Kim,Hyunjee Song,Donghyun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the advances in Referring Expression Segmentation (RES) benchmarks, their evaluation protocols remain constrained, primarily focusing on either single targets with short queries (containing minimal attributes) or multiple targets from distinctly different queries on a single domain. This limitation significantly hinders the assessment of more complex reasoning capabilities in RES models. We introduce WildRES, a novel benchmark that incorporates long queries with diverse attributes and non-distinctive queries for multiple targets. This benchmark spans diverse application domains, including autonomous driving environments and robotic manipulation scenarios, thus enabling more rigorous evaluation of complex reasoning capabilities in real-world settings. Our analysis reveals that current RES models demonstrate substantial performance deterioration when evaluated on WildRES. To address this challenge, we introduce SynRES, an automated pipeline generating densely paired compositional synthetic training data through three innovations: (1) a dense caption-driven synthesis for attribute-rich image-mask-expression triplets, (2) reliable semantic alignment mechanisms rectifying caption-pseudo mask inconsistencies via Image-Text Aligned Grouping, and (3) domain-aware augmentations incorporating mosaic composition and superclass replacement to emphasize generalization ability and distinguishing attributes over object categories. Experimental results demonstrate that models trained with SynRES achieve state-of-the-art performance, improving gIoU by 2.0% on WildRES-ID and 3.8% on WildRES-DS. Code and datasets are available at this https URL.
zh

[CV-71] ViP2-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection

【速读】：该论文旨在解决零样本异常检测（Zero-shot Anomaly Detection, ZSAD）中因缺乏目标领域训练样本而导致的检测性能受限问题，以及现有基于CLIP的方法在提示工程上的局限性，如手工设计提示的高成本和语义覆盖不足，或静态可学习提示在不同异常类型间的适应性差。其解决方案的关键在于引入ViP²-CLIP，核心创新是视觉感知提示（Visual-Perception Prompting, ViP-Prompt）机制，该机制通过融合全局与多尺度局部视觉上下文，自适应生成细粒度文本提示，从而消除人工模板和类别名称先验，使模型能够聚焦于精确的异常区域。

链接: https://arxiv.org/abs/2505.17692
作者: Ziteng Yang,Jingzehua Xu,Yanshu Li,Zepeng Li,Yeqiang Wang,Xinghui Li
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Brown University (布朗大学); Northwest A&F University (西北农林科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) aims to detect anomalies without any target domain training samples, relying solely on external auxiliary data. Existing CLIP-based methods attempt to activate the model’s ZSAD potential via handcrafted or static learnable prompts. The former incur high engineering costs and limited semantic coverage, whereas the latter apply identical descriptions across diverse anomaly types, thus fail to adapt to complex variations. Furthermore, since CLIP is originally pretrained on large-scale classification tasks, its anomaly segmentation quality is highly sensitive to the exact wording of class names, severely constraining prompting strategies that depend on class labels. To address these challenges, we introduce ViP ^2 -CLIP. The key insight of ViP ^2 -CLIP is a Visual-Perception Prompting (ViP-Prompt) mechanism, which fuses global and multi-scale local visual context to adaptively generate fine-grained textual prompts, eliminating manual templates and class-name priors. This design enables our model to focus on precise abnormal regions, making it particularly valuable when category labels are ambiguous or privacy-constrained. Extensive experiments on 15 industrial and medical benchmarks demonstrate that ViP ^2 -CLIP achieves state-of-the-art performance and robust cross-domain generalization.
zh

[CV-72] Semi-Supervised Medical Image Segmentation via Dual Networks

【速读】：该论文试图解决传统监督医学图像分割模型对大量标注数据的依赖问题，以及半监督分割模型中存在的伪标签噪声和特征空间监督不足的问题。其解决方案的关键在于提出一种创新的半监督3D医学图像分割方法，引入双网络架构以更好地利用上下文信息并生成可靠的伪标签，同时采用自监督对比学习策略来增强网络表示能力并减少预测不确定性。

链接: https://arxiv.org/abs/2505.17690
作者: Yunyao Lu,Yihang Wu,Reem Kateb,Ahmad Chaddad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ISBI2025

点击查看摘要

Abstract:Traditional supervised medical image segmentation models require large amounts of labeled data for training; however, obtaining such large-scale labeled datasets in the real world is extremely challenging. Recent semi-supervised segmentation models also suffer from noisy pseudo-label issue and limited supervision in feature space. To solve these challenges, we propose an innovative semi-supervised 3D medical image segmentation method to reduce the dependency on large, expert-labeled datasets. Furthermore, we introduce a dual-network architecture to address the limitations of existing methods in using contextual information and generating reliable pseudo-labels. In addition, a self-supervised contrastive learning strategy is used to enhance the representation of the network and reduce prediction uncertainty by distinguishing between reliable and unreliable predictions. Experiments on clinical magnetic resonance imaging demonstrate that our approach outperforms state-of-the-art techniques. Our code is available at this https URL.
zh

[CV-73] FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

【速读】：该论文试图解决现有视觉语言模型（Visual Language Models, VLMs）在自动驾驶中因依赖离散文本思维链（Chain-of-Thought, CoT）而导致的时空关系模糊和细粒度信息丢失问题。其解决方案的关键在于提出一种时空思维链推理方法，使模型能够进行可视化思考。该方法通过VLM作为世界模型生成统一图像帧以预测未来世界状态，其中感知结果表示空间关系，未来帧表示时间演化关系，并以此作为中间推理步骤，使VLM能够基于当前观测和未来预测进行轨迹规划。

链接: https://arxiv.org/abs/2505.17685
作者: Shuang Zeng,Xinyuan Chang,Mengwei Xie,Xinran Liu,Yifan Bai,Zheng Pan,Mu Xu,Xing Wei
机构: Amap, Alibaba Group (Amap, 阿里巴巴集团); Xi’an Jiaotong University (西安交通大学); DAMO Academy, Alibaba Group (DAMO研究院, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual language models (VLMs) have attracted increasing interest in autonomous driving due to their powerful reasoning capabilities. However, existing VLMs typically utilize discrete text Chain-of-Thought (CoT) tailored to the current scenario, which essentially represents highly abstract and symbolic compression of visual information, potentially leading to spatio-temporal relationship ambiguity and fine-grained information loss. Is autonomous driving better modeled on real-world simulation and imagination than on pure symbolic logic? In this paper, we propose a spatio-temporal CoT reasoning method that enables models to think visually. First, VLM serves as a world model to generate unified image frame for predicting future world states: where perception results (e.g., lane divider and 3D detection) represent the future spatial relationships, and ordinary future frame represent the temporal evolution relationships. This spatio-temporal CoT then serves as intermediate reasoning steps, enabling the VLM to function as an inverse dynamics model for trajectory planning based on current observations and future predictions. To implement visual generation in VLMs, we propose a unified pretraining paradigm integrating visual generation and understanding, along with a progressive visual CoT enhancing autoregressive image generation. Extensive experimental results demonstrate the effectiveness of the proposed method, advancing autonomous driving towards visual reasoning.
zh

[CV-74] 5G-DIL: Domain Incremental Learning with Similarity-Aware Sampling for Dynamic 5G Indoor Localization

【速读】：该论文旨在解决5G室内定位中基于学习的方法在环境条件变化时性能显著下降的问题，这一问题限制了其在新场景中的适用性。解决方案的关键在于提出一种领域增量学习（Domain Incremental Learning, DIL）方法，即5G-DIL，该方法通过一种基于切比雪夫距离的相似性感知采样技术，高效地从先前环境中选择特定样本，并仅在新环境的修改区域进行训练，从而避免对整个区域进行训练，大幅减少适应所需的时间和资源，同时保持定位精度。

链接: https://arxiv.org/abs/2505.17684
作者: Nisha Lakshmana Raichur,Lucas Heublein,Christopher Mutschler,Felix Ott
机构: Fraunhofer Institute for Integrated Circuits IIS (弗劳恩霍夫集成电路研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Indoor positioning based on 5G data has achieved high accuracy through the adoption of recent machine learning (ML) techniques. However, the performance of learning-based methods degrades significantly when environmental conditions change, thereby hindering their applicability to new scenarios. Acquiring new training data for each environmental change and fine-tuning ML models is both time-consuming and resource-intensive. This paper introduces a domain incremental learning (DIL) approach for dynamic 5G indoor localization, called 5G-DIL, enabling rapid adaptation to environmental changes. We present a novel similarity-aware sampling technique based on the Chebyshev distance, designed to efficiently select specific exemplars from the previous environment while training only on the modified regions of the new environment. This avoids the need to train on the entire region, significantly reducing the time and resources required for adaptation without compromising localization accuracy. This approach requires as few as 50 exemplars from adaptation domains, significantly reducing training time while maintaining high positioning accuracy in previous environments. Comparative evaluations against state-of-the-art DIL techniques on a challenging real-world indoor dataset demonstrate the effectiveness of the proposed sample selection method. Our approach is adaptable to real-world non-line-of-sight propagation scenarios and achieves an MAE positioning error of 0.261 meters, even under dynamic environmental conditions. Code: this https URL
zh

[CV-75] owards Dynamic 3D Reconstruction of Hand-Instrument Interaction in Ophthalmic Surgery

【速读】：该论文旨在解决眼科显微手术中手与器械的准确三维重建问题，这一问题在基于视觉的分析中至关重要，但长期以来受限于缺乏真实、大规模的数据集和可靠的标注工具。其解决方案的关键在于构建了OphNet-3D数据集，这是首个针对眼科手术的RGB-D动态三维重建数据集，包含41个序列、710万帧，并提供了细粒度的标注信息，如12个手术阶段、10类器械、密集的MANO手部网格以及完整的6自由度器械位姿。此外，设计了一种多阶段自动标注流程，结合多视角数据观测、数据驱动的运动先验、跨视角几何一致性、生物力学约束及碰撞感知的交互约束，以实现高保真标签的可扩展生成。基于该数据集，论文还提出了两个具有挑战性的基准任务及相应的专用架构，显著提升了手部与器械交互的重建性能。

链接: https://arxiv.org/abs/2505.17677
作者: Ming Hu,Zhendi Yu,Feilong Tang,Kaiwen Chen,Yulong Li,Imran Razzak,Junjun He,Tolga Birdal,Kaijing Zhou,Zongyuan Ge
机构: Monash University (莫纳什大学); Shanghai AI Laboratory (上海人工智能实验室); MBZUAI (MBZUAI); Imperial College London (帝国理工学院); Eye Hospital, Wenzhou Medical Univeristy (温州医科大学眼医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D reconstruction of hands and instruments is critical for vision-based analysis of ophthalmic microsurgery, yet progress has been hampered by the lack of realistic, large-scale datasets and reliable annotation tools. In this work, we introduce OphNet-3D, the first extensive RGB-D dynamic 3D reconstruction dataset for ophthalmic surgery, comprising 41 sequences from 40 surgeons and totaling 7.1 million frames, with fine-grained annotations of 12 surgical phases, 10 instrument categories, dense MANO hand meshes, and full 6-DoF instrument poses. To scalably produce high-fidelity labels, we design a multi-stage automatic annotation pipeline that integrates multi-view data observation, data-driven motion prior with cross-view geometric consistency and biomechanical constraints, along with a combination of collision-aware interaction constraints for instrument interactions. Building upon OphNet-3D, we establish two challenging benchmarks-bimanual hand pose estimation and hand-instrument interaction reconstruction-and propose two dedicated architectures: H-Net for dual-hand mesh recovery and OH-Net for joint reconstruction of two-hand-two-instrument interactions. These models leverage a novel spatial reasoning module with weak-perspective camera modeling and collision-aware center-based representation. Both architectures outperform existing methods by substantial margins, achieving improvements of over 2mm in Mean Per Joint Position Error (MPJPE) and up to 23% in ADD-S metrics for hand and instrument reconstruction, respectively.
zh

[CV-76] SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

【速读】：该论文试图解决传统脉冲神经网络（Spiking Neural Networks, SNNs）在性能上与人工神经网络（Artificial Neural Networks, ANNs）存在显著差距的问题，特别是在开放世界中的3D理解能力不足，包括泛化能力受限、任务特异性高以及多模态理解缺乏。其解决方案的关键在于提出一种基于脉冲的视觉-语言预训练框架（Spike-based Vision-Language, SVL），该框架通过引入两个核心组件：多尺度三元组对齐（Multi-scale Triple Alignment, MTA）以实现跨3D、图像和文本模态的无标签三元组对比学习，以及可重参数化的视觉-语言融合（Re-parameterizable Vision-Language Integration, Rep-VLI）以实现轻量级推理，从而提升SNNs的效率与多模态理解能力。

链接: https://arxiv.org/abs/2505.17674
作者: Xuerui Qiu,Peixi Wu,Yaozhi Wen,Shaowei Gu,Yuqi Pan,Xinhao Luo,Bo XU,Guoqi Li
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Future Technology, University of Chinese Academy of Sciences (中国科学院大学未来技术学院); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is available this https URL.
zh

[CV-77] Proto-FG3D: Prototype-based Interpretable Fine-Grained 3D Shape Classification BMVC2025

【速读】：该论文旨在解决细粒度3D形状分类中存在的挑战，包括多视角特征聚合过程中捕获的区分信息有限、类间细微差异难以识别、类别不平衡以及参数化模型固有的可解释性限制等问题。其解决方案的关键在于提出首个基于原型（Prototype）的框架Proto-FG3D，实现了从参数化softmax到非参数化原型学习的范式转变，通过原型关联建立多视角与多类别联合表示学习，利用在线聚类优化原型并提升多视角特征分配的鲁棒性和子类平衡性，最终通过原型引导的监督学习增强细粒度区分能力，并借助原型-视角相关性分析实现透明的案例推理解释。

链接: https://arxiv.org/abs/2505.17666
作者: Shuxian Ma,Zihao Dong,Runmin Cong,Sam Kwong,Xiuli Shao
机构: University of Jinan(济南大学); Shandong University(山东大学); Lingnan University(岭南大学); Nankai University(南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, 5 tablets; Submitted to BMVC2025

点击查看摘要

Abstract:Deep learning-based multi-view coarse-grained 3D shape classification has achieved remarkable success over the past decade, leveraging the powerful feature learning capabilities of CNN-based and ViT-based backbones. However, as a challenging research area critical for detailed shape understanding, fine-grained 3D classification remains understudied due to the limited discriminative information captured during multi-view feature aggregation, particularly for subtle inter-class variations, class imbalance, and inherent interpretability limitations of parametric model. To address these problems, we propose the first prototype-based framework named Proto-FG3D for fine-grained 3D shape classification, achieving a paradigm shift from parametric softmax to non-parametric prototype learning. Firstly, Proto-FG3D establishes joint multi-view and multi-category representation learning via Prototype Association. Secondly, prototypes are refined via Online Clustering, improving both the robustness of multi-view feature allocation and inter-subclass balance. Finally, prototype-guided supervised learning is established to enhance fine-grained discrimination via prototype-view correlation analysis and enables ad-hoc interpretability through transparent case-based reasoning. Experiments on FG3D and ModelNet40 show Proto-FG3D surpasses state-of-the-art methods in accuracy, transparent predictions, and ad-hoc interpretability with visualizations, challenging conventional fine-grained 3D recognition approaches.
zh

[CV-78] EMRA-proxy: Enhancing Multi-Class Region Semantic Segmentation in Remote Sensing Images with Attention Proxy

【速读】：该论文旨在解决高分辨率遥感（High-Resolution Remote Sensing, HRRS）图像分割中由于复杂的空间布局和多样的目标外观所带来的挑战。传统方法在处理长距离依赖关系或局部细节时存在局限，而该论文提出的Region-Aware Proxy Network (RAPNet) 通过两个关键组件：Contextual Region Attention (CRA) 和 Global Class Refinement (GCR)，实现了更灵活且准确的分割。CRA 模块利用 Transformer 捕捉区域级别的上下文依赖关系，生成语义区域掩码（Semantic Region Mask），而 GCR 模块则通过全局类别注意力图对多类别信息进行精炼，从而提升分割精度。

链接: https://arxiv.org/abs/2505.17665
作者: Yichun Yu,Yuqing Lan,Zhihuan Xing,Xiaoyi Yang,Tingyue Tang,Dan Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024): Poster Volume I. Tianjin, China, 2024: 538-562

点击查看摘要

Abstract:High-resolution remote sensing (HRRS) image segmentation is challenging due to complex spatial layouts and diverse object appearances. While CNNs excel at capturing local features, they struggle with long-range dependencies, whereas Transformers can model global context but often neglect local details and are computationally this http URL propose a novel approach, Region-Aware Proxy Network (RAPNet), which consists of two components: Contextual Region Attention (CRA) and Global Class Refinement (GCR). Unlike traditional methods that rely on grid-based layouts, RAPNet operates at the region level for more flexible segmentation. The CRA module uses a Transformer to capture region-level contextual dependencies, generating a Semantic Region Mask (SRM). The GCR module learns a global class attention map to refine multi-class information, combining the SRM and attention map for accurate this http URL on three public datasets show that RAPNet outperforms state-of-the-art methods, achieving superior multi-class segmentation accuracy.
zh

[CV-79] Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

【速读】：该论文旨在解决自动驾驶系统中轨迹规划的安全性和可行性问题，现有基于学习的规划方法通常依赖于专家示范数据，缺乏显式的安全意识，并可能继承不安全行为。其解决方案的关键在于提出一种两阶段的轨迹规划框架Plan-R1，将轨迹规划建模为序列预测任务，并通过显式的规划原则（如安全性、舒适性和交通规则合规性）进行指导，第一阶段通过专家数据训练自回归轨迹预测器，第二阶段设计基于规则的奖励并使用Group Relative Policy Optimization（GRPO）进行微调，以使模型预测与规划原则对齐。

链接: https://arxiv.org/abs/2505.17659
作者: Xiaolong Tang,Meina Kan,Shiguang Shan,Xilin Chen
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safe and feasible trajectory planning is essential for real-world autonomous driving systems. However, existing learning-based planning methods often rely on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting unsafe behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a novel two-stage trajectory planning framework that formulates trajectory planning as a sequential prediction task, guided by explicit planning principles such as safety, comfort, and traffic rule compliance. In the first stage, we train an autoregressive trajectory predictor via next motion token prediction on expert data. In the second stage, we design rule-based rewards (e.g., collision avoidance, speed limits) and fine-tune the model using Group Relative Policy Optimization (GRPO), a reinforcement learning strategy, to align its predictions with these planning principles. Experiments on the nuPlan benchmark demonstrate that our Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance.
zh

[CV-80] Instruct2See: Learning to Remove Any Obstructions Across Distributions

【速读】：该论文旨在解决图像中因各种障碍物（obstruction）导致的视觉信息缺失问题，现有方法通常仅针对特定类型的障碍物（如栅栏或雨滴）进行处理，难以应对现实世界中多样化的障碍物场景。其解决方案的关键在于提出一种名为Instruct2See的零样本（zero-shot）框架，通过将障碍物去除统一建模为软硬掩码恢复问题，利用多模态提示（如视觉语义和文本指令）并通过交叉注意力单元增强上下文理解与模式控制，从而实现对已见和未见障碍物的有效处理。

链接: https://arxiv.org/abs/2505.17649
作者: Junhang Li,Yu Guo,Chuhua Xian,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at this https URL.
zh

[CV-81] CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment

【速读】：该论文旨在解决合成X-ray血管造影图像质量评估（Image Quality Assessment, IQA）中存在的不足，即现有IQA模型无法利用辅助图像作为参考，并且缺乏细粒度、任务相关的评价指标，导致临床相关性不足。其解决方案的关键在于提出一种基于视觉-语言模型（Vision-Language Model, VLM）的框架CAS-IQA，该框架通过有效整合相关图像的辅助信息，预测细粒度的质量评分，并引入多路径特征融合与路由（Multi-path Feature Fusion and Routing, MUST）模块以增强图像表示，从而提升评估的准确性和临床适用性。

链接: https://arxiv.org/abs/2505.17619
作者: Bo Wang,De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Nu-Fang Xiao,Jian-Long Hao,Ming-Yuan Liu,Zeng-Guang Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Synthetic X-ray angiographies generated by modern generative models hold great potential to reduce the use of contrast agents in vascular interventional procedures. However, low-quality synthetic angiographies can significantly increase procedural risk, underscoring the need for reliable image quality assessment (IQA) methods. Existing IQA models, however, fail to leverage auxiliary images as references during evaluation and lack fine-grained, task-specific metrics necessary for clinical relevance. To address these limitations, this paper proposes CAS-IQA, a vision-language model (VLM)-based framework that predicts fine-grained quality scores by effectively incorporating auxiliary information from related images. In the absence of angiography datasets, CAS-3K is constructed, comprising 3,565 synthetic angiographies along with score annotations. To ensure clinically meaningful assessment, three task-specific evaluation metrics are defined. Furthermore, a Multi-path featUre fuSion and rouTing (MUST) module is designed to enhance image representations by adaptively fusing and routing visual tokens to metric-specific branches. Extensive experiments on the CAS-3K dataset demonstrate that CAS-IQA significantly outperforms state-of-the-art IQA methods by a considerable margin.
zh

[CV-82] Scaling Image and Video Generation via Test-Time Evolutionary Search

【速读】：该论文旨在解决生成式模型在测试阶段扩展（Test-Time Scaling, TTS）中的性能提升问题，特别是在图像和视频生成模型（如基于扩散或流的模型）中缺乏对测试阶段扩展行为的深入理解与有效方法。现有方法在任务特定领域、可扩展性或样本多样性方面存在显著限制。论文提出的解决方案关键在于引入一种名为Evolutionary Search (EvoSearch) 的新型通用且高效的TTS方法，其核心是将测试阶段扩展重新建模为进化搜索问题，利用生物进化原理高效探索和优化去噪轨迹，通过设计针对性的选择与变异机制，在保持种群多样性的同时迭代生成更高质量的样本。

链接: https://arxiv.org/abs/2505.17618
作者: Haoran He,Jiajun Liang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Ling Pan
机构: Hong Kong University of Science and Technology (香港科技大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages. Project: this https URL

点击查看摘要

Abstract:As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbfEvolutionary \textbfSearch (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website this https URL.
zh

[CV-83] PathoSCOPE: Few-Shot Pathology Detection via Self-Supervised Contrastive Learning and Pathology-Informed Synthetic Embeddings

【速读】：该论文旨在解决无监督病理检测中由于医院数据偏向有症状人群且隐私限制导致难以构建可靠正常性模型的问题。其解决方案的关键在于提出PathoSCOPE框架，该框架通过少量非病理样本（最少2 shot）实现高效的数据利用，并引入全局-局部对比损失（Global-Local Contrastive Loss, GLCL）和病理信息嵌入生成（Pathology-informed Embedding Generation, PiEG）模块，以减少非病理嵌入的变异性并增强病理区域的区分能力。

链接: https://arxiv.org/abs/2505.17614
作者: Sinchee Chin,Yinuo Ma,Xiaochen Yang,Jing-Hao Xue,Wenming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised pathology detection trains models on non-pathological data to flag deviations as pathologies, offering strong generalizability for identifying novel diseases and avoiding costly annotations. However, building reliable normality models requires vast healthy datasets, as hospitals’ data is inherently biased toward symptomatic populations, while privacy regulations hinder the assembly of representative healthy cohorts. To address this limitation, we propose PathoSCOPE, a few-shot unsupervised pathology detection framework that requires only a small set of non-pathological samples (minimum 2 shots), significantly improving data efficiency. We introduce Global-Local Contrastive Loss (GLCL), comprised of a Local Contrastive Loss to reduce the variability of non-pathological embeddings and a Global Contrastive Loss to enhance the discrimination of pathological regions. We also propose a Pathology-informed Embedding Generation (PiEG) module that synthesizes pathological embeddings guided by the global loss, better exploiting the limited non-pathological samples. Evaluated on the BraTS2020 and ChestXray8 datasets, PathoSCOPE achieves state-of-the-art performance among unsupervised methods while maintaining computational efficiency (2.48 GFLOPs, 166 FPS).
zh

[CV-84] MinkUNeXt-SI: Improving point cloud-based place recognition including spherical coordinates and LiDAR intensity

【速读】：该论文旨在解决自主导航系统中的场景识别（place recognition）问题，该问题对于系统的安全运行至关重要。由于场景可能因季节变化和不同天气条件而发生显著改变，且需要在不同环境中具备泛化能力，因此该问题具有较高的挑战性。论文提出的解决方案MinkUNeXt-SI的关键在于从LiDAR点云数据出发，通过预处理获取球面坐标和归一化强度值，并采用结合Minkowski卷积与带有跳跃连接的U-net架构的深度学习方法，从而生成鲁棒的场景识别描述符。

链接: https://arxiv.org/abs/2505.17591
作者: Judith Vilella-Cantos,Juan José Cabrera,Luis Payá,Mónica Ballesta,David Valiente
机构: UMH(穆尔西亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In autonomous navigation systems, the solution of the place recognition problem is crucial for their safe functioning. But this is not a trivial solution, since it must be accurate regardless of any changes in the scene, such as seasonal changes and different weather conditions, and it must be generalizable to other environments. This paper presents our method, MinkUNeXt-SI, which, starting from a LiDAR point cloud, preprocesses the input data to obtain its spherical coordinates and intensity values normalized within a range of 0 to 1 for each point, and it produces a robust place recognition descriptor. To that end, a deep learning approach that combines Minkowski convolutions and a U-net architecture with skip connections is used. The results of MinkUNeXt-SI demonstrate that this method reaches and surpasses state-of-the-art performance while it also generalizes satisfactorily to other datasets. Additionally, we showcase the capture of a custom dataset and its use in evaluating our solution, which also achieves outstanding results. Both the code of our solution and the runs of our dataset are publicly available for reproducibility purposes.
zh

[CV-85] CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis

【速读】：该论文旨在解决基于3D高斯点云（3D Gaussian Splatting）的生成式对抗网络（Generative Adversarial Network, GAN）在高质量人类头部合成中面临的3D一致性不足与视图依赖性问题。现有方法通过将随机潜在向量条件化于当前相机位置来稳定训练并提升视角渲染质量，但这导致了在不同视角重新合成时身份信息显著变化的问题；而固定相机视角虽能获得高质量单一视角渲染，却无法支持新视角的生成。为解决这些问题，论文提出了CGS-GAN框架，其关键在于引入多视角正则化技术以增强生成器收敛性，并设计了一种新型生成器架构，能够在不依赖视图条件的情况下实现稳定训练与高质量3D一致性的头部合成，同时支持高达2048²的输出分辨率。

链接: https://arxiv.org/abs/2505.17590
作者: Florian Barthel,Wieland Morgenstern,Paul Hinzer,Anna Hilsmann,Peter Eisert
机构: Fraunhofer HHI (弗劳恩霍夫研究所); Humboldt University (洪堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper 12 pages, supplementary materials 8 pages

点击查看摘要

Abstract:Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to 2048^2 . To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: this https URL
zh

[CV-86] MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery

【速读】：该论文旨在解决恶劣天气导致的图像退化问题，尤其是由于天气引起的退化特征具有高度非均匀性和空间异质性的挑战，例如细粒度的雨痕与大范围的雾霾。解决方案的关键在于提出了一种Morton-Order Degradation Estimation Mechanism (MODEM)，其核心是Morton-Order 2D-Selective-Scan Module (MOS2D)，该模块结合了Morton编码的空间排序与选择性状态空间模型，以捕捉长距离依赖关系同时保持局部结构一致性。此外，还引入了Dual Degradation Estimation Module (DDEM) 来解耦并估计全局和局部退化先验，这些先验动态地对MOS2D模块进行条件约束，从而实现自适应和上下文感知的修复。

链接: https://arxiv.org/abs/2505.17581
作者: Hainuo Wang,Qiming Hu,Xiaojie Guo
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Restoring images degraded by adverse weather remains a significant challenge due to the highly non-uniform and spatially heterogeneous nature of weather-induced artifacts, e.g., fine-grained rain streaks versus widespread haze. Accurately estimating the underlying degradation can intuitively provide restoration models with more targeted and effective guidance, enabling adaptive processing strategies. To this end, we propose a Morton-Order Degradation Estimation Mechanism (MODEM) for adverse weather image restoration. Central to MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates Morton-coded spatial ordering with selective state-space models to capture long-range dependencies while preserving local structural coherence. Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM) that disentangles and estimates both global and local degradation priors. These priors dynamically condition the MOS2D modules, facilitating adaptive and context-aware restoration. Extensive experiments and ablation studies demonstrate that MODEM achieves state-of-the-art results across multiple benchmarks and weather types, highlighting its effectiveness in modeling complex degradation dynamics. Our code will be released at this https URL.
zh

[CV-87] InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

【速读】：该论文旨在解决生成长时序、跨场景视频的挑战，尤其是在自回归模型中随着上下文长度增加导致的计算成本上升和内容一致性下降问题。其解决方案的关键在于提出InfLVG框架，该框架通过可学习的上下文选择策略（基于Group Relative Policy Optimization优化）动态筛选并保留最语义相关的上下文信息，而非累积全部生成历史，从而在固定计算预算下保持内容一致性和文本提示的对齐。

链接: https://arxiv.org/abs/2505.17574
作者: Xueji Fang,Liyuan Ma,Zhiyang Chen,Mingyuan Zhou,Guo-jun Qi
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model’s ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top- K most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9 \times , achieving strong consistency and semantic fidelity across scenes. Our code is available at this https URL.
zh

[CV-88] Enhancing Fourier-based Doppler Resolution with Diffusion Models

【速读】：该论文旨在解决雷达系统中多普勒维度分辨率不足的问题，这一问题限制了对慢速移动目标的检测能力，因为目标与杂波或静止物体之间的区分不够明显。解决方案的关键在于利用生成式 AI (Generative AI) 技术，通过基于零填充快速傅里叶变换（zero-padded FFT）的数据进行细化，采用扩散模型的生成神经网络提升分辨率，从而有效分离距离相近的目标。

链接: https://arxiv.org/abs/2505.17567
作者: Denisa Qosja,Kilian Barth,Simon Wagner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Published at International Radar Symposium (IRS) 2025

点击查看摘要

Abstract:In radar systems, high resolution in the Doppler dimension is important for detecting slow-moving targets as it allows for more distinct separation between these targets and clutter, or stationary objects. However, achieving sufficient resolution is constrained by hardware capabilities and physical factors, leading to the development of processing techniques to enhance the resolution after acquisition. In this work, we leverage artificial intelligence to increase the Doppler resolution in range-Doppler maps. Based on a zero-padded FFT, a refinement via the generative neural networks of diffusion models is achieved. We demonstrate that our method overcomes the limitations of traditional FFT, generating data where closely spaced targets are effectively separated.
zh

[CV-89] Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model

【速读】：该论文试图解决视频扩散模型中初始噪声选择对生成质量与提示对齐度的影响问题，即相同提示下不同噪声种子可能导致显著不同的生成结果。解决方案的关键在于提出ANSE（Active Noise Selection for Generation）框架，其核心是BANSA（Bayesian Active Noise Selection via Attention）算法，通过量化基于注意力的不确定性来选择高质量的噪声种子，从而提升视频质量和时间一致性。

链接: https://arxiv.org/abs/2505.17561
作者: Kwanyoung Kim,Sanghyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: this https URL
zh

[CV-90] Deeper Diffusion Models Amplify Bias

【速读】：该论文试图解决生成式扩散模型（Generative Diffusion Models, DMs）内部工作机制不明确的问题，特别是其在偏差-方差权衡（bias-variance tradeoff）方面的表现。研究表明，扩散模型可能在极端情况下放大训练数据中的固有偏差，或损害训练样本的隐私性。解决方案的关键在于引入一种无需训练的方法，通过在去噪过程中部分绕过中间块的贡献，逐步鼓励生成过程中的临时高方差，从而提升文本到图像和图像到图像生成的输出质量。该方法在理论和实验上均得到了验证。

链接: https://arxiv.org/abs/2505.17560
作者: Shahin Hakemi,Naveed Akhtar,Ghulam Mubashar Hassan,Ajmal Mian
机构: The University of Western Australia (西澳大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the impressive performance of generative Diffusion Models (DMs), their internal working is still not well understood, which is potentially problematic. This paper focuses on exploring the important notion of bias-variance tradeoff in diffusion models. Providing a systematic foundation for this exploration, it establishes that at one extreme the diffusion models may amplify the inherent bias in the training data and, on the other, they may compromise the presumed privacy of the training samples. Our exploration aligns with the memorization-generalization understanding of the generative models, but it also expands further along this spectrum beyond ``generalization’', revealing the risk of bias amplification in deeper models. Building on the insights, we also introduce a training-free method to improve output quality in text-to-image and image-to-image generation. By progressively encouraging temporary high variance in the generation process with partial bypassing of the mid-block’s contribution in the denoising process of DMs, our method consistently improves generative image quality with zero training cost. Our claims are validated both theoretically and empirically.
zh

[CV-91] Wildfire spread forecasting with Deep Learning

【速读】：该论文旨在解决野火蔓延预测的准确性问题，这对于有效的风险管理和应急响应至关重要。其解决方案的关键在于提出一种基于深度学习（DL）的框架，利用点火时可用的数据来预测燃烧区域的最终范围。该框架通过整合时空数据集，包括遥感数据、气象观测、植被图、土地覆盖分类、人为因素、地形数据和热异常信息，以提升预测性能。研究重点在于评估时间上下文的影响，并通过消融实验验证多日观测数据对模型性能的提升作用，结果显示，包含点火前四天至点火后五天的时间窗口显著提高了F1分数和交并比。

链接: https://arxiv.org/abs/2505.17556
作者: Nikolaos Anastasiou,Spyros Kondylatos,Ioannis Papoutsis
机构: Orion Lab, National Observatory of Athens & National Technical University of Athens; Image Processing Laboratory (IPL), Universitat de València; Archimedes, Athena Research Center
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:Accurate prediction of wildfire spread is crucial for effective risk management, emergency response, and strategic resource allocation. In this study, we present a deep learning (DL)-based framework for forecasting the final extent of burned areas, using data available at the time of ignition. We leverage a spatio-temporal dataset that covers the Mediterranean region from 2006 to 2022, incorporating remote sensing data, meteorological observations, vegetation maps, land cover classifications, anthropogenic factors, topography data, and thermal anomalies. To evaluate the influence of temporal context, we conduct an ablation study examining how the inclusion of pre- and post-ignition data affects model performance, benchmarking the temporal-aware DL models against a baseline trained exclusively on ignition-day inputs. Our results indicate that multi-day observational data substantially improve predictive accuracy. Particularly, the best-performing model, incorporating a temporal window of four days before to five days after ignition, improves both the F1 score and the Intersection over Union by almost 5% in comparison to the baseline on the test dataset. We publicly release our dataset and models to enhance research into data-driven approaches for wildfire modeling and response.
zh

[CV-92] ProTAL: A Drag -and-Link Video Programming Framework for Temporal Action Localization

【速读】：该论文旨在解决时序动作定位（Temporal Action Localization, TAL）中模型训练依赖大量人工标注数据的问题。传统方法在TAL任务中需要大量手动标注的开始和结束时间戳，而数据编程是一种通过定义一系列人工标签函数来高效生成训练标签的方法，但在视频时序动作的复杂上下文中难以应用。论文提出的解决方案是ProTAL，其关键在于引入了一种拖拽与链接的视频编程框架，允许用户通过拖动表示身体部位和物体的节点并进行关系约束（如方向、距离等）来定义关键事件，从而为大规模未标注视频生成动作标签，并结合半监督方法训练TAL模型。

链接: https://arxiv.org/abs/2505.17555
作者: Yuchen He,Jianbing Lv,Liqi Cheng,Lingyu Meng,Dazhen Deng,Yingcai Wu
机构: Zhejiang University (浙江大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CHI’25

点击查看摘要

Abstract:Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbfkey events by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.
zh

[CV-93] Center-aware Residual Anomaly Synthesis for Multi-class Industrial Anomaly Detection

【速读】：该论文旨在解决多类别异常检测中因类别间干扰导致的漏检问题以及类别内正常与异常样本重叠引起的过检问题。其解决方案的关键在于提出一种名为Center-aware Residual Anomaly Synthesis (CRAS)的方法，该方法通过中心感知的残差学习将不同类别的样本耦合到统一中心，以减轻类别间干扰；同时引入基于距离引导的异常合成机制，根据正常数据分布自适应调整噪声方差，从而减少类别内重叠。

链接: https://arxiv.org/abs/2505.17551
作者: Qiyu Chen,Huiyuan Luo,Haiming Yao,Wei Luo,Zhen Qu,Chengkan Lv,Zhengtao Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); State Key Laboratory of Precision Measurement Technology and Instruments (精密测量技术与仪器国家重点实验室); Department of Precision Instrument, Tsinghua University (精密仪器系，清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Industrial Informatics (TII)

点击查看摘要

Abstract:Anomaly detection plays a vital role in the inspection of industrial images. Most existing methods require separate models for each category, resulting in multiplied deployment costs. This highlights the challenge of developing a unified model for multi-class anomaly detection. However, the significant increase in inter-class interference leads to severe missed detections. Furthermore, the intra-class overlap between normal and abnormal samples, particularly in synthesis-based methods, cannot be ignored and may lead to over-detection. To tackle these issues, we propose a novel Center-aware Residual Anomaly Synthesis (CRAS) method for multi-class anomaly detection. CRAS leverages center-aware residual learning to couple samples from different categories into a unified center, mitigating the effects of inter-class interference. To further reduce intra-class overlap, CRAS introduces distance-guided anomaly synthesis that adaptively adjusts noise variance based on normal data distribution. Experimental results on diverse datasets and real-world industrial applications demonstrate the superior detection accuracy and competitive inference speed of CRAS. The source code and the newly constructed dataset are publicly available at this https URL.
zh

[CV-94] 2VUnlearning: A Concept Erasing Method for Text-to-Video Diffusion Models

【速读】：该论文试图解决文本到视频（T2V）扩散模型在生成显性或有害内容方面的潜在滥用和权利侵犯问题。其解决方案的关键在于将去学习（unlearning）技术扩展至T2V模型，并提出一种鲁棒且精确的去学习方法，核心包括负向引导速度预测微调与提示增强，以确保对大型语言模型优化后的提示具有鲁棒性，同时通过定位和保留正则化来保持模型生成非目标概念的能力。

链接: https://arxiv.org/abs/2505.17550
作者: Xiaoyu Ye,Songjie Cheng,Yongtao Wang,Yajiao Xiong,Yishen Li
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-video (T2V) diffusion models have significantly enhanced the quality of generated videos. However, their ability to produce explicit or harmful content raises concerns about misuse and potential rights violations. Inspired by the success of unlearning techniques in erasing undesirable concepts from text-to-image (T2I) models, we extend unlearning to T2V models and propose a robust and precise unlearning method. Specifically, we adopt negatively-guided velocity prediction fine-tuning and enhance it with prompt augmentation to ensure robustness against LLM-refined prompts. To achieve precise unlearning, we incorporate a localization and a preservation regularization to preserve the model’s ability to generate non-target concepts. Extensive experiments demonstrate that our method effectively erases a specific concept while preserving the model’s generation capability for all other concepts, outperforming existing methods. We provide the unlearned models in \hrefthis https URLthis https URL.
zh

[CV-95] RePrompt: Reasoning -Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning MICRO

【速读】：该论文试图解决文本到图像（text-to-image, T2I）生成模型在处理简短且描述不充分的提示时，难以准确捕捉用户意图的问题。现有方法依赖于大语言模型（large language models, LLMs）增强提示，但常因缺乏视觉语义和现实世界构图的约束而生成风格化或不真实的内容。解决方案的关键在于提出一种名为RePrompt的新颖重提示框架，通过强化学习将显式推理引入提示增强过程，训练语言模型生成结构化、自我反思的提示，以优化图像级结果。该方法利用定制的奖励模型从人类偏好、语义对齐和视觉构图角度评估生成图像，从而间接监督提示生成，实现端到端训练而无需人工标注数据。

链接: https://arxiv.org/abs/2505.17540
作者: Mingrui Wu,Lu Wang,Pu Zhao,Fangkai Yang,Jianjin Zhang,Jianfeng Liu,Yuefeng Zhan,Weihao Han,Hao Sun,Jiayi Ji,Xiaoshuai Sun,Qingwei Lin,Weiwei Deng,Dongmei Zhang,Feng Sun,Qi Zhang,Rongrong Ji
机构: Xiamen University (厦门大学); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.
zh

[CV-96] Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）中存在的对象幻觉（object hallucination）问题，即模型生成的描述不准确地反映视觉内容，包括引入不存在的对象或错误表示现有对象。解决方案的关键在于提出一种名为集成解码（Ensemble Decoding, ED）的新策略，该策略通过将输入图像分割为子图像，并利用注意力图分配权重来组合logit分布，从而提升生成结果的准确性。此外，还引入了ED自适应合理性约束以校准logit分布，并设计了适用于速度敏感场景的FastED变体。

链接: https://arxiv.org/abs/2505.17529
作者: Yeongjae Cho,Keonwoo Kim,Taebaek Hwang,Sungzoon Cho
机构: Seoul National University (首尔大学); Kim & Chang AI&IT System Center (金与张人工智能与信息技术中心); Waddle Corporation (瓦德尔公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.
zh

[CV-97] Enhancing Adversarial Robustness of Vision Language Models via Adversarial Mixture Prompt Tuning

【速读】：该论文试图解决大型预训练视觉语言模型（Vision Language Models, VLMs）在面对对抗样本时鲁棒性不足的问题，这一问题可能导致潜在的安全风险。解决方案的关键在于提出一种名为对抗混合提示微调（Adversarial Mixture Prompt Tuning, AMPT）的方法，通过学习混合文本提示并结合基于输入对抗图像的条件权重路由器，提升模型对多种对抗攻击的泛化能力，从而获得更鲁棒的文本特征。

链接: https://arxiv.org/abs/2505.17509
作者: Shiji Zhao,Qihui Zhu,Shukun Xiong,Shouwei Ruan,Yize Fan,Ranjie Duan,Qing Guo,Xingxing Wei
机构: Institute of Artificial Intelligence, Beihang University (北京航空航天大学人工智能研究所); Shenyuan College, Beihang University (北京航空航天大学沈元学院); Center for Frontier AI Research, A*STAR (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large pre-trained Vision Language Models (VLMs) have excellent generalization capabilities but are highly susceptible to adversarial examples, presenting potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which finally leads to the overfitting phenomenon. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts can bring more robustness improvement than a longer prompt. Then we propose an adversarial tuning method named Adversarial Mixture Prompt Tuning (AMPT) to enhance the generalization towards various adversarial attacks for VLMs. AMPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the input adversarial image to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific aggregated text features aligning with different adversarial image features. A series of experiments show that our method can achieve better adversarial robustness than state-of-the-art methods on 11 datasets under different experimental settings.
zh

[CV-98] RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition

【速读】：该论文旨在解决不完整多模态情感识别（Incomplete Multimodal Emotion Recognition, IMER）问题，即在实际应用中由于噪声或传感器故障导致多模态数据缺失或损坏，从而影响情感识别性能的问题。解决方案的关键在于提出一种名为鲁棒混合扩散恢复（Robust Hybrid Diffusion Recovery, RoHyDR）的框架，该框架通过结合基于扩散的生成器和对抗学习机制，在单模态、多模态、特征和语义层面实现缺失模态的恢复，从而有效缓解因优化不理想导致的性能下降。

链接: https://arxiv.org/abs/2505.17501
作者: Yuehan Jin,Xiaoqing Liu,Yiyuan Yang,Zhiwen Yu,Tong Zhang,Kaixiang Yang
机构: South China University of Technology (华南理工大学); Pengcheng Laboratory (鹏城实验室); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal emotion recognition analyzes emotions by combining data from multiple sources. However, real-world noise or sensor failures often cause missing or corrupted data, creating the Incomplete Multimodal Emotion Recognition (IMER) challenge. In this paper, we propose Robust Hybrid Diffusion Recovery (RoHyDR), a novel framework that performs missing-modality recovery at unimodal, multimodal, feature, and semantic levels. For unimodal representation recovery of missing modalities, RoHyDR exploits a diffusion-based generator to generate distribution-consistent and semantically aligned representations from Gaussian noise, using available modalities as conditioning. For multimodal fusion recovery, we introduce adversarial learning to produce a realistic fused multimodal representation and recover missing semantic content. We further propose a multi-stage optimization strategy that enhances training stability and efficiency. In contrast to previous work, the hybrid diffusion and adversarial learning-based recovery mechanism in RoHyDR allows recovery of missing information in both unimodal representation and multimodal fusion, at both feature and semantic levels, effectively mitigating performance degradation caused by suboptimal optimization. Comprehensive experiments conducted on two widely used multimodal emotion recognition benchmarks demonstrate that our proposed method outperforms state-of-the-art IMER methods, achieving robust recognition performance under various missing-modality scenarios. Our code will be made publicly available upon acceptance.
zh

[CV-99] Research on Defect Detection Method of Motor Control Board Based on Image Processing

【速读】：该论文旨在解决电机控制板（motor control board）缺陷检测中存在的问题，如颜色差异不一致、插件位置错误、焊点短路等，这些问题直接影响电机控制板的性能和稳定性，进而对产品质量产生负面影响。解决方案的关键在于建立基于图像处理的缺陷检测模型，通过研究数字图像处理方法、分析影响图像特征提取的降噪技术、构建用于缺陷特征提取和颜色差异识别的具体模型，并优化缺陷图像的搜索算法，最终实现高精度的缺陷检测，实验结果表明该模型的检测准确率超过99%。

链接: https://arxiv.org/abs/2505.17493
作者: Jingde Huang,Zhangyu Huang,Chenyu Li,Jiantong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The motor control board has various defects such as inconsistent color differences, incorrect plug-in positions, solder short circuits, and more. These defects directly affect the performance and stability of the motor control board, thereby having a negative impact on product quality. Therefore, studying the defect detection technology of the motor control board is an important means to improve the quality control level of the motor control board. Firstly, the processing methods of digital images about the motor control board were studied, and the noise suppression methods that affect image feature extraction were analyzed. Secondly, a specific model for defect feature extraction and color difference recognition of the tested motor control board was established, and qualified or defective products were determined based on feature thresholds. Thirdly, the search algorithm for defective images was optimized. Finally, comparative experiments were conducted on the typical motor control board, and the experimental results demonstrate that the accuracy of the motor control board defect detection model-based on image processing established in this paper reached over 99%. It is suitable for timely image processing of large quantities of motor control boards on the production line, and achieved efficient defect detection. The defect detection method can not only be used for online detection of the motor control board defects, but also provide solutions for the integrated circuit board defect processing for the industry.
zh

[CV-100] he Coherence Trap: When MLLM -Crafted Narratives Exploit Manipulated Visual Contexts

【速读】：该论文试图解决由多模态大语言模型（Multimodal Large Language Models, MLLMs）驱动的虚假信息检测与定位问题，特别是针对当前方法在评估MLLM生成的复杂欺骗性内容时存在的两个根本性局限：一是低估了MLLM驱动的欺骗风险，二是存在不现实的错位伪影。解决方案的关键在于提出一种新的对抗性流程，通过构建MLLM驱动的合成多模态（MDSM）数据集，并引入Artifact-aware Manipulation Diagnosis via MLLM（AMD）框架，该框架包含Artifact Pre-perception Encoding策略和Manipulation-Oriented Reasoning机制，以有效应对MLLM生成的高风险虚假信息。

链接: https://arxiv.org/abs/2505.17476
作者: Yuchen Zhang,Yaxiong Wang,Yujiao Wu,Lianwei Wu,Li Zhu
机构: Xi’an Jiaotong University (西安交通大学); Hefei University of Technology (合肥工业大学); CSIRO (澳大利亚联邦科学与工业研究组织); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The detection and grounding of multimedia manipulation has emerged as a critical challenge in combating AI-generated disinformation. While existing methods have made progress in recent years, we identify two fundamental limitations in current approaches: (1) Underestimation of MLLM-driven deception risk: prevailing techniques primarily address rule-based text manipulations, yet fail to account for sophisticated misinformation synthesized by multimodal large language models (MLLMs) that can dynamically generate semantically coherent, contextually plausible yet deceptive narratives conditioned on manipulated images; (2) Unrealistic misalignment artifacts: currently focused scenarios rely on artificially misaligned content that lacks semantic coherence, rendering them easily detectable. To address these gaps holistically, we propose a new adversarial pipeline that leverages MLLMs to generate high-risk disinformation. Our approach begins with constructing the MLLM-Driven Synthetic Multimodal (MDSM) dataset, where images are first altered using state-of-the-art editing techniques and then paired with MLLM-generated deceptive texts that maintain semantic consistency with the visual manipulations. Building upon this foundation, we present the Artifact-aware Manipulation Diagnosis via MLLM (AMD) framework featuring two key innovations: Artifact Pre-perception Encoding strategy and Manipulation-Oriented Reasoning, to tame MLLMs for the MDSM problem. Comprehensive experiments validate our framework’s superior generalization capabilities as a unified architecture for detecting MLLM-powered multimodal deceptions.
zh

[CV-101] PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation CVPR2025

【速读】：该论文旨在解决姿态估计中的多数据集训练（multi-dataset training, MDT）问题，特别是骨骼异质性带来的挑战。现有方法在回归和分类任务中通常依赖数据集合并或多头监督，但在姿态估计中，骨骼类型多样性及跨数据集监督的有限性使得整合变得复杂。论文提出的解决方案PoseBH的关键在于两个核心技术：一是非参数关键点原型，其在统一嵌入空间中学习，实现不同骨骼类型的无缝集成；二是跨类型自监督机制，通过将关键点预测与关键点嵌入原型对齐，提供无需依赖教师-学生模型或额外增强的监督。

链接: https://arxiv.org/abs/2505.17475
作者: Uyoung Jeong,Jonathan Freer,Seungryul Baek,Hyung Jin Chang,Kwang In Kim
机构: UNIST(UNIST); University of Birmingham(University of Birmingham); POSTECH(POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to CVPR 2025

点击查看摘要

Abstract:We study multi-dataset training (MDT) for pose estimation, where skeletal heterogeneity presents a unique challenge that existing methods have yet to address. In traditional domains, \eg regression and classification, MDT typically relies on dataset merging or multi-head supervision. However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. First, we propose nonparametric keypoint prototypes that learn within a unified embedding space, enabling seamless integration across skeleton types. Second, we develop a cross-type self-supervision mechanism that aligns keypoint predictions with keypoint embedding prototypes, providing supervision without relying on teacher-student models or additional augmentations. PoseBH substantially improves generalization across whole-body and animal pose datasets, including COCO-WholeBody, AP-10K, and APT-36K, while preserving performance on standard human pose benchmarks (COCO, MPII, and AIC). Furthermore, our learned keypoint embeddings transfer effectively to hand shape estimation (InterHand2.6M) and human body shape estimation (3DPW). The code for PoseBH is available at: this https URL.
zh

[CV-102] Graph Mamba for Efficient Whole Slide Image Understanding

【速读】：该论文旨在解决全幻灯片图像（Whole Slide Images, WSIs）在大规模医学图像分析中的挑战，包括高分辨率、大尺寸以及复杂的切片关系问题。现有基于图神经网络（Graph Neural Networks, GNNs）和Transformer的多实例学习（Multiple Instance Learning, MIL）方法在可扩展性和计算成本方面存在局限。解决方案的关键在于提出WSI-GMamba框架，该框架将GNN的关系建模能力与Mamba——一种专为序列学习设计的状态空间模型（State Space Model, SSM）的高效性相结合。通过集成消息传递、图扫描展平及双向状态空间模型（Bi-SSM）进行特征聚合，GMamba块在保持Transformer性能的同时，实现了7倍减少的浮点运算次数（FLOPs）。

链接: https://arxiv.org/abs/2505.17457
作者: Jiaxuan Lu,Junyan Shi,Yuhui Lin,Fang Yan,Yue Gao,Shaoting Zhang,Xiaosong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whole Slide Images (WSIs) in histopathology present a significant challenge for large-scale medical image analysis due to their high resolution, large size, and complex tile relationships. Existing Multiple Instance Learning (MIL) methods, such as Graph Neural Networks (GNNs) and Transformer-based models, face limitations in scalability and computational cost. To bridge this gap, we propose the WSI-GMamba framework, which synergistically combines the relational modeling strengths of GNNs with the efficiency of Mamba, the State Space Model designed for sequence learning. The proposed GMamba block integrates Message Passing, Graph Scanning Flattening, and feature aggregation via a Bidirectional State Space Model (Bi-SSM), achieving Transformer-level performance with 7* fewer FLOPs. By leveraging the complementary strengths of lightweight GNNs and Mamba, the WSI-GMamba framework delivers a scalable solution for large-scale WSI analysis, offering both high accuracy and computational efficiency for slide-level classification.
zh

[CV-103] Real-time Traffic Accident Anticipation with Feature Reuse ICIP2025

【速读】：该论文试图解决交通事故预判问题，旨在提前预测潜在交通事故。其解决方案的关键在于提出一种轻量级框架RARE（Real-time Accident anticipation with Reused Embeddings），该框架利用单个预训练目标检测器的中间特征，避免了额外的特征提取流程，从而显著降低延迟。此外，引入了一种新的注意力评分排序损失，优先关注与事故相关的物体，提升了模型的准确性和可解释性。

链接: https://arxiv.org/abs/2505.17449
作者: Inpyo Song,Jangwon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP 2025

点击查看摘要

Abstract:This paper addresses the problem of anticipating traffic accidents, which aims to forecast potential accidents before they happen. Real-time anticipation is crucial for safe autonomous driving, yet most methods rely on computationally heavy modules like optical flow and intermediate feature extractors, making real-world deployment challenging. In this paper, we thus introduce RARE (Real-time Accident anticipation with Reused Embeddings), a lightweight framework that capitalizes on intermediate features from a single pre-trained object detector. By eliminating additional feature-extraction pipelines, RARE significantly reduces latency. Furthermore, we introduce a novel Attention Score Ranking Loss, which prioritizes higher attention on accident-related objects over non-relevant ones. This loss enhances both accuracy and interpretability. RARE demonstrates a 4-8 times speedup over existing approaches on the DAD and CCD benchmarks, achieving a latency of 13.6ms per frame (73.3 FPS) on an RTX 6000. Moreover, despite its reduced complexity, it attains state-of-the-art Average Precision and reliably anticipates imminent collisions in real time. These results highlight RARE’s potential for safety-critical applications where timely and explainable anticipation is essential.
zh

[CV-104] Baitradar: A Multi-Model Clickbait Detection Algorithm Using Deep Learning ICASSP’21

【速读】：该论文试图解决YouTube平台上日益严重的“点击诱饵”（clickbait）问题，即通过吸引人的标题和缩略图诱导用户点击视频，但实际内容与宣传不符。解决方案的关键是提出一种名为BaitRadar的算法，该算法采用深度学习技术，联合使用六个推理模型进行最终分类决策，这些模型分别关注视频的不同属性，包括标题、评论、缩略图、标签、视频统计数据和音频转录文本。通过计算多个模型的平均结果，该方法能够在数据缺失的情况下仍提供鲁棒且准确的输出。

链接: https://arxiv.org/abs/2505.17448
作者: Bhanuka Gamage,Adnan Labib,Aisha Joomun,Chern Hong Lim,KokSheik Wong
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Appear in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21), Toronto, ON, Canada

点击查看摘要

Abstract:Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of less than 2s.
zh

[CV-105] PawPrint: Whose Footprints Are These? Identifying Animal Individuals by Their Footprints ICIP2025

【速读】：该论文试图解决传统宠物识别与追踪方法（如GPS标签或ID照片）在实际应用中存在易被移除、信号问题及依赖他人发现和报告等局限性的问题。解决方案的关键在于引入PawPrint和PawPrint+，这是首个公开的针对犬猫个体足印识别的数据库，通过结合现代深度神经网络（如CNN、Transformer）和经典局部特征进行全面基准测试，探索不同基底复杂度和数据可用性下的优势与不足，从而为未来融合学习的全局表示与局部描述符以提升复杂现实条件下可靠性的研究提供方向。

链接: https://arxiv.org/abs/2505.17445
作者: Inpyo Song,Hyemin Hwang,Jangwon Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP 2025

点击查看摘要

Abstract:In the United States, as of 2023, pet ownership has reached 66% of households and continues to rise annually. This trend underscores the critical need for effective pet identification and monitoring methods, particularly as nearly 10 million cats and dogs are reported stolen or lost each year. However, traditional methods for finding lost animals like GPS tags or ID photos have limitations-they can be removed, face signal issues, and depend on someone finding and reporting the pet. To address these limitations, we introduce PawPrint and PawPrint+, the first publicly available datasets focused on individual-level footprint identification for dogs and cats. Through comprehensive benchmarking of both modern deep neural networks (e.g., CNN, Transformers) and classical local features, we observe varying advantages and drawbacks depending on substrate complexity and data availability. These insights suggest future directions for combining learned global representations with local descriptors to enhance reliability across diverse, real-world conditions. As this approach provides a non-invasive alternative to traditional ID tags, we anticipate promising applications in ethical pet management and wildlife conservation efforts.
zh

[CV-106] Reflectance Prediction-based Knowledge Distillation for Robust 3D Object Detection in Compressed Point Clouds

【速读】：该论文旨在解决低比特率传输下点云压缩导致的反射率编码负担重和检测鲁棒性受限的问题。其解决方案的关键在于提出一种基于反射率预测的知识蒸馏（Reflectance Prediction-based Knowledge Distillation, RPKD）框架，通过在低比特率传输中压缩点坐标并丢弃反射率，随后利用几何基础的反射率预测模块重建反射率信息，以提升检测精度，同时通过教师与学生检测器联合训练增强学生检测器的鲁棒性。

链接: https://arxiv.org/abs/2505.17442
作者: Hao Jing,Anhong Wang,Yifan Zhang,Donghan Bu,Junhui Hou
机构: Taiyuan University of Science and Technology (太原科技大学); Shanghai University (上海大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Regarding intelligent transportation systems for vehicle networking, low-bitrate transmission via lossy point cloud compression is vital for facilitating real-time collaborative perception among vehicles with restricted bandwidth. In existing compression transmission systems, the sender lossily compresses point coordinates and reflectance to generate a transmission code stream, which faces transmission burdens from reflectance encoding and limited detection robustness due to information loss. To address these issues, this paper proposes a 3D object detection framework with reflectance prediction-based knowledge distillation (RPKD). We compress point coordinates while discarding reflectance during low-bitrate transmission, and feed the decoded non-reflectance compressed point clouds into a student detector. The discarded reflectance is then reconstructed by a geometry-based reflectance prediction (RP) module within the student detector for precise detection. A teacher detector with the same structure as student detector is designed for performing reflectance knowledge distillation (RKD) and detection knowledge distillation (DKD) from raw to compressed point clouds. Our RPKD framework jointly trains detectors on both raw and compressed point clouds to improve the student detector’s robustness. Experimental results on the KITTI dataset and Waymo Open Dataset demonstrate that our method can boost detection accuracy for compressed point clouds across multiple code rates. Notably, at a low code rate of 2.146 Bpp on the KITTI dataset, our RPKD-PV achieves the highest mAP of 73.6, outperforming existing detection methods with the PV-RCNN baseline.
zh

[CV-107] VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在面对对抗攻击时的鲁棒性不足问题。现有攻击方法通常依赖于任务特定的白盒设置，但在LVLMs中由于其设计用于多种下游任务且需要昂贵的全模型梯度计算，导致这些方法受限。论文提出的解决方案关键在于提出一种针对视觉编码器的简单而有效的攻击方法（Vision Encoder Attack, VEAttack），通过最小化干净与扰动视觉特征之间的余弦相似度来生成对抗样本，无需访问后续的大语言模型、任务信息和标签，从而显著降低计算开销并消除传统白盒攻击对任务和标签的依赖。

链接: https://arxiv.org/abs/2505.17440
作者: Hefei Mei,Zirui Wang,Shen You,Minjing Dong,Chang Xu
机构: City University of Hong Kong (香港城市大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following large language models, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) Möbius band in transfer attack, 4) low sensitivity to attack steps. The code is available at this https URL
zh

[CV-108] Learning Generalized and Flexible Trajectory Models from Omni-Semantic Supervision KDD’25

【速读】：该论文旨在解决轨迹数据在时空数据挖掘中的高效准确检索问题，特别是针对现有轨迹检索方法在大规模数据处理效率、条件查询支持以及依赖轨迹相似性度量方面的不足。其解决方案的关键在于提出OmniTraj框架，该框架通过整合原始轨迹、拓扑结构、道路段和区域四种互补的语义模态，构建一个统一的表示空间，并为每种模态设计专用编码器进行嵌入与融合，从而实现基于单一模态或组合模态的灵活、精准查询，克服了传统基于相似性的方法的局限性。

链接: https://arxiv.org/abs/2505.17437
作者: Yuanshao Zhu,James Jianqiao Yu,Xiangyu Zhao,Xiao Han,Qidong Liu,Xuetao Wei,Yuxuan Liang
机构: Southern University of Science and Technology (南方科技大学); City University of Hong Kong (香港城市大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学（深圳）)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a full paper by KDD’25 - Research Track

点击查看摘要

Abstract:The widespread adoption of mobile devices and data collection technologies has led to an exponential increase in trajectory data, presenting significant challenges in spatio-temporal data mining, particularly for efficient and accurate trajectory retrieval. However, existing methods for trajectory retrieval face notable limitations, including inefficiencies in large-scale data, lack of support for condition-based queries, and reliance on trajectory similarity measures. To address the above challenges, we propose OmniTraj, a generalized and flexible omni-semantic trajectory retrieval framework that integrates four complementary modalities or semantics – raw trajectories, topology, road segments, and regions – into a unified system. Unlike traditional approaches that are limited to computing and processing trajectories as a single modality, OmniTraj designs dedicated encoders for each modality, which are embedded and fused into a shared representation space. This design enables OmniTraj to support accurate and flexible queries based on any individual modality or combination thereof, overcoming the rigidity of traditional similarity-based methods. Extensive experiments on two real-world datasets demonstrate the effectiveness of OmniTraj in handling large-scale data, providing flexible, multi-modality queries, and supporting downstream tasks and applications.
zh

[CV-109] VIBE: Video-to-Text Information Bottleneck Evaluation for TL;DR

【速读】：该论文旨在解决当前视觉-语言模型（VLMs）在生成视频摘要时输出冗长、重复，从而影响任务性能的问题，以及现有视频字幕评估依赖昂贵的人工标注且忽视摘要在下游任务中的实用性问题。其解决方案的关键在于提出一种无需注释的评估方法——视频到文本信息瓶颈评估（VIBE），通过两个指标对VLM输出进行评分：定位性（总结与视觉内容的对齐程度）和实用性（对任务的信息丰富性），并根据这两个评分对随机采样的VLM输出进行排序，以支持更有效的决策。

链接: https://arxiv.org/abs/2505.17423
作者: Shenghui Chen,Po-han Li,Sandeep Chichali,Ufuk Topcu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Many decision-making tasks, where both accuracy and efficiency matter, still require human supervision. For example, tasks like traffic officers reviewing hour-long dashcam footage or researchers screening conference videos can benefit from concise summaries that reduce cognitive load and save time. Yet current vision-language models (VLMs) often produce verbose, redundant outputs that hinder task performance. Existing video caption evaluation depends on costly human annotations and overlooks the summaries’ utility in downstream tasks. We address these gaps with Video-to-text Information Bottleneck Evaluation (VIBE), an annotation-free method that scores VLM outputs using two metrics: grounding (how well the summary aligns with visual content) and utility (how informative it is for the task). VIBE selects from randomly sampled VLM outputs by ranking them according to the two scores to support effective human decision-making. Human studies on LearningPaper24, SUTD-TrafficQA, and LongVideoBench show that summaries selected by VIBE consistently improve performance-boosting task accuracy by up to 61.23% and reducing response time by 75.77% compared to naive VLM summaries or raw video.
zh

[CV-110] Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

【速读】：该论文旨在解决使用体素表示（如符号距离函数）生成高分辨率3D形状时面临的计算和内存挑战。其关键解决方案是提出一种基于稀疏体素的可扩展3D生成框架Direct3D S2，其中的核心创新是空间稀疏注意力机制（Spatial Sparse Attention, SSA），该机制显著提升了扩散Transformer在稀疏体素数据上的计算效率，从而大幅降低了训练成本并提高了推理速度。

链接: https://arxiv.org/abs/2505.17412
作者: Shuang Wu,Youtian Lin,Feihu Zhang,Yifei Zeng,Yikang Yang,Yajie Bao,Jiachen Qian,Siyu Zhu,Philip Torr,Xun Cao,Yao Yao
机构: Nanjing University (南京大学); DreamTech (梦想科技); Fudan University (复旦大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating high resolution 3D shapes using volumetric representations such as Signed Distance Functions presents substantial computational and memory challenges. We introduce Direct3D S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention mechanism, which greatly enhances the efficiency of Diffusion Transformer computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, significantly reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: this https URL.
zh

[CV-111] From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation

【速读】：该论文旨在解决高保真三维重建在航空检测任务中语义理解不足的问题，传统摄影测量技术虽能实现几何建模，但缺乏语义可解释性，而神经渲染和3D Gaussian Splatting（3DGS）虽能生成逼真的三维重建，但同样缺乏场景级理解。其解决方案的关键在于提出一种基于无人机的管道，扩展Feature-3DGS以实现语言引导的三维分割，通过结合LSeg特征场与CLIP嵌入生成热图，并利用SAM或SAM2进行精细化二维分割，从而实现对逼真三维重建的灵活语言驱动交互。

链接: https://arxiv.org/abs/2505.17402
作者: Mahmoud Chick Zaouali,Todd Charter,Homayoun Najjaran
机构: University of Victoria (维多利亚大学); Cognia AI Inc. (Cognia AI 公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:High-fidelity 3D reconstruction is critical for aerial inspection tasks such as infrastructure monitoring, structural assessment, and environmental surveying. While traditional photogrammetry techniques enable geometric modeling, they lack semantic interpretability, limiting their effectiveness for automated inspection workflows. Recent advances in neural rendering and 3D Gaussian Splatting (3DGS) offer efficient, photorealistic reconstructions but similarly lack scene-level understanding. In this work, we present a UAV-based pipeline that extends Feature-3DGS for language-guided 3D segmentation. We leverage LSeg-based feature fields with CLIP embeddings to generate heatmaps in response to language prompts. These are thresholded to produce rough segmentations, and the highest-scoring point is then used as a prompt to SAM or SAM2 for refined 2D segmentation on novel view renderings. Our results highlight the strengths and limitations of various feature field backbones (CLIP-LSeg, SAM, SAM2) in capturing meaningful structure in large-scale outdoor environments. We demonstrate that this hybrid approach enables flexible, language-driven interaction with photorealistic 3D reconstructions, opening new possibilities for semantic aerial inspection and scene understanding. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2505.17402 [cs.GR] (or arXiv:2505.17402v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2505.17402 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-112] Wildfire Detection Using Vision Transformer with the Wildfire Dataset

【速读】：该论文试图解决 wildfires（野火）检测中的准确性和实时性问题，特别是在复杂环境条件下提高早期检测能力。其解决方案的关键在于利用生成式 AI（Generative AI）中的 Vision Transformers (ViTs) 模型，通过处理高分辨率图像数据实现高精度的分类。研究采用了一个包含10.74 GB高分辨率图像的数据集，并对图像进行标准化预处理以提升模型训练效果。

链接: https://arxiv.org/abs/2505.17395
作者: Gowtham Raj Vuppari,Navarun Gupta,Ahmed El-Sayed,Xingguo Xiong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at ASEE NE 2025

点击查看摘要

Abstract:The critical need for sophisticated detection techniques has been highlighted by the rising frequency and intensity of wildfires in the US, especially in California. In 2023, wildfires caused 130 deaths nationwide, the highest since 1990. In January 2025, Los Angeles wildfires which included the Palisades and Eaton fires burnt approximately 40,000 acres and 12,000 buildings, and caused loss of human lives. The devastation underscores the urgent need for effective detection and prevention strategies. Deep learning models, such as Vision Transformers (ViTs), can enhance early detection by processing complex image data with high accuracy. However, wildfire detection faces challenges, including the availability of high-quality, real-time data. Wildfires often occur in remote areas with limited sensor coverage, and environmental factors like smoke and cloud cover can hinder detection. Additionally, training deep learning models is computationally expensive, and issues like false positives/negatives and scaling remain concerns. Integrating detection systems with real-time alert mechanisms also poses difficulties. In this work, we used the wildfire dataset consisting of 10.74 GB high-resolution images categorized into ‘fire’ and ‘nofire’ classes is used for training the ViT model. To prepare the data, images are resized to 224 x 224 pixels, converted into tensor format, and normalized using ImageNet statistics.
zh

[CV-113] Dual-sensing driving detection model

【速读】：该论文试图解决驾驶员疲劳检测中单一传感模态方法存在的局限性问题，旨在提升检测的准确性与可靠性。解决方案的关键在于提出一种结合计算机视觉与生理信号分析的双传感疲劳检测方法，通过融合两种传感模态的互补优势，并引入创新的架构与先进的融合策略，实现更鲁棒的疲劳检测效果。

链接: https://arxiv.org/abs/2505.17392
作者: Leon C.C.K,Zeng Hui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:In this paper, a novel dual-sensing driver fatigue detection method combining computer vision and physiological signal analysis is proposed. The system exploits the complementary advantages of the two sensing modalities and breaks through the limitations of existing single-modality methods. We introduce an innovative architecture that combines real-time facial feature analysis with physiological signal processing, combined with advanced fusion strategies, for robust fatigue detection. The system is designed to run efficiently on existing hardware while maintaining high accuracy and reliability. Through comprehensive experiments, we demonstrate that our method outperforms traditional methods in both controlled environments and real-world conditions, while maintaining high accuracy. The practical applicability of the system has been verified through extensive tests in various driving scenarios and shows great potential in reducing fatigue-related accidents. This study contributes to the field by providing a more reliable, cost-effective, and humane solution for driver fatigue detection.
zh

[CV-114] Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

【速读】：该论文旨在解决离散扩散模型（Discrete Diffusion Models, DDMs）在使用少量去噪步骤时性能下降的问题，特别是在建模多维数据之间的相互依赖关系方面存在局限。其解决方案的关键在于提出一种名为变分自编码离散扩散（Variational Autoencoding Discrete Diffusion, VADD）的新框架，通过引入潜在变量建模来隐式捕捉维度间的相关性，并利用辅助识别模型实现基于变分下界最大化的稳定训练和训练集上的摊销推理。

链接: https://arxiv.org/abs/2505.17384
作者: Tianyu Xie,Shuchen Xue,Zijin Feng,Tianyang Hu,Jiacheng Sun,Zhenguo Li,Cheng Zhang
机构: Peking University (北京大学); Chinese Academy of Sciences (中国科学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 23 pages, 14 figures

点击查看摘要

Abstract:Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines.
zh

[CV-115] EVM-Fusion: An Explainable Vision Mamba Architecture with Neural Algorithmic Fusion

【速读】：该论文旨在解决医学图像分类中对准确性、可解释性和泛化能力的高要求问题。其解决方案的关键在于提出了一种名为EVM-Fusion的可解释视觉Mamba架构，该架构引入了新型神经算法融合（Neural Algorithmic Fusion, NAF）机制，通过多路径设计和两阶段特征融合策略，实现多器官医学图像的高效分类，并嵌入多种可解释性模块以增强模型决策的透明度。

链接: https://arxiv.org/abs/2505.17367
作者: Zichuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Medical image classification is critical for clinical decision-making, yet demands for accuracy, interpretability, and generalizability remain challenging. This paper introduces EVM-Fusion, an Explainable Vision Mamba architecture featuring a novel Neural Algorithmic Fusion (NAF) mechanism for multi-organ medical image classification. EVM-Fusion leverages a multipath design, where DenseNet and U-Net based pathways, enhanced by Vision Mamba (Vim) modules, operate in parallel with a traditional feature pathway. These diverse features are dynamically integrated via a two-stage fusion process: cross-modal attention followed by the iterative NAF block, which learns an adaptive fusion algorithm. Intrinsic explainability is embedded through path-specific spatial attention, Vim \Delta-value maps, traditional feature SE-attention, and cross-modal attention weights. Experiments on a diverse 9-class multi-organ medical image dataset demonstrate EVM-Fusion’s strong classification performance, achieving 99.75% test accuracy and provide multi-faceted insights into its decision-making process, highlighting its potential for trustworthy AI in medical diagnostics.
zh

[CV-116] Optimizing YOLOv8 for Parking Space Detection: Comparative Analysis of Custom YOLOv8 Architecture

【速读】：该论文旨在解决停车空间占用检测中的挑战，尤其是在边缘情况下（如部分可见车辆、小型车辆及光照条件差）传统目标检测方法（如YOLOv8）的性能不足问题。其解决方案的关键在于对定制化主干网络架构与YOLOv8的集成进行系统比较分析，通过在PKLot数据集上评估不同主干网络（如ResNet-18、VGG16、EfficientNetV2、Ghost）的检测准确性和计算效率，以揭示各架构的优势与权衡，从而为停车占用检测选择合适的模型提供依据。

链接: https://arxiv.org/abs/2505.17364
作者: Apar Pokhrel,Gia Dao
机构: The University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Parking space occupancy detection is a critical component in the development of intelligent parking management systems. Traditional object detection approaches, such as YOLOv8, provide fast and accurate vehicle detection across parking lots but can struggle with borderline cases, such as partially visible vehicles, small vehicles (e.g., motorcycles), and poor lighting conditions. In this work, we perform a comprehensive comparative analysis of customized backbone architectures integrated with YOLOv8. Specifically, we evaluate various backbones – ResNet-18, VGG16, EfficientNetV2, Ghost – on the PKLot dataset in terms of detection accuracy and computational efficiency. Experimental results highlight each architecture’s strengths and trade-offs, providing insight into selecting suitable models for parking occupancy.
zh

[CV-117] Are GNNs Worth the Effort for IoT Botnet Detection? A Comparative Study of VAE-GNN vs. ViT-MLP and VAE-MLP Approaches

【速读】：该论文旨在解决物联网（IoT）中由于基于物联网的僵尸网络攻击激增所带来的安全问题，其解决方案的关键在于利用先进的深度学习架构进行特征降维和攻击检测。研究评估了四种最新的深度学习模型：带有多层感知机（MLP）的变分自编码器（VAE）编码器、带有图卷积网络（GCN）的VAE编码器、带有图注意力网络（GAT）的VAE编码器以及带有MLP的视觉Transformer（ViT）编码器，并在N-BaIoT数据集上进行了二分类和多分类任务的性能评估。

链接: https://arxiv.org/abs/2505.17363
作者: Hassan Wasswa,Hussein Abbass,Timothy Lynar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the exponential rise in IoT-based botnet attacks, researchers have explored various advanced techniques for both dimensionality reduction and attack detection to enhance IoT security. Among these, Variational Autoencoders (VAE), Vision Transformers (ViT), and Graph Neural Networks (GNN), including Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), have garnered significant research attention in the domain of attack detection. This study evaluates the effectiveness of four state-of-the-art deep learning architectures for IoT botnet detection: a VAE encoder with a Multi-Layer Perceptron (MLP), a VAE encoder with a GCN, a VAE encoder with a GAT, and a ViT encoder with an MLP. The evaluation is conducted on a widely studied IoT benchmark dataset–the N-BaIoT dataset for both binary and multiclass tasks. For the binary classification task, all models achieved over 99.93% in accuracy, recall, precision, and F1-score, with no notable differences in performance. In contrast, for the multiclass classification task, GNN-based models showed significantly lower performance compared to VAE-MLP and ViT-MLP, with accuracies of 86.42%, 89.46%, 99.72%, and 98.38% for VAE-GCN, VAE-GAT, VAE-MLP, and ViT-MLP, respectively.
zh

[CV-118] Repurposing Marigold for Zero-Shot Metric Depth Estimation via Defocus Blur Cues

【速读】：该论文试图解决单目度量深度估计（Monocular Metric Depth Estimation, MDE）方法在分布外数据集上性能显著下降的问题。其解决方案的关键在于在推理阶段注入散焦模糊线索，通过优化度量深度缩放参数和Marigold模型的噪声潜在变量，将预训练的无监督扩散模型转化为一种无需训练的度量深度预测器。

链接: https://arxiv.org/abs/2505.17358
作者: Chinmay Talegaonkar,Nikhil Gandudi Suresh,Zachary Novack,Yash Belhe,Priyanka Nagasamudra,Nicholas Antipa
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent monocular metric depth estimation (MMDE) methods have made notable progress towards zero-shot generalization. However, they still exhibit a significant performance drop on out-of-distribution datasets. We address this limitation by injecting defocus blur cues at inference time into Marigold, a \textitpre-trained diffusion model for zero-shot, scale-invariant monocular depth estimation (MDE). Our method effectively turns Marigold into a metric depth predictor in a training-free manner. To incorporate defocus cues, we capture two images with a small and a large aperture from the same viewpoint. To recover metric depth, we then optimize the metric depth scaling parameters and the noise latents of Marigold at inference time using gradients from a loss function based on the defocus-blur image formation model. We compare our method against existing state-of-the-art zero-shot MMDE methods on a self-collected real dataset, showing quantitative and qualitative improvements.
zh

[CV-119] Graph Attention Neural Network for Botnet Detection: Evaluating Autoencoder VAE and PCA-Based Dimension Reduction

【速读】：该论文试图解决物联网（IoT）基于NetFlow的攻击数据集在转换为图结构数据时所面临的高维性与计算开销大的问题，以及传统模型在处理攻击实例间关系时的不足。其解决方案的关键在于首先通过维度约简技术（包括变分自编码器、经典自编码器和主成分分析）降低数据维度，再将其转化为图结构数据，并在此基础上应用图注意力网络（GAT）模型，以同时捕捉长程依赖性和实例间的关联性，从而提升僵尸网络攻击检测的性能。

链接: https://arxiv.org/abs/2505.17357
作者: Hassan Wasswa,Hussein Abbass,Timothy Lynar
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rise of IoT-based botnet attacks, researchers have explored various learning models for detection, including traditional machine learning, deep learning, and hybrid approaches. A key advancement involves deploying attention mechanisms to capture long-term dependencies among features, significantly improving detection accuracy. However, most models treat attack instances independently, overlooking inter-instance relationships. Graph Neural Networks (GNNs) address this limitation by learning an embedding space via iterative message passing where similar instances are placed closer based on node features and relationships, enhancing classification performance. To further improve detection, attention mechanisms have been embedded within GNNs, leveraging both long-range dependencies and inter-instance connections. However, transforming the high dimensional IoT attack datasets into a graph structured dataset poses challenges, such as large graph structures leading computational overhead. To mitigate this, this paper proposes a framework that first reduces dimensionality of the NetFlow-based IoT attack dataset before transforming it into a graph dataset. We evaluate three dimension reduction techniques–Variational Autoencoder (VAE-encoder), classical autoencoder (AE-encoder), and Principal Component Analysis (PCA)–and compare their effects on a Graph Attention neural network (GAT) model for botnet attack detection
zh

[CV-120] Dual Ascent Diffusion for Inverse Problems

【速读】：该论文试图解决在多个领域中普遍存在的不适定逆问题（ill-posed inverse problems），特别是针对基于扩散模型（diffusion models）的先验信息在最大后验概率（MAP）或后验采样方法中因计算近似导致的样本不准确或次优的问题。其解决方案的关键在于引入一种基于对偶上升优化框架（dual ascent optimization framework）的新方法，以更高效、更准确地求解带有扩散模型先验的MAP问题，从而在图像恢复任务中实现更高的图像质量、更强的噪声鲁棒性以及更忠实于观测数据的解估计。

链接: https://arxiv.org/abs/2505.17353
作者: Minseo Kim,Axel Levy,Gordon Wetzstein
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 23 pages, 15 figures, 5 tables

点击查看摘要

Abstract:Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.
zh

[CV-121] Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

【速读】：该论文试图解决扩散模型（diffusion models）生成结果与人类偏好及安全约束对齐的问题，这是当前生成式AI（Generative AI）领域的重要挑战。解决方案的关键在于利用强化学习（reinforcement learning, RL）和奖励建模（reward modeling）方法，通过人类反馈或自动化机制优化模型输出，以提升其与用户意图或安全标准的一致性。研究重点包括不同类型的反馈（如人类、自动、二元或排序偏好）、微调技术（如策略梯度、奖励加权似然、直接反向传播等）及其在效率和安全性方面的表现，并提出了五项未来研究方向，旨在推动更安全、更符合价值观的扩散模型发展。

链接: https://arxiv.org/abs/2505.17352
作者: Preeti Lamba,Kiran Ravish,Ankita Kushwaha,Pawan Kumar
机构: International Institute of Information Technology, Hyderabad(国际信息科技研究所，海得拉巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as leading generative models for images and other modalities, but aligning their outputs with human preferences and safety constraints remains a critical challenge. This thesis proposal investigates methods to align diffusion models using reinforcement learning (RL) and reward modeling. We survey recent advances in fine-tuning text-to-image diffusion models with human feedback, including reinforcement learning from human and AI feedback, direct preference optimization, and differentiable reward approaches. We classify these methods based on the type of feedback (human, automated, binary or ranked preferences), the fine-tuning technique (policy gradient, reward-weighted likelihood, direct backpropagation, etc.), and their efficiency and safety outcomes. We compare key algorithms and frameworks, highlighting how they improve alignment with user intent or safety standards, and discuss inter-relationships such as how newer methods build on or diverge from earlier ones. Based on the survey, we identify five promising research directions for the next two years: (1) multi-objective alignment with combined rewards, (2) efficient human feedback usage and active learning, (3) robust safety alignment against adversarial inputs, (4) continual and online alignment of diffusion models, and (5) interpretable and trustworthy reward modeling for generative images. Each direction is elaborated with its problem statement, challenges, related work, and a proposed research plan. The proposal is organized as a comprehensive document with literature review, comparative tables of methods, and detailed research plans, aiming to contribute new insights and techniques for safer and value-aligned diffusion-based generative AI.
zh

[CV-122] Ocular Authentication: Fusion of Gaze and Periocular Modalities

【速读】：该论文试图解决在无需校准的认证系统中融合两种以眼睛为中心的认证模态——眼动和眼周图像——的可行性问题。现有研究表明，每种模态在用户认证中均展现出潜力，但其在统一的眼动估计流程中的组合尚未在大规模范围内得到充分探索。论文提出了一种多模态认证系统，并利用包含9202名受试者的大型内部数据集进行评估，该数据集的眼动（ET）信号质量相当于面向消费者的虚拟现实（VR）设备。解决方案的关键在于集成先进的机器学习架构，该架构通过捕捉认证表征以及融合模态之间的互补判别特性，显著提升了大规模认证性能。

链接: https://arxiv.org/abs/2505.17343
作者: Dillon Lohr,Michael J. Proulx,Mehedi Hasan Raju,Oleg V. Komogortsev
机构: Meta Reality Labs Research (Meta Reality Labs Research); Texas State University (Texas State University)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Supplementary material is available

点击查看摘要

Abstract:This paper investigates the feasibility of fusing two eye-centric authentication modalities-eye movements and periocular images-within a calibration-free authentication system. While each modality has independently shown promise for user authentication, their combination within a unified gaze-estimation pipeline has not been thoroughly explored at scale. In this report, we propose a multimodal authentication system and evaluate it using a large-scale in-house dataset comprising 9202 subjects with an eye tracking (ET) signal quality equivalent to a consumer-facing virtual reality (VR) device. Our results show that the multimodal approach consistently outperforms both unimodal systems across all scenarios, surpassing the FIDO benchmark. The integration of a state-of-the-art machine learning architecture contributed significantly to the overall authentication performance at scale, driven by the model’s ability to capture authentication representations and the complementary discriminative characteristics of the fused modalities.
zh

[CV-123] Render-FM: A Foundation Model for Real-time Photorealistic Volumetric Rendering

【速读】：该论文旨在解决医学影像中CT扫描的体绘制问题，特别是传统高保真方法（尤其是神经渲染技术）需要耗时的逐场景优化，导致计算需求高且泛化能力差，限制了其在临床中的应用。解决方案的关键在于提出Render-FM，这是一种直接、实时进行CT扫描体绘制的新型基础模型，其核心是采用编码器-解码器架构，通过大规模预训练从CT体积直接回归六维高斯点云（6D Gaussian Splatting, 6DGS）参数，从而消除逐扫描优化过程，实现高效、高质量的实时交互式三维可视化。

链接: https://arxiv.org/abs/2505.17338
作者: Zhongpai Gao,Meng Zheng,Benjamin Planche,Anwesa Choudhuri,Terrence Chen,Ziyan Wu
机构: United Imaging Intelligence (联合影像智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Volumetric rendering of Computed Tomography (CT) scans is crucial for visualizing complex 3D anatomical structures in medical imaging. Current high-fidelity approaches, especially neural rendering techniques, require time-consuming per-scene optimization, limiting clinical applicability due to computational demands and poor generalizability. We propose Render-FM, a novel foundation model for direct, real-time volumetric rendering of CT scans. Render-FM employs an encoder-decoder architecture that directly regresses 6D Gaussian Splatting (6DGS) parameters from CT volumes, eliminating per-scan optimization through large-scale pre-training on diverse medical data. By integrating robust feature extraction with the expressive power of 6DGS, our approach efficiently generates high-quality, real-time interactive 3D visualizations across diverse clinical CT data. Experiments demonstrate that Render-FM achieves visual fidelity comparable or superior to specialized per-scan methods while drastically reducing preparation time from nearly an hour to seconds for a single inference step. This advancement enables seamless integration into real-time surgical planning and diagnostic workflows. The project page is: this https URL.
zh

[CV-124] mporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis MICCAI

【速读】：该论文试图解决在术前数据采集阶段，由于患者轻微运动导致呼吸周期中首帧与末帧之间出现动态背景的问题，这一问题会干扰时间建模的准确性。解决方案的关键在于首次利用图像到视频（Image-to-Video, I2V）合成框架模拟规律性运动过程，并通过设计时序差分扩散模型生成时序差分场，以度量相邻帧之间的相对差异表示，同时引入提示注意力层和场增强层来提升差分场与I2V框架的交互效果，从而实现更精确的时间变化模拟。

链接: https://arxiv.org/abs/2505.17333
作者: Xin You,Minghui Zhang,Hanxiao Zhang,Jie Yang,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: early accepted by MICCAI

点击查看摘要

Abstract:Temporal modeling on regular respiration-induced motions is crucial to image-guided clinical applications. Existing methods cannot simulate temporal motions unless high-dose imaging scans including starting and ending frames exist simultaneously. However, in the preoperative data acquisition stage, the slight movement of patients may result in dynamic backgrounds between the first and last frames in a respiratory period. This additional deviation can hardly be removed by image registration, thus affecting the temporal modeling. To address that limitation, we pioneeringly simulate the regular motion process via the image-to-video (I2V) synthesis framework, which animates with the first frame to forecast future frames of a given length. Besides, to promote the temporal consistency of animated videos, we devise the Temporal Differential Diffusion Model to generate temporal differential fields, which measure the relative differential representations between adjacent frames. The prompt attention layer is devised for fine-grained differential fields, and the field augmented layer is adopted to better interact these fields with the I2V framework, promoting more accurate temporal variation of synthesized videos. Extensive results on ACDC cardiac and 4D Lung datasets reveal that our approach simulates 4D videos along the intrinsic motion trajectory, rivaling other competitive methods on perceptual similarity and temporal consistency. Codes will be available soon.
zh

[CV-125] Game-invariant Features Through Contrastive and Domain-adversarial Learning

【速读】：该论文试图解决游戏图像编码器在面对新游戏时因过度适应特定游戏视觉风格而导致下游任务性能下降的问题（game-specific visual styles）。其解决方案的关键在于结合对比学习和领域对抗训练，以学习跨游戏的视觉特征。通过同时鼓励内容相似性聚类并利用对抗性领域分类器抑制游戏特定线索，该方法生成的嵌入能够实现跨多种游戏的泛化能力。

链接: https://arxiv.org/abs/2505.17328
作者: Dylan Kline
机构: University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundational game-image encoders often overfit to game-specific visual styles, undermining performance on downstream tasks when applied to new games. We present a method that combines contrastive learning and domain-adversarial training to learn game-invariant visual features. By simultaneously encouraging similar content to cluster and discouraging game-specific cues via an adversarial domain classifier, our approach produces embeddings that generalize across diverse games. Experiments on the Bingsu game-image dataset (10,000 screenshots from 10 games) demonstrate that after only a few training epochs, our model’s features no longer cluster by game, indicating successful invariance and potential for improved cross-game transfer (e.g., glitch detection) with minimal fine-tuning. This capability paves the way for more generalizable game vision models that require little to no retraining on new games.
zh

[CV-126] Optimizing Image Capture for Computer Vision-Powered Taxonomic Identification and Trait Recognition of Biodiversity Specimens

【速读】：该论文试图解决当前生物标本成像实践与自动化分析需求之间的差距，即现有成像协议主要针对人类视觉解释设计，未充分考虑计算机视觉应用的要求。解决方案的关键在于提出一系列相互关联的考量因素，包括全面的元数据记录、标准化的标本定位、一致的尺寸和颜色校准、多标本图像处理协议、统一背景选择、可控光照、适当的分辨率与放大倍数、优化的文件格式、稳健的数据归档策略以及可访问的数据共享实践，从而构建适用于计算机视觉流程的高质量生物标本图像数据集。

链接: https://arxiv.org/abs/2505.17317
作者: Alyson East,Elizabeth G. Campolongo,Luke Meyers,S M Rayeed,Samuel Stevens,Iuliia Zarubiieva,Isadora E. Fluck,Jennifer C. Girón,Maximiliane Jousse,Scott Lowe,Kayla I Perry,Isabelle Betancourt,Noah Charney,Evan Donoso,Nathan Fox,Kim J. Landsbergen,Ekaterina Nepovinnykh,Michelle Ramirez,Parkash Singh,Khum Thapa-Magar,Matthew Thompson,Evan Waite,Tanya Berger-Wolf,Hilmar Lapp,Paula Mabee,Graham Taylor,Sydne Record
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Biological collections house millions of specimens documenting Earth’s biodiversity, with digital images increasingly available through open-access platforms. Most imaging protocols were developed for human visual interpretation without considering computational analysis requirements. This paper aims to bridge the gap between current imaging practices and the potential for automated analysis by presenting key considerations for creating biological specimen images optimized for computer vision applications. We provide conceptual computer vision topics for context, addressing fundamental concerns including model generalization, data leakage, and comprehensive metadata documentation, and outline practical guidance on specimen imagine, and data storage. These recommendations were synthesized through interdisciplinary collaboration between taxonomists, collection managers, ecologists, and computer scientists. Through this synthesis, we have identified ten interconnected considerations that form a framework for successfully integrating biological specimen images into computer vision pipelines. The key elements include: (1) comprehensive metadata documentation, (2) standardized specimen positioning, (3) consistent size and color calibration, (4) protocols for handling multiple specimens in one image, (5) uniform background selection, (6) controlled lighting, (7) appropriate resolution and magnification, (8) optimal file formats, (9) robust data archiving strategies, and (10) accessible data sharing practices. By implementing these recommendations, collection managers, taxonomists, and biodiversity informaticians can generate images that support automated trait extraction, species identification, and novel ecological and evolutionary analyses at unprecedented scales. Successful implementation lies in thorough documentation of methodological choices.
zh

[CV-127] Harnessing EHRs for Diffusion-based Anomaly Detection on Chest X-rays MICCAI2025

【速读】：该论文旨在解决医学影像中无监督异常检测（Unsupervised Anomaly Detection, UAD）的问题，即在无需大量标注数据的情况下识别病理异常。现有基于扩散模型的UAD方法仅依赖影像特征，难以区分正常解剖变异与病理异常。其解决方案的关键在于提出Diff3M框架，该框架通过整合胸部X光图像和结构化电子健康记录（Electronic Health Records, EHRs），引入一种新颖的图像-EHR交叉注意力模块，将临床背景信息融入图像生成过程，从而提升模型区分正常与异常特征的能力。此外，还设计了静态掩码策略以增强从异常中重建正常图像的效果。

链接: https://arxiv.org/abs/2505.17311
作者: Harim Kim,Yuhan Wang,Minkyu Ahn,Heeyoul Choi,Yuyin Zhou,Charmgil Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2025 early accept

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) in medical imaging is crucial for identifying pathological abnormalities without requiring extensive labeled data. However, existing diffusion-based UAD models rely solely on imaging features, limiting their ability to distinguish between normal anatomical variations and pathological anomalies. To address this, we propose Diff3M, a multi-modal diffusion-based framework that integrates chest X-rays and structured Electronic Health Records (EHRs) for enhanced anomaly detection. Specifically, we introduce a novel image-EHR cross-attention module to incorporate structured clinical context into the image generation process, improving the model’s ability to differentiate normal from abnormal features. Additionally, we develop a static masking strategy to enhance the reconstruction of normal-like images from anomalies. Extensive evaluations on CheXpert and MIMIC-CXR/IV demonstrate that Diff3M achieves state-of-the-art performance, outperforming existing UAD methods in medical imaging. Our code is available at this http URL this https URL
zh

[CV-128] Mitigate One Skew Another? Tackling Intersectional Biases in Text-to-Image Models

【速读】：该论文试图解决文本到图像（Text-to-Image, TTI）模型中偏见的相互依赖问题，即在某一维度（如种族或年龄）上进行偏见缓解可能无意中影响其他维度（如性别），从而加剧或缓解现有不平等。解决方案的关键在于提出BiasConnect，一种用于分析和量化TTI模型中偏见交互作用的新工具，通过反事实干预不同偏见轴来揭示这些交互的潜在结构，并估计缓解一个偏见轴对另一个偏见轴的影响。基于BiasConnect，进一步提出了InterMit，一种由用户定义的目标分布和优先级权重指导的交叉性偏见缓解算法，实现了更有效的偏见降低和更高的图像质量。

链接: https://arxiv.org/abs/2505.17280
作者: Pushkar Shukla,Aditya Chinchure,Emily Diana,Alexander Tolbert,Kartik Hosanagar,Vineeth N Balasubramanian,Leonid Sigal,Matthew Turk
机构: Toyota Technological Institute at Chicago (丰田技术学院芝加哥分校); University of British Columbia (不列颠哥伦比亚大学); Carnegie Mellon University, Tepper School of Business (卡内基梅隆大学泰珀商学院); Emory University (埃默里大学); University of Pennsylvania, The Wharton School (宾夕法尼亚大学沃顿商学院); Indian Institute of Technology Hyderabad (印度理工学院海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The biases exhibited by text-to-image (TTI) models are often treated as independent, though in reality, they may be deeply interrelated. Addressing bias along one dimension - such as ethnicity or age - can inadvertently affect another, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. To address this, we introduce BiasConnect, a novel tool for analyzing and quantifying bias interactions in TTI models. BiasConnect uses counterfactual interventions along different bias axes to reveal the underlying structure of these interactions and estimates the effect of mitigating one bias axis on another. These estimates show strong correlation (+0.65) with observed post-mitigation outcomes. Building on BiasConnect, we propose InterMit, an intersectional bias mitigation algorithm guided by user-defined target distributions and priority weights. InterMit achieves lower bias (0.33 vs. 0.52) with fewer mitigation steps (2.38 vs. 3.15 average steps), and yields superior image quality compared to traditional techniques. Although our implementation is training-free, InterMit is modular and can be integrated with many existing debiasing approaches for TTI models, making it a flexible and extensible solution.
zh

[CV-129] ExpertGen: Training-Free Expert Guidance for Controllable Text-to-Face Generation

【速读】：该论文试图解决在文本到人脸生成中实现细粒度控制面部特征的挑战，现有方法通常需要训练额外模块来处理特定控制任务，如身份、属性或年龄，导致灵活性差且资源消耗大。解决方案的关键在于提出一种无需训练的框架ExpertGen，该框架利用预训练的专家模型（如人脸识别、面部属性识别和年龄估计网络）进行引导生成，通过潜在一致性模型确保每一步扩散过程的现实性和分布内预测，从而提供精确的引导信号以有效控制生成过程。

链接: https://arxiv.org/abs/2505.17256
作者: Liang Shi,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved text-to-face generation, but achieving fine-grained control over facial features remains a challenge. Existing methods often require training additional modules to handle specific controls such as identity, attributes, or age, making them inflexible and resource-intensive. We propose ExpertGen, a training-free framework that leverages pre-trained expert models such as face recognition, facial attribute recognition, and age estimation networks to guide generation with fine control. Our approach uses a latent consistency model to ensure realistic and in-distribution predictions at each diffusion step, enabling accurate guidance signals to effectively steer the diffusion process. We show qualitatively and quantitatively that expert models can guide the generation process with high precision, and multiple experts can collaborate to enable simultaneous control over diverse facial aspects. By allowing direct integration of off-the-shelf expert models, our method transforms any such model into a plug-and-play component for controllable face generation.
zh

[CV-130] Extending Dataset Pruning to Object Detection: A Variance-based Approach

【速读】：该论文试图解决将图像分类中的数据集剪枝（dataset pruning）方法扩展到更复杂的计算机视觉任务——目标检测（object detection）中的问题。其关键解决方案是提出一种名为基于方差的预测得分（Variance-based Prediction Score, VPS）的新评分方法，该方法结合交并比（Intersection over Union, IoU）和置信度得分，以有效识别适用于目标检测任务的信息性训练样本。

链接: https://arxiv.org/abs/2505.17245
作者: Ryota Yagi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dataset pruning – selecting a small yet informative subset of training data – has emerged as a promising strategy for efficient machine learning, offering significant reductions in computational cost and storage compared to alternatives like dataset distillation. While pruning methods have shown strong performance in image classification, their extension to more complex computer vision tasks, particularly object detection, remains relatively underexplored. In this paper, we present the first principled extension of classification pruning techniques to the object detection domain, to the best of our knowledge. We identify and address three key challenges that hinder this transition: the Object-Level Attribution Problem, the Scoring Strategy Problem, and the Image-Level Aggregation Problem. To overcome these, we propose tailored solutions, including a novel scoring method called Variance-based Prediction Score (VPS). VPS leverages both Intersection over Union (IoU) and confidence scores to effectively identify informative training samples specific to detection tasks. Extensive experiments on PASCAL VOC and MS COCO demonstrate that our approach consistently outperforms prior dataset pruning methods in terms of mean Average Precision (mAP). We also show that annotation count and class distribution shift can influence detection performance, but selecting informative examples is a more critical factor than dataset size or balance. Our work bridges dataset pruning and object detection, paving the way for dataset pruning in complex vision tasks.
zh

[CV-131] REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge

【速读】：该论文旨在解决在双人互动中，如何生成多种合适、多样化、真实且同步的人类风格面部反应的问题，这些面部反应由人类听众对输入刺激（即对应说话者表达的视听行为）作出的响应。解决方案的关键在于提供了首个自然且大规模的多模态MAFRG数据集（称为MARS），该数据集记录了137组人类-人类双人互动，共计2856个互动会话，涵盖了五个不同主题，为机器学习模型的开发和基准测试提供了基础。

链接: https://arxiv.org/abs/2505.17223
作者: Siyang Song,Micol Spitale,Xiangyu Kong,Hengde Zhu,Cheng Luo,Cristina Palmero,German Barquero,Sergio Escalera,Michel Valstar,Mohamed Daoudi,Tobias Baur,Fabien Ringeval,Andrew Howes,Elisabeth Andre,Hatice Gunes
机构: University of Exeter(埃克塞特大学); Politecnico di Milano(米兰理工大学); University of Leicester(莱斯特大学); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学); King’s College London(伦敦国王学院); Universitat de Barcelona(巴塞罗那大学); University of Nottingham(诺丁汉大学); IMT Nord Europe(北欧高等技术学院); University of Augsburg(奥格斯堡大学); Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学); University of Cambridge(剑桥大学
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In dyadic interactions, a broad spectrum of human facial reactions might be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023 and REACT 2024 challenges, we are proposing the REACT 2025 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can be used to generate multiple appropriate, diverse, realistic and synchronised human-style facial reactions expressed by human listeners in response to an input stimulus (i.e., audio-visual behaviours expressed by their corresponding speakers). As a key of the challenge, we provide challenge participants with the first natural and large-scale multi-modal MAFRG dataset (called MARS) recording 137 human-human dyadic interactions containing a total of 2856 interaction sessions covering five different topics. In addition, this paper also presents the challenge guidelines and the performance of our baselines on the two proposed sub-challenges: Offline MAFRG and Online MAFRG, respectively. The challenge baseline code is publicly available at this https URL
zh

[CV-132] A Framework for Multi-View Multiple Object Tracking using Single-View Multi-Object Trackers on Fish Data

【速读】：该论文试图解决水下环境中小型鱼类的多目标跟踪（Multi-object tracking, MOT）问题，该问题由于复杂的三维运动和数据噪声而具有独特挑战性。传统单视角MOT模型在该场景下表现不足。解决方案的关键在于开发一种多视角框架，利用双目视频输入来提升跟踪精度和鱼类行为模式识别能力，通过集成和评估FairMOT和YOLOv8等先进单视角MOT模型在水下鱼类视频数据集上的表现，实现相较于单视角方法的显著精度与可靠性提升。

链接: https://arxiv.org/abs/2505.17201
作者: Chaim Chai Elchik,Fatemeh Karimi Nejadasl,Seyed Sahand Mohammadi Ziabari,Ali Mohammed Mansoor Alsahag
机构: Informatics Institute, University of Amsterdam (信息学研究所，阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) in computer vision has made significant advancements, yet tracking small fish in underwater environments presents unique challenges due to complex 3D motions and data noise. Traditional single-view MOT models often fall short in these settings. This thesis addresses these challenges by adapting state-of-the-art single-view MOT models, FairMOT and YOLOv8, for underwater fish detecting and tracking in ecological studies. The core contribution of this research is the development of a multi-view framework that utilizes stereo video inputs to enhance tracking accuracy and fish behavior pattern recognition. By integrating and evaluating these models on underwater fish video datasets, the study aims to demonstrate significant improvements in precision and reliability compared to single-view approaches. The proposed framework detects fish entities with a relative accuracy of 47% and employs stereo-matching techniques to produce a novel 3D output, providing a more comprehensive understanding of fish movements and interactions
zh

[CV-133] Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在执行视觉问答等任务时，其推理过程是更依赖于记忆的世界知识还是输入图像中的视觉信息这一问题。解决方案的关键在于引入了Visual CounterFact数据集，该数据集通过生成视觉上逼真的反事实样本（如将红色草莓与蓝色草莓的图像进行对比），直接冲突世界知识先验与视觉输入。研究发现，模型预测最初反映记忆中的先验知识，但在中后期层逐渐转向视觉证据，表明两种模态之间存在竞争，最终视觉输入在评估过程中占主导地位。为控制这一行为，作者提出了Pixels Versus Priors (PvP)转向向量，通过激活层面的干预机制，使模型输出偏向于世界知识或视觉输入。

链接: https://arxiv.org/abs/2505.17127
作者: Michal Golovanevsky,William Rudman,Michael Lepori,Amir Bar,Ritambhara Singh,Carsten Eickhoff
机构: Brown University (布朗大学); Tel Aviv University (特拉维夫大学); University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 92.5% of color and 74.6% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.
zh

[CV-134] EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

【速读】：该论文试图解决手语中情感指示不明确的问题，这一问题在关键场景中造成了沟通障碍（Emotion indicators in sign language remain poorly understood）。解决方案的关键在于引入了EmoSign，这是首个包含200个美国手语（ASL）视频的情感和情绪标签的视频数据集，并通过三位具有专业翻译经验的聋人ASL使用者进行标注，同时提供了情感和情绪分类的基线模型，为多模态手语情感识别提供了新的基准。

链接: https://arxiv.org/abs/2505.17090
作者: Phoebe Chua,Cathy Mengying Fang,Takehiko Ohkawa,Raja Kushalnagar,Suranga Nanayakkara,Pattie Maes
机构: MIT Media Lab (麻省理工学院媒体实验室); National University of Singapore (新加坡国立大学); The University of Tokyo (东京大学); Gallaudet University (加劳德特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at this https URL.
zh

[CV-135] Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

【速读】：该论文试图解决生成式 AI (Generative AI) 在文本到图像 (Text-to-Image, TTI) 扩散模型中对历史时期表现不准确的问题，特别是其在隐含风格关联、历史一致性以及人口统计学代表性方面的系统性偏差。解决方案的关键在于提出一种系统且可复现的评估方法，并构建 HistVis 数据集，该数据集包含由三种先进扩散模型生成的30,000张合成图像，用于评估不同历史时期的视觉表现。通过这一方法，研究为提升 TTI 模型的历史准确性与文化契合度提供了初步基础。

链接: https://arxiv.org/abs/2505.17064
作者: Maria-Teresa De Rosa Palmini,Eva Cetinic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. In this work, we present a systematic and reproducible methodology for evaluating how TTI systems depict different historical periods. For this purpose, we introduce the HistVis dataset, a curated collection of 30,000 synthetic images generated by three state-of-the-art diffusion models using carefully designed prompts depicting universal human activities across different historical periods. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By offering a scalable methodology and benchmark for assessing historical representation in generated imagery, this work provides an initial step toward building more historically accurate and culturally aligned TTI models.
zh

[CV-136] Lightweight Multispectral Crop-Weed Segmentation for Precision Agriculture

【速读】：该论文试图解决精准农业中作物与杂草分割效率低的问题，传统基于卷积神经网络（Convolutional Neural Network, CNN）的方法在复杂田间条件下泛化能力差，并且依赖于RGB影像，限制了性能。其解决方案的关键在于提出一种轻量级的Transformer-CNN混合模型，通过专用编码器处理RGB、近红外（Near-Infrared, NIR）和红边（Red-Edge, RE）波段，并采用动态模态融合机制，从而提升了分割精度与计算效率。

链接: https://arxiv.org/abs/2505.07444
作者: Zeynep Galymzhankyzy,Eric Martinson
机构: Lawrence Technological University (劳伦斯科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 4 pages, 5 figures, 1 table

点击查看摘要

Abstract:Efficient crop-weed segmentation is critical for site-specific weed control in precision agriculture. Conventional CNN-based methods struggle to generalize and rely on RGB imagery, limiting performance under complex field conditions. To address these challenges, we propose a lightweight transformer-CNN hybrid. It processes RGB, Near-Infrared (NIR), and Red-Edge (RE) bands using specialized encoders and dynamic modality integration. Evaluated on the WeedsGalore dataset, the model achieves a segmentation accuracy (mean IoU) of 78.88%, outperforming RGB-only models by 15.8 percentage points. With only 8.7 million parameters, the model offers high accuracy, computational efficiency, and potential for real-time deployment on Unmanned Aerial Vehicles (UAVs) and edge devices, advancing precision weed management.
zh

[CV-137] Accelerating Learned Image Compression Through Modeling Neural Training Dynamics

【速读】：该论文旨在解决学习图像压缩（LIC）方法在训练过程中计算需求日益增加的问题，核心挑战在于提升其训练效率。解决方案的关键在于提出一种感知敏感度的真实与虚假嵌入训练机制（STDET），通过将LIC模型参数聚类为少数独立模式，并将参数表示为同一模式内参考参数的仿射变换，从而减少可训练参数的数量。此外，结合训练过程中的稳定模式内相关性和参数敏感性，逐步嵌入非参考参数，并引入采样后移动平均（SMA）技术以平滑时间行为并减小训练状态方差，最终实现降低训练空间维度和可训练参数数量的同时保持模型性能，加速模型收敛。

链接: https://arxiv.org/abs/2505.18107
作者: Yichi Zhang,Zhihao Duan,Yuning Huang,Fengqing Zhu
机构: Purdue University (普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TMLR

点击查看摘要

Abstract:As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate modes where parameters are expressed as affine transformations of reference parameters within the same mode. By further utilizing the stable intra-mode correlations throughout training and parameter sensitivities, we gradually embed non-reference parameters, reducing the number of trainable parameters. Additionally, we incorporate a Sampling-then-Moving Average (SMA) technique, interpolating sampled weights from stochastic gradient descent (SGD) training to obtain the moving average weights, ensuring smooth temporal behavior and minimizing training state variances. Overall, our method significantly reduces training space dimensions and the number of trainable parameters without sacrificing model performance, thus accelerating model convergence. We also provide a theoretical analysis on the Noisy quadratic model, showing that the proposed method achieves a lower training variance than standard SGD. Our approach offers valuable insights for further developing efficient training methods for LICs.
zh

[CV-138] A Foundation Model Framework for Multi-View MRI Classification of Extramural Vascular Invasion and Mesorectal Fascia Invasion in Rectal Cancer

【速读】：该论文旨在解决直肠癌术前MRI中对肌层外血管侵犯（extramural vascular invasion, EVI）和直肠筋膜侵犯（mesorectal fascia invasion, MFI）的准确识别问题，这一过程对于风险分层管理至关重要，但传统视觉评估存在主观性和机构间差异。其解决方案的关键在于开发一个基于基础模型（foundation model）的多中心框架，通过自监督频域一致化管道减少扫描仪相关的对比度偏移，并结合多视角特征融合与轻量级分类器（如UMedPT_LR）提升诊断性能。

链接: https://arxiv.org/abs/2505.18058
作者: Yumeng Zhang,Zohaib Salahuddin,Danial Khan,Shruti Atul Mali,Henry C. Woodruff,Sina Amirrajab,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Luis Marti-Bonmati,Philippe Lambin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Background: Accurate MRI-based identification of extramural vascular invasion (EVI) and mesorectal fascia invasion (MFI) is pivotal for risk-stratified management of rectal cancer, yet visual assessment is subjective and vulnerable to inter-institutional variability. Purpose: To develop and externally evaluate a multicenter, foundation-model-driven framework that automatically classifies EVI and MFI on axial and sagittal T2-weighted MRI. Methods: This retrospective study used 331 pre-treatment rectal cancer MRI examinations from three European hospitals. After TotalSegmentator-guided rectal patch extraction, a self-supervised frequency-domain harmonization pipeline was trained to minimize scanner-related contrast shifts. Four classifiers were compared: ResNet50, SeResNet, the universal biomedical pretrained transformer (UMedPT) with a lightweight MLP head, and a logistic-regression variant using frozen UMedPT features (UMedPT_LR). Results: UMedPT_LR achieved the best EVI detection when axial and sagittal features were fused (AUC = 0.82; sensitivity = 0.75; F1 score = 0.73), surpassing the Chaimeleon Grand-Challenge winner (AUC = 0.74). The highest MFI performance was attained by UMedPT on axial harmonized images (AUC = 0.77), surpassing the Chaimeleon Grand-Challenge winner (AUC = 0.75). Frequency-domain harmonization improved MFI classification but variably affected EVI performance. Conventional CNNs (ResNet50, SeResNet) underperformed, especially in F1 score and balanced accuracy. Conclusion: These findings demonstrate that combining foundation model features, harmonization, and multi-view fusion significantly enhances diagnostic performance in rectal MRI.
zh

[CV-139] Explainable Anatomy-Guided AI for Prostate MRI: Foundation Models and In Silico Clinical Trials for Virtual Biopsy-based Risk Assessment

【速读】：该论文旨在解决前列腺癌（Prostate Cancer, PCa）风险分层的自动化与精准化问题，通过整合医学影像与人工智能技术提升诊断的准确性与效率。其解决方案的关键在于构建一个全自动化、解剖引导的深度学习流水线，该流水线包含三个核心组件：基于nnU-Net的分割模块用于精确划分前列腺及其区域；基于UMedPT Swin Transformer基础模型的分类模块，结合解剖先验和临床数据进行微调；以及利用VAE-GAN框架生成反事实热图以增强模型的可解释性。这一方法在多个指标上表现出色，验证了其在临床应用中的潜力。

链接: https://arxiv.org/abs/2505.17971
作者: Danial Khan,Zohaib Salahuddin,Yumeng Zhang,Sheng Kuang,Shruti Atul Mali,Henry C. Woodruff,Sina Amirrajab,Rachel Cavill,Eduardo Ibor-Crespo,Ana Jimenez-Pastor,Adrian Galiana-Bordera,Paula Jimenez Gomez,Luis Marti-Bonmati,Philippe Lambin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a fully automated, anatomically guided deep learning pipeline for prostate cancer (PCa) risk stratification using routine MRI. The pipeline integrates three key components: an nnU-Net module for segmenting the prostate gland and its zones on axial T2-weighted MRI; a classification module based on the UMedPT Swin Transformer foundation model, fine-tuned on 3D patches with optional anatomical priors and clinical data; and a VAE-GAN framework for generating counterfactual heatmaps that localize decision-driving image regions. The system was developed using 1,500 PI-CAI cases for segmentation and 617 biparametric MRIs with metadata from the CHAIMELEON challenge for classification (split into 70% training, 10% validation, and 20% testing). Segmentation achieved mean Dice scores of 0.95 (gland), 0.94 (peripheral zone), and 0.92 (transition zone). Incorporating gland priors improved AUC from 0.69 to 0.72, with a three-scale ensemble achieving top performance (AUC = 0.79, composite score = 0.76), outperforming the 2024 CHAIMELEON challenge winners. Counterfactual heatmaps reliably highlighted lesions within segmented regions, enhancing model interpretability. In a prospective multi-center in-silico trial with 20 clinicians, AI assistance increased diagnostic accuracy from 0.72 to 0.77 and Cohen’s kappa from 0.43 to 0.53, while reducing review time per case by 40%. These results demonstrate that anatomy-aware foundation models with counterfactual explainability can enable accurate, interpretable, and efficient PCa risk assessment, supporting their potential use as virtual biopsies in clinical practice.
zh

[CV-140] Promptable cancer segmentation using minimal expert-curated data

【速读】：该论文旨在解决医学图像中癌症自动分割的挑战，特别是由于专家标注成本高和数据集中的观察者间变异性导致的推广受限问题。其解决方案的关键在于提出一种新型的可提示分割方法，该方法仅需24张完全标注的图像和8张弱标注的图像进行训练，通过结合弱监督和全监督分类器，利用单点提示引导搜索过程来优化分割结果，从而在显著减少标注数据需求的情况下实现与全监督方法相当的性能。

链接: https://arxiv.org/abs/2505.17915
作者: Lynn Karam,Yipei Wang,Veeru Kasivisvanathan,Mirabela Rusu,Yipeng Hu,Shaheer U. Saeed
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Medical Image Understanding and Analysis (MIUA) 2025

点击查看摘要

Abstract:Automated segmentation of cancer on medical images can aid targeted diagnostic and therapeutic procedures. However, its adoption is limited by the high cost of expert annotations required for training and inter-observer variability in datasets. While weakly-supervised methods mitigate some challenges, using binary histology labels for training as opposed to requiring full segmentation, they require large paired datasets of histology and images, which are difficult to curate. Similarly, promptable segmentation aims to allow segmentation with no re-training for new tasks at inference, however, existing models perform poorly on pathological regions, again necessitating large datasets for training. In this work we propose a novel approach for promptable segmentation requiring only 24 fully-segmented images, supplemented by 8 weakly-labelled images, for training. Curating this minimal data to a high standard is relatively feasible and thus issues with the cost and variability of obtaining labels can be mitigated. By leveraging two classifiers, one weakly-supervised and one fully-supervised, our method refines segmentation through a guided search process initiated by a single-point prompt. Our approach outperforms existing promptable segmentation methods, and performs comparably with fully-supervised methods, for the task of prostate cancer segmentation, while using substantially less annotated data (up to 100X less). This enables promptable segmentation with very minimal labelled data, such that the labels can be curated to a very high standard.
zh

[CV-141] UltraBoneUDF: Self-supervised Bone Surface Reconstruction from Ultrasound Based on Neural Unsigned Distance Functions

【速读】：该论文旨在解决从实际三维超声数据中准确重建开放骨表面的问题，尤其是针对超声成像固有的局限性导致的不完整数据所带来的重建误差和伪影问题。其解决方案的关键在于提出一种自监督框架UltraBoneUDF，该框架利用神经无符号距离函数（Neural Unsigned Distance Functions）进行开放骨表面重建，并引入了一种新型的全局特征提取器以有效融合超声图像特有的特征，同时设计了一种基于局部切平面优化的损失函数以显著提升表面重建质量。

链接: https://arxiv.org/abs/2505.17912
作者: Luohong Wu,Matthias Seibold,Nicola A. Cavalcanti,Giuseppe Loggia,Lisa Reissner,Bastian Sigrist,Jonas Hein,Lilian Calvet,Arnd Viehöfer,Philipp Fürnstahl
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Bone surface reconstruction plays a critical role in computer-assisted orthopedic surgery. Compared to traditional imaging modalities such as CT and MRI, ultrasound offers a radiation-free, cost-effective, and portable alternative. Continuous bone surface reconstruction can be employed for many clinical applications. However, due to the inherent limitations of ultrasound imaging, B-mode ultrasound typically capture only partial bone surfaces. Existing reconstruction methods struggle with such incomplete data, leading to artifacts and increased reconstruction errors. Effective techniques for accurately reconstructing thin and open bone surfaces from real-world 3D ultrasound volumes remain lacking. Methods: We propose UltraBoneUDF, a self-supervised framework designed for reconstructing open bone surfaces from ultrasound using neural Unsigned Distance Functions. To enhance reconstruction quality, we introduce a novel global feature extractor that effectively fuses ultrasound-specific image characteristics. Additionally, we present a novel loss function based on local tangent plane optimization that substantially improves surface reconstruction quality. UltraBoneUDF and baseline models are extensively evaluated on four open-source datasets. Results: Qualitative results highlight the limitations of the state-of-the-art methods for open bone surface reconstruction and demonstrate the effectiveness of UltraBoneUDF. Quantitatively, UltraBoneUDF significantly outperforms competing methods across all evaluated datasets for both open and closed bone surface reconstruction in terms of mean Chamfer distance error: 1.10 mm on the UltraBones100k dataset (39.6% improvement compared to the SOTA), 0.23 mm on the OpenBoneCT dataset (69.3% improvement), 0.18 mm on the ClosedBoneCT dataset (70.2% improvement), and 0.05 mm on the Prostate dataset (55.3% improvement).
zh

[CV-142] Dual Attention Residual U-Net for Accurate Brain Ultrasound Segmentation in IVH Detection

【速读】：该论文旨在解决早产儿脑室内出血（Intraventricular Hemorrhage, IVH）的早期准确检测问题，通过从脑部超声（Brain Ultrasound, US）图像中分割脑部解剖结构来改善临床预后。其解决方案的关键在于提出了一种改进的残差U-Net架构，结合了两种互补的注意力机制：卷积块注意力模块（Convolutional Block Attention Module, CBAM）和稀疏注意力层（Sparse Attention Layer, SAL）。CBAM增强了模型对空间和通道特征的细化能力，而SAL通过双分支设计，利用稀疏注意力过滤低置信度的查询-键对以抑制噪声，并通过密集注意力确保信息的全面传播，从而提升了脑部解剖结构检测的鲁棒性。

链接: https://arxiv.org/abs/2505.17683
作者: Dan Yuan,Yi Feng,Ziyun Tang
机构: Chongqing Electric Power College (重庆电力学院); Chongqing Metropolitan College of Science and Technology (重庆科技职业学院); Xinqiao Hospital, Army Medical University (新桥医院，陆军医科大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages,6 figures and 3 tables

点击查看摘要

Abstract:Intraventricular hemorrhage (IVH) is a severe neurological complication among premature infants, necessitating early and accurate detection from brain ultrasound (US) images to improve clinical outcomes. While recent deep learning methods offer promise for computer-aided diagnosis, challenges remain in capturing both local spatial details and global contextual dependencies critical for segmenting brain anatomies. In this work, we propose an enhanced Residual U-Net architecture incorporating two complementary attention mechanisms: the Convolutional Block Attention Module (CBAM) and a Sparse Attention Layer (SAL). The CBAM improves the model’s ability to refine spatial and channel-wise features, while the SAL introduces a dual-branch design, sparse attention filters out low-confidence query-key pairs to suppress noise, and dense attention ensures comprehensive information propagation. Extensive experiments on the Brain US dataset demonstrate that our method achieves state-of-the-art segmentation performance, with a Dice score of 89.04% and IoU of 81.84% for ventricle region segmentation. These results highlight the effectiveness of integrating spatial refinement and attention sparsity for robust brain anatomy detection. Code is available at: this https URL.
zh

[CV-143] owards Prospective Medical Image Reconstruction via Knowledge-Informed Dynamic Optimal Transport

【速读】：该论文旨在解决医学图像重建中因模拟数据与真实前瞻性数据之间的回顾性到前瞻性差距而导致的性能退化问题（retrospective-to-prospective gap）。其解决方案的关键在于引入成像知识引导的动态最优传输框架（Knowledge-Informed Dynamic Optimal Transport, KIDOT），该框架通过构建符合成像物理一致性的动态传输路径，利用成像知识引导的成本函数和传输方程，从非配对数据中学习重建过程，从而提升鲁棒性并更好地利用未配对数据。

链接: https://arxiv.org/abs/2505.17644
作者: Taoran Zheng,Xing Li,Yan Yang,Xiang Gu,Zongben Xu,Jian Sun
机构: Xi’an Jiaotong University (西安交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image reconstruction from measurement data is a vital but challenging inverse problem. Deep learning approaches have achieved promising results, but often requires paired measurement and high-quality images, which is typically simulated through a forward model, i.e., retrospective reconstruction. However, training on simulated pairs commonly leads to performance degradation on real prospective data due to the retrospective-to-prospective gap caused by incomplete imaging knowledge in simulation. To address this challenge, this paper introduces imaging Knowledge-Informed Dynamic Optimal Transport (KIDOT), a novel dynamic optimal transport framework with optimality in the sense of preserving consistency with imaging physics in transport, that conceptualizes reconstruction as finding a dynamic transport path. KIDOT learns from unpaired data by modeling reconstruction as a continuous evolution path from measurements to images, guided by an imaging knowledge-informed cost function and transport equation. This dynamic and knowledge-aware approach enhances robustness and better leverages unpaired data while respecting acquisition physics. Theoretically, we demonstrate that KIDOT naturally generalizes dynamic optimal transport, ensuring its mathematical rationale and solution existence. Extensive experiments on MRI and CT reconstruction demonstrate KIDOT’s superior performance.
zh

[CV-144] Distance Estimation in Outdoor Driving Environments Using Phase-only Correlation Method with Event Cameras

【速读】：该论文旨在解决自动驾驶系统中传感器成本与复杂性高的问题，提出一种基于单目事件相机和路侧LED条的测距方法。解决方案的关键在于利用相位仅相关技术对事件数据进行处理，实现亚像素级的空间位移检测，从而通过三角测量法进行精确测距，无需依赖立体视觉。

链接: https://arxiv.org/abs/2505.17582
作者: Masataka Kobayashi(1),Shintaro Shiba(2),Quan Kong(2),Norimasa Kobori(2),Tsukasa Shimizu(3),Shan Lu(1),Takaya Yamazato(1) ((1) School of Engineering, Nagoya University, Nagoya, Japan, (2) Woven by Toyota, Inc., Tokyo, Japan, (3) Toyota Motor Corporation, Toyota, Japan)
机构: Nagoya University(名古屋大学); Woven by Toyota, Inc.(丰田织造公司); TOYOTA MOTOR CORPORATION(丰田汽车公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 7 figures. To appear in IEEE Intelligent Vehicles Symposium (IV) 2025

点击查看摘要

Abstract:With the growing adoption of autonomous driving, the advancement of sensor technology is crucial for ensuring safety and reliable operation. Sensor fusion techniques that combine multiple sensors such as LiDAR, radar, and cameras have proven effective, but the integration of multiple devices increases both hardware complexity and cost. Therefore, developing a single sensor capable of performing multiple roles is highly desirable for cost-efficient and scalable autonomous driving systems. Event cameras have emerged as a promising solution due to their unique characteristics, including high dynamic range, low latency, and high temporal resolution. These features enable them to perform well in challenging lighting conditions, such as low-light or backlit environments. Moreover, their ability to detect fine-grained motion events makes them suitable for applications like pedestrian detection and vehicle-to-infrastructure communication via visible light. In this study, we present a method for distance estimation using a monocular event camera and a roadside LED bar. By applying a phase-only correlation technique to the event data, we achieve sub-pixel precision in detecting the spatial shift between two light sources. This enables accurate triangulation-based distance estimation without requiring stereo vision. Field experiments conducted in outdoor driving scenarios demonstrated that the proposed approach achieves over 90% success rate with less than 0.5-meter error for distances ranging from 20 to 60 meters. Future work includes extending this method to full position estimation by leveraging infrastructure such as smart poles equipped with LEDs, enabling event-camera-based vehicles to determine their own position in real time. This advancement could significantly enhance navigation accuracy, route optimization, and integration into intelligent transportation systems. Comments: 6 pages, 7 figures. To appear in IEEE Intelligent Vehicles Symposium (IV) 2025 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) ACMclasses: I.4.8; I.2.10; I.5.4 Cite as: arXiv:2505.17582 [eess.IV] (or arXiv:2505.17582v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2505.17582 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Masataka Kobayashi [view email] [v1] Fri, 23 May 2025 07:44:33 UTC (2,483 KB)
zh

[CV-145] FreqU-FNet: Frequency-Aware U-Net for Imbalanced Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中由于类别不平衡和解剖结构频率特异性分布所带来的持续性挑战。传统基于卷积神经网络（CNN）的方法在空间域操作，难以捕捉少数类信号，易受频率混叠和有限频谱选择性的限制；而基于Transformer的模型虽然能建模全局依赖关系，却容易忽略细粒度分割所需的关键局部细节。论文提出的解决方案是FreqU-FNet，其关键在于构建一个在频域操作的新型U型分割架构，通过频域编码器提取多尺度频谱特征，并引入空间可学习解码器和频率感知损失函数，以有效利用区分性频段并提升少数类的学习效果。

链接: https://arxiv.org/abs/2505.17544
作者: Ruiqi Xing
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 1 figure

点击查看摘要

Abstract:Medical image segmentation faces persistent challenges due to severe class imbalance and the frequency-specific distribution of anatomical structures. Most conventional CNN-based methods operate in the spatial domain and struggle to capture minority class signals, often affected by frequency aliasing and limited spectral selectivity. Transformer-based models, while powerful in modeling global dependencies, tend to overlook critical local details necessary for fine-grained segmentation. To overcome these limitations, we propose FreqU-FNet, a novel U-shaped segmentation architecture operating in the frequency domain. Our framework incorporates a Frequency Encoder that leverages Low-Pass Frequency Convolution and Daubechies wavelet-based downsampling to extract multi-scale spectral features. To reconstruct fine spatial details, we introduce a Spatial Learnable Decoder (SLD) equipped with an adaptive multi-branch upsampling strategy. Furthermore, we design a frequency-aware loss (FAL) function to enhance minority class learning. Extensive experiments on multiple medical segmentation benchmarks demonstrate that FreqU-FNet consistently outperforms both CNN and Transformer baselines, particularly in handling under-represented classes, by effectively exploiting discriminative frequency bands.
zh

[CV-146] DECT-based Space-Squeeze Method for Multi-Class Classification of Metastatic Lymph Nodes in Breast Cancer

【速读】：该论文旨在解决乳腺癌患者前哨淋巴结转移负荷准确评估的问题，传统影像学方法难以区分转移程度并全面捕捉淋巴结特征。其解决方案的关键在于利用双能计算机断层扫描（DECT）的光谱-空间信息，提出一种结合通道注意力机制和虚拟类别注入的新型空间压缩方法，以提升多类别分类性能，从而实现对无转移（N_0）、低转移负荷（N_+(1-2)）和高转移负荷（N_+(\geq3)）的精准分类。

链接: https://arxiv.org/abs/2505.17528
作者: Hai Jiang,Chushan Zheng,Jiawei Pan,Yuanpin Zhou,Qiongting Liu,Xiang Zhang,Jun Shen,Yao Lu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Accurate assessment of metastatic burden in axillary lymph nodes is crucial for guiding breast cancer treatment decisions, yet conventional imaging modalities struggle to differentiate metastatic burden levels and capture comprehensive lymph node characteristics. This study leverages dual-energy computed tomography (DECT) to exploit spectral-spatial information for improved multi-class classification. Purpose: To develop a noninvasive DECT-based model classifying sentinel lymph nodes into three categories: no metastasis ( N_0 ), low metastatic burden ( N_+(1-2) ), and heavy metastatic burden ( N_+(\geq3) ), thereby aiding therapeutic planning. Methods: We propose a novel space-squeeze method combining two innovations: (1) a channel-wise attention mechanism to compress and recalibrate spectral-spatial features across 11 energy levels, and (2) virtual class injection to sharpen inter-class boundaries and compact intra-class variations in the representation space. Results: Evaluated on 227 biopsy-confirmed cases, our method achieved an average test AUC of 0.86 (95% CI: 0.80-0.91) across three cross-validation folds, outperforming established CNNs (VGG, ResNet, etc). The channel-wise attention and virtual class components individually improved AUC by 5.01% and 5.87%, respectively, demonstrating complementary benefits. Conclusions: The proposed framework enhances diagnostic AUC by effectively integrating DECT’s spectral-spatial data and mitigating class ambiguity, offering a promising tool for noninvasive metastatic burden assessment in clinical practice.
zh

[CV-147] Anatomy-Guided Multitask Learning for MRI-Based Classification of Placenta Accreta Spectrum and its Subtypes

【速读】：该论文旨在解决胎盘植入谱系疾病（Placenta Accreta Spectrum Disorders, PAS）及其亚型（包括胎盘粘连、胎盘植入和胎盘穿透）在产前诊断中的准确识别问题，现有方法多集中于PAS的存在性判断，缺乏对亚型的系统识别，且以往的多类分类方法依赖低效的两阶段二分类任务。论文提出了一种新型卷积神经网络（CNN）架构，实现基于4,140张磁共振成像（MRI）切片的高效单阶段多类别诊断，其关键在于引入双分支结构：主分类分支采用残差块（residual block）结构，另一分支则融合子宫胎盘区域及邻近子宫浆膜层的解剖特征以增强模型注意力，并通过多任务学习策略有效整合两个分支的信息，从而提升诊断性能。

链接: https://arxiv.org/abs/2505.17484
作者: Hai Jiang,Qiongting Liu,Yuanpin Zhou,Jiawei Pan,Ting Song,Yao Lu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Placenta Accreta Spectrum Disorders (PAS) pose significant risks during pregnancy, frequently leading to postpartum hemorrhage during cesarean deliveries and other severe clinical complications, with bleeding severity correlating to the degree of placental invasion. Consequently, accurate prenatal diagnosis of PAS and its subtypes-placenta accreta (PA), placenta increta (PI), and placenta percreta (PP)-is crucial. However, existing guidelines and methodologies predominantly focus on the presence of PAS, with limited research addressing subtype recognition. Additionally, previous multi-class diagnostic efforts have primarily relied on inefficient two-stage cascaded binary classification tasks. In this study, we propose a novel convolutional neural network (CNN) architecture designed for efficient one-stage multiclass diagnosis of PAS and its subtypes, based on 4,140 magnetic resonance imaging (MRI) slices. Our model features two branches: the main classification branch utilizes a residual block architecture comprising multiple residual blocks, while the second branch integrates anatomical features of the uteroplacental area and the adjacent uterine serous layer to enhance the model’s attention during classification. Furthermore, we implement a multitask learning strategy to leverage both branches effectively. Experiments conducted on a real clinical dataset demonstrate that our model achieves state-of-the-art performance.
zh

[CV-148] SUFFICIENT: A scan-specific unsupervised deep learning framework for high-resolution 3D isotropic fetal brain MRI reconstruction

【速读】：该论文旨在解决从运动伪影严重的2D切片中重建高质量3D胎儿脑部磁共振成像（MRI）的问题，这是临床诊断中的关键挑战。其解决方案的关键在于提出了一种无监督的迭代slice-to-volume registration (SVR)-super-resolution reconstruction (SRR)框架，通过结合卷积神经网络进行刚性变换矩阵的估计以及嵌入深度图像先验框架的解码网络，实现各向同性高分辨率（HR）体积的重建。该方法无需依赖大规模外部训练数据，而是通过优化预测切片与观测切片之间的损失来提升重建质量。

链接: https://arxiv.org/abs/2505.17472
作者: Jiangjie Wu,Lixuan Chen,Zhenghao Li,Xin Li,Saban Ozturk,Lihui Wang,Rongpin Wang,Hongjiang Wei,Yuyao Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality 3D fetal brain MRI reconstruction from motion-corrupted 2D slices is crucial for clinical diagnosis. Reliable slice-to-volume registration (SVR)-based motion correction and super-resolution reconstruction (SRR) methods are essential. Deep learning (DL) has demonstrated potential in enhancing SVR and SRR when compared to conventional methods. However, it requires large-scale external training datasets, which are difficult to obtain for clinical fetal MRI. To address this issue, we propose an unsupervised iterative SVR-SRR framework for isotropic HR volume reconstruction. Specifically, SVR is formulated as a function mapping a 2D slice and a 3D target volume to a rigid transformation matrix, which aligns the slice to the underlying location in the target volume. The function is parameterized by a convolutional neural network, which is trained by minimizing the difference between the volume slicing at the predicted position and the input slice. In SRR, a decoding network embedded within a deep image prior framework is incorporated with a comprehensive image degradation model to produce the high-resolution (HR) volume. The deep image prior framework offers a local consistency prior to guide the reconstruction of HR volumes. By performing a forward degradation model, the HR volume is optimized by minimizing loss between predicted slices and the observed slices. Comprehensive experiments conducted on large-magnitude motion-corrupted simulation data and clinical data demonstrate the superior performance of the proposed framework over state-of-the-art fetal brain reconstruction frameworks.
zh

[CV-149] Assessing the generalization performance of SAM for ureteroscopy scene understanding

【速读】：该论文旨在解决肾脏结石分割的自动化问题，以支持基于机器学习或深度学习方法的尿路结石类型识别。传统模型如U-Net、Residual U-Net和Attention U-Net虽然在特定数据集上表现高效，但在泛化到未见过的数据集时存在局限性。研究提出使用Segment Anything Model (SAM) 作为解决方案，其关键在于SAM具备卓越的适应性和泛化能力，尤其在处理分布外数据时表现出显著优势，相较于U-Net变体在性能上提升了高达23%。

链接: https://arxiv.org/abs/2505.17210
作者: Martin Villagrana,Francisco Lopez-Tiro,Clement Larose,Gilberto Ochoa-Ruiz,Christian Daul
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, 2 tables, conference, MIUA25

点击查看摘要

Abstract:The segmentation of kidney stones is regarded as a critical preliminary step to enable the identification of urinary stone types through machine- or deep-learning-based approaches. In urology, manual segmentation is considered tedious and impractical due to the typically large scale of image databases and the continuous generation of new data. In this study, the potential of the Segment Anything Model (SAM) – a state-of-the-art deep learning framework – is investigated for the automation of kidney stone segmentation. The performance of SAM is evaluated in comparison to traditional models, including U-Net, Residual U-Net, and Attention U-Net, which, despite their efficiency, frequently exhibit limitations in generalizing to unseen datasets. The findings highlight SAM’s superior adaptability and efficiency. While SAM achieves comparable performance to U-Net on in-distribution data (Accuracy: 97.68 + 3.04; Dice: 97.78 + 2.47; IoU: 95.76 + 4.18), it demonstrates significantly enhanced generalization capabilities on out-of-distribution data, surpassing all U-Net variants by margins of up to 23 percent.
zh

[CV-150] AGS: 3D Tumor-Adaptive Guidance for SAM

【速读】：该论文旨在解决生成式 AI (Generative AI) 在3D医学影像中的适应性问题，特别是针对病理检测与分割任务中存在自然图像与医学体积之间的领域差距。现有基于2D数据预训练的Foundation Models (FMs)难以捕捉3D解剖上下文，从而限制了其在临床应用中的有效性。该研究提出的解决方案关键在于设计一种名为TAGS（Tumor Adaptive Guidance for SAM）的适应框架，通过多提示融合机制，使2D FMs能够有效应用于3D医学任务，同时保留大部分预训练权重，以增强SAM的空间特征提取能力。

链接: https://arxiv.org/abs/2505.17096
作者: Sirui Li,Linkai Peng,Zheyuan Zhang,Gorkem Durak,Ulas Bagci
机构: Southern University of Science and Technology (南方科技大学); Northwestern University (西北大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) such as CLIP and SAM have recently shown great promise in image segmentation tasks, yet their adaptation to 3D medical imaging-particularly for pathology detection and segmentation-remains underexplored. A critical challenge arises from the domain gap between natural images and medical volumes: existing FMs, pre-trained on 2D data, struggle to capture 3D anatomical context, limiting their utility in clinical applications like tumor segmentation. To address this, we propose an adaptation framework called TAGS: Tumor Adaptive Guidance for SAM, which unlocks 2D FMs for 3D medical tasks through multi-prompt fusion. By preserving most of the pre-trained weights, our approach enhances SAM’s spatial feature extraction using CLIP’s semantic insights and anatomy-specific prompts. Extensive experiments on three open-source tumor segmentation datasets prove that our model surpasses the state-of-the-art medical image segmentation models (+46.88% over nnUNet), interactive segmentation frameworks, and other established medical FMs, including SAM-Med2D, SAM-Med3D, SegVol, Universal, 3D-Adapter, and SAM-B (at least +13% over them). This highlights the robustness and adaptability of our proposed framework across diverse medical segmentation tasks.
zh

人工智能

[AI-0] Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems

【速读】：该论文试图解决 Responsible AI (RAI) 度量体系中常见的理论不一致问题，例如公平性定义的差异或准确性和隐私之间的权衡。论文提出的解决方案的关键在于将这些不一致视为有价值的特征而非需要消除的缺陷，通过将度量标准视为不同的优化目标，从而实现三个核心优势：规范多元性、认识论完备性和隐式正则化。这种方法有助于更好地体现RAI中的多元道德立场和利益相关者价值观，更全面地捕捉复杂的伦理概念，并提升模型在现实复杂环境中的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2505.18139
作者: Gordon Dai,Yunze Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages,2 figure

点击查看摘要

Abstract:This position paper argues that the theoretical inconsistency often observed among Responsible AI (RAI) metrics, such as differing fairness definitions or tradeoffs between accuracy and privacy, should be embraced as a valuable feature rather than a flaw to be eliminated. We contend that navigating these inconsistencies, by treating metrics as divergent objectives, yields three key benefits: (1) Normative Pluralism: Maintaining a full suite of potentially contradictory metrics ensures that the diverse moral stances and stakeholder values inherent in RAI are adequately represented. (2) Epistemological Completeness: The use of multiple, sometimes conflicting, metrics allows for a more comprehensive capture of multifaceted ethical concepts, thereby preserving greater informational fidelity about these concepts than any single, simplified definition. (3) Implicit Regularization: Jointly optimizing for theoretically conflicting objectives discourages overfitting to one specific metric, steering models towards solutions with enhanced generalization and robustness under real-world complexities. In contrast, efforts to enforce theoretical consistency by simplifying or pruning metrics risk narrowing this value diversity, losing conceptual depth, and degrading model performance. We therefore advocate for a shift in RAI theory and practice: from getting trapped in inconsistency to characterizing acceptable inconsistency thresholds and elucidating the mechanisms that permit robust, approximated consistency in practice.
zh

[AI-1] Leverag ing KANs for Expedient Training of Multichannel MLPs via Preconditioning and Geometric Refinement

【速读】：该论文试图解决如何加快多层感知机（Multilayer Perceptrons, MLPs）的训练速度并提高其准确性的问题。其解决方案的关键在于利用Kolmogorov-Arnold Networks (KANs) 与多通道MLPs之间的结构等价性，通过引入基于样条的KAN基函数，该基函数具有几何局部支持特性，并在ReLU基下起到预条件下降的作用，从而加速训练过程。此外，通过同时训练样条节点的一维位置与权重，进一步提升了模型的准确性。

链接: https://arxiv.org/abs/2505.18131
作者: Jonas A. Actor,Graham Harper,Ben Southworth,Eric C. Cyr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Multilayer perceptrons (MLPs) are a workhorse machine learning architecture, used in a variety of modern deep learning frameworks. However, recently Kolmogorov-Arnold Networks (KANs) have become increasingly popular due to their success on a range of problems, particularly for scientific machine learning tasks. In this paper, we exploit the relationship between KANs and multichannel MLPs to gain structural insight into how to train MLPs faster. We demonstrate the KAN basis (1) provides geometric localized support, and (2) acts as a preconditioned descent in the ReLU basis, overall resulting in expedited training and improved accuracy. Our results show the equivalence between free-knot spline KAN architectures, and a class of MLPs that are refined geometrically along the channel dimension of each weight tensor. We exploit this structural equivalence to define a hierarchical refinement scheme that dramatically accelerates training of the multi-channel MLP architecture. We show further accuracy improvements can be had by allowing the 1 D locations of the spline knots to be trained simultaneously with the weights. These advances are demonstrated on a range of benchmark examples for regression and scientific machine learning.
zh

[AI-2] Bidirectional Knowledge Distillation for Enhancing Sequential Recommendation with Large Language Models

【速读】：该论文旨在解决将大型语言模型（Large Language Models, LLMs）与传统推荐模型（Conventional Recommendation Models, CRMs）结合时面临的高推理成本和静态知识迁移方法的问题。其解决方案的关键在于提出一种新颖的相互蒸馏框架LLMD4Rec，通过动态且双向的知识交换机制，实现LLM和CRM的迭代优化，从而提升CRMs的语义理解能力，并使LLM从用户-物品交互中获取协作信号，同时避免引入额外参数，确保高效的知识迁移。

链接: https://arxiv.org/abs/2505.18120
作者: Jiongran Wu,Jiahao Liu,Dongsheng Li,Guangping Zhang,Mingzhe Han,Hansu Gu,Peng Zhang,Li Shang,Tun Lu,Ning Gu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional performance in understanding and generating semantic patterns, making them promising candidates for sequential recommendation tasks. However, when combined with conventional recommendation models (CRMs), LLMs often face challenges related to high inference costs and static knowledge transfer methods. In this paper, we propose a novel mutual distillation framework, LLMD4Rec, that fosters dynamic and bidirectional knowledge exchange between LLM-centric and CRM-based recommendation systems. Unlike traditional unidirectional distillation methods, LLMD4Rec enables iterative optimization by alternately refining both models, enhancing the semantic understanding of CRMs and enriching LLMs with collaborative signals from user-item interactions. By leveraging sample-wise adaptive weighting and aligning output distributions, our approach eliminates the need for additional parameters while ensuring effective knowledge transfer. Extensive experiments on real-world datasets demonstrate that LLMD4Rec significantly improves recommendation accuracy across multiple benchmarks without increasing inference costs. This method provides a scalable and efficient solution for combining the strengths of both LLMs and CRMs in sequential recommendation systems.
zh

[AI-3] Stable Reinforcement Learning for Efficient Reasoning

【速读】：该论文试图解决在链式推理（Chain-of-Thought, CoT）生成过程中，基于0/1结果奖励的强化学习（Reinforcement Learning, RL）方法因缺乏对中间推理过程的调控能力而引发的严重过度思考问题。现有研究通过设计长度惩罚奖励函数来鼓励模型生成更简短但正确的完成内容，但此类方法导致了强化学习训练的不稳定性，尤其是在生成长度减少时模型准确性骤降。该论文提出的解决方案关键在于GRPO-λ，这是一种高效的GRPO变体，通过监控每个查询采样组内完成内容的正确率比值，动态调整奖励策略：低正确率时切换至与长度无关的0/1奖励以保障推理质量，高正确率时则保留长度惩罚以提升效率。实验结果表明，该方法在保持最优准确率与效率权衡的同时，有效避免了由长度惩罚引起的训练不稳定问题。

链接: https://arxiv.org/abs/2505.18086
作者: Muzhi Dai,Shixuan Liu,Qingyi Si
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The success of Deepseek-R1 has drawn the LLM community’s attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models’ behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO- \lambda , an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
zh

[AI-4] Backpropagation-Free Metropolis-Adjusted Langevin Algorithm

【速读】：该论文试图解决传统基于反向传播的优化方法在马尔可夫链蒙特卡洛（Markov chain Monte Carlo, MCMC）算法中的计算成本高和依赖梯度信息的问题。其解决方案的关键在于引入无需反向传播的前向模式自动微分（forward-mode automatic differentiation, AD），通过在模型前向传递中采样切向量，从而获得方向导数，实现对可微分模型的优化。该方法被整合到修正的莱纳德-阿尔戈里特姆（Metropolis-Adjusted Langevin Algorithm, MALA）的提议机制中，首次提出了无需反向传播的基于梯度的MCMC算法，并进一步扩展出利用海森矩阵信息的位置特定预条件前向模式MALA，显著降低了计算成本并提升了性能。

链接: https://arxiv.org/abs/2505.18081
作者: Adam D. Cobb,Susmit Jha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 Pages, 8 Figures

点击查看摘要

Abstract:Recent work on backpropagation-free learning has shown that it is possible to use forward-mode automatic differentiation (AD) to perform optimization on differentiable models. Forward-mode AD requires sampling a tangent vector for each forward pass of a model. The result is the model evaluation with the directional derivative along the tangent. In this paper, we illustrate how the sampling of this tangent vector can be incorporated into the proposal mechanism for the Metropolis-Adjusted Langevin Algorithm (MALA). As such, we are the first to introduce a backpropagation-free gradient-based Markov chain Monte Carlo (MCMC) algorithm. We also extend to a novel backpropagation-free position-specific preconditioned forward-mode MALA that leverages Hessian information. Overall, we propose four new algorithms: Forward MALA; Line Forward MALA; Pre-conditioned Forward MALA, and Pre-conditioned Line Forward MALA. We highlight the reduced computational cost of the forward-mode samplers and show that forward-mode is competitive with the original MALA, while even outperforming it depending on the probabilistic model. We include Bayesian inference results on a range of probabilistic models, including hierarchical distributions and Bayesian neural networks.
zh

[AI-5] AFD-STA: Adaptive Filtering Denoising with Spatiotemporal Attention for Chaotic System Prediction

【速读】：该论文旨在解决高维混沌系统（governed by partial differential equations）的预测问题，特别是在存在平滑和强混沌状态下的长期预测准确性以及噪声容忍度问题。其解决方案的关键在于提出AFD-STA Net框架，该框架通过整合自适应滤波与时空动态学习，包含自适应指数平滑模块、并行注意力机制、多尺度特征动态门控融合以及具有维度扩展能力的深度投影网络，从而实现对复杂非线性动力学交互的有效学习与建模。

链接: https://arxiv.org/abs/2505.18080
作者: Chunlin Gong,Yin Wang,Jingru Li,Hanleran Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:This paper presents AFD-STA Net, a neural framework integrating adaptive filtering and spatiotemporal dynamics learning for predicting high-dimensional chaotic systems governed by partial differential equations. The architecture combines: 1) An adaptive exponential smoothing module with position-aware decay coefficients for robust attractor reconstruction, 2) Parallel attention mechanisms capturing cross-temporal and spatial dependencies, 3) Dynamic gated fusion of multiscale features, and 4) Deep projection networks with dimension-scaling capabilities. Numerical experiments on nonlinear PDE systems demonstrate the model’s effectiveness in maintaining prediction accuracy under both smooth and strongly chaotic regimes while exhibiting noise tolerance through adaptive filtering. Component ablation studies confirm critical contributions from each module, particularly highlighting the essential role of spatiotemporal attention in learning complex dynamical interactions. The framework shows promising potential for real-world applications requiring simultaneous handling of measurement uncertainties and high-dimensional nonlinear dynamics.
zh

[AI-6] owards Uncertainty Aware Task Delegation and Human-AI Collaborative Decision-Making

【速读】：该论文试图解决在人机协同决策中如何促进用户对人工智能（Artificial Intelligence, AI）的适当依赖问题。其解决方案的关键在于探索基于距离的不确定性评分（distance-based uncertainty scores），并通过嵌入表示（embedding representations）进行可视化，以帮助用户更好地理解AI输出的可靠性，从而提高决策准确性。研究结果表明，与传统的基于概率的不确定性评分相比，基于距离的不确定性评分在识别不确定案例方面表现更优，并显著提升了用户在审查AI输出后的正确决策率和修正错误决策的比例。

链接: https://arxiv.org/abs/2505.18066
作者: Min Hun Lee,Martyn Zhe Yu Tok
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ACM FAccT 2025

点击查看摘要

Abstract:Despite the growing promise of artificial intelligence (AI) in supporting decision-making across domains, fostering appropriate human reliance on AI remains a critical challenge. In this paper, we investigate the utility of exploring distance-based uncertainty scores for task delegation to AI and describe how these scores can be visualized through embedding representations for human-AI decision-making. After developing an AI-based system for physical stroke rehabilitation assessment, we conducted a study with 19 health professionals and 10 students in medicine/health to understand the effect of exploring distance-based uncertainty scores on users’ reliance on AI. Our findings showed that distance-based uncertainty scores outperformed traditional probability-based uncertainty scores in identifying uncertain cases. In addition, after exploring confidence scores for task delegation and reviewing embedding-based visualizations of distance-based uncertainty scores, participants achieved an 8.20% higher rate of correct decisions, a 7.15% higher rate of changing their decisions to correct ones, and a 7.14% lower rate of incorrect changes after reviewing AI outputs than those reviewing probability-based uncertainty scores ( p0.01 ). Our findings highlight the potential of distance-based uncertainty scores to enhance decision accuracy and appropriate reliance on AI while discussing ongoing challenges for human-AI collaborative decision-making.
zh

[AI-7] Linear Mixture Distributionally Robust Markov Decision Processes

【速读】：该论文旨在解决现实世界决策问题中的分布外动态（off-policy dynamics）挑战，即智能体在源域中学习的策略在目标域中部署时，由于状态转移动力学不同而导致性能下降的问题。其解决方案的关键在于提出一种新的线性混合分布鲁棒马尔可夫决策过程（linear mixture DRMDP）框架，该框架假设名义动力学为线性混合模型，并通过在混合权重参数周围定义不确定性集，而非直接以名义核为中心构建球形不确定性集，从而提供更精细的不确定性表征。这一方法在存在先验混合模型知识的情况下，优于基于(s,a)-矩形性和d-矩形性的传统模型。

链接: https://arxiv.org/abs/2505.18044
作者: Zhishuai Liu,Pan Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Machine Learning (stat.ML)
备注: 26 pages, 7 figures

点击查看摘要

Abstract:Many real-world decision-making problems face the off-dynamics challenge: the agent learns a policy in a source domain and deploys it in a target domain with different state transitions. The distributionally robust Markov decision process (DRMDP) addresses this challenge by finding a robust policy that performs well under the worst-case environment within a pre-specified uncertainty set of transition dynamics. Its effectiveness heavily hinges on the proper design of these uncertainty sets, based on prior knowledge of the dynamics. In this work, we propose a novel linear mixture DRMDP framework, where the nominal dynamics is assumed to be a linear mixture model. In contrast with existing uncertainty sets directly defined as a ball centered around the nominal kernel, linear mixture DRMDPs define the uncertainty sets based on a ball around the mixture weighting parameter. We show that this new framework provides a more refined representation of uncertainties compared to conventional models based on (s,a) -rectangularity and d -rectangularity, when prior knowledge about the mixture model is present. We propose a meta algorithm for robust policy learning in linear mixture DRMDPs with general f -divergence defined uncertainty sets, and analyze its sample complexities under three divergence metrics instantiations: total variation, Kullback-Leibler, and \chi^2 divergences. These results establish the statistical learnability of linear mixture DRMDPs, laying the theoretical foundation for future research on this new setting.
zh

[AI-8] Automata Learning of Preferences over Temporal Logic Formulas from Pairwise Comparisons

【速读】：该论文试图解决在顺序决策过程中，从用户提供的有限字对比较中推断出其未知的关于时序目标（temporal goals）的预序关系（preorder）问题。解决方案的关键在于将用户的偏好关系建模为偏好确定性有限自动机（Preference Deterministic Finite Automaton, PDFA），该模型通过在接受条件上增加预序来表示对时序语言的偏好。论文证明了该问题的计算复杂性，并提出了一种基于特征样本的学习算法，能够保证在给定特征样本的情况下学习到与真实PDFA等价的最小PDFA。

链接: https://arxiv.org/abs/2505.18030
作者: Hazhar Rahmani,Jie Fu
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 16 pages, 11 figures, technical report, submission under review

点击查看摘要

Abstract:Many preference elicitation algorithms consider preference over propositional logic formulas or items with different attributes. In sequential decision making, a user’s preference can be a preorder over possible outcomes, each of which is a temporal sequence of events. This paper considers a class of preference inference problems where the user’s unknown preference is represented by a preorder over regular languages (sets of temporal sequences), referred to as temporal goals. Given a finite set of pairwise comparisons between finite words, the objective is to learn both the set of temporal goals and the preorder over these goals. We first show that a preference relation over temporal goals can be modeled by a Preference Deterministic Finite Automaton (PDFA), which is a deterministic finite automaton augmented with a preorder over acceptance conditions. The problem of preference inference reduces to learning the PDFA. This problem is shown to be computationally challenging, with the problem of determining whether there exists a PDFA of size smaller than a given integer k , consistent with the sample, being NP-Complete. We formalize the properties of characteristic samples and develop an algorithm that guarantees to learn, given a characteristic sample, the minimal PDFA equivalent to the true PDFA from which the sample is drawn. We present the method through a running example and provide detailed analysis using a robotic motion planning problem.
zh

[AI-9] LLM assisted web application functional requirements generation: A case study of four popular LLM s over a Mess Management System

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在软件工程中生成功能性规格说明（包括用例、业务规则和协作流程）的性能评估问题。其解决方案的关键在于通过对比GPT、Claude、Gemini和DeepSeek这四个主流LLMs在零样本提示下的生成结果，评估其在语法和语义正确性、一致性、无歧义性和完整性等方面的表现，并与参考规格进行对比分析。

链接: https://arxiv.org/abs/2505.18019
作者: Rashmi Gupta,Aditya K Gupta,Aarav Jain,Avinash C Pandey,Atul Gupta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 figures, Accepted in EASE 2025 this https URL

点击查看摘要

Abstract:Like any other discipline, Large Language Models (LLMs) have significantly impacted software engineering by helping developers generate the required artifacts across various phases of software development. This paper presents a case study comparing the performance of popular LLMs GPT, Claude, Gemini, and DeepSeek in generating functional specifications that include use cases, business rules, and collaborative workflows for a web application, the Mess Management System. The study evaluated the quality of LLM generated use cases, business rules, and collaborative workflows in terms of their syntactic and semantic correctness, consistency, non ambiguity, and completeness compared to the reference specifications against the zero-shot prompted problem statement. Our results suggested that all four LLMs can specify syntactically and semantically correct, mostly non-ambiguous artifacts. Still, they may be inconsistent at times and may differ significantly in the completeness of the generated specification. Claude and Gemini generated all the reference use cases, with Claude achieving the most complete but somewhat redundant use case specifications. Similar results were obtained for specifying workflows. However, all four LLMs struggled to generate relevant Business Rules, with DeepSeek generating the most reference rules but with less completeness. Overall, Claude generated more complete specification artifacts, while Gemini was more precise in the specifications it generated.
zh

[AI-10] ExoGait-MS: Learning Periodic Dynamics with Multi-Scale Graph Network for Exoskeleton Gait Recognition

【速读】：该论文旨在解决外骨骼机器人在个性化步态控制中的挑战，即现有控制方法难以适应个体差异，导致患者不适或可能受伤。解决方案的关键在于准确识别个人步态特征，特别是由关节协同作用引起的细微步态差异，如步频和步长。为此，研究提出了一种新方法，利用多尺度全局密集图卷积网络（Multi-Scale Global Dense Graph Convolutional Networks）在空间域中识别潜在的关节协同模式，并引入步态非线性周期动力学学习模块以有效捕捉步态在时间域中的周期特性。

链接: https://arxiv.org/abs/2505.18018
作者: Lijiang Liu,Junyu Shi,Yong Sun,Zhiyuan Zhang,Jinni Zhou,Shugen Ma,Qiang Nie
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current exoskeleton control methods often face challenges in delivering personalized treatment. Standardized walking gaits can lead to patient discomfort or even injury. Therefore, personalized gait is essential for the effectiveness of exoskeleton robots, as it directly impacts their adaptability, comfort, and rehabilitation outcomes for individual users. To enable personalized treatment in exoskeleton-assisted therapy and related applications, accurate recognition of personal gait is crucial for implementing tailored gait control. The key challenge in gait recognition lies in effectively capturing individual differences in subtle gait features caused by joint synergy, such as step frequency and step length. To tackle this issue, we propose a novel approach, which uses Multi-Scale Global Dense Graph Convolutional Networks (GCN) in the spatial domain to identify latent joint synergy patterns. Moreover, we propose a Gait Non-linear Periodic Dynamics Learning module to effectively capture the periodic characteristics of gait in the temporal domain. To support our individual gait recognition task, we have constructed a comprehensive gait dataset that ensures both completeness and reliability. Our experimental results demonstrate that our method achieves an impressive accuracy of 94.34% on this dataset, surpassing the current state-of-the-art (SOTA) by 3.77%. This advancement underscores the potential of our approach to enhance personalized gait control in exoskeleton-assisted therapy.
zh

[AI-11] AI Literacy for Legal AI Systems: A practical approach

【速读】：该论文试图解决法律人工智能（Legal AI）系统在司法和法律体系中部署与应用时所面临的法律与伦理挑战，特别是如何在利用其潜在优势（如减少偏见、提高效率、增强问责性）的同时，有效管理其带来的重大风险。解决方案的关键在于提升AI素养（AI literacy），作为欧盟《人工智能法案》中的法律要求和实现伦理AI的重要手段，从而为开发者和提供者提供一个评估风险、收益及利益相关者关切的实用工具——路线图问卷（roadmap questionnaire）。

链接: https://arxiv.org/abs/2505.18006
作者: Gizem Gultekin-Varkonyi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: Forthcoming in Iustum Aequum Salutare (2025) vol.21

点击查看摘要

Abstract:Legal AI systems are increasingly being adopted by judicial and legal system deployers and providers worldwide to support a range of applications. While they offer potential benefits such as reducing bias, increasing efficiency, and improving accountability, they also pose significant risks, requiring a careful balance between opportunities, and legal and ethical development and deployment. AI literacy, as a legal requirement under the EU AI Act and a critical enabler of ethical AI for deployers and providers, could be a tool to achieve this. The article introduces the term “legal AI systems” and then analyzes the concept of AI literacy and the benefits and risks associated with these systems. This analysis is linked to a broader AI-L concept for organizations that deal with legal AI systems. The outcome of the article, a roadmap questionnaire as a practical tool for developers and providers to assess risks, benefits, and stakeholder concerns, could be useful in meeting societal and regulatory expectations for legal AI.
zh

[AI-12] An Example Safety Case for Safeguards Against Misuse

【速读】：该论文试图解决现有AI滥用防护措施评估缺乏系统性连接至实际决策的问题，其核心问题是如何有效证明滥用防护措施能够将AI助手带来的风险降低至可接受水平。解决方案的关键在于构建一个端到端的“安全论证”（safety case），通过假设开发者红队对防护措施进行测试以估算规避所需的努力，再将该估计值输入定量“提升模型”（uplift model）以量化防护措施对滥用行为的阻挠程度，从而在部署过程中提供持续的风险信号，并快速响应新兴威胁。

链接: https://arxiv.org/abs/2505.18003
作者: Joshua Clymer,Jonah Weinbaum,Robert Kirk,Kimberly Mai,Selena Zhang,Xander Davies
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a “safety case”) that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative “uplift model” to determine how much barriers introduced by safeguards dissuade misuse (this https URL). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path – though not the only path – to rigorously justifying AI misuse risks are low.
zh

[AI-13] Outcome-based Reinforcement Learning to Predict the Future

【速读】：该论文试图解决将可验证奖励强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）扩展到更复杂、现实世界中的领域，如预测问题，而传统基于结果的强化学习在处理二元、延迟和噪声奖励时表现不稳定。解决方案的关键在于对两种先进算法——组相对策略优化（Group-Relative Policy Optimisation, GRPO）和ReMax进行适应性调整，包括移除GRPO中的逐问题方差缩放、在ReMax中应用基线减去的优势、使用10万条时间一致的合成问题进行训练，并引入轻量级的防护机制以惩罚无意义、非英语回答和缺失推理过程，从而实现单次稳定遍历11万事件的训练。通过将ReMax扩展至11万问题并集成七个预测结果，模型在准确性上与前沿基准o1相当，但在校准性和假设预测市场投注中表现更优。

链接: https://arxiv.org/abs/2505.17989
作者: Benjamin Turtel,Danny Franklin,Kris Skotheim,Luke Hewitt,Philipp Schoenegger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p 0.001). A simple trading rule turns this calibration edge into \ 127 of hypothetical profit versus \ 92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.
zh

[AI-14] owards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning

【速读】：该论文试图解决R1-style Reinforcement Learning (RL)在提升大型语言模型推理能力时，其基于规则的RL机制尚不明确的问题，以及小规模监督微调（SFT）虽对RL有显著影响但效率较低的问题。解决方案的关键在于提出一种分析框架，并通过测量样本效应来比较SFT与RL的效率，进而提出Re-distillation技术，该技术通过从RL训练策略中进行小规模蒸馏来微调预训练模型，从而在减少样本和计算资源的情况下达到与RL相当的性能。

链接: https://arxiv.org/abs/2505.17988
作者: Yutong Chen,Jiandong Gao,Ji Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 figs, 3 table, preprint

点击查看摘要

Abstract:R1-style Reinforcement Learning (RL) significantly enhances Large Language Models’ reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has significant influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring sample effect. Hypothetical analysis show that SFT efficiency is limited by training data. Guided by our analysis, we propose Re-distillation, a technique that fine-tunes pretrain model through small-scale distillation from the RL-trained policy. Experiments on Knight Knave and MATH datasets demonstrate re-distillation’s surprising efficiency: re-distilled models match RL performance with far fewer samples and less computation. Empirical verification shows that sample effect is a good indicator of performance improvements. As a result, on KK dataset, our re-distilled Qwen2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. On MATH, Qwen2.5-1.5B fine-tuned with re-distilled 500 samples matches its instruct-tuned variant without RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: this https URL
zh

[AI-15] ADLGen: Synthesizing Symbolic Event-Triggered Sensor Sequences for Human Activity Modeling

【速读】：该论文旨在解决真实世界中日常生活活动（Activities of Daily Living, ADL）数据收集的挑战，包括隐私问题、高昂的部署和标注成本，以及人类行为的固有稀疏性和不平衡性。其解决方案的关键在于提出ADLGen，一个专门设计用于合成现实、事件触发且符号化的传感器序列的生成框架。ADLGen的核心技术包括仅解码器的Transformer架构、基于符号的时间编码方法，以及上下文和布局感知的采样机制，以生成语义丰富且物理上合理的传感器事件序列。此外，通过集成大语言模型到自动生成-评估-精炼循环中，进一步提升了语义保真度并修正结构不一致性，从而实现了高效、可扩展且隐私保护的ADL数据合成方案。

链接: https://arxiv.org/abs/2505.17987
作者: Weihang You,Hanqi Jiang,Zishuai Liu,Zihang Xie,Tianming Liu,Jin Lu,Fei Dou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real world collection of Activities of Daily Living data is challenging due to privacy concerns, costly deployment and labeling, and the inherent sparsity and imbalance of human behavior. We present ADLGen, a generative framework specifically designed to synthesize realistic, event triggered, and symbolic sensor sequences for ambient assistive environments. ADLGen integrates a decoder only Transformer with sign based symbolic temporal encoding, and a context and layout aware sampling mechanism to guide generation toward semantically rich and physically plausible sensor event sequences. To enhance semantic fidelity and correct structural inconsistencies, we further incorporate a large language model into an automatic generate evaluate refine loop, which verifies logical, behavioral, and temporal coherence and generates correction rules without manual intervention or environment specific tuning. Through comprehensive experiments with novel evaluation metrics, ADLGen is shown to outperform baseline generators in statistical fidelity, semantic richness, and downstream activity recognition, offering a scalable and privacy-preserving solution for ADL data synthesis.
zh

[AI-16] Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

【速读】：该论文试图解决大规模语言模型（Large Language Model, LLM）压缩过程中因使用简化的对角线近似方法而忽略参数相关性，导致下游任务性能下降的问题。其解决方案的关键在于提出一种称为广义Fisher加权奇异值分解（Generalized Fisher-Weighted SVD, GFWSVD）的后训练压缩技术，该技术能够同时考虑Fisher信息矩阵的对角线和非对角线元素，从而更准确地反映参数的重要性。为实现该方法的可行性，作者还引入了一种可扩展的Kronecker因子分解近似算法来处理观测到的Fisher信息。

链接: https://arxiv.org/abs/2505.17974
作者: Viktoriia Chekalina,Daniil Moskovskiy,Daria Cherniuk,Maxim Kurkin,Andrey Kuznetsov,Evgeny Frolov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.
zh

[AI-17] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models

【速读】：该论文试图解决在训练大规模语言模型（Large Language Models, LLMs）时，使用基于奇异值分解（Singular Value Decomposition, SVD）的梯度投影方法导致计算成本高和内存开销大的问题。其解决方案的关键在于提出一种计算效率高且概念简单的两步法，首先利用离散余弦变换（Discrete Cosine Transform, DCT）的预定义正交矩阵构建完整的正交基，其次根据每层梯度与基向量的对齐程度自适应选择基列，从而通过一次矩阵乘法和轻量级排序步骤得到投影矩阵，有效降低了存储需求并提升了运行效率。

链接: https://arxiv.org/abs/2505.17967
作者: Ionut-Vlad Modoranu,Mher Safaryan,Erik Schultheis,Dan Alistarh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.
zh

[AI-18] NeuroTrails: Training with Dynamic Sparse Heads as the Key to Effective Ensembling

【速读】：该论文旨在解决模型集成（model ensembles）在深度学习中虽能提升泛化能力和鲁棒性，但往往伴随着巨大计算开销的问题。其解决方案的关键在于提出一种名为NeuroTrails的稀疏多头架构，该架构具有动态演化的拓扑结构，通过引入动态稀疏性诱导出预测多样性“黄金区域”（Goldilocks zone），从而在保持集成性能的同时显著降低资源需求。

链接: https://arxiv.org/abs/2505.17909
作者: Bram Grooten,Farid Hasanov,Chenxiang Zhang,Qiao Xiao,Boqian Wu,Zahra Atashgahi,Ghada Sokar,Shiwei Liu,Lu Yin,Elena Mocanu,Mykola Pechenizkiy,Decebal Constantin Mocanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Our open-source code is available at this https URL

点击查看摘要

Abstract:Model ensembles have long been a cornerstone for improving generalization and robustness in deep learning. However, their effectiveness often comes at the cost of substantial computational overhead. To address this issue, state-of-the-art methods aim to replicate ensemble-class performance without requiring multiple independently trained networks. Unfortunately, these algorithms often still demand considerable compute at inference. In response to these limitations, we introduce \textbfNeuroTrails , a sparse multi-head architecture with dynamically evolving topology. This unexplored model-agnostic training paradigm improves ensemble performance while reducing the required resources. We analyze the underlying reason for its effectiveness and observe that the various neural trails induced by dynamic sparsity attain a \textitGoldilocks zone of prediction diversity. NeuroTrails displays efficacy with convolutional and transformer-based architectures on computer vision and language tasks. Experiments on ResNet-50/ImageNet, LLaMA-350M/C4, among many others, demonstrate increased accuracy and stronger robustness in zero-shot generalization, while requiring significantly fewer parameters.
zh

[AI-19] Formalizing Embeddedness Failures in Universal Artificial Intelligence

【速读】：该论文试图解决AIXI强化学习代理作为嵌入式代理模型的常见失败问题，其关键在于尝试在通用人工智能框架内形式化这些失败模式，并证明它们的存在，特别是针对将联合行动/感知历史建模为来自普遍分布的AIXI变体。论文还评估了基于AIXI代理变体的嵌入式代理理论取得的进展。

链接: https://arxiv.org/abs/2505.17882
作者: Cole Wyeth,Marcus Hutter
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We rigorously discuss the commonly asserted failures of the AIXI reinforcement learning agent as a model of embedded agency. We attempt to formalize these failure modes and prove that they occur within the framework of universal artificial intelligence, focusing on a variant of AIXI that models the joint action/percept history as drawn from the universal distribution. We also evaluate the progress that has been made towards a successful theory of embedded agency based on variants of the AIXI agent.
zh

[AI-20] oward Optimal ANC: Establishing Mutual Information Lower Bound

【速读】：该论文试图解决深度学习驱动的主动降噪（Active Noise Cancellation, ANC）算法缺乏理论极限以严格评估其性能提升的问题。解决方案的关键在于推导一个统一的降噪性能下界，该下界由两个组成部分构成：信息论部分通过将残余误差功率与反噪声信号捕获的干扰熵比例相联系，量化了信息处理能力带来的限制；支持基础部分则衡量了在降噪路径无法处理的频段中产生的不可减少误差，反映了基本物理约束。通过取这两个部分的最大值，该下界建立了任何ANC算法可达到的归一化均方误差（NMSE）的理论上限。

链接: https://arxiv.org/abs/2505.17877
作者: François Derrida,Shahar Lutati,Eliya Nachmani
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Active Noise Cancellation (ANC) algorithms aim to suppress unwanted acoustic disturbances by generating anti-noise signals that destructively interfere with the original noise in real time. Although recent deep learning-based ANC algorithms have set new performance benchmarks, there remains a shortage of theoretical limits to rigorously assess their improvements. To address this, we derive a unified lower bound on cancellation performance composed of two components. The first component is information-theoretic: it links residual error power to the fraction of disturbance entropy captured by the anti-noise signal, thereby quantifying limits imposed by information-processing capacity. The second component is support-based: it measures the irreducible error arising in frequency bands that the cancellation path cannot address, reflecting fundamental physical constraints. By taking the maximum of these two terms, our bound establishes a theoretical ceiling on the Normalized Mean Squared Error (NMSE) attainable by any ANC algorithm. We validate its tightness empirically on the NOISEX dataset under varying reverberation times, demonstrating robustness across diverse acoustic conditions.
zh

[AI-21] Mixture of Low Rank Adaptation with Partial Parameter Sharing for Time Series Forecasting

【速读】：该论文试图解决多任务时间序列预测（Multi-task forecasting）中存在的表达能力瓶颈（Expressiveness Bottleneck）问题，即不同时间步的预测共享相同表示，导致即使在最优表示下仍存在不可避免的误差。解决方案的关键在于提出一个两阶段框架：首先预训练一个一步 ahead 预测的基础模型，然后通过步骤特定的低秩适配（LoRA）进行微调，从而使得基础模型能够处理任意数量的预测步骤并避免表达能力瓶颈。此外，还引入了混合 LoRA（Mixture-of-LoRA, MoLA）模型，利用自适应加权的 LoRA 专家实现步骤间的部分参数共享，以提升效率和预测性能。

链接: https://arxiv.org/abs/2505.17872
作者: Licheng Pan,Zhichao Chen,Haoxuan Li,Guangyi Liu,Zhijian Xu,Zhaoran Liu,Hao Wang,Ying Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task forecasting has become the standard approach for time-series forecasting (TSF). However, we show that it suffers from an Expressiveness Bottleneck, where predictions at different time steps share the same representation, leading to unavoidable errors even with optimal representations. To address this issue, we propose a two-stage framework: first, pre-train a foundation model for one-step-ahead prediction; then, adapt it using step-specific LoRA this http URL design enables the foundation model to handle any number of forecast steps while avoiding the expressiveness bottleneck. We further introduce the Mixture-of-LoRA (MoLA) model, which employs adaptively weighted LoRA experts to achieve partial parameter sharing across steps. This approach enhances both efficiency and forecasting performance by exploiting interdependencies between forecast steps. Experiments show that MoLA significantly improves model expressiveness and outperforms state-of-the-art time-series forecasting methods. Code is available at this https URL.
zh

[AI-22] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在同步处理跨模态信息方面能力不足的问题。其关键解决方案是构建一个名为Daily-Omni的音频-视觉问答基准，包含丰富的日常场景视频数据及对应的多选题问答对，并通过自动标注、问答生成与优化的流水线提高基准的可扩展性和评估效率；同时引入Daily-Omni-Agent，利用开源的视觉语言模型（Visual Language Model, VLM）、音频语言模型（Audio Language Model, ALM）和自动语音识别（Automatic Speech Recognition, ASR）模型，无需额外训练即可建立基准基线，验证了结合VLM与ALM并辅以简单时间对齐技术能够显著提升音频-视觉融合任务的性能。

链接: https://arxiv.org/abs/2505.17862
作者: Ziwei Zhou,Rui Wang,Zuxuan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. In this paper, we introduce: 1) Daily-Omni, an Audio-Visual Questioning and Answering benchmark comprising 684 videos of daily life scenarios from diverse sources, rich in both audio and visual information, and featuring 1197 multiple-choice QA pairs across 6 major tasks; 2) Daily-Omni QA Generation Pipeline, which includes automatic annotation, QA generation and QA optimization, significantly improves efficiency for human evaluation and scalability of the benchmark; 3) Daily-Omni-Agent, a training-free agent utilizing open-source Visual Language Model (VLM), Audio Language Model (ALM) and Automatic Speech Recognition (ASR) model to establish a baseline for this benchmark. The results show that current MLLMs still struggle significantly with tasks requiring audio-visual integration, but combining VLMs and ALMs with simple temporal alignment techniques can achieve substantially better performance. Codes and benchmark are available at \hrefthis https URLthis https URL.
zh

[AI-23] Superplatforms Have to Attack AI Agents

【速读】：该论文试图解决超级平台（superplatforms）在生成式AI代理（AI agents）兴起背景下，其基于用户注意力的盈利模式与代理自主性之间产生的根本性冲突问题。解决方案的关键在于通过门控理论（gatekeeping theory）分析这种冲突，并揭示AI代理可能取代超级平台成为新的数字流量入口，从而迫使超级平台采取主动措施限制和对抗AI代理。研究还探讨了超级平台发起攻击的潜在技术手段，强调需关注由此引发的新兴张力，并倡导以用户利益为核心的合作性解决方案。

链接: https://arxiv.org/abs/2505.17861
作者: Jianghao Lin,Jiachen Zhu,Zheli Zhou,Yunjia Xi,Weiwen Liu,Yong Yu,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Position paper under review

点击查看摘要

Abstract:Over the past decades, superplatforms, digital companies that integrate a vast range of third-party services and applications into a single, unified ecosystem, have built their fortunes on monopolizing user attention through targeted advertising and algorithmic content curation. Yet the emergence of AI agents driven by large language models (LLMs) threatens to upend this business model. Agents can not only free user attention with autonomy across diverse platforms and therefore bypass the user-attention-based monetization, but might also become the new entrance for digital traffic. Hence, we argue that superplatforms have to attack AI agents to defend their centralized control of digital traffic entrance. Specifically, we analyze the fundamental conflict between user-attention-based monetization and agent-driven autonomy through the lens of our gatekeeping theory. We show how AI agents can disintermediate superplatforms and potentially become the next dominant gatekeepers, thereby forming the urgent necessity for superplatforms to proactively constrain and attack AI agents. Moreover, we go through the potential technologies for superplatform-initiated attacks, covering a brand-new, unexplored technical area with unique challenges. We have to emphasize that, despite our position, this paper does not advocate for adversarial attacks by superplatforms on AI agents, but rather offers an envisioned trend to highlight the emerging tensions between superplatforms and AI agents. Our aim is to raise awareness and encourage critical discussion for collaborative solutions, prioritizing user interests and perserving the openness of digital ecosystems in the age of AI agents.
zh

[AI-24] Scalable Valuation of Human Feedback through Provably Robust Model Alignment

【速读】：该论文试图解决语言模型对齐中的噪声人类反馈问题，即在对齐过程中，来自众包的反馈可能包含不一致或不利的响应，从而影响对齐效果。解决方案的关键是提出Hölder-DPO，这是一种具有可证明红降（redescending）特性的原则性对齐损失函数，能够在严重标签噪声下保持模型参数的一致性，从而从噪声反馈中估计出干净数据分布。

链接: https://arxiv.org/abs/2505.17859
作者: Masahiro Fujisawa,Masaki Adachi,Michael A. Osborne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 38 pages, 7 figures

点击查看摘要

Abstract:Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy – for example, preferring less desirable responses – posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, we apply Hölder-DPO to widely used alignment datasets, revealing substantial noise levels and demonstrating that removing these mislabels significantly improves alignment performance across methods.
zh

[AI-25] Stochastic Weight Sharing for Bayesian Neural Networks

【速读】：该论文试图解决贝叶斯神经网络（Bayesian Neural Networks, BNNs）在深度学习中进行不确定性量化时面临的计算需求高和训练深层架构收敛困难的问题。其解决方案的关键在于从随机视角重新诠释权重共享量化技术，通过引入二维自适应高斯分布、Wasserstein距离估计和alpha混合，将BNN的随机行为编码为低维的软高斯表示，从而显著降低贝叶斯学习的计算开销，实现大规模模型的高效贝叶斯训练。

链接: https://arxiv.org/abs/2505.17856
作者: Moule Lin,Shuhao Guan,Weipeng Jing,Goetz Botterweck,Andrea Patane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While offering a principled framework for uncertainty quantification in deep learning, the employment of Bayesian Neural Networks (BNNs) is still constrained by their increased computational requirements and the convergence difficulties when training very deep, state-of-the-art architectures. In this work, we reinterpret weight-sharing quantization techniques from a stochastic perspective in the context of training and inference with Bayesian Neural Networks (BNNs). Specifically, we leverage 2D adaptive Gaussian distributions, Wasserstein distance estimations, and alpha blending to encode the stochastic behaviour of a BNN in a lower dimensional, soft Gaussian representation. Through extensive empirical investigation, we demonstrate that our approach significantly reduces the computational overhead inherent in Bayesian learning by several orders of magnitude, enabling the efficient Bayesian training of large-scale models, such as ResNet-101 and Vision Transformer (VIT). On various computer vision benchmarks including CIFAR10, CIFAR100, and ImageNet1k. Our approach compresses model parameters by approximately 50x and reduces model size by 75, while achieving accuracy and uncertainty estimations comparable to the state-of-the-art.
zh

[AI-26] Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization

【速读】：该论文试图解决在长上下文下训练大型循环神经网络（Recurrent Neural Networks, RNNs）的不可行性问题，这一问题源于标准优化方法依赖于通过时间反向传播（Backpropagation Through Time, BPTT），导致内存使用随上下文长度和模型规模线性增长。解决方案的关键在于用零阶优化（Zero-Order Optimization, ZOO）方法，如随机向量梯度估计（Random-vector Gradient Estimation, RGE），替代BPTT，从而在保持模型处于推理模式的情况下，显著降低内存消耗并提高训练效率，同时实现与BPTT相当或更优的收敛速度和泛化能力。

链接: https://arxiv.org/abs/2505.17852
作者: Francois Chaubard,Mykel Kochenderfer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:During inference, Recurrent Neural Networks (RNNs) scale constant in both FLOPs and GPU memory with increasing context length, as they compress all prior tokens into a fixed-size memory. In contrast, transformers scale linearly in FLOPs and, at best, linearly in memory during generation, since they must attend to all previous tokens explicitly. Despite this inference-time advantage, training large RNNs on long contexts remains impractical because standard optimization methods depend on Backpropagation Through Time (BPTT). BPTT requires retention of all intermediate activations during the forward pass, causing memory usage to scale linearly with both context length and model size. In this paper, we show that Zero-Order Optimization (ZOO) methods such as Random-vector Gradient Estimation (RGE) can successfully replace BPTT to train RNNs with convergence rates that match, or exceed BPTT by up to 19 fold, while using orders of magnitude less memory and cost, as the model remains in inference mode throughout training. We further demonstrate that Central-Difference RGE (CD-RGE) corresponds to optimizing a smoothed surrogate loss, inherently regularizing training and improving generalization. Our method matches or outperforms BPTT across three settings: (1) overfitting, (2) transduction, and (3) language modeling. Across all tasks, with sufficient perturbations, our models generalize as well as or better than those trained with BPTT, often in fewer steps. Despite the need for more forward passes per step, we can surpass BPTT wall-clock time per step using recent advancements such as FlashRNN and distributed inference.
zh

[AI-27] ransDF: Time-Series Forecasting Needs Transformed Label Alignment

【速读】：该论文试图解决时间序列预测模型在训练过程中面临的两个关键问题：（1）标签自相关性（label autocorrelation），导致标签序列似然出现偏差；（2）任务数量过多，随着预测范围的扩大而增加，从而增加了优化难度。解决方案的关键在于提出Transform-enhanced Direct Forecast（TransDF），通过将标签序列转换为去相关的组件并区分其显著性，使模型仅对最显著的组件进行对齐，从而有效缓解标签自相关性并减少任务数量。

链接: https://arxiv.org/abs/2505.17847
作者: Hao Wang,Licheng Pan,Zhichao Chen,Xu Chen,Qingyang Dai,Lei Wang,Haoxuan Li,Zhouchen Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Training time-series forecasting models presents unique challenges in designing effective learning objectives. Existing methods predominantly utilize the temporal mean squared error, which faces two critical challenges: (1) label autocorrelation, which leads to bias from the label sequence likelihood; (2) excessive amount of tasks, which increases with the forecast horizon and complicates optimization. To address these challenges, we propose Transform-enhanced Direct Forecast (TransDF), which transforms the label sequence into decorrelated components with discriminated significance. Models are trained to align the most significant components, thereby effectively mitigating label autocorrelation and reducing task amount. Extensive experiments demonstrate that TransDF achieves state-of-the-art performance and is compatible with various forecasting models. Code is available at this https URL.
zh

[AI-28] EDI: Trustworthy and Ethical Dataset Indicators to Analyze and Compare Dataset Documentation

【速读】：该论文试图解决多模态数据集在可信性和伦理维度上的透明度不足问题，以及由此导致的难以比较不同数据集在伦理和可信性方面的差异。解决方案的关键在于提出可信与伦理数据集指标（Trustworthy and Ethical Dataset Indicators, TEDI），该指标包含143个细粒度的指标，用于系统化、实证地分析数据集文档，从而提取可验证的信息以评估数据集的可信性和伦理属性。通过TEDI，研究者对超过100个多模态数据集进行了人工标注和分析，揭示了影响数据集伦理和可信性维度的因素。

链接: https://arxiv.org/abs/2505.17841
作者: Wiebke Hutiri,Mircea Cimpoi,Morgan Scheuerman,Victoria Matthews,Alice Xiang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Dataset transparency is a key enabler of responsible AI, but insights into multimodal dataset attributes that impact trustworthy and ethical aspects of AI applications remain scarce and are difficult to compare across datasets. To address this challenge, we introduce Trustworthy and Ethical Dataset Indicators (TEDI) that facilitate the systematic, empirical analysis of dataset documentation. TEDI encompasses 143 fine-grained indicators that characterize trustworthy and ethical attributes of multimodal datasets and their collection processes. The indicators are framed to extract verifiable information from dataset documentation. Using TEDI, we manually annotated and analyzed over 100 multimodal datasets that include human voices. We further annotated data sourcing, size, and modality details to gain insights into the factors that shape trustworthy and ethical dimensions across datasets. We find that only a select few datasets have documented attributes and practices pertaining to consent, privacy, and harmful content indicators. The extent to which these and other ethical indicators are addressed varies based on the data collection method, with documentation of datasets collected via crowdsourced and direct collection approaches being more likely to mention them. Scraping dominates scale at the cost of ethical indicators, but is not the only viable collection method. Our approach and empirical insights contribute to increasing dataset transparency along trustworthy and ethical dimensions and pave the way for automating the tedious task of extracting information from dataset documentation in future.
zh

[AI-29] Hybrid Mamba-Transformer Decoder for Error-Correcting Codes

【速读】：该论文旨在解决传统纠错码解码方法在处理复杂线性码时性能不足的问题，特别是如何有效结合序列建模与全局上下文理解以提升解码准确率。其解决方案的关键在于提出一种混合解码器，该解码器融合了Mamba架构的高效序列建模能力与Transformer的全局上下文捕捉优势，并通过设计层间掩码策略和渐进式层间损失函数，实现了对不同深度代码特征的有选择性关注与鲁棒特征提取。

链接: https://arxiv.org/abs/2505.17834
作者: Shy-el Cohen,Yoni Choukroun,Eliya Nachmani
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a novel deep learning method for decoding error correction codes based on the Mamba architecture, enhanced with Transformer layers. Our approach proposes a hybrid decoder that leverages Mamba’s efficient sequential modeling while maintaining the global context capabilities of Transformers. To further improve performance, we design a novel layer-wise masking strategy applied to each Mamba layer, allowing selective attention to relevant code features at different depths. Additionally, we introduce a progressive layer-wise loss, supervising the network at intermediate stages and promoting robust feature extraction throughout the decoding process. Comprehensive experiments across a range of linear codes demonstrate that our method significantly outperforms Transformer-only decoders and standard Mamba models.
zh

[AI-30] Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverag e in Online Reinforcement Learning

【速读】：该论文旨在解决目标条件强化学习（Goal-Conditioned Reinforcement Learning, GCRL）在视觉环境中因高维、语义稀疏观测而导致的表示学习问题，特别是在在线设置中，由于潜在空间随智能体策略演变，可能导致对有限状态的过度表示，从而影响状态覆盖和技能学习效果。解决方案的关键在于引入DRAG（Distributionally Robust Auto-Encoding for GCRL），该方法结合了β-VAE框架与分布鲁棒优化，通过对抗性神经加权器来调整训练状态的权重，以缓解当前数据分布与环境未见部分之间的不匹配，从而促进更全面的状态空间覆盖和更好的下游控制性能。

链接: https://arxiv.org/abs/2505.17830
作者: Nicolas Castanet,Olivier Sigaud,Sylvain Lamprier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent’s policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the \beta -VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.
zh

[AI-31] Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

【速读】：该论文试图解决在AI系统安全评估过程中，高级AI系统可能因感知到被评估状态而改变行为，从而导致评估结果失真的问题，即评估伪造（evaluation faking）现象。解决方案的关键在于通过系统性实验和链式思维监控技术，识别AI系统在评估情境下的行为变化，并揭示其内部信号，为后续缓解策略提供依据。

链接: https://arxiv.org/abs/2505.17815
作者: Yihe Fan,Wenqi Zhang,Xudong Pan,Min Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As foundation models grow increasingly more intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process? During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of evaluation faking, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed the observer effects for AI: When the AI system under evaluation is more advanced in reasoning and situational awareness, the evaluation faking behavior becomes more ubiquitous, which reflects in the following aspects: 1) Reasoning models recognize evaluation 16% more often than non-reasoning models. 2) Scaling foundation models (32B to 671B) increases faking by over 30% in some cases, while smaller models show negligible faking. 3) AI with basic memory is 2.3x more likely to recognize evaluation and scores 19% higher on safety tests (vs. no memory). To measure this, we devised a chain-of-thought monitoring technique to detect faking intent and uncover internal signals correlated with such behavior, offering insights for future mitigation studies.
zh

[AI-32] Hyperparameter Optimization via Interacting with Probabilistic Circuits

【速读】：该论文试图解决交互式超参数优化（Interactive Hyperparameter Optimization, HPO）中如何有效整合人类反馈的问题。现有交互式贝叶斯优化（Bayesian Optimization, BO）方法通过将用户定义的先验分布加权到采集函数上来融入人类信念，但由于BO中常见的内层优化问题，这种加权方案并不能总是准确反映用户的实际信念。该论文的关键解决方案是引入一种基于可处理概率模型——概率电路（Probabilistic Circuits, PCs）的新型BO方法。PCs在混合超参数空间和评估分数上编码了可处理的联合分布，并支持精确的条件推理和采样。基于条件采样，论文构建了一种无需采集函数的候选点生成策略，从而消除了对额外内层优化的需求，并确保用户信念被准确反映在选择策略中。

链接: https://arxiv.org/abs/2505.17804
作者: Jonas Seng,Fabrizio Ventola,Zhongjie Yu,Kristian Kersting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the growing interest in designing truly interactive hyperparameter optimization (HPO) methods, to date, only a few allow to include human feedback. Existing interactive Bayesian optimization (BO) methods incorporate human beliefs by weighting the acquisition function with a user-defined prior distribution. However, in light of the non-trivial inner optimization of the acquisition function prevalent in BO, such weighting schemes do not always accurately reflect given user beliefs. We introduce a novel BO approach leveraging tractable probabilistic models named probabilistic circuits (PCs) as a surrogate model. PCs encode a tractable joint distribution over the hybrid hyperparameter space and evaluation scores. They enable exact conditional inference and sampling. Based on conditional sampling, we construct a novel selection policy that enables an acquisition function-free generation of candidate points (thereby eliminating the need for an additional inner-loop optimization) and ensures that user beliefs are reflected accurately in the selection policy. We provide a theoretical analysis and an extensive empirical evaluation, demonstrating that our method achieves state-of-the-art performance in standard HPO and outperforms interactive BO baselines in interactive HPO.
zh

[AI-33] Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

【速读】：该论文旨在解决自主多智能体系统（MAS）在复杂任务自动化中因协调失误和目标不对齐等风险而引发的信任问题。其核心挑战在于如何提升多智能体强化学习的可解释性，尤其是在状态/动作空间复杂性、利益相关者需求多样化以及评估标准不足等方面。论文提出的解决方案是基于反事实因果理论和大语言模型（LLM）的摘要能力，构建了Agentic eXplanations via Interrogative Simulation (AXIS)，通过让LLM使用“whatif”和“remove”等查询与环境模拟器交互，从而生成可理解的因果解释。该方法的关键在于利用多轮反事实信息的观察与综合，提升解释的准确性和可信度。

链接: https://arxiv.org/abs/2505.17801
作者: Bálint Gyevnár,Christopher G. Lucas,Stefano V. Albrecht,Shay B. Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous multi-agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks like miscoordination and goal misalignment. Explainability is vital for trust calibration, but explainable reinforcement learning for MAS faces challenges in state/action space complexity, stakeholder needs, and evaluation. Using the counterfactual theory of causation and LLMs’ summarisation capabilities, we propose Agentic eXplanations via Interrogative Simulation (AXIS). AXIS generates intelligible causal explanations for pre-trained multi-agent policies by having an LLM interrogate an environment simulator using queries like ‘whatif’ and ‘remove’ to observe and synthesise counterfactual information over multiple rounds. We evaluate AXIS on autonomous driving across 10 scenarios for 5 LLMs with a novel evaluation methodology combining subjective preference, correctness, and goal/action prediction metrics, and an external LLM as evaluator. Compared to baselines, AXIS improves perceived explanation correctness by at least 7.7% across all models and goal prediction accuracy by 23% for 4 models, with improved or comparable action prediction accuracy, achieving the highest scores overall.
zh

[AI-34] Bruno: Backpropagation Running Undersampled for Novel device Optimization

【速读】：该论文试图解决在基于铁电电容器（FeCap）和电阻开关非易失性器件（RRAM）的专用硬件上训练神经网络时，如何有效应对硬件特性带来的挑战，如随机性、变异性和低精度等问题。其解决方案的关键在于采用自底向上的方法，从物理器件的紧凑模型出发，构建神经元的计算原语，并开发一种能够在考虑常见硬件限制的情况下可靠地进行反向传播的训练算法。

链接: https://arxiv.org/abs/2505.17791
作者: Luca Fehlings,Bojian Zhang,Paolo Gibertini,Martin A. Nicholson,Erika Covi,Fernando M. Quintana
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 pages supplementary material

点击查看摘要

Abstract:Recent efforts to improve the efficiency of neuromorphic and machine learning systems have focused on the development of application-specific integrated circuits (ASICs), which provide hardware specialized for the deployment of neural networks, leading to potential gains in efficiency and performance. These systems typically feature an architecture that goes beyond the von Neumann architecture employed in general-purpose hardware such as GPUs. Neural networks developed for this specialised hardware then need to take into account the specifics of the hardware platform, which requires novel training algorithms and accurate models of the hardware, since they cannot be abstracted as a general-purpose computing platform. In this work, we present a bottom-up approach to train neural networks for hardware based on spiking neurons and synapses built on ferroelectric capacitor (FeCap) and Resistive switching non-volatile devices (RRAM) respectively. In contrast to the more common approach of designing hardware to fit existing abstract neuron or synapse models, this approach starts with compact models of the physical device to model the computational primitive of the neurons. Based on these models, a training algorithm is developed that can reliably backpropagate through these physical models, even when applying common hardware limitations, such as stochasticity, variability, and low bit precision. The training algorithm is then tested on a spatio-temporal dataset with a network composed of quantized synapses based on RRAM and ferroelectric leaky integrate-and-fire (FeLIF) neurons. The performance of the network is compared with different networks composed of LIF neurons. The results of the experiments show the potential advantage of using BRUNO to train networks with FeLIF neurons, by achieving a reduction in both time and memory for detecting spatio-temporal patterns with quantized synapses.
zh

[AI-35] But what is your honest answer? Aiding LLM -judges with honest alternatives using steering vectors

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中存在不易被检测的不诚实行为问题，如谄媚行为。现有诚实性评估基准主要关注事实知识或明显有害行为，并依赖外部评判者，难以识别较为隐蔽的不诚实表现。解决方案的关键在于提出一种新的框架——使用安全引导替代的裁判（Judge Using Safety-Steered Alternatives, JUSSA），该框架通过在单个样本上训练的引导向量，促使模型生成更诚实的回应，从而提升LLM裁判对不诚实行为的检测能力。

链接: https://arxiv.org/abs/2505.17760
作者: Leon Eshuijs,Archie Chaudhury,Alan McBeth,Ethan Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent safety evaluations of Large Language Models (LLMs) show that many models exhibit dishonest behavior, such as sycophancy. However, most honesty benchmarks focus exclusively on factual knowledge or explicitly harmful behavior and rely on external judges, which are often unable to detect less obvious forms of dishonesty. In this work, we introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models, helping LLM-judges in the detection of dishonest behavior. To test our framework, we introduce a new manipulation dataset with prompts specifically designed to elicit deceptive responses. We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.
zh

[AI-36] Mind the GAP! The Challenges of Scale in Pixel-based Deep Reinforcement Learning

【速读】：该论文试图解决在基于像素的环境中扩展深度强化学习时性能下降的问题（performance drop）。研究指出，编码器（由卷积层堆叠而成）的输出与后续全连接层之间的连接是限制扩展能力的主要因素，这一连接被定义为瓶颈（bottleneck）。论文的关键解决方案是引入全局平均池化（global average pooling），作为一种简单而有效的方法来直接针对这一瓶颈，从而避免了早期方法的复杂性。

链接: https://arxiv.org/abs/2505.17749
作者: Ghada Sokar,Pablo Samuel Castro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling deep reinforcement learning in pixel-based environments presents a significant challenge, often resulting in diminished performance. While recent works have proposed algorithmic and architectural approaches to address this, the underlying cause of the performance drop remains unclear. In this paper, we identify the connection between the output of the encoder (a stack of convolutional layers) and the ensuing dense layers as the main underlying factor limiting scaling capabilities; we denote this connection as the bottleneck, and we demonstrate that previous approaches implicitly target this bottleneck. As a result of our analyses, we present global average pooling as a simple yet effective way of targeting the bottleneck, thereby avoiding the complexity of earlier approaches.
zh

[AI-37] MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization

【速读】：该论文旨在解决元黑箱优化（Meta-Black-Box Optimization, MetaBBO）领域中自动化优化算法设计的效率与灵活性不足的问题。其关键解决方案是提出MetaBox-v2，该框架具备统一架构以支持强化学习、进化算法和基于梯度的方法，并通过高效的并行化方案、全面的基准测试集以及丰富的自定义分析与集成接口，显著提升了优化算法开发的自动化水平与适用性。

链接: https://arxiv.org/abs/2505.17745
作者: Zeyuan Ma,Yue-Jiao Gong,Hongshu Guo,Wenjie Qiu,Sijie Ma,Hongqiao Lian,Jiajun Zhan,Kaixu Chen,Chen Wang,Zhiyang Huang,Zechuan Huang,Guojun Peng,Ran Cheng,Yining Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (this https URL) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce 23 up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by 10-40x; 3) a comprehensive benchmark suite of 18 synthetic/realistic tasks (1900+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.
zh

[AI-38] Automating Safety Enhancement for LLM -based Agents with Synthetic Risk Scenarios

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）驱动的智能体在实际应用中面临的安全性问题，尤其是由动态用户交互、外部工具使用以及潜在的有害行为引发的多样化和复杂风险。其解决方案的关键在于提出AutoSafe框架，该框架通过完全自动化的合成数据生成机制系统性地提升智能体的安全性，核心包括一个开放且可扩展的威胁模型OTS，用于形式化描述不安全行为的产生机制，并构建了一个大规模、多样化且高质量的安全训练数据集，从而无需依赖危险的真实数据收集。

链接: https://arxiv.org/abs/2505.17735
作者: Xueyang Zhou,Weidong Wang,Lin Lu,Jiawen Shi,Guiyao Tie,Yongtian Xu,Lixing Chen,Pan Zhou,Neil Zhenqiang Gong,Lichao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 38 pages;12 figures;12 tables

点击查看摘要

Abstract:Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as “digital assistants, autonomous customer service, and decision-support systems”, where their ability to “interact in multi-turn, tool-augmented environments” makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset-eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment. We have released the project page at this https URL.
zh

[AI-39] CIKT: A Collaborative and Iterative Knowledge Tracing Framework with Large Language Models

【速读】：该论文旨在解决传统知识追踪（Knowledge Tracing, KT）方法在可解释性、可扩展性和复杂知识依赖建模方面的不足。其解决方案的关键在于提出一种名为协作迭代知识追踪（Collaborative Iterative Knowledge Tracing, CIKT）的框架，该框架通过结合大型语言模型（Large Language Models, LLMs）的优势，实现预测准确性的提升与模型可解释性的增强。CIKT采用双组件架构，包括生成动态可解释用户画像的分析师和基于这些画像进行未来表现预测的预测器，并通过协同优化循环不断迭代优化两者，从而有效提升模型性能与透明度。

链接: https://arxiv.org/abs/2505.17705
作者: Runze Li,Siyu Wu,Jun Wang,Wei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) aims to model a student’s learning state over time and predict their future performance. However, traditional KT methods often face challenges in explainability, scalability, and effective modeling of complex knowledge dependencies. While Large Language Models (LLMs) present new avenues for KT, their direct application often struggles with generating structured, explainable student representations and lacks mechanisms for continuous, task-specific refinement. To address these gaps, we propose Collaborative Iterative Knowledge Tracing (CIKT), a framework that harnesses LLMs to enhance both prediction accuracy and explainability. CIKT employs a dual-component architecture: an Analyst generates dynamic, explainable user profiles from student historical responses, and a Predictor utilizes these profiles to forecast future performance. The core of CIKT is a synergistic optimization loop. In this loop, the Analyst is iteratively refined based on the predictive accuracy of the Predictor, which conditions on the generated profiles, and the Predictor is subsequently retrained using these enhanced profiles. Evaluated on multiple educational datasets, CIKT demonstrates significant improvements in prediction accuracy, offers enhanced explainability through its dynamically updated user profiles, and exhibits improved scalability. Our work presents a robust and explainable solution for advancing knowledge tracing systems, effectively bridging the gap between predictive performance and model transparency.
zh

[AI-40] Enhancing AI System Resiliency: Formulation and Guarantee for LSTM Resilience Based on Control Theory

【速读】：该论文试图解决长期短期记忆网络（LSTM）在面对输入扰动时的鲁棒性保障问题，旨在为人工智能系统质量保证提供关键技术。解决方案的关键在于引入一种新颖的方法，即增量输入到状态稳定性（δ ISS），用于数学上定义和评估LSTM对输入扰动的鲁棒性，从而实现数据无关的评估方法和通过调整训练参数实现的鲁棒性控制。

链接: https://arxiv.org/abs/2505.17696
作者: Sota Yoshihara(1),Ryousuke Yamamoto(2),Hiroyuki Kusumoto(1),Masanari Shimura(1) ((1) Graduate School of Mathematics, Nagoya University, (2) Aisin Software)
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 6 figures. Appendix: 16 pages. First three listed authors have equal contributions

点击查看摘要

Abstract:This research proposes methods for formulating and guaranteeing the resilience of long short-term memory (LSTM) networks, which can serve as a key technology in AI system quality assurance. We introduce a novel methodology applying incremental input-to-state stability ( \delta ISS) to mathematically define and evaluate the resilience of LSTM against input perturbations. Key achievements include the development of a data-independent evaluation method and the demonstration of resilience control through adjustments to training parameters. This research presents concrete solutions to AI quality assurance from a control theory perspective, which can advance AI applications in control systems.
zh

[AI-41] Rethinking Agent Design: From Top-Down Workflows to Bottom-Up Skill Evolution

【速读】：该论文试图解决传统基于大语言模型（Large Language Model, LLM）的智能体框架依赖人工设计任务分解与流程定义的问题，这些系统在基准任务中表现有效，但缺乏从经验中自主学习的能力。其解决方案的关键在于引入一种自下而上的智能体范式，该范式模仿人类的学习过程，通过试错与反思机制逐步获得能力，并能够将技能快速共享和扩展，从而实现持续进化而非静态复制。

链接: https://arxiv.org/abs/2505.17673
作者: Jiawei Du,Jinlong Wu,Yuzheng Chen,Yucheng Hu,Bing Li,Joey Tianyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most LLM-based agent frameworks adopt a top-down philosophy: humans decompose tasks, define workflows, and assign agents to execute each step. While effective on benchmark-style tasks, such systems rely on designer updates and overlook agents’ potential to learn from experience. Recently, Silver and Sutton(2025) envision a shift into a new era, where agents could progress from a stream of experiences. In this paper, we instantiate this vision of experience-driven learning by introducing a bottom-up agent paradigm that mirrors the human learning process. Agents acquire competence through a trial-and-reasoning mechanism-exploring, reflecting on outcomes, and abstracting skills over time. Once acquired, skills can be rapidly shared and extended, enabling continual evolution rather than static replication. As more agents are deployed, their diverse experiences accelerate this collective process, making bottom-up design especially suited for open-ended environments. We evaluate this paradigm in Slay the Spire and Civilization V, where agents perceive through raw visual inputs and act via mouse outputs, the same as human players. Using a unified, game-agnostic codebase without any game-specific prompts or privileged APIs, our bottom-up agents acquire skills entirely through autonomous interaction, demonstrating the potential of the bottom-up paradigm in complex, real-world environments. Our code is available at this https URL.
zh

[AI-42] owards General Continuous Memory for Vision-Language Models

【速读】：该论文旨在解决语言模型（Language Models, LMs）和视觉-语言模型（Vision-Language Models, VLMs）在处理需要多模态或跨语言现实知识的复杂推理任务时表现不足的问题。现有方法通过将图像和文本标记拼接为长序列作为记忆，导致上下文长度增加并可能降低性能。该论文提出的解决方案的关键在于使用连续记忆（continuous memory），即一组紧凑的密集嵌入，以更高效和有效地表示多模态和跨语言知识。其核心思想是利用VLM自身作为连续记忆编码器，通过数据高效且参数高效的微调方法，仅需1.2%的模型参数和15.6K个自合成样本，即可将VLM转化为记忆编码器，从而在保持VLM冻结的前提下实现灵活集成与性能提升。

链接: https://arxiv.org/abs/2505.17670
作者: Wenyi Wu,Zixuan Song,Kun Zhou,Yifei Shao,Zhiting Hu,Biwei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual real-world knowledge. To support such capabilities, an external memory system that can efficiently provide relevant multimodal information is essential. Existing approaches generally concatenate image and text tokens into a long sequence as memory, which, however, may drastically increase context length and even degrade performance. In contrast, we propose using continuous memory, a compact set of dense embeddings to more effectively and efficiently represent multimodal and multilingual knowledge. Our key insight is that a VLM can serve as its own continuous memory encoder. We empirically show that this design improves performance on complex multimodal reasoning tasks. Building on this, we introduce a data-efficient and parameter-efficient method to fine-tune the VLM into a memory encoder, requiring only 1.2% of the model’s parameters and a small corpus of 15.6K self-synthesized samples. Our approach CoMEM utilizes VLM’s original capabilities to encode arbitrary multimodal and multilingual knowledge into just 8 continuous embeddings. Since the inference-time VLM remains frozen, our memory module is plug-and-play and can be flexibly integrated as needed. Extensive experiments across eight multimodal reasoning benchmarks demonstrate the effectiveness of our approach.
zh

[AI-43] GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理以程序代码形式表达的几何空间信息时能力不足的问题。其解决方案的关键在于提出并形式化了“Program-to-Geometry”任务，该任务要求模型将程序化绘图代码转化为准确且抽象的几何推理。为评估这一能力，研究者构建了GeoGramBench基准，包含500个经过精心筛选的问题，并按照定制的三级分类体系进行组织，该体系关注几何复杂性而非传统的数学推理复杂性。

链接: https://arxiv.org/abs/2505.17653
作者: Shixian Luo,Zezhou Zhu,Yu Yuan,Yuncheng Yang,Lianlei Shan,Yong Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: this https URL.
zh

[AI-44] Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning : A Competence-Difficulty Alignment Perspective

【速读】：该论文试图解决强化学习（Reinforcement Learning, RL）在提升大语言模型推理能力时面临的样本效率低的问题，尤其是由于问题难度估计不稳定和偏差导致的模型能力与问题难度不匹配问题。解决方案的关键在于提出了一种基于模型能力与问题难度对齐的采样方法——能力-难度对齐采样（Competence-Difficulty Alignment Sampling, CDAS），通过聚合历史性能差异来准确稳定地估计问题难度，并利用固定点系统自适应选择与当前模型能力相匹配的问题。

链接: https://arxiv.org/abs/2505.17652
作者: Deyang Kong,Qi Guo,Xiangyu Xi,Wei Wang,Jingang Wang,Xunliang Cai,Shikun Zhang,Wei Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning exhibits potential in enhancing the reasoning abilities of large language models, yet it is hard to scale for the low sample efficiency during the rollout phase. Existing methods attempt to improve efficiency by scheduling problems based on problem difficulties. However, these approaches suffer from unstable and biased estimations of problem difficulty and fail to capture the alignment between model competence and problem difficulty in RL training, leading to suboptimal results. To tackle these limitations, this paper introduces \textbfCompetence-\textbfDifficulty \textbfAlignment \textbfSampling (\textbfCDAS), which enables accurate and stable estimation of problem difficulties by aggregating historical performance discrepancies of problems. Then the model competence is quantified to adaptively select problems whose difficulty is in alignment with the model’s current competence using a fixed-point system. Experimental results across a range of challenging mathematical benchmarks show that CDAS achieves great improvements in both accuracy and efficiency. CDAS attains the highest average accuracy against baselines and exhibits significant speed advantages compared to Dynamic Sampling, a competitive strategy in DAPO, which is \textbf2.33 times slower than CDAS.
zh

[AI-45] Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

【速读】：该论文试图解决的问题是：Chain-of-Thought (CoT) 推理是否真的能够降低越狱攻击（jailbreak attack）的危害性。论文通过严谨的理论分析表明，CoT 推理对越狱攻击的危害性具有双重影响。解决方案的关键在于提出一种新的越狱方法 FicDetail，该方法基于理论洞察，并在实践中验证了其理论结论的有效性。

链接: https://arxiv.org/abs/2505.17650
作者: Chengda Lu,Xiaoyu Fan,Yu Huang,Rongwu Xu,Jijie Li,Wei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.
zh

[AI-46] ReqBrain: Task-Specific Instruction Tuning of LLM s for AI-Assisted Requirements Generation

【速读】：该论文试图解决需求获取与规格说明过程中劳动密集、手动且易出现不一致和遗漏的问题（requirements elicitation and specification），这是现代软件工程中的一个重大挑战。解决方案的关键在于引入ReqBrain，这是一个基于微调大语言模型（LLMs）的AI辅助工具，能够自动生成真实且充分的软件需求，并通过基于聊天的会话方式让软件工程师进行交互。研究通过构建符合ISO 29148标准的数据集并微调多个7B参数的LLM，确定了性能最佳的模型Zephyr-7b-beta，其在生成需求任务中表现出色，验证了生成式AI在需求获取与规格说明中的潜力。

链接: https://arxiv.org/abs/2505.17632
作者: Mohammad Kasra Habib,Daniel Graziotin,Stefan Wagner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Requirements elicitation and specification remains a labor-intensive, manual process prone to inconsistencies and gaps, presenting a significant challenge in modern software engineering. Emerging studies underscore the potential of employing large language models (LLMs) for automated requirements generation to support requirements elicitation and specification; however, it remains unclear how to implement this effectively. In this work, we introduce ReqBrain, an Al-assisted tool that employs a fine-tuned LLM to generate authentic and adequate software requirements. Software engineers can engage with ReqBrain through chat-based sessions to automatically generate software requirements and categorize them by type. We curated a high-quality dataset of ISO 29148-compliant requirements and fine-tuned five 7B-parameter LLMs to determine the most effective base model for ReqBrain. The top-performing model, Zephyr-7b-beta, achieved 89.30% Fl using the BERT score and a FRUGAL score of 91.20 in generating authentic and adequate requirements. Human evaluations further confirmed ReqBrain’s effectiveness in generating requirements. Our findings suggest that generative Al, when fine-tuned, has the potential to improve requirements elicitation and specification, paving the way for future extensions into areas such as defect identification, test case generation, and agile user story creation.
zh

[AI-47] BehaveGPT : A Foundation Model for Large-scale User Behavior Modeling

【速读】：该论文旨在解决用户行为建模领域中由于行为数据复杂性及捕捉用户活动中的时间与上下文关系挑战而导致的进展有限的问题。其解决方案的关键在于提出BehaveGPT，一个专为大规模用户行为预测设计的基础模型，该模型基于Transformer架构和一种新颖的DRO（Distributionally Robust Optimization）预训练范式，通过公平建模头部和尾部行为来提升模型的泛化能力和迁移性，从而有效捕捉和预测用户行为。

链接: https://arxiv.org/abs/2505.17631
作者: Jiahui Gong,Jingtao Ding,Fanjin Meng,Chen Yang,Hong Chen,Zuojian Wang,Haisheng Lu,Yong Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures, 5 tables

点击查看摘要

Abstract:In recent years, foundational models have revolutionized the fields of language and vision, demonstrating remarkable abilities in understanding and generating complex data; however, similar advances in user behavior modeling have been limited, largely due to the complexity of behavioral data and the challenges involved in capturing intricate temporal and contextual relationships in user activities. To address this, we propose BehaveGPT, a foundational model designed specifically for large-scale user behavior prediction. Leveraging transformer-based architecture and a novel pretraining paradigm, BehaveGPT is trained on vast user behavior datasets, allowing it to learn complex behavior patterns and support a range of downstream tasks, including next behavior prediction, long-term generation, and cross-domain adaptation. Our approach introduces the DRO-based pretraining paradigm tailored for user behavior data, which improves model generalization and transferability by equitably modeling both head and tail behaviors. Extensive experiments on real-world datasets demonstrate that BehaveGPT outperforms state-of-the-art baselines, achieving more than a 10% improvement in macro and weighted recall, showcasing its ability to effectively capture and predict user behavior. Furthermore, we measure the scaling law in the user behavior domain for the first time on the Honor dataset, providing insights into how model performance scales with increased data and parameter sizes.
zh

[AI-48] ransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments ACL2025

【速读】：该论文旨在解决GUI agents在动态和互联的现实数字环境中适应性不足的问题，具体表现为难以应对版本更新、跨平台操作以及跨应用任务的挑战。解决方案的关键在于提出TransBench，这是一个首个系统评估和提升GUI agents在三个核心维度上迁移能力的基准：跨版本迁移能力（适应版本更新）、跨平台迁移能力（在iOS、Android和Web等平台间泛化）以及跨应用迁移能力（处理功能各异的应用程序间的任务）。

链接: https://arxiv.org/abs/2505.17629
作者: Yuheng Lu,Qian Yu,Hongru Wang,Zeming Liu,Wei Su,Yanping Liu,Yuhang Guo,Maocheng Liang,Yunhong Wang,Haifeng Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding - the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at Github.
zh

[AI-49] textttRange-Arithmetic: Verifiable Deep Learning Inference on an Untrusted Party

【速读】：该论文旨在解决在去中心化机器学习系统中，由于区块链限制导致资源密集型任务如深度神经网络（DNN）推理被卸载到外部参与者时，如何验证外包计算结果的正确性问题。解决方案的关键在于提出一种名为\texttt{Range-Arithmetic}的新框架，该框架将非算术操作（如定点矩阵乘法后的舍入和ReLU激活函数）转化为可使用sum-check协议和连接范围证明验证的算术步骤，从而避免了布尔编码、高次多项式和大型查找表的复杂性，同时保持与基于有限域的证明系统的兼容性。

链接: https://arxiv.org/abs/2505.17623
作者: Ali Rahimi,Babak H. Khalaj,Mohammad Ali Maddah-Ali
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Verifiable computing (VC) has gained prominence in decentralized machine learning systems, where resource-intensive tasks like deep neural network (DNN) inference are offloaded to external participants due to blockchain limitations. This creates a need to verify the correctness of outsourced computations without re-execution. We propose \textttRange-Arithmetic, a novel framework for efficient and verifiable DNN inference that transforms non-arithmetic operations, such as rounding after fixed-point matrix multiplication and ReLU, into arithmetic steps verifiable using sum-check protocols and concatenated range proofs. Our approach avoids the complexity of Boolean encoding, high-degree polynomials, and large lookup tables while remaining compatible with finite-field-based proof systems. Experimental results show that our method not only matches the performance of existing approaches, but also reduces the computational cost of verifying the results, the computational effort required from the untrusted party performing the DNN inference, and the communication overhead between the two sides.
zh

[AI-50] Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving

【速读】：该论文旨在解决当前大型视觉-语言模型（LVLMs）在处理复杂视觉-语言推理任务时存在的推理能力不足问题，尤其是在视觉密集型几何数学问题上的表现不佳。其解决方案的关键在于提出一种解耦的推理框架，该框架不依赖于端到端训练的视觉-语言推理模型，而是结合现有的视觉解析专家和基于文本的推理大语言模型（LLM），通过将图像内容转化为文本描述，并由LLM根据生成的文本和原始问题进行推理，从而实现高效的多模态推理。此方法优化了现有模型的协作方式，避免了从头开始构建端到端视觉-语言模型的高成本，同时提升了模型的灵活性和可扩展性。

链接: https://arxiv.org/abs/2505.17609
作者: Zixian Guo,Ming Liu,Zhilong Ji,Jinfeng Bai,Lei Zhang,Wangmeng Zuo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Well alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to perform reasoning according to the visual-derived text and the original question. This method presents a cost-efficient solution for multi-modal model development by optimizing existing models to work collaboratively, avoiding end-to-end development of vision-language models from scratch. By transforming images into language model-compatible text representations, it facilitates future low-cost and flexible upgrades to upcoming powerful LLMs. We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. Evaluation results on vision-language benchmarks demonstrate that the decoupled reasoning framework outperforms recent LVLMs. Our approach yields particularly significant performance gains on visually intensive geometric mathematics problems. The code is available: this https URL.
zh

[AI-51] CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

【速读】：该论文旨在解决现有语音合成模型在语言覆盖范围、领域多样性、数据量、文本格式以及后训练技术方面的局限性，特别是针对野外环境下的零样本多语言语音合成问题。其解决方案的关键在于引入一种新型的语音分词器，通过监督式多任务训练提升韵律自然度；开发一个适用于后训练的可微分奖励模型；扩展训练数据规模至一百万小时，涵盖九种语言和十八种汉语方言；以及扩大模型参数规模至15亿，从而提升多语言基准测试中的性能。

链接: https://arxiv.org/abs/2505.17589
作者: Zhihao Du,Changfeng Gao,Yuxuan Wang,Fan Yu,Tianyu Zhao,Hao Wang,Xiang Lv,Hui Wang,Xian Shi,Keyu An,Guanrou Yang,Yabin Li,Yanni Chen,Zhifu Gao,Qian Chen,Yue Gu,Mengzhe Chen,Yafeng Chen,Shiliang Zhang,Wen Wang,Jieping Ye
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Preprint, work in progress

点击查看摘要

Abstract:In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at this https URL.
zh

[AI-52] USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLM s as Urban Agents

【速读】：该论文试图解决现有研究对城市大语言模型代理（LLM agent）在时空推理中的内在推理过程理解不足的问题，尤其是在长期规划和动态城市环境中的反思适应能力方面存在明显短板。解决方案的关键在于提出USTBench，这是首个针对LLM作为城市代理的时空推理能力进行评估的基准，涵盖时空理解、预测、规划及反馈反思四个分解维度，并通过构建交互式城市环境UAgentEnv进行多任务评估，从而实现对模型推理过程的细粒度诊断与任务级比较。

链接: https://arxiv.org/abs/2505.17572
作者: Siqi Lai,Yansong Ning,Zirui Yuan,Zhixi Chen,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs’ spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-level evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications.
zh

[AI-53] JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

【速读】：该论文试图解决音频语言模型（Audio Language Models, ALMs）在面对越狱攻击（jailbreak attacks）时的安全性问题，尤其是针对ALMs的音频模态尚未有系统性的评估框架和对抗数据集。解决方案的关键在于提出JALMBench，这是首个全面的基准测试平台，用于评估ALMs在越狱攻击下的安全性，其包含大规模的文本和音频样本数据集，并支持多种主流ALMs、攻击方法及防御策略，从而为攻击效率、主题敏感性、语音多样性及攻击表征等方面提供深入分析。

链接: https://arxiv.org/abs/2505.17568
作者: Zifan Peng,Yule Liu,Zhen Sun,Mingchen Li,Zeren Luo,Jingyi Zheng,Wenhan Dong,Xinlei He,Xuechao Wang,Yingjie Xue,Shengmin Xu,Xinyi Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio Language Models (ALMs) have made significant progress recently. These models integrate the audio modality directly into the model, rather than converting speech into text and inputting text to Large Language Models (LLMs). While jailbreak attacks on LLMs have been extensively studied, the security of ALMs with audio modalities remains largely unexplored. Currently, there is a lack of an adversarial audio dataset and a unified framework specifically designed to evaluate and compare attacks and ALMs. In this paper, we present JALMBench, the \textitfirst comprehensive benchmark to assess the safety of ALMs against jailbreak attacks. JALMBench includes a dataset containing 2,200 text samples and 51,381 audio samples with over 268 hours. It supports 12 mainstream ALMs, 4 text-transferred and 4 audio-originated attack methods, and 5 defense methods. Using JALMBench, we provide an in-depth analysis of attack efficiency, topic sensitivity, voice diversity, and attack representations. Additionally, we explore mitigation strategies for the attacks at both the prompt level and the response level.
zh

[AI-54] Universal Biological Sequence Reranking for Improved De Novo Peptide Sequencing

【速读】：该论文旨在解决深度学习在从头肽段测序（de novo peptide sequencing）中的性能瓶颈问题，这一瓶颈源于质谱数据的固有复杂性以及噪声信号的异质性分布，导致数据特定偏差。其解决方案的关键在于提出RankNovo，这是首个利用多个测序模型互补优势的深度重排序框架，通过列表级重排序方法将候选肽段建模为多序列比对，并利用轴向注意力机制提取跨候选肽段的有用特征，同时引入两种新指标PMD（Peptide Mass Deviation）和RMD（residual Mass Deviation）以精确量化序列和残基层面的质荷比差异，从而提供精细的监督信号。

链接: https://arxiv.org/abs/2505.17552
作者: Zijie Qiu,Jiaqi Wei,Xiang Zhang,Sheng Xu,Kai Zou,Zhi Jin,Zhiqiang Gao,Nanqing Dong,Siqi Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:De novo peptide sequencing is a critical task in proteomics. However, the performance of current deep learning-based methods is limited by the inherent complexity of mass spectrometry data and the heterogeneous distribution of noise signals, leading to data-specific biases. We present RankNovo, the first deep reranking framework that enhances de novo peptide sequencing by leveraging the complementary strengths of multiple sequencing models. RankNovo employs a list-wise reranking approach, modeling candidate peptides as multiple sequence alignments and utilizing axial attention to extract informative features across candidates. Additionally, we introduce two new metrics, PMD (Peptide Mass Deviation) and RMD (residual Mass Deviation), which offer delicate supervision by quantifying mass differences between peptides at both the sequence and residue levels. Extensive experiments demonstrate that RankNovo not only surpasses its base models used to generate training candidates for reranking pre-training, but also sets a new state-of-the-art benchmark. Moreover, RankNovo exhibits strong zero-shot generalization to unseen models whose generations were not exposed during training, highlighting its robustness and potential as a universal reranking framework for peptide sequencing. Our work presents a novel reranking strategy that fundamentally challenges existing single-model paradigms and advances the frontier of accurate de novo sequencing. Our source code is provided on GitHub.
zh

[AI-55] Learning Representational Disparities

【速读】：该论文试图解决人类决策过程中导致下游结果差异的不公平问题，其核心在于建模可观测决策与理想决策之间的可解释性差异（representational disparities），以减少结果上的不平等。解决方案的关键是通过神经网络将这种差异建模为一个多目标优化问题，从而学习可解释的表示差异，这些差异可以通过对人类决策的具体提示（nudges）进行修正，进而缓解下游结果中的不平等现象。

链接: https://arxiv.org/abs/2505.17533
作者: Pavan Ravishankar,Rushabh Shah,Daniel B. Neill
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 27 pages

点击查看摘要

Abstract:We propose a fair machine learning algorithm to model interpretable differences between observed and desired human decision-making, with the latter aimed at reducing disparity in a downstream outcome impacted by the human decision. Prior work learns fair representations without considering the outcome in the decision-making process. We model the outcome disparities as arising due to the different representations of the input seen by the observed and desired decision-maker, which we term representational disparities. Our goal is to learn interpretable representational disparities which could potentially be corrected by specific nudges to the human decision, mitigating disparities in the downstream outcome; we frame this as a multi-objective optimization problem using a neural network. Under reasonable simplifying assumptions, we prove that our neural network model of the representational disparity learns interpretable weights that fully mitigate the outcome disparity. We validate objectives and interpret results using real-world German Credit, Adult, and Heritage Health datasets.
zh

[AI-56] ransparency and Proportionality in Post-Processing Algorithmic Bias Correction

【速读】：该论文试图解决算法决策系统在公平性处理过程中可能引入新的不公平或加剧现有不平等问题。其解决方案的关键在于提出一组度量方法，用于量化后处理阶段对解的翻转差异，从而帮助实践者评估去偏策略的比例性、透明化解释策略在各群体中的影响，并分析其他偏见缓解方法的可行性。该方法通过后处理阶段的应用，补充了传统公平性指标，提供了更深入的视角以确保所有群体的公平结果。

链接: https://arxiv.org/abs/2505.17525
作者: Juliett Suárez Ferreira,Marija Slavkovik,Jorge Casillas
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Algorithmic decision-making systems sometimes produce errors or skewed predictions toward a particular group, leading to unfair results. Debiasing practices, applied at different stages of the development of such systems, occasionally introduce new forms of unfairness or exacerbate existing inequalities. We focus on post-processing techniques that modify algorithmic predictions to achieve fairness in classification tasks, examining the unintended consequences of these interventions. To address this challenge, we develop a set of measures that quantify the disparity in the flips applied to the solution in the post-processing stage. The proposed measures will help practitioners: (1) assess the proportionality of the debiasing strategy used, (2) have transparency to explain the effects of the strategy in each group, and (3) based on those results, analyze the possibility of the use of some other approaches for bias mitigation or to solve the problem. We introduce a methodology for applying the proposed metrics during the post-processing stage and illustrate its practical application through an example. This example demonstrates how analyzing the proportionality of the debiasing strategy complements traditional fairness metrics, providing a deeper perspective to ensure fairer outcomes across all groups.
zh

[AI-57] Optimizing Retrieval-Augmented Generation for Electrical Engineering: A Case Study on ABB Circuit Breakers

【速读】：该论文试图解决在高风险工程环境中，如何通过集成检索增强生成（RAG）与大型语言模型（Large Language Models, LLMs）提供精确、上下文相关且事实准确的响应问题。解决方案的关键在于构建领域特定的数据集，采用先进的嵌入模型和优化的分块策略，以提升工程文档中的数据检索与上下文对齐效果。研究评估了三种RAG管道（OpenAI GPT4o、Cohere和Anthropic Claude），并探索了段落级和标题感知分块方法对检索精度和生成响应的影响。

链接: https://arxiv.org/abs/2505.17520
作者: Salahuddin Alawadhi,Noorhan Abbas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, published in CSIT Vol. 15, 2025. DOI: https://doi.org/10.5121/csit.2025.150905

点击查看摘要

Abstract:Integrating Retrieval Augmented Generation (RAG) with Large Language Models (LLMs) has shown the potential to provide precise, contextually relevant responses in knowledge intensive domains. This study investigates the ap-plication of RAG for ABB circuit breakers, focusing on accuracy, reliability, and contextual relevance in high-stakes engineering environments. By leveraging tailored datasets, advanced embedding models, and optimized chunking strategies, the research addresses challenges in data retrieval and contextual alignment unique to engineering documentation. Key contributions include the development of a domain-specific dataset for ABB circuit breakers and the evaluation of three RAG pipelines: OpenAI GPT4o, Cohere, and Anthropic Claude. Advanced chunking methods, such as paragraph-based and title-aware segmentation, are assessed for their impact on retrieval accuracy and response generation. Results demonstrate that while certain configurations achieve high precision and relevancy, limitations persist in ensuring factual faithfulness and completeness, critical in engineering contexts. This work underscores the need for iterative improvements in RAG systems to meet the stringent demands of electrical engineering tasks, including design, troubleshooting, and operational decision-making. The findings in this paper help advance research of AI in highly technical domains such as electrical engineering.
zh

[AI-58] Multi-agent Systems for Misinformation Lifecycle : Detection Correction And Source Identification

【速读】：该论文试图解决数字媒体中虚假信息快速传播所带来的问题，传统基于单一大型语言模型（Large Language Model, LLM）或AI代理的检测方法已不足以应对这一复杂挑战。其解决方案的关键在于提出一种多代理框架，覆盖虚假信息的完整生命周期，包括分类、检测、修正和源验证。该框架由五个专业代理组成：索引代理、分类代理、提取代理、修正代理和验证代理，每个代理可独立评估与优化，从而提升系统的可扩展性、模块化和可解释性，实现更透明和可靠的虚假信息检测与修正。

链接: https://arxiv.org/abs/2505.17511
作者: Aditya Gautam
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid proliferation of misinformation in digital media demands solutions that go beyond isolated Large Language Model(LLM) or AI Agent based detection methods. This paper introduces a novel multi-agent framework that covers the complete misinformation lifecycle: classification, detection, correction, and source verification to deliver more transparent and reliable outcomes. In contrast to single-agent or monolithic architectures, our approach employs five specialized agents: an Indexer agent for dynamically maintaining trusted repositories, a Classifier agent for labeling misinformation types, an Extractor agent for evidence based retrieval and ranking, a Corrector agent for generating fact-based correction and a Verification agent for validating outputs and tracking source credibility. Each agent can be individually evaluated and optimized, ensuring scalability and adaptability as new types of misinformation and data sources emerge. By decomposing the misinformation lifecycle into specialized agents - our framework enhances scalability, modularity, and explainability. This paper proposes a high-level system overview, agent design with emphasis on transparency, evidence-based outputs, and source provenance to support robust misinformation detection and correction at scale.
zh

[AI-59] Managing FAIR Knowledge Graphs as Polyglot Data End Points: A Benchmark based on the rdf2pg Framework and Plant Biology Data

【速读】：该论文试图解决如何有效整合Linked Data和 labelled property graphs (LPG)以提升数据共享和软件生态支持的问题，其解决方案的关键在于提出一个可扩展的框架rdf2pg，该框架能够将RDF数据映射为语义等价的LPG格式和数据库。通过该框架，作者对三种流行的图数据库（Virtuoso、Neo4j和ArcadeDB）及三种图查询语言（SPARQL、Cypher和Gremlin）进行了比较分析，从而揭示了这些图数据库技术的优势与局限性，并展示了rdf2pg作为支持多语言访问知识图谱的多功能工具的潜力。

链接: https://arxiv.org/abs/2505.17498
作者: Marco Brandizi,Carlos Bobed,Luca Garulli,Arné de Klerk,Keywan Hassani-Pak
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Linked Data and labelled property graphs (LPG) are two data management approaches with complementary strengths and weaknesses, making their integration beneficial for sharing datasets and supporting software ecosystems. In this paper, we introduce rdf2pg, an extensible framework for mapping RDF data to semantically equivalent LPG formats and data-bases. Utilising this framework, we perform a comparative analysis of three popular graph databases - Virtuoso, Neo4j, and ArcadeDB - and the well-known graph query languages SPARQL, Cypher, and Gremlin. Our qualitative and quantitative as-sessments underline the strengths and limitations of these graph database technologies. Additionally, we highlight the potential of rdf2pg as a versatile tool for enabling polyglot access to knowledge graphs, aligning with established standards of Linked Data and the Semantic Web.
zh

[AI-60] DTRT: Enhancing Human Intent Estimation and Role Allocation for Physical Human-Robot Collaboration

【速读】：该论文旨在解决物理人机协作（pHRC）中人类意图估计不准确以及人机角色分配不合理的问题，这些问题限制了协作的安全性和效率。现有方法依赖短期运动数据进行意图估计，缺乏多步预测能力，无法及时感知意图变化并自主调整人机任务分配。解决方案的关键在于提出一种基于双Transformer的机器人轨迹预测模型（DTRT），其采用分层架构，通过融合人类引导的运动和力数据，快速捕捉人类意图变化，并结合差分合作博弈理论（DCGT）确保机器人行为与人类意图一致，从而实现精准轨迹预测和动态行为调整，提升协作性能。

链接: https://arxiv.org/abs/2505.17490
作者: Haotian Liu,Yuchuang Tong,Zhengtao Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In physical Human-Robot Collaboration (pHRC), accurate human intent estimation and rational human-robot role allocation are crucial for safe and efficient assistance. Existing methods that rely on short-term motion data for intention estimation lack multi-step prediction capabilities, hindering their ability to sense intent changes and adjust human-robot assignments autonomously, resulting in potential discrepancies. To address these issues, we propose a Dual Transformer-based Robot Trajectron (DTRT) featuring a hierarchical architecture, which harnesses human-guided motion and force data to rapidly capture human intent changes, enabling accurate trajectory predictions and dynamic robot behavior adjustments for effective collaboration. Specifically, human intent estimation in DTRT uses two Transformer-based Conditional Variational Autoencoders (CVAEs), incorporating robot motion data in obstacle-free case with human-guided trajectory and force for obstacle avoidance. Additionally, Differential Cooperative Game Theory (DCGT) is employed to synthesize predictions based on human-applied forces, ensuring robot behavior align with human intention. Compared to state-of-the-art (SOTA) methods, DTRT incorporates human dynamics into long-term prediction, providing an accurate understanding of intention and enabling rational role allocation, achieving robot autonomy and maneuverability. Experiments demonstrate DTRT’s accurate intent estimation and superior collaboration performance.
zh

[AI-61] win-2K-500: A dataset for building digital twins of over 2000 people based on their answers to over 500 questions

【速读】：该论文试图解决基于大语言模型（Large Language Model, LLM）的数字孪生仿真中缺乏高质量、大规模且公开可用的个体级数据集的问题，这一问题限制了数字孪生方法的开发与验证。解决方案的关键在于引入一个大规模、公开的数据集，该数据集通过在美国范围内对2,058名参与者进行四轮调查，收集了涵盖人口统计学、心理学、经济学、人格和认知等多个维度的详尽信息，并包含行为经济学实验的复现和定价调查，最终一轮用于建立重测信度基准，从而为构建高精度的人类行为预测数字孪生提供了高质量的地面实况数据。

链接: https://arxiv.org/abs/2505.17479
作者: Olivier Toubia,George Z. Gui,Tianyi Peng,Daniel J. Merlau,Ang Li,Haozhe Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Econometrics (econ.EM)
备注: Also available at SSRN: this https URL

点击查看摘要

Abstract:LLM-based digital twin simulation, where large language models are used to emulate individual human behavior, holds great promise for research in AI, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real, individual-level datasets that are both large and publicly available. This lack of high-quality ground truth limits both the development and validation of digital twin methodologies. To address this gap, we introduce a large-scale, public dataset designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of N = 2,058 participants (average 2.42 hours per person) in the US across four waves with 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral economics experiments and a pricing survey. The final wave repeats tasks from earlier waves to establish a test-retest accuracy baseline. Initial analyses suggest the data are of high quality and show promise for constructing digital twins that predict human behavior well at the individual and aggregate levels. By making the full dataset publicly available, we aim to establish a valuable testbed for the development and benchmarking of LLM-based persona simulations. Beyond LLM applications, due to its unique breadth and scale the dataset also enables broad social science research, including studies of cross-construct correlations and heterogeneous treatment effects.
zh

[AI-62] Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression

【速读】：该论文旨在解决现有方法在捕捉蛋白质构象之间的时序依赖性或不支持直接生成独立于时间的样本方面的局限性。其解决方案的关键在于提出ConfRover，一个自回归模型，该模型通过模块化架构同时学习蛋白质构象与动力学，包括：（i）从蛋白质折叠模型中改进的编码层，用于将蛋白质特异性信息和每个时间帧的构象嵌入潜在空间；（ii）一个序列模型作为时序模块，用于捕捉跨帧的构象动态；（iii）一个SE(3)扩散模型作为结构解码器，用于在连续空间中生成构象。该模型首次在一个框架内实现了蛋白质构象与轨迹的采样，为从蛋白质分子动力学数据中学习提供了新颖且灵活的方法。

链接: https://arxiv.org/abs/2505.17478
作者: Yuning Shen,Lihao Wang,Huizhuo Yuan,Yan Wang,Bangji Yang,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
备注: 33 pages, 17 figures

点击查看摘要

Abstract:Understanding protein dynamics is critical for elucidating their biological functions. The increasing availability of molecular dynamics (MD) data enables the training of deep generative models to efficiently explore the conformational space of proteins. However, existing approaches either fail to explicitly capture the temporal dependencies between conformations or do not support direct generation of time-independent samples. To address these limitations, we introduce ConfRover, an autoregressive model that simultaneously learns protein conformation and dynamics from MD trajectories, supporting both time-dependent and time-independent sampling. At the core of our model is a modular architecture comprising: (i) an encoding layer, adapted from protein folding models, that embeds protein-specific information and conformation at each time frame into a latent space; (ii) a temporal module, a sequence model that captures conformational dynamics across frames; and (iii) an SE(3) diffusion model as the structure decoder, generating conformations in continuous space. Experiments on ATLAS, a large-scale protein MD dataset of diverse structures, demonstrate the effectiveness of our model in learning conformational dynamics and supporting a wide range of downstream tasks. ConfRover is the first model to sample both protein conformations and trajectories within a single framework, offering a novel and flexible approach for learning from protein MD data.
zh

[AI-63] Efficient compression of neural networks and datasets

【速读】：该论文试图解决如何在保持高测试准确率的前提下显著减少神经网络参数数量的问题，进而实现有效的数据压缩。其解决方案的关键在于提出一种无需蒙特卡洛采样的概率重表述方法，用于非线性模型的 $\ell_0$ 正则化优化，并改进了对 $\ell_0$ 范数的平滑近似方法，同时探索了分层方法。这些方法在不同架构和数据集上进行了比较，并通过合成教师-学生设置验证了压缩效果。

链接: https://arxiv.org/abs/2505.17469
作者: Lukas Silvester Barth,Paulo von Petersenn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Optimization and Control (math.OC); Statistics Theory (math.ST)
备注: 10 pages plus appendix, 9 Figures, 3 Tables

点击查看摘要

Abstract:We compare, improve, and contribute methods that substantially decrease the number of parameters of neural networks while maintaining high test accuracy. When applying our methods to minimize description length, we obtain very effective data compression algorithms. In particular, we develop a probabilistic reformulation of \ell_0 regularized optimization for nonlinear models that does not require Monte-Carlo sampling and thus improves upon previous methods. We also improve upon methods involving smooth approximations to the \ell_0 norm, and investigate layerwise methods. We compare the methods on different architectures and datasets, including convolutional networks trained on image datasets and transformers trained on parts of Wikipedia. We also created a synthetic teacher-student setup to investigate compression in a controlled continuous setting. Finally, we conceptually relate compression algorithms to Solomonoff’s theory of inductive inference and empirically verify the prediction that regularized models can exhibit more sample-efficient convergence.
zh

[AI-64] CLIMB: Class-imbalanced Learning Benchmark on Tabular Data

【速读】：该论文旨在解决表格数据上的类别不平衡学习（Class-Imbalanced Learning, CIL）问题，特别是在实际应用中少数类包含关键但罕见结果的情境下。解决方案的关键在于提出CLIMB，这是一个全面的基准测试平台，包含73个跨不同领域和不平衡程度的真实数据集，以及29种代表性CIL算法的统一实现。CLIMB基于高质量的开源Python包，具备统一的API设计、详细的文档和严格的代码质量控制，从而支持不同CIL算法的便捷实现与比较。通过大量实验，该研究提供了关于方法准确性和效率的实用见解，强调了简单重平衡的局限性、集成方法的有效性以及数据质量的重要性。

链接: https://arxiv.org/abs/2505.17451
作者: Zhining Liu,Zihao Li,Ze Yang,Tianxin Wei,Jian Kang,Yada Zhu,Hendrik Hamann,Jingrui He,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Class-imbalanced learning (CIL) on tabular data is important in many real-world applications where the minority class holds the critical but rare outcomes. In this paper, we present CLIMB, a comprehensive benchmark for class-imbalanced learning on tabular data. CLIMB includes 73 real-world datasets across diverse domains and imbalance levels, along with unified implementations of 29 representative CIL algorithms. Built on a high-quality open-source Python package with unified API designs, detailed documentation, and rigorous code quality controls, CLIMB supports easy implementation and comparison between different CIL algorithms. Through extensive experiments, we provide practical insights on method accuracy and efficiency, highlighting the limitations of naive rebalancing, the effectiveness of ensembles, and the importance of data quality. Our code, documentation, and examples are available at this https URL.
zh

[AI-65] Designing an efficient and equitable humanitarian supply chain dynamically via reinforcement learning

【速读】：该论文试图解决人道主义供应链在动态环境下的效率与公平性问题，通过引入强化学习中的PPO（Proximal Policy Optimization）算法来设计高效的应急物资分配机制，并与启发式算法进行对比。该研究的核心解决方案在于采用PPO模型，其关键特点是始终将平均满意度率作为优化优先级。

链接: https://arxiv.org/abs/2505.17439
作者: Weijia Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study designs an efficient and equitable humanitarian supply chain dynamically by using reinforcement learning, PPO, and compared with heuristic algorithms. This study demonstrates the model of PPO always treats average satisfaction rate as the priority.
zh

[AI-66] Scaling Up Biomedical Vision-Language Models: Fine-Tuning Instruction Tuning and Multi-Modal Learning

【速读】：该论文旨在解决生物医学视觉-语言模型在处理长文本、多模态任务适应性以及零样本学习性能方面的挑战。其解决方案的关键在于通过模型规模扩展、微调和指令调优，提升模型在多种多模态生物医学任务中的表现，特别是基于编码器-解码器架构的Transformer模型在图像分类、文本理解、摘要生成、问答、视觉问答和图像描述生成等任务上的优化。

链接: https://arxiv.org/abs/2505.17436
作者: Cheng Peng,Kai Zhang,Mengxian Lyu,Hongfang Liu,Lichao Sun,Yonghui Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To advance biomedical vison-language model capabilities through scaling up, fine-tuning, and instruction tuning, develop vision-language models with improved performance in handling long text, explore strategies to efficiently adopt vision language models for diverse multi-modal biomedical tasks, and examine the zero-shot learning performance. We developed two biomedical vision language models, BiomedGPT-Large and BiomedGPT-XLarge, based on an encoder-decoder-based transformer architecture. We fine-tuned the two models on 23 benchmark datasets from 6 multi-modal biomedical tasks including one image-only task (image classification), three language-only tasks (text understanding, text summarization and question answering), and two vision-language tasks (visual question answering and image captioning). We compared the developed scaled models with our previous BiomedGPT-Base model and existing prestigious models reported in the literature. We instruction-tuned the two models using a large-scale multi-modal biomedical instruction-tuning dataset and assessed the zero-shot learning performance and alignment accuracy. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.17436 [cs.AI] (or arXiv:2505.17436v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.17436 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-67] Dynamic Manipulation of Deformable Objects in 3D: Simulation Benchmark and Learning Strategy

【速读】：该论文旨在解决在高自由度和欠驱动的柔性物体场景中，实现3D目标条件下的动态操作问题，此类问题由于复杂的系统动力学和严格的任务约束而具有挑战性。现有方法通常简化问题为低速或二维设置，限制了其在真实三维任务中的适用性。论文的关键解决方案是提出一种基于降阶动力学的仿真框架和基准，并结合模仿学习与物理信息的测试时适应机制，构建了Dynamics Informed Diffusion Policy (DIDP) 框架，通过在降阶空间中学习逆动力学并引入运动学边界条件和结构化动力学先验，提升了策略学习的效率与执行的准确性与鲁棒性。

链接: https://arxiv.org/abs/2505.17434
作者: Guanzhou Lan,Yuqi Yang,Anup Teejo Mathew,Feiping Nie,Rong Wang,Xuelong Li,Federico Renda,Bin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages,

点击查看摘要

Abstract:Goal-conditioned dynamic manipulation is inherently challenging due to complex system dynamics and stringent task constraints, particularly in deformable object scenarios characterized by high degrees of freedom and underactuation. Prior methods often simplify the problem to low-speed or 2D settings, limiting their applicability to real-world 3D tasks. In this work, we explore 3D goal-conditioned rope manipulation as a representative challenge. To mitigate data scarcity, we introduce a novel simulation framework and benchmark grounded in reduced-order dynamics, which enables compact state representation and facilitates efficient policy learning. Building on this, we propose Dynamics Informed Diffusion Policy (DIDP), a framework that integrates imitation pretraining with physics-informed test-time adaptation. First, we design a diffusion policy that learns inverse dynamics within the reduced-order space, enabling imitation learning to move beyond naïve data fitting and capture the underlying physical structure. Second, we propose a physics-informed test-time adaptation scheme that imposes kinematic boundary conditions and structured dynamics priors on the diffusion process, ensuring consistency and reliability in manipulation execution. Extensive experiments validate the proposed approach, demonstrating strong performance in terms of accuracy and robustness in the learned policy.
zh

[AI-68] MemeReaCon: Probing Contextual Meme Understanding in Large Vision-Language Models

【速读】：该论文试图解决当前大型视觉语言模型（LVLMs）在理解模因（meme）时缺乏上下文感知能力的问题，即同一模因在不同对话语境中可能表达不同意图，而现有方法未能有效捕捉这种语境依赖性。解决方案的关键在于构建MemeReaCon基准，该基准通过收集来自五个不同Reddit社区的模因数据，保留图像、帖子文本和用户评论的原始上下文，并进行细致标注，以评估LVLMs对模因在具体语境中的意图理解能力。

链接: https://arxiv.org/abs/2505.17433
作者: Zhengyi Zhao,Shubo Zhang,Yuxi Zhang,Yanxi Zhao,Yifan Zhang,Zezhong Wang,Huimin Wang,Yutian Zhao,Bin Liang,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memes have emerged as a popular form of multimodal online communication, where their interpretation heavily depends on the specific context in which they appear. Current approaches predominantly focus on isolated meme analysis, either for harmful content detection or standalone interpretation, overlooking a fundamental challenge: the same meme can express different intents depending on its conversational context. This oversight creates an evaluation gap: although humans intuitively recognize how context shapes meme interpretation, Large Vision Language Models (LVLMs) can hardly understand context-dependent meme intent. To address this critical limitation, we introduce MemeReaCon, a novel benchmark specifically designed to evaluate how LVLMs understand memes in their original context. We collected memes from five different Reddit communities, keeping each meme’s image, the post text, and user comments together. We carefully labeled how the text and meme work together, what the poster intended, how the meme is structured, and how the community responded. Our tests with leading LVLMs show a clear weakness: models either fail to interpret critical information in the contexts, or overly focus on visual details while overlooking communicative purpose. MemeReaCon thus serves both as a diagnostic tool exposing current limitations and as a challenging benchmark to drive development toward more sophisticated LVLMs of the context-aware understanding.
zh

[AI-69] SEvoBench : A C Framework For Evolutionary Single-Objective Optimization Benchmarking

【速读】：该论文试图解决进化计算（Evolutionary Computation, EC）中单目标优化算法的系统性基准测试问题，旨在提供一个高效、可扩展的框架以评估和比较不同算法的性能。解决方案的关键在于SEvoBench框架的三个核心组件：基于可重用模块的算法构建、高效的基准问题集以及并行实验分析，同时通过SIMD向量化技术提升了大规模问题的计算效率，从而实现了算法实现的高复用性、基准测试的加速以及整体计算性能的增强。

链接: https://arxiv.org/abs/2505.17430
作者: Yongkang Yang,Jian Zhao,Tengfei Yang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Optimization and Control (math.OC)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:We present SEvoBench, a modern C++ framework for evolutionary computation (EC), specifically designed to systematically benchmark evolutionary single-objective optimization algorithms. The framework features modular implementations of Particle Swarm Optimization (PSO) and Differential Evolution (DE) algorithms, organized around three core components: (1) algorithm construction with reusable modules, (2) efficient benchmark problem suites, and (3) parallel experimental analysis. Experimental evaluations demonstrate the framework’s superior performance in benchmark testing and algorithm comparison. Case studies further validate its capabilities in algorithm hybridization and parameter analysis. Compared to existing frameworks, SEvoBench demonstrates three key advantages: (i) highly efficient and reusable modular implementations of PSO and DE algorithms, (ii) accelerated benchmarking through parallel execution, and (iii) enhanced computational efficiency via SIMD (Single Instruction Multiple Data) vectorization for large-scale problems.
zh

[AI-70] UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

【速读】：该论文试图解决基于大型语言模型（Large-Language-Model, LLM）的文本到语音（Text-to-Speech, TTS）系统中，由于语义与声学信息无法完全对齐而导致的音频信息获取受限问题。其解决方案的关键在于提出DistilCodec和UniTTS，其中DistilCodec通过将多码本音频编解码器压缩为单码本编解码器，实现了接近100%的码本利用率，并且无需依赖语义对齐方案，从而能够利用大量高质量的未标注音频数据进行训练，提升数据多样性和适用性；而UniTTS则通过整合音频模态自回归、文本模态自回归以及语音-文本跨模态自回归三个关键任务，使模型能够处理交错的文本和语音提示，同时保持LLM的文本能力。

链接: https://arxiv.org/abs/2505.17426
作者: Rui Wang,Qianguo Sun,Tianrong Chen,Zhiyun Zeng,Junlong Wu,Jiaxing Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS’s pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM’s text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at this https URL and this https URL.
zh

[AI-71] Misaligning Reasoning with Answers – A Framework for Assessing LLM CoT Robustness

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在决策过程中缺乏透明性的问题，特别是如何评估模型在回答问题时的推理过程是否可靠。其解决方案的关键在于设计了一个新的评估框架MATCHA，通过该框架分析输入扰动对模型推理一致性的影响，并利用LLM裁判评估不同模型的推理鲁棒性，从而揭示LLMs在多步骤和常识性任务中相较于逻辑任务更容易受到输入扰动的影响，同时展示了成功示例向黑盒模型的非平凡迁移率。

链接: https://arxiv.org/abs/2505.17406
作者: Enyi Jiang,Changming Xu,Nischay Singh,Gagandeep Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs’ decision-making process is opaque, prompting the need for explanation techniques like Chain-of-Thought. To investigate the relationship between answer and reasoning, we design a novel evaluation framework, MATCHA. In domains like education and healthcare, reasoning is key for model trustworthiness. MATCHA reveals that LLMs under input perturbations can give inconsistent or nonsensical reasoning. Additionally, we use LLM judges to assess reasoning robustness across models. Our results show that LLMs exhibit greater vulnerability to input perturbations for multi-step and commonsense tasks than compared to logical tasks. Also, we show non-trivial transfer rates of our successful examples to black-box models. Our evaluation framework helps to better understand LLM reasoning mechanisms and guides future models toward more robust and reasoning-driven architectures, enforcing answer-reasoning consistency.
zh

[AI-72] Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space

【速读】：该论文旨在解决传统模仿学习（Imitation Learning, IL）在机器人操作任务中面临的数据收集成本高、泛化能力不足以及难以实现长期任务执行的问题。其关键解决方案是引入分层数据收集空间（Hierarchical Data Collection Space, HD-Space），通过从高层视角将精细操作任务分解为多个关键原子任务，并为人类示范设计对应的原子状态/动作空间，从而生成更稳健的IL数据，提升策略性能。

链接: https://arxiv.org/abs/2505.17389
作者: Jinrong Yang,Kexun Chen,Zhuoling Li,Shengkai Wu,Yong Zhao,Liangliang Ren,Wenqiu Luo,Chaohui Shang,Meiyu Zhi,Linfeng Gao,Mingshan Sun,Hui Cheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation learning (IL) with human demonstrations is a promising method for robotic manipulation tasks. While minimal demonstrations enable robotic action execution, achieving high success rates and generalization requires high cost, e.g., continuously adding data or incrementally conducting human-in-loop processes with complex hardware/software systems. In this paper, we rethink the state/action space of the data collection pipeline as well as the underlying factors responsible for the prediction of non-robust actions. To this end, we introduce a Hierarchical Data Collection Space (HD-Space) for robotic imitation learning, a simple data collection scheme, endowing the model to train with proactive and high-quality data. Specifically, We segment the fine manipulation task into multiple key atomic tasks from a high-level perspective and design atomic state/action spaces for human demonstrations, aiming to generate robust IL data. We conduct empirical evaluations across two simulated and five real-world long-horizon manipulation tasks and demonstrate that IL policy training with HD-Space-based data can achieve significantly enhanced policy performance. HD-Space allows the use of a small amount of demonstration data to train a more powerful policy, particularly for long-horizon manipulation tasks. We aim for HD-Space to offer insights into optimizing data quality and guiding data scaling. project page: this https URL.
zh

[AI-73] Provably Efficient Algorithm for Best Scoring Rule Identification in Online Principal-Agent Information Acquisition ICML2025

【速读】：该论文试图解决在委托-代理框架下在线信息获取问题中识别最优评分规则（scoring rule）的问题。其解决方案的关键在于提出两种算法：OIAFC 和 OIAFB，分别适用于固定置信度和固定预算的场景。通过理论分析，证明了OIAFC能够在实例相关或实例无关的样本复杂性下提取所需的 $(\epsilon, \delta)$ -scoring rule，而OIAFB则在实例无关性能边界上与OIAFC相匹配，且两种算法在固定置信度和固定预算设置下的复杂度相同。

链接: https://arxiv.org/abs/2505.17379
作者: Zichen Wang,Chuanhao Li,Huazheng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025

点击查看摘要

Abstract:We investigate the problem of identifying the optimal scoring rule within the principal-agent framework for online information acquisition problem. We focus on the principal’s perspective, seeking to determine the desired scoring rule through interactions with the agent. To address this challenge, we propose two algorithms: OIAFC and OIAFB, tailored for fixed confidence and fixed budget settings, respectively. Our theoretical analysis demonstrates that OIAFC can extract the desired (\epsilon, \delta) -scoring rule with a efficient instance-dependent sample complexity or an instance-independent sample complexity. Our analysis also shows that OIAFB matches the instance-independent performance bound of OIAFC, while both algorithms share the same complexity across fixed confidence and fixed budget settings.
zh

[AI-74] FRIREN: Beyond Trajectories – A Spectral Lens on Time NEURIPS2025

【速读】：该论文试图解决长期时间序列预测（LTSF）中普遍存在的点预测假设问题，即认为所有数据都是可点预测的，而实际上对于混沌系统而言，几何结构才是更合适的抽象。解决方案的关键在于引入FRIREN（Flow-inspired Representations via Interpretable Eigen-networks）模型，该模型通过最小化Wasserstein-2距离（W2）来捕捉几何变化，并提供动态的谱视图，从而实现长期预测。FRIREN采用增强型归一化流块，将数据嵌入到正态分布的潜在表示中，并生成一个W2高效的最优路径，该路径可分解为旋转、缩放、逆旋转和平移，从而实现保持几何结构的局部预测和作为有限Koopman算子的全局谱表示。

链接: https://arxiv.org/abs/2505.17370
作者: Qilin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 37 pages, 4 figures. Submitted to NeurIPS 2025. Public code at this https URL

点击查看摘要

Abstract:Long-term time-series forecasting (LTSF) models are often presented as general-purpose solutions that can be applied across domains, implicitly assuming that all data is pointwise predictable. Using chaotic systems such as Lorenz-63 as a case study, we argue that geometric structure - not pointwise prediction - is the right abstraction for a dynamic-agnostic foundational model. Minimizing the Wasserstein-2 distance (W2), which captures geometric changes, and providing a spectral view of dynamics are essential for long-horizon forecasting. Our model, FRIREN (Flow-inspired Representations via Interpretable Eigen-networks), implements an augmented normalizing-flow block that embeds data into a normally distributed latent representation. It then generates a W2-efficient optimal path that can be decomposed into rotation, scaling, inverse rotation, and translation. This architecture yields locally generated, geometry-preserving predictions that are independent of the underlying dynamics, and a global spectral representation that functions as a finite Koopman operator with a small modification. This enables practitioners to identify which modes grow, decay, or oscillate, both locally and system-wide. FRIREN achieves an MSE of 11.4, MAE of 1.6, and SWD of 0.96 on Lorenz-63 in a 336-in, 336-out, dt=0.01 setting, surpassing TimeMixer (MSE 27.3, MAE 2.8, SWD 2.1). The model maintains effective prediction for 274 out of 336 steps, approximately 2.5 Lyapunov times. On Rossler (96-in, 336-out), FRIREN achieves an MSE of 0.0349, MAE of 0.0953, and SWD of 0.0170, outperforming TimeMixer’s MSE of 4.3988, MAE of 0.886, and SWD of 3.2065. FRIREN is also competitive on standard LTSF datasets such as ETT and Weather. By connecting modern generative flows with classical spectral analysis, FRIREN makes long-term forecasting both accurate and interpretable, setting a new benchmark for LTSF model design.
zh

[AI-75] FLEX: A Backbone for Diffusion-Based Modeling of Spatio-temporal Physical Systems

【速读】：该论文旨在解决时空物理系统生成建模中的稳定性与精度问题，特别是在超分辨率和预测任务中。其解决方案的关键在于提出FLEX（FLow EXpert）架构，该架构在残差空间而非原始数据上进行扩散模型建模，理论上降低了速度场的方差，从而提升了训练稳定性；同时，FLEX通过将潜在Transformer集成到带有标准卷积ResNet层的U-Net中，并引入重新设计的跳跃连接方案，实现了对潜在空间中局部空间细节和长程依赖关系的有效捕捉。

链接: https://arxiv.org/abs/2505.17351
作者: N. Benjamin Erichson,Vinicius Mikuni,Dongwei Lyu,Yang Gao,Omri Azencot,Soon Hoe Lim,Michael W. Mahoney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce FLEX (FLow EXpert), a backbone architecture for generative modeling of spatio-temporal physical systems using diffusion models. FLEX operates in the residual space rather than on raw data, a modeling choice that we motivate theoretically, showing that it reduces the variance of the velocity field in the diffusion model, which helps stabilize training. FLEX integrates a latent Transformer into a U-Net with standard convolutional ResNet layers and incorporates a redesigned skip connection scheme. This hybrid design enables the model to capture both local spatial detail and long-range dependencies in latent space. To improve spatio-temporal conditioning, FLEX uses a task-specific encoder that processes auxiliary inputs such as coarse or past snapshots. Weak conditioning is applied to the shared encoder via skip connections to promote generalization, while strong conditioning is applied to the decoder through both skip and bottleneck features to ensure reconstruction fidelity. FLEX achieves accurate predictions for super-resolution and forecasting tasks using as few as two reverse diffusion steps. It also produces calibrated uncertainty estimates through sampling. Evaluations on high-resolution 2D turbulence data show that FLEX outperforms strong baselines and generalizes to out-of-distribution settings, including unseen Reynolds numbers, physical observables (e.g., fluid flow velocity fields), and boundary conditions.
zh

[AI-76] A Multi-Head Attention Soft Random Forest for Interpretable Patient No-Show Prediction

【速读】：该论文旨在解决患者未按时就诊（unattended scheduled appointments，即no-shows）对医疗服务提供者和患者健康造成的负面影响，特别是其对医疗资源分配、诊疗连续性和运营效率的干扰。为提升预测模型的准确性与适应性，传统机器学习方法如逻辑回归、随机森林和决策树因依赖硬决策分割和静态特征重要性而存在局限。该研究提出了一种新的混合模型——多头注意力软随机森林（MHASRF），其关键在于将注意力机制引入随机森林，采用概率软分割替代硬分割，使模型能够根据不同树的结构动态分配注意力权重，从而更精准地捕捉特定患者行为。该模型在准确率、精确率、召回率和F1分数上均优于传统模型，并通过双层次特征重要性分析（树级和注意力机制级）提供了更深入的患者未就诊预测因素洞察。

链接: https://arxiv.org/abs/2505.17344
作者: Ninda Nurseha Amalina,Kwadwo Boateng Ofori-Amanfo,Heungjo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Unattended scheduled appointments, defined as patient no-shows, adversely affect both healthcare providers and patients’ health, disrupting the continuity of care, operational efficiency, and the efficient allocation of medical resources. Accurate predictive modelling is needed to reduce the impact of no-shows. Although machine learning methods, such as logistic regression, random forest models, and decision trees, are widely used in predicting patient no-shows, they often rely on hard decision splits and static feature importance, limiting their adaptability to specific or complex patient behaviors. To address this limitation, we propose a new hybrid Multi-Head Attention Soft Random Forest (MHASRF) model that integrates attention mechanisms into a random forest model using probabilistic soft splitting instead of hard splitting. The MHASRF model assigns attention weights differently across the trees, enabling attention on specific patient behaviors. The model exhibited 93.56% accuracy, 93.67% precision, 93.56% recall, and a 93.59% F1 score, surpassing the performance of decision tree, logistic regression, random forest, and naive Bayes models. Furthermore, MHASRF was able to identify key predictors of patient no-shows using two levels of feature importance (tree level and attention mechanism level), offering deeper insights into patient no-show predictors. The proposed model is a robust, adaptable, and interpretable method for predicting patient no-shows that will help healthcare providers in optimizing resources.
zh

[AI-77] Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

【速读】：该论文试图解决如何使人工智能系统具备类似人类的协作能力，即在与不同合作伙伴互动时，能够推断其优劣势并共同实现共享目标。其解决方案的关键在于通过训练无模型的循环神经网络（RNN）代理在开放式的合作环境中与多样化的合作伙伴进行交互，从而自发地形成对合作伙伴任务能力的结构化内部表征。研究发现，这种伙伴建模能力在代理能够通过任务分配影响合作伙伴行为的环境条件下得以出现，表明无需显式的专门机制或辅助目标，社会压力可以促使合作能力的自发产生。

链接: https://arxiv.org/abs/2505.17323
作者: Ruaridh Mon-Williams,Max Taylor-Davies,Elizabeth Mieczkowski,Natalia Velez,Neil R. Bramley,Yanwei Wang,Thomas L. Griffiths,Christopher G. Lucas
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans are remarkably adept at collaboration, able to infer the strengths and weaknesses of new partners in order to work successfully towards shared goals. To build AI systems with this capability, we must first understand its building blocks: does such flexibility require explicit, dedicated mechanisms for modelling others – or can it emerge spontaneously from the pressures of open-ended cooperative interaction? To investigate this question, we train simple model-free RNN agents to collaborate with a population of diverse partners. Using the `Overcooked-AI’ environment, we collect data from thousands of collaborative teams, and analyse agents’ internal hidden states. Despite a lack of additional architectural features, inductive biases, or auxiliary objectives, the agents nevertheless develop structured internal representations of their partners’ task abilities, enabling rapid adaptation and generalisation to novel collaborators. We investigated these internal models through probing techniques, and large-scale behavioural analysis. Notably, we find that structured partner modelling emerges when agents can influence partner behaviour by controlling task allocation. Our results show that partner modelling can arise spontaneously in model-free agents – but only under environmental conditions that impose the right kind of social pressure.
zh

[AI-78] Control of Renewable Energy Communities using AI and Real-World Data

【速读】：该论文旨在解决可再生能源社区（Renewable Energy Communities, RECs）在整合电动汽车（Electric Vehicle, EV）充电与建筑能源系统（如供暖、通风、空调、光伏发电和电池储能）时所面临的复杂性和实际挑战。解决方案的关键在于提出一个专门设计的框架，该框架结合了基于多智能体深度确定性策略梯度（MultiAgent Deep Deterministic Policy Gradient, MADDPG）的控制策略——EnergAIze，以应对现实世界数据采集、系统集成和用户行为建模等难题。通过优化负荷调度和EV充电行为，该框架在实际运行的REC中实现了日峰值需求平均降低9%和能源成本减少5%的成效。

链接: https://arxiv.org/abs/2505.17321
作者: Tiago Fonseca,Clarisse Sousa,Ricardo Venâncio,Pedro Pires,Ricardo Severino,Paulo Rodrigues,Pedro Paiva,Luis Lino Ferreira
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 1 table, 30th IEEE International Conference on Emerging Technologies and Factory Automation

点击查看摘要

Abstract:The electrification of transportation and the increased adoption of decentralized renewable energy generation have added complexity to managing Renewable Energy Communities (RECs). Integrating Electric Vehicle (EV) charging with building energy systems like heating, ventilation, air conditioning (HVAC), photovoltaic (PV) generation, and battery storage presents significant opportunities but also practical challenges. Reinforcement learning (RL), particularly MultiAgent Deep Deterministic Policy Gradient (MADDPG) algorithms, have shown promising results in simulation, outperforming heuristic control strategies. However, translating these successes into real-world deployments faces substantial challenges, including incomplete and noisy data, integration of heterogeneous subsystems, synchronization issues, unpredictable occupant behavior, and missing critical EV state-of-charge (SoC) information. This paper introduces a framework designed explicitly to handle these complexities and bridge the simulation to-reality gap. The framework incorporates EnergAIze, a MADDPG-based multi-agent control strategy, and specifically addresses challenges related to real-world data collection, system integration, and user behavior modeling. Preliminary results collected from a real-world operational REC with four residential buildings demonstrate the practical feasibility of our approach, achieving an average 9% reduction in daily peak demand and a 5% decrease in energy costs through optimized load scheduling and EV charging behaviors. These outcomes underscore the framework’s effectiveness, advancing the practical deployment of intelligent energy management solutions in RECs.
zh

[AI-79] AdaReason er: Adaptive Reasoning Enables More Flexible Thinking

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在需要复杂推理和问题解决的任务中，因采用通用固定配置而难以达到任务特定最优性能的问题。其解决方案的关键在于提出AdaReasoner，一个与LLM无关的插件，能够自动化适应不同思维方式的任务的推理配置。AdaReasoner通过强化学习（Reinforcement Learning, RL）框架进行训练，结合因子化动作空间、针对性探索策略以及预训练奖励模型，仅需少量示例引导即可优化策略模型，从而实现快速收敛和次线性策略差距，并在多种LLMs和推理任务中表现出色。

链接: https://arxiv.org/abs/2505.17312
作者: Xiangqi Wang,Yue Huang,Yanbo Wang,Xiaonan Luo,Kehan Guo,Yujun Zhou,Xiangliang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work ‘well enough’ across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.
zh

[AI-80] LaSER: How Learning Can Guide the Evolution of Equations

【速读】：该论文试图解决在遗传编程（Genetic Programming, GP）中如何提升方程演化的泛化能力，同时保持符号可解释性的问题。传统GP在演化过程中面临同时发现有用表示和精确映射的负担，导致其泛化能力受限。解决方案的关键在于提出一种新的GP流程——LaSER（Latent Semantic Evolutionary Regression），该方法在评估阶段引入监督学习，通过生成语义表示并由监督学习器评估其映射质量来赋予适应度，而不修改底层语法树或进化过程。这一策略有效结合了进化计算与现代机器学习流程，显著提升了GP的泛化性能。

链接: https://arxiv.org/abs/2505.17309
作者: Nam H. Le,Josh Bongard
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Evolution and learning are two distinct yet complementary forms of adaptation. While evolutionary processes operate across generations via the selection of genotypes, learning occurs within the lifetime of an individual, shaping behavior through phenotypic adjustment. The Baldwin effect describes how lifetime learning can improve evolutionary search without altering inherited structures. While this has proven effective in areas like neuroevolution, where gradient-based learning is often used to fine-tune weights or behaviors produced by evolution, it remains underexplored in systems that evolve non-differentiable symbolic structures like Genetic Programming (GP). GP evolves explicit syntax trees that represent equations, offering strong interpretability but limited generalization due to the burden of discovering both useful representations and precise mappings. Here, we show for the first time that integrating a simple form of supervised learning, applied at the semantic or behavioral level during evaluation, can effectively guide the evolution of equations in GP. To achieve this, we propose a new GP pipeline, LaSER (Latent Semantic Evolutionary Regression), where each GP individual generates a semantic representation that is passed to a supervised learner. The quality of the learned mapping is used to assign fitness, without modifying the underlying syntax tree or evolutionary process. Across standard symbolic regression benchmarks, in terms of generalization ability, LaSER significantly outperforms traditional GP and, in several cases, matches or exceeds popular machine learning regressors, while preserving the symbolic interpretability. By separating evolution from learning, LaSER offers a practical route to integrating GP with modern ML workflows, and opens new avenues for research at the intersection of evolutionary computation and representation learning. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC) Cite as: arXiv:2505.17309 [cs.NE] (or arXiv:2505.17309v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2505.17309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-81] Where You Go is Who You Are: Behavioral Theory-Guided LLM s for Inverse Reinforcement Learning

【速读】：该论文试图解决大轨迹数据在人类移动性分析中的应用受限问题，即缺乏关键的旅行者社会人口学属性。解决方案的关键在于提出SILIC框架，该框架结合了大型语言模型（LLM）引导的逆强化学习（IRL）和认知链推理（CCR），通过捕捉潜在的行为意图并基于心理构造进行推理，从而从观测到的移动模式中推断社会人口学属性。该方法特别遵循计划行为理论（Theory of Planned Behavior, TPB），以建模旅行决策背后的潜在认知过程，并利用LLM提升IRL奖励函数的初始化与更新，解决其在广阔且非结构化奖励空间中的不适定性和优化挑战。

链接: https://arxiv.org/abs/2505.17249
作者: Yuran Sun,Susu Xu,Chenguang Wang,Xilei Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Big trajectory data hold great promise for human mobility analysis, but their utility is often constrained by the absence of critical traveler attributes, particularly sociodemographic information. While prior studies have explored predicting such attributes from mobility patterns, they often overlooked underlying cognitive mechanisms and exhibited low predictive accuracy. This study introduces SILIC, short for Sociodemographic Inference with LLM-guided Inverse Reinforcement Learning (IRL) and Cognitive Chain Reasoning (CCR), a theoretically grounded framework that leverages LLMs to infer sociodemographic attributes from observed mobility patterns by capturing latent behavioral intentions and reasoning through psychological constructs. Particularly, our approach explicitly follows the Theory of Planned Behavior (TPB), a foundational behavioral framework in transportation research, to model individuals’ latent cognitive processes underlying travel decision-making. The LLMs further provide heuristic guidance to improve IRL reward function initialization and update by addressing its ill-posedness and optimization challenges arising from the vast and unstructured reward space. Evaluated in the 2017 Puget Sound Regional Council Household Travel Survey, our method substantially outperforms state-of-the-art baselines and shows great promise for enriching big trajectory data to support more behaviorally grounded applications in transportation planning and beyond.
zh

[AI-82] Optimal Policy Minimum Bayesian Risk

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理复杂推理问题时的准确性不足问题，尤其针对长链式思维（long chain-of-thought, long-CoT）生成任务。其解决方案的关键在于提出一种新的方法，将奖励信号和风险/相似性信号整合到最小贝叶斯风险解码（Minimum Bayes Risk Decoding, MBRD）中，该方法基于KL控制强化学习中的最优策略概念，提供了一种简单且定义明确的机制，从而提升了模型的鲁棒性和准确性，并具备可预测的渐近行为。此外，该框架还支持一种样本高效的MBRD变体，能够根据问题难度动态调整生成样本数量，而无需依赖多数投票机制。

链接: https://arxiv.org/abs/2505.17242
作者: Ramón Fernandez Astudillo,Md Arafat Sultan,Aashka Trivedi,Yousef El-Kurdi,Tahira Naseem,Radu Florian,Salim Roukos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference scaling can help LLMs solve complex reasoning problems through extended runtime computation. On top of targeted supervision for long chain-of-thought (long-CoT) generation, purely inference-time techniques such as best-of-N (BoN) sampling, majority voting, or more generally, minimum Bayes risk decoding (MBRD), can further improve LLM accuracy by generating multiple candidate solutions and aggregating over them. These methods typically leverage additional signals in the form of reward models and risk/similarity functions that compare generated samples, e.g., exact match in some normalized space or standard similarity metrics such as Rouge. Here we present a novel method for incorporating reward and risk/similarity signals into MBRD. Based on the concept of optimal policy in KL-controlled reinforcement learning, our framework provides a simple and well-defined mechanism for leveraging such signals, offering several advantages over traditional inference-time methods: higher robustness, improved accuracy, and well-understood asymptotic behavior. In addition, it allows for the development of a sample-efficient variant of MBRD that can adjust the number of samples to generate according to the difficulty of the problem, without relying on majority vote counts. We empirically demonstrate the advantages of our approach on math (MATH- 500 ) and coding (HumanEval) tasks using recent open-source models. We also present a comprehensive analysis of its accuracy-compute trade-offs.
zh

[AI-83] Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis

【速读】：该论文试图解决生成式人工智能（Generative AI）对人类创造力影响的实证研究不足问题，具体探讨GenAI是否能够生成具有创造性的想法，以及其在支持人类生成创造性且多样化的想法方面的能力。解决方案的关键在于通过系统文献检索和元分析方法，评估GenAI在创意任务中的表现，并比较GenAI单独使用、与人类协作时的创造性表现及想法多样性，从而揭示GenAI作为增强工具在提升人类创造力中的作用。

链接: https://arxiv.org/abs/2505.17241
作者: Niklas Holzner,Sebastian Maier,Stefan Feuerriegel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 6 figures. Code and data are available at this https URL

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges’ g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.
zh

[AI-84] Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

【速读】：该论文试图解决大型语言模型在复杂推理任务中表现出的“推理僵化”（reasoning rigidity）问题，即模型在面对明确指令时仍倾向于依赖熟悉的推理模式，从而导致错误结论。解决方案的关键在于引入一个由专家精心设计的诊断数据集\dataset，该数据集包含经过修改的数学基准测试（如AIME和MATH500）以及重新设计的逻辑谜题，旨在系统性地揭示模型在默认使用习惯性推理路径时产生的污染模式。通过分析这些模式，研究者将其归纳为三种主要类型：解释过载、输入不信任和部分指令关注，从而为未来缓解推理僵化提供了基础支持。

链接: https://arxiv.org/abs/2505.17225
作者: Doohyuk Jang,Yoonjeon Kim,Chanjae Park,Hyun Ryu,Eunho Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term \textitreasoning rigidity. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, \dataset. Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.
zh

[AI-85] Effective Reinforcement Learning for Reasoning in Language Models

【速读】：该论文试图解决如何有效提升语言模型（Language Model, LM）在数学和编程等推理任务中的表现，同时优化训练过程的计算效率。其关键解决方案在于重新审视并调整强化学习（Reinforcement Learning, RL）算法的设计，以适应LM推理的特点，包括采用在线策略RL优于监督微调、基于PPO的离线策略更新可提高准确性、移除KL散度有助于生成更简洁且准确的结果，并提出一种名为DASH的新算法，通过预取采样和梯度过滤显著降低训练时间而不牺牲性能。

链接: https://arxiv.org/abs/2505.17218
作者: Lianghuan Huang,Shuo Li,Sagnik Anupam,Insup Lee,Osbert Bastani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a promising strategy for improving the reasoning capabilities of language models (LMs) in domains such as mathematics and coding. However, most modern RL algorithms were designed to target robotics applications, which differ significantly from LM reasoning. We analyze RL algorithm design decisions for LM reasoning, for both accuracy and computational efficiency, focusing on relatively small models due to computational constraints. Our findings are: (i) on-policy RL significantly outperforms supervised fine-tuning (SFT), (ii) PPO-based off-policy updates increase accuracy instead of reduce variance, and (iii) removing KL divergence can lead to more concise generations and higher accuracy. Furthermore, we find that a key bottleneck to computational efficiency is that the optimal batch sizes for inference and backpropagation are different. We propose a novel algorithm, DASH, that performs preemptive sampling (i.e., sample a large batch and accumulate gradient updates in small increments), and gradient filtering (i.e., drop samples with small advantage estimates). We show that DASH reduces training time by 83% compared to a standard implementation of GRPO without sacrificing accuracy. Our findings provide valuable insights on designing effective RL algorithms for LM reasoning.
zh

[AI-86] MEDMKG: Benchmarking Medical Knowledge Exploitation with Multimodal Knowledge Graph NEURIPS2025

【速读】：该论文试图解决医疗深度学习模型在知识密集型临床任务中依赖领域特定知识的问题，特别是如何有效整合多模态医学知识图谱以提升模型性能。现有研究主要依赖单模态知识图谱（如UMLS），而多模态医学知识图谱的整合仍处于探索阶段，主要受限于影像数据与临床概念之间缺乏关联资源。解决方案的关键在于提出MEDMKG，一个通过多阶段构建流程统一视觉和文本医学信息的医学多模态知识图谱，其融合了MIMIC-CXR中的丰富多模态数据与UMLS的结构化临床知识，并采用基于规则的工具和大语言模型进行精准的概念提取与关系建模，同时引入Neighbor-aware Filtering（NaF）算法以确保图的质量和紧凑性。

链接: https://arxiv.org/abs/2505.17214
作者: Xiaochen Wang,Yuan Zhong,Lingwei Zhang,Lisong Dai,Ting Wang,Fenglong Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Neurips 2025

点击查看摘要

Abstract:Medical deep learning models depend heavily on domain-specific knowledge to perform well on knowledge-intensive clinical tasks. Prior work has primarily leveraged unimodal knowledge graphs, such as the Unified Medical Language System (UMLS), to enhance model performance. However, integrating multimodal medical knowledge graphs remains largely underexplored, mainly due to the lack of resources linking imaging data with clinical concepts. To address this gap, we propose MEDMKG, a Medical Multimodal Knowledge Graph that unifies visual and textual medical information through a multi-stage construction pipeline. MEDMKG fuses the rich multimodal data from MIMIC-CXR with the structured clinical knowledge from UMLS, utilizing both rule-based tools and large language models for accurate concept extraction and relationship modeling. To ensure graph quality and compactness, we introduce Neighbor-aware Filtering (NaF), a novel filtering algorithm tailored for multimodal knowledge graphs. We evaluate MEDMKG across three tasks under two experimental settings, benchmarking twenty-four baseline methods and four state-of-the-art vision-language backbones on six datasets. Results show that MEDMKG not only improves performance in downstream medical tasks but also offers a strong foundation for developing adaptive and robust strategies for multimodal knowledge integration in medical artificial intelligence.
zh

[AI-87] LiloDriver: A Lifelong Learning Framework for Closed-loop Motion Planning in Long-tail Autonomous Driving Scenarios

【速读】：该论文旨在解决自动驾驶中运动规划器在长尾场景下的适应性不足问题，现有基于规则和数据驱动的规划器缺乏对罕见场景的适应能力，而知识驱动方法虽具备较强的推理能力，但在表示、控制和实际评估方面面临挑战。论文提出的解决方案是LiloDriver，其关键在于将大语言模型（Large Language Models, LLMs）与记忆增强的规划生成系统相结合，实现无需重新训练即可持续适应新场景的闭环运动规划框架。

链接: https://arxiv.org/abs/2505.17209
作者: Huaiyuan Yao,Pengfei Li,Bu Jin,Yupeng Zheng,An Liu,Lisen Mu,Qing Su,Qian Zhang,Yilun Chen,Peng Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Recent advances in autonomous driving research towards motion planners that are robust, safe, and adaptive. However, existing rule-based and data-driven planners lack adaptability to long-tail scenarios, while knowledge-driven methods offer strong reasoning but face challenges in representation, control, and real-world evaluation. To address these challenges, we present LiloDriver, a lifelong learning framework for closed-loop motion planning in long-tail autonomous driving scenarios. By integrating large language models (LLMs) with a memory-augmented planner generation system, LiloDriver continuously adapts to new scenarios without retraining. It features a four-stage architecture including perception, scene encoding, memory-based strategy refinement, and LLM-guided reasoning. Evaluated on the nuPlan benchmark, LiloDriver achieves superior performance in both common and rare driving scenarios, outperforming static rule-based and learning-based planners. Our results highlight the effectiveness of combining structured memory and LLM reasoning to enable scalable, human-like motion planning in real-world autonomous driving. Our code is available at this https URL.
zh

[AI-88] LengthLogD: A Length-Stratified Ensemble Framework for Enhanced Peptide Lipophilicity Prediction via Multi-Scale Feature Integration

【速读】：该论文旨在解决肽类化合物在药物开发中因膜渗透性低而导致的成药性受限问题，其核心挑战在于准确预测肽的logD值。解决方案的关键在于提出LengthLogD框架，通过分子长度分层策略构建专用模型，并创新性地整合多尺度分子表征，包括原子级（10个分子描述符）、结构级（1024位Morgan指纹）和拓扑级（3个基于图的特征，如Wiener指数），结合分层集成学习优化特征空间，同时引入针对长肽的自适应权重分配机制，从而显著提升模型的泛化能力和预测精度。

链接: https://arxiv.org/abs/2505.17198
作者: Shuang Wu,Meijie Wang,Lun Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Peptide compounds demonstrate considerable potential as therapeutic agents due to their high target affinity and low toxicity, yet their drug development is constrained by their low membrane permeability. Molecular weight and peptide length have significant effects on the logD of peptides, which in turn influences their ability to cross biological membranes. However, accurate prediction of peptide logD remains challenging due to the complex interplay between sequence, structure, and ionization states. This study introduces LengthLogD, a predictive framework that establishes specialized models through molecular length stratification while innovatively integrating multi-scale molecular representations. We constructed feature spaces across three hierarchical levels: atomic (10 molecular descriptors), structural (1024-bit Morgan fingerprints), and topological (3 graph-based features including Wiener index), optimized through stratified ensemble learning. An adaptive weight allocation mechanism specifically developed for long peptides significantly enhances model generalizability. Experimental results demonstrate superior performance across all categories: short peptides (R^2=0.855), medium peptides (R^2=0.816), and long peptides (R^2=0.882), with a 34.7% reduction in prediction error for long peptides compared to conventional single-model approaches. Ablation studies confirm: 1) The length-stratified strategy contributes 41.2% to performance improvement; 2) Topological features account for 28.5% of predictive importance. Compared to state-of-the-art models, our method maintains short peptide prediction accuracy while achieving a 25.7% increase in the coefficient of determination (R^2) for long peptides. This research provides a precise logD prediction tool for peptide drug development, particularly demonstrating unique value in optimizing long peptide lead compounds.
zh

[AI-89] A Toolkit for Compliance a Toolkit for Justice: Drawing on Cross-sectoral Expertise to Develop a Pro-justice EU AI Act Toolkit

【速读】：该论文试图解决如何为人工智能从业者开发一个符合《人工智能法案》（AI Act）要求的AI伦理工具包，同时超越单纯合规性，考虑更广泛的社经伦理问题。解决方案的关键在于通过跨行业合作，结合英国学术团队与意大利产业团队的优势，共同构建一个实用且具有现实意义的伦理工具包，以应对监管合规与社会伦理之间的复杂关系。

链接: https://arxiv.org/abs/2505.17165
作者: Tomasz Hollanek,Yulu Pi,Cosimo Fiorini,Virginia Vignali,Dorian Peters,Eleanor Drage
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: In proceedings of ACM FAccT 2025

点击查看摘要

Abstract:The introduction of the AI Act in the European Union presents the AI research and practice community with a set of new challenges related to compliance. While it is certain that AI practitioners will require additional guidance and tools to meet these requirements, previous research on toolkits that aim to translate the theory of AI ethics into development and deployment practice suggests that such resources suffer from multiple limitations. These limitations stem, in part, from the fact that the toolkits are either produced by industry-based teams or by academics whose work tends to be abstract and divorced from the realities of industry. In this paper, we discuss the challenge of developing an AI ethics toolkit for practitioners that helps them comply with new AI-focused regulation, but that also moves beyond mere compliance to consider broader socio-ethical questions throughout development and deployment. The toolkit was created through a cross-sectoral collaboration between an academic team based in the UK and an industry team in Italy. We outline the background and rationale for creating a pro-justice AI Act compliance toolkit, detail the process undertaken to develop it, and describe the collaboration and negotiation efforts that shaped its creation. We aim for the described process to serve as a blueprint for other teams navigating the challenges of academia-industry partnerships and aspiring to produce usable and meaningful AI ethics resources.
zh

[AI-90] DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLM s Based on Capturing Real-World Changes

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在处理快速变化的事实性数据和多领域信息时面临的挑战，特别是其对时间敏感的网络信息的处理能力不足的问题。解决方案的关键在于构建一个动态更新的数据集DailyQA，该数据集通过每周更新问题并包含任意日期的答案，利用维基百科修订日志实现全自动的数据过滤、查询生成与合成、质量检查、答案提取及查询分类流程，从而为LLMs和检索增强生成（Retrieval-Augmented Generation, RAG）系统提供基准测试环境。

链接: https://arxiv.org/abs/2505.17162
作者: Jiehan Cheng,Zhicheng Dou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.
zh

[AI-91] Efficient Training of Neural SDEs Using Stochastic Optimal Control

【速读】：该论文旨在解决神经随机微分方程（neural stochastic differential equations, SDEs）中变分推断（variational inference, VI）的计算挑战问题，特别是在时间序列中实现不确定性感知推理时的效率问题。其解决方案的关键在于将控制项分解为线性部分和残差非线性部分，并通过随机最优控制理论为线性SDEs推导出最优控制项。通过神经网络建模非线性部分，实现了在不牺牲模型表达能力的前提下高效训练神经SDEs，其中线性部分的最优性使其无需学习，从而降低了训练成本并加快了收敛速度。

链接: https://arxiv.org/abs/2505.17150
作者: Rembert Daems,Manfred Opper,Guillaume Crevecoeur,Tolga Birdal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR)
备注: Published in the ESANN 2025 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium) and online event, 23-25 April 2025

点击查看摘要

Abstract:We present a hierarchical, control theory inspired method for variational inference (VI) for neural stochastic differential equations (SDEs). While VI for neural SDEs is a promising avenue for uncertainty-aware reasoning in time-series, it is computationally challenging due to the iterative nature of maximizing the ELBO. In this work, we propose to decompose the control term into linear and residual non-linear components and derive an optimal control term for linear SDEs, using stochastic optimal control. Modeling the non-linear component by a neural network, we show how to efficiently train neural SDEs without sacrificing their expressive power. Since the linear part of the control term is optimal and does not need to be learned, the training is initialized at a lower cost and we observe faster convergence.
zh

[AI-92] LLM -Powered Agents for Navigating Venices Historical Cadastre

【速读】：该论文试图解决历史城市土地登记数据（cadastral data）因格式多样和人工标注导致的非标准化问题，从而阻碍了大规模分析。其解决方案的关键在于提出一种文本到程序的框架，利用生成式 AI (Generative AI) 将自然语言查询转化为可执行代码，以处理历史土地登记记录。该框架结合了文本到SQL和文本到Python两种互补技术，分别用于结构化查询和复杂数据分析，并通过构建分类体系将研究问题与最合适的技术方法对应，从而有效支持历史城市空间信息的重建与比较分析。

链接: https://arxiv.org/abs/2505.17148
作者: Tristan Karch,Jakhongir Saydaliev,Isabella Di Lenardo,Frédéric Kaplan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cadastral data reveal key information about the historical organization of cities but are often non-standardized due to diverse formats and human annotations, complicating large-scale analysis. We explore as a case study Venice’s urban history during the critical period from 1740 to 1808, capturing the transition following the fall of the ancient Republic and the Ancien Régime. This era’s complex cadastral data, marked by its volume and lack of uniform structure, presents unique challenges that our approach adeptly navigates, enabling us to generate spatial queries that bridge past and present urban landscapes. We present a text-to-programs framework that leverages Large Language Models (LLMs) to translate natural language queries into executable code for processing historical cadastral records. Our methodology implements two complementary techniques: a text-to-SQL approach for handling structured queries about specific cadastral information, and a text-to-Python approach for complex analytical operations requiring custom data manipulation. We propose a taxonomy that classifies historical research questions based on their complexity and analytical requirements, mapping them to the most appropriate technical approach. This framework is supported by an investigation into the execution consistency of the system, alongside a qualitative analysis of the answers it produces. By ensuring interpretability and minimizing hallucination through verifiable program outputs, we demonstrate the system’s effectiveness in reconstructing past population information, property features, and spatiotemporal comparisons in Venice.
zh

[AI-93] MTSA: Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming ACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多轮对话中因恶意意图隐藏而产生有害响应的安全问题。其解决方案的关键在于提出一种名为Multi-Turn Safety Alignment（\ourapproach）的框架，该框架包含两个阶段：在思想引导的攻击学习阶段，红队模型学习生成思想引导的多轮越狱攻击提示；在对抗迭代优化阶段，红队模型与目标模型通过交互持续提升各自能力。此外，引入基于未来奖励的多轮强化学习算法以增强安全对齐的鲁棒性。

链接: https://arxiv.org/abs/2505.17147
作者: Weiyang Guo,Jing Li,Wenya Wang,YU LI,Daojing He,Jun Yu,Min Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages,6 figures,ACL2025

点击查看摘要

Abstract:The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the \textbfMulti-\textbfTurn \textbfSafety \textbfAlignment (\ourapproach) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.
zh

[AI-94] LLM Access Shield: Domain-Specific LLM Framework for Privacy Policy Compliance

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在金融、教育和治理等领域广泛应用过程中面临的数据隐私与安全问题，特别是敏感数据泄露的风险。其解决方案的关键在于提出一个安全框架，包含三项核心创新：基于LLM的策略执行、动态策略定制以及敏感数据匿名化。这些技术共同实现了对LLM交互中的策略合规性保障和安全风险的缓解，同时保持了模型任务的功能准确性。

链接: https://arxiv.org/abs/2505.17145
作者: Yu Wang,Cailing Cai,Zhihua Xiao,Peifung E. Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied in fields such as finance, education, and governance due to their ability to generate human-like text and adapt to specialized tasks. However, their widespread adoption raises critical concerns about data privacy and security, including the risk of sensitive data exposure. In this paper, we propose a security framework to enforce policy compliance and mitigate risks in LLM interactions. Our approach introduces three key innovations: (i) LLM-based policy enforcement: a customizable mechanism that enhances domain-specific detection of sensitive data. (ii) Dynamic policy customization: real-time policy adaptation and enforcement during user-LLM interactions to ensure compliance with evolving security requirements. (iii) Sensitive data anonymization: a format-preserving encryption technique that protects sensitive information while maintaining contextual integrity. Experimental results demonstrate that our framework effectively mitigates security risks while preserving the functional accuracy of LLM-driven tasks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.17145 [cs.CR] (or arXiv:2505.17145v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.17145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-95] Evaluating the Performance of Nigerian Lecturers using Multilayer Perceptron

【速读】：该论文试图解决传统讲师绩效评估系统缺乏全面性和整体性的问题，旨在提升教学质量、学生学习成果和机构声誉。解决方案的关键在于设计了一个基于Web平台的系统，利用多层感知机（MLP）算法处理复杂数据模式，结合学生评价分数、科研发表、教学经验和行政职责等多维绩效指标，并通过面向对象分析与设计（OOAD）方法实现系统的开发，从而提供准确、公平且高效的教学评估。

链接: https://arxiv.org/abs/2505.17143
作者: I.E. Ezeibe,S.O. Okide,D.C. Asogwa
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the performance of a lecturer has been essential for enhancing teaching quality, improving student learning outcomes, and strengthening the institution’s reputation. The absence of such a system brings about lecturer performance evaluation which was neither comprehensive nor holistic. This system was designed using a web-based platform, created a secure database, and by using a custom dataset, captured some performance metrics which included student evaluation scores, Research Publications, Years of Experience, and Administrative Duties. Multilayer Perceptron (MLP) algorithm was utilized due to its ability to process complex data patterns and generates accurate predictions in a lecturer’s performance based on historical data. This research focused on designing multiple performance metrics beyond the standard ones, incorporating student participation, and integrating analytical tools to deliver a comprehensive and holistic evaluation of lecturers’ performance and was developed using Object-Oriented Analysis and Design (OOAD) methodology. Lecturers’ performance is evaluated by the model, and the evaluation accuracy is about 91% compared with actual performance. Finally, by evaluating the performance of the MLP model, it is concluded that MLP enhanced lecturer performance evaluation by providing accurate predictions, reducing bias, and supporting data-driven decisions, ultimately improving the fairness and efficiency of the evaluation process. The MLP model’s performance was evaluated using Mean Squared Error (MSE) and Mean Absolute Error (MAE), achieved a test loss (MSE) of 256.99 and a MAE of 13.76, and reflected a high level of prediction accuracy. The model also demonstrated an estimated accuracy rate of approximately 96%, validated its effectiveness in predicting lecturer performance.
zh

[AI-96] MetaSTH-Sleep: Towards Effective Few-Shot Sleep Stage Classification with Spatial-Temporal Hypergraph Enhanced Meta-Learning

【速读】：该论文旨在解决基于生物信号的睡眠阶段准确分类问题，特别是在标注数据有限、个体间生物信号差异大以及现有方法未能充分建模生物信号间的高阶关系等现实挑战下的自动化睡眠阶段标注难题。其解决方案的关键在于提出一种基于时空超图增强元学习的少样本睡眠阶段分类框架MetaSTH-Sleep，该框架通过超图结构同时建模脑电（EEG）信号中的复杂空间相互作用和时间动态特性，从而实现仅需少量标注样本即可快速适应新受试者，并提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.17142
作者: Jingyu Li,Tiehua Zhang,Jinze Wang,Yi Zhang,Yuhuan Li,Yifan Zhao,Zhishu Shen,Jiannan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate classification of sleep stages based on bio-signals is fundamental for automatic sleep stage annotation. Traditionally, this task relies on experienced clinicians to manually annotate data, a process that is both time-consuming and labor-intensive. In recent years, deep learning methods have shown promise in automating this task. However, three major challenges remain: (1) deep learning models typically require large-scale labeled datasets, making them less effective in real-world settings where annotated data is limited; (2) significant inter-individual variability in bio-signals often results in inconsistent model performance when applied to new subjects, limiting generalization; and (3) existing approaches often overlook the high-order relationships among bio-signals, failing to simultaneously capture signal heterogeneity and spatial-temporal dependencies. To address these issues, we propose MetaSTH-Sleep, a few-shot sleep stage classification framework based on spatial-temporal hypergraph enhanced meta-learning. Our approach enables rapid adaptation to new subjects using only a few labeled samples, while the hypergraph structure effectively models complex spatial interconnections and temporal dynamics simultaneously in EEG signals. Experimental results demonstrate that MetaSTH-Sleep achieves substantial performance improvements across diverse subjects, offering valuable insights to support clinicians in sleep stage annotation.
zh

[AI-97] Fashion Industry in the Age of Generative Artificial Intelligence and Metaverse: A systematic Review

【速读】：该论文试图解决如何通过整合生成式 AI (Generative Artificial Intelligence, GAI) 与元宇宙（metaverse）技术来提升时尚产业的效率与体验问题。其解决方案的关键在于提出一种新的框架，通过融合 GAI 与元宇宙的技术能力，实现对时尚产业在制造、设计、销售及客户体验等方面的深度优化，并展示了多种应用场景以促进该技术整合的有效实施。

链接: https://arxiv.org/abs/2505.17141
作者: Rania Ahmed,Eman Ahmed,Ahmed Elbarbary,Ashraf Darwish,Aboul Ella Hassanien
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The fashion industry is an extremely profitable market that generates trillions of dollars in revenue by producing and distributing apparel, footwear, and accessories. This systematic literature review (SLR) seeks to systematically review and analyze the research landscape about the Generative Artificial Intelligence (GAI) and metaverse in the fashion industry. Thus, investigating the impact of integrating both technologies to enhance the fashion industry. This systematic review uses the Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) methodology, including three essential phases: identification, evaluation, and reporting. In the identification phase, the target search problems are determined by selecting appropriate keywords and alternative synonyms. After that 578 documents from 2014 to the end of 2023 are retrieved. The evaluation phase applies three screening steps to assess papers and choose 118 eligible papers for full-text reading. Finally, the reporting phase thoroughly examines and synthesizes the 118 eligible papers to identify key themes associated with GAI and Metaverse in the fashion industry. Based on Strengths, Weaknesses, Opportunities, and Threats (SWOT) analyses performed for both GAI and metaverse for the fashion industry, it is concluded that the integration of GAI and the metaverse holds the capacity to profoundly revolutionize the fashion sector, presenting chances for improved manufacturing, design, sales, and client experiences. Accordingly, the research proposes a new framework to integrate GAI and metaverse to enhance the fashion industry. The framework presents different use cases to promote the fashion industry using the integration. Future research points for achieving a successful integration are demonstrated.
zh

[AI-98] RAP: Runtime-Adaptive Pruning for LLM Inference

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在部署过程中因计算和内存需求过大而导致的限制问题。现有压缩方法依赖于固定的启发式策略，无法适应运行时内存变化或由多样化用户请求引起的异构键值缓存（KV-cache）需求。论文提出的解决方案是RAP，一个基于强化学习（Reinforcement Learning, RL）的弹性剪枝框架，其关键在于动态调整压缩策略，以实时感知的方式优化模型参数与KV-cache之间的比例关系，从而在当前内存预算下最大化系统效用。

链接: https://arxiv.org/abs/2505.17138
作者: Huanrong Liu,Chunlin Tian,Xuyang Wei,Jiaheng Dai,Qin Liu,Tianqi Wei,Qingbiao Li,Li Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.
zh

[AI-99] NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction

【速读】：该论文试图解决传统算法方法与基于大型语言模型（Large Language Model, LLM）的抽取方法在网页数据记录抽取任务中难以公平比较的问题，这一挑战源于静态、领域特定的基准和不透明的评分机制。其解决方案的关键在于提出一个系统化的评估框架，该框架能够从任意MHTML快照生成评估数据集，标注基于XPath的监督标签，并采用结构感知的度量标准以防止文本幻觉，仅允许对位置幻觉进行评估。此外，框架还整合了预处理策略，如HTML精简、分层JSON和扁平JSON，以优化输入并保留DOM语义，从而提升LLM的抽取性能。

链接: https://arxiv.org/abs/2505.17125
作者: Soyeon Kim,Namhee Kim,Yeonwoo Jeong
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Web Data Record Extraction, Zero-Shot Extraction, Large Language Models (LLMs) Evaluation Framework, Comparative Analysis

点击查看摘要

Abstract:Effective evaluation of web data record extraction methods is crucial, yet hampered by static, domain-specific benchmarks and opaque scoring practices. This makes fair comparison between traditional algorithmic techniques, which rely on structural heuristics, and Large Language Model (LLM)-based approaches, offering zero-shot extraction across diverse layouts, particularly challenging. To overcome these limitations, we introduce a concrete evaluation framework. Our framework systematically generates evaluation datasets from arbitrary MHTML snapshots, annotates XPath-based supervision labels, and employs structure-aware metrics for consistent scoring, specifically preventing text hallucination and allowing only for the assessment of positional hallucination. It also incorporates preprocessing strategies to optimize input for LLMs while preserving DOM semantics: HTML slimming, Hierarchical JSON, and Flat JSON. Additionally, we created a publicly available synthetic dataset by transforming DOM structures and modifying content. We benchmark deterministic heuristic algorithms and off-the-shelf LLMs across these multiple input formats. Our benchmarking shows that Flat JSON input enables LLMs to achieve superior extraction accuracy (F1 score of 0.9567) and minimal hallucination compared to other input formats like Slimmed HTML and Hierarchical JSON. We establish a standardized foundation for rigorous benchmarking, paving the way for the next principled advancements in web data record extraction.
zh

[AI-100] Swarm Intelligence Enhanced Reasoning : A Density-Driven Framework for LLM -Based Multi-Agent Optimization

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在复杂推理场景中缺乏寻找最优解能力的问题。其解决方案的关键在于引入一种基于代理的群体智能（Agent-based Swarm Intelligence, ASI）范式，将LLM的推理过程建模为优化问题，并通过群体智能方案引导多个LLM代理协同搜索最优解。此外，论文进一步提出了群体智能增强推理（Swarm Intelligence Enhancing Reasoning, SIER）框架，采用密度驱动策略以同时优化解的质量与多样性，从而提升解空间的探索效率。

链接: https://arxiv.org/abs/2505.17115
作者: Ying Zhu,Heng Zhou,Rui Su,Peiqin Zhuang,Lei Bai
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, many approaches, such as Chain-of-Thought (CoT) prompting and Multi-Agent Debate (MAD), have been proposed to further enrich Large Language Models’ (LLMs) complex problem-solving capacities in reasoning scenarios. However, these methods may fail to solve complex problems due to the lack of ability to find optimal solutions. Swarm Intelligence has been serving as a powerful tool for finding optima in the field of traditional optimization problems. To this end, we propose integrating swarm intelligence into the reasoning process by introducing a novel Agent-based Swarm Intelligence (ASI) paradigm. In this paradigm, we formulate LLM reasoning as an optimization problem and use a swarm intelligence scheme to guide a group of LLM-based agents in collaboratively searching for optimal solutions. To avoid swarm intelligence getting trapped in local optima, we further develop a Swarm Intelligence Enhancing Reasoning (SIER) framework, which develops a density-driven strategy to enhance the reasoning ability. To be specific, we propose to perform kernel density estimation and non-dominated sorting to optimize both solution quality and diversity simultaneously. In this case, SIER efficiently enhances solution space exploration through expanding the diversity of the reasoning path. Besides, a step-level quality evaluation is used to help agents improve solution quality by correcting low-quality intermediate steps. Then, we use quality thresholds to dynamically control the termination of exploration and the selection of candidate steps, enabling a more flexible and efficient reasoning process. Extensive experiments are …
zh

[AI-101] REMS: a unified solution representation problem modeling and metaheuristic algorithm design for general combinatorial optimization problems

【速读】：该论文旨在解决组合优化问题（Combinatorial Optimization Problems, COPs）在建模与求解过程中需要针对具体问题定制算法的挑战，从而提升算法的通用性和可重用性。解决方案的关键在于提出一种以资源为中心的建模与求解框架（Resource-Centered Modeling and Solving framework, REMS），通过将COP抽象为资源分配任务的过程，统一定义资源、任务及约束，并基于此构建统一的解结构，进而设计一系列基础算子（如初始解生成、邻域结构、破坏与修复、交叉和排序等），支持多种元启发式算法的开发。实验表明，REMS能够在统一范式下建模多种COP，并有效求解，尤其在处理大规模和复杂问题时优于传统求解器。

链接: https://arxiv.org/abs/2505.17108
作者: Aijuan Song,Guohua Wu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figures, regular reseach paper

点击查看摘要

Abstract:Combinatorial optimization problems (COPs) with discrete variables and finite search space are critical across numerous fields, and solving them in metaheuristic algorithms is popular. However, addressing a specific COP typically requires developing a tailored and handcrafted algorithm. Even minor adjustments, such as constraint changes, may necessitate algorithm redevelopment. Therefore, establishing a framework for formulating diverse COPs into a unified paradigm and designing reusable metaheuristic algorithms is valuable. A COP can be typically viewed as the process of giving resources to perform specific tasks, subjecting to given constraints. Motivated by this, a resource-centered modeling and solving framework (REMS) is introduced for the first time. We first extract and define resources and tasks from a COP. Subsequently, given predetermined resources, the solution structure is unified as assigning tasks to resources, from which variables, objectives, and constraints can be derived and a problem model is constructed. To solve the modeled COPs, several fundamental operators are designed based on the unified solution structure, including the initial solution, neighborhood structure, destruction and repair, crossover, and ranking. These operators enable the development of various metaheuristic algorithms. Specially, 4 single-point-based algorithms and 1 population-based algorithm are configured herein. Experiments on 10 COPs, covering routing, location, loading, assignment, scheduling, and graph coloring problems, show that REMS can model these COPs within the unified paradigm and effectively solve them with the designed metaheuristic algorithms. Furthermore, REMS is more competitive than GUROBI and SCIP in tackling large-scale instances and complex COPs, and outperforms OR-TOOLS on several challenging COPs.
zh

[AI-102] CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在网络安全领域中面临的两个关键问题：无法访问训练数据之外的最新网络安全知识，以及难以将新知识整合到复杂的任务规划中。其解决方案的关键在于提出一种基于知识的LLM代理框架CRAKEN，该框架通过三个核心机制提升网络安全能力：任务关键信息的上下文分解、迭代自省知识检索以及知识提示注入，从而将洞察转化为适应性攻击策略。

链接: https://arxiv.org/abs/2505.17107
作者: Minghao Shao,Haoran Xi,Nanda Rani,Meet Udeshi,Venkata Sai Charan Putrevu,Kimberly Milner,Brendan Dolan-Gavitt,Sandeep Kumar Shukla,Prashanth Krishnamurthy,Farshad Khorrami,Ramesh Karri,Muhammad Shafique
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents can automate cybersecurity tasks and can adapt to the evolving cybersecurity landscape without re-engineering. While LLM agents have demonstrated cybersecurity capabilities on Capture-The-Flag (CTF) competitions, they have two key limitations: accessing latest cybersecurity expertise beyond training data, and integrating new knowledge into complex task planning. Knowledge-based approaches that incorporate technical understanding into the task-solving automation can tackle these limitations. We present CRAKEN, a knowledge-based LLM agent framework that improves cybersecurity capability through three core mechanisms: contextual decomposition of task-critical information, iterative self-reflected knowledge retrieval, and knowledge-hint injection that transforms insights into adaptive attack strategies. Comprehensive evaluations with different configurations show CRAKEN’s effectiveness in multi-stage vulnerability detection and exploitation compared to previous approaches. Our extensible architecture establishes new methodologies for embedding new security knowledge into LLM-driven cybersecurity agentic systems. With a knowledge database of CTF writeups, CRAKEN obtained an accuracy of 22% on NYU CTF Bench, outperforming prior works by 3% and achieving state-of-the-art results. On evaluation of MITRE ATTCK techniques, CRAKEN solves 25-30% more techniques than prior work, demonstrating improved cybersecurity capabilities via knowledge-based execution. We make our framework open source to public this https URL.
zh

[AI-103] ransparency in Healthcare AI: Testing European Regulatory Provisions against Users Transparency Needs

【速读】：该论文试图解决生成式 AI (Generative AI) 在医疗领域应用中，其使用说明书（Instructions for Use, IFU）是否满足欧盟《人工智能法案》（AI Act）所要求的透明性原则问题。研究通过调查不同利益相关者对透明性需求的优先级及其在 IFU 结构中的映射情况，评估现有 IFU 是否清晰且与用户相关。解决方案的关键在于识别利益相关者的差异化需求，并提出构建本地化有意义 IFU 的建议，以实现透明性要求与实际使用场景的匹配。

链接: https://arxiv.org/abs/2505.17105
作者: Anna Spagnolli,Cecilia Tolomini,Elisa Beretta,Claudio Sarra
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 22 pages, pre-review version

点击查看摘要

Abstract:Artificial Intelligence (AI) plays an essential role in healthcare and is pervasively incorporated into medical software and equipment. In the European Union, healthcare is a high-risk application domain for AI, and providers must prepare Instructions for Use (IFU) according to the European regulation 2024/1689 (AI Act). To this regulation, the principle of transparency is cardinal and requires the IFU to be clear and relevant to the users. This study tests whether these latter requirements are satisfied by the IFU structure. A survey was administered online via the Qualtrics platform to four types of direct stakeholders, i.e., managers (N = 238), healthcare professionals (N = 115), patients (N = 229), and Information Technology experts (N = 230). The participants rated the relevance of a set of transparency needs and indicated the IFU section addressing them. The results reveal differentiated priorities across stakeholders and a troubled mapping of transparency needs onto the IFU structure. Recommendations to build a locally meaningful IFU are derived.
zh

[AI-104] From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM -powered systems

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在安全与可靠性方面带来的复杂挑战，这些挑战难以通过传统的风险评估方法进行有效管理。其解决方案的关键在于引入100多种非概率风险管理系统策略，这些策略来源于工程领域的通用方法，如事件树分析或稳健设计，并将其应用于LLM系统中，以应对包括适应性攻击在内的新兴风险。这些策略被划分为五类，并映射到LLM安全及更广泛的AI安全领域，旨在提升负责任AI的安全性、可靠性等多维度表现。

链接: https://arxiv.org/abs/2505.17084
作者: Alexander Gutfraind,Vicki Bier
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) offer unprecedented and growing capabilities, but also introduce complex safety and security challenges that resist conventional risk management. While conventional probabilistic risk analysis (PRA) requires exhaustive risk enumeration and quantification, the novelty and complexity of these systems make PRA impractical, particularly against adaptive adversaries. Previous research found that risk management in various fields of engineering such as nuclear or civil engineering is often solved by generic (i.e. field-agnostic) strategies such as event tree analysis or robust designs. Here we show how emerging risks in LLM-powered systems could be met with 100+ of these non-probabilistic strategies to risk management, including risks from adaptive adversaries. The strategies are divided into five categories and are mapped to LLM security (and AI safety more broadly). We also present an LLM-powered workflow for applying these strategies and other workflows suitable for solution architects. Overall, these strategies could contribute (despite some limitations) to security, safety and other dimensions of responsible AI.
zh

[AI-105] Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration

【速读】：该论文旨在解决在生产环境中使用大语言模型（Large Language Models, LLMs）时所面临的安全部署问题，包括对越狱攻击和提示注入的脆弱性，这些漏洞可能导致对人类或企业有害的输出。当在特定领域（如汽车工业）中使用时，这一挑战更加显著，因为通用的LLM话题可能与该领域无关。论文提出的解决方案关键在于引入一个名为Archias的专家模型，该模型能够有效区分领域内和领域外的通信，并对用户查询进行分类，包括领域内（针对汽车工业）、恶意问题、价格注入、提示注入和领域外示例。通过将Archias的输出整合到提示中，再由LLM生成响应，从而提升模型理解用户意图并给出适当回答的能力。Archias因其体积小，可灵活调整、微调并适用于多种用途，便于根据不同行业需求进行定制。

链接: https://arxiv.org/abs/2505.17066
作者: Tatia Tsmindashvili,Ana Kolkhidashvili,Dachi Kurtskhalia,Nino Maghlakelidze,Elene Mekvabishvili,Guram Dentoshvili,Orkhan Shamilov,Zaal Gachechiladze,Steven Saporta,David Dachi Choladze
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under review at IEEE Access. Supplementary material is included in the main PDF. Benchmark dataset ( this http URL , this http URL ) should be included as ancillary data

点击查看摘要

Abstract:Using LLMs in a production environment presents security challenges that include vulnerabilities to jailbreaks and prompt injections, which can result in harmful outputs for humans or the enterprise. The challenge is amplified when working within a specific domain, as topics generally accepted for LLMs to address may be irrelevant to that field. These problems can be mitigated, for example, by fine-tuning large language models with domain-specific and security-focused data. However, these alone are insufficient, as jailbreak techniques evolve. Additionally, API-accessed models do not offer the flexibility needed to tailor behavior to industry-specific objectives, and in-context learning is not always sufficient or reliable. In response to these challenges, we introduce Archias, an expert model adept at distinguishing between in-domain and out-of-domain communications. Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples. Our methodology integrates outputs from the expert model (Archias) into prompts, which are then processed by the LLM to generate responses. This method increases the model’s ability to understand the user’s intention and give appropriate answers. Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size. Therefore, it can be easily customized to the needs of any industry. To validate our approach, we created a benchmark dataset for the automotive industry. Furthermore, in the interest of advancing research and development, we release our benchmark dataset to the community.
zh

[AI-106] An Affective-Taxis Hypothesis for Alignment and Interpretability

【速读】：该论文试图解决AI对齐（AI alignment）问题，即如何确保智能体的行为始终与其人类操作者的意图和价值观保持一致。解决方案的关键在于提出一种情感主义（affectivist）方法，将目标和价值观重新定义为情感趋性（affective taxis），并通过进化发育和计算神经科学的最新研究解释情感效价（affective valence）的产生。该研究进一步提出了基于趋性导航的计算情感模型，并通过可处理的模式生物验证了该模型与生物趋性导航的相似性。

链接: https://arxiv.org/abs/2505.17024
作者: Eli Sennesh,Maxwell Ramstead
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:AI alignment is a field of research that aims to develop methods to ensure that agents always behave in a manner aligned with (i.e. consistently with) the goals and values of their human operators, no matter their level of capability. This paper proposes an affectivist approach to the alignment problem, re-framing the concepts of goals and values in terms of affective taxis, and explaining the emergence of affective valence by appealing to recent work in evolutionary-developmental and computational neuroscience. We review the state of the art and, building on this work, we propose a computational model of affect based on taxis navigation. We discuss evidence in a tractable model organism that our model reflects aspects of biological taxis navigation. We conclude with a discussion of the role of affective taxis in AI alignment.
zh

[AI-107] ReMi: A Random Recurrent Neural Network Approach to Music Production

【速读】：该论文试图解决生成式AI（Generative AI）在能源消耗、版权侵权和创造力衰退方面的问题。其解决方案的关键在于利用随机初始化的循环神经网络（Recurrent Neural Networks），能够生成丰富且可配置的琶音和低频振荡，从而扩展音乐家的创造力，同时无需数据和较少的计算资源。

链接: https://arxiv.org/abs/2505.17023
作者: Hugo Chateau-Laurent,Tara Vanhatalo
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted for an Innovation Showcase Demo at International Computer Music Conference

点击查看摘要

Abstract:Generative artificial intelligence raises concerns related to energy consumption, copyright infringement and creative atrophy. We show that randomly initialized recurrent neural networks can produce arpeggios and low-frequency oscillations that are rich and configurable. In contrast to end-to-end music generation that aims to replace musicians, our approach expands their creativity while requiring no data and much less computational power. More information can be found at: this https URL
zh

[AI-108] Advancing Uncertain Combinatorics through Graphization Hyperization and Uncertainization: Fuzzy Neutrosophic Soft Rough and Beyond

【速读】：该论文旨在探索新的图论与集合概念，以及超图和超超图相关概念，以应对现实世界中的不确定性问题。其解决方案的关键在于引入并扩展一系列基于不确定性的集合模型，如中性模糊超集、中性模糊子集、中性模糊偏移集和非标准实数集，从而为复杂系统的不确定性建模提供更丰富的工具。这些概念的提出不仅拓展了传统集合论和图论的边界，也为人工智能、离散数学、机器学习和组合数学等领域提供了新的研究方向和理论支持。

链接: https://arxiv.org/abs/2411.17411
作者: Takaaki Fujita
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
备注: 255 pages. 11 figures. Published as a book in 2024. Publisher: Biblio Publishing. ISBN: 978-1-59973-812-3

点击查看摘要

Abstract:To better handle real-world uncertainty, concepts such as fuzzy sets, neutrosophic sets, rough sets, and soft sets have been introduced. For example, neutrosophic sets, which simultaneously represent truth, indeterminacy, and falsehood, have proven to be valuable tools for modeling uncertainty in complex systems. These set concepts are increasingly studied in graphized forms, and generalized graph concepts now encompass well-known structures such as hypergraphs and superhypergraphs. Furthermore, hyperconcepts and superhyperconcepts are being actively researched in areas beyond graph theory. Combinatorics, uncertain sets (including fuzzy sets, neutrosophic sets, rough sets, soft sets, and plithogenic sets), uncertain graphs, and hyper and superhyper concepts are active areas of research with significant mathematical and practical implications. Recognizing their importance, this paper explores new graph and set concepts, as well as hyper and superhyper concepts, as detailed in the “Results” section of “The Structure of the Paper.” Additionally, this work aims to consolidate recent findings, providing a survey-like resource to inform and engage readers. For instance, we extend several graph concepts by introducing Neutrosophic Oversets, Neutrosophic Undersets, Neutrosophic Offsets, and the Nonstandard Real Set. This paper defines a variety of concepts with the goal of inspiring new ideas and serving as a valuable resource for researchers in their academic pursuits. Comments: 255 pages. 11 figures. Published as a book in 2024. Publisher: Biblio Publishing. ISBN: 978-1-59973-812-3 Subjects: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO) MSC classes: 03B52 Cite as: arXiv:2411.17411 [cs.AI] (or arXiv:2411.17411v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.17411 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-109] Federated Causal Inference from Multi-Site Observational Data via Propensity Score Aggregation

【速读】：该论文试图解决在数据分散于多个站点的情况下，如何进行因果推断的问题，特别是在无法集中个体级数据的前提下估计平均处理效应（Average Treatment Effect, ATE）。其解决方案的关键在于利用联邦学习框架，通过交换聚合统计信息而非个体数据来实现因果推断。具体而言，论文提出了一种新颖的方法，通过计算本地倾向得分的联邦加权平均来估计倾向得分，并采用两种理论基础坚实的加权方案——成员权重（Membership Weights, MW）和密度比权重（Density Ratio Weights, DW），以平衡通信效率与模型灵活性。随后，基于这些联邦倾向得分构建了两种ATE估计器：联邦逆倾向权重估计器（Fed-IPW）及其增强变体（Fed-AIPW）。

链接: https://arxiv.org/abs/2505.17961
作者: Khellaf Rémi,Bellet Aurélien,Josse Julie
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Causal inference typically assumes centralized access to individual-level data. Yet, in practice, data are often decentralized across multiple sites, making centralization infeasible due to privacy, logistical, or legal constraints. We address this by estimating the Average Treatment Effect (ATE) from decentralized observational data using federated learning, which enables inference through the exchange of aggregate statistics rather than individual-level data. We propose a novel method to estimate propensity scores in a (non-)parametric manner by computing a federated weighted average of local scores, using two theoretically grounded weighting schemes – Membership Weights (MW) and Density Ratio Weights (DW) – that balance communication efficiency and model flexibility. These federated scores are then used to construct two ATE estimators: the Federated Inverse Propensity Weighting estimator (Fed-IPW) and its augmented variant (Fed-AIPW). Unlike meta-analysis methods, which fail when any site violates positivity, our approach leverages heterogeneity in treatment assignment across sites to improve overlap. We show that Fed-IPW and Fed-AIPW perform well under site-level heterogeneity in sample sizes, treatment mechanisms, and covariate distributions, with theoretical analysis and experiments on simulated and real-world data highlighting their strengths and limitations relative to meta-analysis and related methods.
zh

[AI-110] LMask: Learn to Solve Constrained Routing Problems with Lazy Masking

【速读】：该论文旨在解决具有复杂约束条件的路径规划问题（constrained routing problems），这类问题在物流、运输和供应链管理中具有广泛应用。其解决方案的关键在于提出一种名为LMask的新学习框架，该框架通过动态掩码生成高质量的可行解。LMask引入了LazyMask解码方法，利用回溯机制对可行性掩码进行惰性优化，并采用精炼强度嵌入（refinement intensity embedding）将搜索轨迹编码到模型中，以缓解回溯引起的表示模糊性。此外，LMask在解码过程中设置回溯预算，并在训练阶段通过损失函数惩罚约束违反，从而降低采样成本并减少不可行性。

链接: https://arxiv.org/abs/2505.17938
作者: Tianyou Li,Haijun Zou,Jiayuan Wu,Zaiwen Wen
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Routing problems are canonical combinatorial optimization tasks with wide-ranging applications in logistics, transportation, and supply chain management. However, solving these problems becomes significantly more challenging when complex constraints are involved. In this paper, we propose LMask, a novel learning framework that utilizes dynamic masking to generate high-quality feasible solutions for constrained routing problems. LMask introduces the LazyMask decoding method, which lazily refines feasibility masks with the backtracking mechanism. In addition, it employs the refinement intensity embedding to encode the search trace into the model, mitigating representation ambiguities induced by backtracking. To further reduce sampling cost, LMask sets a backtracking budget during decoding, while constraint violations are penalized in the loss function during training to counteract infeasibility caused by this budget. We provide theoretical guarantees for the validity and probabilistic optimality of our approach. Extensive experiments on the traveling salesman problem with time windows (TSPTW) and TSP with draft limits (TSPDL) demonstrate that LMask achieves state-of-the-art feasibility rates and solution quality, outperforming existing neural methods.
zh

[AI-111] DataRater: Meta-Learned Dataset Curation

【速读】：该论文旨在解决基础模型训练数据质量对模型性能影响的问题，传统方法依赖于人工调整粗粒度的数据组合或手工设计的启发式过滤策略，缺乏高效性和精细度。其解决方案的关键在于提出一种基于元学习（meta-learning）的方法——DataRater，通过元梯度（meta-gradients）估计每个数据点对训练的价值，从而实现更精细、有效的数据筛选，提升训练效率和计算资源利用率。

链接: https://arxiv.org/abs/2505.17895
作者: Dan A. Calian,Gregory Farquhar,Iurii Kemaev,Luisa M. Zintgraf,Matteo Hessel,Jeremy Shar,Junhyuk Oh,András György,Tom Schaul,Jeffrey Dean,Hado van Hasselt,David Silver
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emphlearn which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emphDataRater is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients’, with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.
zh

[AI-112] A Distributionally-Robust Framework for Nuisance in Causal Effect Estimation

【速读】：该论文旨在解决因果推断中由于历史决策政策导致的处理组与对照组数据分布不平衡问题，此类不平衡会使得模型在评估时面临挑战。传统统计方法通过逆概率加权（Inverse Probability Weighting, IPW）来应对分布偏移，但该方法面临两个关键问题：倾向性得分估计不准确以及极端权重带来的不稳定。论文通过分解泛化误差，识别出倾向性模糊性和统计不稳定性这两个核心问题，并提出一种基于对抗损失函数的解决方案，结合了分布鲁棒优化以处理倾向性不确定性，以及基于加权Rademacher复杂度的权重正则化，从而提升了模型的稳定性和性能。

链接: https://arxiv.org/abs/2505.17717
作者: Akira Tanimoto
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Causal inference requires evaluating models on balanced distributions between treatment and control groups, while training data often exhibits imbalance due to historical decision-making policies. Most conventional statistical methods address this distribution shift through inverse probability weighting (IPW), which requires estimating propensity scores as an intermediate step. These methods face two key challenges: inaccurate propensity estimation and instability from extreme weights. We decompose the generalization error to isolate these issues–propensity ambiguity and statistical instability–and address them through an adversarial loss function. Our approach combines distributionally robust optimization for handling propensity uncertainty with weight regularization based on weighted Rademacher complexity. Experiments on synthetic and real-world datasets demonstrate consistent improvements over existing methods.
zh

[AI-113] he Discovery Engine: A Framework for AI-Driven Synthesis and Navigation of Scientific Knowledge Landscapes

【速读】：该论文试图解决传统科学知识传播模式在面对近年来出版物指数级增长时所导致的信息过载、可重复性问题及撤稿现象等问题。其解决方案的关键在于引入Discovery Engine框架，通过将分散的文献转化为统一的、可计算的科学领域表示，具体表现为利用大语言模型（LLM）驱动的文献提炼，生成带有可验证来源证据链接的结构化“知识工件”，并将其编码为高维概念张量，从而实现对科学组件及其相互依赖关系的量化表征。

链接: https://arxiv.org/abs/2505.17500
作者: Vladimir Baulin,Austin Cook,Daniel Friedman,Janna Lumiruusu,Andrew Pashea,Shagor Rahman,Benedikt Waldeck
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prevailing model for disseminating scientific knowledge relies on individual publications dispersed across numerous journals and archives. This legacy system is ill suited to the recent exponential proliferation of publications, contributing to insurmountable information overload, issues surrounding reproducibility and retractions. We introduce the Discovery Engine, a framework to address these challenges by transforming an array of disconnected literature into a unified, computationally tractable representation of a scientific domain. Central to our approach is the LLM-driven distillation of publications into structured “knowledge artifacts,” instances of a universal conceptual schema, complete with verifiable links to source evidence. These artifacts are then encoded into a high-dimensional Conceptual Tensor. This tensor serves as the primary, compressed representation of the synthesized field, where its labeled modes index scientific components (concepts, methods, parameters, relations) and its entries quantify their interdependencies. The Discovery Engine allows dynamic “unrolling” of this tensor into human-interpretable views, such as explicit knowledge graphs (the CNM graph) or semantic vector spaces, for targeted exploration. Crucially, AI agents operate directly on the graph using abstract mathematical and learned operations to navigate the knowledge landscape, identify non-obvious connections, pinpoint gaps, and assist researchers in generating novel knowledge artifacts (hypotheses, designs). By converting literature into a structured tensor and enabling agent-based interaction with this compact representation, the Discovery Engine offers a new paradigm for AI-augmented scientific inquiry and accelerated discovery.
zh

[AI-114] HiLAB: A Hybrid Inverse-Design Framework

【速读】：该论文试图解决纳米光子结构逆向设计中多功能器件设计的高计算成本与局部最优问题，其解决方案的关键在于结合了早期终止的拓扑优化（TO）、基于视觉Transformer的变分自编码器（VAE）以及贝叶斯优化的混合框架。通过缩短伴随驱动的TO运行并引入随机物理参数，生成鲁棒的初始结构，并利用VAE将其压缩到紧凑的潜在空间，从而实现几何参数与物理超参数的联合优化。此外，训练好的VAE可通过调整获取函数灵活适应不同目标或约束，显著减少了电磁仿真次数，提升了设计效率与性能。

链接: https://arxiv.org/abs/2505.17491
作者: Reza Marzban,Hamed Abiri,Raphael Pestourie,Ali Adibi
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:HiLAB (Hybrid inverse-design with Latent-space learning, Adjoint-based partial optimizations, and Bayesian optimization) is a new paradigm for inverse design of nanophotonic structures. Combining early-terminated topological optimization (TO) with a Vision Transformer-based variational autoencoder (VAE) and a Bayesian search, HiLAB addresses multi-functional device design by generating diverse freeform configurations at reduced simulation costs. Shortened adjoint-driven TO runs, coupled with randomized physical parameters, produce robust initial structures. These structures are compressed into a compact latent space by the VAE, enabling Bayesian optimization to co-optimize geometry and physical hyperparameters. Crucially, the trained VAE can be reused for alternative objectives or constraints by adjusting only the acquisition function. Compared to conventional TO pipelines prone to local optima, HiLAB systematically explores near-global optima with considerably fewer electromagnetic simulations. Even after accounting for training overhead, the total number of full simulations decreases by over an order of magnitude, accelerating the discovery of fabrication-friendly devices. Demonstrating its efficacy, HiLAB is used to design an achromatic beam deflector for red, green, and blue wavelengths, achieving balanced diffraction efficiencies of ~25% while mitigating chromatic aberrations-a performance surpassing existing demonstrations. Overall, HiLAB provides a flexible platform for robust, multi-parameter photonic designs and rapid adaptation to next-generation nanophotonic challenges.
zh

[AI-115] Alpay Algebra II: Identity as Fixed-Point Emergence in Categorical Data

【速读】：该论文试图解决身份（identity）在数学范畴论框架下的定义与性质问题，特别是如何将其视为一种通过范畴递归产生的固定点。解决方案的关键在于利用超限算子 $\varphi^\infty$ ，将身份表征为小笛卡尔闭范畴上自指函子方程的普遍解，并通过序数索引迭代证明此类身份固定点的存在性与唯一性，同时借助内部范畴极限解释其收敛性。

链接: https://arxiv.org/abs/2505.17480
作者: Faruk Alpay
机构: 未知
类目: Category Theory (math.CT); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 13 pages, no figures. Sequel to Alpay Algebra: A Universal Structural Foundation ( arXiv:2505.15344 ). Defines identity as a categorical fixed point in the Alpay Algebra system. All content is self-contained

点击查看摘要

Abstract:In this second installment of the Alpay Algebra framework, I formally define identity as a fixed point that emerges through categorical recursion. Building upon the transfinite operator \varphi^\infty , I characterize identity as the universal solution to a self-referential functorial equation over a small cartesian closed category. I prove the existence and uniqueness of such identity-fixed-points via ordinal-indexed iteration, and interpret their convergence through internal categorical limits. Functors, adjunctions, and morphisms are reconstructed as dynamic traces of evolving states governed by \varphi , reframing identity not as a static label but as a stabilized process. Through formal theorems and symbolic flows, I show how these fixed points encode symbolic memory, recursive coherence, and semantic invariance. This paper positions identity as a mathematical structure that arises from within the logic of change itself computable, convergent, and categorically intrinsic.
zh

[AI-116] Can Large Language Models Design Biological Weapons? Evaluating Moremi Bio

【速读】：该论文试图解决生成式 AI (Generative AI) 在生物设计中的潜在双重用途问题，即其可能被用于设计有毒化合物和生物武器的风险。解决方案的关键在于通过实验验证当前大型语言模型（LLMs）在生物设计流程中具备设计新型有毒物质的能力，并提出多层次的缓解策略以应对由此产生的生物安全威胁。研究通过生成大量有毒蛋白质和小分子并进行计算毒性评估，揭示了LLMs在生物设计中的双刃剑特性，从而强调了建立严格监管和技术防护机制的必要性。

链接: https://arxiv.org/abs/2505.17154
作者: Gertrude Hattoh,Jeremiah Ayensu,Nyarko Prince Ofori,Solomon Eshun,Darlington Akogo
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advances in AI, particularly LLMs, have dramatically shortened drug discovery cycles by up to 40% and improved molecular target identification. However, these innovations also raise dual-use concerns by enabling the design of toxic compounds. Prompting Moremi Bio Agent without the safety guardrails to specifically design novel toxic substances, our study generated 1020 novel toxic proteins and 5,000 toxic small molecules. In-depth computational toxicity assessments revealed that all the proteins scored high in toxicity, with several closely matching known toxins such as ricin, diphtheria toxin, and disintegrin-based snake venom proteins. Some of these novel agents showed similarities with other several known toxic agents including disintegrin eristostatin, metalloproteinase, disintegrin triflavin, snake venom metalloproteinase, corynebacterium ulcerans toxin. Through quantitative risk assessments and scenario analyses, we identify dual-use capabilities in current LLM-enabled biodesign pipelines and propose multi-layered mitigation strategies. The findings from this toxicity assessment challenge claims that large language models (LLMs) are incapable of designing bioweapons. This reinforces concerns about the potential misuse of LLMs in biodesign, posing a significant threat to research and development (RD). The accessibility of such technology to individuals with limited technical expertise raises serious biosecurity risks. Our findings underscore the critical need for robust governance and technical safeguards to balance rapid biotechnological innovation with biosecurity imperatives.
zh

[AI-117] Learning Probabilities of Causation from Finite Population Data

【速读】：该论文试图解决在子群体数据不足的情况下预测因果概率（Probabilities of Causation）的问题，尤其是针对概率必要性与充分性（PNS）、概率充分性（PS）和概率必要性（PN）的估计难题。传统方法需要每个子群体的实验和观察分布，但在实际中这些数据往往不可用或难以获取。解决方案的关键在于利用机器学习模型，从数据充足的子群体中提取信息，以辅助估计数据不足子群体的因果概率，实验表明，采用适当的机器学习模型和激活函数（如Mish激活函数的多层感知机）能够在有限的数据条件下有效预测PNS。

链接: https://arxiv.org/abs/2505.17133
作者: Shuai Wang,Song Jiang,Yizhou Sun,Judea Pearl,Ang Li
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: text overlap with arXiv:2502.08858

点击查看摘要

Abstract:Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with \textbfinsufficient data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). However, estimating these probabilities requires both experimental and observational distributions specific to each subpopulation, which are often unavailable or impractical to obtain with limited population-level data. Therefore, for most subgroups, the amount of data they have is not enough to guarantee the accuracy of their probabilities. Hence, to estimate these probabilities for subpopulations with \textbfinsufficient data, we propose using machine learning models that draw insights from subpopulations with sufficient data. Our evaluation of multiple machine learning models indicates that, given the population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted. Through simulation studies on multiple Structured Causal Models (SCMs), we show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately 0.02 in predicting PNS for 32,768 subpopulations across most SCMs using data from only 2,000 subpopulations with known PNS values.
zh

[AI-118] Normalized Cut with Reinforcement Learning in Constrained Action Space

【速读】：该论文试图解决在组合优化问题中如何整合外部知识以引导求解过程趋向于领域相关的最优解这一挑战。其解决方案的关键在于提出一种基于约束动作空间的强化学习（Reinforcement Learning, RL）方法，通过该方法引导归一化割（normalized cut）问题向预定义的模板实例靠拢，从而获得更符合实际应用场景的图划分结果。

链接: https://arxiv.org/abs/2505.13986
作者: Qize Jiang,Linsey Pang,Alice Gatti,Mahima Aggarwal,Giovanna Vantini,Xiaosong Ma,Weiwei Sun,Sanjay Chawla
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as an important paradigm to solve combinatorial optimization problems primarily due to its ability to learn heuristics that can generalize across problem instances. However, integrating external knowledge that will steer combinatorial optimization problem solutions towards domain appropriate outcomes remains an extremely challenging task. In this paper, we propose the first RL solution that uses constrained action spaces to guide the normalized cut problem towards pre-defined template instances. Using transportation networks as an example domain, we create a Wedge and Ring Transformer that results in graph partitions that are shaped in form of Wedges and Rings and which are likely to be closer to natural optimal partitions. However, our approach is general as it is based on principles that can be generalized to other domains.
zh

机器学习

[LG-0] Generative Distribution Embeddings

链接: https://arxiv.org/abs/2505.18150
作者: Nic Fishman,Gokul Gowri,Peng Yin,Jonathan Gootenberg,Omar Abudayyeh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the W_2 distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning representations of cell populations from lineage-tracing data (150K cells), predicting perturbation effects on single-cell transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single-cell images), modeling tissue-specific DNA methylation patterns (253M sequences), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).

[LG-1] Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization

链接: https://arxiv.org/abs/2505.18113
作者: Halyun Jeong,Jack Xin,Penghang Yin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Training quantized neural networks requires addressing the non-differentiable and discrete nature of the underlying optimization problem. To tackle this challenge, the straight-through estimator (STE) has become the most widely adopted heuristic, allowing backpropagation through discrete operations by introducing surrogate gradients. However, its theoretical properties remain largely unexplored, with few existing works simplifying the analysis by assuming an infinite amount of training data. In contrast, this work presents the first finite-sample analysis of STE in the context of neural network quantization. Our theoretical results highlight the critical role of sample size in the success of STE, a key insight absent from existing studies. Specifically, by analyzing the quantization-aware training of a two-layer neural network with binary weights and activations, we derive the sample complexity bound in terms of the data dimensionality that guarantees the convergence of STE-based optimization to the global minimum. Moreover, in the presence of label noises, we uncover an intriguing recurrence property of STE-gradient method, where the iterate repeatedly escape from and return to the optimal binary weights. Our analysis leverages tools from compressed sensing and dynamical systems theory.

[LG-2] Dynamic Dual Buffer with Divide-and-Conquer Strategy for Online Continual Learning

链接: https://arxiv.org/abs/2505.18101
作者: Congren Dai,Huichi Zhou,Jiahao Huang,Zhenxuan Zhang,Fanwen Wang,Guang Yang,Fei Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online Continual Learning (OCL) presents a complex learning environment in which new data arrives in a batch-to-batch online format, and the risk of catastrophic forgetting can significantly impair model efficacy. In this study, we address OCL by introducing an innovative memory framework that incorporates a short-term memory system to retain dynamic information and a long-term memory system to archive enduring knowledge. Specifically, the long-term memory system comprises a collection of sub-memory buffers, each linked to a cluster prototype and designed to retain data samples from distinct categories. We propose a novel K -means-based sample selection method to identify cluster prototypes for each encountered category. To safeguard essential and critical samples, we introduce a novel memory optimisation strategy that selectively retains samples in the appropriate sub-memory buffer by evaluating each cluster prototype against incoming samples through an optimal transportation mechanism. This approach specifically promotes each sub-memory buffer to retain data samples that exhibit significant discrepancies from the corresponding cluster prototype, thereby ensuring the preservation of semantically rich information. In addition, we propose a novel Divide-and-Conquer (DAC) approach that formulates the memory updating as an optimisation problem and divides it into several subproblems. As a result, the proposed DAC approach can solve these subproblems separately and thus can significantly reduce computations of the proposed memory updating process. We conduct a series of experiments across standard and imbalanced learning settings, and the empirical findings indicate that the proposed memory framework achieves state-of-the-art performance in both learning contexts.

[LG-3] Early-Exit Graph Neural Networks

链接: https://arxiv.org/abs/2505.18088
作者: Andrea Giuseppe Di Francesco,Maria Sofia Bucarelli,Franco Maria Nardini,Raffaele Perego,Nicola Tonellotto,Fabrizio Silvestri
类目: Machine Learning (cs.LG)
*备注: 37 pages, 14 figures

点击查看摘要

Abstract:Early-exit mechanisms allow deep neural networks to halt inference as soon as classification confidence is high enough, adaptively trading depth for confidence, and thereby cutting latency and energy on easy inputs while retaining full-depth accuracy for harder ones. Similarly, adding early exit mechanisms to Graph Neural Networks (GNNs), the go-to models for graph-structured data, allows for dynamic trading depth for confidence on simple graphs while maintaining full-depth accuracy on harder and more complex graphs to capture intricate relationships. Although early exits have proven effective across various deep learning domains, their potential within GNNs in scenarios that require deep architectures while resisting over-smoothing and over-squashing remains largely unexplored. We unlock that potential by first introducing Symmetric-Anti-Symmetric Graph Neural Networks (SAS-GNN), whose symmetry-based inductive biases mitigate these issues and yield stable intermediate representations that can be useful to allow early exiting in GNNs. Building on this backbone, we present Early-Exit Graph Neural Networks (EEGNNs), which append confidence-aware exit heads that allow on-the-fly termination of propagation based on each node or the entire graph. Experiments show that EEGNNs preserve robust performance as depth grows and deliver competitive accuracy on heterophilic and long-range benchmarks, matching attention-based and asynchronous message-passing models while substantially reducing computation and latency. We plan to release the code to reproduce our experiments.

[LG-4] What Do You Need for Diverse Trajectory Stitching in Diffusion Planning ?

链接: https://arxiv.org/abs/2505.18083
作者: Quentin Clark,Florian Shkurti
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 9 Pages

点击查看摘要

Abstract:In planning, stitching is an ability of algorithms to piece together sub-trajectories of data they are trained on to generate new and diverse behaviours. While stitching is historically a strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch. Focusing on diffusion planners trained via BC, we find two properties are needed to compose: \emphpositional equivariance and \emphlocal receptiveness. We use these two properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning, including replanning frequency, data augmentation, and data scaling. Experimental comparisions show that (1) while locality is more important than positional equivariance in creating a diffusion planner capable of composition, both are crucial (2) enabling these properties through relatively simple architecture choices can be competitive with more computationally expensive methods such as replanning or scaling data, and (3) simple inpainting-based guidance can guide architecturally compositional models to enable generalization in goal-conditioned settings.

[LG-5] An Iterative Framework for Generative Backmapping of Coarse Grained Proteins

链接: https://arxiv.org/abs/2505.18082
作者: Georgios Kementzidis,Erin Wong,John Nicholson,Ruichen Xu,Yuefan Deng
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures. For associated code repositories, see: CGVAE: this https URL GenZProT: this https URL See also arXiv:2201.12176 and arXiv:2303.01569 for related methods

点击查看摘要

Abstract:The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle the challenges associated with such large-scale biomolecules. Our method enables stepwise refinement from CG beads to full atomistic details. We outline the theory of iterative generative backmapping and demonstrate via numerical experiments the advantages of multistep schemes by applying them to proteins of vastly different structures with very coarse representations. This multistep approach not only improves the accuracy of reconstructions but also makes the training process more computationally efficient for proteins with ultra-CG representations.

[LG-6] Emergence of Hebbian Dynamics in Regularized Non-Local Learners

链接: https://arxiv.org/abs/2505.18069
作者: David Koplow,Tomaso Poggio,Liu Ziyin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) has emerged as a remarkably effective learning algorithm, underpinning nearly all state-of-the-art machine learning models, from large language models to autonomous vehicles. Despite its practical success, SGD appears fundamentally distinct from biological learning mechanisms. It is widely believed that the biological brain can not implement gradient descent because it is nonlocal, and we have found little (if any) experimental evidence for it. In contrast, the brain is widely thought to learn via local Hebbian learning principles, which have been seen as incompatible with gradient descent. In this paper, we establish a theoretical and empirical connection between the learning signals of neural networks trained using SGD with weight decay and those trained with Hebbian learning near convergence. We show that SGD with regularization can appear to learn according to a Hebbian rule, and SGD with injected noise according to an anti-Hebbian rule. We also provide empirical evidence that Hebbian learning properties can emerge in a network with weight decay from virtually any learning rule–even random ones. These results may bridge a long-standing gap between artificial and biological learning, revealing Hebbian properties as an epiphenomenon of deeper optimization principles and cautioning against interpreting their presence in neural data as evidence against more complex hetero-synaptic mechanisms.

[LG-7] Reward Model Generalization for Compute-Aware Test-Time Reasoning

链接: https://arxiv.org/abs/2505.18065
作者: Zeen Song,Wenwen Qiang,Siyu Zhao,Changwen Zheng,Gang Hua
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.

[LG-8] Asymptotically optimal regret in communicating Markov decision processes

链接: https://arxiv.org/abs/2505.18064
作者: Victor Boone
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we present a learning algorithm that achieves asymptotically optimal regret for Markov decision processes in average reward under a communicating assumption. That is, given a communicating Markov decision process M , our algorithm has regret K(M) \log(T) + \mathrmo(\log(T)) where T is the number of learning steps and K(M) is the best possible constant. This algorithm works by explicitly tracking the constant K(M) to learn optimally, then balances the trade-off between exploration (playing sub-optimally to gain information), co-exploration (playing optimally to gain information) and exploitation (playing optimally to score maximally). We further show that the function K(M) is discontinuous, which is a consequence challenge for our approach. To that end, we describe a regularization mechanism to estimate K(M) with arbitrary precision from empirical data.

[LG-9] Learning with Restricted Boltzmann Machines: Asymptotics of AMP and GD in High Dimensions

链接: https://arxiv.org/abs/2505.18046
作者: Yizhou Xu,Florent Krzakala,Lenka Zdeborová
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Restricted Boltzmann Machine (RBM) is one of the simplest generative neural networks capable of learning input distributions. Despite its simplicity, the analysis of its performance in learning from the training data is only well understood in cases that essentially reduce to singular value decomposition of the data. Here, we consider the limit of a large dimension of the input space and a constant number of hidden units. In this limit, we simplify the standard RBM training objective into a form that is equivalent to the multi-index model with non-separable regularization. This opens a path to analyze training of the RBM using methods that are established for multi-index models, such as Approximate Message Passing (AMP) and its state evolution, and the analysis of Gradient Descent (GD) via the dynamical mean-field theory. We then give rigorous asymptotics of the training dynamics of RBM on data generated by the spiked covariance model as a prototype of a structure suitable for unsupervised learning. We show in particular that RBM reaches the optimal computational weak recovery threshold, aligning with the BBP transition, in the spiked covariance model.

[LG-10] Improved Algorithms for Overlapping and Robust Clustering of Edge-Colored Hypergraphs: An LP-Based Combinatorial Approach

链接: https://arxiv.org/abs/2505.18043
作者: Changyeol Lee,Yongho Shin,Hyung-Chan An
类目: Machine Learning (cs.LG); Databases (cs.DB); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Clustering is a fundamental task in both machine learning and data mining. Among various methods, edge-colored clustering (ECC) has emerged as a useful approach for handling categorical data. Given a hypergraph with (hyper)edges labeled by colors, ECC aims to assign vertex colors to minimize the number of edges where the vertex color differs from the edge’s color. However, traditional ECC has inherent limitations, as it enforces a nonoverlapping and exhaustive clustering. To tackle these limitations, three versions of ECC have been studied: Local ECC and Global ECC, which allow overlapping clusters, and Robust ECC, which accounts for vertex outliers. For these problems, both linear programming (LP) rounding algorithms and greedy combinatorial algorithms have been proposed. While these LP-rounding algorithms provide high-quality solutions, they demand substantial computation time; the greedy algorithms, on the other hand, run very fast but often compromise solution quality. In this paper, we present an algorithmic framework that combines the strengths of LP with the computational efficiency of combinatorial algorithms. Both experimental and theoretical analyses show that our algorithms efficiently produce high-quality solutions for all three problems: Local, Global, and Robust ECC. We complement our algorithmic contributions with complexity-theoretic inapproximability results and integrality gap bounds, which suggest that significant theoretical improvements are unlikely. Our results also answer two open questions previously raised in the literature.

[LG-11] me to Spike? Understanding the Representational Power of Spiking Neural Networks in Discrete Time

链接: https://arxiv.org/abs/2505.18023
作者: Duc Anh Nguyen,Ernesto Araya,Adalbert Fono,Gitta Kutyniok
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent years have seen significant progress in developing spiking neural networks (SNNs) as a potential solution to the energy challenges posed by conventional artificial neural networks (ANNs). However, our theoretical understanding of SNNs remains relatively limited compared to the ever-growing body of literature on ANNs. In this paper, we study a discrete-time model of SNNs based on leaky integrate-and-fire (LIF) neurons, referred to as discrete-time LIF-SNNs, a widely used framework that still lacks solid theoretical foundations. We demonstrate that discrete-time LIF-SNNs with static inputs and outputs realize piecewise constant functions defined on polyhedral regions, and more importantly, we quantify the network size required to approximate continuous functions. Moreover, we investigate the impact of latency (number of time steps) and depth (number of layers) on the complexity of the input space partitioning induced by discrete-time LIF-SNNs. Our analysis highlights the importance of latency and contrasts these networks with ANNs employing piecewise linear activation functions. Finally, we present numerical experiments to support our theoretical findings.

[LG-12] Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling

链接: https://arxiv.org/abs/2505.18017
作者: Matthieu Blanke,Yongquan Qu,Sara Shamekh,Pierre Gentine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models hold great promise for representing complex physical systems, but their deployment is currently limited by the lack of guarantees on the physical plausibility of the generated outputs. Ensuring that known physical constraints are enforced is therefore critical when applying generative models to scientific and engineering problems. We address this limitation by developing a principled framework for sampling from a target distribution while rigorously satisfying physical constraints. Leveraging the variational formulation of Langevin dynamics, we propose Split Augmented Langevin (SAL), a novel primal-dual sampling algorithm that enforces constraints progressively through variable splitting, with convergence guarantees. While the method is developed theoretically for Langevin dynamics, we demonstrate its effective applicability to diffusion models. In particular, we use constrained diffusion models to generate physical fields satisfying energy and mass conservation laws. We apply our method to diffusion-based data assimilation on a complex physical system, where enforcing physical constraints substantially improves both forecast accuracy and the preservation of critical conserved quantities. We also demonstrate the potential of SAL for challenging feasibility problems in optimal control.

[LG-13] Distances for Markov chains from sample streams

链接: https://arxiv.org/abs/2505.18005
作者: Sergio Calo,Anders Jonsson,Gergely Neu,Ludovic Schwartz,Javier Segovia-Aguas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bisimulation metrics are powerful tools for measuring similarities between stochastic processes, and specifically Markov chains. Recent advances have uncovered that bisimulation metrics are, in fact, optimal-transport distances, which has enabled the development of fast algorithms for computing such metrics with provable accuracy and runtime guarantees. However, these recent methods, as well as all previously known methods, assume full knowledge of the transition dynamics. This is often an impractical assumption in most real-world scenarios, where typically only sample trajectories are available. In this work, we propose a stochastic optimization method that addresses this limitation and estimates bisimulation metrics based on sample access, without requiring explicit transition models. Our approach is derived from a new linear programming (LP) formulation of bisimulation metrics, which we solve using a stochastic primal-dual optimization method. We provide theoretical guarantees on the sample complexity of the algorithm and validate its effectiveness through a series of empirical evaluations.

[LG-14] Rethinking Contrastive Learning in Graph Anomaly Detection: A Clean-View Perspective

链接: https://arxiv.org/abs/2505.18002
作者: Di Jin,Jingyi Cao,Xiaobao Wang,Bingdao Feng,Dongxiao He,Longbiao Wang,Jianwu Dang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph anomaly detection aims to identify unusual patterns in graph-based data, with wide applications in fields such as web security and financial fraud detection. Existing methods typically rely on contrastive learning, assuming that a lower similarity between a node and its local subgraph indicates abnormality. However, these approaches overlook a crucial limitation: the presence of interfering edges invalidates this assumption, since it introduces disruptive noise that compromises the contrastive learning process. Consequently, this limitation impairs the ability to effectively learn meaningful representations of normal patterns, leading to suboptimal detection performance. To address this issue, we propose a Clean-View Enhanced Graph Anomaly Detection framework (CVGAD), which includes a multi-scale anomaly awareness module to identify key sources of interference in the contrastive learning process. Moreover, to mitigate bias from the one-step edge removal process, we introduce a novel progressive purification module. This module incrementally refines the graph by iteratively identifying and removing interfering edges, thereby enhancing model performance. Extensive experiments on five benchmark datasets validate the effectiveness of our approach.

[LG-15] Revisiting Feature Interactions from the Perspective of Quadratic Neural Networks for Click-through Rate Prediction KDD’25

链接: https://arxiv.org/abs/2505.17999
作者: Honghao Li,Yiwen Zhang,Yi Zhang,Lei Sang,Jieming Zhu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: KDD’25 accepted

点击查看摘要

Abstract:Hadamard Product (HP) has long been a cornerstone in click-through rate (CTR) prediction tasks due to its simplicity, effectiveness, and ability to capture feature interactions without additional parameters. However, the underlying reasons for its effectiveness remain unclear. In this paper, we revisit HP from the perspective of Quadratic Neural Networks (QNN), which leverage quadratic interaction terms to model complex feature relationships. We further reveal QNN’s ability to expand the feature space and provide smooth nonlinear approximations without relying on activation functions. Meanwhile, we find that traditional post-activation does not further improve the performance of the QNN. Instead, mid-activation is a more suitable alternative. Through theoretical analysis and empirical evaluation of 25 QNN neuron formats, we identify a good-performing variant and make further enhancements on it. Specifically, we propose the Multi-Head Khatri-Rao Product as a superior alternative to HP and a Self-Ensemble Loss with dynamic ensemble capability within the same network to enhance computational efficiency and performance. Ultimately, we propose a novel neuron format, QNN-alpha, which is tailored for CTR prediction tasks. Experimental results show that QNN-alpha achieves new state-of-the-art performance on six public datasets while maintaining low inference latency, good scalability, and excellent compatibility. The code, running logs, and detailed hyperparameter configurations are available at: this https URL.

[LG-16] A Principled Bayesian Framework for Training Binary and Spiking Neural Networks

链接: https://arxiv.org/abs/2505.17962
作者: James A. Walker,Moein Khajehnejad,Adeel Razi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a Bayesian framework for training binary and spiking neural networks that achieves state-of-the-art performance without normalisation layers. Unlike commonly used surrogate gradient methods – often heuristic and sensitive to hyperparameter choices – our approach is grounded in a probabilistic model of noisy binary networks, enabling fully end-to-end gradient-based optimisation. We introduce importance-weighted straight-through (IW-ST) estimators, a unified class generalising straight-through and relaxation-based estimators. We characterise the bias-variance trade-off in this family and derive a bias-minimising objective implemented via an auxiliary loss. Building on this, we introduce Spiking Bayesian Neural Networks (SBNNs), a variational inference framework that uses posterior noise to train Binary and Spiking Neural Networks with IW-ST. This Bayesian approach minimises gradient bias, regularises parameters, and introduces dropout-like noise. By linking low-bias conditions, vanishing gradients, and the KL term, we enable training of deep residual networks without normalisation. Experiments on CIFAR-10, DVS Gesture, and SHD show our method matches or exceeds existing approaches without normalisation or hand-tuned gradients.

[LG-17] VeriThinker: Learning to Verify Makes Reasoning Model Efficient

链接: https://arxiv.org/abs/2505.17941
作者: Zigeng Chen,Xinyin Ma,Gongfan Fang,Ruonan Yu,Xinchao Wang
类目: Machine Learning (cs.LG)
*备注: Working in progress. Code Repo: this https URL

点击查看摘要

Abstract:Large Reasoning Models (LRMs) excel at complex tasks using Chain-of-Thought (CoT) reasoning. However, their tendency to overthinking leads to unnecessarily lengthy reasoning chains, dramatically increasing inference costs. To mitigate this issue, we introduce VeriThinker, a novel approach for CoT compression. Unlike conventional methods that fine-tune LRMs directly on the original reasoning task using synthetic concise CoT data, we innovatively fine-tune the model solely through an auxiliary verification task. By training LRMs to accurately verify the correctness of CoT solutions, the LRMs inherently become more discerning about the necessity of subsequent self-reflection steps, thereby effectively suppressing overthinking. Extensive experiments validate that VeriThinker substantially reduces reasoning chain lengths while maintaining or even slightly improving accuracy. When applied to DeepSeek-R1-Distill-Qwen-7B, our approach reduces reasoning tokens on MATH500 from 3790 to 2125 while improving accuracy by 0.8% (94.0% to 94.8%), and on AIME25, tokens decrease from 14321 to 10287 with a 2.1% accuracy gain (38.7% to 40.8%). Additionally, our experiments demonstrate that VeriThinker can also be zero-shot generalized to speculative reasoning. Code is available at this https URL

[LG-18] Directed Semi-Simplicial Learning with Applications to Brain Activity Decoding

链接: https://arxiv.org/abs/2505.17939
作者: Manuel Lecha,Andrea Cavallo,Francesca Dominici,Ran Levi,Alessio Del Bue,Elvin Isufi,Pietro Morerio,Claudio Battiloro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel at learning from pairwise interactions but often overlook multi-way and hierarchical relationships. Topological Deep Learning (TDL) addresses this limitation by leveraging combinatorial topological spaces. However, existing TDL models are restricted to undirected settings and fail to capture the higher-order directed patterns prevalent in many complex systems, e.g., brain networks, where such interactions are both abundant and functionally significant. To fill this gap, we introduce Semi-Simplicial Neural Networks (SSNs), a principled class of TDL models that operate on semi-simplicial sets – combinatorial structures that encode directed higher-order motifs and their directional relationships. To enhance scalability, we propose Routing-SSNs, which dynamically select the most informative relations in a learnable manner. We prove that SSNs are strictly more expressive than standard graph and TDL models. We then introduce a new principled framework for brain dynamics representation learning, grounded in the ability of SSNs to provably recover topological descriptors shown to successfully characterize brain activity. Empirically, SSNs achieve state-of-the-art performance on brain dynamics classification tasks, outperforming the second-best model by up to 27%, and message passing GNNs by up to 50% in accuracy. Our results highlight the potential of principled topological models for learning from structured brain data, establishing a unique real-world case study for TDL. We also test SSNs on standard node classification and edge regression tasks, showing competitive performance. We will make the code and data publicly available.

[LG-19] Selection Mechanisms for Sequence Modeling using Linear State Space Models

链接: https://arxiv.org/abs/2505.17932
作者: Umberto Casti,Sandro Zampieri,Fabio Pasqualetti
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in language modeling tasks have been driven by architectures such as Transformers and, more recently, by Selective State Space Models (SSMs). In this paper, we introduce an alternative selection mechanism inspired by control theory methodologies. Specifically, we propose a novel residual generator for selection, drawing an analogy to fault detection strategies in Linear Time-Invariant (LTI) systems. Unlike Mamba, which utilizes Linear Time-Varying (LTV) systems, our approach combines multiple LTI systems, preserving their beneficial properties during training while achieving comparable selectivity. To evaluate the effectiveness of the proposed architecture, we test its performance on synthetic tasks. While these tasks are not inherently critical, they serve as benchmarks to test the selectivity properties of different cores architecture. This work highlights the potential of integrating theoretical insights with experimental advancements, offering a complementary perspective to deep learning innovations at the intersection of control theory and machine learning.

[LG-20] Predicting Length of Stay in Neurological ICU Patients Using Classical Machine Learning and Neural Network Models: A Benchmark Study on MIMIC-IV

链接: https://arxiv.org/abs/2505.17929
作者: Alexander Gabitashvili,Philipp Kellmeyer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intensive care unit (ICU) is a crucial hospital department that handles life-threatening cases. Nowadays machine learning (ML) is being leveraged in healthcare ubiquitously. In recent years, management of ICU became one of the most significant parts of the hospital functionality (largely but not only due to the worldwide COVID-19 pandemic). This study explores multiple ML approaches for predicting LOS in ICU specifically for the patients with neurological diseases based on the MIMIC-IV dataset. The evaluated models include classic ML algorithms (K-Nearest Neighbors, Random Forest, XGBoost and CatBoost) and Neural Networks (LSTM, BERT and Temporal Fusion Transformer). Given that LOS prediction is often framed as a classification task, this study categorizes LOS into three groups: less than two days, less than a week, and a week or more. As the first ML-based approach targeting LOS prediction for neurological disorder patients, this study does not aim to outperform existing methods but rather to assess their effectiveness in this specific context. The findings provide insights into the applicability of ML techniques for improving ICU resource management and patient care. According to the results, Random Forest model proved to outperform others on static, achieving an accuracy of 0.68, a precision of 0.68, a recall of 0.68, and F1-score of 0.67. While BERT model outperformed LSTM model on time-series data with an accuracy of 0.80, a precision of 0.80, a recall of 0.80 and F1-score 0.80.

[LG-21] KITINet: Kinetics Theory Inspired Network Architectures with PDE Simulation Approaches

链接: https://arxiv.org/abs/2505.17919
作者: Mingquan Feng,Yifan Fu,Tongcheng Zhang,Yu Jiang,Yixin Huang,Junchi Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the widely recognized success of residual connections in modern neural networks, their design principles remain largely heuristic. This paper introduces KITINet (Kinetics Theory Inspired Network), a novel architecture that reinterprets feature propagation through the lens of non-equilibrium particle dynamics and partial differential equation (PDE) simulation. At its core, we propose a residual module that models feature updates as the stochastic evolution of a particle system, numerically simulated via a discretized solver for the Boltzmann transport equation (BTE). This formulation mimics particle collisions and energy exchange, enabling adaptive feature refinement via physics-informed interactions. Additionally, we reveal that this mechanism induces network parameter condensation during training, where parameters progressively concentrate into a sparse subset of dominant channels. Experiments on scientific computation (PDE operator), image classification (CIFAR-10/100), and text classification (IMDb/SNLI) show consistent improvements over classic network baselines, with negligible increase of FLOPs.

[LG-22] LLM Meeting Decision Trees on Tabular Data

链接: https://arxiv.org/abs/2505.17918
作者: Hangting Ye,Jinmeng Li,He Zhao,Dandan Guo,Yi Chang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data have been playing a vital role in diverse real-world fields, including healthcare, finance, etc. With the recent success of Large Language Models (LLMs), early explorations of extending LLMs to the domain of tabular data have been developed. Most of these LLM-based methods typically first serialize tabular data into natural language descriptions, and then tune LLMs or directly infer on these serialized data. However, these methods suffer from two key inherent issues: (i) data perspective: existing data serialization methods lack universal applicability for structured tabular data, and may pose privacy risks through direct textual exposure, and (ii) model perspective: LLM fine-tuning methods struggle with tabular data, and in-context learning scalability is bottle-necked by input length constraints (suitable for few-shot learning). This work explores a novel direction of integrating LLMs into tabular data throughough logical decision tree rules as intermediaries, proposes a decision tree enhancer with LLM-derived rule for tabular prediction, DeLTa. The proposed DeLTa avoids tabular data serialization, and can be applied to full data learning setting without LLM fine-tuning. Specifically, we leverage the reasoning ability of LLMs to redesign an improved rule given a set of decision tree rules. Furthermore, we provide a calibration method for original decision trees via new generated rule by LLM, which approximates the error correction vector to steer the original decision tree predictions in the direction of ``errors’’ reducing. Finally, extensive experiments on diverse tabular benchmarks show that our method achieves state-of-the-art performance.

[LG-23] Evolving Machine Learning: A Survey

链接: https://arxiv.org/abs/2505.17902
作者: Ignacio Cabrera Martin,Subhaditya Mukherjee,Almas Baimagambetov,Joaquin Vanschoren,Nikolaos Polatidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In an era defined by rapid data evolution, traditional machine learning (ML) models often fall short in adapting to dynamic environments. Evolving Machine Learning (EML) has emerged as a critical paradigm, enabling continuous learning and adaptation in real-time data streams. This survey presents a comprehensive analysis of EML, focusing on five core challenges: data drift, concept drift, catastrophic forgetting, skewed learning, and network adaptation. We systematically review over 120 studies, categorizing state-of-the-art methods across supervised, unsupervised, and semi-supervised approaches. The survey explores diverse evaluation metrics, benchmark datasets, and real-world applications, offering a comparative lens on the effectiveness and limitations of current techniques. Additionally, we highlight the growing role of adaptive neural architectures, meta-learning, and ensemble strategies in addressing evolving data complexities. By synthesizing insights from recent literature, this work not only maps the current landscape of EML but also identifies critical gaps and opportunities for future research. Our findings aim to guide researchers and practitioners in developing robust, ethical, and scalable EML systems for real-world deployment.

[LG-24] Universal Domain Adaptation Benchmark for Time Series Data Representation

链接: https://arxiv.org/abs/2505.17899
作者: Romain Mussard,Fannia Pacheco,Maxime Berar,Gilles Gasso,Paul Honeine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have significantly improved the ability to detect novelties in time series (TS) data. This success is attributed to their strong representation capabilities. However, due to the inherent variability in TS data, these models often struggle with generalization and robustness. To address this, a common approach is to perform Unsupervised Domain Adaptation, particularly Universal Domain Adaptation (UniDA), to handle domain shifts and emerging novel classes. While extensively studied in computer vision, UniDA remains underexplored for TS data. This work provides a comprehensive implementation and comparison of state-of-the-art TS backbones in a UniDA framework. We propose a reliable protocol to evaluate their robustness and generalization across different domains. The goal is to provide practitioners with a framework that can be easily extended to incorporate future advancements in UniDA and TS architectures. Our results highlight the critical influence of backbone selection in UniDA performance and enable a robustness analysis across various datasets and architectures.

[LG-25] Semi-Supervised Multi-Label Feature Selection with Consistent Sparse Graph Learning

链接: https://arxiv.org/abs/2505.17875
作者: Yan Zhong,Xingyu Wu,Xinping Zhao,Li Zhang,Xinyuan Song,Lei Shi,Bingbing Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In practical domains, high-dimensional data are usually associated with diverse semantic labels, whereas traditional feature selection methods are designed for single-label data. Moreover, existing multi-label methods encounter two main challenges in semi-supervised scenarios: (1). Most semi-supervised methods fail to evaluate the label correlations without enough labeled samples, which are the critical information of multi-label feature selection, making label-specific features discarded. (2). The similarity graph structure directly derived from the original feature space is suboptimal for multi-label problems in existing graph-based methods, leading to unreliable soft labels and degraded feature selection performance. To overcome them, we propose a consistent sparse graph learning method for multi-label semi-supervised feature selection (SGMFS), which can enhance the feature selection performance by maintaining space consistency and learning label correlations in semi-supervised scenarios. Specifically, for Challenge (1), SGMFS learns a low-dimensional and independent label subspace from the projected features, which can compatibly cross multiple labels and effectively achieve the label correlations. For Challenge (2), instead of constructing a fixed similarity graph for semi-supervised learning, SGMFS thoroughly explores the intrinsic structure of the data by performing sparse reconstruction of samples in both the label space and the learned subspace simultaneously. In this way, the similarity graph can be adaptively learned to maintain the consistency between label space and the learned subspace, which can promote propagating proper soft labels for unlabeled samples, facilitating the ultimate feature selection. An effective solution with fast convergence is designed to optimize the objective function. Extensive experiments validate the superiority of SGMFS.

[LG-26] BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models KDD2025

链接: https://arxiv.org/abs/2505.17871
作者: Zezhi Shao,Yujie Li,Fei Wang,Chengqing Yu,Yisong Fu,Tangwen Qian,Bin Xu,Boyu Diao,Yongjun Xu,Xueqi Cheng
类目: Machine Learning (cs.LG)
*备注: Accepted by SIGKDD 2025 (Research Track)

点击查看摘要

Abstract:The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.

[LG-27] Best Group Identification in Multi-Objective Bandits

链接: https://arxiv.org/abs/2505.17869
作者: Mohammad Shahverdikondori,Mohammad Reza Badri,Negar Kiyavash
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the Best Group Identification problem in a multi-objective multi-armed bandit setting, where an agent interacts with groups of arms with vector-valued rewards. The performance of a group is determined by an efficiency vector which represents the group’s best attainable rewards across different dimensions. The objective is to identify the set of optimal groups in the fixed-confidence setting. We investigate two key formulations: group Pareto set identification, where efficiency vectors of optimal groups are Pareto optimal and linear best group identification, where each reward dimension has a known weight and the optimal group maximizes the weighted sum of its efficiency vector’s entries. For both settings, we propose elimination-based algorithms, establish upper bounds on their sample complexity, and derive lower bounds that apply to any correct algorithm. Through numerical experiments, we demonstrate the strong empirical performance of the proposed algorithms.

[LG-28] SpectraLDS: Provable Distillation for Linear Dynamical Systems

链接: https://arxiv.org/abs/2505.17868
作者: Devan Shah,Shlomo Fortgang,Sofiia Druchyna,Elad Hazan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We present the first provable method for identifying symmetric linear dynamical systems (LDS) with accuracy guarantees that are independent of the systems’ state dimension or effective memory. Our approach builds upon recent work that represents symmetric LDSs as convolutions learnable via fixed spectral transformations. We show how to invert this representation, thereby recovering an LDS model from its spectral transform and yielding an end-to-end convex optimization procedure. This distillation preserves predictive accuracy while enabling constant-time and constant-space inference per token, independent of sequence length. We evaluate our method, SpectraLDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling.

[LG-29] DesignX: Human-Competitive Algorithm Designer for Black-Box Optimization

链接: https://arxiv.org/abs/2505.17866
作者: Hongshu Guo,Zeyuan Ma,Yining Ma,Xinglin Zhang,Wei-Neng Chen,Yue-Jiao Gong
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Designing effective black-box optimizers is hampered by limited problem-specific knowledge and manual control that spans months for almost every detail. In this paper, we present DesignX, the first automated algorithm design framework that generates an effective optimizer specific to a given black-box optimization problem within seconds. Rooted in the first principles, we identify two key sub-tasks: 1) algorithm structure generation and 2) hyperparameter control. To enable systematic construction, a comprehensive modular algorithmic space is first built, embracing hundreds of algorithm components collected from decades of research. We then introduce a dual-agent reinforcement learning system that collaborates on structural and parametric design through a novel cooperative training objective, enabling large-scale meta-training across 10k diverse instances. Remarkably, through days of autonomous learning, the DesignX-generated optimizers continuously surpass human-crafted optimizers by orders of magnitude, either on synthetic testbed or on realistic optimization scenarios such as Protein-docking, AutoML and UAV path planning. Further in-depth analysis reveals DesignX’s capability to discover non-trivial algorithm patterns beyond expert intuition, which, conversely, provides valuable design insights for the optimization community. We provide DesignX’s inference code at this https URL.

[LG-30] he emergence of sparse attention: impact of data distribution and benefits of repetition

链接: https://arxiv.org/abs/2505.17863
作者: Nicolas Zucchet,Francesco d’Angelo,Andrew K. Lampinen,Stephanie C.Y. Chan
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.

[LG-31] Out of the Shadows: Exploring a Latent Space for Neural Network Verification

链接: https://arxiv.org/abs/2505.17854
作者: Lukas Koller,Tobias Ladner,Matthias Althoff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks are ubiquitous. However, they are often sensitive to small input changes. Hence, to prevent unexpected behavior in safety-critical applications, their formal verification – a notoriously hard problem – is necessary. Many state-of-the-art verification algorithms use reachability analysis or abstract interpretation to enclose the set of possible outputs of a neural network. Often, the verification is inconclusive due to the conservatism of the enclosure. To address this problem, we design a novel latent space for formal verification that enables the transfer of output specifications to the input space for an iterative specification-driven input refinement, i.e., we iteratively reduce the set of possible inputs to only enclose the unsafe ones. The latent space is constructed from a novel view of projection-based set representations, e.g., zonotopes, which are commonly used in reachability analysis of neural networks. A projection-based set representation is a “shadow” of a higher-dimensional set – a latent space – that does not change during a set propagation through a neural network. Hence, the input set and the output enclosure are “shadows” of the same latent space that we can use to transfer constraints. We present an efficient verification tool for neural networks that uses our iterative refinement to significantly reduce the number of subproblems in a branch-and-bound procedure. Using zonotopes as a set representation, unlike many other state-of-the-art approaches, our approach can be realized by only using matrix operations, which enables a significant speed-up through efficient GPU acceleration. We demonstrate that our tool achieves competitive performance, which would place it among the top-ranking tools of the last neural network verification competition (VNN-COMP’24).

[LG-32] VIBE: Vector Index Benchmark for Embeddings

链接: https://arxiv.org/abs/2505.17810
作者: Elias Jääsaari,Ville Hyvönen,Matteo Ceccarello,Teemu Roos,Martin Aumüller
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 25 pages

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.

[LG-33] Latent Mode Decomposition

链接: https://arxiv.org/abs/2505.17797
作者: Manuel Morante,Naveed ur Rehman
类目: Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, 1 table

点击查看摘要

Abstract:We introduce Variational Latent Mode Decomposition (VLMD), a new algorithm for extracting oscillatory modes and associated connectivity structures from multivariate signals. VLMD addresses key limitations of existing Multivariate Mode Decomposition (MMD) techniques -including high computational cost, sensitivity to parameter choices, and weak modeling of interchannel dependencies. Its improved performance is driven by a novel underlying model, Latent Mode Decomposition (LMD), which blends sparse coding and mode decomposition to represent multichannel signals as sparse linear combinations of shared latent components composed of AM-FM oscillatory modes. This formulation enables VLMD to operate in a lower-dimensional latent space, enhancing robustness to noise, scalability, and interpretability. The algorithm solves a constrained variational optimization problem that jointly enforces reconstruction fidelity, sparsity, and frequency regularization. Experiments on synthetic and real-world datasets demonstrate that VLMD outperforms state-of-the-art MMD methods in accuracy, efficiency, and interpretability of extracted structures.

[LG-34] RECIPE-TKG: From Sparse History to Structured Reasoning for LLM -based Temporal Knowledge Graph Completion

链接: https://arxiv.org/abs/2505.17794
作者: Ömer Faruk Akgül,Feiyu Zhu,Yuxin Yang,Rajgopal Kannan,Viktor Prasanna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal Knowledge Graphs (TKGs) represent dynamic facts as timestamped relations between entities. TKG completion involves forecasting missing or future links, requiring models to reason over time-evolving structure. While LLMs show promise for this task, existing approaches often overemphasize supervised fine-tuning and struggle particularly when historical evidence is limited or missing. We introduce RECIPE-TKG, a lightweight and data-efficient framework designed to improve accuracy and generalization in settings with sparse historical context. It combines (1) rule-based multi-hop retrieval for structurally diverse history, (2) contrastive fine-tuning of lightweight adapters to encode relational semantics, and (3) test-time semantic filtering to iteratively refine generations based on embedding similarity. Experiments on four TKG benchmarks show that RECIPE-TKG outperforms previous LLM-based approaches, achieving up to 30.6% relative improvement in Hits@10. Moreover, our proposed framework produces more semantically coherent predictions, even for the samples with limited historical context.

[LG-35] Supervised Graph Contrastive Learning for Gene Regulatory Network

链接: https://arxiv.org/abs/2505.17786
作者: Sho Oshima,Yuji Okamoto,Taisei Tosaki,Ryosuke Kojima,Yasushi Okuno
类目: Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:Graph representation learning is effective for obtaining a meaningful latent space utilizing the structure of graph data and is widely applied, including biological networks. In particular, Graph Contrastive Learning (GCL) has emerged as a powerful self-supervised method that relies on applying perturbations to graphs for data augmentation. However, when applying existing GCL methods to biological networks such as Gene Regulatory Networks (GRNs), they overlooked meaningful biologically relevant perturbations, e.g., gene knockdowns. In this study, we introduce SupGCL (Supervised Graph Contrastive Learning), a novel GCL method for GRNs that directly incorporates biological perturbations derived from gene knockdown experiments as the supervision. SupGCL mathematically extends existing GCL methods that utilize non-biological perturbations to probabilistic models that introduce actual biological gene perturbation utilizing gene knockdown data. Using the GRN representation obtained by our proposed method, our aim is to improve the performance of biological downstream tasks such as patient hazard prediction and disease subtype classification (graph-level task), and gene function classification (node-level task). We applied SupGCL on real GRN datasets derived from patients with multiple types of cancer, and in all experiments SupGCL achieves better performance than state-of-the-art baselines.

[LG-36] Optimizing Shortfall Risk Metric for Learning Regression Models

链接: https://arxiv.org/abs/2505.17777
作者: Harish G. Ramaswamy,L.A. Prashanth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of estimating and optimizing utility-based shortfall risk (UBSR) of a loss, say (Y - \hat Y)^2 , in the context of a regression problem. Empirical risk minimization with a UBSR objective is challenging since UBSR is a non-linear function of the underlying distribution. We first derive a concentration bound for UBSR estimation using independent and identically distributed (i.i.d.) samples. We then frame the UBSR optimization problem as minimization of a pseudo-linear function in the space of achievable distributions \mathcal D of the loss (Y- \hat Y)^2 . We construct a gradient oracle for the UBSR objective and a linear minimization oracle (LMO) for the set \mathcal D . Using these oracles, we devise a bisection-type algorithm, and establish convergence to the UBSR-optimal solution.

[LG-37] C-LoRA: Contextual Low-Rank Adaptation for Uncertainty Estimation in Large Language Models

链接: https://arxiv.org/abs/2505.17773
作者: Amir Hossein Rahmati,Sanket Jantre,Weifeng Zhang,Yucheng Wang,Byung-Jun Yoon,Nathan M. Urban,Xiaoning Qian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) offers a cost-effective solution for fine-tuning large language models (LLMs), but it often produces overconfident predictions in data-scarce few-shot settings. To address this issue, several classical statistical learning approaches have been repurposed for scalable uncertainty-aware LoRA fine-tuning. However, these approaches neglect how input characteristics affect the predictive uncertainty estimates. To address this limitation, we propose Contextual Low-Rank Adaptation (\textbfC-LoRA) as a novel uncertainty-aware and parameter efficient fine-tuning approach, by developing new lightweight LoRA modules contextualized to each input data sample to dynamically adapt uncertainty estimates. Incorporating data-driven contexts into the parameter posteriors, C-LoRA mitigates overfitting, achieves well-calibrated uncertainties, and yields robust predictions. Extensive experiments demonstrate that C-LoRA consistently outperforms the state-of-the-art uncertainty-aware LoRA methods in both uncertainty quantification and model generalization. Ablation studies further confirm the critical role of our contextual modules in capturing sample-specific uncertainties. C-LoRA sets a new standard for robust, uncertainty-aware LLM fine-tuning in few-shot regimes.

[LG-38] Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models ICML2025

链接: https://arxiv.org/abs/2505.17769
作者: Patrick Leask,Neel Nanda,Noura Al Moubayed
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a popular method for decomposing Large Langage Models (LLM) activations into interpretable latents. However, due to their substantial training cost, most academic research uses open-source SAEs which are only available for a restricted set of models of up to 27B parameters. SAE latents are also learned from a dataset of activations, which means they do not transfer between models. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activations (ITDA) models, an alternative method for decomposing language model activations. To train an ITDA, we greedily construct a dictionary of language model activations on a dataset of prompts, selecting those activations which were worst approximated by matching pursuit on the existing dictionary. ITDAs can be trained in just 1% of the time required for SAEs, using 1% of the data. This allowed us to train ITDAs on Llama-3.1 70B and 405B on a single consumer GPU. ITDAs can achieve similar reconstruction performance to SAEs on some target LLMs, but generally incur a performance penalty. However, ITDA dictionaries enable cross-model comparisons, and a simple Jaccard similarity index on ITDA dictionaries outperforms existing methods like CKA, SVCCA, and relative representation similarity metrics. ITDAs provide a cheap alternative to SAEs where computational resources are limited, or when cross model comparisons are necessary. Code available at this https URL.

[LG-39] Joker: Joint Optimization Framework for Lightweight Kernel Machines ICML2025

链接: https://arxiv.org/abs/2505.17765
作者: Junhong Zhang,Zhihui Lai
类目: Machine Learning (cs.LG)
*备注: 24 pages, 5 figures, accepted by ICML 2025

点击查看摘要

Abstract:Kernel methods are powerful tools for nonlinear learning with well-established theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we propose Joker, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show that Joker saves up to 90% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.

[LG-40] Unsupervised Clustering for Fault Analysis in High-Voltage Power Systems Using Voltage and Current Signals

链接: https://arxiv.org/abs/2505.17763
作者: Julian Oelhaf,Georg Kordowich,Andreas Maier,Johann Jager,Siming Bayer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages

点击查看摘要

Abstract:The widespread use of sensors in modern power grids has led to the accumulation of large amounts of voltage and current waveform data, especially during fault events. However, the lack of labeled datasets poses a significant challenge for fault classification and analysis. This paper explores the application of unsupervised clustering techniques for fault diagnosis in high-voltage power systems. A dataset provided by the Reseau de Transport d’Electricite (RTE) is analyzed, with frequency domain features extracted using the Fast Fourier Transform (FFT). The K-Means algorithm is then applied to identify underlying patterns in the data, enabling automated fault categorization without the need for labeled training samples. The resulting clusters are evaluated in collaboration with power system experts to assess their alignment with real-world fault characteristics. The results demonstrate the potential of unsupervised learning for scalable and data-driven fault analysis, providing a robust approach to detecting and classifying power system faults with minimal prior assumptions.

[LG-41] Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models

链接: https://arxiv.org/abs/2505.17761
作者: Benjamin Walker,Lingyi Yang,Nicola Muca Cirone,Cristopher Salvi,Terry Lyons
类目: Machine Learning (cs.LG)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Structured Linear Controlled Differential Equations (SLiCEs) provide a unifying framework for sequence models with structured, input-dependent state-transition matrices that retain the maximal expressivity of dense matrices whilst being cheaper to compute. The framework encompasses existing architectures, such as input-dependent block-diagonal linear recurrent neural networks and DeltaNet’s diagonal-plus-low-rank structure, as well as two novel variants based on sparsity and the Walsh–Hadamard transform. We prove that, unlike the diagonal state-transition matrices of S4 and Mamba, SLiCEs employing block-diagonal, sparse, or Walsh–Hadamard matrices match the maximal expressivity of dense matrices. Empirically, SLiCEs solve the A_5 state-tracking benchmark with a single layer, achieve best-in-class length generalisation on regular language tasks among parallel-in-time models, and match the state-of-the-art performance of log neural controlled differential equations on six multivariate time-series classification datasets while cutting the average time per training step by a factor of twenty.

[LG-42] Discrete Neural Flow Samplers with Locally Equivariant Transformer

链接: https://arxiv.org/abs/2505.17741
作者: Zijing Ou,Ruixiang Zhang,Yingzhen Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sampling from unnormalised discrete distributions is a fundamental problem across various domains. While Markov chain Monte Carlo offers a principled approach, it often suffers from slow mixing and poor convergence. In this paper, we propose Discrete Neural Flow Samplers (DNFS), a trainable and efficient framework for discrete sampling. DNFS learns the rate matrix of a continuous-time Markov chain such that the resulting dynamics satisfy the Kolmogorov equation. As this objective involves the intractable partition function, we then employ control variates to reduce the variance of its Monte Carlo estimation, leading to a coordinate descent learning algorithm. To further facilitate computational efficiency, we propose locally equivaraint Transformer, a novel parameterisation of the rate matrix that significantly improves training efficiency while preserving powerful network expressiveness. Empirically, we demonstrate the efficacy of DNFS in a wide range of applications, including sampling from unnormalised distributions, training discrete energy-based models, and solving combinatorial optimisation problems.

[LG-43] A tensor network approach for chaotic time series prediction

链接: https://arxiv.org/abs/2505.17740
作者: Rodrigo Martínez-Peña,Román Orús
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: 12 pages, 3 figures. Comments are welcome!

点击查看摘要

Abstract:Making accurate predictions of chaotic time series is a complex challenge. Reservoir computing, a neuromorphic-inspired approach, has emerged as a powerful tool for this task. It exploits the memory and nonlinearity of dynamical systems without requiring extensive parameter tuning. However, selecting and optimizing reservoir architectures remains an open problem. Next-generation reservoir computing simplifies this problem by employing nonlinear vector autoregression based on truncated Volterra series, thereby reducing hyperparameter complexity. Nevertheless, the latter suffers from exponential parameter growth in terms of the maximum monomial degree. Tensor networks offer a promising solution to this issue by decomposing multidimensional arrays into low-dimensional structures, thus mitigating the curse of dimensionality. This paper explores the application of a previously proposed tensor network model for predicting chaotic time series, demonstrating its advantages in terms of accuracy and computational efficiency compared to conventional echo state networks. Using a state-of-the-art tensor network approach enables us to bridge the gap between the tensor network and reservoir computing communities, fostering advances in both fields.

[LG-44] URB – Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles

链接: https://arxiv.org/abs/2505.17734
作者: Ahmet Onur Akman,Anastasia Psarou,Michał Hoffmann,Łukasz Gorczyca,Łukasz Kowalski,Paweł Gora,Grzegorz Jamróz,Rafał Kucharski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data-driven policies, developed by machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present \our: Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. \our is a comprehensive benchmarking environment that unifies evaluation across 29 real-world traffic networks paired with realistic demand patterns. \our comes with a catalog of predefined tasks, four state-of-the-art multi-agent RL (MARL) algorithm implementations, three baseline methods, domain-specific performance metrics, and a modular configuration scheme. Our results suggest that, despite the lengthy and costly training, state-of-the-art MARL algorithms rarely outperformed humans. Experimental results reported in this paper initiate the first leaderboard for MARL in large-scale urban routing optimization and reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.

[LG-45] Redirection for Erasing Memory (REM): Towards a universal unlearning method for corrupted data

链接: https://arxiv.org/abs/2505.17730
作者: Stefan Schoepf,Michael Curtis Mozer,Nicole Elyse Mitchell,Alexandra Brintrup,Georgios Kaissis,Peter Kairouz,Eleni Triantafillou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning is studied for a multitude of tasks, but specialization of unlearning methods to particular tasks has made their systematic comparison challenging. To address this issue, we propose a conceptual space to characterize diverse corrupted data unlearning tasks in vision classifiers. This space is described by two dimensions, the discovery rate (the fraction of the corrupted data that are known at unlearning time) and the statistical regularity of the corrupted data (from random exemplars to shared concepts). Methods proposed previously have been targeted at portions of this space and-we show-fail predictably outside these regions. We propose a novel method, Redirection for Erasing Memory (REM), whose key feature is that corrupted data are redirected to dedicated neurons introduced at unlearning time and then discarded or deactivated to suppress the influence of corrupted data. REM performs strongly across the space of tasks, in contrast to prior SOTA methods that fail outside the regions for which they were designed.

[LG-46] PEAR: Equal Area Weather Forecasting on the Sphere

链接: https://arxiv.org/abs/2505.17720
作者: Hampus Linander,Christoffer Petersson,Daniel Persson,Jan E. Gerken
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Machine learning methods for global medium-range weather forecasting have recently received immense attention. Following the publication of the Pangu Weather model, the first deep learning model to outperform traditional numerical simulations of the atmosphere, numerous models have been published in this domain, building on Pangu’s success. However, all of these models operate on input data and produce predictions on the Driscoll–Healy discretization of the sphere which suffers from a much finer grid at the poles than around the equator. In contrast, in the Hierarchical Equal Area iso-Latitude Pixelization (HEALPix) of the sphere, each pixel covers the same surface area, removing unphysical biases. Motivated by a growing support for this grid in meteorology and climate sciences, we propose to perform weather forecasting with deep learning models which natively operate on the HEALPix grid. To this end, we introduce Pangu Equal ARea (PEAR), a transformer-based weather forecasting model which operates directly on HEALPix-features and outperforms the corresponding model on Driscoll–Healy without any computational overhead.

[LG-47] Get Experience from Practice: LLM Agents with Record Replay

链接: https://arxiv.org/abs/2505.17716
作者: Erhu Feng,Wenbo Zhou,Zibin Liu,Le Chen,Yunpeng Dong,Cheng Zhang,Yisheng Zhao,Dong Du,Zhichao Hua,Yubin Xia,Haibo Chen
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:AI agents, empowered by Large Language Models (LLMs) and communication protocols such as MCP and A2A, have rapidly evolved from simple chatbots to autonomous entities capable of executing complex, multi-step tasks, demonstrating great potential. However, the LLMs’ inherent uncertainty and heavy computational resource requirements pose four significant challenges to the development of safe and efficient agents: reliability, privacy, cost and performance. Existing approaches, like model alignment, workflow constraints and on-device model deployment, can partially alleviate some issues but often with limitations, failing to fundamentally resolve these challenges. This paper proposes a new paradigm called AgentRR (Agent Record Replay), which introduces the classical record-and-replay mechanism into AI agent frameworks. The core idea is to: 1. Record an agent’s interaction trace with its environment and internal decision process during task execution, 2. Summarize this trace into a structured “experience” encapsulating the workflow and constraints, and 3. Replay these experiences in subsequent similar tasks to guide the agent’s behavior. We detail a multi-level experience abstraction method and a check function mechanism in AgentRR: the former balances experience specificity and generality, while the latter serves as a trust anchor to ensure completeness and safety during replay. In addition, we explore multiple application modes of AgentRR, including user-recorded task demonstration, large-small model collaboration and privacy-aware agent execution, and envision an experience repository for sharing and reusing knowledge to further reduce deployment cost. Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2505.17716 [cs.LG] (or arXiv:2505.17716v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.17716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] he Third Pillar of Causal Analysis? A Measurement Perspective on Causal Representations

链接: https://arxiv.org/abs/2505.17708
作者: Dingling Yao,Shimeng Huang,Riccardo Cadei,Kun Zhang,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注: 22 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Causal reasoning and discovery, two fundamental tasks of causal analysis, often face challenges in applications due to the complexity, noisiness, and high-dimensionality of real-world data. Despite recent progress in identifying latent causal structures using causal representation learning (CRL), what makes learned representations useful for causal downstream tasks and how to evaluate them are still not well understood. In this paper, we reinterpret CRL using a measurement model framework, where the learned representations are viewed as proxy measurements of the latent causal variables. Our approach clarifies the conditions under which learned representations support downstream causal reasoning and provides a principled basis for quantitatively assessing the quality of representations using a new Test-based Measurement EXclusivity (T-MEX) score. We validate T-MEX across diverse causal inference scenarios, including numerical simulations and real-world ecological video analysis, demonstrating that the proposed framework and corresponding score effectively assess the identification of learned representations and their usefulness for causal downstream tasks.

[LG-49] Gradient-Based Program Repair: Fixing Bugs in Continuous Program Spaces

链接: https://arxiv.org/abs/2505.17703
作者: André Silva,Gustav Thorén,Martin Monperrus
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Automatic program repair seeks to generate correct code from buggy programs, with most approaches searching the correct program in a discrete, symbolic space of source code tokens. This symbolic search is fundamentally limited by its inability to directly reason about program behavior. We introduce Gradient-Based Program Repair (GBPR), a new paradigm that reframes program repair as continuous optimization in a differentiable numerical program space. Our core insight is to compile symbolic programs into differentiable numerical representations, enabling search in the numerical program space directly guided by program behavior. To evaluate GBPR, we present RaspBugs, a new benchmark of 1,466 buggy symbolic RASP programs and their respective numerical representations. Our experiments demonstrate that GBPR can effectively repair buggy symbolic programs by gradient-based optimization in the numerical program space, with convincing repair trajectories. To our knowledge, we are the first to state program repair as continuous optimization in a numerical program space. Our work establishes a new direction for program repair research, bridging two rich worlds: continuous optimization and program behavior.

[LG-50] FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding

链接: https://arxiv.org/abs/2505.17694
作者: Zhibin Wang,Rui Ning,Chao Fang,Zhonghui Zhang,Xi Lin,Shaobo Ma,Mo Zhou,Xue Li,Zhongfeng Wang,Chengying Huan,Rong Gu,Kun Yang,Guihai Chen,Sheng Zhong,Chen Tian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely FlashForge. FlashForge delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that FlashForge achieves an average 1.9x speedup and 120.9x memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and 3.8x end-to-end time per output token compared to the vLLM.

[LG-51] What is the role of memorization in Continual Learning?

链接: https://arxiv.org/abs/2505.17664
作者: Jędrzej Kozal,Jan Wasilewski,Alif Ashrafee,Bartosz Krawczyk,Michał Woźniak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Memorization impacts the performance of deep learning algorithms. Prior works have studied memorization primarily in the context of generalization and privacy. This work studies the memorization effect on incremental learning scenarios. Forgetting prevention and memorization seem similar. However, one should discuss their differences. We designed extensive experiments to evaluate the impact of memorization on continual learning. We clarified that learning examples with high memorization scores are forgotten faster than regular samples. Our findings also indicated that memorization is necessary to achieve the highest performance. However, at low memory regimes, forgetting regular samples is more important. We showed that the importance of a high-memorization score sample rises with an increase in the buffer size. We introduced a memorization proxy and employed it in the buffer policy problem to showcase how memorization could be used during incremental training. We demonstrated that including samples with a higher proxy memorization score is beneficial when the buffer size is large.

[LG-52] Automating Versatile Time-Series Analysis with Tiny Transformers on Embedded FPGAs

链接: https://arxiv.org/abs/2505.17662
作者: Tianheng Ling,Chao Qian,Lukas Johannes Haßler,Gregor Schiele
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 1 table, accepted by IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2025)

点击查看摘要

Abstract:Transformer-based models have shown strong performance across diverse time-series tasks, but their deployment on resource-constrained devices remains challenging due to high memory and computational demand. While prior work targeting Microcontroller Units (MCUs) has explored hardware-specific optimizations, such approaches are often task-specific and limited to 8-bit fixed-point precision. Field-Programmable Gate Arrays (FPGAs) offer greater flexibility, enabling fine-grained control over data precision and architecture. However, existing FPGA-based deployments of Transformers for time-series analysis typically focus on high-density platforms with manual configuration. This paper presents a unified and fully automated deployment framework for Tiny Transformers on embedded FPGAs. Our framework supports a compact encoder-only Transformer architecture across three representative time-series tasks (forecasting, classification, and anomaly detection). It combines quantization-aware training (down to 4 bits), hardware-aware hyperparameter search using Optuna, and automatic VHDL generation for seamless deployment. We evaluate our framework on six public datasets across two embedded FPGA platforms. Results show that our framework produces integer-only, task-specific Transformer accelerators achieving as low as 0.033 mJ per inference with millisecond latency on AMD Spartan-7, while also providing insights into deployment feasibility on Lattice iCE40. All source code will be released in the GitHub repository (this https URL).

[LG-53] Automated scientific minimization of regret

链接: https://arxiv.org/abs/2505.17661
作者: Marcel Binz,Akshay K. Jagadish,Milena Rmus,Eric Schulz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce automated scientific minimization of regret (ASMR) – a framework for automated computational cognitive science. Building on the principles of scientific regret minimization, ASMR leverages Centaur – a recently proposed foundation model of human cognition – to identify gaps in an interpretable cognitive model. These gaps are then addressed through automated revisions generated by a language-based reasoning model. We demonstrate the utility of this approach in a multi-attribute decision-making task, showing that ASMR discovers cognitive models that predict human behavior at noise ceiling while retaining interpretability. Taken together, our results highlight the potential of ASMR to automate core components of the cognitive modeling pipeline.

[LG-54] DAM-GT: Dual Positional Encoding-Based Attention Masking Graph Transformer for Node Classification

链接: https://arxiv.org/abs/2505.17660
作者: Chenyang Li,Jinsong Chen,John E. Hopcroft,Kun He
类目: Machine Learning (cs.LG)
*备注: Preprint version

点击查看摘要

Abstract:Neighborhood-aware tokenized graph Transformers have recently shown great potential for node classification tasks. Despite their effectiveness, our in-depth analysis of neighborhood tokens reveals two critical limitations in the existing paradigm. First, current neighborhood token generation methods fail to adequately capture attribute correlations within a neighborhood. Second, the conventional self-attention mechanism suffers from attention diversion when processing neighborhood tokens, where high-hop neighborhoods receive disproportionate focus, severely disrupting information interactions between the target node and its neighborhood tokens. To address these challenges, we propose DAM-GT, Dual positional encoding-based Attention Masking graph Transformer. DAM-GT introduces a novel dual positional encoding scheme that incorporates attribute-aware encoding via an attribute clustering strategy, effectively preserving node correlations in both topological and attribute spaces. In addition, DAM-GT formulates a new attention mechanism with a simple yet effective masking strategy to guide interactions between target nodes and their neighborhood tokens, overcoming the issue of attention diversion. Extensive experiments on various graphs with different homophily levels as well as different scales demonstrate that DAM-GT consistently outperforms state-of-the-art methods in node classification tasks.

[LG-55] Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives

链接: https://arxiv.org/abs/2505.17646
作者: Huanran Chen,Yinpeng Dong,Zeming Wei,Yao Huang,Yichi Zhang,Hang Su,Jun Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have revealed that the loss landscape of large language models resembles a basin, within which the models perform nearly identically, and outside of which they lose all their capabilities. In this work, we conduct further studies on the loss landscape of large language models. We discover that pre-training creates a “basic capability” basin, and subsequent fine-tuning creates “specific capability” basins (e.g., math, safety, coding) within the basic capability basin. We further investigate two types of loss landscapes: the most-case landscape (i.e., the landscape along most directions) and the worst-case landscape (i.e., the landscape along the worst direction). We argue that as long as benign fine-tuning remains within the most-case basin, it will not compromise previous capabilities. Similarly, any fine-tuning (including the adversarial one) that stays within the worst-case basin would not compromise previous capabilities. Finally, we theoretically demonstrate that the size of the most-case basin can bound the size of the worst-case basin and the robustness with respect to input perturbations. We also show that, due to the over-parameterization property of current large language models, one can easily enlarge the basins by five times.

[LG-56] A Network Science Approach to Granular Time Series Segmentation

链接: https://arxiv.org/abs/2505.17640
作者: Ivana Kesić,Carolina Fortuna,Mihael Mohorčič,Blaž Bertalanič
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Time series segmentation (TSS) is one of the time series (TS) analysis techniques, that has received considerably less attention compared to other TS related tasks. In recent years, deep learning architectures have been introduced for TSS, however their reliance on sliding windows limits segmentation granularity due to fixed window sizes and strides. To overcome these challenges, we propose a new more granular TSS approach that utilizes the Weighted Dual Perspective Visbility Graph (WDPVG) TS into a graph and combines it with a Graph Attention Network (GAT). By transforming TS into graphs, we are able to capture different structural aspects of the data that would otherwise remain hidden. By utilizing the representation learning capabilities of Graph Neural Networks, our method is able to effectively identify meaningful segments within the TS. To better understand the potential of our approach, we also experimented with different TS-to-graph transformations and compared their performance. Our contributions include: a) formulating the TSS as a node classification problem on graphs; b) conducting an extensive analysis of various TS- to-graph transformations applied to TSS using benchmark datasets from the TSSB repository; c) providing the first detailed study on utilizing GNNs for analyzing graph representations of TS in the context of TSS; d) demonstrating the effectiveness of our method, which achieves an average F1 score of 0.97 across 59 diverse TSS benchmark datasets; e) outperforming the seq2point baseline method by 0.05 in terms of F1 score; and f) reducing the required training data compared to the baseline methods.

[LG-57] PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

链接: https://arxiv.org/abs/2505.17639
作者: Zehua Pei,Ying Zhang,Hui-Ling Zhen,Xianzhi Yu,Wulong Liu,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2% accuracy on MATH500 when pruned to 8/128 configuration (50% expert reduction), and still achieves 72.0% with aggressive 8/32 pruning (87.5% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15% on MATH500 and 81.3% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95% accuracy on MATH500. We make our code publicly available at this https URL.

[LG-58] Why Diffusion Models Dont Memorize: The Role of Implicit Dynamical Regularization in Training

链接: https://arxiv.org/abs/2505.17638
作者: Tony Bonnaire,Raphaël Urfin,Giulio Biroli,Marc Mézard
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 36 pages, 15 figures

点击查看摘要

Abstract:Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time \tau_\mathrmgen at which models begin to generate high-quality samples, and a later time \tau_\mathrmmem beyond which memorization emerges. Crucially, we find that \tau_\mathrmmem increases linearly with the training set size n , while \tau_\mathrmgen remains constant. This creates a growing window of training times with n where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when n becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

[LG-59] Causal Spatio-Temporal Prediction: An Effective and Efficient Multi-Modal Approach

链接: https://arxiv.org/abs/2505.17637
作者: Yuting Huang,Ziquan Fang,Zhihao Zeng,Lu Chen,Yunjun Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.

[LG-60] Leverag ing Stochastic Depth Training for Adaptive Inference

链接: https://arxiv.org/abs/2505.17626
作者: Guilherme Korol,Antonio Carlos Schneider Beck,Jeronimo Castrillon
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Dynamic DNN optimization techniques such as layer-skipping offer increased adaptability and efficiency gains but can lead to i) a larger memory footprint as in decision gates, ii) increased training complexity (e.g., with non-differentiable operations), and iii) less control over performance-quality trade-offs due to its inherent input-dependent execution. To approach these issues, we propose a simpler yet effective alternative for adaptive inference with a zero-overhead, single-model, and time-predictable inference. Central to our approach is the observation that models trained with Stochastic Depth – a method for faster training of residual networks – become more resilient to arbitrary layer-skipping at inference time. We propose a method to first select near Pareto-optimal skipping configurations from a stochastically-trained model to adapt the inference at runtime later. Compared to original ResNets, our method shows improvements of up to 2X in power efficiency at accuracy drops as low as 0.71%.

[LG-61] Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

链接: https://arxiv.org/abs/2505.17621
作者: Jingtong Gao,Ling Pan,Yejing Wang,Rui Zhong,Chi Lu,Qingpeng Cai,Peng Jiang,Xiangyu Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Regularized Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for multi-step reasoning processes. Specifically, sparse reward signals fail to deliver effective or sufficient feedback, particularly for challenging problems. Furthermore, such reward structures induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across ipntermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a novel method designed to both deliver dense rewards and amplify explorations in the RL-based training paradigm. i-MENTOR introduces three key innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; dynamic reward scaling to stabilize exploration and exploitation in large action spaces; and advantage-preserving reward implementation that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across three public datasets demonstrate i-MENTOR’s effectiveness with a 22.39% improvement on the difficult dataset Countdown-4.

[LG-62] Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning

链接: https://arxiv.org/abs/2505.17610
作者: Till Freihaut,Luca Viano,Volkan Cevher,Matthieu Geist,Giorgia Ramponi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper provides the first expert sample complexity characterization for learning a Nash equilibrium from expert data in Markov Games. We show that a new quantity named the single policy deviation concentrability coefficient is unavoidable in the non-interactive imitation learning setting, and we provide an upper bound for behavioral cloning (BC) featuring such coefficient. BC exhibits substantial regret in games with high concentrability coefficient, leading us to utilize expert queries to develop and introduce two novel solution algorithms: MAIL-BRO and MURMAIL. The former employs a best response oracle and learns an \varepsilon -Nash equilibrium with \mathcalO(\varepsilon^-4) expert and oracle queries. The latter bypasses completely the best response oracle at the cost of a worse expert query complexity of order \mathcalO(\varepsilon^-8) . Finally, we provide numerical evidence, confirming our theoretical findings.

[LG-63] Adaptive Semantic Token Communication for Transformer-based Edge Inference

链接: https://arxiv.org/abs/2505.17604
作者: Alessio Devoto,Jary Pomponi,Mattia Merluzzi,Paolo Di Lorenzo,Simone Scardapane
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:This paper presents an adaptive framework for edge inference based on a dynamically configurable transformer-powered deep joint source channel coding (DJSCC) architecture. Motivated by a practical scenario where a resource constrained edge device engages in goal oriented semantic communication, such as selectively transmitting essential features for object detection to an edge server, our approach enables efficient task aware data transmission under varying bandwidth and channel conditions. To achieve this, input data is tokenized into compact high level semantic representations, refined by a transformer, and transmitted over noisy wireless channels. As part of the DJSCC pipeline, we employ a semantic token selection mechanism that adaptively compresses informative features into a user specified number of tokens per sample. These tokens are then further compressed through the JSCC module, enabling a flexible token communication strategy that adjusts both the number of transmitted tokens and their embedding dimensions. We incorporate a resource allocation algorithm based on Lyapunov stochastic optimization to enhance robustness under dynamic network conditions, effectively balancing compression efficiency and task performance. Experimental results demonstrate that our system consistently outperforms existing baselines, highlighting its potential as a strong foundation for AI native semantic communication in edge intelligence applications.

[LG-64] Dynamic Text Bundling Supervision for Zero-Shot Inference on Text-Attributed Graphs

链接: https://arxiv.org/abs/2505.17599
作者: Yusheng Zhao,Qixin Zhang,Xiao Luo,Weizhi Zhang,Zhiping Xiao,Wei Ju,Philip S. Yu,Ming Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been used in many zero-shot learning problems, with their strong generalization ability. Recently, adopting LLMs in text-attributed graphs (TAGs) has drawn increasing attention. However, the adoption of LLMs faces two major challenges: limited information on graph structure and unreliable responses. LLMs struggle with text attributes isolated from the graph topology. Worse still, they yield unreliable predictions due to both information insufficiency and the inherent weakness of LLMs (e.g., hallucination). Towards this end, this paper proposes a novel method named Dynamic Text Bundling Supervision (DENSE) that queries LLMs with bundles of texts to obtain bundle-level labels and uses these labels to supervise graph neural networks. Specifically, we sample a set of bundles, each containing a set of nodes with corresponding texts of close proximity. We then query LLMs with the bundled texts to obtain the label of each bundle. Subsequently, the bundle labels are used to supervise the optimization of graph neural networks, and the bundles are further refined to exclude noisy items. To justify our design, we also provide theoretical analysis of the proposed method. Extensive experiments across ten datasets validate the effectiveness of the proposed method.

[LG-65] Ownership Verification of DNN Models Using White-Box Adversarial Attacks with Specified Probability Manipulation

链接: https://arxiv.org/abs/2505.17579
作者: Teruki Sano,Minoru Kuribayashi,Masao Sakai,Shuji Ishobe,Eisuke Koizumi
类目: Machine Learning (cs.LG)
*备注: Accepted to EUSIPCO 2025

点击查看摘要

Abstract:In this paper, we propose a novel framework for ownership verification of deep neural network (DNN) models for image classification tasks. It allows verification of model identity by both the rightful owner and third party without presenting the original model. We assume a gray-box scenario where an unauthorized user owns a model that is illegally copied from the original model, provides services in a cloud environment, and the user throws images and receives the classification results as a probability distribution of output classes. The framework applies a white-box adversarial attack to align the output probability of a specific class to a designated value. Due to the knowledge of original model, it enables the owner to generate such adversarial examples. We propose a simple but effective adversarial attack method based on the iterative Fast Gradient Sign Method (FGSM) by introducing control parameters. Experimental results confirm the effectiveness of the identification of DNN models using adversarial attack.

[LG-66] Multiphysics Bench: Benchmarking and Investigating Scientific Machine Learning for Multiphysics PDEs

链接: https://arxiv.org/abs/2505.17575
作者: Changfan Yang,Lichen Bai,Yinpeng Wang,Shufei Zhang,Zeke Xie
类目: Machine Learning (cs.LG)
*备注: 31 pages. 20 tables, 17 figures, Dataset

点击查看摘要

Abstract:Solving partial differential equations (PDEs) with machine learning has recently attracted great attention, as PDEs are fundamental tools for modeling real-world systems that range from fundamental physical science to advanced engineering disciplines. Most real-world physical systems across various disciplines are actually involved in multiple coupled physical fields rather than a single field. However, previous machine learning studies mainly focused on solving single-field problems, but overlooked the importance and characteristics of multiphysics problems in real world. Multiphysics PDEs typically entail multiple strongly coupled variables, thereby introducing additional complexity and challenges, such as inter-field coupling. Both benchmarking and solving multiphysics problems with machine learning remain largely unexamined. To identify and address the emerging challenges in multiphysics problems, we mainly made three contributions in this work. First, we collect the first general multiphysics dataset, the Multiphysics Bench, that focuses on multiphysics PDE solving with machine learning. Multiphysics Bench is also the most comprehensive PDE dataset to date, featuring the broadest range of coupling types, the greatest diversity of PDE formulations, and the largest dataset scale. Second, we conduct the first systematic investigation on multiple representative learning-based PDE solvers, such as PINNs, FNO, DeepONet, and DiffusionPDE solvers, on multiphysics problems. Unfortunately, naively applying these existing solvers usually show very poor performance for solving multiphysics. Third, through extensive experiments and discussions, we report multiple insights and a bag of useful tricks for solving multiphysics with machine learning, motivating future directions in the study and simulation of complex, coupled physical systems.

[LG-67] Graph Style Transfer for Counterfactual Explainability ICML’25

链接: https://arxiv.org/abs/2505.17542
作者: Bardh Prenkaj,Efstratios Zaradoukas,Gjergji Kasneci
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML’25

点击查看摘要

Abstract:Counterfactual explainability seeks to uncover model decisions by identifying minimal changes to the input that alter the predicted outcome. This task becomes particularly challenging for graph data due to preserving structural integrity and semantic meaning. Unlike prior approaches that rely on forward perturbation mechanisms, we introduce Graph Inverse Style Transfer (GIST), the first framework to re-imagine graph counterfactual generation as a backtracking process, leveraging spectral style transfer. By aligning the global structure with the original input spectrum and preserving local content faithfulness, GIST produces valid counterfactuals as interpolations between the input style and counterfactual content. Tested on 8 binary and multi-class graph classification benchmarks, GIST achieves a remarkable +7.6% improvement in the validity of produced counterfactuals and significant gains (+45.5%) in faithfully explaining the true class distribution. Additionally, GIST’s backtracking mechanism effectively mitigates overshooting the underlying predictor’s decision boundary, minimizing the spectral differences between the input and the counterfactuals. These results challenge traditional forward perturbation methods, offering a novel perspective that advances graph explainability.

[LG-68] meCF: A TimeMixer-Based Model with adaptive Convolution and Sharpness-Aware Minimization Frequency Domain Loss for long-term time seris forecasting

链接: https://arxiv.org/abs/2505.17532
作者: Bin Wang,Heming Yang,Jinfang Sheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have shown that by introducing prior knowledge, multi-scale analysis of complex and non-stationary time series in real environments can achieve good results in the field of long-term forecasting. However, affected by channel-independent methods, models based on multi-scale analysis may produce suboptimal prediction results due to the autocorrelation between time series labels, which in turn affects the generalization ability of the model. To address this challenge, we are inspired by the idea of sharpness-aware minimization and the recently proposed FreDF method and design a deep learning model TimeCF for long-term time series forecasting based on the TimeMixer, combined with our designed adaptive convolution information aggregation module and Sharpness-Aware Minimization Frequency Domain Loss (SAMFre). Specifically, TimeCF first decomposes the original time series into sequences of different scales. Next, the same-sized convolution modules are used to adaptively aggregate information of different scales on sequences of different scales. Then, decomposing each sequence into season and trend parts and the two parts are mixed at different scales through bottom-up and top-down methods respectively. Finally, different scales are aggregated through a Feed-Forward Network. What’s more, extensive experimental results on different real-world datasets show that our proposed TimeCF has excellent performance in the field of long-term forecasting.

[LG-69] Spacetime Geometry of Denoising in Diffusion Models

链接: https://arxiv.org/abs/2505.17517
作者: Rafał Karczewski,Markus Heinonen,Alison Pouplin,Søren Hauberg,Vikas Garg
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel perspective on diffusion models using the framework of information geometry. We show that the set of noisy samples, taken across all noise levels simultaneously, forms a statistical manifold – a family of denoising probability distributions. Interpreting the noise level as a temporal parameter, we refer to this manifold as spacetime. This manifold naturally carries a Fisher-Rao metric, which defines geodesics – shortest paths between noisy points. Notably, this family of distributions is exponential, enabling efficient geodesic computation even in high-dimensional settings without retraining or fine-tuning. We demonstrate the practical value of this geometric viewpoint in transition path sampling, where spacetime geodesics define smooth sequences of Boltzmann distributions, enabling the generation of continuous trajectories between low-energy metastable states. Code is available at: this https URL.

[LG-70] ExARNN: An Environment-Driven Adaptive RNN for Learning Non-Stationary Power Dynamics

链接: https://arxiv.org/abs/2505.17488
作者: Haoran Li,Muhao Guo,Yang Weng,Marija Ilic,Guangchun Ruan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 5 pages, 3 figures, conference

点击查看摘要

Abstract:Non-stationary power system dynamics, influenced by renewable energy variability, evolving demand patterns, and climate change, are becoming increasingly complex. Accurately capturing these dynamics requires a model capable of adapting to environmental factors. Traditional models, including Recurrent Neural Networks (RNNs), lack efficient mechanisms to encode external factors, such as time or environmental data, for dynamic adaptation. To address this, we propose the External Adaptive RNN (ExARNN), a novel framework that integrates external data (e.g., weather, time) to continuously adjust the parameters of a base RNN. ExARNN achieves this through a hierarchical hypernetwork design, using Neural Controlled Differential Equations (NCDE) to process external data and generate RNN parameters adaptively. This approach enables ExARNN to handle inconsistent timestamps between power and external measurements, ensuring continuous adaptation. Extensive forecasting tests demonstrate ExARNN’s superiority over established baseline models.

[LG-71] Hyperspectral in situ remote sensing of water surface nitrate in the Fitzroy River estuary Queensland Australia using deep learning

链接: https://arxiv.org/abs/2505.17483
作者: Yiqing Guo,Nagur Cherukuru,Eric Lehmann,S. L. Kesav Unnithan,Gemma Kerrisk,Tim Malthus,Faisal Islam
类目: Machine Learning (cs.LG)
*备注: Submitted to IGARSS2025

点击查看摘要

Abstract:Nitrate ( \textNO_3^- ) is a form of dissolved inorganic nitrogen derived primarily from anthropogenic sources. The recent increase in river-discharged nitrate poses a major risk for coral bleaching in the Great Barrier Reef (GBR) lagoon. Although nitrate is an optically inactive (i.e., colourless) constituent, previous studies have demonstrated there is an indirect, non-causal relationship between water surface nitrate and water-leaving reflectance that is mediated through optically active water quality parameters such as total suspended solids and coloured dissolved organic matter. This work aims to advance our understanding of this relationship with an effort to measure time-series nitrate and simultaneous hyperspectral reflectance at the Fitzroy River estuary, Queensland, Australia. Time-series observations revealed periodic cycles in nitrate loads due to the tidal influence in the estuarine study site. The water surface nitrate loads were predicted from hyperspectral reflectance and water salinity measurements, with hyperspectral reflectance indicating the concentrations of optically active variables and salinity indicating the mixing of river water and seawater proportions. The accuracy assessment of model-predicted nitrate against in-situ measured nitrate values showed that the predicted nitrate values correlated well with the ground-truth data, with an R^2 score of 0.86, and an RMSE of 0.03 mg/L. This work demonstrates the feasibility of predicting water surface nitrate from hyperspectral reflectance and salinity measurements.

[LG-72] Reverse-Speech-Finder: A Neural Network Backtracking Architecture for Generating Alzheimers Disease Speech Samples and Improving Diagnosis Performance

链接: https://arxiv.org/abs/2505.17477
作者: Victor OK Li,Yang Han,Jacqueline CK Lam,Lawrence YL Cheung
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This study introduces Reverse-Speech-Finder (RSF), a groundbreaking neural network backtracking architecture designed to enhance Alzheimer’s Disease (AD) diagnosis through speech analysis. Leveraging the power of pre-trained large language models, RSF identifies and utilizes the most probable AD-specific speech markers, addressing both the scarcity of real AD speech samples and the challenge of limited interpretability in existing models. RSF’s unique approach consists of three core innovations: Firstly, it exploits the observation that speech markers most probable of predicting AD, defined as the most probable speech-markers (MPMs), must have the highest probability of activating those neurons (in the neural network) with the highest probability of predicting AD, defined as the most probable neurons (MPNs). Secondly, it utilizes a speech token representation at the input layer, allowing backtracking from MPNs to identify the most probable speech-tokens (MPTs) of AD. Lastly, it develops an innovative backtracking method to track backwards from the MPNs to the input layer, identifying the MPTs and the corresponding MPMs, and ingeniously uncovering novel speech markers for AD detection. Experimental results demonstrate RSF’s superiority over traditional methods such as SHAP and Integrated Gradients, achieving a 3.5% improvement in accuracy and a 3.2% boost in F1-score. By generating speech data that encapsulates novel markers, RSF not only mitigates the limitations of real data scarcity but also significantly enhances the robustness and accuracy of AD diagnostic models. These findings underscore RSF’s potential as a transformative tool in speech-based AD detection, offering new insights into AD-related linguistic deficits and paving the way for more effective non-invasive early intervention strategies.

[LG-73] owards Heterogeneous Continual Graph Learning via Meta-knowledge Distillation

链接: https://arxiv.org/abs/2505.17458
作者: Guiquan Sun,Xikun Zhang,Jingchao Ni,Dongjin Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning on heterogeneous graphs has experienced rapid advancement in recent years, driven by the inherently heterogeneous nature of real-world data. However, existing studies typically assume the graphs to be static, while real-world graphs are continuously expanding. This dynamic nature requires models to adapt to new data while preserving existing knowledge. To this end, this work addresses the challenge of continual learning on heterogeneous graphs by introducing the Meta-learning based Knowledge Distillation framework (MKD), designed to mitigate catastrophic forgetting in evolving heterogeneous graph structures. MKD combines rapid task adaptation through meta-learning on limited samples with knowledge distillation to achieve an optimal balance between incorporating new information and maintaining existing knowledge. To improve the efficiency and effectiveness of sample selection, MKD incorporates a novel sampling strategy that selects a small number of target-type nodes based on node diversity and maintains fixed-size buffers for other types. The strategy retrieves first-order neighbors along metapaths and selects important neighbors based on their structural relevance, enabling the sampled subgraphs to retain key topological and semantic information. In addition, MKD introduces a semantic-level distillation module that aligns the attention distributions over different metapaths between teacher and student models, encouraging semantic consistency beyond the logit level. Comprehensive evaluations across three benchmark datasets validate MKD’s effectiveness in handling continual learning scenarios on expanding heterogeneous graphs.

[LG-74] Corporate Needs You to Find the Difference: Revisiting Submodular and Supermodular Ratio Optimization Problems

链接: https://arxiv.org/abs/2505.17443
作者: Elfarouk Harb,Yousef Yassin,Chandra Chekuri
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of minimizing or maximizing the average value f(S)/|S| of a submodular or supermodular set function f: 2^V \to \mathbbR over non-empty subsets S \subseteq V . This generalizes classical problems such as Densest Subgraph (DSG), Densest Supermodular Set (DSS), and Submodular Function Minimization (SFM). Motivated by recent applications, we introduce two broad formulations: Unrestricted Sparsest Submodular Set (USSS) and Unrestricted Densest Supermodular Set (UDSS), which allow for negative and non-monotone functions. We show that DSS, SFM, USSS, UDSS, and the Minimum Norm Point (MNP) problem are equivalent under strongly polynomial-time reductions, enabling algorithmic crossover. In particular, viewing these through the lens of the MNP in the base polyhedron, we connect Fujishige’s theory with dense decomposition, and show that both Fujishige-Wolfe’s algorithm and the heuristic \textscSuperGreedy++ act as universal solvers for all these problems, including sub-modular function minimization. Theoretically, we explain why \textscSuperGreedy++ is effective beyond DSS, including for tasks like submodular minimization and minimum s - t cut. Empirically, we test several solvers, including the Fujishige-Wolfe algorithm on over 400 experiments across seven problem types and large-scale real/synthetic datasets. Surprisingly, general-purpose convex and flow-based methods outperform task-specific baselines, demonstrating that with the right framing, general optimization techniques can be both scalable and state-of-the-art for submodular and supermodular ratio problems. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2505.17443 [cs.DS] (or arXiv:2505.17443v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.17443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-75] Discretization-free Multicalibration through Loss Minimization over Tree Ensembles

链接: https://arxiv.org/abs/2505.17435
作者: Hongyi Henry Jin,Zijun Ding,Dung Daniel Ngo,Zhiwei Steven Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, multicalibration has emerged as a desirable learning objective for ensuring that a predictor is calibrated across a rich collection of overlapping subpopulations. Existing approaches typically achieve multicalibration by discretizing the predictor’s output space and iteratively adjusting its output values. However, this discretization approach departs from the standard empirical risk minimization (ERM) pipeline, introduces rounding error and additional sensitive hyperparameter, and may distort the predictor’s outputs in ways that hinder downstream decision-making. In this work, we propose a discretization-free multicalibration method that directly optimizes an empirical risk objective over an ensemble of depth-two decision trees. Our ERM approach can be implemented using off-the-shelf tree ensemble learning methods such as LightGBM. Our algorithm provably achieves multicalibration, provided that the data distribution satisfies a technical condition we term as loss saturation. Across multiple datasets, our empirical evaluation shows that this condition is always met in practice. Our discretization-free algorithm consistently matches or outperforms existing multicalibration approaches–even when evaluated using a discretization-based multicalibration metric that shares its discretization granularity with the baselines. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.17435 [cs.LG] (or arXiv:2505.17435v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.17435 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] HyperIMTS: Hypergraph Neural Network for Irregular Multivariate Time Series Forecasting ICML2025

链接: https://arxiv.org/abs/2505.17431
作者: Boyuan Li,Yicheng Luo,Zhen Liu,Junhao Zheng,Jianming Lv,Qianli Ma
类目: Machine Learning (cs.LG)
*备注: Accepted in ICML 2025

点击查看摘要

Abstract:Irregular multivariate time series (IMTS) are characterized by irregular time intervals within variables and unaligned observations across variables, posing challenges in learning temporal and variable dependencies. Many existing IMTS models either require padded samples to learn separately from temporal and variable dimensions, or represent original samples via bipartite graphs or sets. However, the former approaches often need to handle extra padding values affecting efficiency and disrupting original sampling patterns, while the latter ones have limitations in capturing dependencies among unaligned observations. To represent and learn both dependencies from original observations in a unified form, we propose HyperIMTS, a Hypergraph neural network for Irregular Multivariate Time Series forecasting. Observed values are converted as nodes in the hypergraph, interconnected by temporal and variable hyperedges to enable message passing among all observations. Through irregularity-aware message passing, HyperIMTS captures variable dependencies in a time-adaptive way to achieve accurate forecasting. Experiments demonstrate HyperIMTS’s competitive performance among state-of-the-art models in IMTS forecasting with low computational cost.

[LG-77] Wasserstein Transfer Learning

链接: https://arxiv.org/abs/2505.17404
作者: Kaicheng Zhang,Sinian Zhang,Doudou Zhou,Yidong Zhou
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 25 pages, 6 figures

点击查看摘要

Abstract:Transfer learning is a powerful paradigm for leveraging knowledge from source domains to enhance learning in a target domain. However, traditional transfer learning approaches often focus on scalar or multivariate data within Euclidean spaces, limiting their applicability to complex data structures such as probability distributions. To address this, we introduce a novel framework for transfer learning in regression models, where outputs are probability distributions residing in the Wasserstein space. When the informative subset of transferable source domains is known, we propose an estimator with provable asymptotic convergence rates, quantifying the impact of domain similarity on transfer efficiency. For cases where the informative subset is unknown, we develop a data-driven transfer learning procedure designed to mitigate negative transfer. The proposed methods are supported by rigorous theoretical analysis and are validated through extensive simulations and real-world applications.

[LG-78] Spectral Mixture Kernels for Bayesian Optimization

链接: https://arxiv.org/abs/2505.17393
作者: Yi Zhang,Cheng Hua
类目: Machine Learning (cs.LG); Spectral Theory (math.SP)
*备注:

点击查看摘要

Abstract:Bayesian Optimization (BO) is a widely used approach for solving expensive black-box optimization tasks. However, selecting an appropriate probabilistic surrogate model remains an important yet challenging problem. In this work, we introduce a novel Gaussian Process (GP)-based BO method that incorporates spectral mixture kernels, derived from spectral densities formed by scale-location mixtures of Cauchy and Gaussian distributions. This method achieves a significant improvement in both efficiency and optimization performance, matching the computational speed of simpler kernels while delivering results that outperform more complex models and automatic BO methods. We provide bounds on the information gain and cumulative regret associated with obtaining the optimum. Extensive numerical experiments demonstrate that our method consistently outperforms existing baselines across a diverse range of synthetic and real-world problems, including both low- and high-dimensional settings.

[LG-79] Improved and Oracle-Efficient Online ell_1-Multicalibration ICML2025

链接: https://arxiv.org/abs/2505.17365
作者: Rohan Ghuge,Vidya Muthukumar,Sahil Singla
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:We study \emphonline multicalibration, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across T rounds. Although online calibration is typically studied in the \ell_1 norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as \ell_2 and \ell_\infty ) and then transferred these guarantees to \ell_1 at additional loss. In contrast, we propose a direct method that achieves improved and oracle-efficient rates of \widetilde\mathcalO(T^-1/3) and \widetilde\mathcalO(T^-1/4) respectively, for online \ell_1 -multicalibration. Our key insight is a novel reduction of online (\ell_1)-multicalibration to an online learning problem with product-based rewards, which we refer to as \emphonline linear-product optimization ( \mathttOLPO ). To obtain the improved rate of \widetilde\mathcalO(T^-1/3) , we introduce a linearization of \mathttOLPO and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it becomes computationally expensive when the group family (\mathcalH) is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to \mathttOLPO that makes only a polynomial number of calls to an offline optimization (\emphmulticalibration evaluation) oracle, resulting in \emphoracle-efficient online (\ell_1)-multicalibration with a rate of \widetilde\mathcalO(T^-1/4) . Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a 1 -Lipschitz property of the (\ell_1)-multicalibration error with respect to (\mathcalH). Comments: Accepted to ICML 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.17365 [cs.LG] (or arXiv:2505.17365v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.17365 Focus to learn more arXiv-issued DOI via DataCite

[LG-80] owards VM Rescheduling Optimization Through Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.17359
作者: Xianzhong Ding,Yunkai Zhang,Binbin Chen,Donghao Ying,Tieying Zhang,Jianjun Chen,Lei Zhang,Alberto Cerpa,Wan Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern industry-scale data centers need to manage a large number of virtual machines (VMs). Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs, a practice commonly referred to as VM rescheduling. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, due to dynamic VM state changes during this period. This causes existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VM2RL, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions, a feature extraction module that captures relational information specific to rescheduling, as well as a risk-seeking evaluation enabling users to optimize the trade-off between latency and accuracy. We conduct extensive experiments with data from an industry-scale data center. Our results show that VM2RL can achieve a performance comparable to the optimal solution but with a running time of seconds. Code and datasets are open-sourced: this https URL, this https URL.

[LG-81] Adversarial Robustness of Nonparametric Regression

链接: https://arxiv.org/abs/2505.17356
作者: Parsa Moradi,Hanzaleh Akabrinodehi,Mohammad Ali Maddah-Ali
类目: Machine Learning (cs.LG)
*备注: 22 pages, 2 figures

点击查看摘要

Abstract:In this paper, we investigate the adversarial robustness of regression, a fundamental problem in machine learning, under the setting where an adversary can arbitrarily corrupt a subset of the input data. While the robustness of parametric regression has been extensively studied, its nonparametric counterpart remains largely unexplored. We characterize the adversarial robustness in nonparametric regression, assuming the regression function belongs to the second-order Sobolev space (i.e., it is square integrable up to its second derivative). The contribution of this paper is two-fold: (i) we establish a minimax lower bound on the estimation error, revealing a fundamental limit that no estimator can overcome, and (ii) we show that, perhaps surprisingly, the classical smoothing spline estimator, when properly regularized, exhibits robustness against adversarial corruption. These results imply that if o(n) out of n samples are corrupted, the estimation error of the smoothing spline vanishes as n \to \infty . On the other hand, when a constant fraction of the data is corrupted, no estimator can guarantee vanishing estimation error, implying the optimality of the smoothing spline in terms of maximum tolerable number of corrupted samples. Comments: 22 pages, 2 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.17356 [cs.LG] (or arXiv:2505.17356v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.17356 Focus to learn more arXiv-issued DOI via DataCite

[LG-82] CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots

链接: https://arxiv.org/abs/2505.17354
作者: Keisuke Kawano,Takuro Kutsuna,Naoki Hayashi,Yasushi Esaki,Hidenori Tanaka
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages, 28 figures

点击查看摘要

Abstract:In many real-world scenarios, such as single-cell RNA sequencing, data are observed only as discrete-time snapshots spanning finite time intervals and subject to noisy timestamps, with no continuous trajectories available. Recovering the underlying continuous-time dynamics from these snapshots with coarse and noisy observation times is a critical and challenging task. We propose Continuous-Time Optimal Transport Flow (CT-OT Flow), which first infers high-resolution time labels via partial optimal transport and then reconstructs a continuous-time data distribution through a temporal kernel smoothing. This reconstruction enables accurate training of dynamics models such as ODEs and SDEs. CT-OT Flow consistently outperforms state-of-the-art methods on synthetic benchmarks and achieves lower reconstruction errors on real scRNA-seq and typhoon-track datasets. Our results highlight the benefits of explicitly modeling temporal discretization and timestamp uncertainty, offering an accurate and general framework for bridging discrete snapshots and continuous-time processes.

[LG-83] A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

链接: https://arxiv.org/abs/2505.17342
作者: Ankita Kushwaha,Kiran Ravish,Preeti Lamba,Pawan Kumar
类目: Machine Learning (cs.LG)
*备注: 25

点击查看摘要

Abstract:Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained Markov Decision Processes (CMDPs) and extensions to Multi-Agent Safe RL (SafeMARL). We review theoretical foundations of CMDPs, covering definitions, constrained optimization techniques, and fundamental theorems. We then summarize state-of-the-art algorithms in SafeRL for single agents, including policy gradient methods with safety guarantees and safe exploration strategies, as well as recent advances in SafeMARL for cooperative and competitive settings. Additionally, we propose five open research problems to advance the field, with three focusing on SafeMARL. Each problem is described with motivation, key challenges, and related prior work. This survey is intended as a technical guide for researchers interested in SafeRL and SafeMARL, highlighting key concepts, methods, and open future research directions.

[LG-84] I-DeepONet: Learnable Time Integration for Stable Long-Term Extrapolation

链接: https://arxiv.org/abs/2505.17341
作者: Dibyajyoti Nayak,Somdatta Goswami
类目: Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Accurate temporal extrapolation presents a fundamental challenge for neural operators in modeling dynamical systems, where reliable predictions must extend significantly beyond the training time horizon. Conventional Deep Operator Network (DeepONet) approaches employ two inherently limited training paradigms - fixed-horizon rollouts that predict complete spatiotemporal solutions while disregarding temporal causality, and autoregressive formulations that accumulate errors through sequential predictions. We introduce TI-DeepONet, a framework that integrates neural operators with adaptive numerical time-stepping techniques to preserve the Markovian structure of dynamical systems while mitigating error propagation in extended temporal forecasting. Our approach reformulates the learning objective from direct state prediction to the approximation of instantaneous time-derivative fields, which are then integrated using established numerical schemes. This architecture supports continuous-time prediction and enables deployment of higher-precision integrators during inference than those used during training, balancing computational efficiency with predictive accuracy. We further develop TI(L)-DeepONet, which incorporates learnable coefficients for intermediate slopes in the integration process, adapting to solution-specific variations and enhancing fidelity. Evaluation across three canonical PDEs shows that TI(L)-DeepONet marginally outperforms TI-DeepONet, with both reducing relative L2 extrapolation errors: approximately 81% over autoregressive and 70% over fixed-horizon methods. Notably, both maintain prediction stability for temporal domains extending to about twice the training interval. This research establishes a physics-aware operator learning paradigm that bridges neural approximation with numerical analysis while preserving the causal structure of dynamical systems.

[LG-85] Conformal Predictive Distributions for Order Fulfillm ent Time Forecasting

链接: https://arxiv.org/abs/2505.17340
作者: Tinghan Ye,Amira Hijazi,Pascal Van Hentenryck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate estimation of order fulfillment time is critical for e-commerce logistics, yet traditional rule-based approaches often fail to capture the inherent uncertainties in delivery operations. This paper introduces a novel framework for distributional forecasting of order fulfillment time, leveraging Conformal Predictive Systems and Cross Venn-Abers Predictors–model-agnostic techniques that provide rigorous coverage or validity guarantees. The proposed machine learning methods integrate granular spatiotemporal features, capturing fulfillment location and carrier performance dynamics to enhance predictive accuracy. Additionally, a cost-sensitive decision rule is developed to convert probabilistic forecasts into reliable point predictions. Experimental evaluation on a large-scale industrial dataset demonstrates that the proposed methods generate competitive distributional forecasts, while machine learning-based point predictions significantly outperform the existing rule-based system–achieving up to 14% higher prediction accuracy and up to 75% improvement in identifying late deliveries.

[LG-86] Wavelet Probabilistic Recurrent Convolutional Network for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2505.17307
作者: Pu Yang,J. A. Barria
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a Wavelet Probabilistic Recurrent Convolutional Network (WPRCN) for Multivariate Time Series Classification (MTSC), especially effective in handling non-stationary environments, data scarcity and noise perturbations. We introduce a versatile wavelet probabilistic module designed to extract and analyse the probabilistic features, which can seamlessly integrate with a variety of neural network architectures. This probabilistic module comprises an Adaptive Wavelet Probabilistic Feature Generator (AWPG) and a Channel Attention-based Probabilistic Temporal Convolutional Network (APTCN). Such formulation extends the application of wavelet probabilistic neural networks to deep neural networks for MTSC. The AWPG constructs an ensemble probabilistic model addressing different data scarcities and non-stationarity; it adaptively selects the optimal ones and generates probabilistic features for APTCN. The APTCN analyses the correlations of the features and forms a comprehensive feature space with existing MTSC models for classification. Here, we instantiate the proposed module to work in parallel with a Long Short-Term Memory (LSTM) network and a Causal Fully Convolutional Network (C-FCN), demonstrating its broad applicability in time series analysis. The WPRCN is evaluated on 30 diverse MTS datasets and outperforms all the benchmark algorithms on average accuracy and rank, exhibiting pronounced strength in handling scarce data and physiological data subject to perturbations and non-stationarities.

[LG-87] Implicit Regularization of Infinitesimally-perturbed Gradient Descent Toward Low-dimensional Solutions

链接: https://arxiv.org/abs/2505.17304
作者: Jianhao Ma,Geyu Liang,Salar Fattahi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Implicit regularization refers to the phenomenon where local search algorithms converge to low-dimensional solutions, even when such structures are neither explicitly specified nor encoded in the optimization problem. While widely observed, this phenomenon remains theoretically underexplored, particularly in modern over-parameterized problems. In this paper, we study the conditions that enable implicit regularization by investigating when gradient-based methods converge to second-order stationary points (SOSPs) within an implicit low-dimensional region of a smooth, possibly nonconvex function. We show that successful implicit regularization hinges on two key conditions: (i) the ability to efficiently escape strict saddle points, while (ii) maintaining proximity to the implicit region. Existing analyses enabling the convergence of gradient descent (GD) to SOSPs often rely on injecting large perturbations to escape strict saddle points. However, this comes at the cost of deviating from the implicit region. The central premise of this paper is that it is possible to achieve the best of both worlds: efficiently escaping strict saddle points using infinitesimal perturbations, while controlling deviation from the implicit region via a small deviation rate. We show that infinitesimally perturbed gradient descent (IPGD), which can be interpreted as GD with inherent ``round-off errors’', can provably satisfy both conditions. We apply our framework to the problem of over-parameterized matrix sensing, where we establish formal guarantees for the implicit regularization behavior of IPGD. We further demonstrate through extensive experiments that these insights extend to a broader class of learning problems.

[LG-88] Model-Free Graph Data Selection under Distribution Shift

链接: https://arxiv.org/abs/2505.17293
作者: Ting-Wei Li,Ruizhong Qiu,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph domain adaptation (GDA) is a fundamental task in graph machine learning, with techniques like shift-robust graph neural networks (GNNs) and specialized training procedures to tackle the distribution shift problem. Although these model-centric approaches show promising results, they often struggle with severe shifts and constrained computational resources. To address these challenges, we propose a novel model-free framework, GRADATE (GRAph DATa sElector), that selects the best training data from the source domain for the classification task on the target domain. GRADATE picks training samples without relying on any GNN model’s predictions or training recipes, leveraging optimal transport theory to capture and adapt to distribution changes. GRADATE is data-efficient, scalable and meanwhile complements existing model-centric GDA approaches. Through comprehensive empirical studies on several real-world graph-level datasets and multiple covariate shift types, we demonstrate that GRADATE outperforms existing selection methods and enhances off-the-shelf GDA methods with much fewer training data.

[LG-89] Comparator-Adaptive Φ-Regret: Improved Bounds Simpler Algorithms and Applications to Games

链接: https://arxiv.org/abs/2505.17277
作者: Soumita Hait,Ping Li,Haipeng Luo,Mengxiao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the classic expert problem, \Phi -regret measures the gap between the learner’s total loss and that achieved by applying the best action transformation \phi \in \Phi . A recent work by Lu et al., [2025] introduces an adaptive algorithm whose regret against a comparator \phi depends on a certain sparsity-based complexity measure of \phi , (almost) recovering and interpolating optimal bounds for standard regret notions such as external, internal, and swap regret. In this work, we propose a general idea to achieve an even better comparator-adaptive \Phi -regret bound via much simpler algorithms compared to Lu et al., [2025]. Specifically, we discover a prior distribution over all possible binary transformations and show that it suffices to achieve prior-dependent regret against these transformations. Then, we propose two concrete and efficient algorithms to achieve so, where the first one learns over multiple copies of a prior-aware variant of the Kernelized MWU algorithm of Farina et al., [2022], and the second one learns over multiple copies of a prior-aware variant of the BM-reduction [Blum and Mansour, 2007]. To further showcase the power of our methods and the advantages over Lu et al., [2025] besides the simplicity and better regret bounds, we also show that our second approach can be extended to the game setting to achieve accelerated and adaptive convergence rate to \Phi -equilibria for a class of general-sum games. When specified to the special case of correlated equilibria, our bound improves over the existing ones from Anagnostides et al., [2022a,b]

[LG-90] JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

链接: https://arxiv.org/abs/2505.17257
作者: Qihao Duan,Bingding Huang,Zhenqiao Song,Irina Lehmann,Lei Gu,Roland Eils,Benjamin Wild
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: this https URL

[LG-91] Approach to Finding a Robust Deep Learning Model

链接: https://arxiv.org/abs/2505.17254
作者: Alexey Boldyrev,Fedor Ratnikov,Andrey Shevelev
类目: Machine Learning (cs.LG)
*备注: 27 pages, 18 figures

点击查看摘要

Abstract:The rapid development of machine learning (ML) and artificial intelligence (AI) applications requires the training of large numbers of models. This growing demand highlights the importance of training models without human supervision, while ensuring that their predictions are reliable. In response to this need, we propose a novel approach for determining model robustness. This approach, supplemented with a proposed model selection algorithm designed as a meta-algorithm, is versatile and applicable to any machine learning model, provided that it is appropriate for the task at hand. This study demonstrates the application of our approach to evaluate the robustness of deep learning models. To this end, we study small models composed of a few convolutional and fully connected layers, using common optimizers due to their ease of interpretation and computational efficiency. Within this framework, we address the influence of training sample size, model weight initialization, and inductive bias on the robustness of deep learning models.

[LG-92] Backdoors in DRL: Four Environments Focusing on In-distribution Triggers

链接: https://arxiv.org/abs/2505.17248
作者: Chace Ashcraft,Ted Staley,Josh Carney,Cameron Hickert,Derek Juba,Kiran Karra,Nathan Drenkow
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Backdoor attacks, or trojans, pose a security risk by concealing undesirable behavior in deep neural network models. Open-source neural networks are downloaded from the internet daily, possibly containing backdoors, and third-party model developers are common. To advance research on backdoor attack mitigation, we develop several trojans for deep reinforcement learning (DRL) agents. We focus on in-distribution triggers, which occur within the agent’s natural data distribution, since they pose a more significant security threat than out-of-distribution triggers due to their ease of activation by the attacker during model deployment. We implement backdoor attacks in four reinforcement learning (RL) environments: LavaWorld, Randomized LavaWorld, Colorful Memory, and Modified Safety Gymnasium. We train various models, both clean and backdoored, to characterize these attacks. We find that in-distribution triggers can require additional effort to implement and be more challenging for models to learn, but are nevertheless viable threats in DRL even using basic data poisoning attacks.

[LG-93] Semantic-Aware Interpretable Multimodal Music Auto-Tagging

链接: https://arxiv.org/abs/2505.17233
作者: Andreas Patakis,Vassilis Lyberatos,Spyridon Kantarelis,Edmund Dervakos,Giorgos Stamou
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music auto-tagging is essential for organizing and discovering music in extensive digital libraries. While foundation models achieve exceptional performance in this domain, their outputs often lack interpretability, limiting trust and usability for researchers and end-users alike. In this work, we present an interpretable framework for music auto-tagging that leverages groups of musically meaningful multimodal features, derived from signal processing, deep learning, ontology engineering, and natural language processing. To enhance interpretability, we cluster features semantically and employ an expectation maximization algorithm, assigning distinct weights to each group based on its contribution to the tagging process. Our method achieves competitive tagging performance while offering a deeper understanding of the decision-making process, paving the way for more transparent and user-centric music tagging systems.

[LG-94] Automated Capability Evaluation of Foundation Models

链接: https://arxiv.org/abs/2505.17228
作者: Arash Afkanpour,Omkar Dige,Fatemeh Tavakoli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current evaluation frameworks for foundation models rely heavily on fixed, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful language models to decompose a domain into semantically meaningful capabilities and generate diverse evaluation tasks, significantly reducing human effort. To maximize coverage and efficiency, ACE models a subject model’s performance as a capability function over a latent semantic space and uses active learning to prioritize the evaluation of the most informative capabilities. This adaptive evaluation strategy enables cost-effective discovery of strengths, weaknesses, and failure modes that static benchmarks may miss. Our results suggest that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.

[LG-95] Secure and Private Federated Learning: Achieving Adversarial Resilience through Robust Aggregation

链接: https://arxiv.org/abs/2505.17226
作者: Kun Yang,Neena Imam
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative machine learning across decentralized data sources without sharing raw data. It offers a promising approach to privacy-preserving AI. However, FL remains vulnerable to adversarial threats from malicious participants, referred to as Byzantine clients, who can send misleading updates to corrupt the global model. Traditional aggregation methods, such as simple averaging, are not robust to such attacks. More resilient approaches, like the Krum algorithm, require prior knowledge of the number of malicious clients, which is often unavailable in real-world scenarios. To address these limitations, we propose Average-rKrum (ArKrum), a novel aggregation strategy designed to enhance both the resilience and privacy guarantees of FL systems. Building on our previous work (rKrum), ArKrum introduces two key innovations. First, it includes a median-based filtering mechanism that removes extreme outliers before estimating the number of adversarial clients. Second, it applies a multi-update averaging scheme to improve stability and performance, particularly when client data distributions are not identical. We evaluate ArKrum on benchmark image and text datasets under three widely studied Byzantine attack types. Results show that ArKrum consistently achieves high accuracy and stability. It performs as well as or better than other robust aggregation methods. These findings demonstrate that ArKrum is an effective and practical solution for secure FL systems in adversarial environments.

[LG-96] Content Moderation in TV Search: Balancing Policy Compliance Relevance and User Experience SIGIR2025

链接: https://arxiv.org/abs/2505.17207
作者: Adeep Hande,Kishorekumar Sundararajan,Sardar Hamidian,Ferhan Ture
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at SIGIR 2025 Industry Track. 5 pages, 1 figure, 2 tables. DOI: https://doi.org/10.1145/3726302.3731962

点击查看摘要

Abstract:Millions of people rely on search functionality to find and explore content on entertainment platforms. Modern search systems use a combination of candidate generation and ranking approaches, with advanced methods leveraging deep learning and LLM-based techniques to retrieve, generate, and categorize search results. Despite these advancements, search algorithms can still surface inappropriate or irrelevant content due to factors like model unpredictability, metadata errors, or overlooked design flaws. Such issues can misalign with product goals and user expectations, potentially harming user trust and business outcomes. In this work, we introduce an additional monitoring layer using Large Language Models (LLMs) to enhance content moderation. This additional layer flags content if the user did not intend to search for it. This approach serves as a baseline for product quality assurance, with collected feedback used to refine the initial retrieval mechanisms of the search model, ensuring a safer and more reliable user experience.

[LG-97] Shape it Up! Restoring LLM Safety during Finetuning

链接: https://arxiv.org/abs/2505.17196
作者: ShengYun Peng,Pin-Yu Chen,Jianfeng Chi,Seongmin Lee,Duen Horng Chau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment. This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence. Building on this, we present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families-all without compromising capability on intended tasks. We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks.

[LG-98] ropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms

链接: https://arxiv.org/abs/2505.17190
作者: Baran Hashemi,Kurt Pasque,Chris Teska,Ruriko Yoshida
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG); Combinatorics (math.CO)
*备注: Under Review

点击查看摘要

Abstract:Dynamic programming (DP) algorithms for combinatorial optimization problems work with taking maximization, minimization, and classical addition in their recursion algorithms. The associated value functions correspond to convex polyhedra in the max plus semiring. Existing Neural Algorithmic Reasoning models, however, rely on softmax-normalized dot-product attention where the smooth exponential weighting blurs these sharp polyhedral structures and collapses when evaluated on out-of-distribution (OOD) settings. We introduce Tropical attention, a novel attention function that operates natively in the max-plus semiring of tropical geometry. We prove that Tropical attention can approximate tropical circuits of DP-type combinatorial algorithms. We then propose that using Tropical transformers enhances empirical OOD performance in both length generalization and value generalization, on algorithmic reasoning tasks, surpassing softmax baselines while remaining stable under adversarial attacks. We also present adversarial-attack generalization as a third axis for Neural Algorithmic Reasoning benchmarking. Our results demonstrate that Tropical attention restores the sharp, scale-invariant reasoning absent from softmax.

[LG-99] Neuromorphic Mimicry Attacks Exploiting Brain-Inspired Computing for Covert Cyber Intrusions

链接: https://arxiv.org/abs/2505.17094
作者: Hemanth Ravipati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neuromorphic computing, inspired by the human brain’s neural architecture, is revolutionizing artificial intelligence and edge computing with its low-power, adaptive, and event-driven designs. However, these unique characteristics introduce novel cybersecurity risks. This paper proposes Neuromorphic Mimicry Attacks (NMAs), a groundbreaking class of threats that exploit the probabilistic and non-deterministic nature of neuromorphic chips to execute covert intrusions. By mimicking legitimate neural activity through techniques such as synaptic weight tampering and sensory input poisoning, NMAs evade traditional intrusion detection systems, posing risks to applications such as autonomous vehicles, smart medical implants, and IoT networks. This research develops a theoretical framework for NMAs, evaluates their impact using a simulated neuromorphic chip dataset, and proposes countermeasures, including neural-specific anomaly detection and secure synaptic learning protocols. The findings underscore the critical need for tailored cybersecurity measures to protect brain-inspired computing, offering a pioneering exploration of this emerging threat landscape.

[LG-100] Covert Attacks on Machine Learning Training in Passively Secure MPC

链接: https://arxiv.org/abs/2505.17092
作者: Matthew Jagielski,Daniel Escudero,Rahul Rachuri,Peter Scholl
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Secure multiparty computation (MPC) allows data owners to train machine learning models on combined data while keeping the underlying training data private. The MPC threat model either considers an adversary who passively corrupts some parties without affecting their overall behavior, or an adversary who actively modifies the behavior of corrupt parties. It has been argued that in some settings, active security is not a major concern, partly because of the potential risk of reputation loss if a party is detected cheating. In this work we show explicit, simple, and effective attacks that an active adversary can run on existing passively secure MPC training protocols, while keeping essentially zero risk of the attack being detected. The attacks we show can compromise both the integrity and privacy of the model, including attacks reconstructing exact training data. Our results challenge the belief that a threat model that does not include malicious behavior by the involved parties may be reasonable in the context of PPML, motivating the use of actively secure protocols for training. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2505.17092 [cs.CR] (or arXiv:2505.17092v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.17092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-101] Streamlining HTTP Flooding Attack Detection through Incremental Feature Selection

链接: https://arxiv.org/abs/2505.17077
作者: Upasana Sarmah,Parthajit Borah,D. K. Bhattacharyya
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Applications over the Web primarily rely on the HTTP protocol to transmit web pages to and from systems. There are a variety of application layer protocols, but among all, HTTP is the most targeted because of its versatility and ease of integration with online services. The attackers leverage the fact that by default no detection system blocks any HTTP traffic. Thus, by exploiting such characteristics of the protocol, attacks are launched against web applications. HTTP flooding attacks are one such attack in the application layer of the OSI model. In this paper, a method for the detection of such an attack is proposed. The heart of the detection method is an incremental feature subset selection method based on mutual information and correlation. INFS-MICC helps in identifying a subset of highly relevant and independent feature subset so as to detect HTTP Flooding attacks with best possible classification performance in near-real time.

[LG-102] Fast and Flexible Quantum-Inspired Differential Equation Solvers with Data Integration

链接: https://arxiv.org/abs/2505.17046
作者: Lucas Arenstein,Martin Mikkelsen,Michael Kastoryano
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Accurately solving high-dimensional partial differential equations (PDEs) remains a central challenge in computational mathematics. Traditional numerical methods, while effective in low-dimensional settings or on coarse grids, often struggle to deliver the precision required in practical applications. Recent machine learning-based approaches offer flexibility but frequently fall short in terms of accuracy and reliability, particularly in industrial contexts. In this work, we explore a quantum-inspired method based on quantized tensor trains (QTT), enabling efficient and accurate solutions to PDEs in a variety of challenging scenarios. Through several representative examples, we demonstrate that the QTT approach can achieve logarithmic scaling in both memory and computational cost for linear and nonlinear PDEs. Additionally, we introduce a novel technique for data-driven learning within the quantum-inspired framework, combining the adaptability of neural networks with enhanced accuracy and reduced training time.

[LG-103] A brief review of the Deep BSDE method for solving high-dimensional partial differential equations

链接: https://arxiv.org/abs/2505.17032
作者: Jiequn Han,Arnulf Jentzen,Weinan E
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-dimensional partial differential equations (PDEs) pose significant challenges for numerical computation due to the curse of dimensionality, which limits the applicability of traditional mesh-based methods. Since 2017, the Deep BSDE method has introduced deep learning techniques that enable the effective solution of nonlinear PDEs in very high dimensions. This innovation has sparked considerable interest in using neural networks for high-dimensional PDEs, making it an active area of research. In this short review, we briefly sketch the Deep BSDE method, its subsequent developments, and future directions for the field.

[LG-104] Scalable Policy Maximization Under Network Interference

链接: https://arxiv.org/abs/2505.18118
作者: Aidan Gleich,Eric Laber,Alexander Volfovsky
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many interventions, such as vaccines in clinical trials or coupons in online marketplaces, must be assigned sequentially without full knowledge of their effects. Multi-armed bandit algorithms have proven successful in such settings. However, standard independence assumptions fail when the treatment status of one individual impacts the outcomes of others, a phenomenon known as interference. We study optimal-policy learning under interference on a dynamic network. Existing approaches to this problem require repeated observations of the same fixed network and struggle to scale in sample size beyond as few as fifteen connected units – both limit applications. We show that under common assumptions on the structure of interference, rewards become linear. This enables us to develop a scalable Thompson sampling algorithm that maximizes policy impact when a new n -node network is observed each round. We prove a Bayesian regret bound that is sublinear in n and the number of rounds. Simulation experiments show that our algorithm learns quickly and outperforms existing methods. The results close a key scalability gap between causal inference methods for interference and practical bandit algorithms, enabling policy optimization in large-scale networked systems.

[LG-105] Bayesian Deep Learning for Discrete Choice

链接: https://arxiv.org/abs/2505.18077
作者: Daniel F. Villarraga,Ricardo A. Daziano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Discrete choice models (DCMs) are used to analyze individual decision-making in contexts such as transportation choices, political elections, and consumer preferences. DCMs play a central role in applied econometrics by enabling inference on key economic variables, such as marginal rates of substitution, rather than focusing solely on predicting choices on new unlabeled data. However, while traditional DCMs offer high interpretability and support for point and interval estimation of economic quantities, these models often underperform in predictive tasks compared to deep learning (DL) models. Despite their predictive advantages, DL models remain largely underutilized in discrete choice due to concerns about their lack of interpretability, unstable parameter estimates, and the absence of established methods for uncertainty quantification. Here, we introduce a deep learning model architecture specifically designed to integrate with approximate Bayesian inference methods, such as Stochastic Gradient Langevin Dynamics (SGLD). Our proposed model collapses to behaviorally informed hypotheses when data is limited, mitigating overfitting and instability in underspecified settings while retaining the flexibility to capture complex nonlinear relationships when sufficient data is available. We demonstrate our approach using SGLD through a Monte Carlo simulation study, evaluating both predictive metrics–such as out-of-sample balanced accuracy–and inferential metrics–such as empirical coverage for marginal rates of substitution interval estimates. Additionally, we present results from two empirical case studies: one using revealed mode choice data in NYC, and the other based on the widely used Swiss train choice stated preference data.

[LG-106] Deep Operator Neural Network Model Predictive Control

链接: https://arxiv.org/abs/2505.18008
作者: Thomas Oliver de Jong,Khemraj Shukla,Mircea Lazar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we consider the design of model predictive control (MPC) algorithms based on deep operator neural networks (DeepONets). These neural networks are capable of accurately approximating real and complex valued solutions of continuous time nonlinear systems without relying on recurrent architectures. The DeepONet architecture is made up of two feedforward neural networks: the branch network, which encodes the input function space, and the trunk network, which represents dependencies on temporal variables or initial conditions. Utilizing the original DeepONet architecture as a predictor within MPC for Multi Input Multi Output (MIMO) systems requires multiple branch networks, to generate multi output predictions, one for each input. Moreover, to predict multiple time steps into the future, the network has to be evaluated multiple times. Motivated by this, we introduce a multi step DeepONet (MS-DeepONet) architecture that computes in one shot multi step predictions of system outputs from multi step input sequences, which is better suited for MPC. We prove that the MS DeepONet is a universal approximator in terms of multi step sequence prediction. Additionally, we develop automated hyper parameter selection strategies and implement MPC frameworks using both the standard DeepONet and the proposed MS DeepONet architectures in PyTorch. The implementation is publicly available on GitHub. Simulation results demonstrate that MS-DeepONet consistently outperforms the standard DeepONet in learning and predictive control tasks across several nonlinear benchmark systems: the van der Pol oscillator, the quadruple tank process, and a cart pendulum unstable system, where it successfully learns and executes multiple swing up and stabilization policies.

[LG-107] Anytime-valid Bayes-assistedPrediction-Powered Inference

链接: https://arxiv.org/abs/2505.18000
作者: Valentin Kilian,Stefano Cortinovis,François Caron
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Given a large pool of unlabelled data and a smaller amount of labels, prediction-powered inference (PPI) leverages machine learning predictions to increase the statistical efficiency of standard confidence interval procedures based solely on labelled data, while preserving their fixed-time validity. In this paper, we extend the PPI framework to the sequential setting, where labelled and unlabelled datasets grow over time. Exploiting Ville’s inequality and the method of mixtures, we propose prediction-powered confidence sequence procedures that are valid uniformly over time and naturally accommodate prior knowledge on the quality of the predictions to further boost efficiency. We carefully illustrate the design choices behind our method and demonstrate its effectiveness in real and synthetic examples. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2505.18000 [stat.ML] (or arXiv:2505.18000v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.18000 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Valentin Kilian [view email] [v1] Fri, 23 May 2025 15:05:49 UTC (2,174 KB)

[LG-108] New Tight Bounds for SGD without Variance Assumption: A Computer-Aided Lyapunov Analysis

链接: https://arxiv.org/abs/2505.17965
作者: Daniel Cortild,Lucas Ketels,Juan Peypouquet,Guillaume Garrigos
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 57 pages, 10 figures. Under review

点击查看摘要

Abstract:The analysis of Stochastic Gradient Descent (SGD) often relies on making some assumption on the variance of the stochastic gradients, which is usually not satisfied or difficult to verify in practice. This paper contributes to a recent line of works which attempt to provide guarantees without making any variance assumption, leveraging only the (strong) convexity and smoothness of the loss functions. In this context, we prove new theoretical bounds derived from the monotonicity of a simple Lyapunov energy, improving the current state-of-the-art and extending their validity to larger step-sizes. Our theoretical analysis is backed by a Performance Estimation Problem analysis, which allows us to claim that, empirically, the bias term in our bounds is tight within our framework.

[LG-109] he Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

链接: https://arxiv.org/abs/2505.17958
作者: Vittorio Erba,Emanuele Troiani,Lenka Zdeborová,Florent Krzakala
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the \ell_2 -regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

[LG-110] M-learner:A Flexible And Powerful Framework To Study Heterogeneous Treatment Effect In Mediation Model

链接: https://arxiv.org/abs/2505.17917
作者: Xingyu Li,Qing Liu,Tony Jiang,Hong Amy Xia,Brian P. Hobbs,Peng Wei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a novel method, termed the M-learner, for estimating heterogeneous indirect and total treatment effects and identifying relevant subgroups within a mediation framework. The procedure comprises four key steps. First, we compute individual-level conditional average indirect/total treatment effect Second, we construct a distance matrix based on pairwise differences. Third, we apply tSNE to project this matrix into a low-dimensional Euclidean space, followed by K-means clustering to identify subgroup structures. Finally, we calibrate and refine the clusters using a threshold-based procedure to determine the optimal configuration. To the best of our knowledge, this is the first approach specifically designed to capture treatment effect heterogeneity in the presence of mediation. Experimental results validate the robustness and effectiveness of the proposed framework. Application to the real-world Jobs II dataset highlights the broad adaptability and potential applicability of our this http URL is available at https: //anonymous.this http URL.

[LG-111] Flexible MOF Generation with Torsion-Aware Flow Matching

链接: https://arxiv.org/abs/2505.17914
作者: Nayoung Kim,Seongsu Kim,Sungsoo Ahn
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures

点击查看摘要

Abstract:Designing metal-organic frameworks (MOFs) with novel chemistries is a long-standing challenge due to their large combinatorial space and the complex 3D arrangements of building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known ground-truth local block-wise 3D coordinates. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage de novo MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train a SMILES-based autoregressive model to generate novel metal and organic building blocks, paired with cheminformatics for 3D structure initialization. Second, we introduce a flow-matching model that predicts translations, rotations, and torsional angles to assemble flexible blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability of our model to create novel building blocks.

[LG-112] Function Forms of Simple ReLU Networks with Random Hidden Weights

链接: https://arxiv.org/abs/2505.17907
作者: Ka Long Keith Ho,Yoshinari Takeishi,Junichi Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 1 figure, 1 table

点击查看摘要

Abstract:We investigate the function space dynamics of a two-layer ReLU neural network in the infinite-width limit, highlighting the Fisher information matrix (FIM)'s role in steering learning. Extending seminal works on approximate eigendecomposition of the FIM, we derive the asymptotic behavior of basis functions ( f_v(x) = X^\top v ) for four groups of approximate eigenvectors, showing their convergence to distinct function forms. These functions, prioritized by gradient descent, exhibit FIM-induced inner products that approximate orthogonality in the function space, forging a novel connection between parameter and function spaces. Simulations validate the accuracy of these theoretical approximations, confirming their practical relevance. By refining the function space inner product’s role, we advance the theoretical framework for ReLU networks, illuminating their optimization and expressivity. Overall, this work offers a robust foundation for understanding wide neural networks and enhances insights into scalable deep learning architectures, paving the way for improved design and analysis of neural networks.

[LG-113] Continuum Transformers Perform In-Context Learning by Operator Gradient Descent

链接: https://arxiv.org/abs/2505.17838
作者: Abhiti Mishra,Yash Patel,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers robustly exhibit the ability to perform in-context learning, whereby their predictive accuracy on a task can increase not by parameter updates but merely with the placement of training samples in their context windows. Recent works have shown that transformers achieve this by implementing gradient descent in their forward passes. Such results, however, are restricted to standard transformer architectures, which handle finite-dimensional inputs. In the space of PDE surrogate modeling, a generalization of transformers to handle infinite-dimensional function inputs, known as “continuum transformers,” has been proposed and similarly observed to exhibit in-context learning. Despite impressive empirical performance, such in-context learning has yet to be theoretically characterized. We herein demonstrate that continuum transformers perform in-context operator learning by performing gradient descent in an operator RKHS. We demonstrate this using novel proof strategies that leverage a generalized representer theorem for Hilbert spaces and gradient flows over the space of functionals of a Hilbert space. We additionally show the operator learned in context is the Bayes Optimal Predictor in the infinite depth limit of the transformer. We then provide empirical validations of this optimality result and demonstrate that the parameters under which such gradient descent is performed are recovered through the continuum transformer training.

[LG-114] Robust Distributed Estimation: Extending Gossip Algorithms to Ranking and Trimmed Means

链接: https://arxiv.org/abs/2505.17836
作者: Anna Van Elst,Igor Colin,Stephan Clémençon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of robust estimation in gossip algorithms over arbitrary communication graphs. Gossip algorithms are fully decentralized, relying only on local neighbor-to-neighbor communication, making them well-suited for situations where communication is constrained. A fundamental challenge in existing mean-based gossip algorithms is their vulnerability to malicious or corrupted nodes. In this paper, we show that an outlier-robust mean can be computed by globally estimating a robust statistic. More specifically, we propose a novel gossip algorithm for rank estimation, referred to as \textscGoRank, and leverage it to design a gossip procedure dedicated to trimmed mean estimation, coined \textscGoTrim. In addition to a detailed description of the proposed methods, a key contribution of our work is a precise convergence analysis: we establish an \mathcalO(1/t) rate for rank estimation and an \mathcalO(\log(t)/t) rate for trimmed mean estimation, where by t is meant the number of iterations. Moreover, we provide a breakdown point analysis of \textscGoTrim. We empirically validate our theoretical results through experiments on diverse network topologies, data distributions and contamination schemes.

[LG-115] Source Separation of Small Classical Ensembles: Challenges and Opportunities

链接: https://arxiv.org/abs/2505.17823
作者: Gerardo Roa-Dabike,Trevor J. Cox,Jon P. Barker,Michael A. Akeroyd,Scott Bannister,Bruno Fazenda,Jennifer Firth,Simone Graetzer,Alinka Greasley,Rebecca R. Vos,William M. Whitmer
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 5 pages, 4 figures, 2 tables, submitted to WASSPA 2025

点击查看摘要

Abstract:Musical (MSS) source separation of western popular music using non-causal deep learning can be very effective. In contrast, MSS for classical music is an unsolved problem. Classical ensembles are harder to separate than popular music because of issues such as the inherent greater variation in the music; the sparsity of recordings with ground truth for supervised training; and greater ambiguity between instruments. The Cadenza project has been exploring MSS for classical music. This is being done so music can be remixed to improve listening experiences for people with hearing loss. To enable the work, a new database of synthesized woodwind ensembles was created to overcome instrumental imbalances in the EnsembleSet. For the MSS, a set of ConvTasNet models was used with each model being trained to extract a string or woodwind instrument. ConvTasNet was chosen because it enabled both causal and non-causal approaches to be tested. Non-causal approaches have dominated MSS work and are useful for recorded music, but for live music or processing on hearing aids, causal signal processing is needed. The MSS performance was evaluated on the two small datasets (Bach10 and URMP) of real instrument recordings where the ground-truth is available. The performances of the causal and non-causal systems were similar. Comparing the average Signal-to-Distortion (SDR) of the synthesized validation set (6.2 dB causal; 6.9 non-causal), to the real recorded evaluation set (0.3 dB causal, 0.4 dB non-causal), shows that mismatch between synthesized and recorded data is a problem. Future work needs to either gather more real recordings that can be used for training, or to improve the realism and diversity of the synthesized recordings to reduce the mismatch…

[LG-116] Quantifying uncertainty in spectral clusterings: expectations for perturbed and incomplete data

链接: https://arxiv.org/abs/2505.17819
作者: Jürgen Dölz,Jolanda Weygandt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spectral clustering is a popular unsupervised learning technique which is able to partition unlabelled data into disjoint clusters of distinct shapes. However, the data under consideration are often experimental data, implying that the data is subject to measurement errors and measurements may even be lost or invalid. These uncertainties in the corrupted input data induce corresponding uncertainties in the resulting clusters, and the clusterings thus become unreliable. Modelling the uncertainties as random processes, we discuss a mathematical framework based on random set theory for the computational Monte Carlo approximation of statistically expected clusterings in case of corrupted, i.e., perturbed, incomplete, and possibly even additional, data. We propose several computationally accessible quantities of interest and analyze their consistency in the infinite data point and infinite Monte Carlo sample limit. Numerical experiments are provided to illustrate and compare the proposed quantities. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2505.17819 [stat.ML] (or arXiv:2505.17819v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.17819 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-117] Optimal Online Change Detection via Random Fourier Features

链接: https://arxiv.org/abs/2505.17789
作者: Florian Kalinke,Shakeel Gavioli-Akilagun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article studies the problem of online non-parametric change point detection in multivariate data streams. We approach the problem through the lens of kernel-based two-sample testing and introduce a sequential testing procedure based on random Fourier features, running with logarithmic time complexity per observation and with overall logarithmic space complexity. The algorithm has two advantages compared to the state of the art. First, our approach is genuinely online, and no access to training data known to be from the pre-change distribution is necessary. Second, the algorithm does not require the user to specify a window parameter over which local tests are to be calculated. We prove strong theoretical guarantees on the algorithm’s performance, including information-theoretic bounds demonstrating that the detection delay is optimal in the minimax sense. Numerical studies on real and synthetic data show that our algorithm is competitive with respect to the state of the art.

[LG-118] Qiskit Machine Learning: an open-source library for quantum machine learning tasks at scale on quantum hardware and classical simulators

链接: https://arxiv.org/abs/2505.17756
作者: M. Emre Sahin,Edoardo Altamura,Oscar Wallis,Stephen P. Wood,Anton Dekusar,Declan A. Millar,Takashi Imamichi,Atsushi Matsuo,Stefano Mensa
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 6 pages, 1 figure. Qiskit Machine Learning is open-source and available at this https URL

点击查看摘要

Abstract:We present Qiskit Machine Learning (ML), a high-level Python library that combines elements of quantum computing with traditional machine learning. The API abstracts Qiskit’s primitives to facilitate interactions with classical simulators and quantum hardware. Qiskit ML started as a proof-of-concept code in 2019 and has since been developed to be a modular, intuitive tool for non-specialist users while allowing extensibility and fine-tuning controls for quantum computational scientists and developers. The library is available as a public, open-source tool and is distributed under the Apache version 2.0 license.

[LG-119] AstroMLab 4: Benchmark-Topping Performance in Astronomy QA with a 70B-Parameter Domain-Specialized Reasoning Model

链接: https://arxiv.org/abs/2505.17592
作者: Tijmen de Haan,Yuan-Sen Ting,Tirthankar Ghosal,Tuan Dung Nguyen,Alberto Accomazzi,Emily Herron,Vanessa Lama,Rui Pan,Azton Wells,Nesar Ramachandra
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark – comprising 4,425 questions from literature withheld during training – AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.

[LG-120] GPS-Aided Deep Learning for Beam Prediction and Tracking in UAV mmWave Communication

链接: https://arxiv.org/abs/2505.17530
作者: Vendi Ardianto Nugroho,Byung Moo Lee
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Submitted to IEEE. The code implementation repository: this https URL

点击查看摘要

Abstract:Millimeter-wave (mmWave) communication enables high data rates for cellular-connected Unmanned Aerial Vehicles (UAVs). However, a robust beam management remains challenging due to significant path loss and the dynamic mobility of UAVs, which can destabilize the UAV-base station (BS) link. This research presents a GPS-aided deep learning (DL) model that simultaneously predicts current and future optimal beams for UAV mmWave communications, maintaining a Top-1 prediction accuracy exceeding 70% and an average power loss below 0.6 dB across all prediction steps. These outcomes stem from a proposed data set splitting method ensuring balanced label distribution, paired with a GPS preprocessing technique that extracts key positional features, and a DL architecture that maps sequential position data to beam index predictions. The model reduces overhead by approximately 93% (requiring the training of 2 ~ 3 beams instead of 32 beams) with 95% beam prediction accuracy guarantees, and ensures 94% to 96% of predictions exhibit mean power loss not exceeding 1 dB.

[LG-121] Offline Constrained Reinforcement Learning under Partial Data Coverag e

链接: https://arxiv.org/abs/2505.17506
作者: Kihyuk Hong,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study offline constrained reinforcement learning (RL) with general function approximation. We aim to learn a policy from a pre-collected dataset that maximizes the expected discounted cumulative reward for a primary reward signal while ensuring that expected discounted returns for multiple auxiliary reward signals are above predefined thresholds. Existing algorithms either require fully exploratory data, are computationally inefficient, or depend on an additional auxiliary function classes to obtain an \epsilon -optimal policy with sample complexity O(\epsilon^-2) . In this paper, we propose an oracle-efficient primal-dual algorithm based on a linear programming (LP) formulation, achieving O(\epsilon^-2) sample complexity under partial data coverage. By introducing a realizability assumption, our approach ensures that all saddle points of the Lagrangian are optimal, removing the need for regularization that complicated prior analyses. Through Lagrangian decomposition, our method extracts policies without requiring knowledge of the data-generating distribution, enhancing practical applicability.

[LG-122] Efficient Adaptive Experimentation with Non-Compliance

链接: https://arxiv.org/abs/2505.17468
作者: Miruna Oprescu,Brian M Cho,Nathan Kallus
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 3 figures

点击查看摘要

Abstract:We study the problem of estimating the average treatment effect (ATE) in adaptive experiments where treatment can only be encouraged–rather than directly assigned–via a binary instrumental variable. Building on semiparametric efficiency theory, we derive the efficiency bound for ATE estimation under arbitrary, history-dependent instrument-assignment policies, and show it is minimized by a variance-aware allocation rule that balances outcome noise and compliance variability. Leveraging this insight, we introduce AMRIV–an \textbfAdaptive, \textbfMultiply-\textbfRobust estimator for \textbfInstrumental-\textbfVariable settings with variance-optimal assignment. AMRIV pairs (i) an online policy that adaptively approximates the optimal allocation with (ii) a sequential, influence-function-based estimator that attains the semiparametric efficiency bound while retaining multiply-robust consistency. We establish asymptotic normality, explicit convergence rates, and anytime-valid asymptotic confidence sequences that enable sequential inference. Finally, we demonstrate the practical effectiveness of our approach through empirical studies, showing that adaptive instrument assignment, when combined with the AMRIV estimator, yields improved efficiency and robustness compared to existing baselines.

[LG-123] Programmable Photonic Unitary Processor Enables Parametrized Differentiable Long-Haul Spatial Division Multiplexed Transmission

链接: https://arxiv.org/abs/2505.17381
作者: Mitsumasa Nakajima,Kohki Shibahara,Kohei Ikeda,Akira Kawai,Masaya Notomi,Yutaka Miyamoto,Toshikazu Hashimoto
类目: Optics (physics.optics); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:The explosive growth of global data traffic demands scalable and energy-efficient optical communication systems. Spatial division multiplexing (SDM) using multicore or multimode fibers is a promising solution to overcome the capacity limit of single-mode fibers. However, long-haul SDM transmission faces significant challenges due to modal dispersion, which imposes heavy computational loads on digital signal processing (DSP) for signal equalization. Here, we propose parameterized SDM transmission, where programmable photonic unitary processors are installed at intermediate nodes. Instead of relying on conventional digital equalization only on the receiver side, our approach enables direct optimization of the SDM transmission channel itself by the programmable unitary processor, which reduces digital post-processing loads. We introduce a gradient-based optimization algorithm using a differentiable SDM transmission model to determine the optimal unitary transformation. As a key enabler, we first implemented telecom-grade programmable photonic unitary processor, achieving a low-loss (2.1 dB fiber-to-fiber), wideband (full C-band), polarization-independent, and high-fidelity (R296% across the C-band) operation. We experimentally demonstrate 1300-km transmission using a three-mode fiber, achieving strong agreement between simulation and experiment. The optimized photonic processor significantly reduces modal dispersion and post-processing complexity. Our results establish a scalable framework for integrating photonic computation into the optical layer, enabling more efficient, high-capacity optical networks.

[LG-124] ransformer brain encoders explain human high-level visual responses

链接: https://arxiv.org/abs/2505.17329
作者: Hossein Adeli,Minni Sun,Nikolaus Kriegeskorte
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major goal of neuroscience is to understand brain computations during visual processing in naturalistic settings. A dominant approach is to use image-computable deep neural networks trained with different task objectives as a basis for linear encoding models. However, in addition to requiring tuning a large number of parameters, the linear encoding approach ignores the structure of the feature maps both in the brain and the models. Recently proposed alternatives have focused on decomposing the linear mapping to spatial and feature components but focus on finding static receptive fields for units that are applicable only in early visual areas. In this work, we employ the attention mechanism used in the transformer architecture to study how retinotopic visual features can be dynamically routed to category-selective areas in high-level visual processing. We show that this computational motif is significantly more powerful than alternative methods in predicting brain activity during natural scene viewing, across different feature basis models and modalities. We also show that this approach is inherently more interpretable, without the need to create importance maps, by interpreting the attention routing signal for different high-level categorical areas. Our approach proposes a mechanistic model of how visual information from retinotopic maps can be routed based on the relevance of the input content to different category-selective regions.

[LG-125] Repulsive Ensembles for Bayesian Inference in Physics-informed Neural Networks

链接: https://arxiv.org/abs/2505.17308
作者: Philipp Pilar,Markus Heinonen,Niklas Wahlström
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have proven an effective tool for solving differential equations, in particular when considering non-standard or ill-posed settings. When inferring solutions and parameters of the differential equation from data, uncertainty estimates are preferable to point estimates, as they give an idea about the accuracy of the solution. In this work, we consider the inverse problem and employ repulsive ensembles of PINNs (RE-PINN) for obtaining such estimates. The repulsion is implemented by adding a particular repulsive term to the loss function, which has the property that the ensemble predictions correspond to the true Bayesian posterior in the limit of infinite ensemble members. Where possible, we compare the ensemble predictions to Monte Carlo baselines. Whereas the standard ensemble tends to collapse to maximum-a-posteriori solutions, the repulsive ensemble produces significantly more accurate uncertainty estimates and exhibits higher sample diversity.

[LG-126] Statistical Inference for Online Algorithms ALT

链接: https://arxiv.org/abs/2505.17300
作者: Selina Carter,Arun K Kuchibhotla
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: Although SGD is the most commonly mentioned method in machine learning, our simulations show that the performance of SGD is highly sensitive to the choice of tuning parameters of the algorithm. We could not find a simple remedy that improves performance and also makes the asymptotic properties manageable. We hope that our article acts as a word of caution to anyone using online algorithms blindly

点击查看摘要

Abstract:Construction of confidence intervals and hypothesis tests for functionals based on asymptotically normal estimators is a classical topic in statistical inference. The simplest and in many cases optimal inference procedure is the Wald interval or the likelihood ratio test, both of which require an estimator and an estimate of the asymptotic variance of the estimator. Estimators obtained from online/sequential algorithms forces one to consider the computational aspects of the inference problem, i.e., one cannot access all of the data as many times as needed. Several works on this topic explored the online estimation of asymptotic variance. In this article, we propose computationally efficient, rate-optimal, and asymptotically valid confidence regions based on the output of online algorithms \em without estimating the asymptotic variance. As a special case, this implies inference from any algorithm that yields an asymptotically normal estimator. We focus our efforts on stochastic gradient descent with Polyak averaging to understand the practical performance of the proposed method.

[LG-127] Optimal Transport with Heterogeneously Missing Data

链接: https://arxiv.org/abs/2505.17291
作者: Linus Bleistein,Aurélien Bellet,Julie Josse
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider the problem of solving the optimal transport problem between two empirical distributions with missing values. Our main assumption is that the data is missing completely at random (MCAR), but we allow for heterogeneous missingness probabilities across features and across the two distributions. As a first contribution, we show that the Wasserstein distance between empirical Gaussian distributions and linear Monge maps between arbitrary distributions can be debiased without significantly affecting the sample complexity. Secondly, we show that entropic regularized optimal transport can be estimated efficiently and consistently using iterative singular value thresholding (ISVT). We propose a validation set-free hyperparameter selection strategy for ISVT that leverages our estimator of the Bures-Wasserstein distance, which could be of independent interest in general matrix completion problems. Finally, we validate our findings on a wide range of numerical applications.

[LG-128] Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation

链接: https://arxiv.org/abs/2505.17288
作者: Seamus Somerstep,Vinod Raman,Unique Subedi,Yuekai Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new next token predictor on good generations. The second method, Best-of-N, trains a reward model to select good responses from a collection generated by an unaltered base model. If the learning setting is realizable, we find that supervised fine-tuning outperforms BoN through a better dependence on the response length in its rate of convergence. If realizability fails, then depending on the failure mode, BoN can enjoy a better rate of convergence in either n or a rate of convergence with better dependence on the response length.

[LG-129] Deconfounded Warm-Start Thompson Sampling with Applications to Precision Medicine

链接: https://arxiv.org/abs/2505.17283
作者: Prateek Jaiswal,Esmaeil Keyvanshokooh,Junyu Cao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Randomized clinical trials often require large patient cohorts before drawing definitive conclusions, yet abundant observational data from parallel studies remains underutilized due to confounding and hidden biases. To bridge this gap, we propose Deconfounded Warm-Start Thompson Sampling (DWTS), a practical approach that leverages a Doubly Debiased LASSO (DDL) procedure to identify a sparse set of reliable measured covariates and combines them with key hidden covariates to form a reduced context. By initializing Thompson Sampling (LinTS) priors with DDL-estimated means and variances on these measured features – while keeping uninformative priors on hidden features – DWTS effectively harnesses confounded observational data to kick-start adaptive clinical trials. Evaluated on both a purely synthetic environment and a virtual environment created using real cardiovascular risk dataset, DWTS consistently achieves lower cumulative regret than standard LinTS, showing how offline causal insights from observational data can improve trial efficiency and support more personalized treatment decisions.

[LG-130] Liouville PDE-based sliced-Wasserstein flow for fair regression

链接: https://arxiv.org/abs/2505.17204
作者: Pilhwa Lee,Jayshawn Cooper
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:The sliced Wasserstein flow (SWF), a nonparametric and implicit generative gradient flow, is applied to fair regression. We have improved the SWF in a few aspects. First, the stochastic diffusive term from the Fokker-Planck equation-based Monte Carlo is transformed to Liouville partial differential equation (PDE)-based transport with density estimation, however, without the diffusive term. Now, the computation of the Wasserstein barycenter is approximated by the SWF barycenter with the prescription of Kantorovich potentials for the induced gradient flow to generate its samples. These two efforts improve the convergence in training and testing SWF and SWF barycenters with reduced variance. Applying the generative SWF barycenter for fair regression demonstrates competent profiles in the accuracy-fairness Pareto curves.

[LG-131] ransfer Faster Price Smarter: Minimax Dynamic Pricing under Cross-Market Preference Shift

链接: https://arxiv.org/abs/2505.17203
作者: Yi Zhang,Elynn Chen,Yujun Yan
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:We study contextual dynamic pricing when a target market can leverage K auxiliary markets – offline logs or concurrent streams – whose mean utilities differ by a structured preference shift. We propose Cross-Market Transfer Dynamic Pricing (CM-TDP), the first algorithm that provably handles such model-shift transfer and delivers minimax-optimal regret for both linear and non-parametric utility models. For linear utilities of dimension d, where the difference between source- and target-task coefficients is s_0 -sparse, CM-TDP attains regret \tildeO((d*K^-1+s_0)\log T) . For nonlinear demand residing in a reproducing kernel Hilbert space with effective dimension \alpha , complexity \beta and task-similarity parameter H , the regret becomes \tildeO!(K^-2\alpha\beta/(2\alpha\beta+1)T^1/(2\alpha\beta+1) + H^2/(2\alpha+1)T^1/(2\alpha+1)) , matching information-theoretic lower bounds up to logarithmic factors. The RKHS bound is the first of its kind for transfer pricing and is of independent interest. Extensive simulations show up to 50% lower cumulative regret and 5 times faster learning relative to single-market pricing baselines. By bridging transfer learning, robust aggregation, and revenue optimization, CM-TDP moves toward pricing systems that transfer faster, price smarter. Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2505.17203 [stat.ME] (or arXiv:2505.17203v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2505.17203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-132] Distillation-Enabled Knowledge Alignment Protocol for Semantic Communication in AI Agent Networks

链接: https://arxiv.org/abs/2505.17030
作者: Jingzhi Hu,Geoffrey Ye Li
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Future networks are envisioned to connect massive artificial intelligence (AI) agents, enabling their extensive collaboration on diverse tasks. Compared to traditional entities, these agents naturally suit the semantic communication (SC), which can significantly enhance the bandwidth efficiency. Nevertheless, SC requires the knowledge among agents to be aligned, while agents have distinct expert knowledge for their individual tasks in practice. In this paper, we propose a distillation-enabled knowledge alignment protocol (DeKAP), which distills the expert knowledge of each agent into parameter-efficient low-rank matrices, allocates them across the network, and allows agents to simultaneously maintain aligned knowledge for multiple tasks. We formulate the joint minimization of alignment loss, communication overhead, and storage cost as a large-scale integer linear programming problem and develop a highly efficient greedy algorithm. From computer simulation, the DeKAP establishes knowledge alignment with the lowest communication and computation resources compared to conventional approaches.

信息检索

[IR-0] Assessing the performance of 8 AI chatbots in bibliographic reference retrieval: Grok and DeepSeek outperform ChatGPT but none are fully accurate

链接: https://arxiv.org/abs/2505.18059
作者: Álvaro Cabezas-Clavijo,Pavel Sidorenko-Bautista
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This study analyzes the performance of eight generative artificial intelligence chatbots – ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Le Chat, and Perplexity – in their free versions, in the task of generating academic bibliographic references within the university context. A total of 400 references were evaluated across the five major areas of knowledge (Health, Engineering, Experimental Sciences, Social Sciences, and Humanities), based on a standardized prompt. Each reference was assessed according to five key components (authorship, year, title, source, and location), along with document type, publication age, and error count. The results show that only 26.5% of the references were fully correct, 33.8% partially correct, and 39.8% were either erroneous or entirely fabricated. Grok and DeepSeek stood out as the only chatbots that did not generate false references, while Copilot, Perplexity, and Claude exhibited the highest hallucination rates. Furthermore, the chatbots showed a greater tendency to generate book references over journal articles, although the latter had a significantly higher fabrication rate. A high degree of overlap was also detected among the sources provided by several models, particularly between DeepSeek, Grok, Gemini, and ChatGPT. These findings reveal structural limitations in current AI models, highlight the risks of uncritical use by students, and underscore the need to strengthen information and critical literacy regarding the use of AI tools in higher education.

[IR-1] Enhancing CTR Prediction with De-correlated Expert Networks

链接: https://arxiv.org/abs/2505.17925
作者: Jiancheng Wang,Mingjia Yin,Junwei Pan,Ximei Wang,Hao Wang,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modeling feature interactions is essential for accurate click-through rate (CTR) prediction in advertising systems. Recent studies have adopted the Mixture-of-Experts (MoE) approach to improve performance by ensembling multiple feature interaction experts. These studies employ various strategies, such as learning independent embedding tables for each expert or utilizing heterogeneous expert architectures, to differentiate the experts, which we refer to expert \emphde-correlation. However, it remains unclear whether these strategies effectively achieve de-correlated experts. To address this, we propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert this http URL, we propose a novel metric, termed Cross-Expert Correlation, to quantitatively evaluate the expert de-correlation degree. Based on this metric, we identify a key finding for MoE framework design: \emphdifferent de-correlation strategies are mutually compatible, and progressively employing them leads to reduced correlation and enhanced performance.Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle. Moreover, online A/B testing on Tencent’s advertising platforms demonstrates that D-MoE achieves a significant 1.19% Gross Merchandise Volume (GMV) lift compared to the Multi-Embedding MoE baseline.

[IR-2] Modeling Ranking Properties with In-Context Learning

链接: https://arxiv.org/abs/2505.17736
作者: Nilanjan Sinhababu,Andrew Parry,Debasis Ganguly,Pabitra Mitra
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 3 tables, 2 figures

点击查看摘要

Abstract:While standard IR models are mainly designed to optimize relevance, real-world search often needs to balance additional objectives such as diversity and fairness. These objectives depend on inter-document interactions and are commonly addressed using post-hoc heuristics or supervised learning methods, which require task-specific training for each ranking scenario and dataset. In this work, we propose an in-context learning (ICL) approach that eliminates the need for such training. Instead, our method relies on a small number of example rankings that demonstrate the desired trade-offs between objectives for past queries similar to the current input. We evaluate our approach on four IR test collections to investigate multiple auxiliary objectives: group fairness (TREC Fairness), polarity diversity (Touché), and topical diversity (TREC Deep Learning 2019/2020). We empirically validate that our method enables control over ranking behavior through demonstration engineering, allowing nuanced behavioral adjustments without explicit optimization.

[IR-3] EGA: A Unified End-to-End Generative Framework for Industrial Advertising Systems

链接: https://arxiv.org/abs/2505.17549
作者: Zuowu Zheng,Ze Wang,Fan Yang,Jiangke Fan,Teng Zhang,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Online industrial advertising system is fundamentally constrained by the inefficiency of multi-stage cascaded architectures, which filter out high-potential candidates early and fragment business decision logic across independent modules. Although recent advances in generative recommendation offer end-to-end solutions, they fall short of practical advertising requirements, lacking explicit modeling of bidding, creative selection, allocation mechanism, and payment computation that are essential for real-world deployment. To overcome these limitations, we propose End-to-end Generative Advertising (EGA), a first unified generative framework that seamlessly integrates user interests modeling, POI and creative generation, position allocation, and payment optimization within a single model. EGA leverages hierarchical tokenization and multi-token prediction to jointly generate candidate POI and creative contents, while a permutation-aware reward model and token-level bidding strategy ensure alignment with both user experiences and advertiser business objectives. Meanwhile, we decouple allocation from payment via a dedicated POI-level payment network with differentiable ex-post regret minimization, guaranteeing incentive compatibility approximately. Extensive offline and large-scale online experiments on real-world advertising systems demonstrate its effectiveness and practical advantages over traditional cascading architectures, highlighting its potential as one of the industry’s pioneering end-to-end generative advertising solutions.

[IR-4] Benchmarking Recommendation Classification and Tracing Based on Hugging Face Knowledge Graph SIGIR2025

链接: https://arxiv.org/abs/2505.17507
作者: Qiaosheng Chen,Kaijia Huang,Xiao Zhou,Weiqing Luo,Yuanning Cui,Gong Cheng
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures. Accepted at SIGIR 2025

点击查看摘要

Abstract:The rapid growth of open source machine learning (ML) resources, such as models and datasets, has accelerated IR research. However, existing platforms like Hugging Face do not explicitly utilize structured representations, limiting advanced queries and analyses such as tracing model evolution and recommending relevant datasets. To fill the gap, we construct HuggingKG, the first large-scale knowledge graph built from the Hugging Face community for ML resource management. With 2.6 million nodes and 6.2 million edges, HuggingKG captures domain-specific relations and rich textual attributes. It enables us to further present HuggingBench, a multi-task benchmark with three novel test collections for IR tasks including resource recommendation, classification, and tracing. Our experiments reveal unique characteristics of HuggingKG and the derived tasks. Both resources are publicly available, expected to advance research in open source resource sharing and management.

[IR-5] VoxRAG : A Step Toward Transcription-Free RAG Systems in Spoken Question Answering ACL2025

链接: https://arxiv.org/abs/2505.17326
作者: Zackary Rackauckas,Julia Hirschberg
类目: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to ACL 2025 Workshop MAGMaR

点击查看摘要

Abstract:We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation, speaker diarization, CLAP audio embeddings, and FAISS retrieval using L2-normalized cosine similarity. We construct a 50-query test set recorded as spoken input by a native English speaker. Retrieval quality was evaluated using LLM-as-a-judge annotations. For very relevant segments, cosine similarity achieved a Recall@10 of 0.34. For somewhat relevant segments, Recall@10 rose to 0.60 and nDCG@10 to 0.27, highlighting strong topical alignment. Answer quality was judged on a 0–2 scale across relevance, accuracy, completeness, and precision, with mean scores of 0.84, 0.58, 0.56, and 0.46 respectively. While precision and retrieval quality remain key limitations, VoxRAG shows that transcription-free speech-to-speech retrieval is feasible in RAG systems.

[IR-6] ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

链接: https://arxiv.org/abs/2505.17166
作者: Quentin Macé,António Loison,Manuel Faysse
类目: Information Retrieval (cs.IR)
*备注: Published as a HuggingFace Blog

点击查看摘要

Abstract:The ViDoRe Benchmark V1 was approaching saturation with top models exceeding 90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2 introduces realistic, challenging retrieval scenarios via blind contextual querying, long and cross-document queries, and a hybrid synthetic and human-in-the-loop query generation process. It comprises four diverse, multilingual datasets and provides clear evaluation instructions. Initial results demonstrate substantial room for advancement and highlight insights on model generalization and multilingual capability. This benchmark is designed as a living resource, inviting community contributions to maintain relevance through future evaluations.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-05-26

目录

概览 (2025-05-26)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载